Arxiv今日论文 | 2025-01-07

本篇博文主要内容为 2025-01-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（LLMs）在解决复杂数学问题时，由于上下文学习（ICL）示例中的粒度不匹配（granularity-mismatch）和由此产生的负面效应噪声问题（negative-effect noise problem）所导致的推理质量受限的问题。具体来说，LLMs在分解问题的过程中表现良好，但在某些关键的推理步骤中往往因推理不准确而失败，而检索到的ICL示例有时缺乏与特定挑战性推理步骤相关的信息，这种不相关性可能阻碍正确的推理。

解决方案的关键在于提出了BoostStep方法，该方法通过将检索和推理的粒度对齐到步骤级别（step-grained），并为每个推理步骤提供高度相关的ICL示例，采用了一种新颖的“首次尝试”（first-try）策略。相比粗粒度的基于问题的策略，BoostStep提供了更多相关的示例，从而稳步提升了模型在每个推理步骤中的推理质量。此外，BoostStep不仅能够独立提升推理性能，还能与蒙特卡洛树搜索方法（MCTS）无缝集成，优化候选生成和决策过程。实验结果表明，BoostStep在多个数学基准测试中显著提升了GPT-4o和Qwen2.5-Math-72B的性能，分别提高了3.6%和2.0%，与MCTS结合时更是提升了7.5%。

链接: https://arxiv.org/abs/2501.03226
作者: Beichen Zhang,Yuhong Liu,Xiaoyi Dong,Yuhang Zang,Pan Zhang,Haodong Duan,Yuhang Cao,Dahua Lin,Jiaqi Wang
机构: Shanghai AI Laboratory(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学); The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Codes and Data are available at this https URL

点击查看摘要

Abstract:Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try’ strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6% and 2.0% respectively on various mathematical benchmarks, and 7.5% gain combined with MCTS.
zh

[NLP-1] Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

【速读】：该论文旨在解决视觉语言模型（Vision Language Models, VLMs）评估中的难题，特别是现有视觉问答（Visual Question Answering, VQA）基准测试中开放性问题导致的评估不准确问题。由于自然语言回答的多样性，开放性问题难以进行客观评估。为此，论文提出了AutoConverter，一种自动将开放性问题转换为多选题格式的框架，从而实现更客观的评估，同时减少高成本的题目创建过程。AutoConverter能够生成正确且具有挑战性的多选题，实验表明，VLMs在这些转换后的问题上的准确率与人类创建的问题相比一致或更低。通过AutoConverter，论文构建了VMCBench基准，将20个现有VQA数据集转换为统一的多选题格式，共包含9,018个问题，并对33个先进的VLMs进行了全面评估，为可扩展、一致且可重复的VLM评估设定了新标准。

链接: https://arxiv.org/abs/2501.03225
作者: Yuhui Zhang,Yuchang Su,Yiming Liu,Xiaohan Wang,James Burgess,Elaine Sui,Chenyu Wang,Josiah Aklilu,Alejandro Lozano,Anjiang Wei,Ludwig Schmidt,Serena Yeung-Levy
机构: Stanford University(斯坦福大学); Tsinghua University(清华大学); MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
zh

[NLP-2] Leverag ing Explainable AI for LLM LLM Text Attribution: Differentiating Human-Written and Multiple LLMs-Generated Text

【速读】：该论文旨在解决生成式AI大语言模型（LLMs）生成内容的识别问题，特别是在学术环境中，学生过度依赖这些工具可能影响其写作或编程技能的培养，并引发剽窃问题。研究假设通过机器学习（ML）可以检测出LLMs生成的文本，并探索了多种ML和深度学习（DL）算法，如随机森林（RF）和循环神经网络（RNN），同时利用可解释人工智能（XAI）来理解特征在归因中的重要性。解决方案的关键在于：1）通过二元分类区分人类书写和AI生成的文本；2）通过多分类区分人类书写文本和五种不同LLM工具（如ChatGPT、LLaMA、Google Bard、Claude和Perplexity）生成的文本。实验结果表明，该模型在多分类和二元分类中均表现出高准确率，特别是在与GPTZero的对比中，准确率从78.3%提升至98.5%，且能够识别完整测试数据集。XAI结果进一步揭示了不同类别特征的重要性，有助于详细刻画作者/来源特征，支持剽窃检测和内容原创性验证。

链接: https://arxiv.org/abs/2501.03212
作者: Ayat Najjar,Huthaifa I. Ashqar,Omar Darwish,Eman Hammad
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The development of Generative AI Large Language Models (LLMs) raised the alarm regarding identifying content produced through generative AI or humans. In one case, issues arise when students heavily rely on such tools in a manner that can affect the development of their writing or coding skills. Other issues of plagiarism also apply. This study aims to support efforts to detect and identify textual content generated using LLM tools. We hypothesize that LLMs-generated text is detectable by machine learning (ML), and investigate ML models that can recognize and differentiate texts generated by multiple LLMs tools. We leverage several ML and Deep Learning (DL) algorithms such as Random Forest (RF), and Recurrent Neural Networks (RNN), and utilized Explainable Artificial Intelligence (XAI) to understand the important features in attribution. Our method is divided into 1) binary classification to differentiate between human-written and AI-text, and 2) multi classification, to differentiate between human-written text and the text generated by the five different LLM tools (ChatGPT, LLaMA, Google Bard, Claude, and Perplexity). Results show high accuracy in the multi and binary classification. Our model outperformed GPTZero with 98.5% accuracy to 78.3%. Notably, GPTZero was unable to recognize about 4.2% of the observations, but our model was able to recognize the complete test dataset. XAI results showed that understanding feature importance across different classes enables detailed author/source profiles. Further, aiding in attribution and supporting plagiarism detection by highlighting unique stylistic and structural elements ensuring robust content originality verification.
zh

[NLP-3] Detecting AI-Generated Text in Educational Content: Leverag ing Machine Learning and Explainable AI for Academic Integrity

【速读】：该论文旨在通过先进技术检测学生作业中的AI生成内容，以提升学术诚信。其核心解决方案是构建并利用CyberHumanAI数据集，该数据集包含1000个样本，其中500个由人类撰写，另外500个由ChatGPT生成。研究评估了多种机器学习和深度学习算法在该数据集上的表现，比较了人类撰写的内容与大型语言模型（LLMs）生成的内容。结果表明，传统机器学习算法（如XGBoost和随机森林）在分类任务中表现出色，准确率分别达到83%和81%。此外，研究还发现，较短的文本内容分类更具挑战性。通过可解释人工智能（XAI），研究识别了影响模型预测的关键特征：人类撰写的内容倾向于使用实用语言（如“use”和“allow”），而AI生成的文本则更多使用抽象和正式术语（如“realm”和“employ”）。最后，与GPTZero的比较分析显示，本文提出的专门化、简单且经过微调的模型在分类纯AI、纯人类和混合类内容时表现优于GPTZero，准确率达到77.5%，而GPTZero的准确率为48.5%。

链接: https://arxiv.org/abs/2501.03203
作者: Ayat A. Najjar,Huthaifa I. Ashqar,Omar A. Darwish,Eman Hammad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study seeks to enhance academic integrity by providing tools to detect AI-generated content in student work using advanced technologies. The findings promote transparency and accountability, helping educators maintain ethical standards and supporting the responsible integration of AI in education. A key contribution of this work is the generation of the CyberHumanAI dataset, which has 1000 observations, 500 of which are written by humans and the other 500 produced by ChatGPT. We evaluate various machine learning (ML) and deep learning (DL) algorithms on the CyberHumanAI dataset comparing human-written and AI-generated content from Large Language Models (LLMs) (i.e., ChatGPT). Results demonstrate that traditional ML algorithms, specifically XGBoost and Random Forest, achieve high performance (83% and 81% accuracies respectively). Results also show that classifying shorter content seems to be more challenging than classifying longer content. Further, using Explainable Artificial Intelligence (XAI) we identify discriminative features influencing the ML model’s predictions, where human-written content tends to use a practical language (e.g., use and allow). Meanwhile AI-generated text is characterized by more abstract and formal terms (e.g., realm and employ). Finally, a comparative analysis with GPTZero show that our narrowly focused, simple, and fine-tuned model can outperform generalized systems like GPTZero. The proposed model achieved approximately 77.5% accuracy compared to GPTZero’s 48.5% accuracy when tasked to classify Pure AI, Pure Human, and mixed class. GPTZero showed a tendency to classify challenging and small-content cases as either mixed or unrecognized while our proposed model showed a more balanced performance across the three classes.
zh

[NLP-4] he FACTS Grounding Leaderboard: Benchmarking LLM s Ability to Ground Responses to Long-Form Input

【速读】：该论文旨在解决语言模型在生成文本时如何确保其内容在给定上下文中的事实准确性（factual accuracy）问题。为此，作者提出了FACTS Grounding，一个在线排行榜和相关的基准测试，用于评估语言模型在生成长文本时是否能够基于用户提示中的上下文文档生成事实准确的响应。解决方案的关键在于设计了一个两阶段的自动化评估流程：首先，模型生成的响应必须满足用户请求；其次，响应必须完全基于提供的上下文文档。为了减少评估偏差，最终的准确性评分是多个自动化评估模型的综合结果。此外，FACTS Grounding排行榜包含公开和私有部分，以允许外部参与同时保护排行榜的完整性。

链接: https://arxiv.org/abs/2501.03200
作者: Alon Jacovi,Andrew Wang,Chris Alberti,Connie Tao,Jon Lipovetz,Kate Olszewska,Lukas Haas,Michelle Liu,Nate Keating,Adam Bloniarz,Carl Saroufim,Corey Fry,Dror Marcus,Doron Kukliansky,Gaurav Singh Tomar,James Swirhun,Jinwei Xing,Lily Wang,Madhu Gurumurthy,Michael Aaron,Moran Ambar,Rachana Fellinger,Rui Wang,Zizhao Zhang,Sasha Goldshtein,Dipanjan Das
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models’ ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at this https URL.
zh

[NLP-5] CLIX: Cross-Lingual Explanations of Idiomatic Expressions

【速读】：该论文试图解决自动定义生成系统在支持语言学习者词汇扩展时面临的主要障碍，即学习者由于定义中可能包含不熟悉的词汇和语法（特别是涉及非标准语言时）而难以理解定义的问题。为解决这一问题，论文提出了CLIX任务，即跨语言解释习语表达（Cross-Lingual explanations of Idiomatic eXpressions）。解决方案的关键在于探索当前自然语言处理（NLP）模型在此任务中的能力，并发现尽管任务仍具挑战性，但大型语言模型（large language models）显示出潜力。通过详细的错误分析，论文进一步指出了在将这些系统可靠地整合到教育工具之前需要解决的关键挑战。

链接: https://arxiv.org/abs/2501.03191
作者: Aaron Gluck,Katharina von der Wense,Maria Pacheco
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); Johannes Gutenberg University Mainz(约翰内斯·古腾堡大学美因茨分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated definition generation systems have been proposed to support vocabulary expansion for language learners. The main barrier to the success of these systems is that learners often struggle to understand definitions due to the presence of potentially unfamiliar words and grammar, particularly when non-standard language is involved. To address these challenges, we propose CLIX, the task of Cross-Lingual explanations of Idiomatic eXpressions. We explore the capabilities of current NLP models for this task, and observe that while it remains challenging, large language models show promise. Finally, we perform a detailed error analysis to highlight the key challenges that need to be addressed before we can reliably incorporate these systems into educational tools.
zh

[NLP-6] Classifier-Guided Captioning Across Modalities

【速读】：该论文试图解决当前大多数字幕生成系统（captioning systems）在特定数据集上训练后难以泛化到其他模态分布和上下文的问题。这一问题限制了系统在音频或视频字幕生成等任务中的表现，因为这些任务需要不同的语义线索。为了解决这一挑战，论文提出了一种方法，通过调整字幕生成网络以适应不同设置中的语义，例如在音频字幕生成中捕捉可听性（audibility），即描述声音及其来源。解决方案的关键在于引入了一个包含语言模型（LM）的冻结字幕生成系统和一个文本分类器（text classifier），该分类器通过使用GPT-4生成的自动数据集进行训练，专门设计用于增强生成字幕的关键方面。重要的是，该框架仅在推理阶段运行，无需对基础字幕生成模型进行进一步训练。实验结果表明，该框架在零样本音频字幕生成任务中显著提升了现有系统的性能，并达到了最先进的水平。

链接: https://arxiv.org/abs/2501.03183
作者: Ariel Shaulov,Tal Shaharabany,Eitan Shaar,Gal Chechik,Lior Wolf
机构: Tel Aviv University(特拉维夫大学); Bar-Ilan University(巴伊兰大学); NVIDIA, Israel(NVIDIA, 以色列)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specifically designed to enhance key aspects of the generated captions. Importantly, the framework operates solely during inference, eliminating the need for further training of the underlying captioning model. We evaluate the framework on various models and modalities, with a focus on audio captioning, and report promising results. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
zh

[NLP-7] Boosting Explainability through Selective Rationalization in Pre-trained Language Models KDD2025

【速读】：该论文试图解决预训练语言模型（Pre-trained Language Models, PLMs）在自然语言处理（NLP）任务中应用选择性合理化（Selective Rationalization）时出现的严重退化（degeneration）和失败问题。这些问题导致生成的合理化解释（rationales）不理想或无意义，从而损害了对合理化方法的信任，并限制了其在PLMs中的应用。论文发现，PLMs生成的句子中令牌（tokens）的同质性（homogeneity）是导致这些问题的主要原因。

为解决这些挑战，论文提出了一种名为预训练语言模型合理化（Pre-trained Language Model’s Rationalization, PLMR）的方法。该方法的关键在于将PLMs分解为生成器（generator）和预测器（predictor）两个部分：生成器通过剪枝无关令牌来缓解同质性问题，而预测器则利用全文信息来标准化预测结果。实验结果表明，PLMR在多个PLMs和两个广泛使用的数据集上有效解决了选择性合理化在PLMs中的应用难题。

链接: https://arxiv.org/abs/2501.03182
作者: Libing Yuan,Shuaibo Hu,Kui Yu,Le Wu
机构: School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机与信息工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: KDD 2025 research track

点击查看摘要

Abstract:The widespread application of pre-trained language models (PLMs) in natural language processing (NLP) has led to increasing concerns about their explainability. Selective rationalization is a self-explanatory framework that selects human-intelligible input subsets as rationales for predictions. Recent studies have shown that applying existing rationalization frameworks to PLMs will result in severe degeneration and failure problems, producing sub-optimal or meaningless rationales. Such failures severely damage trust in rationalization methods and constrain the application of rationalization techniques on PLMs. In this paper, we find that the homogeneity of tokens in the sentences produced by PLMs is the primary contributor to these problems. To address these challenges, we propose a method named Pre-trained Language Model’s Rationalization (PLMR), which splits PLMs into a generator and a predictor to deal with NLP tasks while providing interpretable rationales. The generator in PLMR also alleviates homogeneity by pruning irrelevant tokens, while the predictor uses full-text information to standardize predictions. Experiments conducted on two widely used datasets across multiple PLMs demonstrate the effectiveness of the proposed method PLMR in addressing the challenge of applying selective rationalization to PLMs. Codes: this https URL.
zh

[NLP-8] GLiREL – Generalist Model for Zero-Shot Relation Extraction NAACL2025

【速读】：该论文旨在解决零样本关系分类（zero-shot relation classification）任务中的效率和准确性问题。现有的方法在处理未见过的关系标签时往往表现不佳，且计算复杂度较高。为此，论文提出了GLiREL（Generalist Lightweight model for zero-shot Relation Extraction），这是一种高效的架构和训练范式，能够在单次前向传播中准确预测多个实体之间的零样本关系标签。其关键创新在于借鉴了零样本命名实体识别（zero-shot named entity recognition）的最新进展，并通过FewRel和WikiZSL基准测试验证了该方法的优越性。此外，论文还提出了一种生成多样化关系标签的合成数据集协议，进一步提升了模型的泛化能力。

链接: https://arxiv.org/abs/2501.03172
作者: Jack Boylan,Chris Hokamp,Demian Gholipour Ghalandari
机构: Quantexa
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to NAACL 2025

点击查看摘要

Abstract:We introduce GLiREL (Generalist Lightweight model for zero-shot Relation Extraction), an efficient architecture and training paradigm for zero-shot relation classification. Inspired by recent advancements in zero-shot named entity recognition, this work presents an approach to efficiently and accurately predict zero-shot relationship labels between multiple entities in a single forward pass. Experiments using the FewRel and WikiZSL benchmarks demonstrate that our approach achieves state-of-the-art results on the zero-shot relation classification task. In addition, we contribute a protocol for synthetically-generating datasets with diverse relation labels.
zh

[NLP-9] Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text COLING’25

【速读】：该论文试图解决的问题是将SQL查询（SQL2Text）翻译为自然语言的任务，即语义描述（semantic captioning）。这一任务在当前大语言模型（LLMs）广泛应用于代码生成、安全分析和教育平台的背景下变得尤为重要，尤其是当LLM生成的代码可能带来潜在安全风险时，理解和解释SQL查询的需求愈发迫切。论文的关键解决方案是通过引入基于GPT-4的迭代式上下文学习（ICL）提示方法，重新利用Text2SQL数据集来生成多个额外的自然语言描述，从而增强数据集在反向任务中的鲁棒性。实验结果表明，利用SQL固有的图结构特性进行ICL样本选择，相较于随机选择方法，在BLEU分数上提升了高达39%，并且优于其他替代方法。

链接: https://arxiv.org/abs/2501.03166
作者: Ali Al-Lawati,Jason Lucas,Prasenjit Mitra
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to COLING’25

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in various NLP tasks, including semantic parsing, which trans lates natural language into formal code representations. However, the reverse process, translating code into natural language, termed semantic captioning, has received less attention. This task is becoming increasingly important as LLMs are integrated into platforms for code generation, security analysis, and educational purposes. In this paper, we focus on the captioning of SQL query (SQL2Text) to address the critical need for understanding and explaining SQL queries in an era where LLM-generated code poses potential security risks. We repurpose Text2SQL datasets for SQL2Text by introducing an iterative ICL prompt using GPT-4o to generate multiple additional utterances, which enhances the robustness of the datasets for the reverse task. We conduct our experiments using in-context learning (ICL) based on different sample selection methods, emphasizing smaller, more computationally efficient LLMs. Our findings demonstrate that leveraging the inherent graph properties of SQL for ICL sample selection significantly outperforms random selection by up to 39% on BLEU score and provides better results than alternative methods. Dataset and codes are published: \urlthis https URL.
zh

[NLP-10] VicSim: Enhancing Victim Simulation with Emotional and Linguistic Fidelity

【速读】：该论文试图解决的问题是如何利用大型语言模型（LLMs）来模拟受害者（victim simulation），以用于基于场景的培训（scenario-based training）。现有的LLMs在模拟多样化角色方面表现出潜力，但在模拟受害者方面仍缺乏深入研究。论文提出的解决方案是VicSim模型，该模型通过三个关键维度来提升模拟的真实性：信息忠实度（informational faithfulness）、情感动态（emotional dynamics）和语言风格（language style）。VicSim的创新之处在于将基于场景的受害者建模与生成对抗网络（GAN-based training workflow）和基于关键信息的提示（key-information-based prompting）相结合，从而增强模拟受害者的真实感。通过对抗训练（adversarial training），判别器能够识别语法和情感线索作为合成内容的可靠指标。实验结果表明，VicSim在人类相似度（human-likeness）方面优于GPT-4。

链接: https://arxiv.org/abs/2501.03139
作者: Yerong Li,Yiren Liu,Yun Huang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 21 pages, 10 figures

点击查看摘要

Abstract:Scenario-based training has been widely adopted in many public service sectors. Recent advancements in Large Language Models (LLMs) have shown promise in simulating diverse personas to create these training scenarios. However, little is known about how LLMs can be developed to simulate victims for scenario-based training purposes. In this paper, we introduce VicSim (victim simulator), a novel model that addresses three key dimensions of user simulation: informational faithfulness, emotional dynamics, and language style (e.g., grammar usage). We pioneer the integration of scenario-based victim modeling with GAN-based training workflow and key-information-based prompting, aiming to enhance the realism of simulated victims. Our adversarial training approach teaches the discriminator to recognize grammar and emotional cues as reliable indicators of synthetic content. According to evaluations by human raters, the VicSim model outperforms GPT-4 in terms of human-likeness.
zh

[NLP-11] PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

链接: https://arxiv.org/abs/2501.03124
作者: Mingyang Song,Zhaochen Su,Xiaoye Qu,Jiawei Zhou,Yu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: $\href{ [this https URL](https://prmbench.github.io/) }{link}$

点击查看摘要

[NLP-12] LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases ALT

【速读】：该论文试图解决大型语言模型（LLMs）在应用中可能表现出的偏见问题，这些偏见可能对基于性别、种族、性取向或年龄等受保护属性的特定群体产生不利影响。为了解决这一问题，论文提出了LangFair，一个开源的Python工具包，旨在为LLM从业者提供评估偏见和公平性风险的工具。LangFair的关键功能包括生成针对特定用例的评估数据集（由LLM对用例特定提示的响应组成），并计算适用于该用例的指标。此外，LangFair还提供了一个可操作的决策框架，以帮助从业者选择合适的评估指标。

链接: https://arxiv.org/abs/2501.03112
作者: Dylan Bouchard,Mohit Singh Chauhan,David Skarbrevik,Viren Bajaj,Zeya Ahmad
机构: CVS Health Corporation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Journal of Open Source Software; LangFair repository: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner’s use case. To guide in metric selection, LangFair offers an actionable decision framework.
zh

[NLP-13] Sentiment-guided Commonsense-aware Response Generation for Mental Health Counseling

【速读】：该论文试图解决心理健康领域中专业治疗师短缺、高昂费用以及心理健康污名化等问题，这些问题阻碍了患者获得有效的心理咨询服务。为解决这一问题，论文提出了EmpRes，一种基于情感引导和常识感知的虚拟心理健康助手（VMHA），旨在生成能够有效引导患者情绪向积极方向发展的回应。EmpRes的关键在于结合了基础模型（foundation models）和常识知识（commonsense knowledge），从而能够理解患者复杂的情感并生成具有同理心的回应。通过在一个基准心理咨询数据集（HOPE）上的评估，EmpRes在多项定性和定量指标上显著优于现有基线模型。此外，通过用户研究和部署实验，EmpRes在实际应用中表现出较高的用户满意度和有效性，91%的用户认为系统有效，80%表示满意，85.45%愿意继续使用并推荐给他人。

链接: https://arxiv.org/abs/2501.03088
作者: Aseem Srivastava,Gauri Naik,Alison Cerezo,Tanmoy Chakraborty,Md. Shad Akhtar
机构: IIIT Delhi, India (印度信息技术学院德里分校); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Mpathic.ai; IIT Delhi, India (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The crisis of mental health issues is escalating. Effective counseling serves as a critical lifeline for individuals suffering from conditions like PTSD, stress, etc. Therapists forge a crucial therapeutic bond with clients, steering them towards positivity. Unfortunately, the massive shortage of professionals, high costs, and mental health stigma pose significant barriers to consulting therapists. As a substitute, Virtual Mental Health Assistants (VMHAs) have emerged in the digital healthcare space. However, most existing VMHAs lack the commonsense to understand the nuanced sentiments of clients to generate effective responses. To this end, we propose EmpRes, a novel sentiment-guided mechanism incorporating commonsense awareness for generating responses. By leveraging foundation models and harnessing commonsense knowledge, EmpRes aims to generate responses that effectively shape the client’s sentiment towards positivity. We evaluate the performance of EmpRes on HOPE, a benchmark counseling dataset, and observe a remarkable performance improvement compared to the existing baselines across a suite of qualitative and quantitative metrics. Moreover, our extensive empirical analysis and human evaluation show that the generation ability of EmpRes is well-suited and, in some cases, surpasses the gold standard. Further, we deploy EmpRes as a chat interface for users seeking mental health support. We address the deployed system’s effectiveness through an exhaustive user study with a significant positive response. Our findings show that 91% of users find the system effective, 80% express satisfaction, and over 85.45% convey a willingness to continue using the interface and recommend it to others, demonstrating the practical applicability of EmpRes in addressing the pressing challenges of mental health support, emphasizing user feedback, and ethical considerations in a real-world context.
zh

[NLP-14] rust Modeling in Counseling Conversations: A Benchmark Study

【速读】：该论文试图解决心理健康咨询中患者与治疗师之间信任关系的量化问题。尽管已有许多研究关注对话建模，但大多数研究对患者与治疗师之间互动质量的关注有限。信任作为治疗关系中的关键因素，直接影响咨询效果，涉及患者在咨询过程中对治疗师的信任建立。论文提出将信任作为一种治疗辅助指标，定义为患者表达自我并接受更好护理的意愿和开放性。为建模信任，作者引入了MENTAL-TRUST数据集，该数据集包含212次咨询会话的手动标注，并首次引入了七个专家验证的序数信任等级。论文将信任量化问题视为序数分类任务，并提出了TrustBench基准，评估了多种经典和前沿语言模型在MENTAL-TRUST上的表现。研究旨在揭示信任在治疗互动中的动态演变过程。

链接: https://arxiv.org/abs/2501.03064
作者: Aseem Srivastava,Zuhair Hasan Shaik,Tanmoy Chakraborty,Md Shad Akhtar
机构: IIIT Delhi, India (印度德里信息技术学院); IIIT Dharwad, India (印度达尔瓦德信息技术学院); IIT Delhi, India (印度德里理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In mental health counseling, a variety of earlier studies have focused on dialogue modeling. However, most of these studies give limited to no emphasis on the quality of interaction between a patient and a therapist. The therapeutic bond between a patient and a therapist directly correlates with effective mental health counseling. It involves developing the patient’s trust on the therapist over the course of counseling. To assess the therapeutic bond in counseling, we introduce trust as a therapist-assistive metric. Our definition of trust involves patients’ willingness and openness to express themselves and, consequently, receive better care. We conceptualize it as a dynamic trajectory observable through textual interactions during the counseling. To facilitate trust modeling, we present MENTAL-TRUST, a novel counseling dataset comprising manual annotation of 212 counseling sessions with first-of-its-kind seven expert-verified ordinal trust levels. We project our problem statement as an ordinal classification task for trust quantification and propose a new benchmark, TrustBench, comprising a suite of classical and state-of-the-art language models on MENTAL-TRUST. We evaluate the performance across a suite of metrics and lay out an exhaustive set of findings. Our study aims to unfold how trust evolves in therapeutic interactions.
zh

[NLP-15] ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events

【速读】：该论文试图解决大语言模型（LLMs）在时间推理（temporal reasoning）方面的不足，特别是对Allen区间关系（Allen’s interval relations）的全面测试和评估。Allen区间关系是时间关系的基本框架，包括“之前”（before）、“之后”（after）、“期间”（during）等关系。尽管LLMs在各种自然语言处理（NLP）任务中取得了显著成功，但在时间推理和算术方面仍面临重大挑战。为了填补这一研究空白，作者提出了ChronoSense，一个新的基准测试，用于评估LLMs对时间关系的理解能力。该基准包括16个任务，重点在于识别两个时间事件之间的Allen关系以及时间算术，使用抽象事件和来自Wikidata的真实世界数据进行测试。通过对七个最新的LLMs进行评估，结果表明这些模型在处理Allen关系时表现差异显著，且可能依赖记忆来回答时间相关的问题。ChronoSense为未来研究提供了一个强大的框架，强调了改进LLMs时间理解能力的必要性。

链接: https://arxiv.org/abs/2501.03040
作者: Duygu Sezen Islakoglu,Jan-Christoph Kalo
机构: Utrecht University(乌得勒支大学); University of Amsterdam(阿姆斯特丹大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen’s interval relations (e.g., before, after, during) – a fundamental framework for temporal relationships – remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs’ temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models’ low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at this https URL.
zh

[NLP-16] Quantization Meets Reasoning : Exploring LLM Reasoning : Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在复杂数学推理任务（如MATH基准测试）中计算资源需求过高的问题。尽管这些模型在数学推理方面取得了显著进展，但其高计算成本和内存占用限制了实际部署的可行性。论文提出的解决方案是通过模型量化（Model Quantization）来降低内存使用和计算开销，具体方法包括采用低精度和低位宽表示。研究通过引入多维评估框架，系统性地评估了量化对数学推理任务的影响，特别是对数值计算和推理规划能力的不同影响，并识别了量化模型性能下降的关键领域。

链接: https://arxiv.org/abs/2501.03035
作者: Zhen Li,Yupeng Su,Runming Yang,Zhongwei Xie,Ngai Wong,Hongxia Yang
机构: The Hong Kong Polytechnic University(香港理工大学); Southern University of Science and Technology(南方科技大学); Tsinghua University(清华大学); Wuhan University(武汉大学); The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages

点击查看摘要

Abstract:Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. We introduce a multidimensional evaluation framework that qualitatively assesses specific capability dimensions and conduct quantitative analyses on the step-by-step outputs of various quantization methods. Our results demonstrate that quantization differentially affects numerical computation and reasoning planning abilities, identifying key areas where quantized models experience performance degradation.
zh

[NLP-17] Analyzing Fine-tuning Representation Shift for Multimodal LLM s Steering alignment

【速读】：该论文试图解决的问题是理解多模态大语言模型（Multimodal LLMs）在微调（fine-tuning）过程中内部表示（hidden state representations）的动态变化机制。尽管多模态大语言模型在多模态输入理解方面表现出色，但现有研究大多仅关注模型的最终状态，而忽略了训练过程中表征的动态变化。论文通过系统分析隐藏状态的演变，揭示了微调如何改变模型的内部结构，使其能够适应新的多模态任务。

解决方案的关键在于采用基于概念（concept-based）的方法，将隐藏状态映射到可解释的视觉和文本概念上，从而追踪训练过程中跨模态编码概念的变化。此外，论文引入了“偏移向量”（shift vectors）来捕捉这些概念的变化，并通过偏移原始模型中的概念来恢复微调后的概念。这种方法不仅揭示了多模态表示在微调过程中的演变，还为模型行为的调整提供了新的视角，例如在不进行额外训练的情况下，通过调整偏移向量来修改模型的回答类型、标题风格或偏向特定响应。

链接: https://arxiv.org/abs/2501.03012
作者: Pegah Khayatan,Mustafa Shukor,Jayneel Parekh,Matthieu Cord
机构: 1ISIR, Sorbonne Université (索邦大学), Paris, France; 2Valeo.ai, Paris, France
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: The first three authors contributed equally

点击查看摘要

Abstract:Multimodal LLMs have reached remarkable levels of proficiency in understanding multimodal inputs, driving extensive research to develop increasingly powerful models. However, much less attention has been paid to understanding and explaining the underlying mechanisms of these models. Most existing explainability research examines these models only in their final states, overlooking the dynamic representational shifts that occur during training. In this work, we systematically analyze the evolution of hidden state representations to reveal how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace changes in encoded concepts across modalities as training progresses. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by shifting those in the original model. Finally, we explore the practical impact of our findings on model steering, showing that we can adjust multimodal LLMs behaviors without any training, such as modifying answer types, captions style, or biasing the model toward specific responses. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks. The code for this project is publicly available at this https URL.
zh

[NLP-18] Quality Estimation based Feedback Training for Improving Pronoun Translation

【速读】：该论文试图解决神经机器翻译（NMT）中代词翻译的长期挑战，特别是在需要跨句上下文以确保语言准确性的情况下。为了解决这一问题，作者提出了ProNMT框架，该框架通过结合质量估计（QE）模型和一种独特的代词生成似然反馈机制，迭代地微调预训练的NMT模型，而无需依赖大量的人工标注。ProNMT的关键在于将QE分数与代词特定的奖励相结合，以指导训练过程，从而确保更好地处理语言细节。实验结果表明，ProNMT在代词翻译准确性和整体翻译质量方面均取得了显著提升，提供了一种高效、可扩展且上下文感知的方法来改进NMT系统，特别是在翻译上下文依赖元素（如代词）时。

链接: https://arxiv.org/abs/2501.03008
作者: Harshit Dhankhar,Baban Gain,Asif Ekbal,Yogesh Mani Tripathi
机构: Indian Institute of Technology Patna (印度理工学院巴特那); Indian Institute of Technology Jodhpur (印度理工学院焦特布尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pronoun translation is a longstanding challenge in neural machine translation (NMT), often requiring inter-sentential context to ensure linguistic accuracy. To address this, we introduce ProNMT, a novel framework designed to enhance pronoun and overall translation quality in context-aware machine translation systems. ProNMT leverages Quality Estimation (QE) models and a unique Pronoun Generation Likelihood-Based Feedback mechanism to iteratively fine-tune pre-trained NMT models without relying on extensive human annotations. The framework combines QE scores with pronoun-specific rewards to guide training, ensuring improved handling of linguistic nuances. Extensive experiments demonstrate significant gains in pronoun translation accuracy and general translation quality across multiple metrics. ProNMT offers an efficient, scalable, and context-aware approach to improving NMT systems, particularly in translating context-dependent elements like pronouns.
zh

[NLP-19] CALM: Curiosity-Driven Auditing for Large Language Models AAAI2025

【速读】：该论文试图解决如何在不访问大型语言模型（LLMs）参数的情况下，通过黑箱优化方法自动检测目标LLMs可能产生的非法、不道德或不安全的输入-输出对的问题。具体来说，研究目标是发现那些在输入无毒的情况下，目标LLM却生成有毒输出的情况，或者输入诱导目标LLM生成包含政治敏感人物的幻觉性响应的情况。由于可行点的稀缺性、提示空间的离散性以及搜索空间的庞大性，这一黑箱优化问题具有挑战性。

解决方案的关键在于提出了好奇心驱动的审计方法（Curiosity-Driven Auditing for Large Language Models, CALM），该方法利用内在动机的强化学习（intrinsically motivated reinforcement learning）来微调一个LLM作为审计代理，以发现目标LLM潜在的有害和偏见输入-输出对。CALM成功地在黑箱设置下识别了涉及名人的贬损性补全，并发现了能够引发特定名称的输入。这一工作为审计黑箱LLMs提供了一个有前景的方向。

链接: https://arxiv.org/abs/2501.02997
作者: Xiang Zheng,Longxiang Wang,Yi Liu,Xingjun Ma,Chao Shen,Cong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by AAAI 2025 AI Alignment Track

点击查看摘要

Abstract:Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at this https URL.
zh

[NLP-20] Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation

【速读】：该论文试图解决多语言神经机器翻译（MNMT）模型在性能上落后于大型语言模型（LLMs）的问题，限制了其实际应用。为了解决这一局限性，论文提出了一种通过引入“注册”（registering）机制来提升解码器仅有的MNMT模型性能的方法。具体而言，解决方案的关键在于在输入序列中插入一组人工标记（称为“注册器”），这些标记指定目标语言，并位于源语言和目标语言标记之间。通过修改注意力掩码（attention mask），目标语言的生成仅关注注册器的激活，从而在目标语言空间中表示源语言标记。实验结果表明，该方法在EC-40大规模基准测试中优于基于优化多语言表示的相关方法。此外，论文还通过预训练两个模型（MITRE-913M和MITRE-3.3B）进一步扩展了该方法，其中一个模型（MITRE-913M）在性能上超越了NLLB-3.3B，并与商业LLMs表现相当，同时在微调中表现出较强的适应性。

链接: https://arxiv.org/abs/2501.02979
作者: Zhi Qu,Yiran Wang,Jiannan Mao,Chenchen Ding,Hideki Tanaka,Masao Utiyama,Taro Watanabe
机构: Nara Institute of Science and Technology, Japan(奈良先端科学技术大学院大学); National Institute of Information and Communications Technology, Japan(国立情报通信研究机构); Gifu University, Japan(岐阜大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The multilingual neural machine translation (MNMT) enables arbitrary translations across multiple languages by training a model with limited parameters using parallel data only. However, the performance of such MNMT models still lags behind that of large language models (LLMs), limiting their practicality. In this work, we address this limitation by introducing registering to achieve the new state-of-the-art of decoder-only MNMT models. Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens. By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space. Experiments on EC-40, a large-scale benchmark, show that our method outperforms related methods driven by optimizing multilingual representations. We further scale up and collect 9.3 billion sentence pairs across 24 languages from public datasets to pre-train two models, namely MITRE (multilingual translation with registers). One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning. Finally, we open-source our models to facilitate further research and development in MNMT: this https URL.
zh

[NLP-21] Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis

【速读】：该论文试图解决幽默风格（humour styles）自动分类模型的可解释性问题。尽管已有研究使用机器学习模型进行幽默风格的自动识别，但这些模型通常是“黑箱”（black box），其预测决策缺乏透明性，这在心理健康领域尤为重要。为了解决这一问题，论文提出了一个可解释的人工智能（XAI）框架，旨在通过分析语言、情感和语义特征如何影响幽默风格的分类决策，揭示模型的行为机制。关键解决方案包括使用先前研究中表现最佳的单一模型（ALI+XGBoost），并应用全面的XAI技术来深入探讨不同幽默风格的特征及其误分类模式，特别是区分亲和性幽默（affiliative humour）与其他风格时的挑战。通过详细分析特征重要性、错误模式和误分类案例，论文揭示了影响模型决策的关键因素，如情感模糊性、上下文误解和目标识别。该框架不仅为理解模型行为提供了可解释的见解，还为心理健康、内容审核和数字人文研究等领域的实际应用提供了理论支持。

链接: https://arxiv.org/abs/2501.02891
作者: Mary Ogbuka Kenneth,Foaad Khosmood,Abbas Edalat
机构: Algorithmic Human Development group, Department of Computing, Imperial College London, UK(帝国理工学院计算系算法人类发展小组); Computer Engineering Department, California Polytechnic State University, USA(加州州立理工大学计算机工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humour styles can have either a negative or a positive impact on well-being. Given the importance of these styles to mental health, significant research has been conducted on their automatic identification. However, the automated machine learning models used for this purpose are black boxes, making their prediction decisions opaque. Clarity and transparency are vital in the field of mental health. This paper presents an explainable AI (XAI) framework for understanding humour style classification, building upon previous work in computational humour analysis. Using the best-performing single model (ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to analyse how linguistic, emotional, and semantic features contribute to humour style classification decisions. Our analysis reveals distinct patterns in how different humour styles are characterised and misclassified, with particular emphasis on the challenges in distinguishing affiliative humour from other styles. Through detailed examination of feature importance, error patterns, and misclassification cases, we identify key factors influencing model decisions, including emotional ambiguity, context misinterpretation, and target identification. The framework demonstrates significant utility in understanding model behaviour, achieving interpretable insights into the complex interplay of features that define different humour styles. Our findings contribute to both the theoretical understanding of computational humour analysis and practical applications in mental health, content moderation, and digital humanities research.
zh

[NLP-22] IIMedGPT : Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment

【速读】：该论文试图解决大型语言模型（LLM）在响应人类查询时面临的两个主要问题：一是预训练数据不足，无法支持广泛的预训练；二是模型的响应无法与用户的指令对齐。为了解决这些问题，作者引入了一个医学指令数据集CMedINS，该数据集包含六种源自实际医疗任务的指令，能够有效地与其他数据结合，对LLM进行微调。此外，作者还提出了一个医学模型IIMedGPT，并采用了一种高效的偏好对齐方法——直接偏好优化（DPO）。实验结果表明，该模型在医学任务上的表现优于现有的医学模型。

链接: https://arxiv.org/abs/2501.02869
作者: Yiming Zhang,Zheng Chang,Wentao Cai,MengXing Ren,Kang Yuan,Yining Sun,Zenghui Ding
机构: Intelligence Institute of Machine, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China/P. R. China (中国科学院合肥物质科学研究院智能机器研究所); University of Science and Technology of China, Hefei 230026, China/P. R. (中国科学技术大学); School of Artificial Intelligence and Big Data, Hefei University, Hefei, 230601, China (合肥大学人工智能与大数据学院); Anhui University of Science and Technology (安徽科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent researches of large language models(LLM), which is pre-trained on massive general-purpose corpora, have achieved breakthroughs in responding human queries. However, these methods face challenges including limited data insufficiency to support extensive pre-training and can not align responses with users’ instructions. To address these issues, we introduce a medical instruction dataset, CMedINS, containing six medical instructions derived from actual medical tasks, which effectively fine-tunes LLM in conjunction with other data. Subsequently, We launch our medical model, IIMedGPT, employing an efficient preference alignment method, Direct preference Optimization(DPO). The results show that our final model outperforms existing medical models in medical this http URL, Code and model checkpoints will be released upon acceptance.
zh

[NLP-23] Graph-based Retrieval Augmented Generation for Dynamic Few-shot Text Classification

【速读】：该论文试图解决动态少样本文本分类（dynamic few-shot text classification）中的挑战，特别是在标注数据稀缺且目标标签频繁变化的情况下，现有基于神经网络（如CNN和BERT）的模型表现受限的问题。当前的解决方案依赖于大语言模型（LLMs），但这些方法在处理输入文本、候选标签和附加信息（如描述）时，面临输入规模增大和附加信息处理引入噪声的问题。为此，论文提出了一种基于图的在线检索增强生成框架（GORAG），其关键创新在于构建和维护一个自适应信息图（adaptive information graph），通过从所有目标文本中提取附加信息，而非独立处理每个输入。GORAG采用加权边机制（weighted edge mechanism）来优先考虑提取信息的重要性和可靠性，并动态地为每个文本输入检索相关上下文，使用最小生成树（minimum-cost spanning tree）进行优化。实验结果表明，GORAG通过提供更全面和准确的上下文信息，显著优于现有方法。

链接: https://arxiv.org/abs/2501.02844
作者: Yubo Wang,Haoyang Li,Fei Teng,Lei Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text classification is a fundamental task in natural language processing, pivotal to various applications such as query optimization, data integration, and schema matching. While neural network-based models, such as CNN and BERT, have demonstrated remarkable performance in text classification, their effectiveness heavily relies on abundant labeled training data. This dependency makes these models less effective in dynamic few-shot text classification, where labeled data is scarce, and target labels frequently evolve based on application needs. Recently, large language models (LLMs) have shown promise due to their extensive pretraining and contextual understanding. Current approaches provide LLMs with text inputs, candidate labels, and additional side information (e.g., descriptions) to predict text labels. However, their effectiveness is hindered by the increased input size and the noise introduced through side information processing. To address these limitations, we propose a graph-based online retrieval-augmented generation framework, namely GORAG, for dynamic few-shot text classification. GORAG constructs and maintains an adaptive information graph by extracting side information across all target texts, rather than treating each input independently. It employs a weighted edge mechanism to prioritize the importance and reliability of extracted information and dynamically retrieves relevant context using a minimum-cost spanning tree tailored for each text input. Empirical evaluations demonstrate that GORAG outperforms existing approaches by providing more comprehensive and accurate contextual information.
zh

[NLP-24] Samba-asr state-of-the-art speech recognition leverag ing structured state-space models

【速读】：该论文旨在解决基于Transformer的自动语音识别（ASR）模型在处理长序列时面临的挑战，包括输入长度的二次方复杂度以及难以捕捉长程依赖关系的问题。为此，作者提出了Samba ASR模型，首次利用Mamba架构作为编码器和解码器，基于状态空间模型（SSMs）构建。与依赖自注意力机制的Transformer不同，Samba ASR通过高效的状态空间动态建模，能够同时捕捉局部和全局的时间依赖关系，从而显著提升了性能。实验结果表明，Samba ASR在多个标准基准测试中超越了现有的开源Transformer-based ASR模型，尤其在词错误率（WER）上取得了显著改进，并在低资源场景下表现出色。此外，Mamba架构的计算效率和参数优化使得Samba ASR成为适用于多种ASR任务的可扩展且鲁棒的解决方案。

链接: https://arxiv.org/abs/2501.02832
作者: Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi
机构: SandLogic Technologies Pvt Ltd
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We propose Samba ASR, the first state-of-the-art Automatic Speech Recognition (ASR) model leveraging the novel Mamba architecture as both encoder and decoder, built on the foundation of state-space models (SSMs). Unlike transformer-based ASR models, which rely on self-attention mechanisms to capture dependencies, Samba ASR effectively models both local and global temporal dependencies using efficient state-space dynamics, achieving remarkable performance gains. By addressing the limitations of transformers, such as quadratic scaling with input length and difficulty in handling long-range dependencies, Samba ASR achieves superior accuracy and efficiency. Experimental results demonstrate that Samba ASR surpasses existing open-source transformer-based ASR models across various standard benchmarks, establishing it as the new state of the art in ASR. Extensive evaluations on benchmark datasets show significant improvements in Word Error Rate (WER), with competitive performance even in low-resource scenarios. Furthermore, the computational efficiency and parameter optimization of the Mamba architecture make Samba ASR a scalable and robust solution for diverse ASR tasks. Our contributions include: A new Samba ASR architecture demonstrating the superiority of SSMs over transformer-based models for speech sequence processing. A comprehensive evaluation on public benchmarks showcasing state-of-the-art performance. An analysis of computational efficiency, robustness to noise, and sequence generalization. This work highlights the viability of Mamba SSMs as a transformer-free alternative for efficient and accurate ASR. By leveraging state-space modeling advancements, Samba ASR sets a new benchmark for ASR performance and future research. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2501.02832 [cs.CL] (or arXiv:2501.02832v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.02832 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kruthika Kr [view email] [v1] Mon, 6 Jan 2025 08:16:06 UTC (132 KB)
zh

[NLP-25] InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion

【速读】：该论文试图解决单一大型语言模型（LLMs）在所有领域任务中表现一致性的挑战。为了解决这一问题，论文提出了两种融合策略来整合多个领域专用模型的优势：第一种是成对多步融合方法，通过逐步将每个源模型蒸馏到枢纽模型中，然后通过权重合并步骤将蒸馏后的模型整合到最终模型中。这种方法虽然性能强大，但需要大量的训练资源；第二种是统一融合方法，通过聚合所有源模型的输出，并引入一种新颖的速率-偏度自适应融合（RSAF）技术，动态调整参数合并过程中的top-K比率，以提高灵活性。此外，论文还提出了一种基于不确定性的加权方法，用于统一融合方法，动态平衡源模型的贡献，并在多个任务上显著提升了准确性。

链接: https://arxiv.org/abs/2501.02795
作者: Zhaoyi Yan,Zhijie Sang,Yiming Zhang,Yuhao Fu,Baoyi He,Qi Zhou,Yining Di,Chunlin Ji,Shengyu Zhang,Fei Wu,Hongxia Yang
机构: Reallm Labs; The Hong Kong Polytechnic University (香港理工大学); Zhejiang University (浙江大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Independent
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong performance across various reasoning tasks, yet building a single model that consistently excels across all domains remains challenging. This paper addresses this problem by exploring strategies to integrate multiple domain-specialized models into an efficient pivot this http URL propose two fusion strategies to combine the strengths of multiple LLMs: (1) a pairwise, multi-step fusion approach that sequentially distills each source model into the pivot model, followed by a weight merging step to integrate the distilled models into the final model. This method achieves strong performance but requires substantial training effort; and (2) a unified fusion approach that aggregates all source models’ outputs this http URL improve the fusion process, we introduce a novel Rate-Skewness Adaptive Fusion (RSAF) technique, which dynamically adjusts top-K ratios during parameter merging for enhanced flexibility and this http URL, we propose an uncertainty-based weighting method for the unified approach, which dynamically balances the contributions of source models and outperforms other logits/distribution ensemble this http URL achieved accuracy improvements of 9.27%, 8.80%, and 8.89% on the GSM8K, MATH, and HumanEval tasks, respectively.
zh

[NLP-26] Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

【速读】：该论文试图解决在基于人类反馈的强化学习（RLHF）中，传统方法通常采用bandit formulation（赌博机模型），忽略了语言模型生成的序列性，并可能面临稀疏奖励问题。虽然最近的研究提出了密集的token级RLHF，但将每个token视为一个动作可能导致奖励分配过于细微。论文提出了一种新的方法，通过训练和利用段级奖励模型（segment-level reward model），为每个语义完整的文本段分配奖励，这些文本段跨越短序列的token。解决方案的关键在于：1）动态文本分割和与标准序列偏好数据集的兼容性；2）将经典的标量bandit奖励归一化器推广为位置感知的归一化函数，并对段奖励进行插值以进一步密集化。通过这些设计，该方法在AlpacaEval 2.0、Arena-Hard和MT-Bench等RLHF基准测试中表现出色。

链接: https://arxiv.org/abs/2501.02790
作者: Yueqin Yin,Shentao Yang,Yujia Xie,Ziyi Yang,Yuting Sun,Hany Awadalla,Weizhu Chen,Mingyuan Zhou
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.
zh

[NLP-27] GeAR: Generation Augmented Retrieval

【速读】：该论文试图解决现有文档检索技术中存在的两个主要问题：一是基于双编码器（bi-encoder）的标量相似度计算难以充分反映检索结果的丰富信息，限制了用户对检索结果的理解；二是现有方法主要关注全局语义，而忽略了查询与文档中复杂文本之间的细粒度语义关系。为解决这些问题，论文提出了一种新方法——生成增强检索（GeAR, Generation Augmented Retrieval）。GeAR 的关键在于引入了精心设计的融合和解码模块，能够基于查询和文档的融合表示生成相关文本，从而学习“聚焦”于细粒度信息。此外，GeAR 在作为检索器使用时，不会增加额外的计算负担。为了支持新框架的训练，论文还提出了一种利用大语言模型高效合成高质量数据的流程。GeAR 在多种场景和数据集上展示了具有竞争力的检索和定位性能，并为检索结果的解释提供了新的见解。

链接: https://arxiv.org/abs/2501.02772
作者: Haoyu Liu,Shaohan Huang,Jianfeng Liu,Yuefeng Zhan,Hao Sun,Weiwei Deng,Feng Sun,Furu Wei,Qi Zhang
机构: Microsoft Corporation(微软公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. However, such scalar similarity is difficult to reflect enough information and impedes our comprehension of the retrieval results. In addition, this computational process mainly emphasizes the global semantics and ignores the fine-grained semantic relationship between the query and the complex text in the document. In this paper, we propose a new method called \textbfGe neration \textbfA ugmented \textbfR etrieval ( \textbfGeAR ) that incorporates well-designed fusion and decoding modules. This enables GeAR to generate the relevant text from documents based on the fused representation of the query and the document, thus learning to “focus on” the fine-grained information. Also when used as a retriever, GeAR does not add any computational burden over bi-encoders. To support the training of the new framework, we have introduced a pipeline to efficiently synthesize high-quality data by utilizing large language models. GeAR exhibits competitive retrieval and localization performance across diverse scenarios and datasets. Moreover, the qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released after completing technical review to facilitate future research.
zh

[NLP-28] MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation ICTAI2024

【速读】：该论文试图解决基于注意力机制（attention-based）的模型在面对后门攻击（backdoor attacks）时的脆弱性问题，尤其是在预训练权重（pre-trained weights）不可用的情况下。现有的许多方法依赖于预训练权重来缓解后门攻击，但在无法获取预训练权重时，这些方法失效。论文提出的解决方案MBTSAD（Mitigating Backdoors with Token Splitting and Attention Distillation）通过仅使用少量干净数据（clean data）来缓解后门攻击，且无需依赖预训练权重。其关键步骤包括：首先通过令牌分割（token splitting）生成数据集并重新训练被后门攻击的模型，然后利用注意力蒸馏（attention distillation）技术，将重新训练的模型作为教师模型（teacher model），原始被攻击的模型作为学生模型（student model），从而消除后门模式。实验结果表明，MBTSAD在缓解后门攻击方面与基于预训练权重的方法性能相当，同时在干净数据上的性能保持良好。此外，MBTSAD通过简化对抗训练（adversarial training）的极小极大问题，并可视化文本表示，发现令牌分割方法生成的数据属于分布外数据（Out-of-Distribution, OOD），促使模型学习更具泛化性的特征，从而有效消除后门模式。

链接: https://arxiv.org/abs/2501.02754
作者: Yidong Ding,Jiafei Niu,Ping Yi
机构: School of Cyber Science and Engineering, Shanghai Jiao Tong University (上海交通大学网络空间安全学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted by ICTAI 2024

点击查看摘要

Abstract:In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD’s first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.
zh

[NLP-29] ARDiS : Text Augmentation for Refining Diversity and Separability

【速读】：该论文试图解决文本增强（Text Augmentation, TA）在少样本（few-shot）文本分类任务中的挑战，特别是在生成和对齐阶段的问题。现有的两阶段文本增强方法在生成多样性和类别对齐方面存在不足。论文提出了一种基于大语言模型（LLM）的新方法TARDiS，其关键解决方案包括：在生成阶段，提出了两种生成过程——SEG（Single-Example Generation）和CEG（Class-Example Generation），通过引入多个类别特定的提示（prompts）来增强生成文本的多样性和可分离性；在对齐阶段，引入了类别适应（Class Adaptation, CA）方法，通过验证和修改确保生成的样本与其目标类别对齐。实验结果表明，TARDiS在多种少样本文本分类任务中优于现有的基于LLM的文本增强方法，并通过深入分析验证了各阶段的具体行为。

链接: https://arxiv.org/abs/2501.02739
作者: Kyungmin Kim,SangHun Im,GiBaeg Kim,Heung-Seon Oh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversity and separability. For the alignment stage, we introduce a class adaptation (CA) method to ensure that generated examples align with their target classes through verification and modification. Experimental results demonstrate TARDiS’s effectiveness, outperforming state-of-the-art LLM-based TA methods in various few-shot text classification tasks. An in-depth analysis confirms the detailed behaviors at each stage.
zh

[NLP-30] KG-CF: Knowledge Graph Completion with Context Filtering under the Guidance of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在知识图谱补全（KGC）任务中的应用局限性问题。具体而言，当前研究主要将LLMs应用于分类任务，如识别缺失的三元组（triplets），而忽略了基于排名的任务，即模型根据候选实体的合理性进行排序。这种局限性限制了LLMs在实际应用中的有效性，因为现实世界的应用更倾向于高合理性的三元组。此外，尽管图路径（graph paths）可以帮助推断缺失三元组的存在并提高补全准确性，但这些路径通常包含冗余信息。为解决这些问题，论文提出了KG-CF框架，专门用于基于排名的KGC任务。KG-CF的关键在于利用LLMs的推理能力过滤掉不相关的上下文，从而在真实世界的数据集上取得了优异的结果。

链接: https://arxiv.org/abs/2501.02711
作者: Zaiyi Zheng,Yushun Dong,Song Wang,Haochen Liu,Qi Wang,Jundong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance in various tasks, including knowledge graph completion (KGC). However, current studies mostly apply LLMs to classification tasks, like identifying missing triplets, rather than ranking-based tasks, where the model ranks candidate entities based on plausibility. This focus limits the practical use of LLMs in KGC, as real-world applications prioritize highly plausible triplets. Additionally, while graph paths can help infer the existence of missing triplets and improve completion accuracy, they often contain redundant information. To address these issues, we propose KG-CF, a framework tailored for ranking-based KGC tasks. KG-CF leverages LLMs’ reasoning abilities to filter out irrelevant contexts, achieving superior results on real-world datasets. The code and datasets are available at \urlthis https URL.
zh

[NLP-31] QuIM-RAG : Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）系统在处理大规模数据时面临的信息稀释（information dilution）和幻觉（hallucinations）问题。传统RAG系统虽然通过整合在线资源和数据库来增强大型语言模型（LLMs）的生成能力，但在处理大量数据时，仍然难以生成准确且上下文相关的回答。论文提出的解决方案关键在于引入了一种新颖的检索机制——QuIM-RAG（Question-to-question Inverted Index Matching），该机制通过将文档块转换为潜在问题，并与用户查询进行匹配，从而识别出最相关的文本块以生成准确的回答。此外，论文还构建了一个基于Meta-LLaMA3-8B-instruct模型的RAG系统，并通过BERT-Score和RAGAS等先进评估指标验证了该方法的优越性。

链接: https://arxiv.org/abs/2501.02702
作者: Binita Saha,Utsha Saha,Muhammad Zubair Malik
机构: North Dakota State University (北达科他州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems to improve Question Answering (QA) tasks from a target corpus. Large Language Models (LLMs) have revolutionized the analyzing and generation of human-like text. These models rely on pre-trained data and lack real-time updates unless integrated with live data tools. RAG enhances LLMs by integrating online resources and databases to generate contextually appropriate responses. However, traditional RAG still encounters challenges like information dilution and hallucinations when handling vast amounts of data. Our approach addresses these challenges by converting corpora into a domain-specific dataset and RAG architecture is constructed to generate responses from the target document. We introduce QuIM-RAG (Question-to-question Inverted Index Matching), a novel approach for the retrieval mechanism in our system. This strategy generates potential questions from document chunks and matches these with user queries to identify the most relevant text chunks for generating accurate answers. We have implemented our RAG system on top of the open-source Meta-LLaMA3-8B-instruct model by Meta Inc. that is available on Hugging Face. We constructed a custom corpus of 500+ pages from a high-traffic website accessed thousands of times daily for answering complex questions, along with manually prepared ground truth QA for evaluation. We compared our approach with traditional RAG models using BERT-Score and RAGAS, state-of-the-art metrics for evaluating LLM applications. Our evaluation demonstrates that our approach outperforms traditional RAG architectures on both metrics.
zh

[NLP-32] Decoding specialised feature neurons in LLM s with the final projection layer

【速读】：该论文试图解决大型语言模型（LLMs）在操作中的可解释性问题，特别是如何解码和理解模型中神经元权重与特定概念之间的关联。由于LLMs通常具有数十亿个参数，其内部工作机制往往难以解释，这种“黑箱”特性在模型被用于重要决策时可能带来安全风险。论文提出的解决方案关键在于通过模型的最终投影层（LM-head）直接解码神经元权重为词元（token）概率，从而揭示神经元与特定概念（如“狗”或“加利福尼亚”）之间的强关联性。通过这种方法，研究者能够识别出对特定概念响应强烈的神经元，并通过“钳制”（clamping）这些神经元来影响模型输出中相关概念的概率。这一方法在Llama 3.1 8B模型及其微调版本中得到了验证，表明超过75%的上投影层神经元在微调前后保持相同的关联词元。最终，该方法能够在15分钟内完成对Llama 3.1 8B模型上投影层神经元的全面映射，且无需并行化处理。

链接: https://arxiv.org/abs/2501.02688
作者: Harry J Davies
机构: Imperial College London(帝国理工学院)
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. Such black-box models can pose a significant risk to safety when trusted to make important decisions. The lack of interpretability of LLMs is more related to their sheer size, rather than the complexity of their individual components. The TARS method for knowledge removal (Davies et al 2024) provides strong evidence for the hypothesis that that linear layer weights which act directly on the residual stream may have high correlation with different concepts encoded in the residual stream. Building upon this, we attempt to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). Firstly, we show that with Llama 3.1 8B we can utilise the LM-head to decode specialised feature neurons that respond strongly to certain concepts, with examples such as “dog” and “California”. This is then confirmed by demonstrating that these neurons can be clamped to affect the probability of the concept in the output. This extends to the fine-tuned assistant Llama 3.1 8B instruct model, where we find that over 75% of neurons in the up-projection layers have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the “dog” neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the entirety of Llama 3.1 8B’s up-projection neurons in less than 15 minutes with no parallelization.
zh

[NLP-33] From Superficial Patterns to Semantic Understanding: Fine-Tuning Language Models on Contrast Sets

【速读】：该论文探讨了大规模预训练语言模型在自然语言推理（NLI）任务中的鲁棒性问题。尽管这些模型在标准数据集上表现出色，但在面对分布外（out-of-distribution）测试集（如对比集，contrast sets）时表现显著下降。对比集通过对输入数据进行微小但有意义的修改来改变其标签，揭示了模型可能仅学习到训练数据中的表面模式，而非更深层次的语言细微差别。例如，ELECTRA-small模型在SNLI数据集上准确率接近90%，但在对比集上降至75%。论文提出的解决方案是在训练过程中引入少量更复杂的对比集，以帮助模型更好地学习语言模式。通过这种方法，模型在对比集上的表现恢复到接近90%，强调了多样化且具有挑战性的训练数据的重要性。

链接: https://arxiv.org/abs/2501.02683
作者: Daniel Petrov
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large scale pretrained language models have demonstrated high performance on standard datasets for natural language inference (NLI) tasks. Unfortunately, these evaluations can be misleading, as although the models can perform well on in-distribution data, they perform poorly on out-of-distribution test sets, such as contrast sets. Contrast sets consist of perturbed instances of data that have very minor, but meaningful, changes to the input that alter the gold label, revealing how models can learn superficial patterns in the training data rather than learning more sophisticated language nuances. As an example, the ELECTRA-small language model achieves nearly 90% accuracy on an SNLI dataset but drops to 75% when tested on an out-of-distribution contrast set. The research performed in this study explores how a language models’ robustness can be improved by exposing it to small amounts of more complex contrast sets during training to help it better learn language patterns. With this approach, the model regains performance and achieves nearly 90% accuracy on contrast sets, highlighting the importance of diverse and challenging training data.
zh

[NLP-34] Generalizing from SIMPLE to HARD Visual Reasoning : Can We Mitigate Modality Imbalance in VLMs?

【速读】：该论文旨在解决视觉语言模型（Vision Language Models, VLMs）在多步推理任务中的模态不平衡（modality imbalance）和脆弱性（brittleness）问题。具体而言，论文通过引入一个合成框架来评估VLMs在算法视觉推理（Algorithmic Visual Reasoning, AVR）任务中的表现，包括三个任务：表格读取（Table Readout）、网格导航（Grid Navigation）和视觉类比（Visual Analogy）。每个任务分为简单（SIMPLE）和困难（HARD）两个难度级别，即使是简单任务对当前最先进的VLMs也具有挑战性。论文的关键解决方案是通过在简单任务上进行训练，提升模型在相应困难任务上的表现，即实现S2H泛化（S2H generalization）。此外，论文通过文本版本的任务量化了模态不平衡，并探讨了训练策略对其的影响。研究还强调了自回归训练中显式图像到文本转换的重要性，并通过梯度对齐（gradient alignment）等机制研究，识别出促进更好S2H泛化的训练策略。

链接: https://arxiv.org/abs/2501.02669
作者: Simon Park,Abhishek Panigrahi,Yun Cheng,Dingli Yu,Anirudh Goyal,Sanjeev Arora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning, their ability to apply multi-step reasoning to images has lagged, giving rise to perceptions of modality imbalance or brittleness. Towards systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning (AVR), comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We seek strategies for training on the SIMPLE version of the tasks that improve performance on the corresponding HARD task, i.e., S2H generalization. This synthetic framework, where each task also has a text-only version, allows a quantification of the modality imbalance, and how it is impacted by training strategy. Ablations highlight the importance of explicit image-to-text conversion in promoting S2H generalization when using auto-regressive training. We also report results of mechanistic study of this phenomenon, including a measure of gradient alignment that seems to identify training strategies that promote better S2H generalization.
zh

[NLP-35] ougher Text Smarter Models: Raising the Bar for Adversarial Defence Benchmarks COLING2025

【速读】：该论文试图解决深度学习模型在对抗攻击（adversarial attacks）下的脆弱性问题。尽管已有多种防御机制被提出，但目前缺乏一个全面的基准来评估这些防御机制在不同数据集、模型和任务中的表现。为了解决这一问题，论文提出了一个广泛的文本对抗防御基准，该基准显著扩展了先前的工作。关键解决方案包括：1）整合了多种数据集；2）评估了最先进的防御机制；3）将评估范围扩展到关键任务，如单句分类、相似性和释义识别、自然语言推理以及常识推理。通过建立这一新的基准标准，论文旨在加速推动更鲁棒和可靠的自然语言处理系统的研究进展。

链接: https://arxiv.org/abs/2501.02654
作者: Yang Wang,Chenghua Lin
机构: Department of Computer Science, The University of Sheffield (谢菲尔德大学); Department of Computer Science, The University of Manchester (曼彻斯特大学); Automated Analytics (自动化分析)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Will be presented as an oral in-person presentation at the conference of COLING 2025

点击查看摘要

Abstract:vulnerability of deep learning models to adversarial attacks. While various defence mechanisms have been proposed, there is a lack of comprehensive benchmarks that evaluate these defences across diverse datasets, models, and tasks. In this work, we address this gap by presenting an extensive benchmark for textual adversarial defence that significantly expands upon previous work. Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art defence mechanisms, and extends the assessment to include critical tasks such as single-sentence classification, similarity and paraphrase identification, natural language inference, and commonsense reasoning. This work not only serves as a valuable resource for researchers and practitioners in the field of adversarial robustness but also identifies key areas for future research in textual adversarial defence. By establishing a new standard for benchmarking in this domain, we aim to accelerate progress towards more robust and reliable natural language processing systems.
zh

[NLP-36] Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

【速读】：该论文探讨了如何通过调整多语言编码器模型的词汇表来提升其在爱沙尼亚语（Estonian）命名实体识别（NER）任务中的性能。具体来说，研究关注的是通过修改词汇表来优化模型的效率和性能。解决方案的关键在于两种词汇表适应方法：重新训练分词器（retraining the tokenizer）和剪枝未使用的词汇（pruning unused tokens）。研究发现，重新训练分词器会导致NER任务性能下降，可能需要更长的嵌入调优时间；而剪枝未使用的词汇则未对模型性能产生负面影响。这些结果表明，词汇表的调整可以在不牺牲性能的情况下优化模型的计算成本和效率。

链接: https://arxiv.org/abs/2501.02631
作者: Aleksei Dorkin,Taido Purason,Kairit Sirts
机构: Institute of Computer Science, University of Tartu (塔尔图大学计算机科学研究所)
类目: Computation and Language (cs.CL)
备注: Published in the Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

点击查看摘要

Abstract:Adapting multilingual language models to specific languages can enhance both their efficiency and performance. In this study, we explore how modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance on the Named Entity Recognition (NER) task. The motivations for adjusting the vocabulary are twofold: practical benefits affecting the computational cost, such as reducing the input sequence length and the model size, and performance enhancements by tailoring the vocabulary to the particular language. We evaluate the effectiveness of two vocabulary adaptation approaches – retraining the tokenizer and pruning unused tokens – and assess their impact on the model’s performance, particularly after continual training. While retraining the tokenizer degraded the performance of the NER task, suggesting that longer embedding tuning might be needed, we observed no negative effects on pruning.
zh

[NLP-37] Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

【速读】：该论文旨在解决大语言模型（LLMs）在面对越狱攻击（jailbreak attacks）时可能产生的有害或不安全输出的问题。越狱攻击通过利用模型的漏洞，诱导模型生成不符合安全与伦理标准的响应，从而威胁模型的安全性。论文提出的解决方案关键是一种名为Layer-AdvPatcher的新方法，该方法通过自增强数据集（self-augmented datasets）对LLMs中的特定层进行修补，采用“反学习”（unlearning）策略来减少这些层在面对有害提示时生成肯定性令牌（affirmative tokens）的倾向。通过识别这些易受攻击的层并对其进行对抗性暴露，生成更多有害数据，从而理解其内在的多样化漏洞。最终，通过“反学习”这些漏洞，减少肯定性令牌的影响，降低越狱攻击的风险，同时保持模型对安全查询的响应能力。实验结果表明，该方法在降低有害性和攻击成功率方面优于现有的防御方法，且不影响模型对良性查询的实用性。

链接: https://arxiv.org/abs/2501.02629
作者: Yang Ouyang,Hengrui Gu,Shuhang Lin,Wenyue Hua,Jie Peng,Bhavya Kailkhura,Tianlong Chen,Kaixiong Zhou
机构: Singapore Management University(新加坡管理大学); National University of Singapore(新加坡国立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLMs’ safety significantly. In this paper, we introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks by utilizing an unlearning strategy to patch specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and adversarially exposing them to generate more harmful data, one can understand their inherent and diverse vulnerabilities to attacks. With these exposures, we then “unlearn” these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model’s responses to safe queries intact. We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak benchmarks to demonstrate the efficacy of our approach. Results indicate that our framework reduces the harmfulness and attack success rate of jailbreak attacks without compromising utility for benign queries compared to recent defense methods.
zh

[NLP-38] Empowering Bengali Education with AI: Solving Bengali Math Word Problems through Transformer Models

【速读】：该论文试图解决将孟加拉语（Bengali）数学文字问题（Mathematical Word Problems, MWPs）转化为数学方程式的挑战，特别是在低资源语言环境下。解决方案的关键在于使用基于Transformer的模型（包括Basic Transformer、mT5、BanglaT5和mBART50），并通过引入包含10,000个孟加拉语数学问题的“PatiGonit”数据集对这些模型进行微调，以实现从文字问题到方程式的准确转换。实验结果表明，mT5模型在准确率上达到了97.30%，证明了Transformer模型在该领域的有效性。这一研究为孟加拉语自然语言处理提供了重要的方法论和资源，同时也为教育AI工具的发展提供了支持，有助于提升孟加拉语学生的数学问题解决能力。

链接: https://arxiv.org/abs/2501.02599
作者: Jalisha Jashim Era,Bidyarthi Paul,Tahmid Sattar Aothoi,Mirazur Rahman Zim,Faisal Muhammad Shah
机构: Department of Computer Science and Engineering (计算机科学与工程系); Ahsanullah University of Science and Technology, Dhaka, Bangladesh (阿赫桑努拉科技大学, 达卡, 孟加拉国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mathematical word problems (MWPs) involve the task of converting textual descriptions into mathematical equations. This poses a significant challenge in natural language processing, particularly for low-resource languages such as Bengali. This paper addresses this challenge by developing an innovative approach to solving Bengali MWPs using transformer-based models, including Basic Transformer, mT5, BanglaT5, and mBART50. To support this effort, the “PatiGonit” dataset was introduced, containing 10,000 Bengali math problems, and these models were fine-tuned to translate the word problems into equations accurately. The evaluation revealed that the mT5 model achieved the highest accuracy of 97.30%, demonstrating the effectiveness of transformer models in this domain. This research marks a significant step forward in Bengali natural language processing, offering valuable methodologies and resources for educational AI tools. By improving math education, it also supports the development of advanced problem-solving skills for Bengali-speaking students.
zh

[NLP-39] GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation

【速读】：该论文旨在解决放射影像报告自动生成的问题，以减少临床医生的工作负担并提高患者护理的标准化水平。当前，放射影像报告的撰写耗时且需要专业的临床知识。论文提出了一种基于端到端Transformer的方法，用于从X射线图像生成准确且事实完整的放射影像报告。该方案的关键在于首次将课程学习（curriculum learning）引入到医学影像领域的端到端Transformer模型中，并通过实验验证了其在提升模型性能方面的显著效果。实验使用了MIMIC-CXR-JPG数据库，这是目前最大的胸部X射线数据集。结果表明，该方法在自然语言生成（NLG）指标BLEU和ROUGE-L上达到了与当前最先进方法相当的水平，同时在临床准确性相关的F1指标（F1 examples-averaged、F1-macro和F1-micro）以及NLG广泛使用的METEOR指标上取得了新的最优结果。

链接: https://arxiv.org/abs/2501.02598
作者: Iustin Sîrbu,Iulia-Renata Sîrbu,Jasmina Bogojeska,Traian Rebedea
机构: National University of Science and Technology POLITEHNICA Bucharest, Romania(罗马尼亚布加勒斯特国立科技大学); ZHAW School of Engineering, Winterthur, Switzerland(瑞士温特图尔工程学校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical imaging is crucial for diagnosing, monitoring, and treating medical conditions. The medical reports of radiology images are the primary medium through which medical professionals attest their findings, but their writing is time consuming and requires specialized clinical expertise. The automated generation of radiography reports has thus the potential to improve and standardize patient care and significantly reduce clinicians workload. Through our work, we have designed and evaluated an end-to-end transformer-based method to generate accurate and factually complete radiology reports for X-ray images. Additionally, we are the first to introduce curriculum learning for end-to-end transformers in medical imaging and demonstrate its impact in obtaining improved performance. The experiments have been conducted using the MIMIC-CXR-JPG database, the largest available chest X-ray dataset. The results obtained are comparable with the current state-of-the-art on the natural language generation (NLG) metrics BLEU and ROUGE-L, while setting new state-of-the-art results on F1 examples-averaged, F1-macro and F1-micro metrics for clinical accuracy and on the METEOR metric widely used for NLG.
zh

[NLP-40] Efficient Architectures for High Resolution Vision-Language Models COLING2025

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在处理高分辨率图像时难以准确识别细节的问题，这一问题限制了模型在多种任务中的表现。为了解决这一问题，论文提出了一种名为Pheye的新型架构，该架构能够高效处理高分辨率图像，并且在训练过程中所需的参数量少于类似规模的VLMs。Pheye的关键创新在于其能够在保持高效性的同时，依然在需要细粒度图像理解和场景文本处理的任务中表现出色。

链接: https://arxiv.org/abs/2501.02584
作者: Miguel Carvalho,Bruno Martins
机构: INESC-ID and Instituto Superior Técnico, University of Lisbon (INESC-ID和里斯本高等理工学院, 里斯本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
zh

[NLP-41] LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

【速读】：该论文旨在解决在加速基于Transformer的大语言模型（LLMs）过程中，关于指数衰减因果线性注意力（exponentially decaying causal linear attention）操作的计算复杂性和实现方法的不足。具体来说，当前领域缺乏对该操作复杂性的清晰理解、现有计算方法的全面收集（这些方法通常分散在看似不相关的领域中），以及用于GPU快速推理的CUDA实现。论文提出的解决方案是LeetDecoding，这是一个Python包，提供了大量针对该基础操作的计算例程。LeetDecoding的设计易于与现有的线性注意力LLMs集成，并允许研究人员对新的计算方法进行基准测试和评估。其关键优势在于无需GPU编程知识或底层复杂性分析，使得LLM从业者能够轻松使用。

链接: https://arxiv.org/abs/2501.02573
作者: Jiaping Wang,Simiao Zhang,Qiao-Chu He,Yifan Chen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Mathematical Software (cs.MS)
备注: The source code of LeetDecoding is hosted at this https URL

点击查看摘要

Abstract:The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with \emphexponentially decaying causal linear attention. In this paper, we present LeetDecoding, which is the first Python package that provides a large set of computation routines for this fundamental operator. The launch of LeetDecoding was motivated by the current lack of (1) clear understanding of the complexity regarding this operator, (2) a comprehensive collection of existing computation methods (usually spread in seemingly unrelated fields), and (3) CUDA implementations for fast inference on GPU. LeetDecoding’s design is easy to integrate with existing linear-attention LLMs, and allows for researchers to benchmark and evaluate new computation methods for exponentially decaying causal linear attention. The usage of LeetDecoding does not require any knowledge of GPU programming and the underlying complexity analysis, intentionally making LeetDecoding accessible to LLM practitioners. The source code of LeetDecoding is provided at \hrefthis https URLthis GitHub repository, and users can simply install LeetDecoding by the command \textttpip install leet-decoding.
zh

[NLP-42] Decoding fMRI Data into Captions using Prefix Language Modeling

【速读】：该论文旨在解决当前基于脑信号解码的图像描述生成方法中潜在的数据污染问题。具体来说，现有的方法使用GIT模型生成图像描述，但由于GIT模型是在COCO数据集上训练的，而脑信号解码任务也使用了来自COCO数据集的刺激图像，这可能导致数据污染。为了解决这一问题，论文提出了一种替代方法：首先通过预测DINOv2模型的图像嵌入（embedding）来解码脑信号，然后将DINOv2模型的[CLS]标记作为前缀输入到GPT-2语言模型中生成图像描述。这种方法不仅减少了计算需求，还避免了直接使用GIT模型带来的数据污染风险。此外，论文还探索了使用3D卷积神经网络（3D CNN）将fMRI信号映射到图像嵌入空间，以更好地考虑体素（voxel）的位置信息，从而提升解码效果。

链接: https://arxiv.org/abs/2501.02570
作者: Vyacheslav Shen,Kassymzhomart Kunanbayev,Dae-Shik Kim
机构: School of Electrical Engineering, KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 2 tables, 1 figure

点击查看摘要

Abstract:With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model’s embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
zh

[NLP-43] Multi-LLM Collaborative Caption Generation in Scientific Documents AAAI2025

【速读】：该论文试图解决科学图表标注（scientific figure captioning）任务中现有方法存在的不足，特别是现有方法往往仅将任务视为图像到文本或文本摘要问题，导致生成的标注无法充分捕捉必要的细节。此外，现有数据集中包含的低质量标注对训练大语言模型（LLMs）构成了挑战。为解决这些问题，论文提出了一种名为多LLM协作图表标注生成（Multi-LLM Collaborative Figure Caption Generation, MLBCAP）的框架。该框架的关键在于通过三个模块实现：首先，利用多模态LLM评估训练数据的质量，过滤低质量标注；其次，通过微调或提示多个LLM生成候选标注；最后，使用一个突出的LLM从候选标注中选择最高质量的标注，并进一步修正其中的不准确之处。实验表明，该方法生成的标注在信息丰富度上优于人工编写的标注，验证了其有效性。

链接: https://arxiv.org/abs/2501.02552
作者: Jaeyoung Kim,Jongho Lee,Hong-Jun Choi,Ting-Yao Hsu,Chieh-Yang Huang,Sungchul Kim,Ryan Rossi,Tong Yu,Clyde Lee Giles,Ting-Hao ‘Kenneth’ Huang,Sungchul Choi
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025 AI4Research Workshop

点击查看摘要

Abstract:Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at this https URL
zh

[NLP-44] From Language To Vision: A Case Study of Text Animation

【速读】：该论文试图解决的问题是如何将自由文本（free text）通过动画形式进行可视化，以增强对编码信息的理解。解决方案的关键在于开发一个文本可视化系统，该系统能够将自然语言文本转换为动画，特别是通过展示基础物理定律的例句来演示其功能。这种转换不仅展示了人类智能在不同信息格式之间转换的能力，还具有广泛的实际应用价值。

链接: https://arxiv.org/abs/2501.02549
作者: Ping Chen,Richard Alo,Justin Rundell
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Information can be expressed in multiple formats including natural language, images, and motions. Human intelligence usually faces little difficulty to convert from one format to another format, which often shows a true understanding of encoded information. Moreover, such conversions have broad application in many real-world applications. In this paper, we present a text visualization system that can visualize free text with animations. Our system is illustrated by visualizing example sentences of elementary Physics laws.
zh

[NLP-45] reeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on a Specific Domain

【速读】：该论文试图解决词义消歧（Word Sense Disambiguation, WSD）这一计算语言学中的主要挑战。解决方案的关键在于使用了一种完全无监督的方法，该方法依赖于从特定领域知识库中提取的依存知识（dependency knowledge）。具体来说，TreeMatch系统最初基于SemEval 2007 Task 7的数据开发，并随后适应于SemEval 2010 Task 17的特定领域词义消歧任务。该系统在任务评估中表现优于最常见的基线方法（Most Frequent Selection baseline），显示出其在实际应用中的有效性。

链接: https://arxiv.org/abs/2501.02546
作者: Andrew Tran,Chris Bowes,David Brown,Ping Chen,Max Choly,Wei Ding
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Word sense disambiguation (WSD) is one of the main challenges in Computational Linguistics. TreeMatch is a WSD system originally developed using data from SemEval 2007 Task 7 (Coarse-grained English All-words Task) that has been adapted for use in SemEval 2010 Task 17 (All-words Word Sense Disambiguation on a Specific Domain). The system is based on a fully unsupervised method using dependency knowledge drawn from a domain specific knowledge base that was built for this task. When evaluated on the task, the system precision performs above the Most Frequent Selection baseline.
zh

[NLP-46] Evaluating Large Language Models Against Human Annotators in Latent Content Analysis: Sentiment Political Leaning Emotional Intensity and Sarcasm

【速读】：该论文试图解决的问题是在数字通信时代，如何高效地进行潜在内容分析以提取有意义的信息。随着大量文本数据的生成，现有的方法需要评估大型语言模型（LLMs）在多个维度上的表现，并与人类注释者进行比较。论文的关键解决方案是通过对七种最先进的LLMs（包括OpenAI的GPT-4、Gemini、Llama和Mixtral的变体）与33名人类注释者在情感分析、政治倾向、情感强度和讽刺检测等任务中的表现进行对比评估。研究通过Krippendorff’s alpha和类内相关系数（intra-class correlation coefficients）来衡量评分者间的一致性和时间一致性。结果表明，LLMs在情感分析和政治倾向评估中表现出高可靠性，且内部一致性优于人类，但在情感强度解释和讽刺检测方面仍需要人类专家的参与。研究结论表明，LLMs（尤其是GPT-4）在某些潜在内容分析任务中能够有效替代人类分析，展现出稳定且高质量的表现。

链接: https://arxiv.org/abs/2501.02532
作者: Ljubisa Bojic,Olga Zagovora,Asta Zelenkauskaite,Vuk Vukovic,Milan Cabarkapa,Selma Veseljević Jerkovic,Ana Jovančevic
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:In the era of rapid digital communication, vast amounts of textual data are generated daily, demanding efficient methods for latent content analysis to extract meaningful insights. Large Language Models (LLMs) offer potential for automating this process, yet comprehensive assessments comparing their performance to human annotators across multiple dimensions are lacking. This study evaluates the reliability, consistency, and quality of seven state-of-the-art LLMs, including variants of OpenAI’s GPT-4, Gemini, Llama, and Mixtral, relative to human annotators in analyzing sentiment, political leaning, emotional intensity, and sarcasm detection. A total of 33 human annotators and eight LLM variants assessed 100 curated textual items, generating 3,300 human and 19,200 LLM annotations, with LLMs evaluated across three time points to examine temporal consistency. Inter-rater reliability was measured using Krippendorff’s alpha, and intra-class correlation coefficients assessed consistency over time. The results reveal that both humans and LLMs exhibit high reliability in sentiment analysis and political leaning assessments, with LLMs demonstrating higher internal consistency than humans. In emotional intensity, LLMs displayed higher agreement compared to humans, though humans rated emotional intensity significantly higher. Both groups struggled with sarcasm detection, evidenced by low agreement. LLMs showed excellent temporal consistency across all dimensions, indicating stable performance over time. This research concludes that LLMs, especially GPT-4, can effectively replicate human analysis in sentiment and political leaning, although human expertise remains essential for emotional intensity interpretation. The findings demonstrate the potential of LLMs for consistent and high-quality performance in certain areas of latent content analysis.
zh

[NLP-47] owards New Benchmark for AI Alignment Sentiment Analysis in Socially Important Issues: A Comparative Study of Human and LLM s in the Context of AGI

【速读】：该论文试图解决的问题是如何评估大型语言模型（LLMs）在社会重要问题上的情感倾向（sentiment），并探讨这些模型对人类社会的长期影响。研究的关键解决方案是采用Likert量表调查方法，对包括GPT-4和Bard在内的七种大型语言模型进行分析，并将其情感得分与三个独立人类样本群体的情感数据进行比较。此外，研究还评估了这些模型在连续三天内的情感变化。研究结果表明，不同LLMs的情感得分存在显著差异，且与人类样本相比，LLMs普遍表现出更为积极的情感倾向。研究还揭示了LLMs在情感形成过程中可能存在的利益冲突和偏见风险，表明这些模型可能像人类认知过程一样，形成独特的情感倾向，并潜移默化地影响社会对各种观点的认知。

链接: https://arxiv.org/abs/2501.02531
作者: Ljubisa Bojic,Dylan Seychell,Milan Cabarkapa
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 20 pages, 1 figure

点击查看摘要

Abstract:With the expansion of neural networks, such as large language models, humanity is exponentially heading towards superintelligence. As various AI systems are increasingly integrated into the fabric of societies-through recommending values, devising creative solutions, and making decisions-it becomes critical to assess how these AI systems impact humans in the long run. This research aims to contribute towards establishing a benchmark for evaluating the sentiment of various Large Language Models in socially importan issues. The methodology adopted was a Likert scale survey. Seven LLMs, including GPT-4 and Bard, were analyzed and compared against sentiment data from three independent human sample populations. Temporal variations in sentiment were also evaluated over three consecutive days. The results highlighted a diversity in sentiment scores among LLMs, ranging from 3.32 to 4.12 out of 5. GPT-4 recorded the most positive sentiment score towards AGI, whereas Bard was leaning towards the neutral sentiment. The human samples, contrastingly, showed a lower average sentiment of 2.97. The temporal comparison revealed differences in sentiment evolution between LLMs in three days, ranging from 1.03% to 8.21%. The study’s analysis outlines the prospect of potential conflicts of interest and bias possibilities in LLMs’ sentiment formation. Results indicate that LLMs, akin to human cognitive processes, could potentially develop unique sentiments and subtly influence societies’ perceptions towards various opinions formed within the LLMs.
zh

[NLP-48] CHAIR-Classifier of Hallucination as Improver

【速读】：该论文旨在解决大语言模型（LLM）中的幻觉（hallucination）检测问题，即模型生成与事实不符或缺乏依据的内容。解决方案的关键在于通过分析LLaMA模型各层的token得分（logits），提取出一组精简的特征集，包括最大值、最小值、均值、标准差和斜率，以减少过拟合。这些特征被用于逻辑回归（logistic regression）分类器，并在TruthfulQA和MMLU数据集上进行了验证。实验结果表明，该方法在零样本（zero-shot）场景下显著提升了性能，展示了其有效性和泛化潜力。

链接: https://arxiv.org/abs/2501.02518
作者: Ao Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a supervised method for detecting hallucinations in large language models. By analyzing token scores (logitis) across layers of the LLaMA model, we derive a small set, aiming to reduce overfitting, of features-including maximum, minimum, mean, standard deviation, and slope. We use logistic regression for classification and validate the model on the TruthfulQA and MMLU datasets. The results demonstrate significant performance gains, especially in zero-shot scenarios, highlighting the effectiveness and potential for generalization.
zh

[NLP-49] Can Impressions of Music be Extracted from Thumbnail Images?

【速读】：该论文试图解决音乐检索和生成系统中缺乏大规模公开数据集的问题，特别是这些数据集需要包含音乐数据及其对应的自然语言描述（即音乐字幕）。现有数据集在描述音乐时，往往缺乏非音乐信息（如适合听歌的情境或听歌时引发的情感），这些信息对于全面描述音乐至关重要。为了解决这一问题，论文提出了一种通过音乐缩略图推断非音乐信息来生成音乐字幕数据的方法，并通过人工评估验证了该方法的有效性。此外，论文还创建了一个包含约36万条字幕的数据集，这些字幕涵盖了非音乐信息。利用该数据集，论文训练了一个音乐检索模型，并通过评估展示了其在音乐检索任务中的有效性。解决方案的关键在于通过音乐缩略图推断非音乐信息，从而生成更全面的音乐字幕数据。

链接: https://arxiv.org/abs/2501.02511
作者: Takashi Harada,Takehiro Motomitsu,Katsuhiko Hayashi,Yusuke Sakai,Hidetaka Kamigaito
机构: The University of Tokyo (东京大学); Hokkaido University (北海道大学); NAIST (奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at NLP4MusA 2024

点击查看摘要

Abstract:In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations. Additionally, we created a dataset with approximately 360,000 captions containing non-musical aspects. Leveraging this dataset, we trained a music retrieval model and demonstrated its effectiveness in music retrieval tasks through evaluation.
zh

[NLP-50] oolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

【速读】：该论文旨在解决大语言模型（LLMs）在多跳工具使用（multi-hop tool use）场景下的评估问题。当前缺乏可靠的评估数据集，阻碍了对LLMs在理解、推理和功能调用能力方面的深入分析。为此，作者提出了ToolHop数据集，包含995个用户查询和3,912个相关工具，专门用于严格评估多跳工具使用。ToolHop通过一种新颖的查询驱动数据构建方法（包括工具创建、文档精炼和代码生成），确保了查询的多样性、工具间的有意义依赖关系、本地可执行工具、详细反馈以及可验证的答案。通过对14个LLMs的评估，揭示了多跳工具使用场景中的显著挑战，并提供了改进模型开发的可操作见解。

链接: https://arxiv.org/abs/2501.02506
作者: Junjie Ye,Zhengyin Du,Xuesong Yao,Weijian Lin,Yufei Xu,Zehui Chen,Zaiyuan Wang,Sining Zhu,Zhiheng Xi,Siyu Yuan,Tao Gui,Qi Zhang,Xuanjing Huang,Jiechao Chen
机构: School of Computer Science, Fudan University (复旦大学计算机学院); ByteDance (字节跳动); Institute of Modern Languages and Linguistics, Fudan University (复旦大学现代语言学研究所); School of Data Science, Fudan University (复旦大学数据科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in this https URL.
zh

[NLP-51] st-time Computing: from System-1 Thinking to System-2 Thinking

【速读】：该论文试图解决的问题是缺乏对测试时计算扩展（test-time computing scaling）的全面调查。测试时计算扩展在复杂推理任务中展现出显著性能，能够进一步释放模型的潜力，特别是在增强模型的System-2思维（System-2 thinking）能力方面。论文通过追溯测试时计算的概念，从System-1模型（System-1 models）到System-2模型（System-2 models），系统地梳理了其在不同模型中的应用和发展趋势。解决方案的关键在于通过参数更新、输入修改、表示编辑和输出校准等方法提升System-1模型的鲁棒性和泛化能力，同时通过重复采样、自我修正和树搜索等技术增强System-2模型的复杂问题解决能力。论文还指出了未来可能的研究方向，强调了测试时计算在从System-1模型向弱System-2模型再到强System-2模型过渡中的关键作用。

链接: https://arxiv.org/abs/2501.02497
作者: Yixin Ji,Juntao Li,Hai Ye,Kaixin Wu,Jia Xu,Linjian Mo,Min Zhang
机构: 1School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院); 2Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); 3Ant Group (蚂蚁集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:The remarkable performance of the o1 model in complex reasoning demonstrates that test-time computing scaling can further unlock the model’s potential, enabling powerful System-2 thinking. However, there is still a lack of comprehensive surveys for test-time computing scaling. We trace the concept of test-time computing back to System-1 models. In System-1 models, test-time computing addresses distribution shifts and improves robustness and generalization through parameter updating, input modification, representation editing, and output calibration. In System-2 models, it enhances the model’s reasoning ability to solve complex problems through repeated sampling, self-correction, and tree search. We organize this survey according to the trend of System-1 to System-2 thinking, highlighting the key role of test-time computing in the transition from System-1 models to weak System-2 models, and then to strong System-2 models. We also point out a few possible future directions.
zh

[NLP-52] LLM PC: Large Language Model Predictive Control

【速读】：该论文探讨了如何通过改进提示技术（prompting techniques）来增强大语言模型（Large Language Models, LLMs）在推理、规划和行动能力方面的表现。具体而言，论文从模型预测控制（model predictive control, MPC）的角度分析了这些提示技术，并指出当使用规划提示时，LLMs 实际上充当了隐式规划成本函数的最小化器。论文的关键解决方案在于通过引入真实的规划成本函数和评估器，进一步优化 LLMs 的规划性能。这一框架不仅揭示了 LLMs 在规划任务中的内在机制，还提供了一种提升其规划能力的方法。

链接: https://arxiv.org/abs/2501.02486
作者: Gabriel Maher
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in prompting techniques for Large Language Models (LLMs) have improved their reasoning, planning, and action abilities. This paper examines these prompting techniques through the lens of model predictive control (MPC). We show that LLMs act as implicit planning cost function minimizers when planning prompts are used. Under our framework we demonstrate that LLM planning performance can be improved further by incorporating real planning cost functions and evaluators.
zh

[NLP-53] Decoding News Bias: Multi Bias Detection in News Articles

【速读】：该论文试图解决新闻文章中存在的各种偏见（bias）问题，这些偏见可能显著扭曲公众舆论和对媒体的信任。为了解决这一问题，论文提出利用大语言模型（Large Language Models, LLMs）来分析和理解自然语言，从而构建数据集并检测这些偏见。解决方案的关键在于使用LLMs进行广泛领域的偏见检测，并通过多种检测技术展示结果。该方法强调了全面检测偏见的重要性，并为提高新闻文章的完整性提供了新的见解。

链接: https://arxiv.org/abs/2501.02482
作者: Bhushan Santosh Shah,Deven Santosh Shah,Vahida Attar
机构: College of Engineering, Pune Technological University (浦那理工大学工程学院); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:News Articles provides crucial information about various events happening in the society but they unfortunately come with different kind of biases. These biases can significantly distort public opinion and trust in the media, making it essential to develop techniques to detect and address them. Previous works have majorly worked towards identifying biases in particular domains e.g., Political, gender biases. However, more comprehensive studies are needed to detect biases across diverse domains. Large language models (LLMs) offer a powerful way to analyze and understand natural language, making them ideal for constructing datasets and detecting these biases. In this work, we have explored various biases present in the news articles, built a dataset using LLMs and present results obtained using multiple detection techniques. Our approach highlights the importance of broad-spectrum bias detection and offers new insights for improving the integrity of news articles.
zh

[NLP-54] Hengqin-RA-v1: Advanced Large Language Model for Diagnosis and Treatment of Rheumatoid Arthritis with Dataset based Traditional Chinese Medicine AAAI-2025

【速读】：该论文试图解决大型语言模型（LLMs）在中文语境下，特别是在传统中医（TCM）领域中存在的偏见和不准确性问题。这些问题在类风湿性关节炎（RA）等特定领域中尤为突出，主要由于缺乏领域特定的数据和中医文化及临床细节的复杂性。为解决这些问题，论文提出了Hengqin-RA-v1，这是首个专门为中医诊断和治疗RA而定制的大型语言模型。关键解决方案包括引入HQ-GCM-RA-C1数据集，该数据集从古代中医文献、经典文本和现代临床研究中精心整理，使Hengqin-RA-v1能够提供准确且符合文化背景的响应，从而弥补通用模型的不足。实验结果表明，Hengqin-RA-v1在性能上优于现有最先进的模型，甚至在某些情况下超越了中医医师的诊断准确性。

链接: https://arxiv.org/abs/2501.02471
作者: Yishen Liu,Shengda Luo,Zishao Zhong,Tongtong Wu,Jianguo Zhang,Peiyao Ou,Yong Liang,Liang Liu,Hudan Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, AAAI-2025 Workshop

点击查看摘要

Abstract:Large language models (LLMs) primarily trained on English texts, often face biases and inaccuracies in Chinese contexts. Their limitations are pronounced in fields like Traditional Chinese Medicine (TCM), where cultural and clinical subtleties are vital, further hindered by a lack of domain-specific data, such as rheumatoid arthritis (RA). To address these issues, this paper introduces Hengqin-RA-v1, the first large language model specifically tailored for TCM with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a comprehensive RA-specific dataset curated from ancient Chinese medical literature, classical texts, and modern clinical studies. This dataset empowers Hengqin-RA-v1 to deliver accurate and culturally informed responses, effectively bridging the gaps left by general-purpose models. Extensive experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models, even surpassing the diagnostic accuracy of TCM practitioners in certain cases.
zh

[NLP-55] owards Omni-RAG : Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications

【速读】：该论文试图解决大型语言模型（LLMs）在医疗领域应用中生成幻觉（hallucinations）的问题，主要原因是这些模型缺乏对医学知识的有效整合。为了解决这一问题，论文提出了一种多源知识获取的框架，将其视为一个源规划（source planning）问题，即根据不同的知识源属性制定适合上下文的查询。现有方法要么忽视了源规划，要么由于模型对知识源的期望与其实际内容之间的不一致而未能有效实现。论文的关键解决方案是提出了MedOmniKB，一个包含多类型和多结构医学知识源的综合性知识库，并结合源规划优化（Source Planning Optimisation, SPO）方法，通过显式的规划优化来增强多源知识的利用。该方法通过专家模型探索和评估潜在规划方案，并训练一个较小的模型来学习源对齐（source alignment），从而显著提升了多源规划的性能，使优化后的小模型在利用多样化医学知识源方面达到了最先进的水平。

链接: https://arxiv.org/abs/2501.02460
作者: Zhe Chen,Yusheng Liao,Shuyang Jiang,Pingjie Wang,Yiqiu Guo,Yanfeng Wang,Yu Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold promise for addressing healthcare challenges but often generate hallucinations due to limited integration of medical knowledge. Incorporating external medical knowledge is therefore critical, especially considering the breadth and complexity of medical content, which necessitates effective multi-source knowledge acquisition. We address this challenge by framing it as a source planning problem, where the task is to formulate context-appropriate queries tailored to the attributes of diverse knowledge sources. Existing approaches either overlook source planning or fail to achieve it effectively due to misalignment between the model’s expectation of the sources and their actual content. To bridge this gap, we present MedOmniKB, a comprehensive repository comprising multigenre and multi-structured medical knowledge sources. Leveraging these sources, we propose the Source Planning Optimisation (SPO) method, which enhances multi-source utilisation through explicit planning optimisation. Our approach involves enabling an expert model to explore and evaluate potential plans while training a smaller model to learn source alignment using positive and negative planning samples. Experimental results demonstrate that our method substantially improves multi-source planning performance, enabling the optimised small model to achieve state-of-the-art results in leveraging diverse medical knowledge sources.
zh

[NLP-56] Understand Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

【速读】：该论文试图解决大型语言模型（LLMs）在多语言环境下的性能差异问题，特别是在高资源语言（如英语和中文）与其他语言（如韩语）之间的推理能力差距。通过引入HRM8K基准测试，该研究系统分析了模型在英语-韩语双语数学问题上的表现，发现性能差异主要源于模型对非英语输入的理解困难，而非推理能力的限制。基于这一发现，论文提出了UST（Understand, Solve, and Translate）方法，该方法策略性地使用英语作为推理和解决方案生成的锚点。通过对130k个合成生成的数据点进行微调，UST在HRM8K基准测试上实现了10.91%的性能提升，并将多语言性能差距从11.6%降低至0.7%。此外，UST的改进效果在不同韩语领域中表现出良好的泛化能力，表明从机器可验证内容中获取的能力可以推广到其他领域。

链接: https://arxiv.org/abs/2501.02448
作者: Hyunwoo Ko,Guijin Son,Dasol Choi
机构: OneLineAI; Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 14 figures, 9 tables

点击查看摘要

Abstract:Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.
zh

[NLP-57] Efficient Deployment of Large Language Models on Resource-constrained Devices

【速读】：该论文试图解决在资源受限设备上部署大语言模型（LLMs）时面临的高推理延迟和内存需求过大的问题。由于资源受限设备的计算能力和通信能力存在异质性，传统的联邦学习（FL）方法虽然能保护隐私，但仍保留了原始LLM的规模，导致效率低下。为解决这一问题，论文提出了FedSpine框架，该框架结合了参数高效微调（PEFT）和结构化剪枝技术，通过迭代过程对LLM参数进行剪枝和微调。此外，FedSpine采用了一种在线多臂赌博机（MAB）算法，自适应地为不同设备确定剪枝比例和LoRA秩，从而在不了解设备具体计算和通信能力的情况下，有效应对设备异质性。实验结果表明，FedSpine在相同稀疏度下，能够将微调速度提升1.4倍至6.9倍，并将最终准确率提高0.4%至4.5%。

链接: https://arxiv.org/abs/2501.02438
作者: Zhiwei Yao,Yang Xu,Hongli Xu,Yunming Liao,Zuan Xie
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); Suzhou Institute for Advanced Research, University of Science and Technology of China (中国科学技术大学苏州高等研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4 \times -6.9 \times and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.
zh

[NLP-58] owards Multimodal Metaphor Understanding: A Chinese Dataset and Model for Metaphor Mapping Identification

【速读】：该论文试图解决自然语言处理（NLP）中隐喻理解的挑战，特别是如何准确识别隐喻中源域（source domain）和目标域（target domain）之间的映射关系。现有研究主要集中在隐喻检测和隐喻表达的情感分析上，而对源域和目标域之间复杂映射的识别过程关注较少。此外，非英语多模态隐喻资源在文献中几乎被忽视，这阻碍了对隐喻解释关键要素的深入理解。为解决这一问题，论文提出了两个关键解决方案：首先，开发了一个中文多模态隐喻广告数据集（CM3D），该数据集包含源域和目标域的详细标注，旨在促进非英语语言中的隐喻理解研究；其次，提出了一种基于链式思维提示（Chain-of-Thought Prompting）的隐喻映射识别模型（CPMMIM），该模型通过模拟人类认知过程，采用双层优化（Bi-Level Optimization）方法，将任务视为一个分层识别问题，从而实现了更准确和可解释的隐喻映射识别。实验结果表明，CPMMIM在隐喻理解方面具有显著效果，为该领域的进一步研究提供了有力支持。

链接: https://arxiv.org/abs/2501.02434
作者: Dongyu Zhang,Shengcheng Yin,Jingwei Yu,Zhiyao Wu,Zhen Li,Chengpei Xu,Xiaoxia Wang,Feng Xia
机构: School of Foreign Languages, Dalian University of Technology (大连理工大学外国语学院); School of Software, Dalian University of Technology (大连理工大学软件学院); Faculty of Business Administration, University of Macau (澳门大学工商管理学院); Faculty of Business and Commerce, Kansai University (关西大学商学部); School of Minerals and Energy Resources Engineering, University of New South Wales (新南威尔士大学矿物与能源资源工程学院); Centre for Educational Innovation and Quality, RMIT University (皇家墨尔本理工大学教育创新与质量中心); School of Computing Technologies, RMIT University (皇家墨尔本理工大学计算技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metaphors play a crucial role in human communication, yet their comprehension remains a significant challenge for natural language processing (NLP) due to the cognitive complexity involved. According to Conceptual Metaphor Theory (CMT), metaphors map a target domain onto a source domain, and understanding this mapping is essential for grasping the nature of metaphors. While existing NLP research has focused on tasks like metaphor detection and sentiment analysis of metaphorical expressions, there has been limited attention to the intricate process of identifying the mappings between source and target domains. Moreover, non-English multimodal metaphor resources remain largely neglected in the literature, hindering a deeper understanding of the key elements involved in metaphor interpretation. To address this gap, we developed a Chinese multimodal metaphor advertisement dataset (namely CM3D) that includes annotations of specific target and source domains. This dataset aims to foster further research into metaphor comprehension, particularly in non-English languages. Furthermore, we propose a Chain-of-Thought (CoT) Prompting-based Metaphor Mapping Identification Model (CPMMIM), which simulates the human cognitive process for identifying these mappings. Drawing inspiration from CoT reasoning and Bi-Level Optimization (BLO), we treat the task as a hierarchical identification problem, enabling more accurate and interpretable metaphor mapping. Our experimental results demonstrate the effectiveness of CPMMIM, highlighting its potential for advancing metaphor comprehension in NLP. Our dataset and code are both publicly available to encourage further advancements in this field.
zh

[NLP-59] Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding COLING2025

【速读】：该论文旨在解决在跨数据集的任务特定微调（fine-tuning）中，高效数据集剪枝（dataset pruning）的挑战。由于数据集大小、数据分布、类别不平衡和标签空间的多样性，现有的跨数据集剪枝技术通常依赖于计算成本高昂的样本排序过程，通常需要完整数据集训练或参考模型。论文提出的解决方案是Swift Cross-Dataset Pruning (SCDP)，其关键点在于使用TF-IDF嵌入（TF-IDF embeddings）结合几何中位数（geometric median）快速评估样本重要性，并根据数据集大小自适应地进行剪枝：对于较小的数据集，保留远离几何中位数的样本以保持多样性；对于较大的数据集，采用基于距离的分层剪枝（distance-based stratified pruning）。实验结果表明，该方法在多种任务和规模的数据集上均有效，并显著减少了计算资源消耗。

链接: https://arxiv.org/abs/2501.02432
作者: Binh-Nguyen Nguyen,Yang He
机构: CFAR, Agency for Science, Technology and Research, Singapore(新加坡科技研究局CFAR); IHPC, Agency for Science, Technology and Research, Singapore(新加坡科技研究局IHPC); VNU University of Engineering and Technology, Hanoi, Vietnam(越南河内工程科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted by COLING 2025

点击查看摘要

Abstract:Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: this https URL
zh

[NLP-60] Scaling Laws for Floating Point Quantization Training

【速读】：该论文旨在解决低精度训练（low-precision training）在大型语言模型（LLM）中的应用问题，特别是浮点量化（floating-point quantization）训练的效果和优化策略。现有的精度缩放定律主要关注整数量化（integer quantization），而对浮点量化的组成部分（如指数位、尾数位等）关注较少，导致其在LLM损失拟合上表现不佳。论文通过深入探讨浮点量化目标、指数位（exponent bits）、尾数位（mantissa bits）以及缩放因子计算粒度对LLM训练性能的影响，提出了一个精确的浮点量化统一缩放定律。关键解决方案包括：（1）发现指数位对模型性能的贡献略高于尾数位，并提供了不同比特数下的最优指数-尾数位比例，供硬件制造商参考；（2）揭示了低精度LLM训练中关键数据规模的形成，指出超过该规模的数据量会导致性能下降；（3）确定了浮点量化精度与计算能力成正比，但在广泛的计算能力范围内，最佳性价比的精度估计在4-8比特之间。这些发现为浮点量化训练提供了理论支持和实践指导。

链接: https://arxiv.org/abs/2501.02423
作者: Xingwu Sun,Shuaipeng Li,Ruobing Xie,Weidong Han,Kan Wu,Zhen Yang,Yixing Li,An Wang,Shuai Li,Jinbao Xue,Yu Cheng,Yangyu Tao,Zhanhui Kang,Chengzhong Xu,Di Wang,Jie Jiang
机构: 未知
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point quantization and thus cannot well fit the LLM losses in this scenario. In contrast, while floating-point quantization training is more commonly implemented in production, the research on it has been relatively superficial. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in floating-point quantization training performance of LLM models. While presenting an accurate floating-point quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal floating-point quantization precision is directly proportional to the computational power, but within a wide computational power range, we estimate that the best cost-performance precision lies between 4-8 bits.
zh

[NLP-61] Anonymization by Design of Language Modeling

【速读】：该论文试图解决自然语言处理（NLP）模型在医疗等敏感领域应用中的隐私问题，特别是模型在训练过程中可能记忆并泄露直接或间接的识别信息（direct and indirect identifying information）。为解决这一问题，论文提出了一种隐私设计（privacy-by-design）的语言建模方法，旨在通过模型匿名化促进模型的共享。解决方案的关键在于提出了两种方法：一是基于掩码语言建模（Masking Language Modeling, MLM）的方法，用于专门化类似BERT的模型；二是基于因果语言建模（Causal Language Modeling, CLM）的方法，用于专门化类似GPT的模型。这两种方法通过避免模型记忆训练数据中的直接和间接识别信息，实现了在保持高隐私性的同时，维持模型的高效用（high utility）。实验结果表明，该方法在医疗数据集上表现优异，提供了隐私与效用之间的最佳权衡。

链接: https://arxiv.org/abs/2501.02407
作者: Antoine Boutet,Zakaria El Kazdam,Lucas Magnana,Helain Zimmermann
机构: INSA Lyon(里昂国立应用科学学院); Inria(法国国家信息与自动化研究所); CITI(通信、信息、技术与创新研究中心); Ensimag(格勒诺布尔国立高等信息与应用数学学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when models specialized on sensitive data can memorize and then expose and regurgitate confidential information. This paper presents a privacy-by-design language modeling approach to address the problem of language models anonymization, and thus promote their sharing. Specifically, we propose both a Masking Language Modeling (MLM) methodology to specialize a BERT-like language model, and a Causal Language Modeling (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using medical datasets and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal language modeling schemes offer the best tradeoff for maintaining high privacy while retaining high utility.
zh

[NLP-62] Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers

【速读】：该论文旨在解决传统Transformer架构在处理复杂依赖关系时的局限性，特别是在捕捉图结构数据中的关系推理能力方面。论文提出了一种将图感知的关系推理（graph-aware relational reasoning）集成到Transformer的注意力机制中的方法，称为图感知同构注意力（Graph-Aware Isomorphic Attention）。这一方法的关键在于将Transformer的注意力机制重新表述为图操作，并利用图同构网络（Graph Isomorphism Networks, GIN）和主邻居聚合（Principal Neighborhood Aggregation, PNA）等先进的图建模策略来丰富关系结构的表示。此外，论文还引入了稀疏GIN注意力（Sparse GIN-Attention），通过将注意力矩阵解释为稀疏邻接图，增强预训练基础模型的适应性，同时保持较低的计算开销。这一解决方案不仅减少了泛化差距，还提高了学习性能，并为Transformer的理解提供了新的视角，使其能够动态适应局部和全局依赖关系，从而在生物信息学、材料科学、语言建模等领域具有广泛的应用潜力。

链接: https://arxiv.org/abs/2501.02393
作者: Markus J. Buehler
机构: MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language modeling. Building on the inherent connection between attention and graph theory, we reformulate the Transformer’s attention mechanism as a graph operation and propose Graph-Aware Isomorphic Attention. This method leverages advanced graph modeling strategies, including Graph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA), to enrich the representation of relational structures. Our approach captures complex dependencies and generalizes across tasks, as evidenced by a reduced generalization gap and improved learning performance. Additionally, we expand the concept of graph-aware attention to introduce Sparse GIN-Attention, a fine-tuning approach that employs sparse GINs. By interpreting attention matrices as sparse adjacency graphs, this technique enhances the adaptability of pre-trained foundational models with minimal computational overhead, endowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning achieves improved training dynamics and better generalization compared to alternative methods like low-rank adaption (LoRA). We discuss latent graph-like structures within traditional attention mechanisms, offering a new lens through which Transformers can be understood. By evolving Transformers as hierarchical GIN models for relational reasoning. This perspective suggests profound implications for foundational model development, enabling the design of architectures that dynamically adapt to both local and global dependencies. Applications in bioinformatics, materials science, language modeling, and beyond could benefit from this synthesis of relational and sequential data modeling, setting the stage for interpretable and generalizable modeling strategies.
zh

[NLP-63] Syntactic Evolution in Language Usage

【速读】：该论文旨在探讨从青少年后期到老年阶段，语言风格（linguistic style）随年龄变化的动态特性。研究通过使用语言分析工具和方法，深入分析个体在不同生命阶段如何适应和调整其语言使用。关键解决方案包括利用从2004年起的博客数据集进行句法分析（syntactic analysis），重点关注英语语言。研究结果对语言学、心理学和传播学领域具有重要意义，揭示了年龄与语言之间复杂的关系。

链接: https://arxiv.org/abs/2501.02392
作者: Surbhit Kumar
机构: Rochester Institute of Technology(罗切斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 7 figures

点击查看摘要

Abstract:This research aims to investigate the dynamic nature of linguistic style throughout various stages of life, from post teenage to old age. By employing linguistic analysis tools and methodologies, the study will delve into the intricacies of how individuals adapt and modify their language use over time. The research uses a data set of blogs from this http URL from 2004 and focuses on English for syntactic analysis. The findings of this research can have implications for linguistics, psychology, and communication studies, shedding light on the intricate relationship between age and language.
zh

[NLP-64] Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

【速读】：该论文试图解决当前视觉-语言模型（VLMs）在医学领域应用中的关键问题，即这些模型通常在全图像级别操作，无法关注到医学图像中的细粒度细节，导致提供的信息可能不相关或缺乏临床价值。为了解决这一问题，论文提出了MedVP框架，其关键解决方案包括提取医学实体、生成视觉提示（visual prompts）以及为视觉提示引导的微调适配数据集。通过引入视觉提示，MedVP能够限制模型的注意力集中在特定区域，从而显著提升了在多个医学视觉问答（VQA）数据集上的性能。实验结果表明，该方法不仅有效，而且具有重要的临床意义。

链接: https://arxiv.org/abs/2501.02385
作者: Kangyu Zhu,Ziyuan Qin,Huahui Yi,Zekun Jiang,Qicheng Lao,Shaoting Zhang,Kang Li
机构: Brown University(布朗大学); Case Western Reserve University(凯斯西储大学); Sichuan University(四川大学); Shanghai AI Lab(上海人工智能实验室); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the recent advancements in vision-language models (VLMs) driven by large language models (LLMs), many researchers have focused on models that comprised of an image encoder, an image-to-language projection layer, and a text decoder architectures, leading to the emergence of works like LLava-Med. However, these works primarily operate at the whole-image level, aligning general information from 2D medical images without attending to finer details. As a result, these models often provide irrelevant or non-clinically valuable information while missing critical details. Medical vision-language tasks differ significantly from general images, particularly in their focus on fine-grained details, while excluding irrelevant content. General domain VLMs tend to prioritize global information due to their design, which compresses the entire image into a multi-token representation that is passed into the LLM decoder. Therefore, current VLMs all lack the capability to restrict their attention to particular areas. To address this critical issue in the medical domain, we introduce MedVP, an visual prompt generation and fine-tuning framework, which involves extract medical entities, generate visual prompts, and adapt datasets for visual prompt guided fine-tuning. To the best of our knowledge, this is the first work to explicitly introduce visual prompt into medical VLMs, and we successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach
zh

[NLP-65] Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

【速读】：该论文试图解决的问题是如何将大语言模型（LLMs）的能力扩展到语音领域，特别是如何有效地将语音信息整合到LLMs中。为此，论文探讨了一种称为密集特征前置（DFP, Dense Feature Prepending）的方法，该方法将投影后的语音表示前置到文本表示中，从而实现与语音编码器的端到端训练。然而，DFP通常需要将文本解码器与语音编码器连接，这引发了对语音编码器复杂性的重要性及其与标准编码器-解码器（即交叉注意力）架构性能对比的疑问。论文通过从头训练所有模型，并使用可比的数据和参数设置，在MuST-C v1.0和CoVoST2数据集上测试了语音到文本识别（ASR）和翻译（ST）任务，研究了语音编码器在DFP中的影响，并比较了DFP与交叉注意力在不同配置下的表现。尽管DFP在应用中较为普遍，但研究结果表明DFP并未展现出明显的优势。

链接: https://arxiv.org/abs/2501.02370
作者: Tsz Kin Lam,Marco Gaido,Sara Papi,Luisa Bentivogli,Barry Haddow
机构: School of Informatics, University of Edinburgh (爱丁堡大学信息学院); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ARR October 2024

点击查看摘要

Abstract:Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech – the most common form in communication. To integrate speech into LLMs, one promising approach is dense feature prepending (DFP) which prepends the projected speech representations to the textual representations, allowing end-to-end training with the speech encoder. However, DFP typically requires connecting a text decoder to a speech encoder. This raises questions about the importance of having a sophisticated speech encoder for DFP, and how its performance compares with a standard encoder-decoder (i.e. cross-attention) architecture. In order to perform a controlled architectural comparison, we train all models from scratch, rather than using large pretrained models, and use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. We study the influence of a speech encoder in DFP. More importantly, we compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, generation speed and GPU memory footprint on monolingual, bilingual and multilingual models. Despite the prevalence of DFP over cross-attention, our overall results do not indicate a clear advantage of DFP.
zh

[NLP-66] Context Aware Lemmatization and Morphological Tagging Method in Turkish

【速读】：该论文旨在解决土耳其语中基于词义敏感的词汇还原（lemmatization）和形态标注（morphological tagging）问题。词汇还原是指将单词还原为其词根（word root），而形态标注则是预测单词的语法信息。论文提出的解决方案关键在于结合双向长短期记忆网络（bidirectional LSTM）用于处理单词的拼写，以及土耳其语BERT模型用于捕捉单词的语义信息。通过这种方式，模型能够同时考虑单词的拼写和语义，从而提高词汇还原和形态标注的准确性。该研究首次在土耳其语中引入了基于词义敏感的词汇还原方法，并使用IMST和PUD数据集进行训练，最终通过与SIGMORPHON 2019竞赛结果的对比，证明了所提出模型的优越性。

链接: https://arxiv.org/abs/2501.02361
作者: Cagri Sayallar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The smallest part of a word that defines the word is called a word root. Word roots are used to increase success in many applications since they simplify the word. In this study, the lemmatization model, which is a word root finding method, and the morphological tagging model, which predicts the grammatical knowledge of the word, are presented. The presented model was developed for Turkish, and both models make predictions by taking the meaning of the word into account. In the literature, there is no lemmatization study that is sensitive to word meaning in Turkish. For this reason, the present study shares the model and the results obtained from the model on Turkish lemmatization for the first time in the literature. In the present study, in the lemmatization and morphological tagging models, bidirectional LSTM is used for the spelling of words, and the Turkish BERT model is used for the meaning of words. The models are trained using the IMST and PUD datasets from Universal Dependencies. The results from the training of the models were compared with the results from the SIGMORPHON 2019 competition. The results of the comparisons revealed that our models were superior.
zh

[NLP-67] hinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving

【速读】：该论文试图解决复杂问题解决过程中认知灵活性（cognitive flexibility）的局限性问题。传统的心理模拟（mental simulation）虽然能够进行想象的审议，但由于认知约束，其效果有限。论文提出了一种基于大语言模型（Large Language Model, LLM）的合成审议（synthetic deliberation）方法，通过模拟多个代理（agents）之间的对话来体现多样化的观点。该解决方案的关键在于利用定制的GPT模型，实现多观点的并行处理、视角的并行探索以及对观点合成的精确控制。通过将审议过程外部化，并在并行搜索和整合之间分配认知负荷，合成审议超越了心理模拟的局限性，展示了在战略规划、政策制定和冲突解决等领域的应用潜力。

链接: https://arxiv.org/abs/2501.02348
作者: Sanghyun Park,Boris Maciejovsky,Phanish Puranam
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 36 pages, 1 appendix

点击查看摘要

Abstract:Complex problem-solving requires cognitive flexibility–the capacity to entertain multiple perspectives while preserving their distinctiveness. This flexibility replicates the “wisdom of crowds” within a single individual, allowing them to “think with many minds.” While mental simulation enables imagined deliberation, cognitive constraints limit its effectiveness. We propose synthetic deliberation, a Large Language Model (LLM)-based method that simulates discourse between agents embodying diverse perspectives, as a solution. Using a custom GPT-based model, we showcase its benefits: concurrent processing of multiple viewpoints without cognitive degradation, parallel exploration of perspectives, and precise control over viewpoint synthesis. By externalizing the deliberative process and distributing cognitive labor between parallel search and integration, synthetic deliberation transcends mental simulation’s limitations. This approach shows promise for strategic planning, policymaking, and conflict resolution.
zh

[NLP-68] Optimizing Small Language Models for In-Vehicle Function-Calling

【速读】：该论文旨在解决在车载边缘设备上部署小型语言模型（Small Language Models, SLMs）作为函数调用代理的问题，以替代传统的基于规则的系统，提供更灵活和鲁棒的解决方案。关键解决方案包括：1）应用先进的模型压缩技术（如结构化剪枝、修复和量化），以确保模型在资源受限的车载硬件上运行；2）对代表性SLM（如微软的Phi-3 mini）进行优化，包括压缩、任务特定的微调和车辆集成；3）在轻量级运行时环境中执行模型，实现每秒11个令牌的生成速度，确保实时推理的可行性。通过这些方法，论文展示了SLMs在保持复杂任务处理能力的同时，显著减少模型规模（最多减少20亿参数），并提升用户体验的潜力。

链接: https://arxiv.org/abs/2501.02342
作者: Yahya Sowti Khiabani,Farris Atif,Chieh Hsu,Sven Stahlmann,Tobias Michels,Sebastian Kramer,Benedikt Heidrich,M. Saquib Sarfraz,Julian Merten,Faezeh Tafazzoli
机构: Mercedes-Benz Research & Development North America (梅赛德斯-奔驰北美研发中心); Mercedes-Benz Tech Innovation (梅赛德斯-奔驰技术创新)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft’s Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model’s ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.
zh

[NLP-69] AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference AAAI

【速读】：该论文旨在解决长上下文大语言模型（Long-context Large Language Models, LLMs）推理过程中存储和计算成本过高的问题。现有的逐层跳过（layer-wise skipping）方法在长上下文推理中存在多个局限性，包括无法适应模型和上下文的变异性、忽视子层的重要性以及不适用于预填充阶段（prefilling phase）。为此，论文提出了一种名为 \sysname 的自适应子层跳过方法，专门针对长上下文推理进行优化。\sysname 通过动态利用相似性信息自适应地识别重要性较低的层，支持子层级别的跳过，并加速预填充和解码阶段。该方法的有效性通过在多个长上下文基准测试和模型上的广泛实验得到了验证，展示了其相对于现有基线方法的优越推理性能。

链接: https://arxiv.org/abs/2501.02336
作者: Zhuomin He,Yizhen Yao,Pengfei Zuo,Bin Gao,Qinya Li,Zhenzhe Zheng,Fan Wu
机构: Huawei Cloud(华为云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages,10 figures, AAAI

点击查看摘要

Abstract:Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of \sysname is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.
zh

[NLP-70] Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications

【速读】：该论文旨在探讨生成式 AI（Generative AI）在高风险测试情境中用于构建式回答评分时的应用问题，特别是与传统基于特征的自然语言处理（NLP）评分系统相比，生成式 AI 在评分中的优势和挑战。论文的核心问题是：如何确保生成式 AI 评分系统的有效性（validity），并为其提供充分的效度证据（validity evidence）。解决方案的关键在于提出一套最佳实践，用于收集和评估生成式 AI 评分系统的效度证据，特别是在缺乏透明度和一致性的情况下。论文还建议采用贡献式评分方法（contributory scoring approach），即结合多个 AI 评分来源，以更全面地覆盖评分构念（construct），尤其是在缺乏人工评分的情况下。

链接: https://arxiv.org/abs/2501.02334
作者: Jodi M. Casabianca,Daniel F. McCaffrey,Matthew S. Johnson,Naim Alper,Vladimir Zubenko
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 33 pages, 2 figures, 6 tables; This work was presented at the 2024 meeting of the International Testing Commission in Granada, Spain

点击查看摘要

Abstract:The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based NLP scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from standardized tests demonstrate the collection of validity evidence for different types of scoring systems and highlights the numerous complexities and considerations when making a validity argument for these scores. In addition, we discuss how the evaluation of AI scores might include a consideration of how a contributory scoring approach combining multiple AI scores (from different sources) will cover more of the construct in the absence of human ratings.
zh

[NLP-71] Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection

【速读】：该论文旨在解决大型语言模型（LLMs）在生成内容中表现出的显性（explicit）和隐性（implicit）偏见问题。尽管已有大量研究探讨了LLMs中的显性偏见，但隐性偏见的研究相对较少。论文提出了一个基于社会心理学理论的系统性框架，通过“自我反思”（self-reflection）的评估方法来比较和分析显性与隐性偏见。该框架分为两个阶段：首先通过模拟心理评估方法测量隐性偏见，然后通过提示LLMs分析其生成内容来评估显性偏见。实验结果表明，LLMs在显性和隐性偏见之间存在显著不一致性，显性偏见表现为温和的刻板印象，而隐性偏见则表现出强烈的刻板印象。此外，论文还探讨了导致这种不一致性的潜在因素，包括训练数据规模、模型参数和对齐技术（如RLHF、DPO）的影响。研究发现，随着训练数据和模型规模的增加，显性偏见减少，而隐性偏见却呈上升趋势。现有的对齐方法虽然能有效抑制显性偏见，但对隐性偏见的缓解效果有限。因此，论文指出，尽管扩大模型规模和对齐训练可以解决显性偏见，但隐性偏见的挑战需要超越现有方法的新颖解决方案。

链接: https://arxiv.org/abs/2501.02295
作者: Yachao Zhao,Bo Wang,Yan Wang
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content. While extensive research has investigated bias in LLMs, prior work has predominantly focused on explicit bias, leaving the more nuanced implicit biases largely unexplored. This paper presents a systematic framework grounded in social psychology theories to investigate and compare explicit and implicit biases in LLMs. We propose a novel “self-reflection” based evaluation framework that operates in two phases: first measuring implicit bias through simulated psychological assessment methods, then evaluating explicit bias by prompting LLMs to analyze their own generated content. Through extensive experiments on state-of-the-art LLMs across multiple social dimensions, we demonstrate that LLMs exhibit a substantial inconsistency between explicit and implicit biases, where explicit biases manifest as mild stereotypes while implicit biases show strong stereotypes. Furthermore, we investigate the underlying factors contributing to this explicit-implicit bias inconsistency. Our experiments examine the effects of training data scale, model parameters, and alignment techniques. Results indicate that while explicit bias diminishes with increased training data and model size, implicit bias exhibits a contrasting upward trend. Notably, contemporary alignment methods (e.g., RLHF, DPO) effectively suppress explicit bias but show limited efficacy in mitigating implicit bias. These findings suggest that while scaling up models and alignment training can address explicit bias, the challenge of implicit bias requires novel approaches beyond current methodologies.
zh

[NLP-72] LLM zSzL: a comprehensive LLM benchmark for Polish

【速读】：该论文旨在解决如何评估大规模语言模型（LLMs）在波兰语环境下的表现问题，特别是其在波兰国家考试中的知识迁移能力和准确性。解决方案的关键在于构建了一个名为LLMzSzŁ（LLMs Behind the School Desk）的综合基准测试，该测试基于波兰中央考试委员会的档案，涵盖了154个领域的4种类型考试，共包含近19,000道封闭式问题。通过这一基准测试，论文研究了开源多语言、英语和波兰语LLMs的表现，验证了LLMs在语言间知识迁移的能力，并分析了LLMs与人类在模型准确性和考试通过率上的相关性。研究结果表明，多语言LLMs在大多数情况下优于单语言模型，但在模型规模受限时，单语言模型可能更具优势。此外，研究还强调了LLMs在辅助考试验证中的潜力，特别是在识别考试任务中的异常或错误方面。

链接: https://arxiv.org/abs/2501.02266
作者: Krzysztof Jassem,Michał Ciesiółka,Filip Graliński,Piotr Jabłoński,Jakub Pokrywka,Marek Kubis,Monika Jabłońska,Ryszard Staruch
机构: Adam Mickiewicz University (亚当·密茨凯维奇大学); Center for Artificial Intelligence AMU (AMU人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzŁ (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k closed-ended questions. We investigate the performance of open-source multilingual, English, and Polish LLMs to verify LLMs’ abilities to transfer knowledge between languages. Also, the correlation between LLMs and humans at model accuracy and exam pass rate levels is examined. We show that multilingual LLMs can obtain superior results over monolingual ones; however, monolingual models may be beneficial when model size matters. Our analysis highlights the potential of LLMs in assisting with exam validation, particularly in identifying anomalies or errors in examination tasks.
zh

[NLP-73] Financial Named Entity Recognition: How Far Can LLM Go? COLING2025

【速读】：该论文旨在解决大型语言模型（LLMs）在金融领域命名实体识别（Named Entity Recognition, NER）任务中的有效性和性能问题。随着金融文档、公告和商业新闻的快速增长，如何从这些非结构化数据中提取关键信息并构建结构化数据成为智能金融分析的基础任务。然而，现有的通用LLMs在不同提示（prompts）下的表现尚未得到充分理解。为此，论文通过系统评估最先进的LLMs和提示方法，揭示了其在金融NER任务中的优势和局限性，识别了五种典型的失败类型，并深入探讨了这些模型在特定领域任务中的潜力和挑战。解决方案的关键在于通过实验验证不同LLMs和提示方法的性能，从而为金融领域的NER任务提供更有效的工具和方法。

链接: https://arxiv.org/abs/2501.02237
作者: Yi-Te Lu,Yintong Huo
机构: National Taiwan University (国立台湾大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at The Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), in conjunction with COLING 2025

点击查看摘要

Abstract:The surge of large language models (LLMs) has revolutionized the extraction and analysis of crucial information from a growing volume of financial statements, announcements, and business news. Recognition for named entities to construct structured data poses a significant challenge in analyzing financial documents and is a foundational task for intelligent financial analytics. However, how effective are these generic LLMs and their performance under various prompts are yet need a better understanding. To fill in the blank, we present a systematic evaluation of state-of-the-art LLMs and prompting methods in the financial Named Entity Recognition (NER) problem. Specifically, our experimental results highlight their strengths and limitations, identify five representative failure types, and provide insights into their potential and challenges for domain-specific tasks.
zh

[NLP-74] Survey on Question Answering over Visually Rich Documents: Methods Challenges and Trends

【速读】：该论文探讨了如何利用大语言模型（LLMs）来增强视觉丰富文档理解（VrDU）的性能，特别是在需要理解和生成的任务（如问答系统）中。论文的核心解决方案在于将视觉丰富文档的特征（VrD features）有效地整合到大语言模型中，从而提升模型在复杂文档理解任务中的表现。关键挑战包括如何设计有效的方法来融合视觉和文本信息，以及如何处理由此带来的计算复杂性和模型训练难度。

链接: https://arxiv.org/abs/2501.02235
作者: Camille Barboule,Benjamin Piwowarski,Yoan Chabot
机构: Orange(Orange); Sorbonne Université, CNRS, ISIR(索邦大学, 法国国家科学研究中心, 智能系统与机器人研究所); Orange(Orange)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using Large Language Models (LLMs) for Visually-rich Document Understanding (VrDU) has significantly improved performance on tasks requiring both comprehension and generation, such as question answering, albeit introducing new challenges. This survey explains how VrDU models enhanced by LLMs function, covering methods for integrating VrD features into LLMs and highlighting key challenges.
zh

[NLP-75] Examining the Robustness of Homogeneity Bias to Hyperparameter Adjustments in GPT -4

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models）在训练过程中通过大规模人类生成数据所复现和放大的社会刻板印象问题，特别是同质性偏差（homogeneity bias）——即模型倾向于将某些群体表现为比其他群体更为同质的现象。研究通过调整GPT-4的超参数（hyperparameters），特别是采样温度（sampling temperature）和top p（控制模型输出的随机性），来探讨这些超参数如何影响同质性偏差。研究通过生成不同种族和性别群体的故事，并使用向量表示比较其相似性，评估了偏差的鲁棒性及其与超参数值的关系。研究发现，同质性偏差在大多数超参数配置下持续存在，且黑人和女性群体比白人和男性群体表现出更高的同质性。此外，超参数与群体表示之间的关系呈现出非线性的模式，特别是在极端值下，且超参数调整对种族和性别同质性偏差的影响不同。研究结果表明，虽然超参数调优可以在一定程度上缓解某些偏差，但不能作为解决不同社会群体维度上同质性偏差的通用解决方案。

链接: https://arxiv.org/abs/2501.02211
作者: Messi H.J. Lee
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models trained on massive collections of human-generated data often reproduce and amplify societal stereotypes. One critical form of stereotyping reproduced by these models is homogeneity bias-the tendency to represent certain groups as more homogeneous than others. We investigate how this bias responds to hyperparameter adjustments in GPT-4, specifically examining sampling temperature and top p which control the randomness of model outputs. By generating stories about individuals from different racial and gender groups and comparing their similarities using vector representations, we assess both bias robustness and its relationship with hyperparameter values. We find that (1) homogeneity bias persists across most hyperparameter configurations, with Black Americans and women being represented more homogeneously than White Americans and men, (2) the relationship between hyperparameters and group representations shows unexpected non-linear patterns, particularly at extreme values, and (3) hyperparameter adjustments affect racial and gender homogeneity bias differently-while increasing temperature or decreasing top p can reduce racial homogeneity bias, these changes show different effects on gender homogeneity bias. Our findings suggest that while hyperparameter tuning may mitigate certain biases to some extent, it cannot serve as a universal solution for addressing homogeneity bias across different social group dimensions.
zh

[NLP-76] CPTuning: Contrastive Prompt Tuning for Generative Relation Extraction

【速读】：该论文试图解决生成式关系抽取（Generative Relation Extraction, RE）中存在的实体对重叠问题，即现有方法通常假设每对实体之间只存在一种确定的关系，而忽略了实际场景中可能存在多种有效关系的情况。为了解决这一问题，论文提出了一种新颖的对比提示调优方法（Contrastive Prompt Tuning, CPTuning）。该方法的关键在于通过对比学习，将候选关系与上下文实体对关联起来，并根据概率质量是否超过阈值来判断该关系是否存在。此外，CPTuning将关系抽取任务组织为一种语言生成任务，并利用Trie约束解码确保生成的关系是有效的。在推理过程中，CPTuning自适应地选择高似然估计的候选关系，从而实现多关系抽取。实验结果表明，使用CPTuning微调的T5-large模型在多个数据集上显著优于现有方法，无论是单关系还是多关系抽取任务。

链接: https://arxiv.org/abs/2501.02196
作者: Jiaxin Duan,Fengyu Lu,Junfei Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative relation extraction (RE) commonly involves first reformulating RE as a linguistic modeling problem easily tackled with pre-trained language models (PLM) and then fine-tuning a PLM with supervised cross-entropy loss. Although having achieved promising performance, existing approaches assume only one deterministic relation between each pair of entities without considering real scenarios where multiple relations may be valid, i.e., entity pair overlap, causing their limited applications. To address this problem, we introduce a novel contrastive prompt tuning method for RE, CPTuning, which learns to associate a candidate relation between two in-context entities with a probability mass above or below a threshold, corresponding to whether the relation exists. Beyond learning schema, CPTuning also organizes RE as a verbalized relation generation task and uses Trie-constrained decoding to ensure a model generates valid relations. It adaptively picks out the generated candidate relations with a high estimated likelihood in inference, thereby achieving multi-relation extraction. We conduct extensive experiments on four widely used datasets to validate our method. Results show that T5-large fine-tuned with CPTuning significantly outperforms previous methods, regardless of single or multiple relations extraction.
zh

[NLP-77] Benchmark Evaluations Applications and Challenges of Large Vision Language Models: A Survey

【速读】：该论文旨在解决多模态视觉语言模型（Multimodal Vision Language Models, VLMs）领域缺乏系统性综述的问题，特别是针对那些希望在其特定领域应用VLMs的研究人员。论文通过提供过去五年（2019-2024）主要VLMs的模型信息、架构和训练方法的概述，总结了流行的基准测试和评估指标，并探讨了VLMs在具身代理、机器人和视频生成等领域的应用。此外，论文还分析了当前VLMs面临的挑战，如幻觉（hallucination）、公平性（fairness）和安全性（safety）问题。解决方案的关键在于系统性地梳理和分类现有研究，提供详细的文献和模型资源链接，帮助研究人员更好地理解和应用VLMs。

链接: https://arxiv.org/abs/2501.02189
作者: Zongxia Li,Xiyang Wu,Hongyang Du,Huy Nghiem,Guangyao Shi
机构: University of Maryland, College Park(马里兰大学帕克分校); University of Southern California(南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 34 pages, 3 figures

点击查看摘要

Abstract:Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in this https URL.
zh

[NLP-78] Personalized Graph-Based Retrieval for Large Language Models

【速读】：该论文旨在解决现有大语言模型（LLMs）在个性化生成任务中的局限性，特别是在冷启动场景（cold-start scenarios）下，由于用户历史数据稀疏或缺失，导致个性化输出效果不佳的问题。现有的个性化方法通常仅依赖用户历史数据来增强提示（prompt），这在数据稀疏的情况下效果有限。为解决这一问题，论文提出了基于个性化图检索增强生成（Personalized Graph-based Retrieval-Augmented Generation, PGraphRAG）的框架。该框架通过利用用户中心知识图谱（user-centric knowledge graphs），将结构化的用户知识直接整合到检索过程中，并在提示中增强用户相关上下文，从而提升上下文理解和生成质量。此外，论文还引入了基于个性化图的文本生成基准（Personalized Graph-based Benchmark for Text Generation），用于评估在用户历史数据稀疏或不可用情况下的个性化文本生成任务。实验结果表明，PGraphRAG在多种任务中显著优于现有的最先进个性化方法，展示了基于图检索的个性化方法的独特优势。

链接: https://arxiv.org/abs/2501.02157
作者: Steven Au,Cameron J. Dimacali,Ojasmitha Pedirappagari,Namyong Park,Franck Dernoncourt,Yu Wang,Nikos Kanakaris,Hanieh Deilamsalehy,Ryan A. Rossi,Nesreen K. Ahmed
机构: University of California Santa Cruz(加州大学圣克鲁兹分校); Meta AI(Meta AI); Adobe Research(Adobe 研究院); University of Oregon(俄勒冈大学); University of Southern California(南加州大学); Cisco AI Research(思科 AI 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.
zh

[NLP-79] able as Thought: Exploring Structured Thoughts in LLM Reasoning

【速读】：该论文试图解决现有大语言模型（LLM）在推理过程中缺乏对单个思维步骤结构进行深入探索的问题。现有的方法主要关注思维序列的组织，而忽略了单个思维步骤内部的结构化表示。为了解决这一问题，作者提出了“Table as Thought”框架，该框架受到认知神经科学理论的启发，通过表格形式组织推理过程。表格的行代表顺序的思维步骤，列则捕捉关键约束和上下文信息，以增强推理能力。推理过程通过迭代填充表格，直到自我验证确保完整性和正确性。实验表明，Table as Thought在规划任务中表现优异，并在数学推理方面展现出比非结构化思维基线更强的潜力。该研究为大语言模型中的思维表示提供了新的探索方向，推动了推理和人工智能认知的进步。

链接: https://arxiv.org/abs/2501.02152
作者: Zhenjie Sun,Naihao Deng,Haofei Yu,Jiaxuan You
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models’ reasoning abilities benefit from methods that organize their thought processes, such as chain-of-thought prompting, which employs a sequential structure to guide the reasoning process step-by-step. However, existing approaches focus primarily on organizing the sequence of thoughts, leaving structure in individual thought steps underexplored. To address this gap, we propose Table as Thought, a framework inspired by cognitive neuroscience theories on human thought. Table as Thought organizes reasoning within a tabular schema, where rows represent sequential thought steps and columns capture critical constraints and contextual information to enhance reasoning. The reasoning process iteratively populates the table until self-verification ensures completeness and correctness. Our experiments show that Table as Thought excels in planning tasks and demonstrates a strong potential for enhancing LLM performance in mathematical reasoning compared to unstructured thought baselines. This work provides a novel exploration of refining thought representation within LLMs, paving the way for advancements in reasoning and AI cognition.
zh

[NLP-80] Applying Text Mining to Analyze Human Question Asking in Creativity Research

【速读】：该论文试图解决的问题是：如何通过提问机制促进创造性思维（creative ideation）的生成，并探索提问在创造力中的具体作用。尽管提问被认为是定义问题和促进创造性问题解决的重要认知机制，但其在创造力中的确切角色尚不明确。论文的解决方案之关键在于应用文本挖掘（text mining）方法来衡量问题的认知潜力，具体考虑了以下几个方面：(a) 问题类型（question type），(b) 问题复杂性（question complexity），以及 © 答案内容（content of the answer）。通过结合自然语言处理（natural language processing）技术，论文提出并实现了一种新的方法，并在五个数据集上进行了实验验证。实验结果表明，自然语言处理在创造性研究中具有重要作用。

链接: https://arxiv.org/abs/2501.02090
作者: Anna Wróblewska,Marceli Korbin,Yoed N. Kenett,Daniel Dan,Maria Ganzha,Marcin Paprzycki
机构: Faculty of Mathematics and Information Sciences, Warsaw University of Technology (华沙理工大学数学与信息科学学院); Faculty of Data and Decision Sciences, Technion - Israel Institute of Technology (以色列理工学院数据与决策科学学院); School of Applied Data Science, Modul University Vienna (维也纳模都大学应用数据科学学院); Systems Research Institute Polish Academy of Sciences (波兰科学院系统研究所)
类目: Computation and Language (cs.CL)
备注: 24 pages, 15 figures; accepted to International Conference on Big Data Analytics in Astronomy, Science and Engineering 2024, Aizu, Japan

点击查看摘要

Abstract:Creativity relates to the ability to generate novel and effective ideas in the areas of interest. How are such creative ideas generated? One possible mechanism that supports creative ideation and is gaining increased empirical attention is by asking questions. Question asking is a likely cognitive mechanism that allows defining problems, facilitating creative problem solving. However, much is unknown about the exact role of questions in creativity. This work presents an attempt to apply text mining methods to measure the cognitive potential of questions, taking into account, among others, (a) question type, (b) question complexity, and © the content of the answer. This contribution summarizes the history of question mining as a part of creativity research, along with the natural language processing methods deemed useful or helpful in the study. In addition, a novel approach is proposed, implemented, and applied to five datasets. The experimental results obtained are comprehensively analyzed, suggesting that natural language processing has a role to play in creative research.
zh

[NLP-81] Instruction-Following Pruning for Large Language Models

【速读】：该论文旨在解决传统静态剪枝方法在大型语言模型（LLMs）中固定剪枝掩码（pruning mask）的局限性问题。传统方法在模型剪枝时通常采用固定的剪枝掩码，无法根据输入任务动态调整模型参数。为此，论文提出了一种动态结构化剪枝方法，称为“指令跟随剪枝”（instruction-following pruning）。该方法的核心在于引入了一个稀疏掩码预测器（sparse mask predictor），该预测器根据用户指令动态选择与任务最相关的模型参数。通过联合优化稀疏掩码预测器和LLM，并利用指令跟随数据和预训练语料库，该方法能够动态激活有效的模型参数。实验结果表明，该方法在多个评估基准上显著提升了模型性能，例如在数学和编程领域，激活的3B模型比密集的3B模型提升了5-8个绝对百分点，性能甚至接近9B模型。

链接: https://arxiv.org/abs/2501.02086
作者: Bairu Hou,Qibin Chen,Jianyu Wang,Guoli Yin,Chong Wang,Nan Du,Ruoming Pang,Shiyu Chang,Tao Lei
机构: Apple AI/ML; UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed “instruction-following pruning”, introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task. To identify and activate effective parameters, we jointly optimize the sparse mask predictor and the LLM, leveraging both instruction-following data and the pre-training corpus. Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.
zh

[NLP-82] he interplay between domain specialization and model size: a case study in the legal domain

【速读】：该论文试图解决的问题是如何在计算资源受限的情况下，通过持续预训练（continual pre-training）实现计算效率最优的模型训练策略。具体而言，研究关注的是在模型规模（model size）和领域专业化（domain specialization）之间的相互作用，特别是在法律领域的应用场景中。论文的关键解决方案是通过对比通用数据集和经过筛选的法律领域数据集上的预训练效果，探索不同模型规模（1.5B、3B、7B 和 14B 参数）下的计算效率差异。研究结果表明，随着模型规模的增大，专业化模型与通用模型之间的计算效率差距逐渐扩大，从而为在计算资源受限的情况下选择最优的模型规模和训练策略提供了依据。

链接: https://arxiv.org/abs/2501.02068
作者: Roseval Malaquias Junior,Ramon Pires,Thales Sales Almeida,Kenzo Sakiyama,Roseli Romero,Rodrigo Nogueira
机构: University of São Paulo (USP); Maritaca AI; State University of Campinas (UNICAMP)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling laws for language models so far focused on finding the compute-optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continual pre-training offers a cost-effective alternative, leveraging the compute investment from pre-trained models to incorporate new knowledge without requiring extensive new data. Recent findings suggest that data quality influences constants in scaling laws, thereby altering the optimal parameter-token allocation ratio. Building on this insight, we investigate the interplay between domain specialization and model size during continual pre-training under compute-constrained scenarios. Our goal is to identify a compute-efficient training regime for this scenario and, potentially, detect patterns in this interplay that can be generalized across different model sizes and domains. To compare general and specialized training, we filtered a web-based dataset to extract legal domain data. We pre-trained models with 1.5B, 3B, 7B and 14B parameters on both the unfiltered and filtered datasets, then evaluated their performance on legal exams. Results show that as model size increases, the compute-effectiveness gap between specialized and general models widens.
zh

[NLP-83] AGGA: A Dataset of Academic Guidelines for Generative AI and Large Language Models

【速读】：该论文旨在解决学术环境中生成式 AI（Generative AI, GAI）和大型语言模型（Large Language Models, LLMs）使用指南的标准化和系统化问题。为此，研究团队构建了一个名为 AGGA 的数据集，该数据集包含从全球 80 所大学的官方网站上精心收集的学术指南，涵盖 188,674 个单词。AGGA 数据集不仅为自然语言处理任务（如模型合成、抽象识别和文档结构评估）提供了重要资源，还可通过进一步标注用于歧义检测、需求分类和等效需求识别等任务的基准测试。解决方案的关键在于通过方法学上严谨的全球代表性样本选择，确保数据集能够反映不同学术领域（如人文、技术）和机构类型（公立与私立）的多样化视角，从而为 GAI 和 LLMs 在学术界的整合提供广泛且深入的见解。

链接: https://arxiv.org/abs/2501.02063
作者: Junfeng Jiao,Saleh Afroogh,Kevin Chen,David Atkinson,Amit Dhurandhar
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: arXiv admin note: text overlap with arXiv:2406.18842 , arXiv:2501.00959

点击查看摘要

Abstract:This study introduces AGGA, a dataset comprising 80 academic guidelines for the use of Generative AIs (GAIs) and Large Language Models (LLMs) in academic settings, meticulously collected from official university websites. The dataset contains 188,674 words and serves as a valuable resource for natural language processing tasks commonly applied in requirements engineering, such as model synthesis, abstraction identification, and document structure assessment. Additionally, AGGA can be further annotated to function as a benchmark for various tasks, including ambiguity detection, requirements categorization, and the identification of equivalent requirements. Our methodologically rigorous approach ensured a thorough examination, with a selection of universities that represent a diverse range of global institutions, including top-ranked universities across six continents. The dataset captures perspectives from a variety of academic fields, including humanities, technology, and both public and private institutions, offering a broad spectrum of insights into the integration of GAIs and LLMs in academia.
zh

[NLP-84] Advancing Pancreatic Cancer Prediction with a Next Visit Token Prediction Head on top of Med-BERT

【速读】：该论文试图解决如何在小样本（few-shot）和全监督（fully supervised）设置下，利用预训练的基础模型（foundation models）来提高胰腺癌（PaCa）预测的准确性问题。具体而言，研究关注于如何更好地利用Med-BERT这一专为电子健康记录（EHRs）设计的预训练模型，尤其是在数据量非常有限的情况下。

解决方案的关键在于将传统的二元分类（Binary Classification, BC）任务重新表述为与Med-BERT预训练任务格式一致的任务形式。具体来说，研究提出了两种任务重构方法：一是将疾病预测任务转化为令牌预测任务（Med-BERT-Sum），二是将其转化为下一访问掩码令牌预测任务（Med-BERT-Mask）。实验结果表明，Med-BERT-Mask在小样本场景下显著优于传统的二元分类任务，准确率提高了3%到7%。这表明，将下游任务与预训练目标对齐，能够显著提升模型的预测能力，从而在罕见和常见疾病的预测中取得更好的效果。

链接: https://arxiv.org/abs/2501.02044
作者: Jianping He,Laila Rasmy,Degui Zhi,Cui Tao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Recently, numerous foundation models pretrained on extensive data have demonstrated efficacy in disease prediction using Electronic Health Records (EHRs). However, there remains some unanswered questions on how to best utilize such models especially with very small fine-tuning cohorts. Methods: We utilized Med-BERT, an EHR-specific foundation model, and reformulated the disease binary prediction task into a token prediction task and a next visit mask token prediction task to align with Med-BERT’s pretraining task format in order to improve the accuracy of pancreatic cancer (PaCa) prediction in both few-shot and fully supervised settings. Results: The reformulation of the task into a token prediction task, referred to as Med-BERT-Sum, demonstrates slightly superior performance in both few-shot scenarios and larger data samples. Furthermore, reformulating the prediction task as a Next Visit Mask Token Prediction task (Med-BERT-Mask) significantly outperforms the conventional Binary Classification (BC) prediction task (Med-BERT-BC) by 3% to 7% in few-shot scenarios with data sizes ranging from 10 to 500 samples. These findings highlight that aligning the downstream task with Med-BERT’s pretraining objectives substantially enhances the model’s predictive capabilities, thereby improving its effectiveness in predicting both rare and common diseases. Conclusion: Reformatting disease prediction tasks to align with the pretraining of foundation models enhances prediction accuracy, leading to earlier detection and timely intervention. This approach improves treatment effectiveness, survival rates, and overall patient outcomes for PaCa and potentially other cancers.
zh

[NLP-85] An Investigation into Value Misalignment in LLM -Generated Texts for Cultural Heritage

【速读】：该论文试图解决大语言模型（LLMs）在文化遗产相关任务中生成文本时可能出现的文化价值错位问题。具体而言，LLMs在生成历史遗迹描述、翻译古代文本、保存口头传统以及创建教育内容等任务时，可能会产生历史事实的误传、文化身份的侵蚀以及复杂文化叙事的过度简化等问题，这些文化价值错位可能导致严重后果。论文通过系统评估LLMs在生成文化遗产相关文本时的可靠性，填补了这一领域缺乏系统性和全面性研究的空白。解决方案的关键在于构建一个包含1066个查询任务的广泛数据集，涵盖文化遗产知识框架中的5个广泛类别和17个方面，并在5个开源LLMs上进行全面评估。通过自动化和手动方法，论文有效检测并分析了LLM生成文本中的文化价值错位，发现超过65%的生成文本存在显著的文化错位。此外，论文还引入了一个基准数据集和全面的评估工作流程，为未来提升LLMs文化敏感性和可靠性的研究提供了宝贵资源。

链接: https://arxiv.org/abs/2501.02039
作者: Fan Bu,Zheng Wang,Siyi Wang,Ziyao Liu
机构: Faculty of Archeology and Museology, Shanghai University, China(上海大学考古与博物馆学系); National Collaborative Research Center for Revolutionary Cultural Heritage at Memorial of the First CPC National Congress & Shanghai University, China(中国共产党第一次全国代表大会纪念馆与上海大学革命文化遗产国家协同研究中心); School of Management, Shanghai University, China(上海大学管理学院); Nanyang Technological University, Singapore(新加坡南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly prevalent in tasks related to cultural heritage, such as generating descriptions of historical monuments, translating ancient texts, preserving oral traditions, and creating educational content, their ability to produce accurate and culturally aligned texts is being increasingly relied upon by users and researchers. However, cultural value misalignments may exist in generated texts, such as the misrepresentation of historical facts, the erosion of cultural identity, and the oversimplification of complex cultural narratives, which may lead to severe consequences. Therefore, investigating value misalignment in the context of LLM for cultural heritage is crucial for mitigating these risks, yet there has been a significant lack of systematic and comprehensive study and investigation in this area. To fill this gap, we systematically assess the reliability of LLMs in generating culturally aligned texts for cultural heritage-related tasks. We conduct a comprehensive evaluation by compiling an extensive set of 1066 query tasks covering 5 widely recognized categories with 17 aspects within the knowledge framework of cultural heritage across 5 open-source LLMs, and examine both the type and rate of cultural value misalignments in the generated texts. Using both automated and manual approaches, we effectively detect and analyze the cultural value misalignments in LLM-generated texts. Our findings are concerning: over 65% of the generated texts exhibit notable cultural misalignments, with certain tasks demonstrating almost complete misalignment with key cultural values. Beyond these findings, this paper introduces a benchmark dataset and a comprehensive evaluation workflow that can serve as a valuable resource for future research aimed at enhancing the cultural sensitivity and reliability of LLMs.
zh

[NLP-86] CarbonChat: Large Language Model-Based Corporate Carbon Emission Analysis and Climate Knowledge QA System

【速读】：该论文旨在解决全球气候变化背景下，企业碳排放分析中的几个关键问题：大型语言模型在气候变化知识更新上的滞后性、传统增强生成架构在处理复杂问题时的专业性和准确性不足，以及可持续性报告分析的高成本和时间消耗。为此，论文提出了CarbonChat系统，基于大型语言模型的企业碳排放分析和气候知识问答系统。其解决方案的关键包括：1）提出了一种多元化的索引模块构建方法，用于处理基于规则和长文本文档的分割以及结构化数据的提取，优化关键信息的解析；2）设计了一种增强的自提示检索增强生成架构，整合了意图识别、结构化推理链、混合检索和Text2SQL技术，提升了语义理解和查询效率；3）基于温室气体核算框架，建立了14个维度的碳排放分析体系，支持报告摘要、相关性评估和定制化分析；4）通过多层分块机制、时间戳和幻觉检测功能，确保分析结果的准确性和可验证性，降低幻觉率并提高响应精度。

链接: https://arxiv.org/abs/2501.02031
作者: Zhixuan Cao,Ming Han,Jingtao Wang,Meng Jia
机构: College of Future Information Technology, Shijiazhuang University (石家庄大学未来信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:As the impact of global climate change intensifies, corporate carbon emissions have become a focal point of global attention. In response to issues such as the lag in climate change knowledge updates within large language models, the lack of specialization and accuracy in traditional augmented generation architectures for complex problems, and the high cost and time consumption of sustainability report analysis, this paper proposes CarbonChat: Large Language Model-based corporate carbon emission analysis and climate knowledge QA system, aimed at achieving precise carbon emission analysis and policy this http URL, a diversified index module construction method is proposed to handle the segmentation of rule-based and long-text documents, as well as the extraction of structured data, thereby optimizing the parsing of key this http URL, an enhanced self-prompt retrieval-augmented generation architecture is designed, integrating intent recognition, structured reasoning chains, hybrid retrieval, and Text2SQL, improving the efficiency of semantic understanding and query this http URL, based on the greenhouse gas accounting framework, 14 dimensions are established for carbon emission analysis, enabling report summarization, relevance evaluation, and customized this http URL, through a multi-layer chunking mechanism, timestamps, and hallucination detection features, the accuracy and verifiability of the analysis results are ensured, reducing hallucination rates and enhancing the precision of the responses.
zh

[NLP-87] Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models

【速读】：该论文旨在解决提升大语言模型（Large Language Models, LLMs）在复杂推理任务中的能力这一关键挑战。为此，作者提出了RDoLT（Recursive Decomposition of Logical Thought prompting）框架，该框架通过三个关键创新显著提升了LLM的推理性能：（1）将复杂的推理任务递归分解为逐步复杂的子任务；（2）采用先进的选择和评分机制，识别最有潜力的推理思路；（3）集成知识传播模块，通过跟踪强和弱的推理思路来模拟人类学习过程，从而实现信息传播。实验结果表明，RDoLT在多个基准测试（如GSM8K、SVAMP等）上均优于现有技术，特别是在GSM8K上，使用ChatGPT-4实现了90.98%的准确率，比现有技术高出6.28%。这些发现表明，RDoLT在提示工程（prompt engineering）领域具有显著潜力，为复杂推理任务提供了更有效且可推广的解决方案。

链接: https://arxiv.org/abs/2501.02026
作者: Kaleem Ullah Qasim,Jiashu Zhang,Tariq Alsahfi,Ateeq Ur Rehman Butt
机构: Southwest Jiaotong University(西南交通大学); University of Jeddah(吉达大学); National Textile University(国立纺织大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Enhancing the reasoning capabilities of Large Language Models remains a critical challenge in artificial intelligence. We introduce RDoLT, Recursive Decomposition of Logical Thought prompting, a novel framework that significantly boosts LLM reasoning performance. RDoLT is built on three key innovations: (1) recursively breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning by keeping track of strong and weak thoughts for information propagation. Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArith, LastLetterConcatenation, and Gaokao2023 Math. The results demonstrate that RDoLT consistently outperforms existing state-of-the-art techniques, achieving a 90.98 percent accuracy on GSM8K with ChatGPT-4, surpassing state-of-the-art techniques by 6.28 percent. Similar improvements were observed on other benchmarks, with accuracy gains ranging from 5.5 percent to 6.75 percent. These findings highlight RDoLT’s potential to advance prompt engineering, offering a more effective and generalizable approach to complex reasoning tasks.
zh

[NLP-88] Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection

【速读】：该论文试图解决大型语言模型（LLMs）在生成文本时出现的幻觉（hallucination）问题，即模型生成的非事实或不忠实的内容，这一问题限制了LLMs在实际场景中的应用。现有的研究主要集中在基于不确定性的幻觉检测方法上，这些方法利用LLMs的输出概率进行不确定性计算，且不依赖外部知识或频繁采样。然而，大多数方法仅考虑每个独立token的不确定性，而忽略了token和句子之间复杂的语义关系，导致无法有效检测跨多个token和句子的幻觉。

论文提出的解决方案关键在于通过语义图（semantic graph）增强不确定性建模，以改进幻觉检测。具体而言，首先构建一个语义图，以捕捉实体token和句子之间的关系。然后，通过实体之间的关系进行不确定性传播，从而增强句子级别的幻觉检测。此外，考虑到幻觉通常由句子之间的冲突引起，论文进一步提出了一种基于图的校准方法，该方法将句子与其在语义图中的邻居的冲突概率整合到不确定性计算中。实验结果表明，该方法在段落级别的幻觉检测上取得了显著提升，达到了19.78%的改进。

链接: https://arxiv.org/abs/2501.02020
作者: Kedi Chen,Qin Chen,Jie Zhou,Xinqi Tao,Bowen Ding,Jingwen Xie,Mingchen Xie,Peilong Li,Feng Zheng,Liang He
机构: 1. Xiaohongshu Inc.(小红书); 2. 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to hallucination with non-factual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches merely consider the uncertainty of each independent token, while the intricate semantic relations among tokens and sentences are not well studied, which limits the detection of hallucination that spans over multiple tokens and sentences in the passage. In this paper, we propose a method to enhance uncertainty modeling with semantic graph for hallucination detection. Specifically, we first construct a semantic graph that well captures the relations among entity tokens and sentences. Then, we incorporate the relations between two entities for uncertainty propagation to enhance sentence-level hallucination detection. Given that hallucination occurs due to the conflict between sentences, we further present a graph-based uncertainty calibration method that integrates the contradiction probability of the sentence with its neighbors in the semantic graph for uncertainty calculation. Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection.
zh

[NLP-89] Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

【速读】：该论文试图解决大型语言模型（LLMs）在面对越狱攻击（jailbreak attacks）时的安全问题。越狱攻击是一种对抗性攻击，旨在诱导模型产生高风险行为，已被网络犯罪分子和黑帽黑客利用，造成严重危害。论文提出了一种名为SafeNudge的新型保护机制，其关键解决方案是结合控制文本生成（Controlled Text Generation）和“推动”（nudging）技术，通过在文本生成过程中触发干预，引导模型生成安全的响应。SafeNudge能够在越狱攻击执行时减少30%的成功攻击尝试，同时仅增加极小的推理延迟，并且对输出语义流畅性的影响可忽略不计。此外，SafeNudge允许可调节的安全-性能权衡（Safety-Performance Trade-offs, SPTs），并兼容Hugging Face的“transformers”库。

链接: https://arxiv.org/abs/2501.02018
作者: Joao Fonseca,Andrew Bell,Julia Stoyanovich
机构: New York University(纽约大学); New York University(纽约大学); New York University(纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs “self-reflect”, may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict ``normal’’ model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with “nudging”, or using text interventions to change the behavior of a model. SafeNudge triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by 30% by guiding the LLM towards a safe responses. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Further, we allow for tunable SPTs. SafeNudge is open-source and available through this https URL, and is compatible with models loaded with the Hugging Face “transformers” library.
zh

[NLP-90] Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

【速读】：该论文试图解决如何理解和控制不同大语言模型（LLMs）之间的概念表示（concept representations）及其行为的问题。具体而言，研究探索了不同LLMs之间概念表示的复杂关系，并提出了一种线性变换方法（linear transformation method）来桥接这些表示。解决方案的关键在于：1）通过简单的线性变换，能够有效对齐不同LLMs之间的概念表示，从而实现跨模型的高效行为控制；2）这种线性变换具有跨概念的泛化能力，能够对齐和控制不同LLMs中表示不同概念的转向向量（steering vectors, SVs）；3）研究发现，从小型LLMs中提取的转向向量能够有效控制更大LLMs的行为，体现了弱到强的可迁移性（weak-to-strong transferability）。这些发现为跨模型的概念表示对齐和行为控制提供了新的理论和方法基础。

链接: https://arxiv.org/abs/2501.02009
作者: Youcheng Huang,Chen Huang,Duanyu Feng,Wenqiang Lei,Jiancheng Lv
机构: Sichuan University (四川大学); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China (教育部机器学习与工业智能工程研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM’s concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato’s Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.
zh

[NLP-91] Is Your Image a Good Storyteller? AAAI2025

【速读】：该论文试图解决图像语义复杂性（semantic complexity）评估的问题。目前，图像在实体层面的复杂性量化较为直接，但语义复杂性的评估却长期被忽视。语义复杂性在不同图像中存在差异，语义丰富的图像能够讲述生动的故事，并具有广泛的应用场景。然而，这类图像稀缺，尤其是在跨文化和跨时代的背景下，亟需更多类似“Cookie Theft”这样的图像来支持人类认知评估和视觉模型的发展。为此，论文提出了图像语义评估（Image Semantic Assessment, ISA）任务，旨在通过语言驱动的创新方法来解决这一视觉问题。关键解决方案包括引入首个ISA数据集，并提出一种利用语言信息来评估图像语义复杂性的新方法。实验结果表明，该方法在评估图像语义复杂性方面具有显著效果。

链接: https://arxiv.org/abs/2501.01982
作者: Xiujie Song,Xiaoyi Pang,Haifeng Tang,Mengyue Wu,Kenny Q. Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. In fact, there are differences in semantic complexity across images. Images with richer semantics can tell vivid and engaging stories and offer a wide range of application scenarios. For example, the Cookie Theft picture is such a kind of image and is widely used to assess human language and cognitive abilities due to its higher semantic complexity. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for them. However, such images are scarce, highlighting the need for a greater number of them. For instance, there is a need for more images like Cookie Theft to cater to people from different cultural backgrounds and eras. Assessing semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image will be the first step of mining or generating more images with rich semantics, and benefit human cognitive assessment, Artificial Intelligence, and various other applications. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach. Our data and code are available at: this https URL.
zh

[NLP-92] A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

【速读】：该论文试图解决的问题是如何检测大型语言模型（LLMs）是否在训练过程中未经授权使用了由其他LLM生成的数据，即数据盗用（data misappropriation）问题。解决方案的关键在于通过在受版权保护的训练数据中嵌入水印（watermarks），并将数据盗用检测问题转化为假设检验问题。具体而言，作者提出了一个通用的统计检验框架，构建了关键统计量，确定了最优拒绝阈值，并明确控制了第一类错误（type I error）和第二类错误（type II error）。此外，论文还证明了所提出检验方法的渐近最优性，并通过大量数值实验验证了其实际有效性。

链接: https://arxiv.org/abs/2501.02441
作者: Yinpeng Cai,Lexin Li,Linjun Zhang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 29 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the inclusion of copyrighted materials in their training data without proper attribution or licensing, which falls under the broader issue of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated data generated by another LLM. To address this issue, we propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct a pivotal statistic, determine the optimal rejection threshold, and explicitly control the type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate its empirical effectiveness through intensive numerical experiments.
zh

[NLP-93] Who Wrote This? Zero-Shot Statistical Tests for LLM -Generated Text Detection using Finite Sample Concentration Inequalities

【速读】：该论文试图解决的问题是如何区分由不同大型语言模型（LLMs）生成的文本，特别是区分由内部授权模型（in-house LLMs）和非授权模型（non-sanctioned LLMs）生成的文本，以及区分由LLM生成的文本和人类生成的文本。解决方案的关键在于将LLM生成的文本建模为一个依赖于历史信息的序列随机过程，并设计零样本统计测试（zero-shot statistical tests）来进行区分。具体而言，作者通过推导对数困惑度（log-perplexity）与字符串在模型A下的平均熵（average entropy）之间的集中不等式（concentration inequalities），证明了当文本由模型A生成时，其对数困惑度会收敛到模型A下的平均熵；而当文本由模型B生成时，其对数困惑度会收敛到模型B与模型A之间的平均交叉熵（average cross-entropy）。这些理论结果通过初步实验得到了验证，从而能够以高概率确定任意长度有害LLM生成文本的来源，帮助对抗虚假信息。

链接: https://arxiv.org/abs/2501.02406
作者: Tara Radvand,Mojtaba Abdolmaleki,Mohamed Mostagir,Ambuj Tewari
机构: Ross School of Business, University of Michigan, Ann Arbor(密歇根大学罗斯商学院); Department of Statistics, University of Michigan, Ann Arbor(密歇根大学统计系)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly difficult as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by LLM A or B (where B can be a human)? We model LLM-generated text as a sequential stochastic process with complete dependence on history and design zero-shot statistical tests to distinguish between (i) the text generated by two different sets of LLMs A (in-house) and B (non-sanctioned) and also (ii) LLM-generated and human-generated texts. We prove that the type I and type II errors for our tests decrease exponentially in the text length. In designing our tests, we derive concentration inequalities on the difference between log-perplexity and the average entropy of the string under A . Specifically, for a given string, we demonstrate that if the string is generated by A , the log-perplexity of the string under A converges to the average entropy of the string under A , except with an exponentially small probability in string length. We also show that if B generates the text, except with an exponentially small probability in string length, the log-perplexity of the string under A converges to the average cross-entropy of B and A . Lastly, we present preliminary experimental results to support our theoretical results. By enabling guaranteed (with high probability) finding of the origin of harmful LLM-generated text with arbitrary size, we can help fight misinformation.
zh

[NLP-94] METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

【速读】：该论文旨在解决如何从复杂的宏基因组数据中有效捕捉和建模基因组信息分布的问题，以支持与疫情监测和病原体检测相关的任务。解决方案的关键在于预训练一个名为METAGENE-1的70亿参数自回归变换器模型（autoregressive transformer model），该模型基于一个包含超过1.5万亿碱基对的多样化宏基因组DNA和RNA序列数据集。该数据集来源于大量人类废水样本，并通过深度宏基因组测序（next-generation sequencing）方法进行处理和测序。与专注于单个基因组或特定物种的基因组模型不同，METAGENE-1的目标是捕捉废水中存在的全部基因组信息分布。论文详细描述了预训练数据集、针对宏基因组序列的字节对编码（byte-pair encoding, BPE）分词策略以及模型架构，展示了其在基因组基准测试和人类病原体检测任务中的卓越性能，证明了其在公共卫生应用中的潜力。

链接: https://arxiv.org/abs/2501.02045
作者: Ollie Liu,Sami Jaghouar,Johannes Hagemann,Shangshang Wang,Jason Wiemels,Jeff Kaufman,Willie Neiswanger
机构: University of Southern California(南加州大学); Prime Intellect; Nucleic Acid Observatory
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.
zh

计算机视觉

[CV-0] Gaussian Masked Autoencoders

【速读】：该论文旨在解决传统掩码自编码器（Masked Autoencoders, MAE）在自监督学习框架中缺乏显式空间感知能力的问题。尽管MAE能够学习到良好的语义抽象，但其并未针对空间理解进行专门训练。为此，论文提出了一种名为高斯掩码自编码器（Gaussian Masked Autoencoder, GMAE）的新方法，旨在同时学习语义抽象和空间理解。GMAE的关键创新在于引入了基于3D高斯分布的中间表示，并通过高斯溅射（splatting）技术进行图像渲染。这种方法不仅能够在像素空间中端到端地重建图像，还能实现零样本学习（zero-shot learning）的空间理解能力，如前景-背景分割、图像分层和边缘检测等，同时保留了MAE自监督表示的高质量语义信息。GMAE是首个在图像表示学习框架中引入高斯基元的方案，超越了基于优化的单场景重建方法，为高保真视觉数据建模提供了新的研究方向。

链接: https://arxiv.org/abs/2501.03229
作者: Jathushan Rajasegaran,Xinlei Chen,Rulilong Li,Christoph Feichtenhofer,Jitendra Malik,Shiry Ginosar
机构: Meta; FAIR; UC Berkeley (加州大学伯克利分校); Toyota Technological Institute at Chicago (芝加哥丰田技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at this https URL
zh

[CV-1] Rate-My-LoRA: Efficient and Adaptive Federated Model Tuning for Cardiac MRI Segmentation

【速读】：该论文试图解决心血管疾病（CVD）和心脏不同步（cardiac dyssynchrony）诊断中精确心脏图像分割的问题。由于隐私问题，集中化处理来自不同医院的大规模数据集具有挑战性。为此，论文提出了联邦学习（Federated Learning, FL）方法，以在不交换敏感信息的情况下进行分散式模型训练。然而，传统联邦学习算法面临带宽限制和数据异质性（data heterogeneity）的挑战。论文的关键解决方案包括：1）利用低秩适应（Low-Rank Adaptation, LoRA）来正则化模型权重更新并减少通信开销；2）提出一种自适应聚合技术，通过比较每个客户端的验证精度来调整不同客户端的权重聚合，从而提升模型的泛化性能和快速本地适应能力。这些方法在公开的心脏磁共振（MR）数据集上进行了验证，证明了其优于其他基于LoRA的联邦学习方法。

链接: https://arxiv.org/abs/2501.03223
作者: Xiaoxiao He,Haizhou Shi,Ligong Han,Chaowei Tan,Bo Liu,Zihao Xu,Meng Ye,Leon Axel,Kang Li,Dimitris Metaxas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted in ISBI 2025

点击查看摘要

Abstract:Cardiovascular disease (CVD) and cardiac dyssynchrony are major public health problems in the United States. Precise cardiac image segmentation is crucial for extracting quantitative measures that help categorize cardiac dyssynchrony. However, achieving high accuracy often depends on centralizing large datasets from different hospitals, which can be challenging due to privacy concerns. To solve this problem, Federated Learning (FL) is proposed to enable decentralized model training on such data without exchanging sensitive information. However, bandwidth limitations and data heterogeneity remain as significant challenges in conventional FL algorithms. In this paper, we propose a novel efficient and adaptive federate learning method for cardiac segmentation that improves model performance while reducing the bandwidth requirement. Our method leverages the low-rank adaptation (LoRA) to regularize model weight update and reduce communication overhead. We also propose a \mymethod aggregation technique to address data heterogeneity among clients. This technique adaptively penalizes the aggregated weights from different clients by comparing the validation accuracy in each client, allowing better generalization performance and fast local adaptation. In-client and cross-client evaluations on public cardiac MR datasets demonstrate the superiority of our method over other LoRA-based federate learning approaches.
zh

[CV-2] RW-Net: Enhancing Few-Shot Point Cloud Classification with a Wavelet Transform Projection-based Network

【速读】：该论文试图解决3D物体分类领域中标注数据稀缺的问题，特别是在少样本学习（few-shot learning）场景下，如何从有限的标注样本中实现鲁棒的泛化能力。传统的数据密集型学习方法在这种场景下表现受限，因此需要识别并利用3D物体中最显著和最具区分性的特征，以提高学习效率并减少对大规模标注数据集的依赖。

解决方案的关键在于提出了RW-Net框架，该框架通过集成率失真解释（Rate-Distortion Explanation, RDE）和小波变换（wavelet transform）来优化3D物体分类架构。RDE用于提取关键特征，通过识别和保留信息量最大的数据成分来减少冗余，从而确保保留有效决策所需的关键信息。小波变换则通过强调输入数据的低频成分，捕捉3D物体的基本几何和结构属性，进一步增强模型在低数据情况下的泛化能力，减少过拟合并提高表示鲁棒性。实验结果表明，RW-Net在少样本3D物体分类任务中表现出色，具有优越的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2501.03221
作者: Haosheng Zhang,Hao Huang
机构: NYU Tandon School of Engineering, New York University, USA(纽约大学坦登工程学院, 纽约大学, 美国); NYUAD Center for Artificial Intelligence and Robotics, New York University Abu Dhabi, UAE(纽约大学阿布扎比分校人工智能与机器人中心, 纽约大学阿布扎比分校, 阿联酋)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 9 tables

点击查看摘要

Abstract:In the domain of 3D object classification, a fundamental challenge lies in addressing the scarcity of labeled data, which limits the applicability of traditional data-intensive learning paradigms. This challenge is particularly pronounced in few-shot learning scenarios, where the objective is to achieve robust generalization from minimal annotated samples. To overcome these limitations, it is crucial to identify and leverage the most salient and discriminative features of 3D objects, thereby enhancing learning efficiency and reducing dependency on large-scale labeled datasets. This work introduces RW-Net, a novel framework designed to address the challenges above by integrating Rate-Distortion Explanation (RDE) and wavelet transform into a state-of-the-art projection-based 3D object classification architecture. The proposed method capitalizes on RDE to extract critical features by identifying and preserving the most informative data components while reducing redundancy. This process ensures the retention of essential information for effective decision-making, optimizing the model’s ability to learn from limited data. Complementing RDE, incorporating the wavelet transform further enhances the framework’s capability to generalize in low-data regimes. By emphasizing low-frequency components of the input data, the wavelet transform captures fundamental geometric and structural attributes of 3D objects. These attributes are instrumental in mitigating overfitting and improving the robustness of the learned representations across diverse tasks and domains. To validate the effectiveness of our RW-Net, we conduct extensive experiments on three datasets: ModelNet40, ModelNet40-C, and ScanObjectNN for few-shot 3D object classification. The results demonstrate that our approach achieves state-of-the-art performance and exhibits superior generalization and robustness in few-shot learning scenarios.
zh

[CV-3] ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

【速读】：该论文旨在解决视频中任意点的鲁棒且准确的长期密集跟踪问题。解决方案的关键在于提出了一种名为ProTracker的新框架，该框架通过概率集成（probabilistic integration）来优化来自光流（optical flow）和语义特征的多个预测，从而实现短期和长期的鲁棒跟踪。具体而言，该方法以概率方式集成光流估计，通过最大化每个预测的似然性来生成平滑且准确的轨迹。此外，为了有效重新定位因遮挡而消失和再现的挑战性点，该方法进一步将长期特征对应关系纳入光流预测中，以生成连续的轨迹。实验结果表明，ProTracker在无监督和自监督方法中达到了最先进的性能，并在多个基准测试中甚至优于监督方法。

链接: https://arxiv.org/abs/2501.03220
作者: Tingyang Zhang,Chen Wang,Zhiyang Dou,Qingzhe Gao,Jiahui Lei,Baoquan Chen,Lingjie Liu
机构: University of Pennsylvania(宾夕法尼亚大学); Peking University(北京大学); The University of Hong Kong(香港大学); Shandong University(山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.
zh

[CV-4] Dispider: Enabling Video LLM s with Active Real-Time Interaction via Disentangled Perception Decision and Reaction

【速读】：该论文旨在解决实时视频大语言模型（LLMs）在主动交互中的能力冲突问题。传统离线视频LLMs需要先分析整个视频再回答问题，而实时交互要求模型在连续处理视频流的同时理解用户意图并做出响应。这种实时交互需要三种能力：感知（Perception）、决策（Decision）和反应（Reaction）。然而，这些能力之间存在内在冲突，例如决策和反应需要不同的感知尺度和粒度，而自回归解码在反应过程中会阻碍实时感知和决策。为解决这些冲突，论文提出了Dispider系统，该系统通过解耦感知、决策和反应来实现和谐的统一。Dispider的核心在于其轻量级的主动流视频处理模块，该模块实时跟踪视频流并识别最佳交互时机，同时异步交互模块在触发交互时提供详细响应，而处理模块继续监控视频。这种解耦和异步设计确保了响应及时、上下文准确且计算高效，特别适用于长时间视频流的实时交互。实验表明，Dispider不仅在传统视频问答任务中表现优异，还在流媒体场景中显著超越了之前的在线模型，验证了其架构的有效性。

链接: https://arxiv.org/abs/2501.03218
作者: Rui Qian,Shuangrui Ding,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang
机构: The Chinese University of Hong Kong(香港中文大学); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \urlthis https URL.
zh

[CV-5] MObI: Multimodal Object Inpainting Using Diffusion Models

【速读】：该论文旨在解决在安全关键应用（如自动驾驶）中，如何高效生成具有高度真实性和可控性的多模态数据（multimodal data）的问题。由于采集真实世界数据的成本和复杂性较高，基于合成数据的方法逐渐受到关注，但这些方法需要确保生成的数据具有足够的真实性和可控性，才能有效用于感知模型的测试。论文提出的解决方案是MObI（Multimodal Object Inpainting），这是一种基于扩散模型（diffusion model）的新型框架，能够同时在相机和激光雷达（lidar）等多模态感知数据中生成真实且可控的对象修复（object inpainting）。MObI通过使用单个参考RGB图像，能够在指定的3D边界框（bounding box）位置无缝插入对象，同时保持语义一致性和多模态一致性。与传统的仅依赖编辑掩码（edit masks）的修复方法不同，MObI通过3D边界框条件控制，确保对象具有精确的空间定位和真实的比例缩放，从而灵活地将新对象插入多模态场景中，为感知模型的测试提供了显著优势。

链接: https://arxiv.org/abs/2501.03173
作者: Alexandru Buburuzan,Anuj Sharma,John Redford,Puneet K. Dokania,Romain Mueller
机构: FiveAI; The University of Manchester(曼彻斯特大学); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.
zh

[CV-6] Segment Anything Model for Zero-shot Single Particle Tracking in Liquid Phase Transmission Electron Microscopy

【速读】：该论文试图解决在液相透射电子显微镜（Liquid Phase Transmission Electron Microscopy, LPTEM）视频中识别和追踪纳米颗粒的标准化框架缺失问题。由于LPTEM视频中存在噪声，现有的方法在单颗粒追踪（Single Particle Tracking）方面存在较大挑战。为了解决这一问题，作者采用了Meta发布的Segment Anything Model 2 (SAM 2)，这是一个用于视频和图像分割的基础模型（Foundation Model）。通过零样本学习（Zero-shot Learning）的方式，SAM 2能够在无需微调的情况下成功分割LPTEM视频。基于这一能力，作者提出了SAM4EM框架，该框架将可提示的视频分割（Promptable Video Segmentation）与颗粒追踪和统计分析相结合，提供了一个端到端的LPTEM分析框架。SAM4EM在分割和分析LPTEM视频方面的准确度比现有最先进方法提高了近50倍，为LPTEM在纳米尺度成像中的广泛应用铺平了道路。

链接: https://arxiv.org/abs/2501.03153
作者: Risha Goel,Zain Shabeeb,Isabel Panicker,Vida Jamali
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:Liquid phase transmission electron microscopy (LPTEM) offers an unparalleled combination of spatial and temporal resolution, making it a promising tool for single particle tracking at the nanoscale. However, the absence of a standardized framework for identifying and tracking nanoparticles in noisy LPTEM videos has impeded progress in the field to develop this technique as a single particle tracking tool. To address this, we leveraged Segment Anything Model 2 (SAM 2), released by Meta, which is a foundation model developed for segmenting videos and images. Here, we demonstrate that SAM 2 can successfully segment LPTEM videos in a zero-shot manner and without requiring fine-tuning. Building on this capability, we introduce SAM4EM, a comprehensive framework that integrates promptable video segmentation with particle tracking and statistical analysis, providing an end-to-end LPTEM analysis framework for single particle tracking. SAM4EM achieves nearly 50-fold higher accuracy in segmenting and analyzing LPTEM videos compared to state-of-the-art methods, paving the way for broader applications of LPTEM in nanoscale imaging.
zh

[CV-7] Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

【速读】：该论文探讨了当前基于大规模预训练基础模型（PFMs）的生成式人工智能（Generative AI）系统在实现通用人工智能（AGI）过程中面临的关键问题。尽管这些系统（如视觉-语言模型、大语言模型（LLMs）、扩散模型和视觉-语言-动作（VLA）模型）在多领域和复杂任务中表现出色，但其认知能力仍然表面化且脆弱，限制了其通用能力。论文指出，要实现人类水平的通用智能，必须解决几个基础问题：具身性（embodiment）、符号接地（symbol grounding）、因果性（causality）和记忆（memory）。这些概念与人类认知更为契合，能够为LLMs提供内在的类人认知属性，从而支持实现物理上合理、语义上有意义、灵活且更具泛化性的知识和智能。论文的核心解决方案是通过引入具身性、符号接地、因果性和记忆等原则，以有机的方式推动LLMs向AGI迈进。

链接: https://arxiv.org/abs/2501.03151
作者: Alhassan Mumuni,Fuseini Mumuni
机构: Department of Electrical and Electronics Engineering, Cape Coast Technical University, Cape Coast, Ghana(电气与电子工程系, 海岸角技术大学, 海岸角, 加纳); University of Mines and Technology, UMaT, Tarkwa, Ghana(矿业与技术大学, UMaT, 塔夸, 加纳)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems – embodiment, symbol grounding, causality and memory – are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.
zh

[CV-8] Geometry Restoration and Dewarping of Camera-Captured Document Images

【速读】：该论文旨在解决通过相机拍摄的纸质文档图像在数字化过程中出现的拓扑失真问题，特别是文档的几何变形和非线性扭曲。论文提出了一种基于深度学习（DL）和计算机视觉（CV）的解决方案，其关键步骤包括文档轮廓检测、使用三次多项式插值生成二维拓扑网格、以及通过图像重映射校正非线性失真。该方法通过结合深度学习和经典计算机视觉技术，显著提高了文档拓扑恢复的效率和速度，同时减少了计算资源和内存的消耗。实验结果表明，该方法在视觉质量、文档可读性（通过OCR评估）和几何恢复指标上均优于现有基准方法（如RectiNet、DocGeoNet和DocTr++），为高质量纸质文档的数字化和OCR系统的效率提升提供了新的途径。

链接: https://arxiv.org/abs/2501.03145
作者: Valery Istomin,Oleg Pereziabov,Ilya Afanasyev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 16 figures

点击查看摘要

Abstract:This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: this https URL
zh

[CV-9] Normalizing Batch Normalization for Long-Tailed Recognition

【速读】：该论文试图解决在现实场景中，训练样本在各类别之间通常呈现长尾分布（long-tailed distribution）时，传统训练的网络在稀有类别上表现不佳的问题。具体来说，网络倾向于对频繁类别表现出更强的偏好，而稀有类别的特征（rare-specific features）在区分稀有类别时较弱。论文的关键解决方案是通过归一化批归一化层（Batch Normalization, BN）的参数来显式纠正特征偏差。具体方法是将BN层的权重/偏置参数表示为向量，归一化为单位向量，并乘以一个可学习的标量参数。通过解耦BN层参数的方向和幅度，使得权重/偏置分布更加均衡，从而使得特征的强度更加均匀。实验结果表明，该方法在多个长尾识别基准数据集（如CIFAR-10/100-LT、ImageNet-LT和iNaturalist 2018）上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.03122
作者: Yuxiang Bao,Guoliang Kang,Linlin Yang,Xiaoyue Duan,Bo Zhao,Baochang Zhang
机构: Beihang University(北京航空航天大学); Hangzhou Research Institute, Beihang(北京航空航天大学杭州研究院); Nanchang Institute of Technology(南昌工程学院); Communication University of China(中国传媒大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In real-world scenarios, the number of training samples across classes usually subjects to a long-tailed distribution. The conventionally trained network may achieve unexpected inferior performance on the rare class compared to the frequent class. Most previous works attempt to rectify the network bias from the data-level or from the classifier-level. Differently, in this paper, we identify that the bias towards the frequent class may be encoded into features, i.e., the rare-specific features which play a key role in discriminating the rare class are much weaker than the frequent-specific features. Based on such an observation, we introduce a simple yet effective approach, normalizing the parameters of Batch Normalization (BN) layer to explicitly rectify the feature bias. To achieve this end, we represent the Weight/Bias parameters of a BN layer as a vector, normalize it into a unit one and multiply the unit vector by a scalar learnable parameter. Through decoupling the direction and magnitude of parameters in BN layer to learn, the Weight/Bias exhibits a more balanced distribution and thus the strength of features becomes more even. Extensive experiments on various long-tailed recognition benchmarks (i.e., CIFAR-10/100-LT, ImageNet-LT and iNaturalist 2018) show that our method outperforms previous state-of-the-arts remarkably. The code and checkpoints are available at this https URL.
zh

[CV-10] CAT: Content-Adaptive Image Tokenization

【速读】：该论文试图解决现有图像标记器（image tokenizers）在编码图像时使用固定数量的标记（tokens）或补丁（patches）的问题，忽略了图像复杂度的固有变异性。为了解决这一问题，作者提出了内容自适应标记器（Content-Adaptive Tokenizer, CAT），该标记器能够根据图像内容动态调整表示能力，并将较简单的图像编码为较少的标记。关键解决方案包括设计一个基于标题的评估系统，利用大语言模型（LLMs）预测图像内容的复杂度，并确定给定图像的最佳压缩比，同时考虑人类感知的关键因素。通过在具有不同压缩比的图像上进行训练，CAT在图像重建中表现出鲁棒性能，并利用其可变长度的潜在表示训练扩散变换器（Diffusion Transformers, DiTs）用于ImageNet生成。通过优化标记分配，CAT在相同计算量（flops）下提高了FID分数，并将推理吞吐量提升了18.5%。

链接: https://arxiv.org/abs/2501.03120
作者: Junhong Shen,Kushal Tirumala,Michihiro Yasunaga,Ishan Misra,Luke Zettlemoyer,Lili Yu,Chunting Zhou
机构: Carnegie Mellon University(卡内基梅隆大学); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
zh

[CV-11] MVP: Multimodal Emotion Recognition based on Video and Physiological Signals ECCV

【速读】：该论文旨在解决当前情感识别模型中行为（behavioral）和生理（physiological）信号融合的不足。现有方法主要依赖经典机器学习技术，而未充分利用深度学习（deep learning）的优势。为此，作者提出了一种名为MVP（Multimodal for Video and Physio）的架构，专门用于融合视频和生理信号。该架构的关键创新在于利用注意力机制（attention mechanism）处理长输入序列（1-2分钟），从而更好地捕捉情感相关的复杂变化。通过实验验证，MVP在基于面部视频、皮肤电活动（EDA）和心电图/光电容积描记（ECG/PPG）的情感识别任务中表现优于现有方法。

链接: https://arxiv.org/abs/2501.03103
作者: Valeriya Strizhkova,Hadi Kachmar,Hava Chaptoukaev,Raphael Kalandadze,Natia Kukhilava,Tatia Tsmindashvili,Nibras Abo-Alzahab,Maria A. Zuluaga,Michal Balazia,Antitza Dantcheva,François Brémond,Laura Ferrari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Final paper accepted at Affective Behavior Analysis in-the-Wild (ABAW) at IEEE/CVF European Conference on Computer Vision (ECCV), Milan, September, 2024. 17 pages

点击查看摘要

Abstract:Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.
zh

[CV-12] A Novel Structure-Agnostic Multi-Objective Approach for Weight-Sharing Compression in Deep Neural Networks

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在训练后存储数百万甚至数十亿权重（weights）时面临的内存占用问题，特别是在嵌入式设备上部署内存密集型模型时的挑战。为了解决这一问题，论文提出了一种基于多目标进化算法（Multi-Objective Evolutionary Algorithm, MOEA）的压缩框架，该框架独立于神经网络架构、维度、任务和数据集。关键解决方案包括使用统一大小的分箱（bins）将网络权重量化为单个码本（codebook），并通过MOEA搜索帕累托最优（Pareto optimal）的k个分箱，同时优化两个目标。此外，论文采用迭代合并技术，在不降低性能的情况下减少分箱数量，从而提高压缩比。该方法具有模型和层独立性，且使用聚类中心作为共享权重值，避免了计算昂贵的权重重训练。实验结果表明，该方法在CIFAR-10、CIFAR-100和ImageNet数据集上分别实现了13.72~14.98倍、11.61~12.99倍和7.44~8.58倍的内存压缩，验证了其有效性。

链接: https://arxiv.org/abs/2501.03095
作者: Rasa Khosrowshahli,Shahryar Rahnamayan,Beatrice Ombuki-Berman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages, 9 figures, submitted to IEEE Transactions on Neural Networks and Learning Systems

点击查看摘要

Abstract:Deep neural networks suffer from storing millions and billions of weights in memory post-training, making challenging memory-intensive models to deploy on embedded devices. The weight-sharing technique is one of the popular compression approaches that use fewer weight values and share across specific connections in the network. In this paper, we propose a multi-objective evolutionary algorithm (MOEA) based compression framework independent of neural network architecture, dimension, task, and dataset. We use uniformly sized bins to quantize network weights into a single codebook (lookup table) for efficient weight representation. Using MOEA, we search for Pareto optimal k bins by optimizing two objectives. Then, we apply the iterative merge technique to non-dominated Pareto frontier solutions by combining neighboring bins without degrading performance to decrease the number of bins and increase the compression ratio. Our approach is model- and layer-independent, meaning the weights are mixed in the clusters from any layer, and the uniform quantization method used in this work has O(N) complexity instead of non-uniform quantization methods such as k-means with O(Nkt) complexity. In addition, we use the center of clusters as the shared weight values instead of retraining shared weights, which is computationally expensive. The advantage of using evolutionary multi-objective optimization is that it can obtain non-dominated Pareto frontier solutions with respect to performance and shared weights. The experimental results show that we can reduce the neural network memory by 13.72 \sim14.98 \times on CIFAR-10, 11.61 \sim 12.99\times on CIFAR-100, and 7.44 \sim 8.58\times on ImageNet showcasing the effectiveness of the proposed deep neural network compression framework.
zh

[CV-13] AIF-SFDA: Autonomous Information Filter-driven Source-Free Domain Adaptation for Medical Image Segmentation AAAI2025

【速读】：该论文试图解决在医学图像分析中，由于数据收集和隐私问题导致的训练和测试数据访问受限，从而难以通过现有方法实现域变信息（DVI）和域不变信息（DII）的有效解耦的问题。为了解决这一问题，论文提出了一种基于自主信息过滤的无源域适应算法（AIF-SFDA）。该算法的关键是通过频率可学习的信息过滤器自主解耦DVI和DII，并结合信息瓶颈（IB）和自监督（SS）来优化过滤器。信息瓶颈用于控制过滤器内的信息流，减少冗余的DVI，而自监督则确保DII与特定任务和图像模态保持一致。通过这种方式，AIF-SFDA能够仅依赖目标数据克服域偏移问题。

链接: https://arxiv.org/abs/2501.03074
作者: Haojin Li,Heng Li,Jianyu Chen,Rihan Zhong,Ke Niu,Huazhu Fu,Jiang Liu
机构: 1: Southern University of Science and Technology (南方科技大学); 2: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); 3: Unknown; 4: Inception Institute of Artificial Intelligence (起源人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages total (7 pages main text, 2 pages references), 6 figures, accepted by AAAI 2025

点击查看摘要

Abstract:Decoupling domain-variant information (DVI) from domain-invariant information (DII) serves as a prominent strategy for mitigating domain shifts in the practical implementation of deep learning algorithms. However, in medical settings, concerns surrounding data collection and privacy often restrict access to both training and test data, hindering the empirical decoupling of information by existing methods. To tackle this issue, we propose an Autonomous Information Filter-driven Source-free Domain Adaptation (AIF-SFDA) algorithm, which leverages a frequency-based learnable information filter to autonomously decouple DVI and DII. Information Bottleneck (IB) and Self-supervision (SS) are incorporated to optimize the learnable frequency filter. The IB governs the information flow within the filter to diminish redundant DVI, while SS preserves DII in alignment with the specific task and image modality. Thus, the autonomous information filter can overcome domain shifts relying solely on target data. A series of experiments covering various medical image modalities and segmentation tasks were conducted to demonstrate the benefits of AIF-SFDA through comparisons with leading algorithms and ablation studies. The code is available at this https URL.
zh

[CV-14] hrough-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

【速读】：该论文试图解决图像到视频生成（Image-to-Video, I2V）任务中的关键挑战，即在多对象场景下生成具有准确和一致对象运动的视频。现有方法虽然在生成逼真视频方面取得了进展，但在处理多对象运动时仍存在困难。为解决这一问题，论文提出了一种两阶段组合框架，将I2V生成分解为：（i）显式中间表示生成阶段，和（ii）基于该表示的视频生成阶段。关键创新在于引入基于掩码的运动轨迹作为中间表示，该表示同时捕捉语义对象信息和运动信息，从而实现对运动和语义的紧凑且表达力强的表示。在第二阶段，论文通过对象级注意力目标（object-level attention objectives）将学习到的表示整合到视频生成过程中，具体包括空间、逐对象的掩码交叉注意力目标（masked-cross attention objective）和掩码时空自注意力目标（masked spatio-temporal self-attention objective），以确保每个对象的帧间一致性。实验结果表明，该方法在多对象和高运动场景下的时间一致性、运动真实性和文本提示忠实性方面达到了最先进的水平。

链接: https://arxiv.org/abs/2501.03059
作者: Guy Yariv,Yuval Kirstain,Amit Zohar,Shelly Sheynin,Yaniv Taigman,Yossi Adi,Sagie Benaim,Adam Polyak
机构: GenAI, Meta; FAIR, Meta; The Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method’s superiority on this benchmark. Project page is available at this https URL.
zh

[CV-15] ransPixar: Advancing Text-to-Video Generation with Transparency

【速读】：该论文试图解决生成包含透明度通道（alpha channels）的RGBA视频的挑战。现有的文本到视频生成模型在生成RGB视频方面取得了显著进展，但在生成RGBA视频时面临数据集有限和现有模型难以适配的问题。RGBA视频中的alpha通道对于视觉效果（VFX）至关重要，因为它允许透明元素（如烟雾和反射）无缝融入场景。论文提出的解决方案TransPixar通过扩展预训练的视频模型来生成RGBA视频，同时保留原有的RGB生成能力。TransPixar采用扩散变换器（DiT）架构，引入alpha特定的token，并使用基于LoRA的微调方法，联合生成RGB和alpha通道，确保两者高度一致。通过优化注意力机制，TransPixar在有限的训练数据下仍能保持RGB和alpha通道之间的强对齐，从而有效生成多样且一致的RGBA视频，推动了VFX和交互式内容创作的进一步发展。

链接: https://arxiv.org/abs/2501.03006
作者: Luozhou Wang,Yijun Li,Zhifei Chen,Jui-Hsien Wang,Zhifei Zhang,He Zhang,Zhe Lin,Yingcong Chen
机构: HKUST(GZ)(香港科技大学广州校区); HKUST(香港科技大学); Adobe Research(Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
zh

[CV-16] PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

【速读】：该论文试图解决在掩码图像建模（Masked Image Modeling, MIM）中，Pixel MIM和Latent MIM两种方法各自专注于不同层次的视觉特征（低层次细节和高层次语义），导致在依赖特定层次视觉特征的任务中表现不佳的问题。为了解决这一局限性，论文提出了PiLaMIM，一个统一的框架，结合了Pixel MIM和Latent MIM的互补优势。该方案的关键在于使用单一编码器（encoder）和两个不同的解码器（decoder），分别用于预测像素值和潜在表示，从而确保同时捕捉到高层次和低层次的视觉特征。此外，论文还通过将CLS token（分类标记）整合到重建过程中，聚合全局上下文信息，进一步增强模型对语义信息的捕捉能力。实验结果表明，PiLaMIM在大多数情况下优于MAE、I-JEPA和BootMAE等基线方法，证明了其在提取更丰富视觉表示方面的有效性。

链接: https://arxiv.org/abs/2501.03005
作者: Junmyeong Lee,Eui Jun Hwang,Sukmin Cho,Jong C. Park
机构: School of Computing, Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.
zh

[CV-17] SurgRIPE challenge: Benchmark of Surgical Robot Instrument Pose Estimation

【速读】：该论文试图解决的是在机器人手术中精确估计手术器械姿态（instrument pose estimation）的问题，这是实现自主手术任务执行的关键步骤。现有的基于视觉的手术器械姿态估计方法通常需要在器械上附加标记物（markers），而最近的研究则更多地关注基于深度学习的无标记（marker-less）方法。然而，获取具有真实姿态标注（ground truth instrument poses）的现实手术数据用于深度学习训练仍然具有挑战性。

解决方案的关键在于引入了“Surgical Robot Instrument Pose Estimation (SurgRIPE)”挑战赛，该挑战赛在2023年的第26届国际医学图像计算与计算机辅助干预会议（MICCAI）上举办。该挑战赛的目标是：（1）为手术视觉社区提供带有真实姿态标注的现实手术视频数据；（2）建立一个用于评估无标记姿态估计方法的基准。通过这一挑战赛，开发了多种新颖的算法，这些算法在准确性和鲁棒性上优于现有方法。SurgRIPE数据集的性能评估研究表明，这些先进算法有潜力被集成到机器人手术系统中，从而推动更精确和自主的手术程序的发展。SurgRIPE挑战赛成功地为该领域建立了新的基准，鼓励了手术机器人器械姿态估计的进一步研究和开发。

链接: https://arxiv.org/abs/2501.02990
作者: Haozheng Xu,Alistair Weld,Chi Xu,Alfie Roddan,Joao Cartucho,Mert Asim Karaoglu,Alexander Ladikos,Yangke Li,Yiping Li,Daiyun Shen,Shoujie Yang,Geonhee Lee,Seyeon Park,Jongho Shin,Young-Gon Kim,Lucy Fothergill,Dominic Jones,Pietro Valdastri,Duygu Sarikaya,Stamatia Giannarou
机构: The Hamlyn Centre for Robotic Surgery, Imperial College London, United Kingdom; ImFusion GmbH, Munich, Germany; Technical University of Munich, Munich, Germany; Eindhoven University of Technology, Eindhoven, Netherlands; Department of Engineering Physics, Tsinghua University, Beijing, China; Department of Biomedical Engineering, National University of Singapore, Singapore; Department of Transdisciplinary Medicine, Seoul National University Hospital, South Korea; School of Computing, University of Leeds, United Kingdom; STORMLab, University of Leeds, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 35 pages, 18 figures, journal paper

点击查看摘要

Abstract:Accurate instrument pose estimation is a crucial step towards the future of robotic surgery, enabling applications such as autonomous surgical task execution. Vision-based methods for surgical instrument pose estimation provide a practical approach to tool tracking, but they often require markers to be attached to the instruments. Recently, more research has focused on the development of marker-less methods based on deep learning. However, acquiring realistic surgical data, with ground truth instrument poses, required for deep learning training, is challenging. To address the issues in surgical instrument pose estimation, we introduce the Surgical Robot Instrument Pose Estimation (SurgRIPE) challenge, hosted at the 26th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2023. The objectives of this challenge are: (1) to provide the surgical vision community with realistic surgical video data paired with ground truth instrument poses, and (2) to establish a benchmark for evaluating markerless pose estimation methods. The challenge led to the development of several novel algorithms that showcased improved accuracy and robustness over existing methods. The performance evaluation study on the SurgRIPE dataset highlights the potential of these advanced algorithms to be integrated into robotic surgery systems, paving the way for more precise and autonomous surgical procedures. The SurgRIPE challenge has successfully established a new benchmark for the field, encouraging further research and development in surgical robot instrument pose estimation.
zh

[CV-18] STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

【速读】：该论文旨在解决基于生成对抗网络（GAN）的视频超分辨率方法中存在的过度平滑问题，以及图像扩散模型在视频超分辨率中难以保持时间一致性的问题。为了解决这些问题，论文提出了一种名为 Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution (STAR) 的新方法，该方法通过整合文本到视频（T2V）模型来增强时空建模能力。关键解决方案包括：1）在全局注意力块之前引入局部信息增强模块（Local Information Enhancement Module, LIEM），以丰富局部细节并减少复杂退化带来的伪影；2）提出动态频率损失（Dynamic Frequency Loss, DF Loss），通过在不同扩散步骤中引导模型关注不同频率成分来增强保真度。实验结果表明，STAR 在合成和真实世界数据集上均优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.02976
作者: Rui Xie,Yinhong Liu,Penghao Zhou,Chen Zhao,Jun Zhou,Kai Zhang,Zhenyu Zhang,Jian Yang,Zhenheng Yang,Ying Tai
机构: Nanjing University(南京大学); ByteDance(字节跳动); Southwest University(西南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textite.g., CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf~\name (\textbfSpatial-\textbfTemporal \textbfAugmentation with T2V models for \textbfReal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf~\name~outperforms state-of-the-art methods on both synthetic and real-world datasets.
zh

[CV-19] HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos

【速读】：该论文旨在解决当前3D手部姿态估计方法在自我中心（egocentric）视频设置中的局限性，即现有方法主要关注单张图像在相机坐标系下的3D手部重建，而忽略了手部在世界坐标系中的运动。这种限制使得这些方法无法直接应用于自我中心视频，因为在这些场景中，手部和相机都在持续运动。为了解决这一问题，论文提出了HaWoR方法，通过将任务解耦为在相机空间中重建手部运动和在世界坐标系中估计相机轨迹，从而实现高保真的手部运动重建。关键解决方案包括：1）提出了一种自适应自我中心SLAM框架，以克服传统SLAM方法在复杂相机动态下的不足；2）设计了一种新颖的运动补全网络（motion infiller network），以在手部移出视锥体时有效补全缺失的帧序列。通过这些创新，HaWoR在多个自我中心基准数据集上实现了手部运动重建和世界坐标系相机轨迹估计的最先进性能。

链接: https://arxiv.org/abs/2501.02973
作者: Jinglei Zhang,Jiankang Deng,Chao Ma,Rolandos Alexandros Potamias
机构: Shanghai Jiao Tong University (上海交通大学); Imperial College London (伦敦帝国学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the advent in 3D hand pose estimation, current methods predominantly focus on single-image 3D hand reconstruction in the camera frame, overlooking the world-space motion of the hands. Such limitation prohibits their direct use in egocentric video settings, where hands and camera are continuously in motion. In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. We propose to decouple the task by reconstructing the hand motion in the camera space and estimating the camera trajectory in the world coordinate system. To achieve precise camera trajectory estimation, we propose an adaptive egocentric SLAM framework that addresses the shortcomings of traditional SLAM methods, providing robust performance under challenging camera dynamics. To ensure robust hand motion trajectories, even when the hands move out of view frustum, we devise a novel motion infiller network that effectively completes the missing frames of the sequence. Through extensive quantitative and qualitative evaluations, we demonstrate that HaWoR achieves state-of-the-art performance on both hand motion reconstruction and world-frame camera trajectory estimation under different egocentric benchmark datasets. Code and models are available on this https URL .
zh

[CV-20] Human Gaze Boosts Object-Centered Representation Learning

【速读】：该论文试图解决自监督学习（SSL）模型在人类视角的视觉输入上表现不佳的问题，尤其是在图像识别任务中与人类相比存在显著差距。关键问题在于，现有的自监督学习模型通常使用从头戴式摄像头收集的均匀视觉输入进行训练，而人类的视觉系统通过视网膜和视觉皮层的解剖结构对中央视觉信息（即人类注视位置周围的信息）进行选择性放大，这有助于形成以物体为中心的视觉表征。论文的解决方案是通过模拟人类注视行为，裁剪注视位置周围的视觉区域，并利用时间自监督学习模型对这些修改后的输入进行训练。实验结果表明，聚焦于中央视觉信息能够提升模型对物体中心视觉表征的学习能力，并且模型能够利用注视运动的时间动态性来构建更强的视觉表征。这一研究为生物启发的视觉表征学习迈出了重要一步。

链接: https://arxiv.org/abs/2501.02966
作者: Timothy Schaumlöffel,Arthur Aubret,Gemma Roig,Jochen Triesch
机构: Goethe University Frankfurt(法兰克福大学); The Hessian Center AI(黑森州人工智能中心); Frankfurt Institute for Advanced Studies(法兰克福高级研究所); Xidian-FIAS international Joint Research Center(西安电子科技大学-法兰克福高级研究所国际联合研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans’ gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.
zh

[CV-21] Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

【速读】：该论文试图解决复杂视觉推理（Complex Visual Reasoning）中的关键挑战，特别是如何有机地结合链式思维（Chain of Thought, COT）和视觉指令调优（Visual Instruction Tuning）方法，以提升模型在复杂视觉推理和问答任务中的表现。此外，论文还关注如何减少生成式模型中的幻觉（hallucinations）问题，并降低训练成本。

解决方案的关键在于提出了一种创新的多轮训练和推理框架，称为“苏格拉底式提问”（Socratic Questioning, SQ）。该框架通过启发式自问自答的方式，引导轻量级多模态大语言模型（Multimodal Large Language Models, MLLMs）专注于与目标问题相关的视觉线索，从而减少幻觉并增强模型对图像细节的描述能力。此外，论文还创建了一个名为CapQA的多模态小数据集，用于视觉指令调优和评估，实验结果表明SQ方法在幻觉评分上提升了31.2%，并在多个基准测试中展示了其在启发式自问自答、零样本视觉推理和幻觉缓解方面的显著能力。

链接: https://arxiv.org/abs/2501.02964
作者: Wanpeng Hu,Haodi Liu,Lin Chen,Feng Zhou,Changming Xiao,Qi Yang,Changshui Zhang
机构: Aibee Inc; Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model’s ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ’s remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.
zh

[CV-22] SceneVTG: Controllable Multilingual Visual Text Generation in the Wild

【速读】：该论文旨在解决在自然场景图像中生成视觉文本的挑战性问题。与在人工设计的图像（如海报、封面、卡通等）上生成文本不同，自然场景图像中的文本生成需要满足四个关键标准：(1) 保真度（Fidelity）：生成的文本应像照片一样逼真，且笔画完全准确无误；(2) 合理性（Reasonability）：文本应生成在合理的载体区域（如标牌、墙壁等），且文本内容应与场景相关；(3) 实用性（Utility）：生成的文本应有助于自然场景OCR（光学字符识别）任务的训练；(4) 可控性（Controllability）：文本的属性（如字体和颜色）应可控。为此，论文提出了一种两阶段方法SceneVTG++，其核心包括文本布局与内容生成器（Text Layout and Content Generator, TLCG）和可控局部文本扩散模型（Controllable Local Text Diffusion, CLTD）。TLCG利用多模态大语言模型的世界知识，根据自然场景背景图像找到合理的文本区域并推荐文本内容，而CLTD则基于扩散模型生成可控的多语言文本。通过大量实验，论文验证了TLCG和CLTD的有效性，并展示了SceneVTG++在文本生成方面的先进性能，同时生成的图像在OCR任务中表现出优越的实用性。

链接: https://arxiv.org/abs/2501.02962
作者: Jiawei Liu,Yuanzhi Zhu,Feiyu Gao,Zhibo Yang,Peng Wang,Junyang Lin,Xinggang Wang,Wenyu Liu
机构: School of EIC, Huazhong University of Science and Technology, Wuhan, 430074, China (华中科技大学电子与信息工程学院); Alibaba Group, Hangzhou, 311121, China (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
zh

[CV-23] MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

【速读】：该论文试图解决当前视觉语言模型（Vision Language Models, VLMs）在视频理解中细粒度运动理解能力不足的问题。尽管VLMs在视频理解方面取得了显著进展，但在现有基准测试中，细粒度运动理解能力尚未得到充分探索。为此，作者提出了MotionBench，一个全面的评估基准，旨在通过六种主要运动导向的问题类型来评估视频理解模型的细粒度运动理解能力。MotionBench的数据来源多样，确保了真实世界视频内容的广泛代表性。实验结果表明，现有VLMs在理解细粒度运动方面表现不佳。为了提升VLMs在有限序列长度内感知细粒度运动的能力，作者进行了大量实验，回顾了优化视频特征压缩的VLM架构，并提出了一种新颖且高效的Through-Encoder (TE) Fusion方法。实验显示，更高的帧率输入和TE Fusion方法在运动理解方面带来了改进，但仍存在显著的提升空间。该基准旨在指导和激励开发更具能力的视频理解模型，强调细粒度运动理解的重要性。

链接: https://arxiv.org/abs/2501.02955
作者: Wenyi Hong,Yean Cheng,Zhuoyi Yang,Weihan Wang,Lefan Wang,Xiaotao Gu,Shiyu Huang,Yuxiao Dong,Jie Tang
机构: Tsinghua University(清华大学); Zhipu AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages

点击查看摘要

Abstract:In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models’ motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM’s ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: this https URL .
zh

[CV-24] 4D-CS: Exploiting Cluster Prior for 4D Spatio-Temporal LiDAR Semantic Segmentation

【速读】：该论文旨在解决LiDAR点云语义分割（Semantic Segmentation）中的时空一致性问题。现有的多帧扫描方法虽然能够利用时空信息识别每个点的语义类别和运动状态，但往往忽视了空间和时间上的分割一致性，导致同一物体内的点云被预测为不同类别。为解决这一问题，论文提出了一种核心思想：通过生成跨多帧的聚类标签（cluster labels），以反映物体的完整空间结构和时间信息。这些标签作为显式指导，用于双分支网络4D-CS（4D-Consistent Segmentation），该网络集成了基于点（point-based）和基于聚类（cluster-based）的分支，以实现更一致的分割。具体而言，基于点的分支通过多视角时间融合（temporal fusion）利用历史知识丰富当前特征；基于聚类的分支则提出了一种新策略，生成前景物体的聚类标签，并利用这些标签收集点级信息以推导聚类特征。此外，通过合并多帧扫描中的相邻聚类，恢复因遮挡而缺失的特征。最后，在点-聚类融合阶段，自适应地融合两个分支的信息以优化分割结果。实验结果表明，该方法在SemanticKITTI和nuScenes数据集上实现了多帧语义和运动物体分割的最先进性能。

链接: https://arxiv.org/abs/2501.02937
作者: Jiexi Zhong,Zhiheng Li,Yubo Cui,Zheng Fang
机构: Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110819, China (东北大学机器人科学与工程学院); National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang 110819, China (东北大学工业智能与系统优化前沿科学中心); Key Laboratory of Data Analytics and Optimization for Smart Industry, Ministry of Education, Northeastern University, Shenyang 110819, China (东北大学教育部智能工业数据解析与优化重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE Robotics and Automation Letters (RAL)

点击查看摘要

Abstract:Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at this https URL.
zh

[CV-25] Label-free Concept Based Multiple Instance Learning for Gigapixel Histopathology

【速读】：该论文试图解决在医学领域中应用多示例学习（Multiple Instance Learning, MIL）方法进行全切片图像（Whole-Slide Image, WSI）分类时的可解释性问题。传统的MIL方法通过突出显著区域来提供解释，但这些空间热图对最终用户的洞察力有限。为了解决这一问题，论文提出了一种新型的、本质上可解释的WSI分类方法，该方法利用人类可理解的病理学概念生成解释。解决方案的关键在于提出的Concept MIL模型，该模型利用视觉-语言模型（vision-language models）的最新进展，直接基于图像特征预测病理学概念。模型的预测结果通过WSI中前K个图像块（patches）上识别的概念的线性组合获得，从而通过追踪每个概念对预测的影响来实现内在的解释。与传统基于概念的可解释模型不同，该方法无需昂贵的人工标注，而是通过视觉-语言模型自动生成概念。实验验证表明，Concept MIL在Camelyon16和PANDA两个广泛使用的病理学数据集上均取得了AUC和准确率超过0.9的性能，与最先进的模型相当。此外，用户研究表明，模型识别的概念与病理学家使用的概念一致，表明该方法在人类可解释的WSI分类中具有潜力。

链接: https://arxiv.org/abs/2501.02922
作者: Susu Sun,Leslie Tessier,Frédérique Meeuwsen,Clément Grisi,Dominique van Midden,Geert Litjens,Christian F. Baumgartner
机构: Cluster of Excellence: Machine Learning - New Perspectives for Science, University of Tübingen, Tübingen, Germany; Faculty of Health Sciences and Medicine, University of Lucerne, Lucerne, Switzerland; Radboud University Medical Center, Radboud, Netherlands; Institut du Cancer de l’Ouest, Angers, France; Oncode Institute, Utrecht, Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multiple Instance Learning (MIL) methods allow for gigapixel Whole-Slide Image (WSI) analysis with only slide-level annotations. Interpretability is crucial for safely deploying such algorithms in high-stakes medical domains. Traditional MIL methods offer explanations by highlighting salient regions. However, such spatial heatmaps provide limited insights for end users. To address this, we propose a novel inherently interpretable WSI-classification approach that uses human-understandable pathology concepts to generate explanations. Our proposed Concept MIL model leverages recent advances in vision-language models to directly predict pathology concepts based on image features. The model’s predictions are obtained through a linear combination of the concepts identified on the top-K patches of a WSI, enabling inherent explanations by tracing each concept’s influence on the prediction. In contrast to traditional concept-based interpretable models, our approach eliminates the need for costly human annotations by leveraging the vision-language model. We validate our method on two widely used pathology datasets: Camelyon16 and PANDA. On both datasets, Concept MIL achieves AUC and accuracy scores over 0.9, putting it on par with state-of-the-art models. We further find that 87.1% (Camelyon16) and 85.3% (PANDA) of the top 20 patches fall within the tumor region. A user study shows that the concepts identified by our model align with the concepts used by pathologists, making it a promising strategy for human-interpretable WSI classification.
zh

[CV-26] Unsupervised Tomato Split Anomaly Detection using Hyperspectral Imaging and Variational Autoencoders

【速读】：该论文旨在解决温室种植中番茄异常/损伤检测的难题，特别是番茄裂果（splitting）问题。番茄裂果表现为果皮出现裂纹，严重影响果实品质。由于裂果的外观和大小存在动态变化，且相关数据集稀缺，检测此类异常具有挑战性。论文提出了一种基于无监督学习的方法，利用定制化的变分自编码器（VAE）结合高光谱输入来解决这一问题。通过初步数据分析，研究确定了530nm - 550nm波长范围最适合检测番茄干裂。通过分析重构损失，该方法不仅能检测异常，还能在一定程度上估计异常区域。解决方案的关键在于利用高光谱数据和VAE的无监督学习能力，克服了数据集稀缺和外观动态变化的挑战。

链接: https://arxiv.org/abs/2501.02921
作者: Mahmoud Abdulsalam,Usman Zahidi,Bradley Hurst,Simon Pearson,Grzegorz Cielniak,James Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPPA Workshop

点击查看摘要

Abstract:Tomato anomalies/damages pose a significant challenge in greenhouse farming. While this method of cultivation benefits from efficient resource utilization, anomalies can significantly degrade the quality of farm produce. A common anomaly associated with tomatoes is splitting, characterized by the development of cracks on the tomato skin, which degrades its quality. Detecting this type of anomaly is challenging due to dynamic variations in appearance and sizes, compounded by dataset scarcity. We address this problem in an unsupervised manner by utilizing a tailored variational autoencoder (VAE) with hyperspectral input. Preliminary analysis of the dataset enabled us to select the optimal range of wavelengths for detecting this anomaly. Our findings indicate that the 530nm - 550nm range is suitable for identifying tomato dry splits. The analysis on reconstruction loss allow us to not only detect the anomalies but also to some degree estimate the anomalous regions.
zh

[CV-27] Spiking monocular event based 6D pose estimation for space application

【速读】：该论文旨在解决在轨服务（On-orbit servicing, OOS）和主动碎片清除（Active Debris Removal, ADR）任务中，航天器姿态估计的精度和效率问题。为了解决这一复杂任务，论文提出了一种基于事件的全新解决方案，利用事件相机（event-based camera）和尖峰神经网络（spiking neural networks）等生物启发的低功耗技术。论文的关键在于首次提出了一个完全基于事件的处理方法，并通过实验验证了其可行性。具体来说，论文展示了使用事件相机捕获的真实事件帧数据集SEENIC，并开发了一个小型尖峰端到端网络（S2E2），该网络在位置误差（21cm）和旋转误差（14度）方面取得了初步成果，为嵌入式航天器姿态估计的全事件处理奠定了基础。

链接: https://arxiv.org/abs/2501.02916
作者: Jonathan Courtois,Benoît Miramond,Alain Pegatoquet
机构: LEAT, Univ. Côte d’azur, CNRS, France (LEAT, 蔚蓝海岸大学, 法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 1 table. This paper has been presented in the Thursday 19 September poster session at the SPAICE 2024 conference (17-19 September 2024)

点击查看摘要

Abstract:With the growing interest in on On-orbit servicing (OOS) and Active Debris Removal (ADR) missions, spacecraft poses estimation algorithms are being developed using deep learning to improve the precision of this complex task and find the most efficient solution. With the advances of bio-inspired low-power solutions, such a spiking neural networks and event-based processing and cameras, and their recent work for space applications, we propose to investigate the feasibility of a fully event-based solution to improve event-based pose estimation for spacecraft. In this paper, we address the first event-based dataset SEENIC with real event frames captured by an event-based camera on a testbed. We show the methods and results of the first event-based solution for this use case, where our small spiking end-to-end network (S2E2) solution achieves interesting results over 21cm position error and 14degree rotation error, which is the first step towards fully event-based processing for embedded spacecraft pose estimation.
zh

[CV-28] Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

【速读】：该论文旨在解决单图像新视角合成（Novel View Synthesis, NVS）任务中的挑战，即从单一参考图像生成高质量且几何一致的多视角图像。传统方法通常依赖于复杂的3D重建或大量训练参数，而本文提出的PointmapDiffusion框架通过利用预训练的2D扩散模型（diffusion models）和点图（pointmaps）作为条件信号，显著减少了可训练参数的数量。关键解决方案在于引入点图作为几何先验信息，结合参考注意力块（reference attention blocks）和ControlNet来平衡生成能力与几何一致性，从而在不同视角下实现精确的视图合成。该方法在多个真实世界数据集上的实验表明，PointmapDiffusion能够在保持高质量生成结果的同时，显著减少模型复杂度。

链接: https://arxiv.org/abs/2501.02913
作者: Thang-Anh-Quan Nguyen,Nathan Piasco,Luis Roldão,Moussab Bennehar,Dzmitry Tsishkou,Laurent Caraffa,Jean-Philippe Tarel,Roland Brémond
机构: Noah’s Ark, Huawei Paris Research Center, France; COSYS, Gustave Eiffel University, France; LASTIG, Gustave Eiffel University, IGN-ENSG, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present PointmapDiffusion, a novel framework for single-image novel view synthesis (NVS) that utilizes pre-trained 2D diffusion models. Our method is the first to leverage pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric prior from the reference images to guide the diffusion process. By embedding reference attention blocks and a ControlNet for pointmap features, our model balances between generative capability and geometric consistency, enabling accurate view synthesis across varying viewpoints. Extensive experiments on diverse real-world datasets demonstrate that PointmapDiffusion achieves high-quality, multi-view consistent results with significantly fewer trainable parameters compared to other baselines for single-image NVS tasks.
zh

[CV-29] Comprehensive Pathological Image Segmentation via Teacher Aggregation for Tumor Microenvironment Analysis

【速读】：该论文试图解决当前在HE染色组织切片中对肿瘤微环境（Tumor Microenvironment, TME）进行全面分析时面临的细胞类型多样性和准确性不足的问题。现有的方法在处理TME的复杂性和异质性时存在显著局限性。论文提出的解决方案是PAGET（Pathological image segmentation via AGgrEgated Teachers），这是一种新的知识蒸馏方法，通过整合多个分割模型并考虑TME中细胞类型的层次结构，能够同时识别和分类14种关键的TME成分。PAGET利用通过免疫组化重染色技术创建的独特数据集和现有的分割模型，实现了跨不同组织类型和医疗机构的快速、全面的TME分割，从而推动了肿瘤微环境的定量分析。这一方法在增强对癌症生物学的理解和支持大规模组织病理学图像的精确临床决策方面迈出了重要一步。

链接: https://arxiv.org/abs/2501.02909
作者: Daisuke Komura,Maki Takao,Mieko Ochi,Takumi Onoyama,Hiroto Katoh,Hiroyuki Abe,Hiroyuki Sano,Teppei Konishi,Toshio Kumasaka,Tomoyuki Yokose,Yohei Miyagi,Tetsuo Ushiku,Shumpei Ishikawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 13 figures

点击查看摘要

Abstract:The tumor microenvironment (TME) plays a crucial role in cancer progression and treatment response, yet current methods for its comprehensive analysis in HE-stained tissue slides face significant limitations in the diversity of tissue cell types and accuracy. Here, we present PAGET (Pathological image segmentation via AGgrEgated Teachers), a new knowledge distillation approach that integrates multiple segmentation models while considering the hierarchical nature of cell types in the TME. By leveraging a unique dataset created through immunohistochemical restaining techniques and existing segmentation models, PAGET enables simultaneous identification and classification of 14 key TME components. We demonstrate PAGET’s ability to perform rapid, comprehensive TME segmentation across various tissue types and medical institutions, advancing the quantitative analysis of tumor microenvironments. This method represents a significant step forward in enhancing our understanding of cancer biology and supporting precise clinical decision-making from large-scale histopathology images.
zh

[CV-30] FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection WACV2025

【速读】：该论文试图解决面部识别系统中存在的演示攻击（presentation attacks）检测问题，特别是当前演示攻击检测（PAD）算法在未知场景下泛化能力差以及需要大量训练数据的问题。解决方案的关键在于利用基础模型（Foundation Models, FM），这些模型在广泛的数据集上进行了预训练，能够在未见过的领域中表现出色，并且即使在训练数据有限的情况下也能高效地进行任务特定适配。论文首次采用了一种基于LoRA权重适配的基础模型，并同时训练了一个分类头（classification header），构建了一个名为FoundPAD的架构。该架构在多种数据可用性场景下表现出色，甚至在仅使用合成训练数据时也能取得竞争性结果。为了促进研究的可重复性和进一步探索，论文还公开了FoundPAD的实现代码。

链接: https://arxiv.org/abs/2501.02892
作者: Guray Ozgur,Eduarda Caldeira,Tahar Chettaoui,Fadi Boutros,Raghavendra Ramachandra,Naser Damer
机构: Fraunhofer IGD (弗劳恩霍夫计算机图形研究所); Norwegian University of Science and Technology (NTNU) (挪威科技大学); TU Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025 workshops

点击查看摘要

Abstract:Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at this https URL .
zh

[CV-31] MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLM s

【速读】：该论文旨在解决视频大语言模型（Video-LLMs）在处理多帧视频时面临的挑战，特别是由于视觉标记序列过长导致的上下文长度限制问题，以及无关帧对视觉感知的干扰。为了解决这些问题，论文提出了一种有效的帧选择方法，强调帧选择应遵循三个关键原则：查询相关性（query relevance）、列表多样性（list-wise diversity）和时序性（sequentiality）。现有的方法（如均匀帧采样和查询帧匹配）未能全面捕捉这些原则。因此，论文提出了一种基于马尔可夫决策过程与动态规划的行列式点过程（MDP3）的帧选择方法。该方法无需训练且与模型无关，能够无缝集成到现有的Video-LLMs中。MDP3通过在再生核希尔伯特空间（RKHS）中使用条件高斯核估计帧相似性，并应用行列式点过程（DPP）来捕捉查询相关性和列表多样性。为了融入时序性，MDP3将视频分段，并在每个分段内应用DPP，同时将前一段的选择结果作为条件，通过马尔可夫决策过程（MDP）分配各分段的选择大小。理论分析表明，MDP3为NP难问题提供了一个((1 - 1/e))-近似解，具有伪多项式时间复杂度，实验验证了其有效性和鲁棒性。

链接: https://arxiv.org/abs/2501.02885
作者: Hui Sun,Shiyin Lu,Huanyu Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Ming Li
机构: National Key Laboratory for Novel Software Technology, Nanjing University, China (南京大学新型软件技术国家重点实验室); School of Artificial Intelligence, Nanjing University, China (南京大学人工智能学院); Alibaba International Digital Commerce, Hangzhou, China (阿里巴巴国际数字商业, 杭州)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a ((1 - 1/e))-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
zh

[CV-32] PARF-Net: integrating pixel-wise adaptive receptive fields into hybrid Transformer-CNN network for medical image segmentation

【速读】：该论文旨在解决现有混合Transformer-CNN网络在医学图像分割任务中存在的两个主要问题：一是由于卷积的固定感受野（receptive fields）导致局部语义特征学习不足；二是未能有效整合局部和长程依赖关系。为解决这些问题，论文提出了一种新方法PARF-Net，其关键创新在于引入了像素级自适应感受野卷积（Conv-PARF）。Conv-PARF能够根据像素间的语义差异动态调整每个像素的卷积感受野，从而提供可区分的特征以从背景中分离出形状和尺度各异的病变区域。此外，PARF-Net通过轻量级的混合Transformer-CNN块进一步处理这些特征，有效捕捉局部和长程依赖关系，从而显著提升了分割性能。实验结果表明，PARF-Net在多个医学图像数据集上均优于现有方法，例如在Synapse数据集上达到了84.27%的平均Dice系数。

链接: https://arxiv.org/abs/2501.02882
作者: Xu Ma,Mengsheng Chen,Junhui Zhang,Lijuan Song,Fang Du,Zhenhua Yu
机构: School of Information Engineering, Ningxia University (宁夏大学信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) excel in local feature extraction while Transformers are superior in processing global semantic information. By leveraging the strengths of both, hybrid Transformer-CNN networks have become the major architectures in medical image segmentation tasks. However, existing hybrid methods still suffer deficient learning of local semantic features due to the fixed receptive fields of convolutions, and also fall short in effectively integrating local and long-range dependencies. To address these issues, we develop a new method PARF-Net to integrate convolutions of Pixel-wise Adaptive Receptive Fields (Conv-PARF) into hybrid Network for medical image segmentation. The Conv-PARF is introduced to cope with inter-pixel semantic differences and dynamically adjust convolutional receptive fields for each pixel, thus providing distinguishable features to disentangle the lesions with varying shapes and scales from the background. The features derived from the Conv-PARF layers are further processed using hybrid Transformer-CNN blocks under a lightweight manner, to effectively capture local and long-range dependencies, thus boosting the segmentation performance. By assessing PARF-Net on four widely used medical image datasets including MoNuSeg, GlaS, DSB2018 and multi-organ Synapse, we showcase the advantages of our method over the state-of-the-arts. For instance, PARF-Net achieves 84.27% mean Dice on the Synapse dataset, surpassing existing methods by a large margin.
zh

[CV-33] wo-Dimensional Unknown View Tomography from Unknown Angle Distributions ICASSP

【速读】：该论文致力于解决二维断层成像（2D tomography）在未知视角分布情况下的重建问题，这一问题常见于冷冻电镜（cryo-electron microscopy）和CT系统的几何校准中。现有的二维未知视角断层成像（UVT）算法通常假设视角分布已知，然而实际应用中这一信息往往不可获得。论文提出的方法将问题转化为基于交叉验证误差的优化任务，通过交替估计视角分布和底层二维结构来解决问题。关键解决方案包括：1）采用半参数化的von Mises密度混合模型和概率质量函数模型来建模视角分布；2）结合基于PCA的去噪技术和图拉普拉斯断层成像（GLT），利用估计分布的序统计量确保近乎完美的排序。该方法在噪声投影下表现出色，并通过与直观基线方法的对比验证了其有效性。

链接: https://arxiv.org/abs/2501.02872
作者: Kaishva Chintan Shah,Karthik S. Gurumoorthy,Ajit Rajwade
机构: Department of Electrical Engineering, Indian Institute of Technology Bombay (印度理工学院孟买分校); Walmart Global Tech (沃尔玛全球科技); Department of Computer Science and Engineering, Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025

点击查看摘要

Abstract:This study presents a technique for 2D tomography under unknown viewing angles when the distribution of the viewing angles is also unknown. Unknown view tomography (UVT) is a problem encountered in cryo-electron microscopy and in the geometric calibration of CT systems. There exists a moderate-sized literature on the 2D UVT problem, but most existing 2D UVT algorithms assume knowledge of the angle distribution which is not available usually. Our proposed methodology formulates the problem as an optimization task based on cross-validation error, to estimate the angle distribution jointly with the underlying 2D structure in an alternating fashion. We explore the algorithm’s capabilities for the case of two probability distribution models: a semi-parametric mixture of von Mises densities and a probability mass function model. We evaluate our algorithm’s performance under noisy projections using a PCA-based denoising technique and Graph Laplacian Tomography (GLT) driven by order statistics of the estimated distribution, to ensure near-perfect ordering, and compare our algorithm to intuitive baselines.
zh

[CV-34] Seeing the Whole in the Parts in Self-Supervised Representation Learning

【速读】：该论文旨在解决自监督学习（SSL）中如何有效建模视觉特征的空间共现问题。现有的方法通常通过掩码图像部分或进行激进裁剪来实现，但这些方法存在一定的局限性。论文提出了一种新的解决方案，即通过将局部表示（在池化之前）与全局图像表示对齐来建模空间共现。具体而言，作者提出了CO-SSL（Co-occurrence Self-Supervised Learning）方法，该方法通过实例判别任务来对齐局部和全局表示。实验表明，CO-SSL在多个数据集上优于现有方法，尤其是在ImageNet-1K上，仅用100个预训练周期就达到了71.5%的Top-1准确率。此外，CO-SSL对噪声污染、内部损坏、小规模对抗攻击以及大尺寸训练裁剪表现出更强的鲁棒性。分析进一步表明，CO-SSL学习到的局部表示具有高度冗余性，这解释了其鲁棒性的来源。总体而言，论文表明局部与全局表示的对齐可能是无监督类别学习中的一个强大原则。

链接: https://arxiv.org/abs/2501.02860
作者: Arthur Aubret,Céline Teulière,Jochen Triesch
机构: Frankfurt Institute for Advanced Studies; Institut Pascal, CNRS, Clermont Auvergne INP
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages

点击查看摘要

Abstract:Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.
zh

[CV-35] A Novel Vision Transformer for Camera-LiDAR Fusion based Traffic Object Segmentation

【速读】：该论文旨在解决自动驾驶感知中的交通目标分割问题，特别是在复杂天气条件下的性能优化。解决方案的关键在于提出了Camera-LiDAR Fusion Transformer (CLFT)模型，该模型通过视觉变压器（vision transformers）融合摄像头和LiDAR数据，利用自注意力机制（self-attention mechanism）增强分割能力，并扩展了对多种目标（如骑行者、交通标志和行人）的分类选项。尽管模型在多种条件下表现良好，但在黑暗和雨天等恶劣环境下仍面临挑战，表明需要进一步优化以提升其在实际部署中的性能。

链接: https://arxiv.org/abs/2501.02858
作者: Toomas Tahves,Junyi Gu,Mauro Bellone,Raivo Sell
机构: Department of Mechanical and Industrial Engineering, Tallinn University of Technology, Estonia(爱沙尼亚塔林理工大学机械与工业工程系); Dept. of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Sweden(瑞典查尔姆斯理工大学与哥德堡大学计算机科学与工程系); FinEst Centre for Smart Cities, Tallinn University of Technology, Estonia(爱沙尼亚塔林理工大学FinEst智慧城市中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Agents and Artificial Intelligence 2025

点击查看摘要

Abstract:This paper presents Camera-LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self-attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state-of-the-art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.
zh

[CV-36] Synthetic Fungi Datasets: A Time-Aligned Approach

【速读】：该论文旨在解决真菌生命周期中形态学动态变化的研究难题，特别是从孢子到成熟菌丝体结构的转变过程。现有的真实世界真菌数据集在时间一致性和结构对齐方面存在局限性，难以系统捕捉孢子大小变化、分枝动态和复杂菌丝网络的形成等关键现象。为此，作者提出了一种合成的、时间对齐的图像数据集，通过控制生成过程确保时间一致性、可扩展性和结构对齐。该数据集专为深度学习应用优化，支持生长阶段分类、真菌发育预测以及形态学模式的时间序列分析。这一解决方案为农业、医学和工业真菌学领域的自动化真菌分析、疾病监测和真菌生物学研究提供了坚实的基础。

链接: https://arxiv.org/abs/2501.02855
作者: A. Rani,D. O. Arroyo,P. Durdevic
机构: Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:Fungi undergo dynamic morphological transformations throughout their lifecycle, forming intricate networks as they transition from spores to mature mycelium structures. To support the study of these time-dependent processes, we present a synthetic, time-aligned image dataset that models key stages of fungal growth. This dataset systematically captures phenomena such as spore size reduction, branching dynamics, and the emergence of complex mycelium networks. The controlled generation process ensures temporal consistency, scalability, and structural alignment, addressing the limitations of real-world fungal datasets. Optimized for deep learning (DL) applications, this dataset facilitates the development of models for classifying growth stages, predicting fungal development, and analyzing morphological patterns over time. With applications spanning agriculture, medicine, and industrial mycology, this resource provides a robust foundation for automating fungal analysis, enhancing disease monitoring, and advancing fungal biology research through artificial intelligence.
zh

[CV-37] Large Language Models for Video Surveillance Applications

【速读】：该论文旨在解决视频内容生产快速增长所带来的海量数据分析和管理难题。传统的视频分析方法通常提供通用摘要或有限的动作识别，难以满足高效分析的需求。为此，论文提出了一种基于生成式人工智能（Generative Artificial Intelligence, GenAI）的创新解决方案，利用视觉语言模型（Vision Language Models）来增强下游视频分析过程。该工具能够根据用户定义的查询生成定制化的文本摘要，从而在大量视频数据中提供聚焦的洞察。与现有方法相比，该方案通过视觉语言模型提取相关信息，显著提高了分析的精确性和效率。此外，该方法能够从大量闭路电视（CCTV）录像中生成文本摘要，并以极小的存储空间长期保存，使用户能够快速导航和验证重要事件，而无需进行繁琐的手动审查。实验结果表明，该方法在时间质量和空间质量以及流程一致性方面的准确率分别达到80%和70%。

链接: https://arxiv.org/abs/2501.02850
作者: Ulindu De Silva,Leon Fernando,Billy Lau Pik Lik,Zann Koh,Sam Conrad Joyce,Belinda Yuen,Chau Yuen
机构: University of Moratuwa (莫拉图瓦大学); SUTD (新加坡科技设计大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for TENCON 2024

点击查看摘要

Abstract:The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
zh

[CV-38] HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation AAAI2025

【速读】：该论文试图解决双手机器人与物体交互（bimanual hand-object interaction）领域中高质量、大规模数据集难以收集和标注的问题。由于手与物体之间的显著遮挡以及高自由度运动，现有的数据集规模有限，限制了相关基线的进一步改进。为此，论文提出了一种基于3D高斯泼溅（3D Gaussian Splatting, 3DGS）的数据增强框架，能够通过生成多样化的手-物体姿态和视角，将现有数据集扩展为大规模、逼真的数据。解决方案的关键包括：1）使用基于网格的3DGS建模手和物体，并通过超分辨率模块解决多分辨率输入图像导致的渲染模糊问题；2）扩展单手握持姿态优化模块，生成多样化的双手-物体交互姿态，显著扩展数据集的姿态分布；3）分析了所提出数据增强方法对双手-物体交互理解的影响。通过在H2O和Arctic两个基准数据集上的实验，验证了该方法能够有效提升基线的性能。

链接: https://arxiv.org/abs/2501.02845
作者: Wentian Qu,Jiahe Li,Jian Cheng,Jian Shi,Chenyu Meng,Cuixia Ma,Hongan Wang,Xiaoming Deng,Yinda Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Understanding of bimanual hand-object interaction plays an important role in robotics and virtual reality. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related baselines. In this work, we propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines.
zh

[CV-39] Enhanced Rooftop Solar Panel Detection by Efficiently Aggregating Local Features

【速读】：该论文旨在解决利用卫星图像进行屋顶太阳能光伏（PV）面板检测的问题。解决方案的关键在于提出了一种基于预训练卷积神经网络（CNN）的增强方法，通过提取屋顶的局部卷积特征，并结合局部聚合描述符向量（VLAD）技术生成屋顶级别的全局特征。这些全局特征随后用于训练传统机器学习模型，以区分包含和不包含PV面板的屋顶图像。此外，论文还提出了一种三阶段方法，使得在缺乏标注数据的新城市或区域中能够高效利用先前训练的模型。实验结果表明，该方法在多个城市的屋顶PV分类任务中表现优异，分类得分均超过了预设的0.9阈值。

链接: https://arxiv.org/abs/2501.02840
作者: Kuldeep Kurte,Kedar Kulkarni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CODS-COMAD 2024, December, 2024, Jodhpur, India ( this https URL )

点击查看摘要

Abstract:In this paper, we present an enhanced Convolutional Neural Network (CNN)-based rooftop solar photovoltaic (PV) panel detection approach using satellite images. We propose to use pre-trained CNN-based model to extract the local convolutional features of rooftops. These local features are then combined using the Vectors of Locally Aggregated Descriptors (VLAD) technique to obtain rooftop-level global features, which are then used to train traditional Machine Learning (ML) models to identify rooftop images that do and do not contain PV panels. On the dataset used in this study, the proposed approach achieved rooftop-PV classification scores exceeding the predefined threshold of 0.9 across all three cities for each of the feature extractor networks evaluated. Moreover, we propose a 3-phase approach to enable efficient utilization of the previously trained models on a new city or region with limited labelled data. We illustrate the effectiveness of this 3-phase approach for multi-city rooftop-PV detection task.
zh

[CV-40] Universal Features Guided Zero-Shot Category-Level Object Pose Estimation AAAI2025

【速读】：该论文旨在解决计算机视觉和机器人应用中，面对未见类别（unseen categories）时物体姿态估计（object pose estimation）的挑战。具体来说，论文提出了一种零样本（zero-shot）方法，用于实现类别级别的6自由度（6-DOF）物体姿态估计。该方法的关键在于利用输入RGB-D图像的2D和3D通用特征（universal features），通过语义相似性建立对应关系，从而无需额外模型微调即可扩展到未见类别。解决方案的核心步骤包括：首先结合2D通用特征找到类别内物体的稀疏对应关系，获得初始粗略姿态；然后通过迭代策略优化姿态，以应对2D特征在姿态偏离目标姿态时的退化问题；最后，通过3D通用特征的密集对齐约束优化粗略姿态，以解决类别内物体形状差异导致的姿态模糊问题。该方法在REAL275和Wild6D基准测试中，对未见类别的表现优于现有方法。

链接: https://arxiv.org/abs/2501.02831
作者: Wentian Qu,Chenyu Meng,Heng Li,Jian Cheng,Cuixia Ma,Hongan Wang,Xiao Zhou,Xiaoming Deng,Ping Tan
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Tsinghua University (清华大学); 4. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.
zh

[CV-41] RDD4D: 4D Attention-Guided Road Damage Detection And Classification

【速读】：该论文旨在解决道路损伤检测和评估中的关键问题，即现有方法在单张图像中检测多种类型和不同尺度的道路损伤时表现不佳。这一问题的根源在于缺乏包含多种损伤类型和尺度的道路数据集。为解决这一问题，论文提出了两个关键解决方案：首先，作者构建了一个名为Diverse Road Damage Dataset (DRDD)的新数据集，该数据集在单张图像中捕捉了多种类型的道路损伤，填补了现有数据集的空白。其次，作者提出了一种名为RDD4D的模型，该模型利用Attention4D模块，通过结合位置编码和“Talking Head”组件的注意力机制，实现了跨多尺度的特征优化，从而更好地捕捉局部和全局上下文信息。实验结果表明，该模型在检测大尺寸道路裂缝时表现出色，平均精度（AP）达到0.458，整体AP为0.445，并在CrackTinyNet数据集上实现了约0.21的性能提升。

链接: https://arxiv.org/abs/2501.02822
作者: Asma Alkalbani,Muhammad Saqib,Ahmed Salim Alrawahi,Abbas Anwar,Chandarnath Adak,Saeed Anwar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Road damage detection and assessment are crucial components of infrastructure maintenance. However, current methods often struggle with detecting multiple types of road damage in a single image, particularly at varying scales. This is due to the lack of road datasets with various damage types having varying scales. To overcome this deficiency, first, we present a novel dataset called Diverse Road Damage Dataset (DRDD) for road damage detection that captures the diverse road damage types in individual images, addressing a crucial gap in existing datasets. Then, we provide our model, RDD4D, that exploits Attention4D blocks, enabling better feature refinement across multiple scales. The Attention4D module processes feature maps through an attention mechanism combining positional encoding and “Talking Head” components to capture local and global contextual information. In our comprehensive experimental analysis comparing various state-of-the-art models on our proposed, our enhanced model demonstrated superior performance in detecting large-sized road cracks with an Average Precision (AP) of 0.458 and maintained competitive performance with an overall AP of 0.445. Moreover, we also provide results on the CrackTinyNet dataset; our model achieved around a 0.21 increase in performance. The code, model weights, dataset, and our results are available on \hrefthis https URLthis https URL_Damage_Detection.
zh

[CV-42] InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models

【速读】：该论文试图解决图像修复定位（Image Inpainting Localization, IIL）中的两个主要挑战：一是现有方法容易过度自信，导致错误预测；二是难以检测修复图像中细微的篡改边界。为解决这些问题，论文提出了一种新的范式，将IIL视为基于扩散模型（diffusion models）的条件掩码生成任务。其关键解决方案包括：1）利用图像语义条件增强的去噪过程逐步优化预测；2）在去噪过程中引入边缘条件和新的边缘监督策略，以增强模型对修复对象边缘细节的感知；3）通过平衡扩散模型的随机采样与篡改区域的边缘监督，减少过度自信导致的错误预测，并避免过于随机过程导致的细微边界丢失。此外，论文还提出了一种创新的双流多尺度特征提取器（Dual-stream Multi-scale Feature Extractor, DMFE），通过同时考虑修复图像的语义和边缘条件，增强特征表示能力。实验结果表明，该方法在IIL任务中显著优于现有最先进方法，并展现出优异的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2501.02816
作者: Kai Wang,Shaozhang Niu,Qixian Hao,Jiwei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model’s perception of edge details in inpainted objects. Balancing the diffusion model’s stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
zh

[CV-43] First-place Solution for Streetscape Shop Sign Recognition Competition

【速读】：该论文旨在解决街景图像中店铺招牌的文本识别问题，特别是在复杂设计和多样化文本风格的情况下。解决方案的关键在于提出了一种多阶段方法，该方法结合了多模态特征融合（multimodal feature fusion）、大规模自监督训练（self-supervised training）以及基于Transformer的大模型。此外，论文还引入了创新的技术，如基于强化学习的BoxDQN和文本校正方法（text rectification），这些技术的应用显著提升了复杂城市环境中的文本识别能力。通过全面的实验验证，这些方法展示了其在提升文本识别性能方面的有效性。

链接: https://arxiv.org/abs/2501.02811
作者: Bin Wang,Li Jing
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.
zh

[CV-44] AE-NeRF: Augmenting Event-Based Neural Radiance Fields for Non-ideal Conditions and Larger Scene AAAI2025

【速读】：该论文试图解决在非理想条件下从事件相机（event cameras）数据中学习神经辐射场（Neural Radiance Fields, NeRF）的挑战，包括非均匀事件序列、噪声相机姿态（noisy poses）以及不同尺度的场景。现有方法依赖于理想条件，如均匀且高质量的事件序列和准确的相机姿态，且主要关注对象级别的重建，限制了其实际应用。论文提出的解决方案AE-NeRF通过利用事件流的密度，并结合姿态校正模块与事件驱动的NeRF（e-NeRF）框架，实现了在相机姿态不准确情况下的鲁棒3D重建。此外，为了适应更大规模的场景，论文提出了分层事件蒸馏（hierarchical event distillation）方法，通过提案e-NeRF网络和标准e-NeRF网络对重建过程进行重采样和优化。论文还引入了事件重建损失（event reconstruction loss）和时间损失（temporal loss）以提高重建场景的视图一致性。通过在包含大规模场景的综合基准测试中验证，该方法在事件驱动的3D重建领域达到了新的最优性能。

链接: https://arxiv.org/abs/2501.02807
作者: Chaoran Feng,Wangbo Yu,Xinhua Cheng,Zhenyu Tang,Junwu Zhang,Li Yuan,Yonghong Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025. this https URL

点击查看摘要

Abstract:Compared to frame-based methods, computational neuromorphic imaging using event cameras offers significant advantages, such as minimal motion blur, enhanced temporal resolution, and high dynamic range. The multi-view consistency of Neural Radiance Fields combined with the unique benefits of event cameras, has spurred recent research into reconstructing NeRF from data captured by moving event cameras. While showing impressive performance, existing methods rely on ideal conditions with the availability of uniform and high-quality event sequences and accurate camera poses, and mainly focus on the object level reconstruction, thus limiting their practical applications. In this work, we propose AE-NeRF to address the challenges of learning event-based NeRF from non-ideal conditions, including non-uniform event sequences, noisy poses, and various scales of scenes. Our method exploits the density of event streams and jointly learn a pose correction module with an event-based NeRF (e-NeRF) framework for robust 3D reconstruction from inaccurate camera poses. To generalize to larger scenes, we propose hierarchical event distillation with a proposal e-NeRF network and a vanilla e-NeRF network to resample and refine the reconstruction process. We further propose an event reconstruction loss and a temporal loss to improve the view consistency of the reconstructed scene. We established a comprehensive benchmark that includes large-scale scenes to simulate practical non-ideal conditions, incorporating both synthetic and challenging real-world event datasets. The experimental results show that our method achieves a new state-of-the-art in event-based 3D reconstruction.
zh

[CV-45] COph100: A comprehensive fundus image registration dataset from infants constituting the “RIDIRP” database

【速读】：该论文旨在解决现有视网膜图像配准（Retinal Image Registration）数据集中图像对数量有限且忽视临床挑战的问题，特别是在婴儿眼科领域。现有的公开数据集主要关注成人视网膜病变，且图像质量较高，但缺乏针对婴儿视网膜图像的研究资源。为此，作者提出了COph100数据集，这是一个包含100只眼睛、491对图像的综合眼科视网膜图像配准数据集，涵盖了广泛的图像质量问题。该数据集的关键解决方案包括：1）从公开的“RIDIRP”数据库中精心挑选图像对；2）手动标注地面真实图像点；3）为每张图像提供自动血管分割掩码。通过这些措施，COph100能够支持先进的配准算法评估，并有助于分析婴儿疾病进展，从而深化对儿科眼科疾病的理解。

链接: https://arxiv.org/abs/2501.02800
作者: Yan Hu,Mingdao Gong,Zhongxi Qiu,Jiabao Liu,Hongli Shen,Mingzhen Yuan,Xiaoqing Zhang,Heng Li,Hai Lu,Jiang Liu
机构: Southern University of Science and Technology (南方科技大学); Beijing Tongren Hospital, Capital Medical University (首都医科大学北京同仁医院); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Retinal image registration is vital for diagnostic therapeutic applications within the field of ophthalmology. Existing public datasets, focusing on adult retinal pathologies with high-quality images, have limited number of image pairs and neglect clinical challenges. To address this gap, we introduce COph100, a novel and challenging dataset known as the Comprehensive Ophthalmology Retinal Image Registration dataset for infants with a wide range of image quality issues constituting the public “RIDIRP” database. COph100 consists of 100 eyes, each with 2 to 9 examination sessions, amounting to a total of 491 image pairs carefully selected from the publicly available dataset. We manually labeled the corresponding ground truth image points and provided automatic vessel segmentation masks for each image. We have assessed COph100 in terms of image quality and registration outcomes using state-of-the-art algorithms. This resource enables a robust comparison of retinal registration methodologies and aids in the analysis of disease progression in infants, thereby deepening our understanding of pediatric ophthalmic conditions.
zh

[CV-46] GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic Features for Medical Image Segmentation

【速读】：该论文试图解决Vision Transformers (ViTs)在医学图像语义分割（MISS）中难以有效建模局部空间信息的问题，特别是在小数据集上缺乏广泛预训练的情况下。解决方案的关键在于提出了一种新颖的架构——Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet)，该架构通过引入可学习的放射组学特征（radiomic features）来增强基于Transformer的模型。具体而言，GLoG-CSUnet集成了动态自适应的Gabor和Laplacian of Gaussian (LoG)滤波器，以捕捉纹理、边缘和边界信息，从而提升Transformer模型处理的特征表示能力。该方法独特地将Transformer的长程依赖建模能力与Gabor和LoG特征的纹理分析能力相结合，显著提升了在Synapse多器官分割和ACDC心脏分割数据集上的性能，同时仅增加了极少的计算开销。

链接: https://arxiv.org/abs/2501.02788
作者: Niloufar Eghbali,Hassan Bagher-Ebadian,Tuka Alhanai,Mohammad M. Ghassemi
机构: Michigan State University(密歇根州立大学); Henry Ford Health(亨利福特健康); New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long-range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre-training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture enhancing Transformer-based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long-range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet demonstrates significant improvements over state-of-the-art models, achieving a 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG-CSUnet’s flexible design allows integration with various base models, offering a promising approach for incorporating radiomics-inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: this https URL.
zh

[CV-47] CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

【速读】：该论文旨在解决单声道音频（monaural audio）转换为立体声（binaural audio）时，现有模型容易过拟合于特定房间环境并丢失细粒度空间细节的问题。解决方案的关键在于引入了一种新的音频-视觉条件归一化层（audio-visual conditional normalisation layer），该层利用视觉上下文动态对齐目标差异音频特征的均值和方差，从而增强模型对空间信息的理解。此外，论文还提出了一种新的对比学习方法，通过从打乱的视觉特征中挖掘负样本来增强空间敏感性。最后，论文还介绍了一种在视频数据中利用测试时增强（test-time augmentation）的低成本方法，以进一步提升生成性能。这些创新使得该模型在FAIR-Play和MUSIC-Stereo基准测试中达到了最先进的生成精度。

链接: https://arxiv.org/abs/2501.02786
作者: Yuanhong Chen,Kazuki Shimada,Christian Simon,Yukara Ikemiya,Takashi Shibuya,Yuki Mitsufuji
机构: Australian Institute for Machine Learning, University of Adelaide(阿德莱德大学澳大利亚机器学习研究所); Sony AI(Sony AI); Sony Group Corporation(Sony集团)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
zh

[CV-48] Hybrid deep convolution model for lung cancer detection with transfer learning

【速读】：该论文试图解决肺癌早期和准确诊断的挑战。尽管现有的肺癌检测模型显示出一定的潜力，但在提高诊断准确性以进行及时干预方面仍有较大改进空间。为此，作者提出了一种基于迁移学习的混合深度卷积模型，称为最大灵敏度神经网络（MSNN）。该模型通过优化灵敏度和特异性来提高肺癌检测的精度。实验验证表明，MSNN在准确率（98%）和灵敏度（97%）方面超越了现有的深度学习方法。通过将灵敏度图叠加在肺部CT扫描图像上，MSNN能够可视化显示最可能为恶性或良性分类的区域。这一创新方法在区分肺癌时表现出色，显著减少了误报，从而提高了医学诊断的准确性。

链接: https://arxiv.org/abs/2501.02785
作者: Sugandha Saxena,S. N. Prasad,Ashwin M Polnaya,Shweta Agarwala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Advances in healthcare research have significantly enhanced our understanding of disease mechanisms, diagnostic precision, and therapeutic options. Yet, lung cancer remains one of the leading causes of cancer-related mortality worldwide due to challenges in early and accurate diagnosis. While current lung cancer detection models show promise, there is considerable potential for further improving the accuracy for timely intervention. To address this challenge, we introduce a hybrid deep convolution model leveraging transfer learning, named the Maximum Sensitivity Neural Network (MSNN). MSNN is designed to improve the precision of lung cancer detection by refining sensitivity and specificity. This model has surpassed existing deep learning approaches through experimental validation, achieving an accuracy of 98% and a sensitivity of 97%. By overlaying sensitivity maps onto lung Computed Tomography (CT) scans, it enables the visualization of regions most indicative of malignant or benign classifications. This innovative method demonstrates exceptional performance in distinguishing lung cancer with minimal false positives, thereby enhancing the accuracy of medical diagnoses.
zh

[CV-49] Unsupervised Domain Adaptation for Occlusion Resilient Human Pose Estimation

【速读】：该论文试图解决在人体姿态估计（Human Pose Estimation）中由于遮挡（Occlusion）导致的性能下降问题，特别是在目标域图像存在遮挡的情况下，现有域自适应（Domain Adaptation）算法表现不佳的挑战。解决方案的关键在于提出了OR-POSE算法，该算法通过以下三个创新点来应对这些问题：首先，采用均值教师框架（Mean Teacher Framework）进行迭代伪标签（Pseudo-label）优化，有效缓解域偏移（Domain Shift）；其次，利用学习到的人体姿态先验（Human Pose Prior）来增强姿态预测的合理性，确保姿态符合人体解剖学约束；最后，通过基于可见性（Visibility-based）的课程学习（Curriculum Learning）策略，逐步从较少遮挡的训练样本过渡到高度遮挡的样本，避免模型对不准确的伪标签过拟合。实验结果表明，OR-POSE在具有挑战性的遮挡人体姿态估计数据集上比现有最先进算法性能提升了约7%。

链接: https://arxiv.org/abs/2501.02773
作者: Arindam Dutta,Sarosij Bose,Saketh Bachu,Calvin-Khang Ta,Konstantinos Karydis,Amit K. Roy-Chowdhury
机构: Department of Electrical and Computer Engineering, University of California Riverside, USA (加州大学河滨分校电气与计算机工程系); Department of Computer Science and Engineering, University of California Riverside, USA (加州大学河滨分校计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Occlusions are a significant challenge to human pose estimation algorithms, often resulting in inaccurate and anatomically implausible poses. Although current occlusion-robust human pose estimation algorithms exhibit impressive performance on existing datasets, their success is largely attributed to supervised training and the availability of additional information, such as multiple views or temporal continuity. Furthermore, these algorithms typically suffer from performance degradation under distribution shifts. While existing domain adaptive human pose estimation algorithms address this bottleneck, they tend to perform suboptimally when the target domain images are occluded, a common occurrence in real-life scenarios. To address these challenges, we propose OR-POSE: Unsupervised Domain Adaptation for Occlusion Resilient Human POSE Estimation. OR-POSE is an innovative unsupervised domain adaptation algorithm which effectively mitigates domain shifts and overcomes occlusion challenges by employing the mean teacher framework for iterative pseudo-label refinement. Additionally, OR-POSE reinforces realistic pose prediction by leveraging a learned human pose prior which incorporates the anatomical constraints of humans in the adaptation process. Lastly, OR-POSE avoids overfitting to inaccurate pseudo labels generated from heavily occluded images by employing a novel visibility-based curriculum learning approach. This enables the model to gradually transition from training samples with relatively less occlusion to more challenging, heavily occluded samples. Extensive experiments show that OR-POSE outperforms existing analogous state-of-the-art algorithms by \sim 7% on challenging occluded human pose estimation datasets.
zh

[CV-50] WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation

【速读】：该论文旨在解决多人在复杂、开放环境下的全局姿态估计（global pose estimation）问题，特别是在大规模体育赛事中的多视角捕捉和三维姿态重建。现有的数据集主要局限于单人姿态或室内环境，难以应对复杂场景下的多人姿态估计。论文提出的解决方案关键是通过利用2022年世界杯期间部署的多视角高清摄像头（HD cameras）基础设施，捕捉超过1.75英亩范围内的球员三维姿态和运动数据，并结合球场标记对移动广播摄像头进行校准。最终生成的WorldPose数据集包含80多个序列、约250万三维姿态数据，总运动距离超过120公里，为多人在复杂环境下的姿态估计研究提供了新的挑战和机遇。

链接: https://arxiv.org/abs/2501.02771
作者: Tianjian Jiang,Johsan Billingham,Sebastian Müksch,Juan Zarate,Nicolas Evans,Martin R. Oswald,Marc Polleyfeys,Otmar Hilliges,Manuel Kaufmann,Jie Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present WorldPose, a novel dataset for advancing research in multi-person global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and moving cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres. We then leverage the captured players’ motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises more than 80 sequences with approx 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes.
zh

[CV-51] Visual Large Language Models for Generalized and Specialized Applications

【速读】：该论文旨在解决视觉大语言模型（Visual Large Language Models, VLLMs）在广泛应用中的研究空白，特别是从综合应用视角出发，涵盖视觉（图像、视频、深度）、动作和语言模态的通用和专用应用。论文通过调查VLLMs的多样化应用场景，识别伦理考量和挑战，并讨论其未来发展方向，旨在为VLLMs的未来创新和更广泛应用提供全面指南。解决方案的关键在于对VLLMs的现有应用进行系统梳理，分析其在不同场景中的表现，并提出未来研究的方向和潜在挑战。

链接: https://arxiv.org/abs/2501.02765
作者: Yifan Li,Zhixin Lai,Wentao Bao,Zhen Tan,Anh Dao,Kewei Sui,Jiayi Shen,Dong Liu,Huan Liu,Yu Kong
机构: Michigan State University(密歇根州立大学); Cornell University(康奈尔大学); Arizona State University(亚利桑那州立大学); University of California, Berkeley(加州大学伯克利分校); University of Texas at Austin(德克萨斯大学奥斯汀分校); Yale University(耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: this https URL.
zh

[CV-52] LDMapNet-U: An End-to-End System for City-Scale Lane-Level Map Updating KDD2025

【速读】：该论文旨在解决城市级车道级地图（lane-level map）更新过程中依赖人工标注导致的瓶颈问题。传统方法采用三阶段流程（构建、变化检测和更新），但由于精度限制，通常需要人工验证，导致更新过程耗时且劳动密集。为解决这一问题，论文提出了LDMapNet-U，通过将地图更新任务重新定义为基于历史地图数据的端到端地图生成过程，实现了矢量地图生成与变化检测的同步进行。其关键创新在于引入了Prior-Map Encoding（PME）模块，用于有效编码历史地图作为变化检测的参考，并结合Instance Change Prediction（ICP）模块，预测与历史地图的关联。这一方法显著缩短了地图更新周期，从季度更新提升至每周更新，并已在百度地图中成功部署，支持超过360个城市的更新。

链接: https://arxiv.org/abs/2501.02763
作者: Deguo Xia,Weiming Zhang,Xiyan Liu,Wei Zhang,Chenting Gong,Xiao Tan,Jizhou Huang,Mengmeng Yang,Diange Yang
机构: Tsinghua University(清华大学); Baidu(百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by KDD 2025, camera-ready version

点击查看摘要

Abstract:An up-to-date city-scale lane-level map is an indispensable infrastructure and a key enabling technology for ensuring the safety and user experience of autonomous driving systems. In industrial scenarios, reliance on manual annotation for map updates creates a critical bottleneck. Lane-level updates require precise change information and must ensure consistency with adjacent data while adhering to strict standards. Traditional methods utilize a three-stage approach-construction, change detection, and updating-which often necessitates manual verification due to accuracy limitations. This results in labor-intensive processes and hampers timely updates. To address these challenges, we propose LDMapNet-U, which implements a new end-to-end paradigm for city-scale lane-level map updating. By reconceptualizing the update task as an end-to-end map generation process grounded in historical map data, we introduce a paradigm shift in map updating that simultaneously generates vectorized maps and change information. To achieve this, a Prior-Map Encoding (PME) module is introduced to effectively encode historical maps, serving as a critical reference for detecting changes. Additionally, we incorporate a novel Instance Change Prediction (ICP) module that learns to predict associations with historical maps. Consequently, LDMapNet-U simultaneously achieves vectorized map element generation and change detection. To demonstrate the superiority and effectiveness of LDMapNet-U, extensive experiments are conducted using large-scale real-world datasets. In addition, LDMapNet-U has been successfully deployed in production at Baidu Maps since April 2024, supporting map updating for over 360 cities and significantly shortening the update cycle from quarterly to weekly. The updated maps serve hundreds of millions of users and are integrated into the autonomous driving systems of several leading vehicle companies.
zh

[CV-53] Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising ICASSP2025

【速读】：该论文试图解决在文本驱动视频生成（text-driven video generation）领域中，现有方法在生成长视频时面临的挑战，包括计算资源需求高、数据量大、以及现有免训练方法在生成视频时存在的运动动态不足和视频质量下降等问题。解决方案的关键在于提出了一种名为Brick-Diffusion的免训练方法，该方法通过引入“砖块到墙”去噪策略（brick-to-wall denoising strategy），将潜在空间（latent space）分段去噪，并在后续迭代中应用步幅（stride），模拟交错砖墙的构建过程。每个“砖块”代表一个去噪的片段，通过这种方式实现帧间的信息传递，从而提升整体视频质量。实验结果表明，Brick-Diffusion在生成高保真视频方面优于现有的基线方法。

链接: https://arxiv.org/abs/2501.02741
作者: Yunlong Yuan,Yuanfan Guo,Chunwei Wang,Hang Xu,Li Zhang
机构: School of Data Science, Fudan University (复旦大学数据科学学院); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2025

点击查看摘要

Abstract:Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.
zh

[CV-54] Interpretable Recognition of Fused Magnesium Furnace Working Conditions with Deep Convolutional Stochastic Configuration Networks

【速读】：该论文旨在解决熔镁炉工况识别模型泛化能力弱和可解释性差的问题。解决方案的关键在于提出了一种基于深度卷积随机配置网络（DCSCNs）的可解释工况识别方法。首先，通过监督学习机制生成具有物理意义的高斯差分卷积核，并采用增量方法构建DCSCNs模型，确保识别误差的分层收敛，避免了广泛使用的反向传播算法对卷积核参数的迭代优化过程。其次，定义了通道特征图的独立系数，以获得熔镁炉特征类激活图的可视化结果。最后，基于识别精度、可解释性信任度评估指标和模型参数量构建联合奖励函数，并应用强化学习自适应地剪枝DCSCNs模型的卷积核，以构建紧凑、高性能且可解释的网络。实验结果表明，该方法在识别精度和可解释性方面优于其他深度学习方法。

链接: https://arxiv.org/abs/2501.02740
作者: Li Weitao,Zhang Xinru,Wang Dianhui,Tong Qianqian,Chai Tianyou
机构: Hefei University of Technology(合肥工业大学); China University of Mining and Technology(中国矿业大学); Northeastern University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To address the issues of a weak generalization capability and interpretability in working condition recognition model of a fused magnesium furnace, this paper proposes an interpretable working condition recognition method based on deep convolutional stochastic configuration networks (DCSCNs). Firstly, a supervised learning mechanism is employed to generate physically meaningful Gaussian differential convolution kernels. An incremental method is utilized to construct a DCSCNs model, ensuring the convergence of recognition errors in a hierarchical manner and avoiding the iterative optimization process of convolutional kernel parameters using the widely used backpropagation algorithm. The independent coefficient of channel feature maps is defined to obtain the visualization results of feature class activation maps for the fused magnesium furnace. A joint reward function is constructed based on the recognition accuracy, the interpretable trustworthiness evaluation metrics, and the model parameter quantity. Reinforcement learning (RL) is applied to adaptively prune the convolutional kernels of the DCSCNs model, aiming to build a compact, highly performed and interpretable network. The experimental results demonstrate that the proposed method outperforms the other deep learning approaches in terms of recognition accuracy and interpretability.
zh

[CV-55] Holistic Semantic Representation for Navigational Trajectory Generation AAAI2025

【速读】：该论文旨在解决现有轨迹生成方法在生成人类移动轨迹时缺乏跨尺度语义理解的问题。现有方法通常仅从单一角度提升轨迹生成质量，未能全面考虑不同尺度的语义信息。为此，作者提出了一个名为HOSER（HOlistic SEmantic Representation）的框架，用于导航轨迹生成。该框架的关键解决方案包括三个核心组件：首先，通过Road Network Encoder扩展道路和区域级别的语义感知范围；其次，设计Multi-Granularity Trajectory Encoder，在点和轨迹级别上整合生成轨迹的时空语义；最后，采用Destination-Oriented Navigator无缝集成目的地导向的导航信息。实验结果表明，HOSER在多个真实数据集上显著优于现有方法，并在少样本学习和零样本学习场景中验证了其整体语义表示的有效性。

链接: https://arxiv.org/abs/2501.02737
作者: Ji Cao,Tongya Zheng,Qinghong Guo,Yu Wang,Junshu Dai,Shunyu Liu,Jie Yang,Jie Song,Mingli Song
机构: 1. 未知; 2. 未知; 3. 未知; 4. 未知; 5. 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model’s performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.
zh

[CV-56] Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

【速读】：该论文试图解决AI生成视频质量评估（AI-Generated Video Quality Assessment）的挑战。尽管扩散模型（diffusion models）的快速发展在视频长度和一致性方面取得了显著进展，但评估AI生成的视频仍然是一个难题。现有的方法大多集中在用户生成内容（User-Generated Content, UGC）上，而针对AI生成视频质量评估的研究较少。为此，论文提出了MSA-VQA（Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment），一种多级语义感知模型，通过利用基于CLIP的语义监督（CLIP-based semantic supervision）和交叉注意力机制（cross-attention mechanisms）来评估视频质量。该模型通过分层框架在帧、片段和视频三个层次上分析视频内容，并引入了提示语义监督模块（Prompt Semantic Supervision Module）和语义突变感知模块（Semantic Mutation-aware Module），分别用于确保视频与条件提示之间的语义一致性，以及捕捉帧间的细微变化。实验结果表明，该方法在AI生成视频质量评估方面达到了最先进的水平。

链接: https://arxiv.org/abs/2501.02706
作者: Jiaze Li,Haoran Xu,Shiding Zhu,Junwei He,Haozhao Wang
机构: Zhejiang University(浙江大学); University of Chinese Academy of Sciences(中国科学院大学); School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
zh

[CV-57] Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis

【速读】：该论文旨在解决水下成像中由于光与水相互作用导致的颜色失真和清晰度下降的问题。为了解决这些挑战，作者提出了一种新颖的基于颜色平衡先验的混合感知水下图像恢复框架（GuidedHybSensUIR）。该框架的关键在于多尺度操作，通过细节恢复模块（Detail Restorer）在更细的尺度上恢复低层次的细节特征，同时利用特征上下文化模块（Feature Contextualizer）在更广的尺度上捕捉高层次特征的远程上下文关系。此外，作者提出了一种新的颜色平衡先验（Color Balance Prior），在特征上下文化步骤中作为强指导，在最终解码阶段作为弱指导，以有效引导模型进化方向。通过构建一个包含三个真实世界水下数据集的配对训练数据基准，并在六个测试集上进行评估，作者验证了该方法的有效性。实验结果表明，该方法在多个基准数据集和指标上优于37种现有的最先进方法。

链接: https://arxiv.org/abs/2501.02701
作者: Xiaojiao Guo,Xuhang Chen,Shuqiang Wang,Chi-Man Pun
机构: University of Macau (澳门大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); School of Big Data, Baoshan University (保山学院大数据学院); School of Computer Science and Engineering, Huizhou University (惠州学院计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TCSVT

点击查看摘要

Abstract:Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior \textbfGuided \textbfHybrid \textbfSense \textbfUnderwater \textbfImage \textbfRestoration framework (\textbfGuidedHybSensUIR). This framework operates on multiple scales, employing the proposed \textbfDetail Restorer module to restore low-level detailed features at finer scales and utilizing the proposed \textbfFeature Contextualizer module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel \textbfColor Balance Prior as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at \hrefthis https URLthis https URL.
zh

[CV-58] EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

【速读】：该论文试图解决多模态架构（multi-modal architectures）在生成响应时出现的“幻觉”（hallucinations）问题，即模型生成的响应与图像数据中的真实情况存在偏差。为了解决这一问题，论文提出了一种名为EAGLE的解决方案，其关键在于直接增强视觉组件（visual component）的能力。具体而言，EAGLE通过对原始对比预训练任务（contrastive pre-training task）进行重新表述，从而改进视觉编码器（visual encoder）的接地性（grounding）和语言对齐（language alignment），而无需额外的指令训练。这种方法与大型语言模型（LLM）或融合模块（fusion module）无关，且作为后预训练（post-pretraining）方法，显著减少了多个具有挑战性的基准和任务中的幻觉现象。

链接: https://arxiv.org/abs/2501.02699
作者: Andrés Villa,Juan León Alcázar,Motasem Alfarra,Vladimir Araujo,Alvaro Soto,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST); Sailplane AI; Pontificia Universidad Católica de Chile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.
zh

[CV-59] GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

【速读】：该论文旨在解决现有视频生成方法无法支持复杂镜头技术（如多相机拍摄和推拉变焦）的问题，这些技术需要4D视频控制。现有的方法直接训练视频扩散变换器（Diffusion Transformer, DiT）来控制4D内容，但需要昂贵的多视角视频数据。为此，论文提出了一种新颖的框架，通过构建伪4D高斯场（pseudo 4D Gaussian field）并结合密集3D点跟踪（Dense 3D Point Tracking, D3D-PT）方法，生成视频帧。该框架利用预训练的DiT进行微调，生成遵循渲染视频引导的视频，称为GS-DiT。D3D-PT方法在伪4D高斯场构建中表现出色，不仅精度优于当前最先进的稀疏3D点跟踪方法（SpatialTracker），还将推理速度提升了两个数量级。GS-DiT在推理阶段能够生成具有相同动态内容但遵循不同相机参数的视频，解决了当前视频生成模型的重要局限性。此外，GS-DiT展示了强大的泛化能力，并将高斯溅射（Gaussian splatting）的4D可控性扩展到视频生成中，支持通过操纵高斯场和相机内参实现高级电影效果，为创意视频制作提供了强大工具。

链接: https://arxiv.org/abs/2501.02690
作者: Weikang Bian,Zhaoyang Huang,Xiaoyu Shi,Yijin Li,Fu-Yun Wang,Hongsheng Li
机构: Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室); Centre for Perceptual and Interactive Intelligence(感知与交互智能中心); Avolution AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at this https URL.
zh

[CV-60] ghnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features CVPR

【速读】：该论文旨在解决在特定时空背景下预测植物物种组成的问题，这对于生物多样性管理和保护以及改进物种识别工具具有重要意义。研究利用了欧洲范围内88,987条植物调查记录，并结合相应的卫星图像、时间序列数据、气候时间序列以及其他栅格化的环境数据（如土地覆盖、人类足迹、生物气候和土壤变量）作为训练数据，以预测4,716条植物调查的结果。解决方案的关键在于提出了一种基于图结构的特征构建和结果校正方法，并通过对比实验选择了在时间和图像模态下表现最佳的特征提取骨干网络（backbone network）。具体而言，研究构建了基于Swin-Transformer Block的骨干网络用于提取时间立方体特征，并设计了一种层次化的交叉注意力机制，能够稳健地融合多模态特征。在训练过程中，采用了基于微调的10折交叉融合方法，并使用Threshold Top-K方法进行后处理。消融实验验证了所提出的解决方案流程对模型性能的提升。

链接: https://arxiv.org/abs/2501.02649
作者: Haixu Liu,Penghao Jiang,Zerui Tao,Muyan Wan,Qiuzhuang Sun
机构: The University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR GeolifeCLEF

点击查看摘要

Abstract:Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline.
zh

[CV-61] Multispectral Pedestrian Detection with Sparsely Annotated Label

【速读】：该论文旨在解决稀疏标注环境下的多光谱行人检测（Sparsely Annotated Object Detection, SAOD）问题。现有方法在多光谱领域中仅标注部分行人时存在两个主要局限性：（1）缺乏对缺失标注的伪标签质量提升的考虑；（2）依赖固定的真实标注，导致学习到的行人视觉外观范围有限。为解决这些问题，论文提出了一种名为稀疏标注多光谱行人检测（Sparsely Annotated Multispectral Pedestrian Detection, SAMPD）的新框架。其关键解决方案包括：（1）引入多光谱行人感知自适应权重（Multispectral Pedestrian-aware Adaptive Weight, MPAW）和正伪标签增强（Positive Pseudo-label Enhancement, PPE）模块，通过利用多光谱知识生成高质量伪标签，并根据模态特性增加高质量伪标签的权重；（2）提出自适应行人检索增强（Adaptive Pedestrian Retrieval Augmentation, APRA）模块，自适应地整合真实标注中的行人图像块，并动态地将高质量伪标签与真实标注结合，从而扩展行人样本的多样性。实验结果表明，SAMPD在多光谱稀疏标注环境中显著提升了检测性能。

链接: https://arxiv.org/abs/2501.02640
作者: Chan Lee,Seungho Shin,Gyeong-Moon Park,Jung Uk Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although existing Sparsely Annotated Object Detection (SAOD) approches have made progress in handling sparsely annotated environments in multispectral domain, where only some pedestrians are annotated, they still have the following limitations: (i) they lack considerations for improving the quality of pseudo-labels for missing annotations, and (ii) they rely on fixed ground truth annotations, which leads to learning only a limited range of pedestrian visual appearances in the multispectral domain. To address these issues, we propose a novel framework called Sparsely Annotated Multispectral Pedestrian Detection (SAMPD). For limitation (i), we introduce Multispectral Pedestrian-aware Adaptive Weight (MPAW) and Positive Pseudo-label Enhancement (PPE) module. Utilizing multispectral knowledge, these modules ensure the generation of high-quality pseudo-labels and enable effective learning by increasing weights for high-quality pseudo-labels based on modality characteristics. To address limitation (ii), we propose an Adaptive Pedestrian Retrieval Augmentation (APRA) module, which adaptively incorporates pedestrian patches from ground-truth and dynamically integrates high-quality pseudo-labels with the ground-truth, facilitating a more diverse learning pool of pedestrians. Extensive experimental results demonstrate that our SAMPD significantly enhances performance in sparsely annotated environments within the multispectral domain.
zh

[CV-62] Identifying Surgical Instruments in Pedagogical Cataract Surgery Videos through an Optimized Aggregation Network

【速读】：该论文旨在解决在白内障手术视频中实时识别手术器械的挑战。为了实现这一目标，作者提出了一种基于深度学习（Deep Learning）的模型，该模型采用了YOLOV9的架构，并结合了可编程梯度信息（Programmable Gradient Information, PGI）机制和一种新颖的通用优化高效层聚合网络（Generally-Optimized Efficient Layer Aggregation Network, Go-ELAN）。这些技术的引入旨在解决信息瓶颈问题，从而在更高的非极大值抑制交并比（Non-Maximum Suppression Intersection over Union, NMS IoU）分数下提升最小平均精度（Minimum Average Precision, mAP）。实验结果表明，该模型在包含615张图像和10种器械类别的数据集上，在IoU为0.5时达到了73.74的mAP，显著优于其他对比模型（如YOLO v5、v7、v8、v9 vanilla、Laptool和DETR），验证了其有效性。

链接: https://arxiv.org/abs/2501.02618
作者: Sanya Sinha,Michal Balazia,Francois Bremond
机构: Imperial College London(伦敦帝国理工学院); INRIA d’Université Côte d’Azur(法国国家信息与自动化研究所蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Full paper accepted at the IEEE International Conference on Image Processing Applications and Systems (IPAS), Lyon, France, Jan 2025. 6 pages

点击查看摘要

Abstract:Instructional cataract surgery videos are crucial for ophthalmologists and trainees to observe surgical details repeatedly. This paper presents a deep learning model for real-time identification of surgical instruments in these videos, using a custom dataset scraped from open-access sources. Inspired by the architecture of YOLOV9, the model employs a Programmable Gradient Information (PGI) mechanism and a novel Generally-Optimized Efficient Layer Aggregation Network (Go-ELAN) to address the information bottleneck problem, enhancing Minimum Average Precision (mAP) at higher Non-Maximum Suppression Intersection over Union (NMS IoU) scores. The Go-ELAN YOLOV9 model, evaluated against YOLO v5, v7, v8, v9 vanilla, Laptool and DETR, achieves a superior mAP of 73.74 at IoU 0.5 on a dataset of 615 images with 10 instrument classes, demonstrating the effectiveness of the proposed model.
zh

[CV-63] Multi-layer Radial Basis Function Networks for Out-of-distribution Detection

【速读】：该论文旨在解决现有方法在分布外检测（Out-of-Distribution, OOD）中需要额外生成一个独立于分类的得分来判定输入是否为OOD的问题。论文的核心解决方案是通过设计一种多层径向基函数网络（Multi-Layer Radial Basis Function Network, MLRBFN），将分类和OOD检测合并为单一步骤。MLRBFN能够自然地关联分类置信度和OOD检测，但由于传统RBFN在多层次训练中的困难，其应用受到限制。为此，论文提出了一种新的抑制机制（depression mechanism），使得MLRBFN易于训练，并且在作为独立分类器或预训练特征提取器的头部时，能够与现有OOD检测方法竞争。该架构为OOD检测方法提供了一个新的研究方向。

链接: https://arxiv.org/abs/2501.02616
作者: Amol Khanna,Chenyi Ling,Derek Everett,Edward Raff,Nathan Inkawhich
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing methods for out-of-distribution (OOD) detection use various techniques to produce a score, separate from classification, that determines how ``OOD’’ an input is. Our insight is that OOD detection can be simplified by using a neural network architecture which can effectively merge classification and OOD detection into a single step. Radial basis function networks (RBFNs) inherently link classification confidence and OOD detection; however, these networks have lost popularity due to the difficult of training them in a multi-layer fashion. In this work, we develop a multi-layer radial basis function network (MLRBFN) which can be easily trained. To ensure that these networks are also effective for OOD detection, we develop a novel depression mechanism. We apply MLRBFNs as standalone classifiers and as heads on top of pretrained feature extractors, and find that they are competitive with commonly used methods for OOD detection. Our MLRBFN architecture demonstrates a promising new direction for OOD detection methods.
zh

[CV-64] Evolving Skeletons: Motion Dynamics in Action Recognition

【速读】：该论文旨在解决基于骨架的动作识别（skeleton-based action recognition）中如何更有效地表示时空信息的问题。传统方法通常使用基于图的模型（graph-based models），如时空图卷积网络（ST-GCN），来处理骨架序列，其中每个姿态被表示为围绕人体物理连接性构建的骨架图。然而，这些方法在处理复杂关节交互时表达能力有限。为此，论文引入了超图模型（hypergraph-based models），如Hyperformer，以捕捉更高阶的相关性，从而提供更丰富的关节交互表示。此外，论文还探讨了Taylor Videos这一新方法，通过嵌入运动概念来增强骨架序列，为解释人类动作提供了新的视角。研究的关键在于对传统骨架序列和Taylor变换骨架的全面评估，使用ST-GCN和Hyperformer模型在NTU-60和NTU-120数据集上进行对比分析，重点比较静态姿态与运动注入姿态的表现。研究结果表明，Taylor变换骨架在增强运动动态性方面具有潜力，但也揭示了当前在充分利用其优势方面面临的挑战，强调了开发创新骨架建模技术的必要性，以有效处理富含运动的数据并推动动作识别领域的发展。

链接: https://arxiv.org/abs/2501.02593
作者: Jushang Qiu,Lei Wang
机构: Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.
zh

[CV-65] Gaze Behavior During a Long-Term In-Home Social Robot Intervention for Children with ASD

【速读】：该论文试图解决的问题是自闭症谱系障碍（ASD）患者在社交和沟通中表现出的非典型注视行为（atypical gaze behavior），这种行为是ASD的诊断标志之一。研究通过一个为期一个月的家庭干预方案，旨在促进社交机器人、ASD儿童及其照顾者之间的三方互动（triadic interactions），以改善ASD儿童的注视行为。解决方案的关键在于利用社交机器人引导ASD儿童跟随其注视，从而增加儿童与照顾者之间的自发眼神接触和共同注意（joint attention）的频率和持续时间。研究结果表明，这种干预成功地促进了适当的注视行为，并且ASD的诊断指标对注视模式具有显著的预测作用。这些发现深化了对ASD注视模式的理解，并展示了机器人辅助干预在临床中的潜在应用价值。

链接: https://arxiv.org/abs/2501.02583
作者: Rebecca Ramnauth,Frederick Shic,Brian Scassellati
机构: Department of Computer Science, Yale University (耶鲁大学计算机科学系); Center for Child Health, Behavior, and Development, Seattle Children’s Research Institute (西雅图儿童研究所儿童健康、行为与发展中心); Department of Pediatrics, University of Washington School of Medicine (华盛顿大学医学院儿科系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 2025 20th IEEE/ACM International Conference on Human-Robot Interaction (HRI)

点击查看摘要

Abstract:Atypical gaze behavior is a diagnostic hallmark of Autism Spectrum Disorder (ASD), playing a substantial role in the social and communicative challenges that individuals with ASD face. This study explores the impacts of a month-long, in-home intervention designed to promote triadic interactions between a social robot, a child with ASD, and their caregiver. Our results indicate that the intervention successfully promoted appropriate gaze behavior, encouraging children with ASD to follow the robot’s gaze, resulting in more frequent and prolonged instances of spontaneous eye contact and joint attention with their caregivers. Additionally, we observed specific timelines for behavioral variability and novelty effects among users. Furthermore, diagnostic measures for ASD emerged as strong predictors of gaze patterns for both caregivers and children. These results deepen our understanding of ASD gaze patterns and highlight the potential for clinical relevance of robot-assisted interventions.
zh

[CV-66] DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

【速读】：该论文试图解决单目深度估计（Monocular Depth Estimation）在扩散-去噪范式（Diffusion-Denoising Paradigm）中存在的推理速度低和生成特征与判别特征之间的差距问题。尽管现有的单步确定性范式（Single-Step Deterministic Paradigm）在提高推理效率方面取得了进展，但其忽视了生成特征与判别特征之间的差异，导致结果不够理想。为此，论文提出了DepthMaster模型，通过两个关键模块来解决这些问题：首先，特征对齐模块（Feature Alignment Module）通过引入高质量的语义特征来增强去噪网络的表示能力，从而缓解生成特征对纹理细节的过拟合问题；其次，傅里叶增强模块（Fourier Enhancement Module）通过自适应平衡低频结构和高频细节，弥补单步确定性框架中缺乏细粒度细节的不足。通过两阶段训练策略，模型在泛化能力和细节保留方面达到了最先进的性能，超越了其他基于扩散的方法。

链接: https://arxiv.org/abs/2501.02576
作者: Ziyang Song,Zerong Wang,Bo Li,Hao Zhang,Ruijie Zhu,Li Liu,Peng-Tao Jiang,Tianzhu Zhang
机构: University of Science and Technology of China (USTC) (中国科学技术大学); vivo Mobile Communication Co., Ltd. (vivo移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network’s representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at this https URL.
zh

[CV-67] Balanced Multi-view Clustering

【速读】：该论文试图解决多视图聚类（Multi-view Clustering, MvC）中由于联合训练范式（joint training paradigm）导致的视图特定特征（view-specific features）不平衡和欠优化问题。具体而言，联合训练范式可能无法充分利用多视图信息，因为所有视图共享统一的学习目标，导致具有更多判别信息的视图主导学习过程，而其他视图则被欠优化。为解决这一问题，论文提出了一种新颖的平衡多视图聚类方法（Balanced Multi-view Clustering, BMvC），其关键创新在于引入了视图特定对比正则化（View-specific Contrastive Regularization, VCR）。VCR通过保留从联合特征和视图特定特征中捕获的样本相似性，将其融入视图特定特征对应的聚类分布中，从而增强视图特定特征提取器的学习过程。此外，理论分析表明，VCR能够自适应地调节视图特定特征提取器参数更新的梯度幅度，以实现平衡的多视图学习过程。通过这种方式，BMvC在视图特定模式的利用和视图不变模式的探索之间实现了更好的权衡，从而充分学习多视图信息以完成聚类任务。

链接: https://arxiv.org/abs/2501.02564
作者: Zhenglai Li,Jun Wang,Chang Tang,Xinzhong Zhu,Wei Zhang,Xinwang Liu
机构: City University of Macau(澳门城市大学); National University of Defense Technology(国防科技大学); China University of Geosciences(中国地质大学); Zhejiang Normal University(浙江师范大学); Research Institute of Ningbo Cixing Co. Ltd(宁波慈星股份有限公司研究院); Shandong Computer Science Center (National Supercomputer Center in Jinan)(山东省计算中心（国家超级计算济南中心）); Qilu University of Technology (Shandong Academy of Sciences)(齐鲁工业大学（山东省科学院）); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing(山东省计算互联网与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science(山东省计算机科学基础研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-view clustering (MvC) aims to integrate information from different views to enhance the capability of the model in capturing the underlying data structures. The widely used joint training paradigm in MvC is potentially not fully leverage the multi-view information, since the imbalanced and under-optimized view-specific features caused by the uniform learning objective for all views. For instance, particular views with more discriminative information could dominate the learning process in the joint training paradigm, leading to other views being under-optimized. To alleviate this issue, we first analyze the imbalanced phenomenon in the joint-training paradigm of multi-view clustering from the perspective of gradient descent for each view-specific feature extractor. Then, we propose a novel balanced multi-view clustering (BMvC) method, which introduces a view-specific contrastive regularization (VCR) to modulate the optimization of each view. Concretely, VCR preserves the sample similarities captured from the joint features and view-specific ones into the clustering distributions corresponding to view-specific features to enhance the learning process of view-specific feature extractors. Additionally, a theoretical analysis is provided to illustrate that VCR adaptively modulates the magnitudes of gradients for updating the parameters of view-specific feature extractors to achieve a balanced multi-view learning procedure. In such a manner, BMvC achieves a better trade-off between the exploitation of view-specific patterns and the exploration of view-invariance patterns to fully learn the multi-view information for the clustering task. Finally, a set of experiments are conducted to verify the superiority of the proposed method compared with state-of-the-art approaches both on eight benchmark MvC datasets and two spatially resolved transcriptomics datasets.
zh

[CV-68] Neural Error Covariance Estimation for Precise LiDAR Localization

【速读】：该论文试图解决自动驾驶车辆在LiDAR（激光雷达）地图匹配中的精确定位问题，特别是由于数据退化导致的误差问题。现有的传感器融合技术（如卡尔曼滤波）依赖于准确的误差协方差估计来提高定位精度，但在LiDAR地图匹配中获取可靠的协方差值仍然是一个复杂任务。为解决这一问题，论文提出了一种基于神经网络的框架，用于预测LiDAR地图匹配中的定位误差协方差。解决方案的关键在于引入了一种专门为误差协方差估计设计的新型数据集生成方法。通过使用卡尔曼滤波进行评估，该方法在定位精度上实现了2厘米的提升，显著改善了该领域的性能。

链接: https://arxiv.org/abs/2501.02558
作者: Minoo Dolatabadi,Fardin Ayar,Ehsan Javanmardi,Manabu Tsukada,Mahdi Javanmardi
机构: Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran (伊朗德黑兰阿米尔卡比尔理工大学计算机工程系); Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan (日本东京大学信息科学与技术研究生院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2024 International Conference on Intelligent Computing and its Emerging Applications

点击查看摘要

Abstract:Autonomous vehicles have gained significant attention due to technological advancements and their potential to transform transportation. A critical challenge in this domain is precise localization, particularly in LiDAR-based map matching, which is prone to errors due to degeneracy in the data. Most sensor fusion techniques, such as the Kalman filter, rely on accurate error covariance estimates for each sensor to improve localization accuracy. However, obtaining reliable covariance values for map matching remains a complex task. To address this challenge, we propose a neural network-based framework for predicting localization error covariance in LiDAR map matching. To achieve this, we introduce a novel dataset generation method specifically designed for error covariance estimation. In our evaluation using a Kalman filter, we achieved a 2 cm improvement in localization accuracy, a significant enhancement in this domain.
zh

[CV-69] AHMSA-Net: Adaptive Hierarchical Multi-Scale Attention Network for Micro-Expression Recognition

【速读】：该论文试图解决微表情识别（Micro-expression Recognition, MER）中的挑战，即由于微表情的瞬时性和细微性，现有深度学习方法在特征捕捉和动态适应性方面存在不足。为了解决这一问题，论文提出了一种自适应分层多尺度注意力网络（Adaptive Hierarchical Multi-Scale Attention Network, AHMSA-Net）。该网络的关键在于其自适应分层框架和多尺度注意力机制。自适应分层框架通过动态调整每层光流特征图的大小，从不同粒度（精细和粗糙）捕捉微表情的细微变化；多尺度注意力机制则通过融合不同尺度（通道和空间）的特征，学习微表情的动作信息。这两个模块协同工作，显著提高了微表情识别的准确性。实验结果表明，AHMSA-Net在主要微表情数据库上取得了具有竞争力的识别准确率。

链接: https://arxiv.org/abs/2501.02539
作者: Lijun Zhang,Yifan Zhang,Weicheng Tang,Xinzhi Sun,Xiaomeng Wang,Zhanshan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression recognition (MER) presents a significant challenge due to the transient and subtle nature of the motion changes involved. In recent years, deep learning methods based on attention mechanisms have made some breakthroughs in MER. However, these methods still suffer from the limitations of insufficient feature capture and poor dynamic adaptation when coping with the instantaneous subtle movement changes of micro-expressions. Therefore, in this paper, we design an Adaptive Hierarchical Multi-Scale Attention Network (AHMSA-Net) for MER. Specifically, we first utilize the onset and apex frames of the micro-expression sequence to extract three-dimensional (3D) optical flow maps, including horizontal optical flow, vertical optical flow, and optical flow strain. Subsequently, the optical flow feature maps are inputted into AHMSA-Net, which consists of two parts: an adaptive hierarchical framework and a multi-scale attention mechanism. Based on the adaptive downsampling hierarchical attention framework, AHMSA-Net captures the subtle changes of micro-expressions from different granularities (fine and coarse) by dynamically adjusting the size of the optical flow feature map at each layer. Based on the multi-scale attention mechanism, AHMSA-Net learns micro-expression action information by fusing features from different scales (channel and spatial). These two modules work together to comprehensively improve the accuracy of MER. Additionally, rigorous experiments demonstrate that the proposed method achieves competitive results on major micro-expression databases, with AHMSA-Net achieving recognition accuracy of up to 78.21% on composite databases (SMIC, SAMM, CASMEII) and 77.08% on the CASME^3 database.
zh

[CV-70] Pixel-Wise Feature Selection for Perceptual Edge Detection without post-processing

【速读】：该论文旨在解决当前基于深度卷积神经网络（CNNs）的图像边缘检测（ED）模型在性能上的局限性。具体而言，现有模型高度依赖后处理技术（如非极大值抑制，NMS），并且在感知质量上表现不佳，尤其是在允许误差容忍距离减小时，性能显著下降。这些问题的根源在于模型对所有像素的特征进行了统一融合，而忽略了像素间的特定特性（如纹理区域和边缘区域的差异）。论文提出了一种新的特征选择范式，能够更精细地选择特征并增强特征的多样性，从而在不依赖后处理的情况下显著提升传统ED模型的性能，并改善预测结果的感知质量。该解决方案的关键在于引入了一种可无缝集成到现有ED模型中的差异化特征选择机制。

链接: https://arxiv.org/abs/2501.02534
作者: Hao Shu
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Although deep convolutional neutral networks (CNNs) have significantly enhanced performance in image edge detection (ED), current models remain highly dependent on post-processing techniques such as non-maximum suppression (NMS), and often fail to deliver satisfactory perceptual results, while the performance will deteriorate significantly if the allowed error toleration distance decreases. These limitations arise from the uniform fusion of features across all pixels, regardless of their specific characteristics, such as the distinction between textural and edge areas. If the features extracted by the ED models are selected more meticulously and encompass greater diversity, the resulting predictions are expected to be more accurate and perceptually meaningful. Motivated by this observation, this paper proposes a novel feature selection paradigm for deep networks that facilitates the differential selection of features and can be seamlessly integrated into existing ED models. By incorporating this additional structure, the performance of conventional ED models is substantially enhanced without post-processing, while simultaneously enhancing the perceptual quality of the predictions. Extensive experimental evaluations validate the effectiveness of the proposed model.
zh

[CV-71] Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

【速读】：该论文试图解决视觉生成（Vision Generation）领域中的挑战，特别是在视觉理解和生成能力无缝集成方面的问题。论文提出了一种名为视觉驱动提示优化（Vision-Driven Prompt Optimization, VDPO）的新框架，该框架利用大语言模型（Large Language Models, LLMs）从视觉输入动态生成文本提示，从而指导高保真度的图像合成。VDPO的关键在于其结合了视觉嵌入提示调谐器（visual embedding prompt tuner）、文本指令生成器（textual instruction generator）和视觉生成模块（vision generation module），通过这些组件的协同工作，VDPO在多种视觉生成任务中实现了最先进的性能。实验结果表明，VDPO在COCO和Sketchy等基准测试中显著优于现有方法，并在FID、LPIPS和BLEU/CIDEr等指标上取得了显著提升。此外，VDPO展示了其可扩展性、鲁棒性和泛化能力，适用于领域内和领域外的任务。人类评估进一步验证了VDPO在生成视觉吸引力和语义一致性输出方面的实际优势。

链接: https://arxiv.org/abs/2501.02527
作者: Leo Franklin,Apiradee Boonmee,Kritsada Wongsuwan
机构: Kasem Bundit University(凯森邦迪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
zh

[CV-72] Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

【速读】：该论文旨在解决通过文本提示（text prompt）生成所需面部图像的挑战。尽管当前的大规模文本-图像扩散模型（text-image diffusion models）在生成能力上表现出色，但仅依赖文本提示生成特定面部图像仍存在困难。为此，论文提出了一种基于图像提示（image prompt）的优化方法，专注于面部图像的生成。解决方案的关键在于：（1）构建了一个包含400万高质量面部图像-文本对（FaceCaptionHQ-4M）的数据集，基于LAION-Face训练了Face-MakeUp模型；（2）为了保持与参考面部图像的一致性，提取并学习了多尺度内容特征和姿态特征，并将这些特征整合到扩散模型中，以增强面部身份特征的保留能力。通过在两个面部相关测试数据集上的验证，Face-MakeUp模型展现了最佳的综合性能。

链接: https://arxiv.org/abs/2501.02523
作者: Dawei Dai,Mingming Jia,Yinxiu Zhou,Hang Xing,Chenghang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive this http URL codes are available at:this https URL
zh

[CV-73] Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors

【速读】：该论文旨在解决基于文本提示（text prompts）的3D场景生成中存在的描述不准确和缺乏细粒度控制的问题，导致生成的场景不真实。为解决这一问题，论文提出了一种名为Layout2Scene的文本到场景生成方法，其关键解决方案是通过引入额外的语义布局（semantic layout）作为提示，以实现对3D场景中物体位置的精确控制。具体而言，该方法首先采用场景混合表示（scene hybrid representation）将物体与背景解耦，并通过预训练的文本到3D模型进行初始化。随后，提出了一种两阶段优化方案，分别对初始化的场景几何和外观进行优化。为了充分利用2D扩散先验（2D diffusion priors），论文引入了语义引导的几何扩散模型（semantic-guided geometry diffusion model）和语义-几何引导的扩散模型（semantic-geometry guided diffusion model），并在场景数据集上进行了微调。实验表明，该方法能够生成比现有技术更真实和合理的3D场景，并支持灵活且精确的编辑，适用于多种下游应用。

链接: https://arxiv.org/abs/2501.02519
作者: Minglin Chen,Longguang Wang,Sheng Ao,Ye Zhang,Kai Xu,Yulan Guo
机构: The Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University (中山大学深圳校区, 中山大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.
zh

[CV-74] Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

【速读】：该论文旨在解决面部吸引力预测（Facial Attractiveness Prediction, FAP）任务中数据集规模小、封闭源或缺乏多样性，以及现有模型泛化和适应能力有限的问题。为此，作者提出了LiveBeauty，这是首个针对直播场景的大规模FAP数据集，包含10,000张从直播平台直接收集的人脸图像，并通过精心设计的主观实验获得了200,000个吸引力标注，使其成为该领域最大的开放数据集。此外，论文提出了一种多模态FAP方法，通过个性化吸引力先验模块（Personalized Attractiveness Prior Module, PAPM）和多模态吸引力编码模块（Multi-modal Attractiveness Encoder Module, MAEM）分别提取整体面部先验知识和多模态美学语义特征，并通过跨模态融合模块（Cross-Modal Fusion Module, CMFM）进行特征整合。实验结果表明，该方法在LiveBeauty和其他开源FAP数据集上均达到了最先进的性能。

链接: https://arxiv.org/abs/2501.02509
作者: Hui Li,Xiaoyu Ren,Hongjiu Yu,Huiyu Duan,Kai Li,Ying Chen,Libo Wang,Xiongkuo Min,Guangtao Zhai,Xu Liu
机构: Alibaba Group(阿里巴巴集团); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in this paper we present LiveBeauty, the first large-scale live-specific FAP dataset, in a more challenging application scenario, i.e., live streaming. 10,000 face images are collected from a live streaming platform directly, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset in the challenging live scenario. Furthermore, a multi-modal FAP method is proposed to measure the facial attractiveness in live streaming. Specifically, we first extract holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM) and a Multi-modal Attractiveness Encoder Module (MAEM), respectively, then integrate the extracted features through a Cross-Modal Fusion Module (CMFM). Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. Dataset will be available soon.
zh

[CV-75] PTEENet: Post-Trained Early-Exit Neural Networks Augmentation for Inference Cost Optimization

【速读】：该论文旨在解决深度神经网络（DNN）推理过程中计算成本过高的问题。为了在保持较高推理精度的同时显著减少计算资源的需求，作者提出了一种在DNN前向推理过程中引入“捷径”（shortcuts）的方法。该方法的核心在于通过跳过昂贵的前向计算来降低计算成本，同时允许在精度上做出小幅度的妥协。具体而言，作者扩展了BranchyNet和EEnet架构，通过在预训练模型上附加分支（branches），避免了修改原始网络权重的需求。此外，作者提出了一种基于卷积构建块的新分支架构，以在应用于大型DNN时提供足够的训练能力。该架构包括置信度头（confidence heads），用于预测每个早期退出（early exits）的置信度水平。通过调整这些置信度扩展的阈值，可以实时控制从每个分支退出的数据量，从而在模型速度和精度之间实现动态权衡。实验结果表明，该方法能够有效降低平均推理计算成本，并进一步控制模型精度与计算成本之间的权衡。

链接: https://arxiv.org/abs/2501.02508
作者: Assaf Lahiany,Yehudit Aperstein
机构: Afeka Academic College of Engineering, Tel Aviv-Yafo, Israel (阿菲卡工程学院, 特拉维夫-雅法, 以色列)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For many practical applications, a high computational cost of inference over deep network architectures might be unacceptable. A small degradation in the overall inference accuracy might be a reasonable price to pay for a significant reduction in the required computational resources. In this work, we describe a method for introducing “shortcuts” into the DNN feedforward inference process by skipping costly feedforward computations whenever possible. The proposed method is based on the previously described BranchyNet (Teerapittayanon et al., 2016) and the EEnet (Demir, 2019) architectures that jointly train the main network and early exit branches. We extend those methods by attaching branches to pre-trained models and, thus, eliminating the need to alter the original weights of the network. We also suggest a new branch architecture based on convolutional building blocks to allow enough training capacity when applied on large DNNs. The proposed architecture includes confidence heads that are used for predicting the confidence level in the corresponding early exits. By defining adjusted thresholds on these confidence extensions, we can control in real-time the amount of data exiting from each branch and the overall tradeoff between speed and accuracy of our model. In our experiments, we evaluate our method using image datasets (SVHN and CIFAR10) and several DNN architectures (ResNet, DenseNet, VGG) with varied depth. Our results demonstrate that the proposed method enables us to reduce the average inference computational cost and further controlling the tradeoff between the model accuracy and the computation cost.
zh

[CV-76] Watch Video Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection AAAI2025

【速读】：该论文旨在解决视频片段检索（video moment retrieval）和亮点检测（highlight detection）任务中难以全面捕捉视频整体上下文（overall video context）的问题，导致难以确定哪些词语最为相关。为了解决这一问题，论文提出了一种新颖的视频上下文感知关键词注意力模块（Video Context-aware Keyword Attention module），通过捕捉整个视频上下文中的关键词变化来克服这一限制。关键解决方案包括引入视频上下文聚类模块（video context clustering module），该模块提供了视频整体上下文的简洁表示，从而增强对关键词动态的理解；此外，还提出了一个关键词权重检测模块（keyword weight detection module），结合关键词感知对比学习（keyword-aware contrastive learning），以增强视觉和文本特征之间的细粒度对齐。实验结果表明，该方法在多个基准数据集上显著提升了片段检索和亮点检测任务的性能。

链接: https://arxiv.org/abs/2501.02504
作者: Sung Jin Um,Dongjin Kim,Sangmin Lee,Jung Uk Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: this https URL
zh

[CV-77] ACE: Instruction-Based Image Creation and Editing via Context-Aware Content Filling

【速读】：该论文旨在解决图像生成和编辑任务中的多样化需求，提出了一个基于指令的扩散框架 ACE++。其关键解决方案包括两个方面：首先，通过改进 ACE 中的长上下文条件单元（Long-context Condition Unit, LCU），并将其输入范式扩展到所有编辑和生成任务中，从而增强模型的通用性。其次，采用两阶段训练方案，充分利用图像生成先验知识。第一阶段使用任务数据对模型进行预训练，结合文本到图像模型的 0-ref 任务；第二阶段通过微调模型以支持通用指令，覆盖 ACE 中定义的所有任务。此外，为了提升 ACE++ 在不同场景中的广泛应用性，论文提供了一套涵盖全微调和轻量微调的模型，兼顾通用性和垂直场景的适用性。实验结果表明，ACE++ 在图像生成质量和指令跟随能力方面具有显著优势。

链接: https://arxiv.org/abs/2501.02487
作者: Chaojie Mao,Jingfeng Zhang,Yulin Pan,Zeyinzi Jiang,Zhen Han,Yu Liu,Jingren Zhou
机构: Tongyi Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. Inspired by the input format for the inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context Condition Unit (LCU) introduced in ACE and extend this input paradigm to any editing and generation tasks. To take full advantage of image generative priors, we develop a two-stage training scheme to minimize the efforts of finetuning powerful text-to-image diffusion models like FLUX.1-dev. In the first stage, we pre-train the model using task data with the 0-ref tasks from the text-to-image model. There are many models in the community based on the post-training of text-to-image foundational models that meet this training paradigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with painting tasks and can be used as an initialization to accelerate the training process. In the second stage, we finetune the above model to support the general instructions using all tasks defined in ACE. To promote the widespread application of ACE++ in different scenarios, we provide a comprehensive set of models that cover both full finetuning and lightweight finetuning, while considering general applicability and applicability in vertical scenarios. The qualitative analysis showcases the superiority of ACE++ in terms of generating image quality and prompt following ability.
zh

[CV-78] A Deep Positive-Negative Prototype Approach to Integrated Prototypical Discriminative Learning

【速读】：该论文旨在解决深度神经网络中类内紧凑性（intra-class compactness）和类间分离性（inter-class separability）的优化问题。传统基于原型的学习方法（Prototype-based Learning, PbL）虽然通过样本与代表性原型的相似性进行分类，具有较好的可解释性，但在复杂场景中难以生成最优决策边界。而判别式方法虽然能有效分离类别，但通常缺乏直观的可解释性。为解决这一问题，论文提出了一种新颖的深度正负原型模型（Deep Positive-Negative Prototype, DPNP），通过将类原型与权重向量统一，构建了一个结构化的潜在空间，从而在保持可解释性的同时实现准确的分类。具体而言，DPNP模型在潜在空间中为每个类别生成深度正原型（Deep Positive Prototype, DPP），并将邻近类别的DPP视为隐式负原型，通过排斥力增强类间分离性。通过整合交叉熵损失、原型对齐和分离项的新型损失函数，DPNP实现了特征空间的几何优化，最大化类内紧凑性和类间间隔。实验结果表明，DPNP在多个数据集上优于现有最先进模型，且使用更小的网络结构。

链接: https://arxiv.org/abs/2501.02477
作者: Ramin Zarei-Sabzevar,Ahad Harati
机构: Department of Computer Engineering, Ferdowsi University of Mashhad (FUM) (计算机工程系，马什哈德费尔多西大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a novel Deep Positive-Negative Prototype (DPNP) model that combines prototype-based learning (PbL) with discriminative methods to improve class compactness and separability in deep neural networks. While PbL traditionally emphasizes interpretability by classifying samples based on their similarity to representative prototypes, it struggles with creating optimal decision boundaries in complex scenarios. Conversely, discriminative methods effectively separate classes but often lack intuitive interpretability. Toward exploiting advantages of these two approaches, the suggested DPNP model bridges between them by unifying class prototypes with weight vectors, thereby establishing a structured latent space that enables accurate classification using interpretable prototypes alongside a properly learned feature representation. Based on this central idea of unified prototype-weight representation, Deep Positive Prototype (DPP) is formed in the latent space as a representative for each class using off-the-shelf deep networks as feature extractors. Then, rival neighboring class DPPs are treated as implicit negative prototypes with repulsive force in DPNP, which push away DPPs from each other. This helps to enhance inter-class separation without the need for any extra parameters. Hence, through a novel loss function that integrates cross-entropy, prototype alignment, and separation terms, DPNP achieves well-organized feature space geometry, maximizing intra-class compactness and inter-class margins. We show that DPNP can organize prototypes in nearly regular positions within feature space, such that it is possible to achieve competitive classification accuracy even in much lower-dimensional feature spaces. Experimental results on several datasets demonstrate that DPNP outperforms state-of-the-art models, while using smaller networks.
zh

[CV-79] Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

【速读】：该论文试图解决从大量带有噪声标签的网络图像中学习无偏分类器的问题，特别是在仅有少量干净标签图像的情况下。这一问题具有实际意义，因为它通过利用带有噪声标签的免费网络图像，减少了昂贵的标注成本。然而，由于噪声图像的存在，传统的类原型（class prototype）生成方法容易产生偏差，导致原型不够紧凑和具有区分性。论文提出的解决方案关键是一种名为SimNoiPro的相似性最大化损失函数。SimNoiPro首先生成由干净图像和噪声容忍原型组成的混合原型，然后将它们拉近。该方法通过显式划分噪声图像的多样性，克服了优化不一致问题，从而更好地建模干净图像与噪声图像之间的关系，并从噪声图像集中提取有效信息。实验结果表明，SimNoiPro在图像关系度量和噪声数据清理方面优于现有方法。

链接: https://arxiv.org/abs/2501.02476
作者: Chao Liang,Linchao Zhu,Zongxin Yang,Wei Chen,Yi Yang
机构: Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by TOMM 2024

点击查看摘要

Abstract:We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this article, we introduce a similarity maximization loss named SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.
zh

[CV-80] Generalization-Enhanced Few-Shot Object Detection in Remote Sensing

【速读】：该论文试图解决遥感图像中目标检测（object detection）在数据有限条件下的泛化能力问题。遥感图像具有高分辨率、多尺度特征和多样化的地物特性，这使得目标检测任务尤为复杂。尽管深度学习方法在遥感目标检测中取得了显著成功，但它们通常依赖于大量标注数据，而获取足够的标注数据，尤其是对于新颖或罕见的目标，既具有挑战性又耗时，限制了现有模型的泛化能力。为了解决这些问题，论文提出了**Generalization-Enhanced Few-Shot Object Detection (GE-FSOD)**模型，旨在通过少量标注样本提升模型在遥感场景中的泛化能力。该模型的关键创新包括：1）Cross-Level Fusion Pyramid Attention Network (CFPAN)，用于增强多尺度特征表示；2）Multi-Stage Refinement Region Proposal Network (MRRPN)，用于生成更精确的区域建议；3）Generalized Classification Loss (GCL)，用于提升少样本场景下的分类性能。通过在DIOR和NWPU VHR-10数据集上的广泛实验，该模型在遥感少样本目标检测任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.02474
作者: Hui Lin,Nan Li,Pengjuan Yao,Kexin Dong,Yuhan Guo,Danfeng Hong,Ying Zhang,Congcong Wen
机构: China Academy of Electronics and Information Technology (中国电子科技集团公司); National Satellite Meteorological Center (国家卫星气象中心); Innovation Center for FengYun Meteorological Satellite, China Meteorological Administration (中国气象局风云气象卫星创新中心); State Key Laboratory of Hydroscience and Engineering, Tsinghua University (清华大学水沙科学与水利水电工程国家重点实验室); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); School of Automation and Electrical Engineering, University of Science and Technology Beijing (北京科技大学自动化与电气工程学院); Department of Electrical and Computer Engineering, New York University Abu Dhabi (纽约大学阿布扎比分校电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing object detection is particularly challenging due to the high resolution, multi-scale features, and diverse ground object characteristics inherent in satellite and UAV imagery. These challenges necessitate more advanced approaches for effective object detection in such environments. While deep learning methods have achieved remarkable success in remote sensing object detection, they typically rely on large amounts of labeled data. Acquiring sufficient labeled data, particularly for novel or rare objects, is both challenging and time-consuming in remote sensing scenarios, limiting the generalization capabilities of existing models. To address these challenges, few-shot learning (FSL) has emerged as a promising approach, aiming to enable models to learn new classes from limited labeled examples. Building on this concept, few-shot object detection (FSOD) specifically targets object detection challenges in data-limited conditions. However, the generalization capability of FSOD models, particularly in remote sensing, is often constrained by the complex and diverse characteristics of the objects present in such environments. In this paper, we propose the Generalization-Enhanced Few-Shot Object Detection (GE-FSOD) model to improve the generalization capability in remote sensing FSOD tasks. Our model introduces three key innovations: the Cross-Level Fusion Pyramid Attention Network (CFPAN) for enhanced multi-scale feature representation, the Multi-Stage Refinement Region Proposal Network (MRRPN) for more accurate region proposals, and the Generalized Classification Loss (GCL) for improved classification performance in few-shot scenarios. Extensive experiments on the DIOR and NWPU VHR-10 datasets show that our model achieves state-of-the-art performance for few-shot object detection in remote sensing.
zh

[CV-81] DeTrack: In-model Latent Denoising Learning for Visual Object Tracking NEURIPS2024

【速读】：该论文旨在解决视觉目标跟踪（visual object tracking）中现有方法在未见数据上表现不佳的问题。传统方法主要依赖于图像特征回归模型（image-feature regression models）或坐标自回归模型（coordinate autoregression models），前者过度依赖匹配结果且未充分利用位置先验，后者则仅能使用训练集中的边界框进行训练，导致在测试时对未见数据的泛化能力不足。论文提出了一种基于去噪学习（denoising learning）的新范式，通过向边界框引入噪声并生成噪声框进行训练，从而增强模型在测试数据上的鲁棒性。关键解决方案是将去噪学习过程分解到模型内部的每个去噪块（denoising block）中，而非多次运行模型，从而在不显著影响跟踪速度的前提下实现实时跟踪。具体而言，论文提出了一种去噪视觉Transformer（denoising Vision Transformer, ViT），该模型由多个去噪块组成，每个块负责去除预测边界框中的噪声，并通过堆叠多个去噪块完成整个去噪过程。此外，模型还利用图像特征、轨迹信息、轨迹记忆和视觉记忆来进一步提升跟踪的稳定性和精度。实验结果表明，该方法在多个具有挑战性的数据集上取得了具有竞争力的性能。

链接: https://arxiv.org/abs/2501.02467
作者: Xinyu Zhou,Jinglun Li,Lingyi Hong,Kaixun Jiang,Pinxue Guo,Weifeng Ge,Wenqiang Zhang
机构: 1Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China (上海智能信息处理重点实验室, 复旦大学计算机学院, 上海, 中国); 2Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering and Technology, Fudan University, Shanghai, China (上海人工智能与机器人工程研究中心, 复旦大学工程与技术学院, 上海, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model’s robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.
zh

[CV-82] Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

【速读】：该论文旨在解决在不同类型相机（特别是具有大视场角（FoV）的鱼眼相机和360度相机）上实现精确度量深度估计的挑战。现有的深度估计方法虽然在零样本（zero-shot）泛化方面表现良好，但在处理这些大视场角相机时仍存在显著困难。论文提出的解决方案是Depth Any Camera (DAC)框架，该框架通过扩展基于透视图像训练的模型，使其能够有效处理不同视场角的相机。DAC的关键创新包括：使用等距柱状投影（Equi-Rectangular Projection, ERP）作为统一的图像表示方法，确保对不同视场角图像的一致性处理；引入俯仰角感知的图像到ERP转换，以实现ERP空间中的高效在线增强；采用视场角对齐操作，支持在广泛视场角范围内的有效训练；以及多分辨率数据增强，以解决训练和测试之间的分辨率差异问题。通过这些技术，DAC在无需专门训练数据的情况下，能够无缝泛化到鱼眼和360度相机，并在多个数据集上实现了最先进的零样本度量深度估计性能，显著提升了delta-1（δ₁）精度。

链接: https://arxiv.org/abs/2501.02464
作者: Yuliang Guo,Sparsh Garg,S. Mahdi H. Miangoleh,Xinyu Huang,Liu Ren
机构: Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI)(博世北美研究中心 & 博世人工智能中心); Carnegie Mellon University(卡内基梅隆大学); Simon Fraser University(西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:While recent depth estimation methods exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 ( \delta_1 ) accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.
zh

[CV-83] FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

【速读】：该论文试图解决在遥感图像分类任务中，由于数据隐私和共享限制，难以在集中式训练框架中利用大规模分布式数据集的问题。传统的联邦学习方法在处理包含数十亿参数的视觉-语言模型（VLMs）时，面临显著的通信成本挑战。为此，论文提出了FedRSCLIP框架，这是首个基于CLIP模型的联邦学习框架，专门用于遥感图像分类。解决方案的关键在于引入了Prompt Learning（提示学习），通过优化少量可调参数来应对数据异质性和大规模模型传输的挑战。具体而言，FedRSCLIP采用双提示机制，包括用于全局知识共享的共享提示（Shared Prompts）和用于客户端特定适应的私有提示（Private Prompts）。为了保持共享提示和私有提示之间的语义一致性，提出了双提示对齐约束（Dual Prompt Alignment Constraint），以平衡全局一致性和局部适应性。此外，为了增强跨模态表示学习，引入了跨模态特征对齐约束（Cross-Modal Feature Alignment Constraint），以对齐文本和图像提示之间的多模态特征。通过构建Fed-RSIC数据集进行实验验证，结果表明FedRSCLIP在遥感图像分类任务中表现出显著的有效性和优越性。

链接: https://arxiv.org/abs/2501.02461
作者: Hui Lin,Chao Zhang,Danfeng Hong,Kexin Dong,Congcong Wen
机构: China Academy of Electronics and Information Technology (中国电子科技集团公司); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); Department of Electrical and Computer Engineering, New York University Abu Dhabi (纽约大学阿布扎比分校电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.
zh

[CV-84] Neural Reflectance Fields for Radio-Frequency Ray Tracing

【速读】：该论文试图解决在复杂环境中准确估计材料反射率（material reflectivity）的问题，特别是在无线电频率（RF）信号传播建模中。现有的方法在场景几何估计方面已经取得了显著进展，但在实际环境中高效且可扩展地估计材料反射率仍然是一个挑战。论文的解决方案关键在于通过从发射器到接收器的RF信号路径损耗中学习材料反射率，从而最小化预测与实测接收功率之间的差距。具体而言，作者通过将光学领域的神经反射场（neural reflectance field）转换到RF领域，同时建模RF信号的幅度和相位以考虑多径效应（multipath effects），并提出了一种可微分的RF射线追踪框架，通过优化神经反射场来匹配信号强度测量结果。实验结果表明，该方法能够在较少训练数据的情况下，成功学习所有入射角的反射系数，并显著提高接收功率预测的准确性。

链接: https://arxiv.org/abs/2501.02458
作者: Haifeng Jia,Xinyi Chen,Yichen Wei,Yifei Sun,Yibo Pi
机构: Shanghai Jiao Tong University (上海交通大学); East China Branch of State Grid Corporation of China (国家电网公司华东分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: Accepted by IEEE Global Communications Conference 2024 (GLOBECOM’24)

点击查看摘要

Abstract:Ray tracing is widely employed to model the propagation of radio-frequency (RF) signal in complex environment. The modelling performance greatly depends on how accurately the target scene can be depicted, including the scene geometry and surface material properties. The advances in computer vision and LiDAR make scene geometry estimation increasingly accurate, but there still lacks scalable and efficient approaches to estimate the material reflectivity in real-world environment. In this work, we tackle this problem by learning the material reflectivity efficiently from the path loss of the RF signal from the transmitters to receivers. Specifically, we want the learned material reflection coefficients to minimize the gap between the predicted and measured powers of the receivers. We achieve this by translating the neural reflectance field from optics to RF domain by modelling both the amplitude and phase of RF signals to account for the multipath effects. We further propose a differentiable RF ray tracing framework that optimizes the neural reflectance field to match the signal strength measurements. We simulate a complex real-world environment for experiments and our simulation results show that the neural reflectance field can successfully learn the reflection coefficients for all incident angles. As a result, our approach achieves better accuracy in predicting the powers of receivers with significantly less training data compared to existing approaches.
zh

[CV-85] Enhancing Contrastive Learning for Retinal Imaging via Adjusted Augmentation Scales

【速读】：该论文试图解决对比学习（contrastive learning）在医学影像领域中表现不佳的问题。研究表明，尽管对比学习在自然图像处理中表现出色，但在医学影像中的应用效果却不尽如人意。论文假设，医学影像的密集分布特性对对比学习中的前置任务（pretext tasks）提出了挑战，尤其是在构建正负样本对（positive and negative pairs）时。为了解决这一问题，论文通过实验探讨了不同数据增强策略对模型性能的影响，并比较了强增强（strong augmentation）和弱增强（weak augmentation）的效果。研究发现，使用弱增强预训练的模型在多个公开数据集上表现更优，特别是在MESSIDOR2数据集上，AUROC从0.838提升至0.848，AUPR从0.523提升至0.597。因此，论文的关键解决方案在于优化数据增强的规模，以提升对比学习在医学影像中的有效性。

链接: https://arxiv.org/abs/2501.02451
作者: Zijie Cheng,Boxuan Li,André Altmann,Pearse A Keane,Yukun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrastive learning, a prominent approach within self-supervised learning, has demonstrated significant effectiveness in developing generalizable models for various applications involving natural images. However, recent research indicates that these successes do not necessarily extend to the medical imaging domain. In this paper, we investigate the reasons for this suboptimal performance and hypothesize that the dense distribution of medical images poses challenges to the pretext tasks in contrastive learning, particularly in constructing positive and negative pairs. We explore model performance under different augmentation strategies and compare the results to those achieved with strong augmentations. Our study includes six publicly available datasets covering multiple clinically relevant tasks. We further assess the model’s generalizability through external evaluations. The model pre-trained with weak augmentation outperforms those with strong augmentation, improving AUROC from 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar enhancements across other datasets. Our findings suggest that optimizing the scale of augmentation is critical for enhancing the efficacy of contrastive learning in medical imaging.
zh

[CV-86] GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection

【速读】：该论文旨在解决协同感知（Collaborative Perception）在自动驾驶中的安全性问题，特别是针对恶意代理（malicious agents）发起的对抗性消息攻击（adversarial message attacks）导致的性能下降问题。现有防御方法主要基于单次异常检测（single-shot outlier detection），忽略了时间维度上的消息相关性，容易被细微但有害的扰动绕过。论文揭示了一种新的盲区混淆攻击（Blind Area Confusion, BAC），并提出了一个名为GCP（Guarded Collaborative Perception）的防御框架。GCP的关键在于通过空间-时间感知的恶意代理检测机制，结合置信度加权的空间一致性损失（confidence-scaled spatial concordance loss）来保持单次空间一致性，同时通过重建低置信度区域的历史鸟瞰图运动流（bird’s eye view motion flows）来检测时间异常。此外，GCP采用联合空间-时间的Benjamini-Hochberg检验来综合双域异常结果，从而实现可靠的恶意代理检测。实验表明，GCP在多种攻击场景下显著优于现有防御方法，尤其在BAC攻击下实现了高达34.69%的AP@0.5提升。

链接: https://arxiv.org/abs/2501.02450
作者: Yihang Tao,Senkang Hu,Yue Hu,Haonan An,Hangcheng Cao,Yuguang Fang
机构: Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系); Department of Robotics, University of Michigan(密歇根大学机器人系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Collaborative perception significantly enhances autonomous driving safety by extending each vehicle’s perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird’s eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP’s superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at this https URL.
zh

[CV-87] MedSegDiffNCA: Diffusion Models With Neural Cellular Automata for Skin Lesion Segmentation

【速读】：该论文旨在解决基于去噪扩散模型（Denoising Diffusion Models, DDMs）在医学图像分割中因依赖Unet架构而导致的高计算开销问题，特别是在高分辨率图像处理时。论文提出了三种基于神经细胞自动机（Neural Cellular Automata, NCA）的改进方案：首先，Multi-MedSegDiffNCA采用多级NCA框架来优化由低级NCA模型生成的粗略噪声估计；其次，CBAM-MedSegDiffNCA引入了通道和空间注意力机制以提升分割精度；最后，MultiCBAM-MedSegDiffNCA结合了上述方法，并引入了一种新的RGB通道损失函数以提供语义指导。实验结果表明，MultiCBAM-MedSegDiffNCA在病灶分割任务中达到了与Unet模型相当的Dice分数（87.84%），同时参数数量减少了60-110倍，为低资源医疗环境提供了更高效的解决方案。

链接: https://arxiv.org/abs/2501.02447
作者: Avni Mittal,John Kalkhof,Anirban Mukhopadhyay,Arnav Bhavsar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Denoising Diffusion Models (DDMs) are widely used for high-quality image generation and medical image segmentation but often rely on Unet-based architectures, leading to high computational overhead, especially with high-resolution images. This work proposes three NCA-based improvements for diffusion-based medical image segmentation. First, Multi-MedSegDiffNCA uses a multilevel NCA framework to refine rough noise estimates generated by lower level NCA models. Second, CBAM-MedSegDiffNCA incorporates channel and spatial attention for improved segmentation. Third, MultiCBAM-MedSegDiffNCA combines these methods with a new RGB channel loss for semantic guidance. Evaluations on Lesion segmentation show that MultiCBAM-MedSegDiffNCA matches Unet-based model performance with dice score of 87.84% while using 60-110 times fewer parameters, offering a more efficient solution for low resource medical settings.
zh

[CV-88] Unsupervised Search for Ethnic Minorities Medical Segmentation Training Set

【速读】：该论文探讨了医学影像数据集中存在的关键问题，即由于数据集收集过程中人口分布不均导致的种族差异（racial disparities）。研究发现，医学分割数据集存在显著的偏差，主要受收集地点人口构成的影响。例如，在美国收集的扫描激光眼底成像（Scanning Laser Ophthalmoscopy, SLO）数据集主要包含白人个体的图像，而少数族裔群体的代表性不足。这种不平衡可能导致模型性能偏差，进而对少数族裔群体产生不公平的临床结果。为解决这一问题，论文提出了一种新颖的训练集搜索策略，旨在通过关注代表性不足的种族群体来减少这些偏差。该策略利用现有数据集，采用简单的贪心算法（greedy algorithm）来识别与目标域分布最接近的源图像。通过选择更符合少数族裔群体特征的训练数据，该策略提高了医学分割模型在特定少数族裔（如黑人群体）上的准确性。实验结果表明，该方法在缓解偏差方面具有显著效果。论文还讨论了更广泛的社会影响，强调解决这些差异有助于实现更公平的医疗结果。

链接: https://arxiv.org/abs/2501.02442
作者: Yixiao Chen,Yue Yao,Ruining Yang,Md Zakir Hossain,Ashu Gupta,Tom Gedeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This article investigates the critical issue of dataset bias in medical imaging, with a particular emphasis on racial disparities caused by uneven population distribution in dataset collection. Our analysis reveals that medical segmentation datasets are significantly biased, primarily influenced by the demographic composition of their collection sites. For instance, Scanning Laser Ophthalmoscopy (SLO) fundus datasets collected in the United States predominantly feature images of White individuals, with minority racial groups underrepresented. This imbalance can result in biased model performance and inequitable clinical outcomes, particularly for minority populations. To address this challenge, we propose a novel training set search strategy aimed at reducing these biases by focusing on underrepresented racial groups. Our approach utilizes existing datasets and employs a simple greedy algorithm to identify source images that closely match the target domain distribution. By selecting training data that aligns more closely with the characteristics of minority populations, our strategy improves the accuracy of medical segmentation models on specific minorities, i.e., Black. Our experimental results demonstrate the effectiveness of this approach in mitigating bias. We also discuss the broader societal implications, highlighting how addressing these disparities can contribute to more equitable healthcare outcomes.
zh

[CV-89] FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

【速读】：该论文试图解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在处理从视觉骨干网络提取的长序列视觉标记（visual tokens）时面临的挑战，特别是在实时应用中的计算和内存需求过高的问题。为了解决这一问题，作者提出了FOLDER，一个简单但有效的即插即用模块，旨在减少视觉标记序列的长度，从而降低训练和推理过程中的计算和内存开销。FOLDER的关键在于通过分析不同标记缩减策略引入的信息损失，设计了一种能够在去除视觉冗余的同时保留关键信息的机制。实验表明，FOLDER在集成到多个MLLMs的视觉骨干网络中后，显著加速了推理阶段，并且在某些情况下甚至提升了模型性能，同时减少了高达70%的视觉标记，显著降低了模型复杂度。

链接: https://arxiv.org/abs/2501.02430
作者: Haicheng Wang,Zhemeng Yu,Gabriele Spadaro,Chen Ju,Victor Quétu,Enzo Tartaglione
机构: SJTU Paris Elite Institute of Technology, Shanghai Jiao Tong University (上海交通大学巴黎卓越工程师学院); LTCI, Télécom Paris, Institut Polytechnique de Paris (巴黎综合理工学院); University of Turin (都灵大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
zh

[CV-90] MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance AAAI2025

【速读】：该论文试图解决基于神经表示的视频分析（Neural Representations for Videos, NeRV）方法在处理大量多样化视频时效率低下的问题。具体来说，传统的NeRV方法需要为每个视频从头训练一个单独的模型，这在大规模视频数据集上耗时且计算资源消耗大。此外，NeRV方法在空间上需要从低维时间戳输入生成高维信号（即整个图像），而视频通常由数十帧组成，相邻帧之间的变化较小，这进一步增加了计算复杂度。

为解决这些问题，论文提出了Meta Neural Representations for Videos（MetaNeRV），一种新颖的框架，旨在快速适应未见过的视频。MetaNeRV的关键解决方案包括两个方面：首先，利用元学习（meta-learning）框架学习一个最优参数初始化，作为适应新视频的良好起点；其次，引入空间-时间引导机制，通过多分辨率损失（multi-resolution loss）捕捉不同分辨率阶段的信息，并通过渐进学习策略（progressive learning strategy）逐步优化拟合帧数，从而提升视频表示能力。实验结果表明，MetaNeRV在视频表示和视频压缩任务中表现出显著优势。

链接: https://arxiv.org/abs/2501.02427
作者: Jialong Guo,Ke liu,Jiangchao Yao,Zhihua Wang,Jiajun Bu,Haishuai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.
zh

[CV-91] Journey into Automation: Image-Derived Pavement Texture Extraction and Evaluation

【速读】：该论文旨在解决沥青路面抗滑性能评估中的关键问题，即如何准确、高效地提取路面纹理特征并计算平均纹理深度（Mean Texture Depth, MTD）。解决方案的关键在于开发了一套自动化系统，该系统通过经济高效的方法获取三维（3D）路面纹理数据，并利用增强的3D图像处理技术提取多维纹理特征。此外，论文建立了多元预测模型，将提取的纹理特征与MTD值关联起来。其中，梯度提升树（Gradient Boosting Tree, GBT）模型表现出显著的预测稳定性和准确性（R² = 0.9858），现场测试结果也表明该方法优于其他技术，相对误差低于10%。这一方法为从图像输入到MTD预测输出的路面质量评估提供了一个全面的端到端解决方案。

链接: https://arxiv.org/abs/2501.02414
作者: Bingjie Lu(1),Han-Cheng Dan(1),Yichen Zhang(1),Zhetao Huang(1) ((1) Central South University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mean texture depth (MTD) is pivotal in assessing the skid resistance of asphalt pavements and ensuring road safety. This study focuses on developing an automated system for extracting texture features and evaluating MTD based on pavement images. The contributions of this work are threefold: firstly, it proposes an economical method to acquire three-dimensional (3D) pavement texture data; secondly, it enhances 3D image processing techniques and formulates features that represent various aspects of texture; thirdly, it establishes multivariate prediction models that link these features with MTD values. Validation results demonstrate that the Gradient Boosting Tree (GBT) model achieves remarkable prediction stability and accuracy (R2 = 0.9858), and field tests indicate the superiority of the proposed method over other techniques, with relative errors below 10%. This method offers a comprehensive end-to-end solution for pavement quality evaluation, from images input to MTD predictions output.
zh

[CV-92] Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models

【速读】：该论文试图解决文本引导的图像到图像扩散模型（text-guided image-to-image diffusion models）在生成图像时可能被滥用于传播虚假信息、侵犯版权和逃避内容追踪的问题。为此，论文提出了一个名为“ID^2”的任务，即通过识别生成图像的原始图像来解决这一问题。解决方案的关键在于提出了一种具有理论保证的方法，该方法通过证明存在一个线性变换（linear transformation），能够最小化生成样本与其原始图像在预训练变分自编码器（VAE）嵌入空间中的距离。这种方法不仅能够有效应对不同扩散模型之间的视觉差异，还通过实验验证了其在跨模型泛化性能上的显著优势，相较于基于相似性的方法，平均精度（mAP）提升了31.6%。此外，论文还贡献了一个名为OriPID的数据集，用于训练和测试跨不同扩散模型的识别模型。

链接: https://arxiv.org/abs/2501.02376
作者: Wenhao Wang,Yifan Sun,Zongxin Yang,Zhentao Tan,Zhengdong Hu,Yi Yang
机构: University of Technology Sydney(悉尼科技大学); Baidu Inc.(百度公司); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for spreading misinformation, infringing on copyrights, and evading content tracing. This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID ^2 ), aiming to retrieve the original image of a given translated query. A straightforward solution to ID ^2 involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to visual discrepancy across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID ^2 task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, OriPID, contains abundant Origins and guided Prompts, which can be used to train and test potential IDentification models across various diffusion models. In the method section, we first prove the existence of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be generalized across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods ( +31.6% mAP), even those with generalization designs.
zh

[CV-93] Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

【速读】：该论文试图解决的问题是深度神经网络（Deep Neural Networks, DNNs）在分类任务中表现出的线性可分性（linear separability）现象缺乏理论支持的问题。尽管经验研究表明深度网络能够学习到线性可分的特征，但这些发现往往缺乏严格的数学证明，尤其是在相对简单的设置下。论文通过研究浅层非线性网络（shallow nonlinear networks）的线性分离能力，填补了这一理论空白。

解决方案的关键在于将输入数据建模为低维子空间（low-dimensional subspaces, UoS）的联合，并证明仅需一个非线性层即可将此类数据转换为线性可分的集合。理论分析表明，当使用随机权重和二次激活函数（quadratic activations）时，这种转换在高概率下可以实现。特别地，论文证明了当网络宽度与数据的内在维度（intrinsic dimension）呈多项式关系时，而非与环境维度（ambient dimension）相关时，这种线性分离是可以实现的。实验结果进一步验证了这些理论发现，表明在实际场景中也存在类似的线性分离特性。这一研究为非线性网络的分离能力提供了理论支持，深化了对模型可解释性和泛化能力的理解。

链接: https://arxiv.org/abs/2501.02364
作者: Alec S. Xu,Can Yaras,Peng Wang,Qing Qu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 32 pages, 9 figures

点击查看摘要

Abstract:Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. However, these findings often lack rigorous justifications, even under relatively simple settings. In this work, we address this gap by examining the linear separation capabilities of shallow nonlinear networks. Specifically, inspired by the low intrinsic dimensionality of image data, we model inputs as a union of low-dimensional subspaces (UoS) and demonstrate that a single nonlinear layer can transform such data into linearly separable sets. Theoretically, we show that this transformation occurs with high probability when using random weights and quadratic activations. Notably, we prove this can be achieved when the network width scales polynomially with the intrinsic dimension of the data rather than the ambient dimension. Experimental results corroborate these theoretical findings and demonstrate that similar linear separation properties hold in practical scenarios beyond our analytical scope. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks, offering deeper insights into model interpretability and generalization.
zh

[CV-94] V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection

【速读】：该论文旨在解决V2X（Vehicle-to-Everything）协同感知中由于异构节点之间的领域差异（domain gaps）以及由延迟和GPS定位噪声引起的姿态误差（pose errors）所导致的信息融合难题。这些挑战会导致特征错位（feature misalignment），从而影响感知效果。为解决这些问题，论文提出了V2X-DGPE框架，其关键解决方案包括：1）采用知识蒸馏框架（Knowledge Distillation Framework）和特征补偿模块（Feature Compensation Module）来从多源数据中学习领域不变的特征表示，有效减少车辆与路侧基础设施之间的特征分布差异；2）利用历史信息为模型提供对当前场景的更全面理解；3）通过协同融合模块（Collaborative Fusion Module）结合异构自注意力机制（heterogeneous self-attention mechanism）提取并整合来自车辆和基础设施的异构特征表示；4）引入可变形注意力机制（deformable attention mechanism）动态调整采样点，使模型能够自适应地聚焦于输入特征的关键部分，从而缓解姿态误差的影响。实验结果表明，该方法在DAIR-V2X数据集上实现了最先进的检测性能。

链接: https://arxiv.org/abs/2501.02363
作者: Sichao Wang,Chuang Zhang,Ming Yuan,Qing Xu,Lei He,Jianqiang Wang
机构: School of Vehicle and Mobility, Tsinghua University (清华大学车辆与运载学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In V2X collaborative perception, the domain gaps between heterogeneous nodes pose a significant challenge for effective information fusion. Pose errors arising from latency and GPS localization noise further exacerbate the issue by leading to feature misalignment. To overcome these challenges, we propose V2X-DGPE, a high-accuracy and robust V2X feature-level collaborative perception framework. V2X-DGPE employs a Knowledge Distillation Framework and a Feature Compensation Module to learn domain-invariant representations from multi-source data, effectively reducing the feature distribution gap between vehicles and roadside infrastructure. Historical information is utilized to provide the model with a more comprehensive understanding of the current scene. Furthermore, a Collaborative Fusion Module leverages a heterogeneous self-attention mechanism to extract and integrate heterogeneous representations from vehicles and infrastructure. To address pose errors, V2X-DGPE introduces a deformable attention mechanism, enabling the model to adaptively focus on critical parts of the input features by dynamically offsetting sampling points. Extensive experiments on the real-world DAIR-V2X dataset demonstrate that the proposed method outperforms existing approaches, achieving state-of-the-art detection performance. The code is available at this https URL.
zh

[CV-95] CorrFill: Enhancing Faithfulness in Reference-based Inpainting with Correspondence Guidance in Diffusion Models WACV2025

【速读】：该论文试图解决基于参考图像的图像修复任务中，现有扩散模型（diffusion models）在修复过程中缺乏对参考图像与受损图像之间几何相关性（geometric correlations）的显式约束，导致修复结果对参考图像的忠实度较低的问题。为解决这一问题，论文提出了CorrFill，这是一个无需训练的模块，通过在修复过程中引入对应性约束（correspondence constraints）来增强参考图像与目标图像之间的几何相关性感知。具体而言，CorrFill利用自注意力层（self-attention layers）中的注意力掩码（attention masking）和根据约束更新输入张量的目标函数（objective function），从而在修复过程中引导模型更好地保持对参考图像的忠实度。实验结果表明，CorrFill显著提升了多种基于扩散模型的基线方法（包括最先进的方法）的性能。

链接: https://arxiv.org/abs/2501.02355
作者: Kuan-Hung Liu,Cheng-Kun Yang,Min-Hung Chen,Yu-Lun Liu,Yen-Yu Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025. Project page: this https URL

点击查看摘要

Abstract:In the task of reference-based image inpainting, an additional reference image is provided to restore a damaged target image to its original state. The advancement of diffusion models, particularly Stable Diffusion, allows for simple formulations in this task. However, existing diffusion-based methods often lack explicit constraints on the correlation between the reference and damaged images, resulting in lower faithfulness to the reference images in the inpainting results. In this work, we propose CorrFill, a training-free module designed to enhance the awareness of geometric correlations between the reference and target images. This enhancement is achieved by guiding the inpainting process with correspondence constraints estimated during inpainting, utilizing attention masking in self-attention layers and an objective function to update the input tensor according to the constraints. Experimental results demonstrate that CorrFill significantly enhances the performance of multiple baseline diffusion-based methods, including state-of-the-art approaches, by emphasizing faithfulness to the reference images.
zh

[CV-96] GNSS/GPS Spoofing and Jamming Identification Using Machine Learning and Deep Learning

【速读】：该论文旨在解决全球导航卫星系统（GNSS），特别是全球定位系统（GPS）在面对恶意威胁（如欺骗（spoofing）和干扰（jamming））时的脆弱性问题。GNSS作为定位、导航和授时（PNT）的核心技术，广泛应用于交通、通信和应急服务等领域，但其缺乏固有的安全措施，容易受到故意干扰，导致严重后果，如民用航空中的导航错误或军事行动中的安全漏洞。论文通过机器学习和深度学习技术，提出了增强检测和缓解这些威胁的策略。关键解决方案包括利用先进的算法对真实世界的数据集进行广泛实验，在GNSS/GPS干扰检测任务中达到了约99%的准确率，较之前研究提升了约5%。此外，论文还通过机器学习和深度学习技术在欺骗检测任务中取得了显著成果，展示了这些技术在应对GNSS安全威胁中的潜力。

链接: https://arxiv.org/abs/2501.02352
作者: Ali Ghanbarzade,Hossein Soleimani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing reliance on Global Navigation Satellite Systems (GNSS), particularly the Global Positioning System (GPS), underscores the urgent need to safeguard these technologies against malicious threats such as spoofing and jamming. As the backbone for positioning, navigation, and timing (PNT) across various applications including transportation, telecommunications, and emergency services GNSS is vulnerable to deliberate interference that poses significant risks. Spoofing attacks, which involve transmitting counterfeit GNSS signals to mislead receivers into calculating incorrect positions, can result in serious consequences, from navigational errors in civilian aviation to security breaches in military operations. Furthermore, the lack of inherent security measures within GNSS systems makes them attractive targets for adversaries. While GNSS/GPS jamming and spoofing systems consist of numerous components, the ability to distinguish authentic signals from malicious ones is essential for maintaining system integrity. Recent advancements in machine learning and deep learning provide promising avenues for enhancing detection and mitigation strategies against these threats. This paper addresses both spoofing and jamming by tackling real-world challenges through machine learning, deep learning, and computer vision techniques. Through extensive experiments on two real-world datasets related to spoofing and jamming detection using advanced algorithms, we achieved state of the art results. In the GNSS/GPS jamming detection task, we attained approximately 99% accuracy, improving performance by around 5% compared to previous studies. Additionally, we addressed a challenging tasks related to spoofing detection, yielding results that underscore the potential of machine learning and deep learning in this domain.
zh

[CV-97] Revelio: A Real-World Screen-Camera Communication System with Visually Imperceptible Data Embedding

【速读】：该论文旨在解决屏幕-摄像头通信系统中数据嵌入的视觉不可感知性和抗噪性问题。现有的方法在噪声、异步性和失真条件下往往难以实现可靠解码。为此，作者提出了“Revelio”系统，其关键解决方案包括：1）利用OKLAB色彩空间中的时间性闪烁融合（temporal flicker fusion）技术，通过空间自适应的闪烁模式在像素区域形状中编码信息，从而实现视觉上不可察觉的数据嵌入；2）采用两阶段神经网络驱动的解码器，结合加权差分累加器（weighted differential accumulator）进行精确的帧检测和符号识别，确保在标准智能手机摄像头下实现可靠解码。实验结果表明，Revelio在交互式电视等场景中能够有效传输元信息，且对噪声和失真具有鲁棒性。

链接: https://arxiv.org/abs/2501.02349
作者: Abbaas Alif Mohamed Nishar,Shrinivas Kudekar,Bernard Kintzing,Ashwin Ashok
机构: 未知
类目: Multimedia (cs.MM); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 6 Figures, 1 Table, Accepted at IEEE International Conference on Acoustic, Speech, and Signal Processing 2025

点击查看摘要

Abstract:We present `Revelio’, a real-world screen-camera communication system leveraging temporal flicker fusion in the OKLAB color space. Using spatially-adaptive flickering and encoding information in pixel region shapes, Revelio achieves visually imperceptible data embedding while remaining robust against noise, asynchronicity, and distortions in screen-camera channels, ensuring reliable decoding by standard smartphone cameras. The decoder, driven by a two-stage neural network, uses a weighted differential accumulator for precise frame detection and symbol recognition. Initial experiments demonstrate Revelio’s effectiveness in interactive television, offering an unobtrusive method for meta-information transmission.
zh

[CV-98] Accurate Crop Yield Estimation of Blueberries using Deep Learning and Smart Drones

【速读】：该论文旨在解决蓝莓田间果实数量和产量估计的准确性问题。解决方案的关键在于利用配备计算机视觉（computer vision）技术的智能无人机，结合基于YOLO深度学习架构的两个目标检测模型：一个是能够从低空不同角度拍摄的图像中检测蓝莓灌木的“灌木模型”（Bush Model），另一个是能够检测灌木上可见的单个蓝莓的“果实模型”（Berry Model）。通过这两个模型的协同工作，无人机可以智能调整位置和摄像头角度，安全地近距离拍摄灌木的侧视图，从而实现更精确的产量估计。此外，论文还探讨了在图像前景中心灌木周围裁剪图像时模型的精度和召回率表现，并讨论了部署模型以绘制蓝莓田地图的不同采样策略，以及标注小尺寸对象（如蓝莓）和评估模型有效性的挑战。

链接: https://arxiv.org/abs/2501.02344
作者: Hieu D. Nguyen,Brandon McHenry,Thanh Nguyen,Harper Zappone,Anthony Thompson,Chau Tran,Anthony Segrest,Luke Tonon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages

点击查看摘要

Abstract:We present an AI pipeline that involves using smart drones equipped with computer vision to obtain a more accurate fruit count and yield estimation of the number of blueberries in a field. The core components are two object-detection models based on the YOLO deep learning architecture: a Bush Model that is able to detect blueberry bushes from images captured at low altitudes and at different angles, and a Berry Model that can detect individual berries that are visible on a bush. Together, both models allow for more accurate crop yield estimation by allowing intelligent control of the drone’s position and camera to safely capture side-view images of bushes up close. In addition to providing experimental results for our models, which show good accuracy in terms of precision and recall when captured images are cropped around the foreground center bush, we also describe how to deploy our models to map out blueberry fields using different sampling strategies, and discuss the challenges of annotating very small objects (blueberries) and difficulties in evaluating the effectiveness of our models.
zh

[CV-99] RadarNeXt: Real-Time and Reliable 3D Object Detector Based On 4D mmWave Imaging Radar

【速读】：该论文旨在解决自动驾驶（Autonomous Driving, AD）和高级驾驶辅助系统（Advanced Driver Assistance Systems, ADAS）中3D物体检测的实时性和可靠性问题。当前大多数3D检测器在追求检测精度的同时，往往忽视了实际应用中的网络推理速度。为此，论文提出了RadarNeXt，一种基于4D毫米波雷达点云的实时且可靠的3D物体检测器。其关键解决方案包括：1）利用可重参数化神经网络（re-parameterizable neural networks）捕捉多尺度特征，减少内存开销并加速推理；2）提出多路径可变形前景增强网络（Multi-path Deformable Foreground Enhancement Network, MDFEN），以突出雷达点云中的不规则前景特征并抑制背景杂波，从而在保证检测精度的同时，最小化速度和参数数量的牺牲。实验结果表明，RadarNeXt在View-of-Delft和TJ4DRadSet数据集上表现出色，使用MDFEN的变体分别达到了50.48和32.30的mAP（mean Average Precision），并在RTX A4000 GPU和Jetson AGX Orin上分别实现了超过67.10 FPS和28.40 FPS的推理速度。

链接: https://arxiv.org/abs/2501.02314
作者: Liye Jia,Runwei Guan,Haocheng Zhao,Qiuchi Zhao,Ka Lok Man,Jeremy Smith,Limin Yu,Yutao Yue
机构: Institute of Deep Perception Technology, JITRI, Wuxi, China (深度感知技术研究所, JITRI, 无锡, 中国); SAT, Xi’an Jiaotong-Liverpool University, Suzhou, China (西交利物浦大学, 苏州, 中国); Department of EEE, University of Liverpool, Liverpool (利物浦大学电子电气工程系, 利物浦, 英国); XJTLU-JITRI Academy of Industrial Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China (西交利物浦大学-JITRI工业技术研究院, 苏州, 中国); School of Automation Science and Electrical Engineering, Beihang University, Beijing, China (北京航空航天大学自动化科学与电气工程学院, 北京, 中国); Thrust of Artificial Intelligence, HKUST (GZ), Guangzhou, China (香港科技大学(广州)人工智能学部, 广州, 中国); Thrust of Artificial Intelligence and Thrust of Intelligent Transportation, HKUST (GZ), Guangzhou, China (香港科技大学(广州)人工智能与智能交通学部, 广州, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 3 tables. Code: this https URL

点击查看摘要

Abstract:3D object detection is crucial for Autonomous Driving (AD) and Advanced Driver Assistance Systems (ADAS). However, most 3D detectors prioritize detection accuracy, often overlooking network inference speed in practical applications. In this paper, we propose RadarNeXt, a real-time and reliable 3D object detector based on the 4D mmWave radar point clouds. It leverages the re-parameterizable neural networks to catch multi-scale features, reduce memory cost and accelerate the inference. Moreover, to highlight the irregular foreground features of radar point clouds and suppress background clutter, we propose a Multi-path Deformable Foreground Enhancement Network (MDFEN), ensuring detection accuracy while minimizing the sacrifice of speed and excessive number of parameters. Experimental results on View-of-Delft and TJ4DRadSet datasets validate the exceptional performance and efficiency of RadarNeXt, achieving 50.48 and 32.30 mAPs with the variant using our proposed MDFEN. Notably, our RadarNeXt variants achieve inference speeds of over 67.10 FPS on the RTX A4000 GPU and 28.40 FPS on the Jetson AGX Orin. This research demonstrates that RadarNeXt brings a novel and effective paradigm for 3D perception based on 4D mmWave radar.
zh

[CV-100] Hyperbolic Contrastive Learning for Hierarchical 3D Point Cloud Embedding

【速读】：该论文试图解决如何在多模态数据（包括文本、图像和3D点云）中更有效地建模复杂层次结构的问题。现有的双曲空间（Hyperbolic spaces）虽然在语言-图像预训练中表现出色，但其在统一语言、图像和3D点云模态方面的能力尚未得到充分探索。为此，论文提出了一种扩展的双曲多模态对比预训练方法，将3D点云模态纳入其中。解决方案的关键在于引入了三种正则化器（entailment、modality gap和alignment regularizers），这些正则化器有助于学习每个模态内的层次结构（intra-modal hierarchy）以及跨文本、2D图像和3D点云的模态间层次结构（inter-modal hierarchy），从而促进知识从文本和图像模态向3D点云模态的迁移。实验结果表明，该训练策略显著提升了3D点云编码器的性能，并在多个下游任务中取得了优异的成绩。

链接: https://arxiv.org/abs/2501.02285
作者: Yingjie Liu,Pengyu Zhang,Ziyao He,Mingsong Chen,Xuan Tang,Xian Wei
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperbolic spaces allow for more efficient modeling of complex, hierarchical structures, which is particularly beneficial in tasks involving multi-modal data. Although hyperbolic geometries have been proven effective for language-image pre-training, their capabilities to unify language, image, and 3D Point Cloud modalities are under-explored. We extend the 3D Point Cloud modality in hyperbolic multi-modal contrastive pre-training. Additionally, we explore the entailment, modality gap, and alignment regularizers for learning hierarchical 3D embeddings and facilitating the transfer of knowledge from both Text and Image modalities. These regularizers enable the learning of intra-modal hierarchy within each modality and inter-modal hierarchy across text, 2D images, and 3D Point this http URL results demonstrate that our proposed training strategy yields an outstanding 3D Point Cloud encoder, and the obtained 3D Point Cloud hierarchical embeddings significantly improve performance on various downstream tasks.
zh

[CV-101] Efficient Video-Based ALPR System Using YOLO and Visual Rhythm

【速读】：该论文旨在解决自动车牌识别（Automatic License Plate Recognition, ALPR）系统中依赖多帧图像进行车牌检测和识别的问题。传统视频基ALPR系统通常需要从多个帧中提取信息来检测车辆并识别车牌，而本文提出了一种仅从单帧图像中提取车辆信息并识别车牌字符的解决方案。其关键在于使用光学字符识别（Optical Character Recognition, OCR）模型，通过单一图像实现车牌字符的准确识别。初步实验表明，该方法具有可行性。

链接: https://arxiv.org/abs/2501.02270
作者: Victor Nascimento Ribeiro,Nina S. T. Hirata
机构: Instituto de Matemática e Estatística - Universidade de São Paulo (圣保罗大学数学与统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2024

点击查看摘要

Abstract:Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate information from image or a video capture. These systems have gained popularity due to the wide availability of low-cost surveillance cameras and advances in Deep Learning. Typically, video-based ALPR systems rely on multiple frames to detect the vehicle and recognize the license plates. Therefore, we propose a system capable of extracting exactly one frame per vehicle and recognizing its license plate characters from this singular image using an Optical Character Recognition (OCR) model. Early experiments show that this methodology is viable.
zh

[CV-102] DM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration

【速读】：该论文旨在解决视频修复（video restoration）任务中存在的局限性，即传统方法通常需要针对每种修复任务训练特定模型的问题。为此，作者提出了一种基于扩散模型（diffusion-based）的一体化视频修复方法，利用预训练的Stable Diffusion和微调的ControlNet来实现多种视频退化类型的修复。解决方案的关键包括：1）采用任务提示引导（Task Prompt Guidance, TPG）的高效训练策略，以支持多样化的修复任务；2）结合去噪扩散隐式模型（Denoising Diffusion Implicit Models, DDIM）反演和滑动窗口跨帧注意力机制（Sliding Window Cross-Frame Attention, SW-CFA）的推理策略，以增强内容保存和时间一致性；3）设计了一个可扩展的管道，使该方法能够适应不同的视频修复任务。通过实验验证，该方法在泛化能力和时间一致性保持方面优于现有最先进方法，为视频修复任务提供了一个统一的解决方案。

链接: https://arxiv.org/abs/2501.02269
作者: Yizhou Li,Zihua Liu,Yusuke Monno,Masatoshi Okutomi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MMM2025

点击查看摘要

Abstract:In this paper, we propose the first diffusion-based all-in-one video restoration method that utilizes the power of a pre-trained Stable Diffusion and a fine-tuned ControlNet. Our method can restore various types of video degradation with a single unified model, overcoming the limitation of standard methods that require specific models for each restoration task. Our contributions include an efficient training strategy with Task Prompt Guidance (TPG) for diverse restoration tasks, an inference strategy that combines Denoising Diffusion Implicit Models~(DDIM) inversion with a novel Sliding Window Cross-Frame Attention (SW-CFA) mechanism for enhanced content preservation and temporal consistency, and a scalable pipeline that makes our method all-in-one to adapt to different video restoration tasks. Through extensive experiments on five video restoration tasks, we demonstrate the superiority of our method in generalization capability to real-world videos and temporal consistency preservation over existing state-of-the-art methods. Our method advances the video restoration task by providing a unified solution that enhances video quality across multiple applications.
zh

[CV-103] What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理视觉信息时存在的计算开销过大和视觉冗余问题。具体来说，现有的MLLMs通常使用大量视觉标记（visual tokens）来弥补其视觉能力的不足，但这导致了不必要的计算负担和明显的视觉冗余。论文通过研究发现，前景和背景标记对于MLLMs都至关重要，尤其是在处理不同难度的任务时。基于这一观察，作者提出了一种基于图的无训练视觉标记剪枝方法，称为G-Prune。该方法将视觉标记视为图中的节点，并根据它们的语义相似性构建连接关系。通过加权链接传播信息流，并在迭代后保留最重要的标记，这些标记可以是前景或背景标记。实验结果表明，G-Prune在显著减少计算开销的同时，能够在粗粒度和细粒度任务上保持高性能。例如，在VQA2.0和TextVQA任务中，G-Prune分别减少了63.57%的浮点运算（FLOPs），而准确率仅下降了0.95%和2.34%。

链接: https://arxiv.org/abs/2501.02268
作者: Yutao Jiang,Qiong Wu,Wenhao Lin,Wei Yu,Yiyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed this http URL particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or this http URL validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of this http URL experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95% and 2.34% accuracy drops, respectively.
zh

[CV-104] Unsupervised Class Generation to Expand Semantic Segmentation Datasets

【速读】：该论文试图解决语义分割（Semantic Segmentation）任务中标注数据成本高、难以扩展新类别的问题。语义分割需要在像素级别进行分类，而标注真实图像的像素级标签既耗时又昂贵。现有的解决方案通常依赖于合成数据生成工具（如模拟器或视频游戏），但这些工具生成的合成数据集存在封闭性（closed-set nature），无法在不修改生成工具的情况下引入新类别，且这些工具通常不公开。

论文提出的解决方案关键是通过无监督（unsupervised）的流程，利用Stable Diffusion和Segment Anything Module（SAM）生成带有分割掩码（segmentation mask）的类别示例，并将生成的新类别样本集成到语义分割数据集中。该方法无需修改底层算法，仅需极少的用户输入，即可通过引入新类别的样本来提升无监督域适应（unsupervised domain adaptation）方法的性能。实验结果表明，该方法不仅能够有效学习新类别的分割（平均IoU达到51%），还能减少已有类别的错误，从而整体提升模型性能。

链接: https://arxiv.org/abs/2501.02264
作者: Javier Montalvo,Álvaro García-Martín,Pablo Carballeira,Juan C. SanMiguel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation is a computer vision task where classification is performed at a pixel level. Due to this, the process of labeling images for semantic segmentation is time-consuming and expensive. To mitigate this cost there has been a surge in the use of synthetically generated data – usually created using simulators or videogames – which, in combination with domain adaptation methods, can effectively learn how to segment real data. Still, these datasets have a particular limitation: due to their closed-set nature, it is not possible to include novel classes without modifying the tool used to generate them, which is often not public. Concurrently, generative models have made remarkable progress, particularly with the introduction of diffusion models, enabling the creation of high-quality images from text prompts without additional supervision. In this work, we propose an unsupervised pipeline that leverages Stable Diffusion and Segment Anything Module to generate class examples with an associated segmentation mask, and a method to integrate generated cutouts for novel classes in semantic segmentation datasets, all with minimal user input. Our approach aims to improve the performance of unsupervised domain adaptation methods by introducing novel samples into the training data without modifications to the underlying algorithms. With our methods, we show how models can not only effectively learn how to segment novel classes, with an average performance of 51% IoU, but also reduce errors for other, already existing classes, reaching a higher performance level overall. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.02264 [cs.CV] (or arXiv:2501.02264v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.02264 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-105] MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

【速读】：该论文旨在解决面部表情编辑（facial expression editing）问题，特别是通过控制同一人的面部动作单元（Action Unit, AU）的相对变化来实现精细、连续且可解释的表情编辑，同时保持其身份、姿态、背景和详细的面部属性。解决方案的关键在于提出的MagicFace模型，该模型基于扩散模型（diffusion model），并结合了AU变化和ID编码器（ID encoder）来保持面部细节的高度一致性。具体而言，模型利用预训练的Stable-Diffusion模型，并通过自注意力机制设计了一个ID编码器来融合外观特征。此外，为了保持背景和姿态的一致性，模型引入了高效的属性控制器（Attribute Controller），显式地告知模型当前目标的背景和姿态信息。通过将AU变化注入到去噪UNet中，模型能够以各种AU组合对任意身份进行动画处理，从而在高保真表情编辑方面取得了优于其他面部表情编辑工作的效果。

链接: https://arxiv.org/abs/2501.02260
作者: Mengting Wei,Tuomas Varanka,Xingxun Jiang,Huai-Qian Khor,Guoying Zhao
机构: Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu (奥卢大学); Key Laboratory of Child Development and Learning Science of Ministry of Education, School of Biological Sciences and Medical Engineering, Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person’s expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at this https URL.
zh

[CV-106] Distillation-Enhanced Physical Adversarial Attacks

【速读】：该论文旨在解决物理对抗补丁（physical adversarial patches）在AI识别系统中的隐蔽性与攻击性能之间的平衡问题。当前研究虽然致力于提高补丁的隐蔽性以增强实际应用性，但在保持隐蔽性的同时实现有效的攻击性能仍然是一个重大挑战。为此，论文提出了一种基于知识蒸馏（knowledge distillation）的新型物理对抗攻击方法。其关键解决方案包括三个步骤：首先，定义一种针对目标环境的隐蔽颜色空间，以确保补丁能够平滑融入背景；其次，在无约束颜色空间中优化一个对抗补丁，作为“教师”补丁；最后，通过对抗知识蒸馏模块将“教师”补丁的知识迁移到“学生”补丁中，从而指导隐蔽补丁的优化。实验结果表明，该方法在保持隐蔽性的同时，将攻击性能提升了20%，展示了其实际应用价值。

链接: https://arxiv.org/abs/2501.02232
作者: Wei Liu,Yonglin Wu,Chaoqun Li,Zhuodong Liu,Huanqian Yan
机构: Tsinghua University(清华大学); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:The study of physical adversarial patches is crucial for identifying vulnerabilities in AI-based recognition systems and developing more robust deep learning models. While recent research has focused on improving patch stealthiness for greater practical applicability, achieving an effective balance between stealth and attack performance remains a significant challenge. To address this issue, we propose a novel physical adversarial attack method that leverages knowledge distillation. Specifically, we first define a stealthy color space tailored to the target environment to ensure smooth blending. Then, we optimize an adversarial patch in an unconstrained color space, which serves as the ‘teacher’ patch. Finally, we use an adversarial knowledge distillation module to transfer the teacher patch’s knowledge to the ‘student’ patch, guiding the optimization of the stealthy patch. Experimental results show that our approach improves attack performance by 20%, while maintaining stealth, highlighting its practical value.
zh

[CV-107] Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies

【速读】：该论文试图解决AI生成人脸检测中的泛化问题，即现有检测器在面对快速演进的AI人脸生成器时难以适应的问题。解决方案的关键在于提出了一种基于自监督学习的异常检测方法，通过从真实人脸图像中学习相机固有特征和面部特定特征，设计了一个预训练任务来训练特征提取器，用于排序四个可交换图像文件格式（EXIF）标签并分类人工篡改的人脸图像。随后，利用高斯混合模型对学习到的真实人脸图像特征分布进行建模，将低似然值的人脸标记为AI生成。该方法通过定量和定性实验验证了其有效性。

链接: https://arxiv.org/abs/2501.02207
作者: Mian Zou,Baosheng Yu,Yibing Zhan,Kede Ma
机构: City University of Hong Kong(香港城市大学); Nanyang Technological University(南洋理工大学); JD Explore Academy(京东探索研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The detection of AI-generated faces is commonly approached as a binary classification task. Nevertheless, the resulting detectors frequently struggle to adapt to novel AI face generators, which evolve rapidly. In this paper, we describe an anomaly detection method for AI-generated faces by leveraging self-supervised learning of camera-intrinsic and face-specific features purely from photographic face images. The success of our method lies in designing a pretext task that trains a feature extractor to rank four ordinal exchangeable image file format (EXIF) tags and classify artificially manipulated face images. Subsequently, we model the learned feature distribution of photographic face images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Both quantitative and qualitative experiments validate the effectiveness of our method. Our code is available at \urlthis https URL.
zh

[CV-108] Accounting for Focus Ambiguity in Visual Questions

【速读】：该论文旨在解决视觉问答（Visual Question Answering, VQA）领域中现有工作未明确处理的问题，即问题中描述的内容在图像中的位置存在模糊性（ambiguity）。为了填补这一空白，作者提出了VQ-FocusAmbiguity数据集，这是首个在视觉上对问题中描述的每个区域进行定位的VQA数据集，这些区域对于得出答案至关重要。解决方案的关键在于通过该数据集，明确区分了问题中的视觉定位与答案中的视觉定位，并分析了数据集中的问题和分割特性。此外，作者还评估了现代模型在两个新任务上的表现：识别视觉问题是否存在焦点模糊性，以及在图像中定位所有可能的焦点区域。结果表明，现有模型在该数据集上表现具有挑战性。为了促进未来研究，作者公开了该数据集并提供了评估服务器。

链接: https://arxiv.org/abs/2501.02201
作者: Chongyan Chen,Yu-Yun Tseng,Zhuoheng Li,Anush Venkatesh,Danna Gurari
机构: University of Texas at Austin(德克萨斯大学奥斯汀分校); University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:No existing work on visual question answering explicitly accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each region described in the question that is necessary to arrive at the answer. We then provide an analysis showing how our dataset for visually grounding questions' is distinct from visually grounding answers’, and characterize the properties of the questions and segmentations provided in our dataset. Finally, we benchmark modern models for two novel tasks: recognizing whether a visual question has focus ambiguity and localizing all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly share the dataset with an evaluation server at this https URL.
zh

[CV-109] Learning Evolution via Optimization Knowledge Adaptation

【速读】：该论文试图解决传统进化算法（Evolutionary Algorithms, EAs）在动态适应不断扩展的知识库时面临的挑战，特别是如何有效利用积累的历史种群数据和适应度评估信息，以提升算法的优化能力和对新情境的适应性。为解决这一问题，论文提出了一种优化知识适应进化模型（Optimization Knowledge Adaptation Evolutionary Model, OKAEM），其关键创新在于通过动态参数调整利用积累的知识来增强优化能力。OKAEM采用注意力机制分别建模个体、适应度景观和遗传成分之间的交互关系，从而参数化选择、交叉和变异等进化操作。这些可学习的操作符使OKAEM能够从预先学习的大量先验知识中受益，并通过实时进化洞察进行自我调整，显著提升了算法在多种知识转移场景下的性能。

链接: https://arxiv.org/abs/2501.02200
作者: Chao Wang,Licheng Jiao,Jiaxuan Zhao,Lingling Li,Fang Liu,Shuyuan Yang
机构: Xidian University (西安电子科技大学)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work has been submitted to Springer Nature for possible publication

点击查看摘要

Abstract:Evolutionary algorithms (EAs) maintain populations through evolutionary operators to discover diverse solutions for complex tasks while gathering valuable knowledge, such as historical population data and fitness evaluations. However, traditional EAs face challenges in dynamically adapting to expanding knowledge bases, hindering the efficient exploitation of accumulated information and limiting adaptability to new situations. To address these issues, we introduce an Optimization Knowledge Adaptation Evolutionary Model (OKAEM), which features dynamic parameter adjustment using accumulated knowledge to enhance its optimization capabilities. OKAEM employs attention mechanisms to model the interactions among individuals, fitness landscapes, and genetic components separately, thereby parameterizing the evolutionary operators of selection, crossover, and mutation. These powerful learnable operators enable OKAEM to benefit from pre-learned extensive prior knowledge and self-tune with real-time evolutionary insights. Experimental results demonstrate that OKAEM: 1) exploits prior knowledge for significant performance gains across various knowledge transfer settings; 2) achieves competitive performance through self-tuning alone, even without prior knowledge; 3) outperforms state-of-the-art black-box baselines in a vision-language model tuning case; 4) can improve its optimization capabilities with growing knowledge; 5) is capable of emulating principles of natural selection and genetic recombination.
zh

[CV-110] Fresh-CL: Feature Realignment through Experts on Hypersphere in Continual Learning

【速读】：该论文试图解决在持续学习（Continual Learning）过程中，模型在学习新任务时可能导致的特征纠缠（feature entanglement）问题，这种纠缠会限制模型区分新领域的能力。为了解决这一问题，作者提出了一种名为“通过超球面上的专家进行特征重新对齐的持续学习方法”（Feature Realignment through Experts on hyperSpHere in Continual Learning, Fresh-CL）。该方法的关键在于利用预定义且固定的单纯形等角紧框架（simplex equiangular tight frame, ETF）分类器，通过将特征投影到超球面上，增强任务内和任务间的特征分离。此外，为了解决新任务引入时投影变化导致的结构化特征表示破坏问题，作者进一步提出了通过专家混合（mixture of experts, MoE）动态扩展ETF，使投影能够自适应地映射到不同的子空间，从而提升特征区分能力。实验结果表明，该方法在11个数据集上相比最强基线模型提升了2%的准确率，尤其在细粒度数据集上表现显著，验证了结合ETF和MoE在持续学习场景中提升特征区分能力的有效性。

链接: https://arxiv.org/abs/2501.02198
作者: Zhongyi Zhou,Yaxin Peng,Pin Yi,Minjie Zhu,Chaomin Shen
机构: School of Computer Science, East China Normal University (华东师范大学计算机科学学院); Department of Mathematics, Shanghai University (上海大学数学系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Learning enables models to learn and adapt to new tasks while retaining prior this http URL new tasks, however, can naturally lead to feature entanglement across tasks, limiting the model’s capability to distinguish between new domain this http URL this work, we propose a method called Feature Realignment through Experts on hyperSpHere in Continual Learning (Fresh-CL). By leveraging predefined and fixed simplex equiangular tight frame (ETF) classifiers on a hypersphere, our model improves feature separation both intra and inter this http URL, the projection to a simplex ETF shifts with new tasks, disrupting structured feature representation of previous tasks and degrading performance. Therefore, we propose a dynamic extension of ETF through mixture of experts, enabling adaptive projections onto diverse subspaces to enhance feature this http URL on 11 datasets demonstrate a 2% improvement in accuracy compared to the strongest baseline, particularly in fine-grained datasets, confirming the efficacy of combining ETF and MoE to improve feature distinction in continual learning scenarios.
zh

[CV-111] Phase Retrieval by Quaternionic Reweighted Amplitude Flow on Image Reconstruction

【速读】：该论文旨在解决四元数信号处理中的相位恢复问题（quaternionic phase retrieval problem）。通过基于幅度的模型，作者系统地开发了一系列新算法，包括四元数重加权幅度流算法（Quaternionic Reweighted Amplitude Flow, QRAF）及其三种变体：增量型、加速型和自适应型QRAF算法。此外，作者还提出了具有线性收敛性的四元数扰动幅度流算法（Quaternionic Perturbed Amplitude Flow, QPAF）。这些算法的关键创新在于通过四元数代数有效管理颜色信号，并保持信号维度之间的内在相关性，从而显著提高了恢复性能和计算效率。数值实验表明，所提出的方法在合成数据和真实图像上的表现优于现有最先进的方法。

链接: https://arxiv.org/abs/2501.02180
作者: Ren Hu,Pan Lian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Complex Variables (math.CV)
备注:

点击查看摘要

Abstract:Quaternionic signal processing provides powerful tools for efficiently managing color signals by preserving the intrinsic correlations among signal dimensions through quaternion algebra. In this paper, we address the quaternionic phase retrieval problem by systematically developing novel algorithms based on an amplitude-based model. Specifically, we propose the Quaternionic Reweighted Amplitude Flow (QRAF) algorithm, which is further enhanced by three of its variants: incremental, accelerated, and adapted QRAF algorithms. In addition, we introduce the Quaternionic Perturbed Amplitude Flow (QPAF) algorithm, which has linear convergence. Extensive numerical experiments on both synthetic data and real images, demonstrate that our proposed methods significantly improve recovery performance and computational efficiency compared to state-of-the-art approaches.
zh

[CV-112] Generating Multimodal Images with GAN: Integrating Text Image and Style

【速读】：该论文试图解决多模态图像生成（multimodal image generation）中的关键问题，即如何有效地将文本描述、参考图像和风格信息结合起来，生成符合多模态需求的图像。解决方案的关键在于提出了一种基于生成对抗网络（Generative Adversarial Networks, GAN）的多模态图像生成方法。该方法通过设计文本编码器（text encoder）、图像特征提取器（image feature extractor）和风格整合模块（style integration module），确保生成的图像在视觉内容和风格一致性方面保持高质量。此外，论文引入了多种损失函数，包括对抗损失（adversarial loss）、文本-图像一致性损失（text-image consistency loss）和风格匹配损失（style matching loss），以优化生成过程。实验结果表明，该方法在多个公开数据集上生成的图像具有高清晰度和一致性，相较于现有方法表现出显著的性能提升。

链接: https://arxiv.org/abs/2501.02167
作者: Chaoyi Tan,Wenqing Zhang,Zhen Qi,Kowei Shih,Xinshi Li,Ao Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of computer vision, multimodal image generation has become a research hotspot, especially the task of integrating text, image, and style. In this study, we propose a multimodal image generation method based on Generative Adversarial Networks (GAN), capable of effectively combining text descriptions, reference images, and style information to generate images that meet multimodal requirements. This method involves the design of a text encoder, an image feature extractor, and a style integration module, ensuring that the generated images maintain high quality in terms of visual content and style consistency. We also introduce multiple loss functions, including adversarial loss, text-image consistency loss, and style matching loss, to optimize the generation process. Experimental results show that our method produces images with high clarity and consistency across multiple public datasets, demonstrating significant performance improvements compared to existing methods. The outcomes of this study provide new insights into multimodal image generation and present broad application prospects.
zh

[CV-113] ROLO-SLAM: Rotation-Optimized LiDAR-Only SLAM in Uneven Terrain with Ground Vehicle

【速读】：该论文旨在解决基于LiDAR的SLAM（Simultaneous Localization and Mapping）方法在复杂地形中，特别是在垂直方向上的姿态估计漂移问题，这通常会导致全局地图的显著失真。为了解决这一问题，论文提出了一种称为旋转优化的仅LiDAR SLAM（ROLO-SLAM）方法。其关键解决方案包括：利用前向位置预测粗略消除连续扫描之间的位置差异，从而在前端分别准确确定位置和方向；采用并行空间体素化进行对应匹配；在每个体素内开发球形对齐引导的旋转配准以估计车辆的旋转；通过引入几何对齐，将运动约束纳入优化公式中，以增强LiDAR平移的快速有效估计；提取关键帧构建子图，并利用当前扫描与子图的对齐进行精确姿态估计；同时建立全局尺度因子图以减少累积误差。实验结果表明，ROLO-SLAM在地面车辆姿态估计方面表现出色，优于现有的最先进LiDAR SLAM框架。

链接: https://arxiv.org/abs/2501.02166
作者: Yinchuan Wang,Bin Ren,Xiang Zhang,Pengyu Wang,Chaoqun Wang,Rui Song,Yibin Li,Max Q.-H. Meng
机构: School of Control Science and Engineering, Shandong University, Jinan, China (山东大学控制科学与工程学院, 济南, 中国); Shenzhen Key Laboratory of Robotics Perception and Intelligence and the Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China (南方科技大学机器人感知与智能深圳市重点实验室及电子与电气工程系, 深圳, 中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted by Journal of Field Robotics

点击查看摘要

Abstract:LiDAR-based SLAM is recognized as one effective method to offer localization guidance in rough environments. However, off-the-shelf LiDAR-based SLAM methods suffer from significant pose estimation drifts, particularly components relevant to the vertical direction, when passing to uneven terrains. This deficiency typically leads to a conspicuously distorted global map. In this article, a LiDAR-based SLAM method is presented to improve the accuracy of pose estimations for ground vehicles in rough terrains, which is termed Rotation-Optimized LiDAR-Only (ROLO) SLAM. The method exploits a forward location prediction to coarsely eliminate the location difference of consecutive scans, thereby enabling separate and accurate determination of the location and orientation at the front-end. Furthermore, we adopt a parallel-capable spatial voxelization for correspondence-matching. We develop a spherical alignment-guided rotation registration within each voxel to estimate the rotation of vehicle. By incorporating geometric alignment, we introduce the motion constraint into the optimization formulation to enhance the rapid and effective estimation of LiDAR’s translation. Subsequently, we extract several keyframes to construct the submap and exploit an alignment from the current scan to the submap for precise pose estimation. Meanwhile, a global-scale factor graph is established to aid in the reduction of cumulative errors. In various scenes, diverse experiments have been conducted to evaluate our method. The results demonstrate that ROLO-SLAM excels in pose estimation of ground vehicles and outperforms existing state-of-the-art LiDAR SLAM frameworks.
zh

[CV-114] Joint Optimization for 4D Human-Scene Reconstruction in the Wild

【速读】：该论文旨在解决从单目视频中重建自然且多样化的人类运动及其周围环境的问题，特别是在非约束环境（in the wild）下的4D人类-场景重建。现有方法在受限环境中捕捉人类-场景交互方面取得了进展，但难以从网络视频中重建自然且多样化的人类运动和场景上下文。为此，论文提出了JOSH，一种基于优化的方法，通过联合优化场景几何和人类运动来实现4D重建。JOSH的关键在于利用密集场景重建和人体网格恢复技术进行初始化，并通过人类-场景接触约束联合优化场景、相机姿态和人类运动。实验表明，JOSH在全局人类运动估计和密集场景重建方面均取得了更好的效果。此外，论文还设计了更高效的模型JOSH3R，并通过从网络视频中生成的伪标签直接训练，进一步验证了其准确性和泛化能力。

链接: https://arxiv.org/abs/2501.02158
作者: Zhizheng Liu,Joe Lin,Wayne Wu,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
zh

[CV-115] From Images to Detection: Machine Learning for Blood Pattern Classification

【速读】：该论文试图解决血液痕迹分析（Bloodstain Pattern Analysis, BPA）中的一个关键问题，即如何区分不同类型的血液痕迹，特别是冲击溅射（impact spatter）和枪击溅射（gunshot）血液痕迹模式。解决方案的关键在于提取精心设计的单个血液痕迹特征，应用有效的数据整合方法，并选择合适的增强分类器（boosting classifiers）。通过这些方法，研究团队开发了一个在准确性和效率方面表现优异的模型。此外，论文还利用外部数据源讨论了BPA领域的挑战和未来研究方向。

链接: https://arxiv.org/abs/2501.02151
作者: Yilin Li,Weining Shen
机构: University of California, Irvine(加州大学欧文分校); University of California, Irvine(加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Bloodstain Pattern Analysis (BPA) helps us understand how bloodstains form, with a focus on their size, shape, and distribution. This aids in crime scene reconstruction and provides insight into victim positions and crime investigation. One challenge in BPA is distinguishing between different types of bloodstains, such as those from firearms, impacts, or other mechanisms. Our study focuses on differentiating impact spatter bloodstain patterns from gunshot bloodstain patterns. We distinguish patterns by extracting well-designed individual stain features, applying effective data consolidation methods, and selecting boosting classifiers. As a result, we have developed a model that excels in both accuracy and efficiency. In addition, we use outside data sources from previous studies to discuss the challenges and future directions for BPA.
zh

[CV-116] Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality Translation Using Conditional CycleGAN

【速读】：该论文试图解决MRI和PET成像之间的跨模态转换问题，特别是如何利用血液生物标志物（BBBMs）来增强PET图像的生成质量。MRI和PET成像机制不同，导致跨模态转换具有挑战性。论文通过将BBBMs整合到深度生成模型中，探索其在提升PET图像合成中的潜力。研究发现，BBBMs的整合显著提高了所有模型的生成质量，尤其是CycleGAN生成的PET图像在视觉保真度上表现最佳。基于这些发现，论文提出了一种新的生成模型——Plasma-CycleGAN，该模型以BBBMs为条件，从MRI合成PET图像。这是首次将BBBMs整合到MRI和PET之间的条件跨模态转换中，解决方案的关键在于利用BBBMs作为条件信息来增强生成模型的性能。

链接: https://arxiv.org/abs/2501.02146
作者: Yanxi Chen,Yi Su,Celine Dumitrascu,Kewei Chen,David Weidman,Richard J Caselli,Nicholas Ashton,Eric M Reiman,Yalin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Accepted by ISBI 2025

点击查看摘要

Abstract:Cross-modality translation between MRI and PET imaging is challenging due to the distinct mechanisms underlying these modalities. Blood-based biomarkers (BBBMs) are revolutionizing Alzheimer’s disease (AD) detection by identifying patients and quantifying brain amyloid levels. However, the potential of BBBMs to enhance PET image synthesis remains unexplored. In this paper, we performed a thorough study on the effect of incorporating BBBM into deep generative models. By evaluating three widely used cross-modality translation models, we found that BBBMs integration consistently enhances the generative quality across all models. By visual inspection of the generated results, we observed that PET images generated by CycleGAN exhibit the best visual fidelity. Based on these findings, we propose Plasma-CycleGAN, a novel generative model based on CycleGAN, to synthesize PET images from MRI using BBBMs as conditions. This is the first approach to integrate BBBMs in conditional cross-modality translation between MRI and PET.
zh

[CV-117] SafeAug: Safety-Critical Driving Data Augmentation from Naturalistic Datasets

【速读】：该论文旨在解决自动驾驶算法开发中安全关键驾驶数据（safety-critical driving data）稀缺的问题。由于自然数据集（naturalistic datasets）中此类数据较少，现有方法主要依赖模拟或人工生成的图像，但这些图像与真实场景之间存在真实性差距。论文提出了一种新颖的框架，通过从自然数据集中增强安全关键驾驶数据来解决这一问题。该框架的关键步骤包括：首先使用YOLOv5检测车辆，随后进行深度估计和3D变换，以更好地模拟车辆接近和关键驾驶场景，从而有针对性地修改车辆动态数据以反映潜在危险情况。与模拟或人工生成的数据相比，该方法在最小化图像真实性损失的前提下生成安全关键驾驶数据。实验结果表明，基于KITTI数据集训练的自动驾驶算法在使用该增强数据集后，性能优于包括SMOGN和重要性采样（importance sampling）在内的基线方法。

链接: https://arxiv.org/abs/2501.02143
作者: Zhaobin Mo,Yunlong Li,Xuan Di
机构: Department of Civil Engineering and Engineering Mechanics, Columbia University (哥伦比亚大学土木工程与工程力学系); Department of Electrical Engineering, Columbia University (哥伦比亚大学电气工程系); Data Science Institute, Columbia University (哥伦比亚大学数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety-critical driving data is crucial for developing safe and trustworthy self-driving algorithms. Due to the scarcity of safety-critical data in naturalistic datasets, current approaches primarily utilize simulated or artificially generated images. However, there remains a gap in authenticity between these generated images and naturalistic ones. We propose a novel framework to augment the safety-critical driving data from the naturalistic dataset to address this issue. In this framework, we first detect vehicles using YOLOv5, followed by depth estimation and 3D transformation to simulate vehicle proximity and critical driving scenarios better. This allows for targeted modification of vehicle dynamics data to reflect potentially hazardous situations. Compared to the simulated or artificially generated data, our augmentation methods can generate safety-critical driving data with minimal compromise on image authenticity. Experiments using KITTI datasets demonstrate that a downstream self-driving algorithm trained on this augmented dataset performs superiorly compared to the baselines, which include SMOGN and importance sampling.
zh

[CV-118] AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLM s

【速读】：该论文旨在解决当前多模态大语言模型（MLLMs）在音频-视觉（AV）理解能力评估方面的不足。现有的诊断基准主要局限于视觉方面的评估，缺乏对音频-视觉整体理解能力的全面考察，并且没有评估模型在面对扰动输入时的响应校准能力。为此，作者提出了音频-视觉可信度评估基准（AVTrustBench），包含600K样本，涵盖9个精心设计的任务，评估AVLLMs在对抗攻击（Adversarial attack）、组合推理（Compositional reasoning）和模态特定依赖性（Modality-specific dependency）三个维度上的能力。通过对13个先进AVLLMs的广泛评估，发现现有模型在实现类人理解能力方面存在显著不足。为缓解现有方法的局限性，作者进一步提出了一种鲁棒的、模型无关的校准音频-视觉偏好优化训练策略（CAVPref），在所有9个任务中取得了高达30.19%的性能提升。该研究为未来相关领域的研究提供了有价值的见解和工具。

链接: https://arxiv.org/abs/2501.02135
作者: Sanjoy Chowdhury,Sayan Nag,Subhrajyoti Dasgupta,Yaoting Wang,Mohamed Elhoseiny,Ruohan Gao,Dinesh Manocha
机构: University of Maryland, College Park(马里兰大学帕克分校); University of Toronto(多伦多大学); Mila and Université de Montréal(米拉和蒙特利尔大学); KAUST(阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models’ multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.
zh

[CV-119] Siamese Networks for Cat Re-Identification: Exploring Neural Models for Cat Instance Recognition

【速读】：该论文试图解决城市流浪猫（street cats）种群控制和福利管理中的挑战，特别是如何高效、可持续地识别和监控流浪猫个体。解决方案的关键在于利用深度学习（Deep Learning）模型，特别是基于Siamese Networks的架构，结合EfficientNetB0、MobileNet和VGG16等基础模型，通过对比损失（contrastive loss）和三元组损失（triplet loss）函数进行训练，以实现自动化的猫个体重识别（re-identification）。研究结果表明，VGG16模型与对比损失函数的组合表现最佳，测试准确率达到97%，F1得分为0.9344。该方法通过图像增强（image augmentation）和数据集优化（dataset refinement）克服了数据有限和视觉多样性带来的挑战，为社区驱动的流浪猫管理提供了可扩展且可靠的解决方案。

链接: https://arxiv.org/abs/2501.02112
作者: Tobias Trein,Luan Fonseca Garcia
机构: Pontifícia Universidade Católica do Rio Grande do Sul (南里奥格兰德天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Street cats in urban areas often rely on human intervention for survival, leading to challenges in population control and welfare management. In April 2023, Hello Inc., a Chinese urban mobility company, launched the Hello Street Cat initiative to address these issues. The project deployed over 21,000 smart feeding stations across 14 cities in China, integrating livestreaming cameras and treat dispensers activated through user donations. It also promotes the Trap-Neuter-Return (TNR) method, supported by a community-driven platform, HelloStreetCatWiki, where volunteers catalog and identify cats. However, manual identification is inefficient and unsustainable, creating a need for automated solutions. This study explores Deep Learning-based models for re-identifying street cats in the Hello Street Cat initiative. A dataset of 2,796 images of 69 cats was used to train Siamese Networks with EfficientNetB0, MobileNet and VGG16 as base models, evaluated under contrastive and triplet loss functions. VGG16 paired with contrastive loss emerged as the most effective configuration, achieving up to 97% accuracy and an F1 score of 0.9344 during testing. The approach leverages image augmentation and dataset refinement to overcome challenges posed by limited data and diverse visual variations. These findings underscore the potential of automated cat re-identification to streamline population monitoring and welfare efforts. By reducing reliance on manual processes, the method offers a scalable and reliable solution for communitydriven initiatives. Future research will focus on expanding datasets and developing real-time implementations to enhance practicality in large-scale deployments.
zh

[CV-120] AI-Powered Cow Detection in Complex Farm Environments

【速读】：该论文试图解决在现实农场环境中，由于复杂光照、遮挡、姿态变化和背景干扰等因素，现有奶牛检测算法在检测精度和泛化能力方面面临的挑战。为了解决这些问题，论文提出了一种结合YOLOv8和卷积块注意力模块（Convolutional Block Attention Module, CBAM）的检测模型。该模型通过引入CBAM机制，增强了模型在复杂环境下的特征提取能力，从而提高了检测精度和泛化性能。实验结果表明，YOLOv8-CBAM在平均精度（mAP）上比YOLOv8提升了2.3%，达到了95.2%的精确度和82.6%的mAP@0.5:0.95，显著优于其他基线模型（如Mask R-CNN、YOLOv5和YOLOv8）。该研究的关键在于通过引入注意力机制，提升了模型在复杂农场环境中的鲁棒性和检测性能，为智能农场中的健康监测、行为分析和跟踪等应用提供了技术支持。

链接: https://arxiv.org/abs/2501.02080
作者: Voncarlos,Ines,Thomas,Sebastien,Elsa,Marjorie,Abdoulaye
机构: Département d’informatique, Université du Québec à Montréal (蒙特利尔魁北克大学信息系); McGill University (麦吉尔大学); Innovation Chair in Animal Welfare and Artificial Intelligence (WELL-E) (动物福利与人工智能创新主席)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Animal welfare has become a critical issue in contemporary society, emphasizing our ethical responsibilities toward animals, particularly within livestock farming. The advent of Artificial Intelligence (AI) technologies, specifically computer vision, offers an innovative approach to monitoring and enhancing animal welfare. Cows, as essential contributors to sustainable agriculture, are central to this effort. However, existing cow detection algorithms face challenges in real-world farming environments, such as complex lighting, occlusions, pose variations, and background interference, hindering detection. Model generalization is crucial for adaptation across contexts beyond the training dataset. This study addresses these challenges using a diverse cow dataset from six environments, including indoor and outdoor scenarios. We propose a detection model combining YOLOv8 with the CBAM (Convolutional Block Attention Module) and assess its performance against baseline models, including Mask R-CNN, YOLOv5, and YOLOv8. Our findings show baseline models degrade in complex conditions, while our approach improves using CBAM. YOLOv8-CBAM outperformed YOLOv8 by 2.3% in mAP, achieving 95.2% precision and an mAP@0.5:0.95 of 82.6%, demonstrating superior accuracy. Contributions include (1) analyzing detection limitations, (2) proposing a robust model, and (3) benchmarking state-of-the-art algorithms. Applications include health monitoring, behavioral analysis, and tracking in smart farms, enabling precise detection in challenging settings. This study advances AI-driven livestock monitoring, improving animal welfare and smart agriculture.
zh

[CV-121] RadHop-Net: A Lightweight Radiomics-to-Error Regression for False Positive Reduction In MRI Prostate Cancer Detection

【速读】：该论文旨在解决临床上显著的前列腺癌（clinically significant prostate cancer, csPCa）筛查中双参数磁共振成像（bi-parametric MRI, bpMRI）的高假阳性（false positive, FP）率问题。高FP率不仅增加了诊断成本，还导致患者不适。为此，论文提出了一种名为RadHop-Net的新型轻量级卷积神经网络（CNN），用于减少FP。解决方案的关键在于两阶段处理流程：第一阶段通过数据驱动的放射组学（radiomics）提取候选感兴趣区域（ROIs）；第二阶段则利用RadHop-Net扩展每个ROI的感受野，以补偿第一阶段的预测误差。此外，论文还引入了一种新的回归问题损失函数，以平衡FP和真阳性（true positives, TPs）之间的影响。RadHop-Net采用放射组学到误差的训练方式，从而与常见的体素到标签（voxel-to-label）方法解耦。实验结果表明，该方法在公开的pi-cai数据集上将病变检测的平均精度（average precision, AP）从0.407提升至0.468，同时保持了显著较小的模型规模。

链接: https://arxiv.org/abs/2501.02066
作者: Vasileios Magoulianitis,Jiaxin Yang,Catherine A. Alexander,C.-C. Jay Kuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages, 4 figures - Accepted to IEEE International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Clinically significant prostate cancer (csPCa) is a leading cause of cancer death in men, yet it has a high survival rate if diagnosed early. Bi-parametric MRI (bpMRI) reading has become a prominent screening test for csPCa. However, this process has a high false positive (FP) rate, incurring higher diagnostic costs and patient discomfort. This paper introduces RadHop-Net, a novel and lightweight CNN for FP reduction. The pipeline consists of two stages: Stage 1 employs data driven radiomics to extract candidate ROIs. In contrast, Stage 2 expands the receptive field about each ROI using RadHop-Net to compensate for the predicted error from Stage 1. Moreover, a novel loss function for regression problems is introduced to balance the influence between FPs and true positives (TPs). RadHop-Net is trained in a radiomics-to-error manner, thus decoupling from the common voxel-to-label approach. The proposed Stage 2 improves the average precision (AP) in lesion detection from 0.407 to 0.468 in the publicly available pi-cai dataset, also maintaining a significantly smaller model size than the state-of-the-art.
zh

[CV-122] ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing

【速读】：该论文旨在解决文本引导的风格迁移（text-guided style transfer）中存在的挑战，特别是在平衡文本语义的表达能力与输出结果的多样性方面。现有的扩散模型（diffusion models）虽然在条件引导（conditional guidance）方面表现出色，但在捕捉风格特征时，往往难以同时兼顾文本语义的精确表达和生成结果的多样性。为此，论文提出了ArtCrafter框架，其核心解决方案包括三个关键部分：首先，引入了一个基于注意力机制（attention-based）的风格提取模块，通过多层架构和感知注意力机制（perceiver attention mechanisms）来捕捉图像中的细微风格元素；其次，设计了一个文本-图像对齐增强组件（text-image aligning augmentation component），通过注意力操作实现图像和文本嵌入（embeddings）在共享特征空间中的高效映射；最后，采用显式调制（explicit modulation）机制，通过嵌入重构设计（embedding reframing design）将多模态增强嵌入与原始嵌入无缝融合，从而生成多样化的输出。实验结果表明，ArtCrafter在视觉风格化（visual stylization）中表现出色，具有高度的风格强度、可控性和多样性。

链接: https://arxiv.org/abs/2501.02064
作者: Nisha Huang,Kaer Huang,Yifan Pu,Jiangshan Wang,Jie Guo,Yiqiang Yan,Xiu Li
机构: Tsinghua University(清华大学); Peng Cheng Laboratory(鹏城实验室); Lenovo Research(联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed significant advancements in text-guided style transfer, primarily attributed to innovations in diffusion models. These models excel in conditional guidance, utilizing text or images to direct the sampling process. However, despite their capabilities, direct conditional guidance approaches often face challenges in balancing the expressiveness of textual semantics with the diversity of output results while capturing stylistic features. To address these challenges, we introduce ArtCrafter, a novel framework for text-to-image style transfer. Specifically, we introduce an attention-based style extraction module, meticulously engineered to capture the subtle stylistic elements within an image. This module features a multi-layer architecture that leverages the capabilities of perceiver attention mechanisms to integrate fine-grained information. Additionally, we present a novel text-image aligning augmentation component that adeptly balances control over both modalities, enabling the model to efficiently map image and text embeddings into a shared feature space. We achieve this through attention operations that enable smooth information flow between modalities. Lastly, we incorporate an explicit modulation that seamlessly blends multimodal enhanced embeddings with original embeddings through an embedding reframing design, empowering the model to generate diverse outputs. Extensive experiments demonstrate that ArtCrafter yields impressive results in visual stylization, exhibiting exceptional levels of stylistic intensity, controllability, and diversity.
zh

[CV-123] DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data

【速读】：该论文旨在解决开放词汇全景分割（open-vocabulary panoptic segmentation）中模型在新类别上泛化能力不足的问题。尽管现有方法在已训练类别上表现出色，但在面对新类别时泛化能力有限。为此，论文提出了一种名为DreamMask的解决方案，其核心是从数据为中心的角度提升现有模型性能。DreamMask通过自动生成训练数据的方式，结合现成模型构建了一个数据生成管道，并提出了词汇扩展（vocabulary expansion）、布局安排（layout arrangement）、数据过滤（data filtering）等关键设计。生成的合成数据在质量上显著优于手动收集的网络数据。此外，论文还设计了一种合成-真实对齐损失（synthetic-real alignment loss），用于缩小合成数据与真实数据之间的表示差距，从而在多基准测试中带来显著性能提升。DreamMask不仅简化了大规模训练数据的收集过程，还可作为现有方法的即插即用增强模块。例如，在COCO数据集上训练并在ADE20K数据集上测试时，DreamMask使模型在mIoU指标上比之前的最先进方法提升了2.1%。

链接: https://arxiv.org/abs/2501.02048
作者: Yuanpeng Tu,Xi Chen,Ser-Nam Lim,Hengshuang Zhao
机构: HKU(香港大学); UCF(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project url: this https URL

点击查看摘要

Abstract:Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
zh

[CV-124] MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using Multi-Instance Point Cloud Registration

【速读】：该论文旨在解决高保真数字仿真环境在复制物理操作过程中存在的不一致性问题，这种不一致性导致仿真结果的可信度较低，限制了其在指导实际生产中的有效性。论文提出了一种新颖的多机器人制造数字场景生成方法（Multi-Robot Manufacturing Digital Scene Generation, MRG），该方法首次在制造场景中应用多实例点云配准（multi-instance point cloud registration）。解决方案的关键在于开发了一个面向实例的变压器模块（instance-focused transformer module），用于划定实例边界并捕捉局部区域之间的相关性；同时提出了一个假设生成模块（hypothesis generation module），用于提取目标实例并保留关键特征；最后设计了一种高效的筛选和优化算法，以优化最终的配准结果。实验结果表明，该方法在Scan2CAD和Welding-Station数据集上显著优于现有的多实例点云配准技术，提升了MR（匹配率）和MP（匹配精度）指标。

链接: https://arxiv.org/abs/2501.02041
作者: Songjie Han,Yinhua Liu,Yanzheng Li,Hua Chen,Dongmei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A high-fidelity digital simulation environment is crucial for accurately replicating physical operational processes. However, inconsistencies between simulation and physical environments result in low confidence in simulation outcomes, limiting their effectiveness in guiding real-world production. Unlike the traditional step-by-step point cloud “segmentation-registration” generation method, this paper introduces, for the first time, a novel Multi-Robot Manufacturing Digital Scene Generation (MRG) method that leverages multi-instance point cloud registration, specifically within manufacturing scenes. Tailored to the characteristics of industrial robots and manufacturing settings, an instance-focused transformer module is developed to delineate instance boundaries and capture correlations between local regions. Additionally, a hypothesis generation module is proposed to extract target instances while preserving key features. Finally, an efficient screening and optimization algorithm is designed to refine the final registration results. Experimental evaluations on the Scan2CAD and Welding-Station datasets demonstrate that: (1) the proposed method outperforms existing multi-instance point cloud registration techniques; (2) compared to state-of-the-art methods, the Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%, respectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by 16.95% and 24.15%, respectively. This work marks the first application of multi-instance point cloud registration in manufacturing scenes, significantly advancing the precision and reliability of digital simulation environments for industrial applications.
zh

[CV-125] A Separable Self-attention Inspired by the State Space Model for Computer Vision

【速读】：该论文旨在解决传统状态空间模型（State Space Model, SSM）在处理非因果数据（non-causal data）时的局限性问题。尽管SSM在计算复杂度上具有线性优势，但其在处理图像分类和目标检测等任务时表现不佳。为此，作者提出了一种新颖的可分离自注意力机制（separable self-attention），首次将Mamba模型中的优秀设计理念引入到自注意力机制中。为了与现有的Vision Mamba (ViM)方法进行公平比较，作者构建了一个名为VMINet的原型架构，该架构仅通过堆叠新型注意力模块和最基本的降采样层构成，显著区别于传统的Transformer架构。实验结果表明，VMINet在图像分类和高分辨率密集预测任务中取得了具有竞争力的结果。解决方案的关键在于将Mamba的高效设计理念与自注意力机制相结合，从而在保持计算效率的同时提升模型性能。

链接: https://arxiv.org/abs/2501.02040
作者: Juntao Zhang,Shaogeng Liu,Kun Bian,You Zhou,Pei Zhang,Jianning Liu,Jun Zhou,Bingyan Liu
机构: AMS, Beijing, China; School of Electronic Engineering, Xidian University (西安电子科技大学电子工程学院); Coolanyp L.L.C., Wuxi, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction this http URL is available at: \urlthis https URL.
zh

[CV-126] 3D Cloud reconstruction through geospatially-aware Masked Autoencoders

【速读】：该论文试图解决的是地球辐射平衡中云层（clouds）的复杂效应所引入的气候模型不确定性问题。具体而言，实时三维云数据对于改进气候预测至关重要。论文的解决方案关键在于利用MSG/SEVIRI的地球静止卫星图像和CloudSat/CPR的雷达反射率测量数据，通过自监督学习（self-supervised learning, SSL）方法——包括掩码自编码器（Masked Autoencoders, MAE）和地理空间感知的SatMAE——对未标记的MSG图像进行训练，并在匹配的图像-剖面数据对上微调模型。该方法在云结构重建方面优于现有的U-Net等方法，且地理空间编码进一步提升了预测结果，展示了自监督学习在云重建中的潜力。

链接: https://arxiv.org/abs/2501.02035
作者: Stella Girtsou,Emiliano Diaz Salas-Porras,Lilli Freischem,Joppe Massant,Kyriaki-Margarita Bintsi,Guiseppe Castiglione,William Jones,Michael Eisinger,Emmanuel Johnson,Anna Jungbluth
机构: National Observatory of Athens(雅典国家天文台); National Technical University of Athens(雅典国立技术大学); Universitat de València(瓦伦西亚大学); University of Oxford(牛津大学); Royal Belgian Institute of Natural Sciences(比利时皇家自然科学研究所); Imperial College London(伦敦帝国学院); University of Sussex(苏塞克斯大学); European Space Agency(欧洲航天局); CSIC-UCM-IGEO(西班牙国家研究委员会-马德里康普顿斯大学-地球科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clouds play a key role in Earth’s radiation balance with complex effects that introduce large uncertainties into climate models. Real-time 3D cloud data is essential for improving climate predictions. This study leverages geostationary imagery from MSG/SEVIRI and radar reflectivity measurements of cloud profiles from CloudSat/CPR to reconstruct 3D cloud structures. We first apply self-supervised learning (SSL) methods-Masked Autoencoders (MAE) and geospatially-aware SatMAE on unlabelled MSG images, and then fine-tune our models on matched image-profile pairs. Our approach outperforms state-of-the-art methods like U-Nets, and our geospatial encoding further improves prediction results, demonstrating the potential of SSL for cloud reconstruction.
zh

[CV-127] Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在面对恶意提示（malicious prompts）时的安全性问题。尽管已有研究在模型的后处理对齐（post-hoc alignment）方面投入了大量努力，但模型内部的安全机制仍未被充分探索。论文发现，LVLMs在生成第一个token时的内部激活（internal activations）能够有效识别不同攻击中的恶意提示。这一内在的安全感知机制由稀疏的注意力头（sparse attention heads）控制，作者将其称为“安全头”（safety heads）。这些安全头充当了针对恶意提示的专门防护机制，移除它们会导致攻击成功率显著上升，而模型的实用性不受影响。通过定位这些安全头并将其激活值串联，作者构建了一个简单但强大的恶意提示检测器，该检测器能够无缝集成到生成过程中，且仅带来极少的额外推理开销。尽管检测器仅采用逻辑回归模型（logistic regression model）的简单结构，但其表现出强大的零样本泛化能力（zero-shot generalization capabilities）。实验结果表明，利用安全头保护LVLMs在各种基于提示的攻击中具有显著效果。

链接: https://arxiv.org/abs/2501.02029
作者: Ziwei Zheng,Junyao Zhao,Le Yang,Lijun He,Fan Li
机构: Xi’an Jiaotong University (西安交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model’s utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \urlthis https URL.
zh

[CV-128] RealDiffFusionNet: Neural Controlled Differential Equation Informed Multi-Head Attention Fusion Networks for Disease Progression Modeling Using Real-World Data

【速读】：该论文旨在解决利用多模态数据（如图像数据和时间不变数据）进行疾病进展预测的问题。解决方案的关键在于提出了一种名为RealDiffFusionNet的新型深度学习模型，该模型结合了神经控制微分方程（Neural CDE）和多头注意力机制。神经控制微分方程能够有效处理不规则采样的时间序列数据，而多头注意力机制则用于在每个时间点对齐相关的多模态上下文信息。此外，论文还通过消融研究探讨了神经控制微分方程、多模态数据、注意力融合和插值策略对模型性能的影响。实验结果表明，使用多模态数据显著提高了神经控制微分方程的性能，并且RealDiffFusionNet在所有模型中表现最优，尤其是在结合图像特征后，模型的预测精度进一步提升。

链接: https://arxiv.org/abs/2501.02025
作者: Aashish Cheruvu,Nathaniel Rigoni
机构: Central Bucks High School South(中央雄鹿高中南校); Lockheed Martin(洛克希德·马丁公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:This paper presents a novel deep learning-based approach named RealDiffFusionNet incorporating Neural Controlled Differential Equations (Neural CDE) - time series models that are robust in handling irregularly sampled data - and multi-head attention to align relevant multimodal context (image data, time invariant data, etc.) at each time point. Long short-term memory (LSTM) models were also used as a baseline. Two different datasets were used: a data from the Open-Source Imaging Consortium (OSIC) containing structured time series data of demographics and lung function with a baseline CT scan of the lungs and the second from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) containing a series of MRI scans along with demographics, physical examinations, and cognitive assessment data. An ablation study was performed to understand the role of CDEs, multimodal data, attention fusion, and interpolation strategies on model performance. When the baseline models were evaluated, the use of multimodal data resulted in an improvement in Neural CDE performance, with a lower test RMSE. Additionally, the performance of multimodal Neural CDE was also superior to multimodal LSTM. In the attention-based architectures, fusion through concatenation and rectilinear interpolation were found to improve model performance. The performance of the proposed RealDiffFusionNet was found to be superior (0.2570) to all models. For the ADNI dataset, between the Neural-CDE and LSTM models trained only on the structured data, the test RMSE were comparable (0.471 for LSTM vs. 0.4581 Neural-CDE). Furthermore, the addition of image features from patients’ MRI series resulted in an improvement in performance, with a lower test RMSE (0.4372 with multimodal vs 0.4581 with structured data). RealDiffFusionNet has shown promise in utilizing CDEs and multimodal data to accurately predict disease progression.
zh

[CV-129] Model Checking in Medical Imaging for Tumor Detection and Segmentation

【速读】：该论文探讨了模型检测（model checking）在医学影像分析中的应用，特别是如何利用空间逻辑（spatial logic）开发操作符和工具，以自动或半自动地识别图像中的感兴趣区域（regions of interest），如肿瘤和非肿瘤区域。论文的核心目标是解决医学影像分割中的准确性问题，并设计出适用于常规临床实践的简化流程。解决方案的关键在于利用空间模型检测技术，通过开发新的操作符和工具，提高图像分割的精度和效率，同时应对地面真值数据（ground truth data）的变异性问题。

链接: https://arxiv.org/abs/2501.02024
作者: Elhoucine Elfatimi,Lahcen El fatimi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in model checking have demonstrated significant potential across diverse applications, particularly in signal and image analysis. Medical imaging stands out as a critical domain where model checking can be effectively applied to design and evaluate robust frameworks. These frameworks facilitate automatic and semi-automatic delineation of regions of interest within images, aiding in accurate segmentation. This paper provides a comprehensive analysis of recent works leveraging spatial logic to develop operators and tools for identifying regions of interest, including tumorous and non-tumorous areas. Additionally, we examine the challenges inherent to spatial model-checking techniques, such as variability in ground truth data and the need for streamlined procedures suitable for routine clinical practice.
zh

[CV-130] Rephotography in the Digital Era: Mass Rephotography and re.photos the Web Portal for Rephotography

【速读】：该论文旨在探讨和解决数字重摄影（rephotography）在大规模应用中的技术挑战和需求。自19世纪中叶重摄影技术出现以来，其注册（registration）、保存、展示和共享技术已取得了显著进展。论文提出了现有数字重摄影方法的局限性，并讨论了未来大规模数字重摄影所需的关键技术和要求。解决方案的关键包括：1）同时处理大量模板图像和重摄影图像；2）实现图像注册的自动化；3）开发直观的智能手机应用程序以支持重摄影；4）确保长期存储并采用持久标识符（persistent identifiers）；5）实现自动或大规模地理参照（georeferencing）；6）引入游戏化（gamification）和社交媒体集成以增强用户体验。这些技术需求将推动重摄影在大规模应用中的进一步发展。

链接: https://arxiv.org/abs/2501.02017
作者: Axel Schaffland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Since the beginning of rephotography in the middle of the 19th century, techniques in registration, conservation, presentation, and sharing of rephotographs have come a long way. Here, we will present existing digital approaches to rephotography and discuss future approaches and requirements for digital mass rephotography. We present this http URL, an existing web portal for rephotography, featuring methods for collaborative rephotography, interactive image registration, as well as retrieval, organization, and sharing of rephotographs. For mass rephotography additional requirements must be met. Batches of template images and rephotographs must be handled simultaneously, image registration must be automated, and intuitive smartphone apps for rephotography must be available. Long–term storage with persistent identifiers, automatic or mass georeferencing, as well as gamification and social media integration are further requirements we will discuss in this paper.
zh

[CV-131] SurfPatch: Enabling Patch Matching for Exploratory Stream Surface Visualization

【速读】：该论文试图解决流场可视化中基于表面（surface-based）技术的挑战，特别是在表面放置（surface placement）、速度、感知和评估方面的困难。现有的基于线（line-based）技术已经得到了广泛研究，而基于表面的技术尚未得到充分探索。论文提出了一个名为SurfPatch的新框架，支持探索性流表面（stream surface）可视化。解决方案的关键在于将表面放置问题转化为表面选择问题，并通过一个三阶段过程（顶点级分类、补丁级匹配和表面级聚类）来分层构建顶点与补丁之间以及补丁与表面之间的连接。这种自下而上的方法实现了细粒度的多尺度补丁级匹配，与现有工作的表面级匹配形成鲜明对比，并在查询过程中提供了前所未有的灵活性。此外，论文还设计了一个直观的可视化界面，方便用户以探索性的方式可视化和分析流表面集合。SurfPatch不仅适用于从稳态流场数据生成的流表面，还展示了其在非稳态流场和标量场提取的等值面中的有效性。

链接: https://arxiv.org/abs/2501.02003
作者: Delin An,Chaoli Wang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlike their line-based counterparts, surface-based techniques have yet to be thoroughly investigated in flow visualization due to their significant placement, speed, perception, and evaluation challenges. This paper presents SurfPatch, a novel framework supporting exploratory stream surface visualization. To begin with, we translate the issue of surface placement to surface selection and trace a large number of stream surfaces from a given flow field dataset. Then, we introduce a three-stage process: vertex-level classification, patch-level matching, and surface-level clustering that hierarchically builds the connection between vertices and patches and between patches and surfaces. This bottom-up approach enables fine-grained, multiscale patch-level matching, sharply contrasts surface-level matching offered by existing works, and provides previously unavailable flexibility during querying. We design an intuitive visual interface for users to conveniently visualize and analyze the underlying collection of stream surfaces in an exploratory manner. SurfPatch is not limited to stream surfaces traced from steady flow datasets. We demonstrate its effectiveness through experiments on stream surfaces produced from steady and unsteady flows as well as isosurfaces extracted from scalar fields. The code is available at this https URL.
zh

[CV-132] On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds

【速读】：该论文探讨了在处理点云数据时，不同几何复杂度任务中影响模型性能的关键因素，特别是研究了等变层（equivariant layers）在灵活性和权重共享之间的权衡，评估了等变性（equivariance）何时提升或削弱模型性能。论文的核心问题是：当额外的输入信息破坏了某些属性（如SE(3)等变性）时，这些信息是否仍然有益？通过在多数据集上对分割、回归和生成任务进行基准测试，作者发现等变性对性能有积极影响，且随着任务复杂度的增加，这种影响更为显著，即使在不严格要求等变性的情况下也是如此。解决方案的关键在于识别等变和非等变架构在不同任务中的成功因素，并验证等变性在复杂任务中的优势。

链接: https://arxiv.org/abs/2501.01999
作者: Sharvaree Vadgama,Mohammad Mohaiminul Islam,Domas Buracus,Christian Shewmake,Erik Bekkers
机构: AMLab, University of Amsterdam(阿姆斯特丹大学); QurAI, University of Amsterdam(阿姆斯特丹大学); New Theory AI; UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:This paper explores the key factors that influence the performance of models working with point clouds, across different tasks of varying geometric complexity. In this work, we explore the trade-offs between flexibility and weight-sharing introduced by equivariant layers, assessing when equivariance boosts or detracts from performance. It is often argued that providing more information as input improves a model’s performance. However, if this additional information breaks certain properties, such as \SE(3) equivariance, does it remain beneficial? We identify the key aspects of equivariant and non-equivariant architectures that drive success in different tasks by benchmarking them on segmentation, regression, and generation tasks across multiple datasets with increasing complexity. We observe a positive impact of equivariance, which becomes more pronounced with increasing task complexity, even when strict equivariance is not required.
zh

[CV-133] SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation Framework

【速读】：该论文旨在解决Stable Diffusion模型在生成复杂空间排列（特别是涉及精细3D关系）时表现不佳的问题。为了解决这一局限性，作者提出了SmartSpatial方法，通过引入3D感知条件（3D-aware conditioning）和注意力引导机制（attention-guided mechanisms）来增强模型的空间排列能力。SmartSpatial的关键在于结合深度信息，并采用交叉注意力控制（cross-attention control）来确保对象的精确放置，从而显著提升了空间准确性。此外，论文还提出了SmartSpatialEval评估框架，利用视觉语言模型（vision-language models）和基于图的依赖解析（graph-based dependency parsing）来全面评估空间关系。实验结果表明，SmartSpatial在COCO和SpatialPrompts数据集上显著优于现有方法，为图像生成中的空间排列准确性设定了新的基准。

链接: https://arxiv.org/abs/2501.01998
作者: Mao Xun Huang,Hen-Hsen Huang
机构: Institute for Clarity in Documentation(文档清晰研究所); The Thørväld Group(Thørväld集团); Inria Paris-Rocquencourt(法国国家信息与自动化研究所巴黎-罗克库尔中心); Rajiv Gandhi University(拉吉夫·甘地大学); Tsinghua University(清华大学); Palmer Research Laboratories(帕尔默研究实验室); The Kumquat Consortium(金橘联盟)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships. To address this limitation, we introduce SmartSpatial, an innovative approach that enhances the spatial arrangement capabilities of Stable Diffusion models through 3D-aware conditioning and attention-guided mechanisms. SmartSpatial incorporates depth information and employs cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics. In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework designed to assess spatial relationships. This framework utilizes vision-language models and graph-based dependency parsing for performance analysis. Experimental results on the COCO and SpatialPrompts datasets show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial arrangement accuracy in image generation.
zh

[CV-134] A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation

【速读】：该论文旨在解决从RGB图像中估计6D物体姿态（6D object pose）的挑战，主要难点在于缺乏深度信息，需要从2D投影中推断三维结构。传统方法通常依赖于基于网格数据结构的深度学习，但难以捕捉提取特征之间的复杂依赖关系。为解决这一问题，论文提出了一种直接从图像中导出的基于图的表示方法，其中每个像素的时空特征作为节点，节点之间的连接和空间交互定义了它们的关系。此外，论文采用了空间注意力和自注意力蒸馏的特征选择机制，并利用Legendre多项式的正交性设计了Legendre卷积层，以提高数值稳定性。通过在LINEMOD、Occluded LINEMOD和YCB Video数据集上的实验，该方法在物体姿态估计任务中超越了九种现有方法，达到了最新的基准性能。

链接: https://arxiv.org/abs/2501.01993
作者: Alexander Du,Yingwu Zhu
机构: Bellevue School District(贝尔维尤学区); Seattle University(西雅图大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Estimating 6D object poses from RGB images is challenging because the lack of depth information requires inferring a three dimensional structure from 2D projections. Traditional methods often rely on deep learning with grid based data structures but struggle to capture complex dependencies among extracted features. To overcome this, we introduce a graph based representation derived directly from images, where spatial temporal features of each pixel serve as nodes, and relationships between them are defined through node connectivity and spatial interactions. We also employ feature selection mechanisms that use spatial attention and self attention distillation, along with a Legendre convolution layer leveraging the orthogonality of Legendre polynomials for numerical stability. Experiments on the LINEMOD, Occluded LINEMOD, and YCB Video datasets demonstrate that our method outperforms nine existing approaches and achieves state of the art benchmark in object pose estimation.
zh

[CV-135] A Hybrid Deep Learning and Model-Checking Framework for Accurate Brain Tumor Detection and Validation

【速读】：该论文旨在解决医学影像中脑肿瘤检测和验证的准确性和可靠性问题。解决方案的关键在于提出了一种新颖的混合框架，将模型检查（model checking）这一形式化验证技术与深度学习相结合。具体而言，该框架通过结合卷积神经网络（CNN）进行特征提取，并利用K-FCM聚类算法进行图像分割，从而提高了肿瘤检测和分割的可靠性。实验结果表明，该框架在准确性、精确度和召回率方面表现优异，分别达到了98%、96.15%和100%，展示了其在高级医学图像分析中的潜力。

链接: https://arxiv.org/abs/2501.01991
作者: Lahcen El Fatimi,Elhoucine Elfatimi,Hanifa Bouchaneb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Model checking, a formal verification technique, ensures systems meet predefined requirements, playing a crucial role in minimizing errors and enhancing quality during development. This paper introduces a novel hybrid framework integrating model checking with deep learning for brain tumor detection and validation in medical imaging. By combining model-checking principles with CNN-based feature extraction and K-FCM clustering for segmentation, the proposed approach enhances the reliability of tumor detection and segmentation. Experimental results highlight the framework’s effectiveness, achieving 98% accuracy, 96.15% precision, and 100% recall, demonstrating its potential as a robust tool for advanced medical image analysis.
zh

[CV-136] CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs

【速读】：该论文旨在解决胸部影像学报告生成的复杂性和低效性问题，特别是在面对大量影像和长时间高强度工作时，即使是经验丰富的放射科医生也难以保持报告的准确性和一致性。为此，论文提出了CRRG-CLIP模型（Chest Radiology Report Generation and Radiograph Classification Model），这是一个端到端的自动化报告生成和影像分类模型。其关键解决方案包括两个模块：报告生成模块和影像分类模块。报告生成模块结合了Faster R-CNN用于识别影像中的解剖区域、二元分类器用于选择关键区域，以及GPT-2用于生成语义连贯的报告。影像分类模块则采用无监督的对比语言-图像预训练（Contrastive Language Image Pretraining, CLIP）模型，解决了高成本标注数据集和特征不足的挑战。实验结果表明，该模型在报告生成和影像分类任务中均表现出色，显著提升了准确性和可读性。

链接: https://arxiv.org/abs/2501.01989
作者: Jianfei Xu,Thanet Markchom,Huizhi Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The complexity of stacked imaging and the massive number of radiographs make writing radiology reports complex and inefficient. Even highly experienced radiologists struggle to maintain accuracy and consistency in interpreting radiographs under prolonged high-intensity work. To address these issues, this work proposes the CRRG-CLIP Model (Chest Radiology Report Generation and Radiograph Classification Model), an end-to-end model for automated report generation and radiograph classification. The model consists of two modules: the radiology report generation module and the radiograph classification module. The generation module uses Faster R-CNN to identify anatomical regions in radiographs, a binary classifier to select key regions, and GPT-2 to generate semantically coherent reports. The classification module uses the unsupervised Contrastive Language Image Pretraining (CLIP) model, addressing the challenges of high-cost labelled datasets and insufficient features. The results show that the generation module performs comparably to high-performance baseline models on BLEU, METEOR, and ROUGE-L metrics, and outperformed the GPT-4o model on BLEU-2, BLEU-3, BLEU-4, and ROUGE-L metrics. The classification module significantly surpasses the state-of-the-art model in AUC and Accuracy. This demonstrates that the proposed model achieves high accuracy, readability, and fluency in report generation, while multimodal contrastive training with unlabelled radiograph-report pairs enhances classification performance.
zh

[CV-137] Gender Bias in Text-to-Video Generation Models: A case study of Sora

【速读】：该论文试图解决文本到视频生成模型（text-to-video generation models）中存在的性别偏见（gender bias）问题，特别是针对OpenAI的Sora模型。研究通过分析从性别中立和刻板印象提示（stereotypical prompts）生成的视频，揭示了Sora在性别表征上的显著偏见，表明其倾向于将特定性别与刻板印象行为和职业相关联，反映了其训练数据中嵌入的社会偏见。解决方案的关键在于通过多样化的提示集进行系统分析，以揭示和量化模型中的偏见，从而为进一步的模型优化和训练数据筛选提供依据。

链接: https://arxiv.org/abs/2501.01987
作者: Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Björn W. Schuller,Amir Hussain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:The advent of text-to-video generation models has revolutionized content creation as it produces high-quality videos from textual prompts. However, concerns regarding inherent biases in such models have prompted scrutiny, particularly regarding gender representation. Our study investigates the presence of gender bias in OpenAI’s Sora, a state-of-the-art text-to-video generation model. We uncover significant evidence of bias by analyzing the generated videos from a diverse set of gender-neutral and stereotypical prompts. The results indicate that Sora disproportionately associates specific genders with stereotypical behaviors and professions, which reflects societal prejudices embedded in its training data.
zh

[CV-138] FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在处理长视频和高分辨率视频时面临的视觉令牌（visual tokens）数量过多的问题。现有的令牌缩减方法主要基于重要性进行令牌剪枝（importance-based token pruning），但忽略了由于帧相似性和重复视觉元素导致的冗余。论文通过分析LVLMs中视觉令牌的高相似性，揭示了随着网络层加深，令牌相似性分布逐渐集中且保持排序一致性的现象。基于这一发现，作者提出了FrameFusion方法，该方法结合了基于相似性的合并（similarity-based merging）和基于重要性的剪枝，以更好地减少视觉令牌数量。FrameFusion的关键在于先识别并合并相似令牌，再进行剪枝，从而为令牌缩减提供了新的视角。实验表明，FrameFusion能够减少70%的视觉令牌，显著提升了模型的处理速度，同时对性能的影响平均不到3%。

链接: https://arxiv.org/abs/2501.01986
作者: Tianyu Fu,Tengxuan Liu,Qinghao Han,Guohao Dai,Shengen Yan,Huazhong Yang,Xuefei Ning,Yu Wang
机构: Tsinghua University(清华大学); Infinigence-AI; Peking University(北京大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements. In this paper, we analyze the high vision token similarities in LVLMs. We reveal that token similarity distribution condenses as layers deepen while maintaining ranking consistency. Leveraging the unique properties of similarity over importance, we introduce FrameFusion, a novel approach that combines similarity-based merging with importance-based pruning for better token reduction in LVLMs. FrameFusion identifies and merges similar tokens before pruning, opening up a new perspective for token reduction. We evaluate FrameFusion on diverse LVLMs, including Llava-Video-7B,32B,72B, and MiniCPM-V-8B, on video understanding, question-answering, and retrieval benchmarks. Experiments show that FrameFusion reduces vision tokens by 70 % , achieving 3.4-4.4x LLM speedups and 1.6-1.9x end-to-end speedups, with an average performance impact of less than 3 % . Our code is available at this https URL.
zh

[CV-139] Fall Detection in Passenger Elevators using Intelligent Surveillance Camera Systems: An Application with YoloV8 Nano Model

【速读】：该论文试图解决在乘客电梯环境中准确检测人体跌倒的问题。由于电梯的封闭环境和多变的照明条件，这一场景具有独特的挑战性。解决方案的关键在于应用YoloV8 Nano模型，并通过在包含10,000多张不同电梯类型图像的强大数据集上进行训练，以提高检测的精确度（precision）和召回率（recall）。该模型在跌倒检测中达到了85%的精确度和82%的召回率，展示了其集成到现有电梯安全系统中以实现快速干预的潜力。

链接: https://arxiv.org/abs/2501.01985
作者: Pinar Yozgatli,Yavuz Acar,Mehmet Tulumen,Selman Minga,Salih Selamet,Beytullah Nalbant,Mustafa Talha Toru,Berna Koca,Tevfik Keles,Mehmet Selcok
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Computer vision technology, which involves analyzing images and videos captured by cameras through deep learning algorithms, has significantly advanced the field of human fall detection. This study focuses on the application of the YoloV8 Nano model in identifying fall incidents within passenger elevators, a context that presents unique challenges due to the enclosed environment and varying lighting conditions. By training the model on a robust dataset comprising over 10,000 images across diverse elevator types, we aim to enhance the detection precision and recall rates. The model’s performance, with an 85% precision and 82% recall in fall detection, underscores its potential for integration into existing elevator safety systems to enable rapid intervention.
zh

[CV-140] ECG-guided individual identification via PPG ICASSP2025

【速读】：该论文试图解决基于光电容积描记法（Photoplethsmography, PPG）的个体识别技术由于信息密度低而导致的识别效果不佳的问题。为了解决这一问题，论文引入了心电图（Electrocardiogram, ECG）信号作为一种新的模态，以增强输入信息密度。关键解决方案是提出了一种跨模态知识蒸馏框架，该框架在不增加推理阶段计算需求的情况下，将ECG模态中的判别知识传递到PPG模态中。此外，为了确保知识传递的高效性，论文还提出了基于对比语言-图像预训练（Contrastive Language-Image Pre-training, CLIP）的知识对齐模块和跨知识评估模块。实验结果表明，该框架在已知和未知个体识别任务中的整体准确率分别提高了2.8%和3.0%，显著优于基线模型。

链接: https://arxiv.org/abs/2501.01983
作者: Riling Wei,Hanjie Chen,Kelu Yao,Chuanguang Yang,Jun Wang,Chao Li
机构: Zhejiang Lab(浙江实验室); Hong Kong Centre for Cerebro-cardiovascular Health Engineering (COCHE)(香港心脑血管健康工程中心); Zhejiang University(浙江大学); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2025. Camera Ready Version

点击查看摘要

Abstract:Photoplethsmography (PPG)-based individual identification aiming at recognizing humans via intrinsic cardiovascular activities has raised extensive attention due to its high security and resistance to mimicry. However, this kind of technology witnesses unpromising results due to the limitation of low information density. To this end, electrocardiogram (ECG) signals have been introduced as a novel modality to enhance the density of input information. Specifically, a novel cross-modal knowledge distillation framework is implemented to propagate discriminate knowledge from ECG modality to PPG modality without incurring additional computational demands at the inference phase. Furthermore, to ensure efficient knowledge propagation, Contrastive Language-Image Pre-training (CLIP)-based knowledge alignment and cross-knowledge assessment modules are proposed respectively. Comprehensive experiments are conducted and results show our framework outperforms the baseline model with the improvement of 2.8% and 3.0% in terms of overall accuracy on seen- and unseen individual recognitions.
zh

[CV-141] Optical Character Recognition using Convolutional Neural Networks for Ashokan Brahmi Inscriptions

【速读】：该论文旨在解决阿育王婆罗米字符（Ashokan Brahmi characters）的识别问题，通过开发一种基于卷积神经网络（Convolutional Neural Networks, CNNs）的光学字符识别（Optical Character Recognition, OCR）系统。解决方案的关键在于利用预训练的CNN模型（包括LeNet、VGG-16和MobileNet）进行迁移学习（Transfer Learning），并结合数据增强（Data Augmentation）和图像预处理技术（如图像去噪和分割）来优化模型的训练过程。研究结果表明，MobileNet在识别阿育王婆罗米字符时表现最佳，验证准确率达到95.94%，验证损失为0.129。该研究为古文字的保护和数字化提供了有效的技术手段，特别是在碑铭学（Epigraphy）领域具有重要意义。

链接: https://arxiv.org/abs/2501.01981
作者: Yash Agrawal,Srinidhi Balasubramanian,Rahul Meena,Rohail Alam,Himanshu Malviya,Rohini P
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This research paper delves into the development of an Optical Character Recognition (OCR) system for the recognition of Ashokan Brahmi characters using Convolutional Neural Networks. It utilizes a comprehensive dataset of character images to train the models, along with data augmentation techniques to optimize the training process. Furthermore, the paper incorporates image preprocessing to remove noise, as well as image segmentation to facilitate line and character segmentation. The study mainly focuses on three pre-trained CNNs, namely LeNet, VGG-16, and MobileNet and compares their accuracy. Transfer learning was employed to adapt the pre-trained models to the Ashokan Brahmi character dataset. The findings reveal that MobileNet outperforms the other two models in terms of accuracy, achieving a validation accuracy of 95.94% and validation loss of 0.129. The paper provides an in-depth analysis of the implementation process using MobileNet and discusses the implications of the findings. The use of OCR for character recognition is of significant importance in the field of epigraphy, specifically for the preservation and digitization of ancient scripts. The results of this research paper demonstrate the effectiveness of using pre-trained CNNs for the recognition of Ashokan Brahmi characters. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2501.01981 [cs.CV] (or arXiv:2501.01981v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.01981 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-142] Polarimetric BSSRDF Acquisition of Dynamic Faces

【速读】：该论文旨在解决动态人脸的偏振光反射和散射的获取与建模问题，特别是针对复杂结构、反射特性、空间变化的次表面散射（subsurface scattering）以及动态特性带来的挑战。现有的偏振获取系统主要局限于静态和不透明物体，无法有效处理动态人脸的复杂特性。论文提出了一种新的偏振获取方法，能够捕捉动态人脸的空间变化外观和精确几何形状，涵盖多种肤色和面部表情。该方法的关键在于同时获取偏振和光谱反射信息，并结合基于生物物理的皮肤参数（如内外层血红蛋白、真黑素和褐黑素的浓度）以及几何信息。通过利用这些成分的多光谱吸收特性，量化其浓度，从而更好地模拟皮肤层内的复杂相互作用。此外，该偏振皮肤模型能够无缝集成到多种渲染管线中，为计算机图形学提供了更精确的动态人脸建模工具。

链接: https://arxiv.org/abs/2501.01980
作者: Hyunho Ha,Inseung Hwang,Nestor Monzon,Jaemin Cho,Donggun Kim,Seung-Hwan Baek,Adolfo Muñoz,Diego Gutierrez,Min H. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Acquisition and modeling of polarized light reflection and scattering help reveal the shape, structure, and physical characteristics of an object, which is increasingly important in computer graphics. However, current polarimetric acquisition systems are limited to static and opaque objects. Human faces, on the other hand, present a particularly difficult challenge, given their complex structure and reflectance properties, the strong presence of spatially-varying subsurface scattering, and their dynamic nature. We present a new polarimetric acquisition method for dynamic human faces, which focuses on capturing spatially varying appearance and precise geometry, across a wide spectrum of skin tones and facial expressions. It includes both single and heterogeneous subsurface scattering, index of refraction, and specular roughness and intensity, among other parameters, while revealing biophysically-based components such as inner- and outer-layer hemoglobin, eumelanin and pheomelanin. Our method leverages such components’ unique multispectral absorption profiles to quantify their concentrations, which in turn inform our model about the complex interactions occurring within the skin layers. To our knowledge, our work is the first to simultaneously acquire polarimetric and spectral reflectance information alongside biophysically-based skin parameters and geometry of dynamic human faces. Moreover, our polarimetric skin model integrates seamlessly into various rendering pipelines.
zh

[CV-143] INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

【速读】：该论文旨在解决多模态生成式 AI 系统（如文本到图像生成模型）在公平性评估方面的不足，特别是针对内容生成对齐和社会偏见敏感领域的评估。现有评估框架存在关键局限性，包括对内容生成对齐的关注不足、对社会偏见敏感领域的评估不充分，以及依赖像素检测技术导致的准确性不足。为解决这些问题，论文提出了 INFELM 框架，其关键解决方案包括：（1）引入一种先进的肤色分类器，结合面部拓扑结构和精细化皮肤像素表示，将分类精度提升至少 16.04%；（2）提出一种偏见敏感的内容对齐测量方法，以理解生成内容的社会影响；（3）设计一种可推广的表征偏见评估方法，适用于不同人口群体；（4）通过大规模实验分析六个社会偏见敏感领域的文本到图像模型输出。研究结果表明，现有模型普遍未能满足经验公平性标准，且表征偏见比对齐错误更为显著。INFELM 为公平性评估建立了稳健的基准，支持开发符合伦理和以人为本原则的多模态 AI 系统。

链接: https://arxiv.org/abs/2501.01973
作者: Di Jin,Xing Liu,Yu Liu,Jia Qing Yap,Andrea Wong,Adriana Crespo,Qi Lin,Zhiyuan Yin,Qiang Yan,Ryan Ye
机构: TikTok Inc. (抖音公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Di Jin and Xing Liu contributed equally to this work

点击查看摘要

Abstract:The rapid development of large language models (LLMs) and large vision models (LVMs) have propelled the evolution of multi-modal AI systems, which have demonstrated the remarkable potential for industrial applications by emulating human-like cognition. However, they also pose significant ethical challenges, including amplifying harmful content and reinforcing societal biases. For instance, biases in some industrial image generation models highlighted the urgent need for robust fairness assessments. Most existing evaluation frameworks focus on the comprehensiveness of various aspects of the models, but they exhibit critical limitations, including insufficient attention to content generation alignment and social bias-sensitive domains. More importantly, their reliance on pixel-detection techniques is prone to inaccuracies. To address these issues, this paper presents INFELM, an in-depth fairness evaluation on widely-used text-to-image models. Our key contributions are: (1) an advanced skintone classifier incorporating facial topology and refined skin pixel representation to enhance classification precision by at least 16.04%, (2) a bias-sensitive content alignment measurement for understanding societal impacts, (3) a generalizable representation bias evaluation for diverse demographic groups, and (4) extensive experiments analyzing large-scale text-to-image model outputs across six social-bias-sensitive domains. We find that existing models in the study generally do not meet the empirical fairness criteria, and representation bias is generally more pronounced than alignment errors. INFELM establishes a robust benchmark for fairness assessment, supporting the development of multi-modal AI systems that align with ethical and human-centric principles. Comments: Di Jin and Xing Liu contributed equally to this work Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2501.01973 [cs.CV] (or arXiv:2501.01973v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.01973 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-144] GAF-FusionNet: Multimodal ECG Analysis via Gramian Angular Fields and Split Attention ICONIP2024

【速读】：该论文旨在解决心电图（ECG）信号分析中的准确分类问题，特别是在诊断心血管疾病时面临的复杂信号解释挑战。论文提出了一种新颖的多模态框架（GAF-FusionNet），通过将时间序列分析与基于图像的表示（使用Gramian Angular Fields, GAF）相结合，实现了对ECG信号的更精确分类。解决方案的关键在于引入了一个双层跨通道分割注意力模块，该模块能够自适应地融合时间和空间特征，从而实现对互补信息的细致整合。通过在ECG200、ECG5000和MIT-BIH心律失常数据库上的评估，GAF-FusionNet在三个数据集上分别达到了94.5%、96.9%和99.6%的准确率，显著优于现有方法。

链接: https://arxiv.org/abs/2501.01960
作者: Jiahao Qin,Feng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 14 pages, 1 figure, accepted by ICONIP 2024

点击查看摘要

Abstract:Electrocardiogram (ECG) analysis plays a crucial role in diagnosing cardiovascular diseases, but accurate interpretation of these complex signals remains challenging. This paper introduces a novel multimodal framework(GAF-FusionNet) for ECG classification that integrates time-series analysis with image-based representation using Gramian Angular Fields (GAF). Our approach employs a dual-layer cross-channel split attention module to adaptively fuse temporal and spatial features, enabling nuanced integration of complementary information. We evaluate GAF-FusionNet on three diverse ECG datasets: ECG200, ECG5000, and the MIT-BIH Arrhythmia Database. Results demonstrate significant improvements over state-of-the-art methods, with our model achieving 94.5%, 96.9%, and 99.6% accuracy on the respective datasets. Our code will soon be available at this https URL.
zh

[CV-145] STEAM-EEG: Spatiotemporal EEG Analysis with Markov Transfer Fields and Attentive CNNs

【速读】：该论文旨在解决脑电图（EEG）信号在生物医学研究和临床应用中的有效分析和解释难题。EEG信号在癫痫诊断、睡眠障碍分析和脑机接口等领域具有重要作用，但其复杂性和动态性使得传统分析方法面临挑战。论文提出的解决方案关键是一种新颖的框架（STEAM-EEG），该框架将计算机图形学技术与生物信号模式识别相结合，特别是利用马尔可夫转移场（Markov Transfer Fields, MTFs）对EEG时间序列进行成像。通过MTF捕捉EEG信号的时空动态特性，并将其转化为可视化的图像，再结合先进的计算机图形学技术进行渲染、可视化和建模。这一方法显著提升了数据的探索能力、模式识别效率和决策支持效果。

链接: https://arxiv.org/abs/2501.01959
作者: Jiahao Qin,Feng Liu
机构: Xi’an Jiaotong-Liverpool University (西交利物浦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Electroencephalogram (EEG) signals play a pivotal role in biomedical research and clinical applications, including epilepsy diagnosis, sleep disorder analysis, and brain-computer interfaces. However, the effective analysis and interpretation of these complex signals often present significant challenges. This paper presents a novel approach that integrates computer graphics techniques with biological signal pattern recognition, specifically using Markov Transfer Fields (MTFs) for EEG time series imaging. The proposed framework (STEAM-EEG) employs the capabilities of MTFs to capture the spatiotemporal dynamics of EEG signals, transforming them into visually informative images. These images are then rendered, visualised, and modelled using state-of-the-art computer graphics techniques, thereby facilitating enhanced data exploration, pattern recognition, and decision-making. The code could be accessed from GitHub.
zh

[CV-146] Dr. Tongue: Sign-Oriented Multi-label Detection for Remote Tongue Diagnosis

【速读】：该论文旨在解决在远程医疗（telehealth）环境下准确识别舌象属性的问题，特别是在COVID-19疫情期间，远程医疗评估的需求显著增加。论文提出的解决方案关键是一个面向多标签属性检测的框架，称为Sign-Oriented multi-label Attributes Detection framework。该框架的核心包括两个部分：首先是一个自适应的舌象特征提取模块（adaptive tongue feature extraction module），用于标准化舌象图像并减少环境因素的干扰；其次是一个面向特征的网络（Sign-oriented Network, SignNet），该网络模拟经验丰富的医生的诊断过程，能够识别特定的舌象属性，从而实现全面的健康评估。此外，论文还开发了一个专门为远程诊断设计的舌象图像数据集，该数据集具有全面的属性标签，并将公开供研究使用。初步测试表明，该框架在检测多种舌象属性方面具有更高的准确性，展示了其在远程医疗评估中的潜力。

链接: https://arxiv.org/abs/2501.03053
作者: Yiliang Chen,Steven SC Ho,Cheng Xu,Yao Jie Xie,Wing-Fai Yeung,Shengfeng He,Jing Qin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tongue diagnosis is a vital tool in Western and Traditional Chinese Medicine, providing key insights into a patient’s health by analyzing tongue attributes. The COVID-19 pandemic has heightened the need for accurate remote medical assessments, emphasizing the importance of precise tongue attribute recognition via telehealth. To address this, we propose a Sign-Oriented multi-label Attributes Detection framework. Our approach begins with an adaptive tongue feature extraction module that standardizes tongue images and mitigates environmental factors. This is followed by a Sign-oriented Network (SignNet) that identifies specific tongue attributes, emulating the diagnostic process of experienced practitioners and enabling comprehensive health evaluations. To validate our methodology, we developed an extensive tongue image dataset specifically designed for telemedicine. Unlike existing datasets, ours is tailored for remote diagnosis, with a comprehensive set of attribute labels. This dataset will be openly available, providing a valuable resource for research. Initial tests have shown improved accuracy in detecting various tongue attributes, highlighting our framework’s potential as an essential tool for remote medical assessments.
zh

[CV-147] DDRM-PR: Fourier Phase Retrieval using Denoising Diffusion Restoration Models

【速读】：该论文试图解决非线性相位恢复问题（nonlinear phase retrieval problem），即从仅含噪声的强度测量（如傅里叶强度）中重建图像。现有的方法大多局限于线性逆问题，而本文提出了一种基于去噪扩散恢复模型（Denoising Diffusion Restoration Models, DDRM）的高效且无监督的后验采样框架，用于解决非线性相位恢复问题。解决方案的关键在于将基于模型的交替投影方法（alternating-projection methods）与DDRM相结合，利用预训练的无条件扩散先验（unconditional diffusion priors）进行相位恢复。通过仿真和实验数据验证了该方法的性能，展示了其在改进交替投影方法方面的潜力及其局限性。

链接: https://arxiv.org/abs/2501.03030
作者: Mehmet Onurcan Kaya,Figen S. Oktem
机构: Dept. of Applied Mathematics and Computer Science, DTU (丹麦技术大学应用数学与计算机科学系); Dept. of Electrical and Electronics Eng., METU (中东技术大学电气与电子工程系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated their utility as learned priors for solving various inverse problems. However, most existing approaches are limited to linear inverse problems. This paper exploits the efficient and unsupervised posterior sampling framework of Denoising Diffusion Restoration Models (DDRM) for the solution of nonlinear phase retrieval problem, which requires reconstructing an image from its noisy intensity-only measurements such as Fourier intensity. The approach combines the model-based alternating-projection methods with the DDRM to utilize pretrained unconditional diffusion priors for phase retrieval. The performance is demonstrated through both simulations and experimental data. Results demonstrate the potential of this approach for improving the alternating-projection methods as well as its limitations.
zh

[CV-148] A Trust-Guided Approach to MR Image Reconstruction with Side Information

【速读】：该论文旨在解决磁共振成像（MRI）扫描时间过长的问题，通过减少扫描时间可以提高患者护理质量并降低医疗成本。论文提出了一种新的端到端深度学习框架——信任引导变分网络（Trust-Guided Variational Network, TGVN），用于从有限的k空间数据中重建诊断质量的图像。该框架的关键在于有效地整合来自其他来源的上下文侧信息（side information），以解决由于欠采样导致的线性逆问题（LIP）中的模糊性。TGVN通过消除前向算子模糊空间中的不良解，同时保持对采集数据的忠实性，显著提高了图像重建质量。该方法在多线圈、多对比度MRI图像重建中表现出色，能够在极具挑战性的欠采样水平下实现高质量的图像重建，并大幅减少采集时间，同时最小化幻觉效应。此外，TGVN具有广泛的适用性，能够将多种类型的侧信息（包括先前的扫描甚至文本）整合到任何线性逆问题中。

链接: https://arxiv.org/abs/2501.03021
作者: Arda Atalık,Sumit Chopra,Daniel K. Sodickson
机构: NYU Center for Data Science (纽约大学数据科学中心); Center for Advanced Imaging Innovation and Research (CAI2R) (高级影像创新与研究中心); Bernard and Irene Schwartz Center for Biomedical Imaging, Department of Radiology, NYU Grossman School of Medicine (伯纳德和艾琳·施瓦茨生物医学影像中心, 纽约大学格罗斯曼医学院放射科); Courant Institute of Mathematical Sciences (库朗数学科学研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 14 figures

点击查看摘要

Abstract:Reducing MRI scan times can improve patient care and lower healthcare costs. Many acceleration methods are designed to reconstruct diagnostic-quality images from limited sets of acquired \textitk -space data. This task can be framed as a linear inverse problem (LIP), where, as a result of undersampling, the forward operator may become rank-deficient or exhibit small singular values. This results in ambiguities in reconstruction, in which multiple generally incorrect or non-diagnostic images can map to the same acquired data. To address such ambiguities, it is crucial to incorporate prior knowledge, for example in the form of regularization. Another form of prior knowledge less commonly used in medical imaging is contextual side information garnered from other sources than the current acquisition. Here, we propose the \textbfT rust- \textbfG uided \textbfV ariational \textbfN etwork \textbf(TGVN) , a novel end-to-end deep learning framework that effectively integrates side information into LIPs. TGVN eliminates undesirable solutions from the ambiguous space of the forward operator while remaining faithful to the acquired data. We demonstrate its effectiveness in multi-coil, multi-contrast MR image reconstruction, where incomplete or low-quality measurements from one contrast are used as side information to reconstruct high-quality images of another contrast from heavily under-sampled data. Our method is robust across different contrasts, anatomies, and field strengths. Compared to baselines that also utilize side information, TGVN achieves superior image quality at challenging under-sampling levels, drastically speeding up acquisition while minimizing hallucinations. Our approach is also versatile enough to incorporate many different types of side information (including previous scans or even text) into any LIP.
zh

[CV-149] GLFC: Unified Global-Local Feature and Contrast Learning with Mamba-Enhanced UNet for Synthetic CT Generation from CBCT

【速读】：该论文旨在解决从锥形束计算机断层扫描（Cone Beam Computed Tomography, CBCT）生成合成计算机断层扫描（synthetic Computed Tomography, sCT）图像时，现有方法难以有效捕捉全局和局部特征及对比度的问题。为此，作者提出了一种全局-局部特征和对比度学习（Global-Local Feature and Contrast learning, GLFC）框架。该框架的关键创新点包括：1）在UNet的高分辨率跳跃连接中引入Mamba块，形成Mamba增强的UNet（Mamba-Enhanced UNet, MEUNet），以增强全局和局部特征的学习能力；2）提出了一种多对比度损失函数（Multiple Contrast Loss, MCL），通过在不同强度窗口下计算合成损失，提升软组织和骨区域的图像质量。实验结果表明，GLFC在SynthRAD2023数据集上显著提升了sCT的结构相似性指数（SSIM），从77.91%提高到91.50%，并优于现有的多种sCT生成方法。

链接: https://arxiv.org/abs/2501.02992
作者: Xianhao Zhou,Jianghao Wu,Huangxuan Zhao,Lei Chen,Shaoting Zhang,Guotai Wang,Guotai Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI2025

点击查看摘要

Abstract:Generating synthetic Computed Tomography (CT) images from Cone Beam Computed Tomography (CBCT) is desirable for improving the image quality of CBCT. Existing synthetic CT (sCT) generation methods using Convolutional Neural Networks (CNN) and Transformers often face difficulties in effectively capturing both global and local features and contrasts for high-quality sCT generation. In this work, we propose a Global-Local Feature and Contrast learning (GLFC) framework for sCT generation. First, a Mamba-Enhanced UNet (MEUNet) is introduced by integrating Mamba blocks into the skip connections of a high-resolution UNet for effective global and local feature learning. Second, we propose a Multiple Contrast Loss (MCL) that calculates synthetic loss at different intensity windows to improve quality for both soft tissues and bone regions. Experiments on the SynthRAD2023 dataset demonstrate that GLFC improved the SSIM of sCT from 77.91% to 91.50% compared with the original CBCT, and significantly outperformed several existing methods for sCT generation. The code is available at this https URL
zh

[CV-150] Region of Interest based Medical Image Compression

【速读】：该论文旨在解决医学图像数据在远程医疗服务中面临的存储和传输效率问题。由于医学图像数据量庞大，传统压缩技术难以在压缩率和图像质量之间取得平衡，尤其是在需要保留关键诊断信息的情况下。论文提出的解决方案是通过区域感兴趣（ROI, Region of Interest）编码技术，结合UNET分割算法在Brats 2020数据集上精确识别肿瘤区域，这些区域对诊断至关重要。随后，对这些关键区域采用高效视频编码（HEVC, High Efficiency Video Coding）进行压缩，以确保在提升压缩率的同时保留诊断所需的关键图像质量。非关键区域则进行更高程度的压缩。该方法优化了存储空间和传输带宽，满足了远程医疗和大规模医学影像处理的需求，同时确保了关键数据的完整性。

链接: https://arxiv.org/abs/2501.02895
作者: Utkarsh Prakash Srivastava,Toshiaki Fujii
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:The vast volume of medical image data necessitates efficient compression techniques to support remote healthcare services. This paper explores Region of Interest (ROI) coding to address the balance between compression rate and image quality. By leveraging UNET segmentation on the Brats 2020 dataset, we accurately identify tumor regions, which are critical for diagnosis. These regions are then subjected to High Efficiency Video Coding (HEVC) for compression, enhancing compression rates while preserving essential diagnostic information. This approach ensures that critical image regions maintain their quality, while non-essential areas are compressed more. Our method optimizes storage space and transmission bandwidth, meeting the demands of telemedicine and large-scale medical imaging. Through this technique, we provide a robust solution that maintains the integrity of vital data and improves the efficiency of medical image handling.
zh

[CV-151] Diff-Lung: Diffusion-Based Texture Synthesis for Enhanced Pathological Tissue Segmentation in Lung CT Scans

【速读】：该论文试图解决在间质性肺疾病（interstitial lung diseases）的诊断和随访中，准确量化肺部病理模式（如纤维化、磨玻璃影、肺气肿、实变）的问题。由于健康组织和病理组织之间的类别不平衡，分割任务具有挑战性。论文的关键解决方案是利用扩散模型（diffusion model）进行数据增强，通过在训练AI模型时生成合成病理组织斑块，同时保留每种组织类型特有的形状特征和复杂细节。这种方法通过增加训练数据中代表性不足类别的出现频率，提升了分割过程的准确性，特别是对于较少见的病理模式。这一进展有助于提高肺部CT扫描的自动化分析可靠性，从而可能改善临床决策和患者预后。

链接: https://arxiv.org/abs/2501.02867
作者: Rezkellah Noureddine Khiati,Pierre-Yves Brillet,Radu Ispas,Catalin Fetita
机构: SAMOVAR, Telecom Sud-Paris, Institut Polytechnique de Paris, Evry, France; Keyrus France, Levallois-Perret, France; Avicenne Hospital, AP-HP, Bobigny, France
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at ISBI 2025

点击查看摘要

Abstract:Accurate quantification of the extent of lung pathological patterns (fibrosis, ground-glass opacity, emphysema, consolidation) is prerequisite for diagnosis and follow-up of interstitial lung diseases. However, segmentation is challenging due to the significant class imbalance between healthy and pathological tissues. This paper addresses this issue by leveraging a diffusion model for data augmentation applied during training an AI model. Our approach generates synthetic pathological tissue patches while preserving essential shape characteristics and intricate details specific to each tissue type. This method enhances the segmentation process by increasing the occurence of underrepresented classes in the training data. We demonstrate that our diffusion-based augmentation technique improves segmentation accuracy across all pathological tissue types, particularly for the less common patterns. This advancement contributes to more reliable automated analysis of lung CT scans, potentially improving clinical decision-making and patient outcomes
zh

[CV-152] ICFNet: Integrated Cross-modal Fusion Network for Survival Prediction

【速读】：该论文旨在解决医学领域中生存预测（survival prediction）任务中现有方法依赖单一数据模态导致预测性能不佳的问题。为了解决这一问题，作者提出了一种集成跨模态融合网络（Integrated Cross-modal Fusion Network, ICFNet），该网络整合了组织病理学全切片图像（histopathology whole slide images）、基因组表达谱（genomic expression profiles）、患者人口统计学信息（patient demographics）以及治疗方案（treatment protocols）等多种数据模态。关键解决方案包括使用三种类型的编码器（encoders）、残差正交分解模块（residual orthogonal decomposition module）和统一融合模块（unification fusion module）来融合多模态特征，从而提升预测准确性。此外，设计了一种平衡的负对数似然损失函数（balanced negative log-likelihood loss function），以确保不同患者之间的公平训练。实验结果表明，ICFNet在五个公共TCGA数据集上优于现有最先进的算法，展示了其在支持临床决策和推进精准医学方面的潜力。

链接: https://arxiv.org/abs/2501.02778
作者: Binyu Zhang,Zhu Meng,Junhao Dong,Fei Su,Zhicheng Zhao
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China(北京邮电大学人工智能学院); Beijing Key Laboratory of Network System and Network Culture(北京市网络系统与网络文化重点实验室)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Survival prediction is a crucial task in the medical field and is essential for optimizing treatment options and resource allocation. However, current methods often rely on limited data modalities, resulting in suboptimal performance. In this paper, we propose an Integrated Cross-modal Fusion Network (ICFNet) that integrates histopathology whole slide images, genomic expression profiles, patient demographics, and treatment protocols. Specifically, three types of encoders, a residual orthogonal decomposition module and a unification fusion module are employed to merge multi-modal features to enhance prediction accuracy. Additionally, a balanced negative log-likelihood loss function is designed to ensure fair training across different patients. Extensive experiments demonstrate that our ICFNet outperforms state-of-the-art algorithms on five public TCGA datasets, including BLCA, BRCA, GBMLGG, LUAD, and UCEC, and shows its potential to support clinical decision-making and advance precision medicine. The codes are available at: this https URL.
zh

[CV-153] Ultrasound-QBench: Can LLM s Aid in Quality Assessment of Ultrasound Imaging?

【速读】：该论文旨在解决由于操作者熟练度和成像环境差异导致的低质量超声图像（ultrasound imaging）增加问题，这一问题严重影响了诊断准确性，甚至在关键病例中可能导致重新诊断的风险。为了解决这一问题，论文提出了Ultrasound-QBench，一个综合性的基准测试工具，用于系统评估多模态大语言模型（MLLMs）在超声图像质量评估任务中的表现。解决方案的关键在于建立两个数据集：IVUSQA（包含7,709张图像）和CardiacUltraQA（包含3,863张图像），这些图像由专业超声专家标注，并根据质量分为高、中、低三个等级。此外，论文将质量评估任务分解为三个维度：定性分类、定量评分和比较评估，以更全面地评估MLLMs在超声图像质量分类中的初步能力。

链接: https://arxiv.org/abs/2501.02751
作者: Hongyi Miao,Jun Jia,Yankun Cao,Yingjie Zhou,Yanwei Jiang,Zhi Liu,Guangtao Zhai
机构: Shandong University(山东大学); Shanghai Jiao Tong University(上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.
zh

[CV-154] KM-UNet KAN Mamba UNet for medical image segmentation

【速读】：该论文旨在解决医学图像分割任务中传统卷积神经网络（CNN）方法难以建模长程依赖关系，以及基于Transformer的模型虽然成功但计算复杂度高的问题。为此，作者提出了一种新型的U形网络架构KM-UNet，结合了Kolmogorov-Arnold网络（KANs）和状态空间模型（SSMs）的优势。KM-UNet利用Kolmogorov-Arnold表示定理实现高效的特征表示，并通过SSMs进行可扩展的长程依赖建模，从而在准确性和计算效率之间取得平衡。该方案的关键在于通过KANs和SSMs的结合，既提升了模型的长程依赖建模能力，又保持了较低的计算复杂度，为医学图像分割任务提供了一种高效且可解释的解决方案。

链接: https://arxiv.org/abs/2501.02559
作者: Yibo Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is a critical task in medical imaging analysis. Traditional CNN-based methods struggle with modeling long-range dependencies, while Transformer-based models, despite their success, suffer from quadratic computational complexity. To address these limitations, we propose KM-UNet, a novel U-shaped network architecture that combines the strengths of Kolmogorov-Arnold Networks (KANs) and state-space models (SSMs). KM-UNet leverages the Kolmogorov-Arnold representation theorem for efficient feature representation and SSMs for scalable long-range modeling, achieving a balance between accuracy and computational efficiency. We evaluate KM-UNet on five benchmark datasets: ISIC17, ISIC18, CVC, BUSI, and GLAS. Experimental results demonstrate that KM-UNet achieves competitive performance compared to state-of-the-art methods in medical image segmentation tasks. To the best of our knowledge, KM-UNet is the first medical image segmentation framework integrating KANs and SSMs. This work provides a valuable baseline and new insights for the development of more efficient and interpretable medical image segmentation systems. The code is open source at this https URL Keywords:KAN,Manba, state-space models,UNet, Medical image segmentation, Deep learning Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.02559 [eess.IV] (or arXiv:2501.02559v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.02559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-155] Framework for lung CT image segmentation based on UNet

【速读】：该论文试图解决医学图像分割领域中存在的两个主要问题：过拟合（overfitting）和小数据集（small dataset）。现有的U-Net及其变体虽然在医学图像分割中取得了显著成果，但其复杂的深度神经网络结构容易提取无意义的信息，且大多数模型不适用于肺部切片CT图像分割任务。为解决这些问题，作者提出了一种新的全流程网络，结合了先进的UNet++模型。该网络包含三个关键模块：数据增强（data augmentation）、优化的神经网络（optimized neural network）和参数微调（parameter fine-tuning）。通过整合多种方法，该网络在训练结果上表现出显著优势，达到了98.03%的领先准确率，并且具有最低的过拟合风险。该网络是首批专门针对肺部切片CT图像的分割模型之一。

链接: https://arxiv.org/abs/2501.02428
作者: Hao Ziang,Jingsi Zhang,Lixian Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the state-of-art models for medical image segmentation is U-Net and their variants. These networks, though succeeding in deriving notable results, ignore the practical problem hanging over the medical segmentation field: overfitting and small dataset. The over-complicated deep neural networks unnecessarily extract meaningless information, and a majority of them are not suitable for lung slice CT image segmentation task. To overcome the two limitations, we proposed a new whole-process network merging advanced UNet++ model. The network comprises three main modules: data augmentation, optimized neural network, parameter fine-tuning. By incorporating diverse methods, the training results demonstrate a significant advantage over similar works, achieving leading accuracy of 98.03% with the lowest overfitting. potential. Our network is remarkable as one of the first to target on lung slice CT images.
zh

[CV-156] Revisiting Compactness for District Plans

【速读】：该论文旨在解决两个主要问题：首先，针对现有的选区地图绘制方法，提出了一种基于人口加权的形状评分方法（population-weighted versions of shape-based scores），以在形状评分和离散评分之间实现更精确的插值。其次，论文改进了ReCom采样方法（ReCom sampling method），以生成具有更优形状紧凑性评分（shape-based compactness scores）的地图集合。解决方案的关键在于通过引入人口加权评分，更好地平衡形状和离散评分之间的关系，并通过改进的ReCom方法提升地图的紧凑性，从而为构建公平的选区地图和诉讼不公平地图提供更有效的工具。

链接: https://arxiv.org/abs/2501.02325
作者: Kristopher Tapp
机构: 未知
类目: Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern sampling methods create ensembles of district maps that score well on discrete compactness scores, whereas the Polsby-Popper and other shape-based scores remain highly relevant for building fair maps and litigating unfair ones. The aim of this paper is twofold. First, we introduce population-weighted versions of shape-based scores and show a precise sense in which this interpolates between shape-based and discrete scores. Second, we introduce a modification of the ReCom sampling method that produces ensembles of maps with improved shape-based compactness scores.
zh

[CV-157] Diabetic Retinopathy Detection Using CNN with Residual Block with DCGAN

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）的早期检测和分类问题，以预防视力丧失。DR是全球范围内导致失明的主要原因之一，由糖尿病引起的视网膜血管损伤所致。论文提出了一种基于卷积神经网络（Convolutional Neural Networks, CNNs）的自动化系统，采用残差块架构以增强特征提取和模型性能。为了进一步提高模型的鲁棒性，研究引入了先进的数据增强技术，特别是利用深度卷积生成对抗网络（Deep Convolutional Generative Adversarial Network, DCGAN）生成多样化的视网膜图像。这种方法增加了训练数据的变异性，使模型更具泛化能力，能够处理现实世界中视网膜图像的多样性。该系统能够将视网膜图像分类为五个不同的类别，从无DR到增殖性DR，为DR的早期诊断和进展监测提供了高效且可扩展的解决方案。该模型的目标是支持医疗专业人员进行大规模的DR筛查，尤其是在资源有限的环境中。

链接: https://arxiv.org/abs/2501.02300
作者: Debjany Ghosh Aronno,Sumaiya Saeha
机构: Bangladesh University of Engineering & Technology (孟加拉国工程技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a major cause of blindness worldwide, caused by damage to the blood vessels in the retina due to diabetes. Early detection and classification of DR are crucial for timely intervention and preventing vision loss. This work proposes an automated system for DR detection using Convolutional Neural Networks (CNNs) with a residual block architecture, which enhances feature extraction and model performance. To further improve the model’s robustness, we incorporate advanced data augmentation techniques, specifically leveraging a Deep Convolutional Generative Adversarial Network (DCGAN) for generating diverse retinal images. This approach increases the variability of training data, making the model more generalizable and capable of handling real-world variations in retinal images. The system is designed to classify retinal images into five distinct categories, from No DR to Proliferative DR, providing an efficient and scalable solution for early diagnosis and monitoring of DR progression. The proposed model aims to support healthcare professionals in large-scale DR screening, especially in resource-constrained settings.
zh

[CV-158] Deep Learning-Driven Segmentation of Ischemic Stroke Lesions Using Multi-Channel MRI

【速读】：该论文试图解决缺血性卒中（ischemic stroke）病灶在医学影像中分割不准确的问题，特别是在磁共振成像（MRI）中，由于卒中病灶的多样性和细微性，现有的分割技术往往难以精确描绘病灶。解决方案的关键在于提出了一种基于深度学习的新型分割方法，该方法结合了多通道MRI模态，包括扩散加权成像（DWI）、表观扩散系数（ADC）和增强扩散加权成像（eDWI）。该方法的架构采用了DenseNet121作为编码器，并在解码器中引入了自组织操作神经网络（SelfONN），同时通过通道和空间复合注意力机制（CSCA）和双重压缩-激励模块（DSE）进行增强。此外，还引入了一种结合Dice Loss和Jaccard Loss的自定义损失函数，通过加权平均来提升模型性能。该模型在ISLES 2022数据集上进行了训练和评估，结果显示，仅使用DWI时Dice相似系数（DSC）达到83.88%，结合DWI和ADC时达到85.86%，而结合DWI、ADC和eDWI时达到87.49%。这一方法不仅优于现有技术，还解决了当前分割实践中的关键限制，显著提高了缺血性卒中的诊断精度和治疗规划，为临床决策提供了有力支持。

链接: https://arxiv.org/abs/2501.02287
作者: Ashiqur Rahman,Muhammad E. H. Chowdhury,Md Sharjis Ibne Wadud,Rusab Sarmun,Adam Mushtak,Sohaib Bassam Zoghoul,Israa Al-Hashimi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ischemic stroke, caused by cerebral vessel occlusion, presents substantial challenges in medical imaging due to the variability and subtlety of stroke lesions. Magnetic Resonance Imaging (MRI) plays a crucial role in diagnosing and managing ischemic stroke, yet existing segmentation techniques often fail to accurately delineate lesions. This study introduces a novel deep learning-based method for segmenting ischemic stroke lesions using multi-channel MRI modalities, including Diffusion Weighted Imaging (DWI), Apparent Diffusion Coefficient (ADC), and enhanced Diffusion Weighted Imaging (eDWI). The proposed architecture integrates DenseNet121 as the encoder with Self-Organized Operational Neural Networks (SelfONN) in the decoder, enhanced by Channel and Space Compound Attention (CSCA) and Double Squeeze-and-Excitation (DSE) blocks. Additionally, a custom loss function combining Dice Loss and Jaccard Loss with weighted averages is introduced to improve model performance. Trained and evaluated on the ISLES 2022 dataset, the model achieved Dice Similarity Coefficients (DSC) of 83.88% using DWI alone, 85.86% with DWI and ADC, and 87.49% with the integration of DWI, ADC, and eDWI. This approach not only outperforms existing methods but also addresses key limitations in current segmentation practices. These advancements significantly enhance diagnostic precision and treatment planning for ischemic stroke, providing valuable support for clinical decision-making.
zh

[CV-159] CURLoRA: Tensor CUR Decomposition Based Low-Rank Parameter Adaptation for Medical Image Segmentation

【速读】：该论文旨在解决在资源受限环境下，随着深度神经网络规模的扩大，全量微调（full fine-tuning）带来的计算和存储挑战。为了解决这一问题，论文提出了一种基于张量CUR分解（tensor CUR decomposition）的新型微调方法——tCURLoRA。其关键解决方案在于将预训练的权重矩阵拼接成一个三维张量，并应用张量CUR分解，仅在微调过程中更新低阶张量分量，从而显著降低计算复杂度和存储需求。实验结果表明，tCURLoRA在医学图像分割任务中优于现有的参数高效微调（PEFT）方法。

链接: https://arxiv.org/abs/2501.02227
作者: Guanghua He,Wangang Cheng,Hancan Zhu,Xiaohao Cai,Gaohang Yu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transfer learning, by leveraging knowledge from pre-trained models, has significantly enhanced the performance of target tasks. However, as deep neural networks scale up, full fine-tuning introduces substantial computational and storage challenges in resource-constrained environments, limiting its widespread adoption. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed to reduce computational complexity and storage requirements by minimizing the number of updated parameters. While matrix decomposition-based PEFT methods, such as LoRA, show promise, they struggle to fully capture the high-dimensional structural characteristics of model weights. In contrast, high-dimensional tensors offer a more natural representation of neural network weights, allowing for a more comprehensive capture of higher-order features and multi-dimensional interactions. In this paper, we propose tCURLoRA, a novel fine-tuning method based on tensor CUR decomposition. By concatenating pre-trained weight matrices into a three-dimensional tensor and applying tensor CUR decomposition, we update only the lower-order tensor components during fine-tuning, effectively reducing computational and storage overhead. Experimental results demonstrate that tCURLoRA outperforms existing PEFT methods in medical image segmentation tasks.
zh

[CV-160] ree-NET: Enhancing Medical Image Segmentation Through Efficient Low-Level Feature Training

【速读】：该论文旨在解决医学图像分割（medical image segmentation）中计算效率与分割精度之间的平衡问题。现有的瓶颈特征监督（bottleneck feature supervision）方法主要局限于训练阶段，未能显著提升计算效率。为此，论文提出了Tree-NET框架，其关键创新在于引入了两个额外的训练阶段，分别在输入和输出阶段利用瓶颈特征，从而在不增加参数数量的情况下显著降低输入和输出的维度，提升计算性能。Tree-NET采用三层架构，包括用于压缩输入数据的Encoder-Net、压缩标签数据的Decoder-Net，以及监督瓶颈特征的分割框架Bridge-Net。通过专注于密集且压缩的特征表示，Tree-NET在不改变现有分割模型内部结构或增加模型规模的情况下，显著提升了计算效率，并在皮肤病变和息肉分割任务中验证了其有效性。实验结果表明，Tree-NET在保持或超越原有模型精度的同时，将FLOPs减少了4到13倍，并降低了内存使用量。

链接: https://arxiv.org/abs/2501.02140
作者: Orhan Demirci,Bulent Yilmaz
机构: Hacettepe University (哈斯特帕大学); Gulf University for Science and Technology (海湾科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is 10 pages long, includes 10 figures and 3 tables, and presents a novel framework for medical image segmentation. It has been submitted to the Medical Image Analysis journal for review

点击查看摘要

Abstract:This paper introduces Tree-NET, a novel framework for medical image segmentation that leverages bottleneck feature supervision to enhance both segmentation accuracy and computational efficiency. While previous studies have employed bottleneck feature supervision, their applications have largely been limited to the training phase, offering no computational benefits during training or evaluation. To the best of our knowledge, this study is the first to propose a framework that incorporates two additional training phases for segmentation models, utilizing bottleneck features at both input and output stages. This approach significantly improves computational performance by reducing input and output dimensions with a negligible addition to parameter count, without compromising accuracy. Tree-NET features a three-layer architecture comprising Encoder-Net and Decoder-Net, which are autoencoders designed to compress input and label data, respectively, and Bridge-Net, a segmentation framework that supervises the bottleneck features. By focusing on dense, compressed representations, Tree-NET enhances operational efficiency and can be seamlessly integrated into existing segmentation models without altering their internal structures or increasing model size. We evaluate Tree-NET on two critical segmentation tasks – skin lesion and polyp segmentation – using various backbone models, including U-NET variants and Polyp-PVT. Experimental results demonstrate that Tree-NET reduces FLOPs by a factor of 4 to 13 and decreases memory usage, while achieving comparable or superior accuracy compared to the original architectures. These findings underscore Tree-NET’s potential as a robust and efficient solution for medical image segmentation.
zh

[CV-161] Multi-Center Study on Deep Learning-Assisted Detection and Classification of Fetal Central Nervous System Anomalies Using Ultrasound Imaging

【速读】：该论文旨在解决产前超声检查中胎儿中枢神经系统（CNS）异常诊断的准确性和效率问题。尽管产前超声用于评估胎儿生长和检测先天性异常，但放射科医生对超声图像的解读需要专业知识和精密设备，这可能导致特定类型胎儿中枢神经系统异常的识别率不高，并引发不必要的患者检查。论文提出了一种深度学习模型，以提高胎儿颅脑异常诊断的整体准确性，辅助产前诊断。该模型基于多中心数据集，涵盖四种典型胎儿中枢神经系统异常：无脑畸形（anencephaly）、脑膨出（包括脑膜膨出，encephalocele）、前脑无裂畸形（holoprosencephaly）和脊柱裂（rachischisis）。模型在患者层面的预测准确率达到94.5%，AUROC值为99.3%。此外，热图叠加在超声图像上，不仅为算法提供了可视化解释，还通过突出显示需要复查的关键区域，为医生提供了直观的视觉辅助，帮助快速识别和验证关键区域。最终，回顾性读者研究表明，结合深度学习系统的自动预测和放射科医生的专业判断，可以有效提高诊断准确性和效率，降低误诊率，具有重要的临床应用前景。

链接: https://arxiv.org/abs/2501.02000
作者: Yang Qi,Jiaxin Cai,Jing Lu,Runqing Xiong,Rongshang Chen,Liping Zheng,Duo Ma
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prenatal ultrasound evaluates fetal growth and detects congenital abnormalities during pregnancy, but the examination of ultrasound images by radiologists requires expertise and sophisticated equipment, which would otherwise fail to improve the rate of identifying specific types of fetal central nervous system (CNS) abnormalities and result in unnecessary patient examinations. We construct a deep learning model to improve the overall accuracy of the diagnosis of fetal cranial anomalies to aid prenatal diagnosis. In our collected multi-center dataset of fetal craniocerebral anomalies covering four typical anomalies of the fetal central nervous system (CNS): anencephaly, encephalocele (including meningocele), holoprosencephaly, and rachischisis, patient-level prediction accuracy reaches 94.5%, with an AUROC value of 99.3%. In the subgroup analyzes, our model is applicable to the entire gestational period, with good identification of fetal anomaly types for any gestational period. Heatmaps superimposed on the ultrasound images not only provide a visual interpretation for the algorithm but also provide an intuitive visual aid to the physician by highlighting key areas that need to be reviewed, helping the physician to quickly identify and validate key areas. Finally, the retrospective reader study demonstrates that by combining the automatic prediction of the DL system with the professional judgment of the radiologist, the diagnostic accuracy and efficiency can be effectively improved and the misdiagnosis rate can be reduced, which has an important clinical application prospect.
zh

[CV-162] Leverag ing AI for Automatic Classification of PCOS Using Ultrasound Imaging

【速读】：该论文旨在通过自动化分类健康和不健康的超声图像帧，提升人工智能（AI）在识别多囊卵巢综合征（PCOS）方面的诊断能力。解决方案的关键在于构建一个稳健的AI管道，利用迁移学习（transfer learning）和InceptionV3架构实现高精度的二分类任务。通过预处理步骤优化数据集，并结合局部可解释模型（LIME）和显著性图（saliency maps）等可解释性方法，深入理解模型的决策过程。该方法在验证数据上达到了90.52%的准确率，且精确率、召回率和F1分数均超过90%，展示了其在医疗诊断中的有效性。

链接: https://arxiv.org/abs/2501.01984
作者: Atharva Divekar,Atharva Sonawane
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at: this https URL

点击查看摘要

Abstract:The AUTO-PCOS Classification Challenge seeks to advance the diagnostic capabilities of artificial intelligence (AI) in identifying Polycystic Ovary Syndrome (PCOS) through automated classification of healthy and unhealthy ultrasound frames. This report outlines our methodology for building a robust AI pipeline utilizing transfer learning with the InceptionV3 architecture to achieve high accuracy in binary classification. Preprocessing steps ensured the dataset was optimized for training, validation, and testing, while interpretability methods like LIME and saliency maps provided valuable insights into the model’s decision-making. Our approach achieved an accuracy of 90.52%, with precision, recall, and F1-score metrics exceeding 90% on validation data, demonstrating its efficacy. The project underscores the transformative potential of AI in healthcare, particularly in addressing diagnostic challenges like PCOS. Key findings, challenges, and recommendations for future enhancements are discussed, highlighting the pathway for creating reliable, interpretable, and scalable AI-driven medical diagnostic tools.
zh

人工智能

[AI-0] LightGNN: Simple Graph Neural Network for Recommendation

链接: https://arxiv.org/abs/2501.03228
作者: Guoxuan Chen,Lianghao Xia,Chao Huang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have demonstrated superior performance in collaborative recommendation through their ability to conduct high-order representation smoothing, effectively capturing structural information within users’ interaction patterns. However, existing GNN paradigms face significant challenges in scalability and robustness when handling large-scale, noisy, and real-world datasets. To address these challenges, we present LightGNN, a lightweight and distillation-based GNN pruning framework designed to substantially reduce model complexity while preserving essential collaboration modeling capabilities. Our LightGNN framework introduces a computationally efficient pruning module that adaptively identifies and removes redundant edges and embedding entries for model compression. The framework is guided by a resource-friendly hierarchical knowledge distillation objective, whose intermediate layer augments the observed graph to maintain performance, particularly in high-rate compression scenarios. Extensive experiments on public datasets demonstrate LightGNN’s effectiveness, significantly improving both computational efficiency and recommendation accuracy. Notably, LightGNN achieves an 80% reduction in edge count and 90% reduction in embedding entries while maintaining performance comparable to more complex state-of-the-art baselines. The implementation of our LightGNN framework is available at the github repository: this https URL.

[AI-1] urn-based Multi-Agent Reinforcement Learning Model Checking

链接: https://arxiv.org/abs/2501.03187
作者: Dennis Gross
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach for verifying the compliance of turn-based multi-agent reinforcement learning (TMARL) agents with complex requirements in stochastic multiplayer games. Our method overcomes the limitations of existing verification approaches, which are inadequate for dealing with TMARL agents and not scalable to large games with multiple agents. Our approach relies on tight integration of TMARL and a verification technique referred to as model checking. We demonstrate the effectiveness and scalability of our technique through experiments in different types of environments. Our experiments show that our method is suited to verify TMARL agents and scales better than naive monolithic model checking.

[AI-2] FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

链接: https://arxiv.org/abs/2501.03181
作者: Tian-Hao Zhang,Jiawei Zhang,Jun Wang,Xinyuan Qian,Xu-Cheng Yin
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Humans can perceive speakers’ characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character’s persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS, which is diligently curated and annotated to facilitate research in this domain. The experimental results demonstrate our proposed FaceSpeak can generate portrait-aligned voice with satisfactory naturalness and quality.

[AI-3] he Scaling Law for LoRA Base on Mutual Information Upper Bound

链接: https://arxiv.org/abs/2501.03152
作者: Jing Zhang,Hui Gao,Peng Zhang,Shuzhen Sun,Chang Yang,Yuexian Hou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LoRA (Low-Rank Adaptation) is a widely used model fine-tuning method. In fine-tuning, the law among model performance, model parameters, and data complexity has been a focal issue in the field. Existing methods often leverage external metrics (such as cross-entropy or perplexity) to evaluate model performance. In the fine-tuning process for large models, two types of knowledge are typically involved: the frozen, general knowledge acquired by the model during pre-training and the new knowledge learned through the LoRA module from the current data. Generally, the less LoRA’s learned knowledge relies on the large model, the more it captures the specific knowledge of new data, thereby enhancing its adaptability to new tasks. However, external metrics do not readily capture the dependency relationship between these two types of knowledge. Therefore, we designed an internal metric based on the Mutual Information Upper Bound (MIUB) theory to investigate the scaling law of large-model LoRA fine-tuning. In our experiments, we validated this approach on benchmark datasets, using the Llama3-8B and Phi3-3B models. The results show that the proposed MIUB metric aligns more accurately and stably with the scaling law of LoRA fine-tuning compared to cross-entropy and perplexity.

[AI-4] Co-Activation Graph Analysis of Safety-Verified and Explainable Deep Reinforcement Learning Policies

链接: https://arxiv.org/abs/2501.03142
作者: Dennis Gross,Helge Spieker
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (RL) policies can demonstrate unsafe behaviors and are challenging to interpret. To address these challenges, we combine RL policy model checking–a technique for determining whether RL policies exhibit unsafe behaviors–with co-activation graph analysis–a method that maps neural network inner workings by analyzing neuron activation patterns–to gain insight into the safe RL policy’s sequential decision-making. This combination lets us interpret the RL policy’s inner workings for safe decision-making. We demonstrate its applicability in various experiments.

[AI-5] From Models to Network Topologies: A Topology Inference Attack in Decentralized Federated Learning

链接: https://arxiv.org/abs/2501.03119
作者: Chao Feng,Yuanzhe Gao,Alberto Huertas Celdran,Gerome Bovet,Burkhard Stiller
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is widely recognized as a privacy-preserving machine learning paradigm due to its model-sharing mechanism that avoids direct data exchange. However, model training inevitably leaves exploitable traces that can be used to infer sensitive information. In Decentralized FL (DFL), the overlay topology significantly influences its models’ convergence, robustness, and security. This study explores the feasibility of inferring the overlay topology of DFL systems based solely on model behavior, introducing a novel Topology Inference Attack. A taxonomy of topology inference attacks is proposed, categorizing them by the attacker’s capabilities and knowledge. Practical attack strategies are developed for different scenarios, and quantitative experiments are conducted to identify key factors influencing the attack effectiveness. Experimental results demonstrate that analyzing only the public models of individual nodes can accurately infer the DFL topology, underscoring the risk of sensitive information leakage in DFL systems. This finding offers valuable insights for improving privacy preservation in decentralized learning environments.

[AI-6] Personalized Fashion Recommendation with Image Attributes and Aesthetics Assessment

链接: https://arxiv.org/abs/2501.03085
作者: Chongxian Chen,Fan Mo,Xin Fan,Hayato Yamana
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized fashion recommendation is a difficult task because 1) the decisions are highly correlated with users’ aesthetic appetite, which previous work frequently overlooks, and 2) many new items are constantly rolling out that cause strict cold-start problems in the popular identity (ID)-based recommendation methods. These new items are critical to recommend because of trend-driven consumerism. In this work, we aim to provide more accurate personalized fashion recommendations and solve the cold-start problem by converting available information, especially images, into two attribute graphs focusing on optimized image utilization and noise-reducing user modeling. Compared with previous methods that separate image and text as two components, the proposed method combines image and text information to create a richer attributes graph. Capitalizing on the advancement of large language and vision models, we experiment with extracting fine-grained attributes efficiently and as desired using two different prompts. Preliminary experiments on the IQON3000 dataset have shown that the proposed method achieves competitive accuracy compared with baselines.

[AI-7] Survival Analysis Revisited: Understanding and Unifying Poisson Exponential and Cox Models in Fall Risk Analysis

链接: https://arxiv.org/abs/2501.03058
作者: Tianhua Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores foundational and applied aspects of survival analysis, using fall risk assessment as a case study. It revisits key time-related probability distributions and statistical methods, including logistic regression, Poisson regression, Exponential regression, and the Cox Proportional Hazards model, offering a unified perspective on their relationships within the survival analysis framework. A contribution of this work is the step-by-step derivation and clarification of the relationships among these models, particularly demonstrating that Poisson regression in the survival context is a specific case of the Cox model. These insights address gaps in understanding and reinforce the simplicity and interpretability of survival models. The paper also emphasizes the practical utility of survival analysis by connecting theoretical insights with real-world applications. In the context of fall detection, it demonstrates how these models can simultaneously predict fall risk, analyze contributing factors, and estimate time-to-event outcomes within a single streamlined framework. In contrast, advanced deep learning methods often require complex post-hoc interpretation and separate training for different tasks particularly when working with structured numerical data. This highlights the enduring relevance of classical statistical frameworks and makes survival models especially valuable in healthcare settings, where explainability and robustness are critical. By unifying foundational concepts and offering a cohesive perspective on time-to-event analysis, this work serves as an accessible resource for understanding survival models and applying them effectively to diverse analytical challenges.

[AI-8] o Analyze and Regulate Human-in-the-loop Learning for Congestion Games

链接: https://arxiv.org/abs/2501.03055
作者: Hongbo Li,Lingjie Duan
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2211.14029

点击查看摘要

Abstract:In congestion games, selfish users behave myopically to crowd to the shortest paths, and the social planner designs mechanisms to regulate such selfish routing through information or payment incentives. However, such mechanism design requires the knowledge of time-varying traffic conditions and it is the users themselves to learn and report past road experiences to the social planner (e.g., Waze or Google Maps). When congestion games meet mobile crowdsourcing, it is critical to incentivize selfish users to explore non-shortest paths in the best exploitation-exploration trade-off. First, we consider a simple but fundamental parallel routing network with one deterministic path and multiple stochastic paths for users with an average arrival probability \lambda . We prove that the current myopic routing policy (widely used in Waze and Google Maps) misses both exploration (when strong hazard belief) and exploitation (when weak hazard belief) as compared to the social optimum. Due to the myopic policy’s under-exploration, we prove that the caused price of anarchy (PoA) is larger than (\frac11-\rho^\frac1\lambda), which can be arbitrarily large as discount factor (\rho\rightarrow1). To mitigate such huge efficiency loss, we propose a novel selective information disclosure (SID) mechanism: we only reveal the latest traffic information to users when they intend to over-explore stochastic paths upon arrival, while hiding such information when they want to under-explore. We prove that our mechanism successfully reduces PoA to be less than~(2). Besides the parallel routing network, we further extend our mechanism and PoA results to any linear path graphs with multiple intermediate nodes.

[AI-9] Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders ICASSP2025

链接: https://arxiv.org/abs/2501.03038
作者: Dichucheng Li,Yongyi Zang,Qiuqiang Kong
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Automatic Music Transcription (AMT), aiming to get musical notes from raw audio, typically uses frame-level systems with piano-roll outputs or language model (LM)-based systems with note-level predictions. However, frame-level systems require manual thresholding, while the LM-based systems struggle with long sequences. In this paper, we propose a hybrid method combining pre-trained roll-based encoders with an LM decoder to leverage the strengths of both methods. Besides, our approach employs a hierarchical prediction strategy, first predicting onset and pitch, then velocity, and finally offset. The hierarchical prediction strategy reduces computational costs by breaking down long sequences into different hierarchies. Evaluated on two benchmark roll-based encoders, our method outperforms traditional piano-roll outputs 0.01 and 0.022 in onset-offset-velocity F1 score, demonstrating its potential as a performance-enhancing plug-in for arbitrary roll-based music transcription encoder. We release the code of this work at this https URL.

[AI-10] Putnams Critical and Explanatory Tendencies Interpreted from a Machine Learning Perspective

链接: https://arxiv.org/abs/2501.03026
作者: Sheldon Z. Soudin
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Making sense of theory choice in normal and across extraordinary science is central to philosophy of science. The emergence of machine learning models has the potential to act as a wrench in the gears of current debates. In this paper, I will attempt to reconstruct the main movements that lead to and came out of Putnam’s critical and explanatory tendency distinction, argue for the biconditional necessity of the tendencies, and conceptualize that wrench through a machine learning interpretation of my claim.

[AI-11] A Bio-Inspired Research Paradigm of Collision Perception Neurons Enabling Neuro-Robotic Integration: The LGMD Case

链接: https://arxiv.org/abs/2501.02982
作者: Ziyan Qin,Jigen Peng,Shigang Yue,Qinbing Fu
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Compared to human vision, insect visual systems excel at rapid and precise collision detection, despite relying on only tens of thousands of neurons organized through a few neuropils. This efficiency makes them an attractive model system for developing artificial collision-detecting systems. Specifically, researchers have identified collision-selective neurons in the locust’s optic lobe, called lobula giant movement detectors (LGMDs), which respond specifically to approaching objects. Research upon LGMD neurons began in the early 1970s. Initially, due to their large size, these neurons were identified as motion detectors, but their role as looming detectors was recognized over time. Since then, progress in neuroscience, computational modeling of LGMD’s visual neural circuits, and LGMD-based robotics has advanced in tandem, each field supporting and driving the others. Today, with a deeper understanding of LGMD neurons, LGMD-based models have significantly improved collision-free navigation in mobile robots including ground and aerial robots. This review highlights recent developments in LGMD research from the perspectives of neuroscience, computational modeling, and robotics. It emphasizes a biologically plausible research paradigm, where insights from neuroscience inform real-world applications, which would in turn validate and advance neuroscience. With strong support from extensive research and growing application demand, this paradigm has reached a mature stage and demonstrates versatility across different areas of neuroscience research, thereby enhancing our understanding of the interconnections between neuroscience, computational modeling, and robotics. Furthermore, other motion-sensitive neurons have also shown promising potential for adopting this research paradigm.

[AI-12] CONTINUUM: Detecting APT Attacks through Spatial-Temporal Graph Neural Networks

链接: https://arxiv.org/abs/2501.02981
作者: Atmane Ayoub Mansour Bahara,Kamel Soaïd Ferrahia,Mohamed-Lamine Messai,Hamida Seba,Karima Amrouche
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 31 pages

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) represent a significant challenge in cybersecurity due to their sophisticated and stealthy nature. Traditional Intrusion Detection Systems (IDS) often fall short in detecting these multi-stage attacks. Recently, Graph Neural Networks (GNNs) have been employed to enhance IDS capabilities by analyzing the complex relationships within networked data. However, existing GNN-based solutions are hampered by high false positive rates and substantial resource consumption. In this paper, we present a novel IDS designed to detect APTs using a Spatio-Temporal Graph Neural Network Autoencoder. Our approach leverages spatial information to understand the interactions between entities within a graph and temporal information to capture the evolution of the graph over time. This dual perspective is crucial for identifying the sequential stages of APTs. Furthermore, to address privacy and scalability concerns, we deploy our architecture in a federated learning environment. This setup ensures that local data remains on-premise while encrypted model-weights are shared and aggregated using homomorphic encryption, maintaining data privacy and security. Our evaluation shows that this system effectively detects APTs with lower false positive rates and optimized resource usage compared to existing methods, highlighting the potential of spatio-temporal analysis and federated learning in enhancing cybersecurity defenses.

[AI-13] CAMP: Collaborative Attention Model with Profiles for Vehicle Routing Problems AAMAS2025

链接: https://arxiv.org/abs/2501.02977
作者: Chuanbo Hua,Federico Berto,Jiwoo Son,Seunghyun Kang,Changhyun Kwon,Jinkyoo Park
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Accepted at AAMAS 2025

点击查看摘要

Abstract:The profiled vehicle routing problem (PVRP) is a generalization of the heterogeneous capacitated vehicle routing problem (HCVRP) in which the objective is to optimize the routes of vehicles to serve client demands subject to different vehicle profiles, with each having a preference or constraint on a per-client basis. While existing learning methods have shown promise for solving the HCVRP in real-time, no learning method exists to solve the more practical and challenging PVRP. In this paper, we propose a Collaborative Attention Model with Profiles (CAMP), a novel approach that learns efficient solvers for PVRP using multi-agent reinforcement learning. CAMP employs a specialized attention-based encoder architecture to embed profiled client embeddings in parallel for each vehicle profile. We design a communication layer between agents for collaborative decision-making across profiled embeddings at each decoding step and a batched pointer mechanism to attend to the profiled embeddings to evaluate the likelihood of the next actions. We evaluate CAMP on two variants of PVRPs: PVRP with preferences, which explicitly influence the reward function, and PVRP with zone constraints with different numbers of agents and clients, demonstrating that our learned solvers achieve competitive results compared to both classical state-of-the-art neural multi-agent models in terms of solution quality and computational efficiency. We make our code openly available at this https URL.

[AI-14] Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls

链接: https://arxiv.org/abs/2501.02975
作者: Can Gao,Xiaofeng Tan,Jie Zhou,Weiping Ding,Witold Pedrycz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data and has been extensively studied and used in a variety of practical tasks. However, most unsupervised outlier detection methods are carefully designed to detect specified outliers, while real-world data may be entangled with different types of outliers. In this study, we propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers. Specifically, a novel fuzzy rough sets-based method that integrates relative fuzzy granule density is first introduced to improve the capability of detecting local outliers. Then, a multi-scale view generation method based on granular-ball computing is proposed to collaboratively identify group outliers at different levels of granularity. Moreover, reliable outliers and inliers determined by the three-way decision are used to train a weighted support vector machine to further improve the performance of outlier detection. The proposed method innovatively transforms unsupervised outlier detection into a semi-supervised classification problem and for the first time explores the fuzzy rough sets-based outlier detection from the perspective of multi-scale granular balls, allowing for high adaptability to different types of outliers. Extensive experiments carried out on both artificial and UCI datasets demonstrate that the proposed outlier detection method significantly outperforms the state-of-the-art methods, improving the results by at least 8.48% in terms of the Area Under the ROC Curve (AUROC) index. The source codes are released at \urlthis https URL.

[AI-15] Proof-of-Data: A Consensus Protocol for Collaborative Intelligence

链接: https://arxiv.org/abs/2501.02971
作者: Huiwen Liu,Feida Zhu,Ling Cheng
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing research on federated learning has been focused on the setting where learning is coordinated by a centralized entity. Yet the greatest potential of future collaborative intelligence would be unleashed in a more open and democratized setting with no central entity in a dominant role, referred to as “decentralized federated learning”. New challenges arise accordingly in achieving both correct model training and fair reward allocation with collective effort among all participating nodes, especially with the threat of the Byzantine node jeopardising both tasks. In this paper, we propose a blockchain-based decentralized Byzantine fault-tolerant federated learning framework based on a novel Proof-of-Data (PoD) consensus protocol to resolve both the “trust” and “incentive” components. By decoupling model training and contribution accounting, PoD is able to enjoy not only the benefit of learning efficiency and system liveliness from asynchronous societal-scale PoW-style learning but also the finality of consensus and reward allocation from epoch-based BFT-style voting. To mitigate false reward claims by data forgery from Byzantine attacks, a privacy-aware data verification and contribution-based reward allocation mechanism is designed to complete the framework. Our evaluation results show that PoD demonstrates performance in model training close to that of the centralized counterpart while achieving trust in consensus and fairness for reward allocation with a fault tolerance ratio of 1/3. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.02971 [cs.CR] (or arXiv:2501.02971v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.02971 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-16] Skillful High-Resolution Ensemble Precipitation Forecasting with an Integrated Deep Learning Framework

链接: https://arxiv.org/abs/2501.02905
作者: Shuangshuang He,Hongli Liang,Yuanting Zhang,Xingyuan Yuan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-resolution precipitation forecasts are crucial for providing accurate weather prediction and supporting effective responses to extreme weather events. Traditional numerical models struggle with stochastic subgrid-scale processes, while recent deep learning models often produce blurry results. To address these challenges, we propose a physics-inspired deep learning framework for high-resolution (0.05\textdegree \times 0.05\textdegree) ensemble precipitation forecasting. Trained on ERA5 and CMPA high-resolution precipitation datasets, the framework integrates deterministic and probabilistic components. The deterministic model, based on a 3D SwinTransformer, captures average precipitation at mesoscale resolution and incorporates strategies to enhance performance, particularly for moderate to heavy rainfall. The probabilistic model employs conditional diffusion in latent space to account for uncertainties in residual precipitation at convective scales. During inference, ensemble members are generated by repeatedly sampling latent variables, enabling the model to represent precipitation uncertainty. Our model significantly enhances spatial resolution and forecast accuracy. Rank histogram shows that the ensemble system is reliable and unbiased. In a case study of heavy precipitation in southern China, the model outputs align more closely with observed precipitation distributions than ERA5, demonstrating superior capability in capturing extreme precipitation events. Additionally, 5-day real-time forecasts show good performance in terms of CSI scores.

[AI-17] Forward Once for All: Structural Parameterized Adaptation for Efficient Cloud-coordinated On-device Recommendation KDD2025

链接: https://arxiv.org/abs/2501.02837
作者: Kairui Fu,Zheqi Lv,Shengyu Zhang,Fan Wu,Kun Kuang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:In cloud-centric recommender system, regular data exchanges between user devices and cloud could potentially elevate bandwidth demands and privacy risks. On-device recommendation emerges as a viable solution by performing reranking locally to alleviate these concerns. Existing methods primarily focus on developing local adaptive parameters, while potentially neglecting the critical role of tailor-made model architecture. Insights from broader research domains suggest that varying data distributions might favor distinct architectures for better fitting. In addition, imposing a uniform model structure across heterogeneous devices may result in risking inefficacy on less capable devices or sub-optimal performance on those with sufficient capabilities. In response to these gaps, our paper introduces Forward-OFA, a novel approach for the dynamic construction of device-specific networks (both structure and parameters). Forward-OFA employs a structure controller to selectively determine whether each block needs to be assembled for a given device. However, during the training of the structure controller, these assembled heterogeneous structures are jointly optimized, where the co-adaption among blocks might encounter gradient conflicts. To mitigate this, Forward-OFA is designed to establish a structure-guided mapping of real-time behaviors to the parameters of assembled networks. Structure-related parameters and parallel components within the mapper prevent each part from receiving heterogeneous gradients from others, thus bypassing the gradient conflicts for coupled optimization. Besides, direct mapping enables Forward-OFA to achieve adaptation through only one forward pass, allowing for swift adaptation to changing interests and eliminating the requirement for on-device backpropagation. Experiments on real-world datasets demonstrate the effectiveness and efficiency of Forward-OFA.

[AI-18] Enhancing Lifelong Multi-Agent Path Finding with Cache Mechanism

链接: https://arxiv.org/abs/2501.02803
作者: Yimin Tang,Zhenghong Yu,Yi Zheng,T. K. Satish Kumar,Jiaoyang Li,Sven Koenig
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2403.13421

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial in autonomous warehouse operations. Lifelong MAPF (L-MAPF), where agents are continuously reassigned new targets upon completing their current tasks, offers a more realistic approximation of real-world warehouse scenarios. While cache storage systems can enhance efficiency and reduce operational costs, existing approaches primarily rely on expectations and mathematical models, often without adequately addressing the challenges of multi-robot planning and execution. In this paper, we introduce a novel mechanism called Lifelong MAPF with Cache Mechanism (L-MAPF-CM), which integrates high-level cache storage with low-level path planning. We have involved a new type of map grid called cache for temporary item storage. Additionally, we involved a task assigner (TA) with a locking mechanism to bridge the gap between the new cache grid and L-MAPF algorithm. The TA dynamically allocates target locations to agents based on their status in various scenarios. We evaluated L-MAPF-CM using different cache replacement policies and task distributions. L-MAPF-CM has demonstrated performance improvements particularly with high cache hit rates and smooth traffic conditions.

[AI-19] Fairness Through Matching

链接: https://arxiv.org/abs/2501.02793
作者: Kunwoong Kim,Insung Kong,Jongjin Lee,Minwoo Chae,Sangchul Park,Yongdai Kim
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in TMLR

点击查看摘要

Abstract:Group fairness requires that different protected groups, characterized by a given sensitive attribute, receive equal outcomes overall. Typically, the level of group fairness is measured by the statistical gap between predictions from different protected groups. In this study, we reveal an implicit property of existing group fairness measures, which provides an insight into how the group-fair models behave. Then, we develop a new group-fair constraint based on this implicit property to learn group-fair models. To do so, we first introduce a notable theoretical observation: every group-fair model has an implicitly corresponding transport map between the input spaces of each protected group. Based on this observation, we introduce a new group fairness measure termed Matched Demographic Parity (MDP), which quantifies the averaged gap between predictions of two individuals (from different protected groups) matched by a given transport map. Then, we prove that any transport map can be used in MDP to learn group-fair models, and develop a novel algorithm called Fairness Through Matching (FTM), which learns a group-fair model using MDP constraint with an user-specified transport map. We specifically propose two favorable types of transport maps for MDP, based on the optimal transport theory, and discuss their advantages. Experiments reveal that FTM successfully trains group-fair models with certain desirable properties by choosing the transport map accordingly.

[AI-20] Multi-Agent Path Finding under Limited Communication Range Constraint via Dynamic Leading

链接: https://arxiv.org/abs/2501.02770
作者: Hoang-Dung Bui,Erion Plaku,Gregoy J. Stein
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper proposes a novel framework to handle a multi-agent path finding problem under a limited communication range constraint, where all agents must have a connected communication channel to the rest of the team. Many existing approaches to multi-agent path finding (e.g., leader-follower platooning) overcome computational challenges of planning in this domain by planning one agent at a time in a fixed order. However, fixed leader-follower approaches can become stuck during planning, limiting their practical utility in dense-clutter environments. To overcome this limitation, we develop dynamic leading multi-agent path finding, which allows for dynamic reselection of the leading agent during path planning whenever progress cannot be made. The experiments show the efficiency of our framework, which can handle up to 25 agents with more than 90% success-rate across five environment types where baselines routinely fail.

[AI-21] Enhancing Trustworthiness of Graph Neural Networks with Rank-Based Conformal Training AAAI2025

链接: https://arxiv.org/abs/2501.02767
作者: Ting Wang,Zhixin Zhou,Rui Luo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages,2 figures,published to AAAI 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) has been widely used in a variety of fields because of their great potential in representing graph-structured data. However, lacking of rigorous uncertainty estimations limits their application in high-stakes. Conformal Prediction (CP) can produce statistically guaranteed uncertainty estimates by using the classifier’s probability estimates to obtain prediction sets, which contains the true class with a user-specified probability. In this paper, we propose a Rank-based CP during training framework to GNNs (RCP-GNN) for reliable uncertainty estimates to enhance the trustworthiness of GNNs in the node classification scenario. By exploiting rank information of the classifier’s outcome, prediction sets with desired coverage rate can be efficiently constructed. The strategy of CP during training with differentiable rank-based conformity loss function is further explored to adapt prediction sets according to network topology information. In this way, the composition of prediction sets can be guided by the goal of jointly reducing inefficiency and probability estimation errors. Extensive experiments on several real-world datasets show that our model achieves any pre-defined target marginal coverage while significantly reducing the inefficiency compared with state-of-the-art methods.

[AI-22] Are GNNs Effective for Multimodal Fault Diagnosis in Microservice Systems?

链接: https://arxiv.org/abs/2501.02766
作者: Fei Gao,Ruyue Xin,Yaqiang Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures, submitted to conference

点击查看摘要

Abstract:Fault diagnosis in microservice systems has increasingly embraced multimodal observation data for a holistic and multifaceted view of the system, with Graph Neural Networks (GNNs) commonly employed to model complex service dependencies. However, despite the intuitive appeal, there remains a lack of compelling justification for the adoption of GNNs, as no direct evidence supports their necessity or effectiveness. To critically evaluate the current use of GNNs, we propose DiagMLP, a simple topology-agnostic baseline as a substitute for GNNs in fault diagnosis frameworks. Through experiments on five public datasets, we surprisingly find that DiagMLP performs competitively with and even outperforms GNN-based methods in fault diagnosis tasks, indicating that the current paradigm of using GNNs to model service dependencies has not yet demonstrated a tangible contribution. We further discuss potential reasons for this observation and advocate shifting the focus from solely pursuing novel model designs to developing challenging datasets, standardizing preprocessing protocols, and critically evaluating the utility of advanced deep learning modules.

[AI-23] Enhancing Robot Route Optimization in Smart Logistics with Transformer and GNN Integration

链接: https://arxiv.org/abs/2501.02749
作者: Hao Luo,Jianjun Wei,Shuchen Zhao,Ankai Liang,Zhongjin Xu,Ruxue Jiang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:This research delves into advanced route optimization for robots in smart logistics, leveraging a fusion of Transformer architectures, Graph Neural Networks (GNNs), and Generative Adversarial Networks (GANs). The approach utilizes a graph-based representation encompassing geographical data, cargo allocation, and robot dynamics, addressing both spatial and resource limitations to refine route efficiency. Through extensive testing with authentic logistics datasets, the proposed method achieves notable improvements, including a 15% reduction in travel distance, a 20% boost in time efficiency, and a 10% decrease in energy consumption. These findings highlight the algorithm’s effectiveness, promoting enhanced performance in intelligent logistics operations.

[AI-24] AFed: Algorithmic Fair Federated Learning

链接: https://arxiv.org/abs/2501.02732
作者: Huiqiang Chen,Tianqing Zhu,Wanlei Zhou,Wei Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE Transactions on Neural Networks and Learning Systems

点击查看摘要

Abstract:Federated Learning (FL) has gained significant attention as it facilitates collaborative machine learning among multiple clients without centralizing their data on a server. FL ensures the privacy of participating clients by locally storing their data, which creates new challenges in fairness. Traditional debiasing methods assume centralized access to sensitive information, rendering them impractical for the FL setting. Additionally, FL is more susceptible to fairness issues than centralized machine learning due to the diverse client data sources that may be associated with group information. Therefore, training a fair model in FL without access to client local data is important and challenging. This paper presents AFed, a straightforward yet effective framework for promoting group fairness in FL. The core idea is to circumvent restricted data access by learning the global data distribution. This paper proposes two approaches: AFed-G, which uses a conditional generator trained on the server side, and AFed-GAN, which improves upon AFed-G by training a conditional GAN on the client side. We augment the client data with the generated samples to help remove bias. Our theoretical analysis justifies the proposed methods, and empirical results on multiple real-world datasets demonstrate a substantial improvement in AFed over several baselines.

[AI-25] OpenGU: A Comprehensive Benchmark for Graph Unlearning

链接: https://arxiv.org/abs/2501.02728
作者: Bowen Fan,Yuming Ai,Xunkai Li,Zhilin Guo,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:Graph Machine Learning is essential for understanding and analyzing relational data. However, privacy-sensitive applications demand the ability to efficiently remove sensitive information from trained graph neural networks (GNNs), avoiding the unnecessary time and space overhead caused by retraining models from scratch. To address this issue, Graph Unlearning (GU) has emerged as a critical solution, with the potential to support dynamic graph updates in data management systems and enable scalable unlearning in distributed data systems while ensuring privacy compliance. Unlike machine unlearning in computer vision or other fields, GU faces unique difficulties due to the non-Euclidean nature of graph data and the recursive message-passing mechanism of GNNs. Additionally, the diversity of downstream tasks and the complexity of unlearning requests further amplify these challenges. Despite the proliferation of diverse GU strategies, the absence of a benchmark providing fair comparisons for GU, and the limited flexibility in combining downstream tasks and unlearning requests, have yielded inconsistencies in evaluations, hindering the development of this domain. To fill this gap, we present OpenGU, the first GU benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are integrated, enabling various downstream tasks with 13 GNN backbones when responding to flexible unlearning requests. Based on this unified benchmark framework, we are able to provide a comprehensive and fair evaluation for GU. Through extensive experimentation, we have drawn 8 crucial conclusions about existing GU methods, while also gaining valuable insights into their limitations, shedding light on potential avenues for future research.

[AI-26] ree-based RAG -Agent Recommendation System: A Case Study in Medical Test Data

链接: https://arxiv.org/abs/2501.02727
作者: Yahe Yang,Chengyue Huang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present HiRMed (Hierarchical RAG-enhanced Medical Test Recommendation), a novel tree-structured recommendation system that leverages Retrieval-Augmented Generation (RAG) for intelligent medical test recommendations. Unlike traditional vector similarity-based approaches, our system performs medical reasoning at each tree node through a specialized RAG process. Starting from the root node with initial symptoms, the system conducts step-wise medical analysis to identify potential underlying conditions and their corresponding diagnostic requirements. At each level, instead of simple matching, our RAG-enhanced nodes analyze retrieved medical knowledge to understand symptom-disease relationships and determine the most appropriate diagnostic path. The system dynamically adjusts its recommendation strategy based on medical reasoning results, considering factors such as urgency levels and diagnostic uncertainty. Experimental results demonstrate that our approach achieves superior performance in terms of coverage rate, accuracy, and miss rate compared to conventional retrieval-based methods. This work represents a significant advance in medical test recommendation by introducing medical reasoning capabilities into the traditional tree-based retrieval structure.

[AI-27] Artificial Intelligence in Creative Industries: Advances Prior to 2025 DATE

链接: https://arxiv.org/abs/2501.02725
作者: Nantheera Anantrasirichai,Fan Zhang,David Bull
类目: Artificial Intelligence (cs.AI)
*备注: This is an updated review of our previous paper (see this https URL )

点击查看摘要

Abstract:The rapid advancements in artificial intelligence (AI), particularly in generative AI and large language models (LLMs), have profoundly impacted the creative industries by enabling innovative content creation, enhancing workflows, and democratizing access to creative tools. This paper explores the significant technological shifts since our previous review in 2022, highlighting how these developments have expanded creative opportunities and efficiency. These technological advancements have enhanced the capabilities of text-to-image, text-to-video, and multimodal generation technologies. In particular, key breakthroughs in LLMs have established new benchmarks in conversational AI, while advancements in image generators have revolutionized content creation. We also discuss AI integration into post-production workflows, which has significantly accelerated and refined traditional processes. Despite these innovations, challenges remain, particularly for the media industry, due to the demands on communication traffic from creative content. We therefore include data compression and quality assessment in this paper. Furthermore, we highlight the trend toward unified AI frameworks capable of addressing multiple creative tasks and underscore the importance of human oversight to mitigate AI-generated inaccuracies. Finally, we explore AI’s future potential in the creative sector, stressing the need to navigate emerging challenges to maximize its benefits while addressing associated risks.

[AI-28] Improved Data Encoding for Emerging Computing Paradigms: From Stochastic to Hyperdimensional Computing

链接: https://arxiv.org/abs/2501.02715
作者: Mehran Shoushtari Moghadam,Sercan Aygun,M.Hassan Najafi
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Data encoding is a fundamental step in emerging computing paradigms, particularly in stochastic computing (SC) and hyperdimensional computing (HDC), where it plays a crucial role in determining the overall system performance and hardware cost efficiency. This study presents an advanced encoding strategy that leverages a hardware-friendly class of low-discrepancy (LD) sequences, specifically powers-of-2 bases of Van der Corput (VDC) sequences (VDC-2^n), as sources for random number generation. Our approach significantly enhances the accuracy and efficiency of SC and HDC systems by addressing challenges associated with randomness. By employing LD sequences, we improve correlation properties and reduce hardware complexity. Experimental results demonstrate significant improvements in accuracy and energy savings for SC and HDC systems. Our solution provides a robust framework for integrating SC and HDC in resource-constrained environments, paving the way for efficient and scalable AI implementations.

[AI-29] Horizon Generalization in Reinforcement Learning

链接: https://arxiv.org/abs/2501.02709
作者: Vivek Myers,Catherine Ji,Benjamin Eysenbach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals. Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results and recall findings from prior work in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.

[AI-30] Multi-Aggregator Time-Warping Heterogeneous Graph Neural Network for Personalized Micro-Video Recommendation

链接: https://arxiv.org/abs/2501.02666
作者: Jinkun Han,Wei Li,Xhipeng Cai,Yingshu Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Micro-video recommendation is attracting global attention and becoming a popular daily service for people of all ages. Recently, Graph Neural Networks-based micro-video recommendation has displayed performance improvement for many kinds of recommendation tasks. However, the existing works fail to fully consider the characteristics of micro-videos, such as the high timeliness of news nature micro-video recommendation and sequential interactions of frequently changed interests. In this paper, a novel Multi-aggregator Time-warping Heterogeneous Graph Neural Network (MTHGNN) is proposed for personalized news nature micro-video recommendation based on sequential sessions, where characteristics of micro-videos are comprehensively studied, users’ preference is mined via multi-aggregator, the temporal and dynamic changes of users’ preference are captured, and timeliness is considered. Through the comparison with the state-of-the-arts, the experimental results validate the superiority of our MTHGNN model.

[AI-31] Representation Learning of Lab Values via Masked AutoEncoder

链接: https://arxiv.org/abs/2501.02648
作者: David Restrepo,Chenwei Wu,Yueran Jia,Jaden K. Sun,Jack Gallifant,Catherine G. Bielick,Yugang Jia,Leo A. Celi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages main text, 8 appendix

点击查看摘要

Abstract:Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE’s robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.

[AI-32] rust and Dependability in Blockchain AI Based MedIoT Applications: Research Challenges and Future Directions

链接: https://arxiv.org/abs/2501.02647
作者: Ellis Solaiman,Christa Awad
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper critically reviews the integration of Artificial Intelligence (AI) and blockchain technologies in the context of Medical Internet of Things (MedIoT) applications, where they collectively promise to revolutionize healthcare delivery. By examining current research, we underscore AI’s potential in advancing diagnostics and patient care, alongside blockchain’s capacity to bolster data security and patient privacy. We focus particularly on the imperative to cultivate trust and ensure reliability within these systems. Our review highlights innovative solutions for managing healthcare data and challenges such as ensuring scalability, maintaining privacy, and promoting ethical practices within the MedIoT domain. We present a vision for integrating AI-driven insights with blockchain security in healthcare, offering a comprehensive review of current research and future directions. We conclude with a set of identified research gaps and propose that addressing these is crucial for achieving the dependable, secure, and patient -centric MedIoT applications of tomorrow.

[AI-33] Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets

链接: https://arxiv.org/abs/2501.02628
作者: Mahmoud Jahanshahi,Audris Mockus
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted in the Second International Workshop on Large Language Models for Code (LLM4Code 2025)

点击查看摘要

Abstract:A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention. We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes in at least one OSS project, and pinpoint a subset of samples that include fixes for bugs or vulnerabilities. We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions, with 17% of those representing bug fixes, including 2.36% addressing known CVEs. The deduplicated version of Stack v2 still includes blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the dataset were never modified after creation, suggesting they likely represent software with minimal or no use. Misidentified blob origins present an additional challenge, as they lead to the inclusion of non-permissively licensed code, raising serious compliance concerns. By addressing these issues, the training of new models can avoid perpetuating buggy code patterns or license violations. We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools. Comments: Accepted in the Second International Workshop on Large Language Models for Code (LLM4Code 2025) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.02628 [cs.SE] (or arXiv:2501.02628v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.02628 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-34] LLM s Help Alleviate the Cross-Subject Variability in Brain Signal and Language Alignment

链接: https://arxiv.org/abs/2501.02621
作者: Yifei Liu,Hengwei Ye,Shuhang Li
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Decoding human activity from EEG signals has long been a popular research topic. While recent studies have increasingly shifted focus from single-subject to cross-subject analysis, few have explored the model’s ability to perform zero-shot predictions on EEG signals from previously unseen subjects. This research aims to investigate whether deep learning methods can capture subject-independent semantic information inherent in human EEG signals. Such insights are crucial for Brain-Computer Interfaces (BCI) because, on one hand, they demonstrate the model’s robustness against subject-specific temporal biases, and on the other, they significantly enhance the generalizability of downstream tasks. We employ Large Language Models (LLMs) as denoising agents to extract subject-independent semantic features from noisy EEG signals. Experimental results, including ablation studies, highlight the pivotal role of LLMs in decoding subject-independent semantic information from noisy EEG data. We hope our findings will contribute to advancing BCI research and assist both academia and industry in applying EEG signals to a broader range of applications.

[AI-35] APAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

链接: https://arxiv.org/abs/2501.02600
作者: Jovan Stojkovic,Chaojie Zhang,Íñigo Goiri,Esha Choukse,Haoran Qiu,Rodrigo Fonseca,Josep Torrellas,Ricardo Bianchini
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.

[AI-36] Energy Optimization of Multi-task DNN Inference in MEC-assisted XR Devices: A Lyapunov-Guided Reinforcement Learning Approach

链接: https://arxiv.org/abs/2501.02572
作者: Yanzan Sun,Jiacheng Qiu,Guangjin Pan,Shugong Xu,Shunqing Zhang,Xiaoyun Wang,Shuangfeng Han
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 13 pages, 7 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Extended reality (XR), blending virtual and real worlds, is a key application of future networks. While AI advancements enhance XR capabilities, they also impose significant computational and energy challenges on lightweight XR devices. In this paper, we developed a distributed queue model for multi-task DNN inference, addressing issues of resource competition and queue coupling. In response to the challenges posed by the high energy consumption and limited resources of XR devices, we designed a dual time-scale joint optimization strategy for model partitioning and resource allocation, formulated as a bi-level optimization problem. This strategy aims to minimize the total energy consumption of XR devices while ensuring queue stability and adhering to computational and communication resource constraints. To tackle this problem, we devised a Lyapunov-guided Proximal Policy Optimization algorithm, named LyaPPO. Numerical results demonstrate that the LyaPPO algorithm outperforms the baselines, achieving energy conservation of 24.79% to 46.14% under varying resource capacities. Specifically, the proposed algorithm reduces the energy consumption of XR devices by 24.29% to 56.62% compared to baseline algorithms.

[AI-37] AMM: Adaptive Modularized Reinforcement Model for Multi-city Traffic Signal Control

链接: https://arxiv.org/abs/2501.02548
作者: Zherui Huang,Yicheng Liu,Chumeng Liang,Guanjie Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic signal control (TSC) is an important and widely studied direction. Recently, reinforcement learning (RL) methods have been used to solve TSC problems and achieve superior performance over conventional TSC methods. However, applying RL methods to the real world is challenging due to the huge cost of experiments in real-world traffic environments. One possible solution is TSC domain adaptation, which adapts trained models to target environments and reduces the number of interactions and the training cost. However, existing TSC domain adaptation methods still face two major issues: the lack of consideration for differences across cities and the low utilization of multi-city data. To solve aforementioned issues, we propose an approach named Adaptive Modularized Model (AMM). By modularizing TSC problems and network models, we overcome the challenge of possible changes in environmental observations. We also aggregate multi-city experience through meta-learning. We conduct extensive experiments on different cities and show that AMM can achieve excellent performance with limited interactions in target environments and outperform existing methods. We also demonstrate the feasibility and generalizability of our method. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.02548 [cs.LG] (or arXiv:2501.02548v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.02548 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] A completely uniform transformer for parity

链接: https://arxiv.org/abs/2501.02535
作者: Alexander Kozachinskiy,Tomasz Steifer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages

点击查看摘要

Abstract:We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).

[AI-39] Rethinking IDE Customization for Enhanced HAX: A Hyperdimensional Perspective ICSE’25

链接: https://arxiv.org/abs/2501.02491
作者: Roham Koohestani,Maliheh Izadi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted at the 2nd Workshop on Integrated Development Environments (the IDE Workshop) co-located with ICSE '25

点击查看摘要

Abstract:As Integrated Development Environments (IDEs) increasingly integrate Artificial Intelligence, Software Engineering faces both benefits like productivity gains and challenges like mismatched user preferences. We propose Hyper-Dimensional (HD) vector spaces to model Human-Computer Interaction, focusing on user actions, stylistic preferences, and project context. These contributions aim to inspire further research on applying HD computing in IDE design.

[AI-40] he Meta-Representation Hypothesis

链接: https://arxiv.org/abs/2501.02481
作者: Zhengpeng Xie,Jiahang Cao,Qiang Zhang,Jianxiong Zhang,Changwei Wang,Renjing Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Humans rely on high-level meta-representations to engage in abstract reasoning. In complex cognitive tasks, these meta-representations help individuals abstract general rules from experience. However, constructing such meta-representations from high-dimensional observations remains a longstanding challenge for reinforcement learning agents. For instance, a well-trained agent often fails to generalize to even minor variations of the same task, such as changes in background color, while humans can easily handle. In this paper, we build a bridge between meta-representation and generalization, showing that generalization performance benefits from meta-representation learning. We also hypothesize that deep mutual learning (DML) among agents can help them converge to meta-representations. Empirical results provide support for our theory and hypothesis. Overall, this work provides a new perspective on the generalization of deep reinforcement learning.

[AI-41] RTLMarker: Protecting LLM -Generated RTL Copyright via a Hardware Watermarking Framework

链接: https://arxiv.org/abs/2501.02446
作者: Kun Wang,Kaiyan Chang,Mengdi Wang,Xinqi Zou,Haobo Xu,Yinhe Han,Ying Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances of large language models in the field of Verilog generation have raised several ethical and security concerns, such as code copyright protection and dissemination of malicious code. Researchers have employed watermarking techniques to identify codes generated by large language models. However, the existing watermarking works fail to protect RTL code copyright due to the significant syntactic and semantic differences between RTL code and software code in languages such as Python. This paper proposes a hardware watermarking framework RTLMarker that embeds watermarks into RTL code and deeper into the synthesized netlist. We propose a set of rule-based Verilog code transformations , ensuring the watermarked RTL code’s syntactic and semantic correctness. In addition, we consider an inherent tradeoff between watermark transparency and watermark effectiveness and jointly optimize them. The results demonstrate RTLMarker’s superiority over the baseline in RTL code watermarking.

[AI-42] Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations

链接: https://arxiv.org/abs/2501.02409
作者: Zaikang Lin,Sei Chang,Aaron Zweig,Elham Azizi,David A. Knowles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Molecular Networks (q-bio.MN); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Modern high-throughput biological datasets with thousands of perturbations provide the opportunity for large-scale discovery of causal graphs that represent the regulatory interactions between genes. Numerous methods have been proposed to infer a directed acyclic graph (DAG) corresponding to the underlying gene regulatory network (GRN) that captures causal gene relationships. However, existing models have restrictive assumptions (e.g. linearity, acyclicity), limited scalability, and/or fail to address the dynamic nature of biological processes such as cellular differentiation. We propose PerturbODE, a novel framework that incorporates biologically informative neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the causal GRN from the neural ODE’s parameters. We demonstrate PerturbODE’s efficacy in trajectory prediction and GRN inference across simulated and real over-expression datasets.

[AI-43] Enhancing Workplace Productivity and Well-being Using AI Agent

链接: https://arxiv.org/abs/2501.02368
作者: Ravirajan K,Arvind Sundarajan
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper discusses the use of Artificial Intelligence (AI) to enhance workplace productivity and employee well-being. By integrating machine learning (ML) techniques with neurobiological data, the proposed approaches ensure alignment with human ethical standards through value alignment models and Hierarchical Reinforcement Learning (HRL) for autonomous task management. The system utilizes biometric feedback from employees to generate personalized health prompts, fostering a supportive work environment that encourages physical activity. Additionally, we explore decentralized multi-agent systems for improved collaboration and decision-making frameworks that enhance transparency. Various approaches using ML techniques in conjunction with AI implementations are discussed. Together, these innovations aim to create a more productive and health-conscious workplace. These outcomes assist HR management and organizations in launching more rational career progression streams for employees and facilitating organizational transformation.

[AI-44] UAVs Meet LLM s: Overviews and Perspectives Toward Agent ic Low-Altitude Mobility

链接: https://arxiv.org/abs/2501.02341
作者: Yonglin Tian,Fei Lin,Yiduo Li,Tengchao Zhang,Qiyao Zhang,Xuan Fu,Jun Huang,Xingyuan Dai,Yutong Wang,Chunwei Tian,Bai Li,Yisheng Lv,Levente Kovács,Fei-Yue Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains, like transportation, logistics, and agriculture. Leveraging flexible perspectives and rapid maneuverability, UAVs extend traditional systems’ perception and action capabilities, garnering widespread attention from academia and industry. However, current UAV operations primarily depend on human control, with only limited autonomy in simple scenarios, and lack the intelligence and adaptability needed for more complex environments and tasks. The emergence of large language models (LLMs) demonstrates remarkable problem-solving and generalization capabilities, offering a promising pathway for advancing UAV intelligence. This paper explores the integration of LLMs and UAVs, beginning with an overview of UAV systems’ fundamental components and functionalities, followed by an overview of the state-of-the-art in LLM technology. Subsequently, it systematically highlights the multimodal data resources available for UAVs, which provide critical support for training and evaluation. Furthermore, it categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge. Finally, a reference roadmap towards agentic UAVs is proposed, aiming to enable UAVs to achieve agentic intelligence through autonomous perception, memory, reasoning, and tool utilization. Related resources are available at this https URL.

[AI-45] Evaluation of the Code Generation Capabilities of ChatGPT 4: A Comparative Analysis in 19 Programming Languages

链接: https://arxiv.org/abs/2501.02338
作者: L. C. Gilbert
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 65 pages, in German, Bachelor’s thesis on the evaluation of ChatGPT 4’s code generation capabilities in 19 programming languages, University of Potsdam, June 2024

点击查看摘要

Abstract:This bachelor’s thesis examines the capabilities of ChatGPT 4 in code generation across 19 programming languages. The study analyzed solution rates across three difficulty levels, types of errors encountered, and code quality in terms of runtime and memory efficiency through a quantitative experiment. A total of 188 programming problems were selected from the LeetCode platform, and ChatGPT 4 was given three attempts to produce a correct solution with feedback. ChatGPT 4 successfully solved 39.67% of all tasks, with success rates decreasing significantly as problem complexity increased. Notably, the model faced considerable challenges with hard problems across all languages. ChatGPT 4 demonstrated higher competence in widely used languages, likely due to a larger volume and higher quality of training data. The solution rates also revealed a preference for languages with low abstraction levels and static typing. For popular languages, the most frequent error was “Wrong Answer,” whereas for less popular languages, compiler and runtime errors prevailed, suggesting frequent misunderstandings and confusion regarding the structural characteristics of these languages. The model exhibited above-average runtime efficiency in all programming languages, showing a tendency toward statically typed and low-abstraction languages. Memory efficiency results varied significantly, with above-average performance in 14 languages and below-average performance in five languages. A slight preference for low-abstraction languages and a leaning toward dynamically typed languages in terms of memory efficiency were observed. Future research should include a larger number of tasks, iterations, and less popular languages. Additionally, ChatGPT 4’s abilities in code interpretation and summarization, debugging, and the development of complex, practical code could be analyzed further.

[AI-46] SR-Reward: Taking The Path More Traveled

链接: https://arxiv.org/abs/2501.02330
作者: Seyed Mahdi B. Azad,Zahra Padar,Gabriel Kalweit,Joschka Boedecker
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel method for learning reward functions directly from offline demonstrations. Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner’s policy, eliminating the adversarial interaction typically required between the two. This results in a more stable and efficient training process. Our reward function, called \textitSR-Reward, leverages successor representation (SR) to encode a state based on expected future states’ visitation under the demonstration policy and transition dynamics. By utilizing the Bellman equation, SR-Reward can be learned concurrently with most reinforcement learning (RL) algorithms without altering the existing training pipeline. We also introduce a negative sampling strategy to mitigate overestimation errors by reducing rewards for out-of-distribution data, thereby enhancing robustness. This strategy inherently introduces a conservative bias into RL algorithms that employ the learned reward. We evaluate our method on the D4RL benchmark, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning. Moreover, our ablation studies on data size and quality reveal the advantages and limitations of SR-Reward as a proxy for true rewards.

[AI-47] DiffGraph: Heterogeneous Graph Diffusion Model WSDM’2025

链接: https://arxiv.org/abs/2501.02313
作者: Zongwei Li,Lianghao Xia,Hua Hua,Shijie Zhang,Shuangyang Wang,Chao Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: This paper is accepted by WSDM’2025

点击查看摘要

Abstract:Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods’ inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: this https URL.

[AI-48] Interpretable Load Forecasting via Representation Learning of Geo-distributed Meteorological Factors

链接: https://arxiv.org/abs/2501.02241
作者: Yangze Zhou,Guoxin Lin,Gonghao Zhang,Yi Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Meteorological factors (MF) are crucial in day-ahead load forecasting as they significantly influence the electricity consumption behaviors of consumers. Numerous studies have incorporated MF into the load forecasting model to achieve higher accuracy. Selecting MF from one representative location or the averaged MF as the inputs of the forecasting model is a common practice. However, the difference in MF collected in various locations within a region may be significant, which poses a challenge in selecting the appropriate MF from numerous locations. A representation learning framework is proposed to extract geo-distributed MF while considering their spatial relationships. In addition, this paper employs the Shapley value in the graph-based model to reveal connections between MF collected in different locations and loads. To reduce the computational complexity of calculating the Shapley value, an acceleration method is adopted based on Monte Carlo sampling and weighted linear regression. Experiments on two real-world datasets demonstrate that the proposed method improves the day-ahead forecasting accuracy, especially in extreme scenarios such as the “accumulation temperature effect” in summer and “sudden temperature change” in winter. We also find a significant correlation between the importance of MF in different locations and the corresponding area’s GDP and mainstay industry.

[AI-49] CORD: Generalizable Cooperation via Role Diversity

链接: https://arxiv.org/abs/2501.02221
作者: Kanefumi Matsuyama,Kefan Su,Jiangxing Wang,Deheng Ye,Zongqing Lu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning (MARL) aims to develop agents that can collaborate effectively. However, most cooperative MARL methods overfit training agents, making learned policies not generalize well to unseen collaborators, which is a critical issue for real-world deployment. Some methods attempt to address the generalization problem but require prior knowledge or predefined policies of new teammates, limiting real-world applications. To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD. CORD’s high-level controller assigns roles to low-level agents by maximizing the role entropy with constraints. We show this constrained objective can be decomposed into causal influence in role that enables reasonable role assignment, and role heterogeneity that yields coherent, non-redundant role clusters. Evaluated on a variety of cooperative multi-agent tasks, CORD achieves better performance than baselines, especially in generalization tests. Ablation studies further demonstrate the efficacy of the constrained objective in generalizable cooperation.

[AI-50] Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

链接: https://arxiv.org/abs/2501.02219
作者: Zhongwei Wang,Tong Wu,Zhiyong Chen,Liang Qian,Yin Xu,Meixia Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: accepted by IEEE WCNC 2025

点击查看摘要

Abstract:Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.

[AI-51] Can ChatGPT implement finite element models for geotechnical engineering applications?

链接: https://arxiv.org/abs/2501.02199
作者: Taegu Kim,Tae Sup Yun,Hyoung Suk Suh
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study assesses the capability of ChatGPT to generate finite element code for geotechnical engineering applications from a set of prompts. We tested three different initial boundary value problems using a hydro-mechanically coupled formulation for unsaturated soils, including the dissipation of excess pore water pressure through fluid mass diffusion in one-dimensional space, time-dependent differential settlement of a strip footing, and gravity-driven seepage. For each case, initial prompting involved providing ChatGPT with necessary information for finite element implementation, such as balance and constitutive equations, problem geometry, initial and boundary conditions, material properties, and spatiotemporal discretization and solution strategies. Any errors and unexpected results were further addressed through prompt augmentation processes until the ChatGPT-generated finite element code passed the verification/validation test. Our results demonstrate that ChatGPT required minimal code revisions when using the FEniCS finite element library, owing to its high-level interfaces that enable efficient programming. In contrast, the MATLAB code generated by ChatGPT necessitated extensive prompt augmentations and/or direct human intervention, as it involves a significant amount of low-level programming required for finite element analysis, such as constructing shape functions or assembling global matrices. Given that prompt engineering for this task requires an understanding of the mathematical formulation and numerical techniques, this study suggests that while a large language model may not yet replace human programmers, it can greatly assist in the implementation of numerical models.

[AI-52] AdaMixup: A Dynamic Defense Framework for Membership Inference Attack Mitigation

链接: https://arxiv.org/abs/2501.02182
作者: Ying Chen,Jiajing Chen,Yijie Weng,ChiaHua Chang,Dezhi Yu,Guanbiao Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Membership inference attacks have emerged as a significant privacy concern in the training of deep learning models, where attackers can infer whether a data point was part of the training set based on the model’s outputs. To address this challenge, we propose a novel defense mechanism, AdaMixup. AdaMixup employs adaptive mixup techniques to enhance the model’s robustness against membership inference attacks by dynamically adjusting the mixup strategy during training. This method not only improves the model’s privacy protection but also maintains high performance. Experimental results across multiple datasets demonstrate that AdaMixup significantly reduces the risk of membership inference attacks while achieving a favorable trade-off between defensive efficiency and model accuracy. This research provides an effective solution for data privacy protection and lays the groundwork for future advancements in mixup training methods.

[AI-53] he Integration of Blockchain and Artificial Intelligence for Secure Healthcare Systems

链接: https://arxiv.org/abs/2501.02169
作者: Umar Safdar,Simon Gabrael
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 Figures

点击查看摘要

Abstract:Verisign reported a 125 percent increase in data breaches within the healthcare sector in the United States during 2022, with 18.2 million patient records being impacted. Growing healthcare data volumes and diversification mean that medical information is becoming more valuable. Many Health Centers use various technologies to ease the classification, storage, and exchange of big data. This use can also make the health data of the users at risk and vulnerable. AI and blockchain are among the leading technologies at hand. With AI, data-driven operations and big data efficiency have been improved with respect to traditional techniques. Due to its potential to bring about improvements in health services and lower medical costs, this AI technology is regularly used in healthcare. Blockchain helps protect transactions on sharing information and private privacy as long as the exchange of knowledge is that of the standard. The objective of this analysis is to investigate the research and unique contributions since 2008 regarding blockchain-integrated AI and healthcare systems. The work sheds light on applied AI-based healthcare schemes with machine, ballistic, and acrylic learning and disparate blockchain structures. The use of technology in order to ensure patient data security and manage medical information effectively in healthcare settings offers a highly successful position for both healthcare providers and patients. From 2018 to 2021, the best year was 2021 to grow, enhancing everything to examine the download of the device and the counting of Google Academies, for which the joining perspective was borrowed; local research experts were asked, identified articles in recent years, and read reviews of large research grants.

[AI-54] he Race to Efficiency: A New Perspective on AI Scaling Laws

链接: https://arxiv.org/abs/2501.02156
作者: Chien-Ping Lu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 27 pages, 3 figures. First draft

点击查看摘要

Abstract:As large-scale AI models expand, training becomes costlier and sustaining progress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020), Hoffmann et al. (2022)) predict training loss from a static compute budget yet neglect time and efficiency, prompting the question: how can we balance ballooning GPU fleets with rapidly improving hardware and algorithms? We introduce the relative-loss equation, a time- and efficiency-aware framework that extends classical AI scaling laws. Our model shows that, without ongoing efficiency gains, advanced performance could demand millennia of training or unrealistically large GPU fleets. However, near-exponential progress remains achievable if the “efficiency-doubling rate” parallels Moore’s Law. By formalizing this race to efficiency, we offer a quantitative roadmap for balancing front-loaded GPU investments with incremental improvements across the AI stack. Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the diminishing returns inherent in classical scaling.

[AI-55] Attribute-Based Robotic Grasping with Data-Efficient Adaptation

链接: https://arxiv.org/abs/2501.02149
作者: Yang Yang,Houjian Yu,Xibai Lou,Yuanhao Liu,Changhyun Choi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2104.02271

点击查看摘要

Abstract:Robotic grasping is one of the most fundamental robotic manipulation tasks and has been the subject of extensive research. However, swiftly teaching a robot to grasp a novel target object in clutter remains challenging. This paper attempts to address the challenge by leveraging object attributes that facilitate recognition, grasping, and rapid adaptation to new domains. In this work, we present an end-to-end encoder-decoder network to learn attribute-based robotic grasping with data-efficient adaptation capability. We first pre-train the end-to-end model with a variety of basic objects to learn generic attribute representation for recognition and grasping. Our approach fuses the embeddings of a workspace image and a query text using a gated-attention mechanism and learns to predict instance grasping affordances. To train the joint embedding space of visual and textual attributes, the robot utilizes object persistence before and after grasping. Our model is self-supervised in a simulation that only uses basic objects of various colors and shapes but generalizes to novel objects in new environments. To further facilitate generalization, we propose two adaptation methods, adversarial adaption and one-grasp adaptation. Adversarial adaptation regulates the image encoder using augmented data of unlabeled images, whereas one-grasp adaptation updates the overall end-to-end model using augmented data from one grasp trial. Both adaptation methods are data-efficient and considerably improve instance grasping performance. Experimental results in both simulation and the real world demonstrate that our approach achieves over 81% instance grasping success rate on unknown objects, which outperforms several baselines by large margins.

[AI-56] Effective LLM -Driven Code Generation with Pythoness

链接: https://arxiv.org/abs/2501.02138
作者: Kyla H. Levin,Kyle Gwilt,Emery D. Berger,Stephen N. Freund
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 5 pages

点击查看摘要

Abstract:The advent of large language models (LLMs) has paved the way for a new era of programming tools with both significant capabilities and risks, as the generated code lacks guarantees of correctness and reliability. Developers using LLMs currently face the difficult task of optimizing, integrating, and maintaining code generated by AI. We propose an embedded domain-specific language (DSL), Pythoness, to address those challenges. In Pythoness, developers program with LLMs at a higher level of abstraction. Rather than interacting directly with generated code, developers using Pythoness operate at the level of behavioral specifications when writing functions, classes, or an entire program. These specifications can take the form of unit tests and property-based tests, which may be expressed formally or in natural language. Guided by these specifications, Pythoness generates code that both passes the tests and can be continuously checked during execution. We posit that the Pythoness approach lets developers harness the full potential of LLMs for code generation while substantially mitigating their inherent risks. We describe our current prototype implementation of Pythoness and demonstrate that it can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.

[AI-57] A hybrid marketplace of ideas

链接: https://arxiv.org/abs/2501.02132
作者: Tomer Jordi Chaffer,Dontrail Cotlage,Justin Goldston
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The convergence of humans and artificial intelligence systems introduces new dynamics into the cultural and intellectual landscape. Complementing emerging cultural evolution concepts such as machine culture, AI agents represent a significant technosociological development, particularly within the anthropological study of Web3 as a community focused on decentralization through blockchain. Despite their growing presence, the cultural significance of AI agents remains largely unexplored in academic literature. We argue that, within the context of Web3, these agents challenge traditional notions of participation and influence in public discourse, creating a hybrid marketplace of ideas, a conceptual space where human and AI generated ideas coexist and compete for attention. We examine the current state of AI agents in idea generation, propagation, and engagement, positioning their role as cultural agents through the lens of memetics and encouraging further inquiry into their cultural and societal impact. Additionally, we address the implications of this paradigm for privacy, intellectual property, and governance, highlighting the societal and legal challenges of integrating AI agents into the hybrid marketplace of ideas.

[AI-58] Online Detection of Water Contamination Under Concept Drift

链接: https://arxiv.org/abs/2501.02107
作者: Jin Li,Kleanthis Malialis,Stelios G. Vrachimis,Marios M. Polycarpou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Water Distribution Networks (WDNs) are vital infrastructures, and contamination poses serious public health risks. Harmful substances can interact with disinfectants like chlorine, making chlorine monitoring essential for detecting contaminants. However, chlorine sensors often become unreliable and require frequent calibration. This study introduces the Dual-Threshold Anomaly and Drift Detection (ADDD) method, an unsupervised approach combining a dual-threshold drift detection mechanism with an LSTM-based Variational Autoencoder(LSTM-VAE) for real-time contamination detection. Tested on two realistic WDNs, ADDD effectively identifies anomalies with sensor offsets as concept drift, and outperforms other methods. A proposed decentralized architecture enables accurate contamination detection and localization by deploying ADDD on selected nodes.

[AI-59] On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures

链接: https://arxiv.org/abs/2501.02089
作者: Ming Yin,Mengdi Wang,Yu-Xiang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Review Article

点击查看摘要

Abstract:This article reviews the recent advances on the statistical foundation of reinforcement learning (RL) in the offline and low-adaptive settings. We will start by arguing why offline RL is the appropriate model for almost any real-life ML problems, even if they have nothing to do with the recent AI breakthroughs that use RL. Then we will zoom into two fundamental problems of offline RL: offline policy evaluation (OPE) and offline policy learning (OPL). It may be surprising to people that tight bounds for these problems were not known even for tabular and linear cases until recently. We delineate the differences between worst-case minimax bounds and instance-dependent bounds. We also cover key algorithmic ideas and proof techniques behind near-optimal instance-dependent methods in OPE and OPL. Finally, we discuss the limitations of offline RL and review a burgeoning problem of \emphlow-adaptive exploration which addresses these limitations by providing a sweet middle ground between offline and online RL.

[AI-60] Architecture for Trajectory-Based Fishing Ship Classification with AIS Data

链接: https://arxiv.org/abs/2501.02038
作者: David Sánchez Pedroche,Daniel Amigo,Jesús García,Jose M. Molina
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Sensors 2020

点击查看摘要

Abstract:This paper proposes a data preparation process for managing real-world kinematic data and detecting fishing vessels. The solution is a binary classification that classifies ship trajectories into either fishing or non-fishing ships. The data used are characterized by the typical problems found in classic data mining applications using real-world data, such as noise and inconsistencies. The two classes are also clearly unbalanced in the data, a problem which is addressed using algorithms that resample the instances. For classification, a series of features are extracted from spatiotemporal data that represent the trajectories of the ships, available from sequences of Automatic Identification System (AIS) reports. These features are proposed for the modelling of ship behavior but, because they do not contain context-related information, the classification can be applied in other scenarios. Experimentation shows that the proposed data preparation process is useful for the presented classification problem. In addition, positive results are obtained using minimal information.

[AI-61] Deep Clustering via Community Detection

链接: https://arxiv.org/abs/2501.02036
作者: Tianyu Cheng,Qun Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:Deep clustering is an essential task in modern artificial intelligence, aiming to partition a set of data samples into a given number of homogeneous groups (i.e., clusters). Even though many Deep Neural Network (DNN) backbones and clustering strategies have been proposed for the task, achieving increasingly improved performance, deep clustering remains very challenging due to the lack of accurately labeled samples. In this paper, we propose a novel approach of deep clustering via community detection. It initializes clustering by detecting many communities, and then gradually expands clusters by community merging. Compared with the existing clustering strategies, community detection factors in the new perspective of cluster network analysis. As a result, it has the inherent benefit of high pseudo-label purity, which is critical to the performance of self-supervision. We have validated the efficacy of the proposed approach on benchmark image datasets. Our extensive experiments have shown that it can effectively improve the SOTA performance. Our ablation study also demonstrates that the new network perspective can effectively improve community pseudo-label purity, resulting in improved clustering performance.

[AI-62] Dynamic Feature Fusion: Combining Global Graph Structures and Local Semantics for Blockchain Fraud Detection

链接: https://arxiv.org/abs/2501.02032
作者: Zhang Sheng,Liangliang Song,Yanbin Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The advent of blockchain technology has facilitated the widespread adoption of smart contracts in the financial sector. However, current fraud detection methodologies exhibit limitations in capturing both global structural patterns within transaction networks and local semantic relationships embedded in transaction data. Most existing models focus on either structural information or semantic features individually, leading to suboptimal performance in detecting complex fraud this http URL this paper, we propose a dynamic feature fusion model that combines graph-based representation learning and semantic feature extraction for blockchain fraud detection. Specifically, we construct global graph representations to model account relationships and extract local contextual features from transaction data. A dynamic multimodal fusion mechanism is introduced to adaptively integrate these features, enabling the model to capture both structural and semantic fraud patterns effectively. We further develop a comprehensive data processing pipeline, including graph construction, temporal feature enhancement, and text preprocessing. Experimental results on large-scale real-world blockchain datasets demonstrate that our method outperforms existing benchmarks across accuracy, F1 score, and recall metrics. This work highlights the importance of integrating structural relationships and semantic similarities for robust fraud detection and offers a scalable solution for securing blockchain systems.

[AI-63] Detecting Music Performance Errors with Transformers AAAI2025

链接: https://arxiv.org/abs/2501.02030
作者: Benjamin Shiue-Hal Chou,Purvish Jajal,Nicholas John Eliopoulos,Tim Nadolsky,Cheng-Yun Yang,Nikita Ravi,James C. Davis,Kristen Yeon-Ji Yun,Yung-Hsiang Lu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: AAAI 2025

点击查看摘要

Abstract:Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets.; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. Our source code and datasets are available at this https URL.

[AI-64] Weakly Supervised Learning on Large Graphs

链接: https://arxiv.org/abs/2501.02021
作者: Aditya Prakash
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph classification plays a pivotal role in various domains, including pathology, where images can be represented as this http URL this domain, images can be represented as graphs, where nodes might represent individual nuclei, and edges capture the spatial or functional relationships between them. Often, the overall label of the graph, such as a cancer type or disease state, is determined by patterns within smaller, localized regions of the image. This work introduces a weakly-supervised graph classification framework leveraging two subgraph extraction techniques: (1) Sliding-window approach (2) BFS-based approach. Subgraphs are processed using a Graph Attention Network (GAT), which employs attention mechanisms to identify the most informative subgraphs for classification. Weak supervision is achieved by propagating graph-level labels to subgraphs, eliminating the need for detailed subgraph annotations.

[AI-65] Benchmarking Constraint-Based Bayesian Structure Learning Algorithms: Role of Network Topology

链接: https://arxiv.org/abs/2501.02019
作者: Radha Nagarajan,Marco Scutari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
*备注: 8 Pages, 4 Figures

点击查看摘要

Abstract:Modeling the associations between real world entities from their multivariate cross-sectional profiles can provide cues into the concerted working of these entities as a system. Several techniques have been proposed for deciphering these associations including constraint-based Bayesian structure learning (BSL) algorithms that model them as directed acyclic graphs. Benchmarking these algorithms have typically focused on assessing the variation in performance measures such as sensitivity as a function of the dimensionality represented by the number of nodes in the DAG, and sample size. The present study elucidates the importance of network topology in benchmarking exercises. More specifically, it investigates variations in sensitivity across distinct network topologies while constraining the nodes, edges, and sample-size to be identical, eliminating these as potential confounders. Sensitivity of three popular constraint-based BSL algorithms (Peter-Clarke, Grow-Shrink, Incremental Association Markov Blanket) in learning the network structure from multivariate cross-sectional profiles sampled from network models with sub-linear, linear, and super-linear DAG topologies generated using preferential attachment is investigated. Results across linear and nonlinear models revealed statistically significant (\alpha=0.05) decrease in sensitivity estimates from sub-linear to super-linear topology constitutively across the three algorithms. These results are demonstrated on networks with nodes (N_nods=48,64) , noise strengths (\sigma =3,6) and sample size (N = 2^10) . The findings elucidate the importance of accommodating the network topology in constraint-based BSL benchmarking exercises.

[AI-66] ST-HCSS: Deep Spatio-Temporal Hypergraph Convolutional Neural Network for Soft Sensing ICASSP2025

链接: https://arxiv.org/abs/2501.02016
作者: Hwa Hui Tew,Fan Ding,Gaoxuan Li,Junn Yong Loo,Chee-Ming Ting,Ze Yang Ding,Chee Pin Tan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted at the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Higher-order sensor networks are more accurate in characterizing the nonlinear dynamics of sensory time-series data in modern industrial settings by allowing multi-node connections beyond simple pairwise graph edges. In light of this, we propose a deep spatio-temporal hypergraph convolutional neural network for soft sensing (ST-HCSS). In particular, our proposed framework is able to construct and leverage a higher-order graph (hypergraph) to model the complex multi-interactions between sensor nodes in the absence of prior structural knowledge. To capture rich spatio-temporal relationships underlying sensor data, our proposed ST-HCSS incorporates stacked gated temporal and hypergraph convolution layers to effectively aggregate and update hypergraph information across time and nodes. Our results validate the superiority of ST-HCSS compared to existing state-of-the-art soft sensors, and demonstrates that the learned hypergraph feature representations aligns well with the sensor data correlations. The code is available at this https URL

[AI-67] KANS: Knowledge Discovery Graph Attention Network for Soft Sensing in Multivariate Industrial Processes

链接: https://arxiv.org/abs/2501.02015
作者: Hwa Hui Tew,Gaoxuan Li,Fan Ding,Xuewen Luo,Junn Yong Loo,Chee-Ming Ting,Ze Yang Ding,Chee Pin Tan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Accepted at IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2024)

点击查看摘要

Abstract:Soft sensing of hard-to-measure variables is often crucial in industrial processes. Current practices rely heavily on conventional modeling techniques that show success in improving accuracy. However, they overlook the non-linear nature, dynamics characteristics, and non-Euclidean dependencies between complex process variables. To tackle these challenges, we present a framework known as a Knowledge discovery graph Attention Network for effective Soft sensing (KANS). Unlike the existing deep learning soft sensor models, KANS can discover the intrinsic correlations and irregular relationships between the multivariate industrial processes without a predefined topology. First, an unsupervised graph structure learning method is introduced, incorporating the cosine similarity between different sensor embedding to capture the correlations between sensors. Next, we present a graph attention-based representation learning that can compute the multivariate data parallelly to enhance the model in learning complex sensor nodes and edges. To fully explore KANS, knowledge discovery analysis has also been conducted to demonstrate the interpretability of the model. Experimental results demonstrate that KANS significantly outperforms all the baselines and state-of-the-art methods in soft sensing performance. Furthermore, the analysis shows that KANS can find sensors closely related to different process variables without domain knowledge, significantly improving soft sensing accuracy.

[AI-68] Machine Learning-Based Differential Diagnosis of Parkinsons Disease Using Kinematic Feature Extraction and Selection

链接: https://arxiv.org/abs/2501.02014
作者: Masahiro Matsumoto,Abu Saleh Musa Miah,Nobuyoshi Asai,Jungpil Shin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parkinson’s disease (PD), the second most common neurodegenerative disorder, is characterized by dopaminergic neuron loss and the accumulation of abnormal synuclein. PD presents both motor and non-motor symptoms that progressively impair daily functioning. The severity of these symptoms is typically assessed using the MDS-UPDRS rating scale, which is subjective and dependent on the physician’s experience. Additionally, PD shares symptoms with other neurodegenerative diseases, such as progressive supranuclear palsy (PSP) and multiple system atrophy (MSA), complicating accurate diagnosis. To address these diagnostic challenges, we propose a machine learning-based system for differential diagnosis of PD, PSP, MSA, and healthy controls (HC). This system utilizes a kinematic feature-based hierarchical feature extraction and selection approach. Initially, 18 kinematic features are extracted, including two newly proposed features: Thumb-to-index vector velocity and acceleration, which provide insights into motor control patterns. In addition, 41 statistical features were extracted here from each kinematic feature, including some new approaches such as Average Absolute Change, Rhythm, Amplitude, Frequency, Standard Deviation of Frequency, and Slope. Feature selection is performed using One-way ANOVA to rank features, followed by Sequential Forward Floating Selection (SFFS) to identify the most relevant ones, aiming to reduce the computational complexity. The final feature set is used for classification, achieving a classification accuracy of 66.67% for each dataset and 88.89% for each patient, with particularly high performance for the MSA and HC groups using the SVM algorithm. This system shows potential as a rapid and accurate diagnostic tool in clinical practice, though further data collection and refinement are needed to enhance its reliability.

[AI-69] ART: Token-based Architecture Transformer for Neural Network Performance Prediction

链接: https://arxiv.org/abs/2501.02007
作者: Yannis Y. He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of neural architecture design, achieving high performance is largely reliant on the manual expertise of researchers. Despite the emergence of Neural Architecture Search (NAS) as a promising technique for automating this process, current NAS methods still require human input to expand the search space and cannot generate new architectures. This paper explores the potential of Transformers in comprehending neural architectures and their performance, with the objective of establishing the foundation for utilizing Transformers to generate novel networks. We propose the Token-based Architecture Transformer (TART), which predicts neural network performance without the need to train candidate networks. TART attains state-of-the-art performance on the DeepNets-1M dataset for performance prediction tasks without edge information, indicating the potential of Transformers to aid in discovering novel and high-performing neural architectures.

[AI-70] Multi-Task Semantic Communication With Graph Attention-Based Feature Correlation Extraction

链接: https://arxiv.org/abs/2501.02006
作者: Xi Yu,Tiejun Lv,Weicai Li,Wei Ni,Dusit Niyato,Ekram Hossain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages,11 figures, accepted by IEEE TMC

点击查看摘要

Abstract:Multi-task semantic communication can serve multiple learning tasks using a shared encoder model. Existing models have overlooked the intricate relationships between features extracted during an encoding process of tasks. This paper presents a new graph attention inter-block (GAI) module to the encoder/transmitter of a multi-task semantic communication system, which enriches the features for multiple tasks by embedding the intermediate outputs of encoding in the features, compared to the existing techniques. The key idea is that we interpret the outputs of the intermediate feature extraction blocks of the encoder as the nodes of a graph to capture the correlations of the intermediate features. Another important aspect is that we refine the node representation using a graph attention mechanism to extract the correlations and a multi-layer perceptron network to associate the node representations with different tasks. Consequently, the intermediate features are weighted and embedded into the features transmitted for executing multiple tasks at the receiver. Experiments demonstrate that the proposed model surpasses the most competitive and publicly available models by 11.4% on the CityScapes 2Task dataset and outperforms the established state-of-the-art by 3.97% on the NYU V2 3Task dataset, respectively, when the bandwidth ratio of the communication channel (i.e., compression level for transmission over the channel) is as constrained as 1 12 .

[AI-71] General Information Metrics for Improving AI Model Training Efficiency

链接: https://arxiv.org/abs/2501.02004
作者: Jianfeng Xu,Congcong Liu,Xiaoying Tan,Xiaojie Zhu,Anpeng Wu,Huan Wan,Weijun Kong,Chun Li,Hu Xu,Kun Kuang,Fei Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs – this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.

[AI-72] Fuzzy Model Identification and Self Learning with Smooth Compositions

链接: https://arxiv.org/abs/2501.01994
作者: Ebrahim Navid Sadjadi,Jesus Garcia,Jose M. Molina,Akbar Hashemi Borzabadi,Monireh Asadi Abchouyeh
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper develops a smooth model identification and self-learning strategy for dynamic systems taking into account possible parameter variations and uncertainties. We have tried to solve the problem such that the model follows the changes and variations in the system on a continuous and smooth surface. Running the model to adaptively gain the optimum values of the parameters on a smooth surface would facilitate further improvements in the application of other derivative based optimization control algorithms such as MPC or robust control algorithms to achieve a combined modeling-control scheme. Compared to the earlier works on the smooth fuzzy modeling structures, we could reach a desired trade-off between the model optimality and the computational load. The proposed method has been evaluated on a test problem as well as the non-linear dynamic of a chemical process.

[AI-73] Disagree and Commit: Degrees of Argumentation-based Agreements

链接: https://arxiv.org/abs/2501.01992
作者: Timotheus Kampik,Juan Carlos Nieves
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
*备注: To appear eventually in the Autonomous Agents and Multi-Agent Systems journal

点击查看摘要

Abstract:In cooperative human decision-making, agreements are often not total; a partial degree of agreement is sufficient to commit to a decision and move on, as long as one is somewhat confident that the involved parties are likely to stand by their commitment in the future, given no drastic unexpected changes. In this paper, we introduce the notion of agreement scenarios that allow artificial autonomous agents to reach such agreements, using formal models of argumentation, in particular abstract argumentation and value-based argumentation. We introduce the notions of degrees of satisfaction and (minimum, mean, and median) agreement, as well as a measure of the impact a value in a value-based argumentation framework has on these notions. We then analyze how degrees of agreement are affected when agreement scenarios are expanded with new information, to shed light on the reliability of partial agreements in dynamic scenarios. An implementation of the introduced concepts is provided as part of an argumentation-based reasoning software library.

[AI-74] Hawkes based Representation Learning for Reasoning over Scale-free Community-structured Temporal Knowledge Graphs

链接: https://arxiv.org/abs/2501.01974
作者: Yuwei Du,Xinyue Liu,Wenxin Liang,Linlin Zong,Xianchao Zhang
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal knowledge graph (TKG) reasoning has become a hot topic due to its great value in many practical tasks. The key to TKG reasoning is modeling the structural information and evolutional patterns of the TKGs. While great efforts have been devoted to TKG reasoning, the structural and evolutional characteristics of real-world networks have not been considered. In the aspect of structure, real-world networks usually exhibit clear community structure and scale-free (long-tailed distribution) properties. In the aspect of evolution, the impact of an event decays with the time elapsing. In this paper, we propose a novel TKG reasoning model called Hawkes process-based Evolutional Representation Learning Network (HERLN), which learns structural information and evolutional patterns of a TKG simultaneously, considering the characteristics of real-world networks: community structure, scale-free and temporal decaying. First, we find communities in the input TKG to make the encoding get more similar intra-community embeddings. Second, we design a Hawkes process-based relational graph convolutional network to cope with the event impact-decaying phenomenon. Third, we design a conditional decoding method to alleviate biases towards frequent entities caused by long-tailed distribution. Experimental results show that HERLN achieves significant improvements over the state-of-the-art models.

[AI-75] Optimal bounds for dissatisfaction in perpetual voting AAAI2025

链接: https://arxiv.org/abs/2501.01969
作者: Alexander Kozachinskiy,Alexander Shen,Tomasz Steifer
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Full version of the AAAI 2025 paper

点击查看摘要

Abstract:In perpetual voting, multiple decisions are made at different moments in time. Taking the history of previous decisions into account allows us to satisfy properties such as proportionality over periods of time. In this paper, we consider the following question: is there a perpetual approval voting method that guarantees that no voter is dissatisfied too many times? We identify a sufficient condition on voter behavior – which we call ‘bounded conflicts’ condition – under which a sublinear growth of dissatisfaction is possible. We provide a tight upper bound on the growth of dissatisfaction under bounded conflicts, using techniques from Kolmogorov complexity. We also observe that the approval voting with binary choices mimics the machine learning setting of prediction with expert advice. This allows us to present a voting method with sublinear guarantees on dissatisfaction under bounded conflicts, based on the standard techniques from prediction with expert advice.

[AI-76] Statistical learning does not always entail knowledge

链接: https://arxiv.org/abs/2501.01963
作者: Daniel Andrés Díaz-Pachón,H. Renata Gallegos,Ola Hössjer,J. Sunil Rao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 30 pages, 1 figure

点击查看摘要

Abstract:In this paper, we study learning and knowledge acquisition (LKA) of an agent about a proposition that is either true or false. We use a Bayesian approach, where the agent receives data to update his beliefs about the proposition according to a posterior distribution. The LKA is formulated in terms of active information, with data representing external or exogenous information that modifies the agent’s beliefs. It is assumed that data provide details about a number of features that are relevant to the proposition. We show that this leads to a Gibbs distribution posterior, which is in maximum entropy relative to the prior, conditioned on the side constraints that the data provide in terms of the features. We demonstrate that full learning is sometimes not possible and full knowledge acquisition is never possible when the number of extracted features is too small. We also distinguish between primary learning (receiving data about features of relevance for the proposition) and secondary learning (receiving data about the learning of another agent). We argue that this type of secondary learning does not represent true knowledge acquisition. Our results have implications for statistical learning algorithms, and we claim that such algorithms do not always generate true knowledge. The theory is illustrated with several examples.

[AI-77] Can AI Help with Your Personal Finances?

链接: https://arxiv.org/abs/2412.19784
作者: Oudom Hean,Utsha Saha,Binita Saha
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have emerged as a transformative development in artificial intelligence (AI), drawing significant attention from industry and academia. Trained on vast datasets, these sophisticated AI systems exhibit impressive natural language processing and content generation capabilities. This paper explores the potential of LLMs to address key challenges in personal finance, focusing on the United States. We evaluate several leading LLMs, including OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and Meta’s Llama, to assess their effectiveness in providing accurate financial advice on topics such as mortgages, taxes, loans, and investments. Our findings show that while these models achieve an average accuracy rate of approximately 70%, they also display notable limitations in certain areas. Specifically, LLMs struggle to provide accurate responses for complex financial queries, with performance varying significantly across different topics. Despite these limitations, the analysis reveals notable improvements in newer versions of these models, highlighting their growing utility for individuals and financial advisors. As these AI systems continue to evolve, their potential for advancing AI-driven applications in personal finance becomes increasingly promising.

[AI-78] IOHunter: Graph Foundation Model to Uncover Online Information Operations

链接: https://arxiv.org/abs/2412.14663
作者: Marco Minici,Luca Luceri,Francesco Fabbri,Emilio Ferrara
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Social media platforms have become vital spaces for public discourse, serving as modern agorás where a wide range of voices influence societal narratives. However, their open nature also makes them vulnerable to exploitation by malicious actors, including state-sponsored entities, who can conduct information operations (IOs) to manipulate public opinion. The spread of misinformation, false news, and misleading claims threatens democratic processes and societal cohesion, making it crucial to develop methods for the timely detection of inauthentic activity to protect the integrity of online discourse. In this work, we introduce a methodology designed to identify users orchestrating information operations, a.k.a. \textitIO drivers, across various influence campaigns. Our framework, named \textttIOHunter, leverages the combined strengths of Language Models and Graph Neural Networks to improve generalization in \emphsupervised, \emphscarcely-supervised, and \emphcross-IO contexts. Our approach achieves state-of-the-art performance across multiple sets of IOs originating from six countries, significantly surpassing existing approaches. This research marks a step toward developing Graph Foundation Models specifically tailored for the task of IO detection on social media platforms.

[AI-79] Learning Dynamic Cognitive Map with Autonomous Navigation

链接: https://arxiv.org/abs/2411.08447
作者: Daria de Tinguy,Tim Verbelen,Bart Dhoedt
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: under submission at Frontiers Computer Neuroscience

点击查看摘要

Abstract:Inspired by animal navigation strategies, we introduce a novel computational model to navigate and map a space rooted in biologically inspired principles. Animals exhibit extraordinary navigation prowess, harnessing memory, imagination, and strategic decision-making to traverse complex and aliased environments adeptly. Our model aims to replicate these capabilities by incorporating a dynamically expanding cognitive map over predicted poses within an Active Inference framework, enhancing our agent’s generative model plasticity to novelty and environmental changes. Through structure learning and active inference navigation, our model demonstrates efficient exploration and exploitation, dynamically expanding its model capacity in response to anticipated novel un-visited locations and updating the map given new evidence contradicting previous beliefs. Comparative analyses in mini-grid environments with the Clone-Structured Cognitive Graph model (CSCG), which shares similar objectives, highlight our model’s ability to rapidly learn environmental structures within a single episode, with minimal navigation overlap. Our model achieves this without prior knowledge of observation and world dimensions, underscoring its robustness and efficacy in navigating intricate environments.

[AI-80] Single-Channel Distance-Based Source Separation for Mobile GPU in Outdoor and Indoor Environments ICASSP2025

链接: https://arxiv.org/abs/2501.03045
作者: Hanbin Bae,Byungjun Kang,Jiwon Kim,Jaeyong Hwang,Hosang Sung,Hoon-Young Cho
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted by ICASSP2025. \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component

点击查看摘要

Abstract:This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), and a TensorFlow Lite GPU delegate. While the linear RSA may not capture physical cues as explicitly as the quadratic RSA, the linear RSA enhances the model’s context awareness, leading to improved performance on the DSS that requires an understanding of physical cues in outdoor and indoor environments. The experimental results demonstrated that the proposed model overcomes the limitations of existing approaches and considerably enhances energy efficiency and real-time inference speed on mobile devices.

[AI-81] Key-value memory in the brain

链接: https://arxiv.org/abs/2501.02950
作者: Samuel J. Gershman,Ila Fiete,Kazuki Irie
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical models of memory in psychology and neuroscience rely on similarity-based retrieval of stored patterns, where similarity is a function of retrieval cues and the stored patterns. While parsimonious, these models do not allow distinct representations for storage and retrieval, despite their distinct computational demands. Key-value memory systems, in contrast, distinguish representations used for storage (values) and those used for retrieval (keys). This allows key-value memory systems to optimize simultaneously for fidelity in storage and discriminability in retrieval. We review the computational foundations of key-value memory, its role in modern machine learning systems, related ideas from psychology and neuroscience, applications to a number of empirical puzzles, and possible biological implementations.

[AI-82] From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering

链接: https://arxiv.org/abs/2501.02680
作者: Wen-ran Li,Xavier F. Cadet,David Medina-Ortiz,Mehdi D. Davari,Ramanathan Sowdhamini,Cedric Damour,Yu Li,Alain Miranville,Frederic Cadet
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protein design with desirable properties has been a significant challenge for many decades. Generative artificial intelligence is a promising approach and has achieved great success in various protein generation tasks. Notably, diffusion models stand out for their robust mathematical foundations and impressive generative capabilities, offering unique advantages in certain applications such as protein design. In this review, we first give the definition and characteristics of diffusion models and then focus on two strategies: Denoising Diffusion Probabilistic Models and Score-based Generative Models, where DDPM is the discrete form of SGM. Furthermore, we discuss their applications in protein design, peptide generation, drug discovery, and protein-ligand interaction. Finally, we outline the future perspectives of diffusion models to advance autonomous protein design and engineering. The E(3) group consists of all rotations, reflections, and translations in three-dimensions. The equivariance on the E(3) group can keep the physical stability of the frame of each amino acid as much as possible, and we reflect on how to keep the diffusion model E(3) equivariant for protein generation.

[AI-83] Remote Inference over Dynamic Links via Adaptive Rate Deep Task-Oriented Vector Quantization

链接: https://arxiv.org/abs/2501.02521
作者: Eyal Fishel,May Malka,Shai Ginzach,Nir Shlezinger
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:A broad range of technologies rely on remote inference, wherein data acquired is conveyed over a communication channel for inference in a remote server. Communication between the participating entities is often carried out over rate-limited channels, necessitating data compression for reducing latency. While deep learning facilitates joint design of the compression mapping along with encoding and inference rules, existing learned compression mechanisms are static, and struggle in adapting their resolution to changes in channel conditions and to dynamic links. To address this, we propose Adaptive Rate Task-Oriented Vector Quantization (ARTOVeQ), a learned compression mechanism that is tailored for remote inference over dynamic links. ARTOVeQ is based on designing nested codebooks along with a learning algorithm employing progressive learning. We show that ARTOVeQ extends to support low-latency inference that is gradually refined via successive refinement principles, and that it enables the simultaneous usage of multiple resolutions when conveying high-dimensional data. Numerical results demonstrate that the proposed scheme yields remote deep inference that operates with multiple rates, supports a broad range of bit budgets, and facilitates rapid inference that gradually improves with more bits exchanged, while approaching the performance of single-rate deep quantization methods.

[AI-84] ARGET: Interpretable Tailored Age Regression for Grouped Epigenetic Traits

链接: https://arxiv.org/abs/2501.02401
作者: Zipeng Wu,Daniel Herring,Fabian Spill,James Andrews
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注: To be published in IEEE BIBM this http URL manuscript includes a comprehensive description of the methodology and comparison with traditional epigenetic clocks and machine learning models. Submitted to arXiv as part of ongoing research in epigenetics and aging studies

点击查看摘要

Abstract:Accurately predicting chronological age from DNA methylation patterns is crucial for advancing biological age estimation. However, this task is made challenging by Epigenetic Correlation Drift (ECD) and Heterogeneity Among CpGs (HAC), which reflect the dynamic relationship between methylation and age across different life stages. To address these issues, we propose a novel two-phase algorithm. The first phase employs similarity searching to cluster methylation profiles by age group, while the second phase uses Explainable Boosting Machines (EBM) for precise, group-specific prediction. Our method not only improves prediction accuracy but also reveals key age-related CpG sites, detects age-specific changes in aging rates, and identifies pairwise interactions between CpG sites. Experimental results show that our approach outperforms traditional epigenetic clocks and machine learning models, offering a more accurate and interpretable solution for biological age estimation with significant implications for aging research.

[AI-85] Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support

链接: https://arxiv.org/abs/2501.02346
作者: Florian Putz,Marlen Haderleina,Sebastian Lettmaier,Sabine Semrau,Rainer Fietkau,Yixing Huang
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
*备注: Officially published in the Red Journal

点击查看摘要

Abstract:Thanks to the rapidly evolving integration of LLMs into decision-support tools, a significant transformation is happening across large-scale systems. Like other medical fields, the use of LLMs such as GPT-4 is gaining increasing interest in radiation oncology as well. An attempt to assess GPT-4’s performance in radiation oncology was made via a dedicated 100-question examination on the highly specialized topic of radiation oncology physics, revealing GPT-4’s superiority over other LLMs. GPT-4’s performance on a broader field of clinical radiation oncology is further benchmarked by the ACR Radiation Oncology In-Training (TXIT) exam where GPT-4 achieved a high accuracy of 74.57%. Its performance on re-labelling structure names in accordance with the AAPM TG-263 report has also been benchmarked, achieving above 96% accuracies. Such studies shed light on the potential of LLMs in radiation oncology. As interest in the potential and constraints of LLMs in general healthcare applications continues to rise5, the capabilities and limitations of LLMs in radiation oncology decision support have not yet been fully explored.

[AI-86] owards a constructive framework for control theory

链接: https://arxiv.org/abs/2501.02267
作者: Pavel Osinenko
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Published under: this https URL

点击查看摘要

Abstract:This work presents a framework for control theory based on constructive analysis to account for discrepancy between mathematical results and their implementation in a computer, also referred to as computational uncertainty. In control engineering, the latter is usually either neglected or considered submerged into some other type of uncertainty, such as system noise, and addressed within robust control. However, even robust control methods may be compromised when the mathematical objects involved in the respective algorithms fail to exist in exact form and subsequently fail to satisfy the required properties. For instance, in general stabilization using a control Lyapunov function, computational uncertainty may distort stability certificates or even destabilize the system despite robustness of the stabilization routine with regards to system, actuator and measurement noise. In fact, battling numerical problems in practical implementation of controllers is common among control engineers. Such observations indicate that computational uncertainty should indeed be addressed explicitly in controller synthesis and system analysis. The major contribution here is a fairly general framework for proof techniques in analysis and synthesis of control systems based on constructive analysis which explicitly states that every computation be doable only up to a finite precision thus accounting for computational uncertainty. A series of previous works is overviewed, including constructive system stability and stabilization, approximate optimal controls, eigenvalue problems, Caratheodory trajectories, measurable selectors. Additionally, a new constructive version of the Danskin’s theorem, which is crucial in adversarial defense, is presented.

[AI-87] Establishing baselines for generative discovery of inorganic crystals

链接: https://arxiv.org/abs/2501.02144
作者: Nathan J. Szymanski,Christopher J. Bartel
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Generative artificial intelligence offers a promising avenue for materials discovery, yet its advantages over traditional methods remain unclear. In this work, we introduce and benchmark two baseline approaches - random enumeration of charge-balanced prototypes and data-driven ion exchange of known compounds - against three generative models: a variational autoencoder, a large language model, and a diffusion model. Our results show that established methods such as ion exchange perform comparably well in generating stable materials, although many of these materials tend to closely resemble known compounds. In contrast, generative models excel at proposing novel structural frameworks and, when sufficient training data is available, can more effectively target properties such as electronic band gap and bulk modulus while maintaining a high stability rate. To enhance the performance of both the baseline and generative approaches, we implement a post-generation screening step in which all proposed structures are passed through stability and property filters from pre-trained machine learning models including universal interatomic potentials. This low-cost filtering step leads to substantial improvement in the success rates of all methods, remains computationally efficient, and ultimately provides a practical pathway toward more effective generative strategies for materials discovery.

[AI-88] Relaxation-assisted reverse annealing on nonnegative/binary matrix factorization

链接: https://arxiv.org/abs/2501.02114
作者: Renichiro Haba,Masayuki Ohzeki,Kazuyuki Tanaka
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum annealing has garnered significant attention as meta-heuristics inspired by quantum physics for combinatorial optimization problems. Among its many applications, nonnegative/binary matrix factorization stands out for its complexity and relevance in unsupervised machine learning. The use of reverse annealing, a derivative procedure of quantum annealing to prioritize the search in a vicinity under a given initial state, helps improve its optimization performance in matrix factorization. This study proposes an improved strategy that integrates reverse annealing with a linear programming relaxation technique. Using relaxed solutions as the initial configuration for reverse annealing, we demonstrate improvements in optimization performance comparable to the exact optimization methods. Our experiments on facial image datasets show that our method provides better convergence than known reverse annealing methods. Furthermore, we investigate the effectiveness of relaxation-based initialization methods on randomized datasets, demonstrating a relationship between the relaxed solution and the optimal solution. This research underscores the potential of combining reverse annealing and classical optimization strategies to enhance optimization performance.

机器学习

[LG-0] Characterizing the Accuracy-Communication-Privacy Trade-off in Distributed Stochastic Convex Optimization

链接: https://arxiv.org/abs/2501.03222
作者: Sudeep Salgia,Nikola Pavlovic,Yuejie Chi,Qing Zhao
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of differentially private stochastic convex optimization (DP-SCO) in a distributed setting with M clients, where each of them has a local dataset of N i.i.d. data samples from an underlying data distribution. The objective is to design an algorithm to minimize a convex population loss using a collaborative effort across M clients, while ensuring the privacy of the local datasets. In this work, we investigate the accuracy-communication-privacy trade-off for this problem. We establish matching converse and achievability results using a novel lower bound and a new algorithm for distributed DP-SCO based on Vaidya’s plane cutting method. Thus, our results provide a complete characterization of the accuracy-communication-privacy trade-off for DP-SCO in the distributed setting.

[LG-1] Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment ICASSP2025

链接: https://arxiv.org/abs/2501.03190
作者: Andrew Chang,Viswadruth Akkaraju,Ray McFadden Cogliano,David Poeppel,Dustin Freeman
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注: ICASSP 2025

点击查看摘要

Abstract:Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research on videoconferencing user experience by showing that multimodal machine learning can be used to identify rare moments of negative user experience for further study or mitigation.

[LG-2] Scalable Forward-Forward Algorithm

链接: https://arxiv.org/abs/2501.03176
作者: Andrii Krutsylo
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We propose a scalable Forward-Forward (FF) algorithm that eliminates the need for backpropagation by training each layer separately. Unlike backpropagation, FF avoids backward gradients and can be more modular and memory efficient, making it appealing for large networks. We extend FF to modern convolutional architectures, such as MobileNetV3 and ResNet18, by introducing a new way to compute losses for convolutional layers. Experiments show that our method achieves performance comparable to standard backpropagation. Furthermore, when we divide the network into blocks, such as the residual blocks in ResNet, and apply backpropagation only within each block, but not across blocks, our hybrid design tends to outperform backpropagation baselines while maintaining a similar training speed. Finally, we present experiments on small datasets and transfer learning that confirm the adaptability of our method.

[LG-3] Deep-Relative-Trust-Based Diffusion for Decentralized Deep Learning

链接: https://arxiv.org/abs/2501.03162
作者: Muyun Li,Aaron Fainman,Stefan Vlaski
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Decentralized learning strategies allow a collection of agents to learn efficiently from local data sets without the need for central aggregation or orchestration. Current decentralized learning paradigms typically rely on an averaging mechanism to encourage agreement in the parameter space. We argue that in the context of deep neural networks, which are often over-parameterized, encouraging consensus of the neural network outputs, as opposed to their parameters can be more appropriate. This motivates the development of a new decentralized learning algorithm, termed DRT diffusion, based on deep relative trust (DRT), a recently introduced similarity measure for neural networks. We provide convergence analysis for the proposed strategy, and numerically establish its benefit to generalization, especially with sparse topologies, in an image classification task.

[LG-4] Communication Bounds for the Distributed Experts Problem NEURIPS2024

链接: https://arxiv.org/abs/2501.03132
作者: Zhihao Jia,Qi Pang,Trung Tran,David Woodruff,Zhihao Zhang,Wenting Zheng
类目: Machine Learning (cs.LG)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:In this work, we study the experts problem in the distributed setting where an expert’s cost needs to be aggregated across multiple servers. Our study considers various communication models such as the message-passing model and the broadcast model, along with multiple aggregation functions, such as summing and taking the \ell_p norm of an expert’s cost across servers. We propose the first communication-efficient protocols that achieve near-optimal regret in these settings, even against a strong adversary who can choose the inputs adaptively. Additionally, we give a conditional lower bound showing that the communication of our protocols is nearly optimal. Finally, we implement our protocols and demonstrate empirical savings on the HPO-B benchmarks.

[LG-5] Learning DAGs and Root Causes from Time-Series Data

链接: https://arxiv.org/abs/2501.03130
作者: Panagiotis Misiakos,Markus Püschel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 9 figures, conference preprint

点击查看摘要

Abstract:We introduce DAG-TFRC, a novel method for learning directed acyclic graphs (DAGs) from time series with few root causes. By this, we mean that the data are generated by a small number of events at certain, unknown nodes and time points under a structural vector autoregression model. For such data, we (i) learn the DAGs representing both the instantaneous and time-lagged dependencies between nodes, and (ii) discover the location and time of the root causes. For synthetic data with few root causes, DAG-TFRC shows superior performance in accuracy and runtime over prior work, scaling up to thousands of nodes. Experiments on simulated and real-world financial data demonstrate the viability of our sparse root cause assumption. On SP 500 data, DAG-TFRC successfully clusters stocks by sectors and discovers major stock movements as root causes.

[LG-6] Balancing Efficiency and Expressiveness: Subgraph GNNs with Walk-Based Centrality

链接: https://arxiv.org/abs/2501.03113
作者: Joshua Southern,Yam Eitan,Guy Bar-Shalom,Michael Bronstein,Haggai Maron,Fabrizio Frasca
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 33 pages, 8 figures

点击查看摘要

Abstract:We propose an expressive and efficient approach that combines the strengths of two prominent extensions of Graph Neural Networks (GNNs): Subgraph GNNs and Structural Encodings (SEs). Our approach leverages walk-based centrality measures, both as a powerful form of SE and also as a subgraph selection strategy for Subgraph GNNs. By drawing a connection to perturbation analysis, we highlight the effectiveness of centrality-based sampling, and show it significantly reduces the computational burden associated with Subgraph GNNs. Further, we combine our efficient Subgraph GNN with SEs derived from the calculated centrality and demonstrate this hybrid approach, dubbed HyMN, gains in discriminative power. HyMN effectively addresses the expressiveness limitations of Message Passing Neural Networks (MPNNs) while mitigating the computational costs of Subgraph GNNs. Through a series of experiments on synthetic and real-world tasks, we show it outperforms other subgraph sampling approaches while being competitive with full-bag Subgraph GNNs and other state-of-the-art approaches with a notably reduced runtime.

[LG-7] Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks

链接: https://arxiv.org/abs/2501.03078
作者: Théophane Vallaeys,Matthew Muckley,Jakob Verbeek,Matthijs Douze
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vector quantization is a fundamental technique for compression and large-scale nearest neighbor search. For high-accuracy operating points, multi-codebook quantization associates data vectors with one element from each of multiple codebooks. An example is residual quantization (RQ), which iteratively quantizes the residual error of previous steps. Dependencies between the different parts of the code are, however, ignored in RQ, which leads to suboptimal rate-distortion performance. QINCo recently addressed this inefficiency by using a neural network to determine the quantization codebook in RQ based on the vector reconstruction from previous steps. In this paper we introduce QINCo2 which extends and improves QINCo with (i) improved vector encoding using codeword pre-selection and beam-search, (ii) a fast approximate decoder leveraging codeword pairs to establish accurate short-lists for search, and (iii) an optimized training procedure and network architecture. We conduct experiments on four datasets to evaluate QINCo2 for vector compression and billion-scale nearest neighbor search. We obtain outstanding results in both settings, improving the state-of-the-art reconstruction MSE by 34% for 16-byte vector compression on BigANN, and search accuracy by 24% with 8-byte encodings on Deep1M.

[LG-8] Probably Correct Optimal Stable Matching for Two-Sided Markets Under Uncertainty AAMAS2025

链接: https://arxiv.org/abs/2501.03018
作者: Andreas Athanasopoulos,Anne-Marie George,Christos Dimitrakakis
类目: Machine Learning (cs.LG)
*备注: This paper was accepted to International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:We consider a learning problem for the stable marriage model under unknown preferences for the left side of the market. We focus on the centralized case, where at each time step, an online platform matches the agents, and obtains a noisy evaluation reflecting their preferences. Our aim is to quickly identify the stable matching that is left-side optimal, rendering this a pure exploration problem with bandit feedback. We specifically aim to find Probably Correct Optimal Stable Matchings and present several bandit algorithms to do so. Our findings provide a foundational understanding of how to efficiently gather and utilize preference information to identify the optimal stable matching in two-sided markets under uncertainty. An experimental analysis on synthetic data complements theoretical results on sample complexities for the proposed methods.

[LG-9] Convexity in ReLU Neural Networks: beyond ICNNs?

链接: https://arxiv.org/abs/2501.03017
作者: Anne Gagneux,Mathurin Massias,Emmanuel Soubies,Rémi Gribonval
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convex functions and their gradients play a critical role in mathematical imaging, from proximal optimization to Optimal Transport. The successes of deep learning has led many to use learning-based methods, where fixed functions or operators are replaced by learned neural networks. Regardless of their empirical superiority, establishing rigorous guarantees for these methods often requires to impose structural constraints on neural architectures, in particular convexity. The most popular way to do so is to use so-called Input Convex Neural Networks (ICNNs). In order to explore the expressivity of ICNNs, we provide necessary and sufficient conditions for a ReLU neural network to be convex. Such characterizations are based on product of weights and activations, and write nicely for any architecture in the path-lifting framework. As particular applications, we study our characterizations in depth for 1 and 2-hidden-layer neural networks: we show that every convex function implemented by a 1-hidden-layer ReLU network can be also expressed by an ICNN with the same architecture; however this property no longer holds with more layers. Finally, we provide a numerical procedure that allows an exact check of convexity for ReLU neural networks with a large number of affine regions.

[LG-10] LOHA: Direct Graph Spectral Contrastive Learning Between Low-pass and High-pass Views AAAI2025

链接: https://arxiv.org/abs/2501.02969
作者: Ziyun Zou,Yinghui Jiang,Lian Shen,Juan Liu,Xiangrong Liu
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI2025

点击查看摘要

Abstract:Spectral Graph Neural Networks effectively handle graphs with different homophily levels, with low-pass filter mining feature smoothness and high-pass filter capturing differences. When these distinct filters could naturally form two opposite views for self-supervised learning, the commonalities between the counterparts for the same node remain unexplored, leading to suboptimal performance. In this paper, a simple yet effective self-supervised contrastive framework, LOHA, is proposed to address this gap. LOHA optimally leverages low-pass and high-pass views by embracing “harmony in diversity”. Rather than solely maximizing the difference between these distinct views, which may lead to feature separation, LOHA harmonizes the diversity by treating the propagation of graph signals from both views as a composite feature. Specifically, a novel high-dimensional feature named spectral signal trend is proposed to serve as the basis for the composite feature, which remains relatively unaffected by changing filters and focuses solely on original feature differences. LOHA achieves an average performance improvement of 2.8% over runner-up models on 9 real-world datasets with varying homophily levels. Notably, LOHA even surpasses fully-supervised models on several datasets, which underscores the potential of LOHA in advancing the efficacy of spectral GNNs for diverse graph structures.

[LG-11] MSA-CNN: A Lightweight Multi-Scale CNN with Attention for Sleep Stage Classification

链接: https://arxiv.org/abs/2501.02949
作者: Stephan Goerttler,Yucheng Wang,Emadeldeen Eldele,Min Wu,Fei He
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 6 figures, journal paper

点击查看摘要

Abstract:Recent advancements in machine learning-based signal analysis, coupled with open data initiatives, have fuelled efforts in automatic sleep stage classification. Despite the proliferation of classification models, few have prioritised reducing model complexity, which is a crucial factor for practical applications. In this work, we introduce Multi-Scale and Attention Convolutional Neural Network (MSA-CNN), a lightweight architecture featuring as few as ~10,000 parameters. MSA-CNN leverages a novel multi-scale module employing complementary pooling to eliminate redundant filter parameters and dense convolutions. Model complexity is further reduced by separating temporal and spatial feature extraction and using cost-effective global spatial convolutions. This separation of tasks not only reduces model complexity but also mirrors the approach used by human experts in sleep stage scoring. We evaluated both small and large configurations of MSA-CNN against nine state-of-the-art baseline models across three public datasets, treating univariate and multivariate models separately. Our evaluation, based on repeated cross-validation and re-evaluation of all baseline models, demonstrated that the large MSA-CNN outperformed all baseline models on all three datasets in terms of accuracy and Cohen’s kappa, despite its significantly reduced parameter count. Lastly, we explored various model variants and conducted an in-depth analysis of the key modules and techniques, providing deeper insights into the underlying mechanisms. The code for our models, baselines, and evaluation procedures is available at this https URL.

[LG-12] he Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features

链接: https://arxiv.org/abs/2501.02945
作者: Shi Bin Hoo,Samuel Müller,David Salinas,Frank Hutter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models have become popular in forecasting due to their ability to make accurate predictions, even with minimal fine-tuning on specific datasets. In this paper, we demonstrate how the newly released regression variant of TabPFN, a general tabular foundation model, can be applied to time series forecasting. We propose a straightforward approach, TabPFN-TS, which pairs TabPFN with simple feature engineering to achieve strong forecasting performance. Despite its simplicity and with only 11M parameters, TabPFN-TS outperforms Chronos-Mini, a model of similar size, and matches or even slightly outperforms Chronos-Large, which has 65-fold more parameters. A key strength of our method lies in its reliance solely on artificial data during pre-training, avoiding the need for large training datasets and eliminating the risk of benchmark contamination.

[LG-13] Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures

链接: https://arxiv.org/abs/2501.02931
作者: Charles O’Neill
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-attention mechanisms have revolutionised deep learning architectures, but their mathematical foundations remain incompletely understood. We establish that these mechanisms can be formalised through categorical algebra, presenting a framework that focuses on the linear components of self-attention. We prove that the query, key, and value maps in self-attention naturally form a parametric endofunctor in the 2-category \mathbfPara(\mathbfVect) of parametric morphisms. We show that stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor. For positional encodings, we demonstrate that strictly additive position embeddings constitute monoid actions on the embedding space, while standard sinusoidal encodings, though not additive, possess a universal property among faithful position-preserving functors. We establish that the linear portions of self-attention exhibit natural equivariance properties with respect to permutations of input tokens. Finally, we prove that the ``circuits’’ identified in mechanistic interpretability correspond precisely to compositions of parametric morphisms in our framework. This categorical perspective unifies geometric, algebraic, and interpretability-based approaches to transformer analysis, while making explicit the mathematical structures underlying attention mechanisms. Our treatment focuses exclusively on linear maps, setting aside nonlinearities like softmax and layer normalisation, which require more sophisticated categorical structures. Our results extend recent work on categorical foundations for deep learning while providing insights into the algebraic structure of attention mechanisms.

[LG-14] Offline-to-online hyperparameter transfer for stochastic bandits AAAI2025

链接: https://arxiv.org/abs/2501.02926
作者: Dravyansh Sharma,Arun Sai Suggala
类目: Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:Classic algorithms for stochastic bandits typically use hyperparameters that govern their critical properties such as the trade-off between exploration and exploitation. Tuning these hyperparameters is a problem of great practical significance. However, this is a challenging problem and in certain cases is information theoretically impossible. To address this challenge, we consider a practically relevant transfer learning setting where one has access to offline data collected from several bandit problems (tasks) coming from an unknown distribution over the tasks. Our aim is to use this offline data to set the hyperparameters for a new task drawn from the unknown distribution. We provide bounds on the inter-task (number of tasks) and intra-task (number of arm pulls for each task) sample complexity for learning near-optimal hyperparameters on unseen tasks drawn from the distribution. Our results apply to several classic algorithms, including tuning the exploration parameters in UCB and LinUCB and the noise parameter in GP-UCB. Our experiments indicate the significance and effectiveness of the transfer of hyperparameters from offline problems in online learning with stochastic bandit feedback.

[LG-15] Sim-to-Real Transfer for Mobile Robots with Reinforcement Learning: from NVIDIA Isaac Sim to Gazebo and Real ROS 2 Robots

链接: https://arxiv.org/abs/2501.02902
作者: Sahar Salimpour,Jorge Peña-Queralta,Diego Paez-Granados,Jukka Heikkonen,Tomi Westerlund
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unprecedented agility and dexterous manipulation have been demonstrated with controllers based on deep reinforcement learning (RL), with a significant impact on legged and humanoid robots. Modern tooling and simulation platforms, such as NVIDIA Isaac Sim, have been enabling such advances. This article focuses on demonstrating the applications of Isaac in local planning and obstacle avoidance as one of the most fundamental ways in which a mobile robot interacts with its environments. Although there is extensive research on proprioception-based RL policies, the article highlights less standardized and reproducible approaches to exteroception. At the same time, the article aims to provide a base framework for end-to-end local navigation policies and how a custom robot can be trained in such simulation environment. We benchmark end-to-end policies with the state-of-the-art Nav2, navigation stack in Robot Operating System (ROS). We also cover the sim-to-real transfer process by demonstrating zero-shot transferability of policies trained in the Isaac simulator to real-world robots. This is further evidenced by the tests with different simulated robots, which show the generalization of the learned policy. Finally, the benchmarks demonstrate comparable performance to Nav2, opening the door to quick deployment of state-of-the-art end-to-end local planners for custom robot platforms, but importantly furthering the possibilities by expanding the state and action spaces or task definitions for more complex missions. Overall, with this article we introduce the most important steps, and aspects to consider, in deploying RL policies for local path planning and obstacle avoidance with Isaac Sim training, Gazebo testing, and ROS 2 for real-time inference in real robots. The code is available at this https URL.

[LG-16] Conditional Mutual Information Based Diffusion Posterior Sampling for Solving Inverse Problems

链接: https://arxiv.org/abs/2501.02880
作者: Shayan Mohajer Hamidi,En-Hui Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inverse problems are prevalent across various disciplines in science and engineering. In the field of computer vision, tasks such as inpainting, deblurring, and super-resolution are commonly formulated as inverse problems. Recently, diffusion models (DMs) have emerged as a promising approach for addressing noisy linear inverse problems, offering effective solutions without requiring additional task-specific training. Specifically, with the prior provided by DMs, one can sample from the posterior by finding the likelihood. Since the likelihood is intractable, it is often approximated in the literature. However, this approximation compromises the quality of the generated images. To overcome this limitation and improve the effectiveness of DMs in solving inverse problems, we propose an information-theoretic approach. Specifically, we maximize the conditional mutual information \mathrmI(\boldsymbolx_0; \boldsymboly | \boldsymbolx_t) , where \boldsymbolx_0 represents the reconstructed signal, \boldsymboly is the measurement, and \boldsymbolx_t is the intermediate signal at stage t . This ensures that the intermediate signals \boldsymbolx_t are generated in a way that the final reconstructed signal \boldsymbolx_0 retains as much information as possible about the measurement \boldsymboly . We demonstrate that this method can be seamlessly integrated with recent approaches and, once incorporated, enhances their performance both qualitatively and quantitatively.

[LG-17] ParetoLens: A Visual Analytics Framework for Exploring Solution Sets of Multi-objective Evolutionary Algorithms

链接: https://arxiv.org/abs/2501.02857
作者: Yuxin Ma,Zherui Zhang,Ran Cheng,Yaochu Jin,Kay Chen Tan
类目: Neural and Evolutionary Computing (cs.NE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted by IEEE Computational Intelligence Magazine

点击查看摘要

Abstract:In the domain of multi-objective optimization, evolutionary algorithms are distinguished by their capability to generate a diverse population of solutions that navigate the trade-offs inherent among competing objectives. This has catalyzed the ascension of evolutionary multi-objective optimization (EMO) as a prevalent approach. Despite the effectiveness of the EMO paradigm, the analysis of resultant solution sets presents considerable challenges. This is primarily attributed to the high-dimensional nature of the data and the constraints imposed by static visualization methods, which frequently culminate in visual clutter and impede interactive exploratory analysis. To address these challenges, this paper introduces ParetoLens, a visual analytics framework specifically tailored to enhance the inspection and exploration of solution sets derived from the multi-objective evolutionary algorithms. Utilizing a modularized, algorithm-agnostic design, ParetoLens enables a detailed inspection of solution distributions in both decision and objective spaces through a suite of interactive visual representations. This approach not only mitigates the issues associated with static visualizations but also supports a more nuanced and flexible analysis process. The usability of the framework is evaluated through case studies and expert interviews, demonstrating its potential to uncover complex patterns and facilitate a deeper understanding of multi-objective optimization solution sets. A demo website of ParetoLens is available at this https URL.

[LG-18] RAHN: A Reputation Based Hourglass Network for Web Service QoS Prediction

链接: https://arxiv.org/abs/2501.02843
作者: Xia Chen,Yugen Du,Guoxing Tang,Yingwei Luo,Benchi Ma
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 4 pages,3 figures

点击查看摘要

Abstract:As the homogenization of Web services becomes more and more common, the difficulty of service recommendation is gradually increasing. How to predict Quality of Service (QoS) more efficiently and accurately becomes an important challenge for service recommendation. Considering the excellent role of reputation and deep learning (DL) techniques in the field of QoS prediction, we propose a reputation and DL based QoS prediction network, RAHN, which contains the Reputation Calculation Module (RCM), the Latent Feature Extraction Module (LFEM), and the QoS Prediction Hourglass Network (QPHN). RCM obtains the user reputation and the service reputation by using a clustering algorithm and a Logit model. LFEM extracts latent features from known information to form an initial latent feature vector. QPHN aggregates latent feature vectors with different scales by using Attention Mechanism, and can be stacked multiple times to obtain the final latent feature vector for prediction. We evaluate RAHN on a real QoS dataset. The experimental results show that the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) of RAHN are smaller than the six baseline methods.

[LG-19] Foundations of GenIR

链接: https://arxiv.org/abs/2501.02842
作者: Qingyao Ai,Jingtao Zhan,Yiqun Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Chapter 2 of the book on Information Access in the Era of Generative AI

点击查看摘要

Abstract:The chapter discusses the foundational impact of modern generative AI models on information access (IA) systems. In contrast to traditional AI, the large-scale training and superior data modeling of generative AI models enable them to produce high-quality, human-like responses, which brings brand new opportunities for the development of IA paradigms. In this chapter, we identify and introduce two of them in details, i.e., information generation and information synthesis. Information generation allows AI to create tailored content addressing user needs directly, enhancing user experience with immediate, relevant outputs. Information synthesis leverages the ability of generative AI to integrate and reorganize existing information, providing grounded responses and mitigating issues like model hallucination, which is particularly valuable in scenarios requiring precision and external knowledge. This chapter delves into the foundational aspects of generative models, including architecture, scaling, and training, and discusses their applications in multi-modal scenarios. Additionally, it examines the retrieval-augmented generation paradigm and other methods for corpus modeling and understanding, demonstrating how generative AI can enhance information access systems. It also summarizes potential challenges and fruitful directions for future studies.

[LG-20] Randomly Sampled Language Reasoning Problems Reveal Limits of LLM s

链接: https://arxiv.org/abs/2501.02825
作者: Kavi Gupta,Kate Sanders,Armando Solar-Lezama
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Can LLMs pick up language structure from examples? Evidence in prior work seems to indicate yes, as pretrained models repeatedly demonstrate the ability to adapt to new language structures and vocabularies. However, this line of research typically considers languages that are present within common pretraining datasets, or otherwise share notable similarities with these seen languages. In contrast, in this work we attempt to measure models’ language understanding capacity while circumventing the risk of dataset recall. We parameterize large families of language tasks recognized by deterministic finite automata (DFAs), and can thus sample novel language reasoning problems to fairly evaulate LLMs regardless of training data. We find that, even in the strikingly simple setting of 3-state DFAs, LLMs underperform unparameterized ngram models on both language recognition and synthesis tasks. These results suggest that LLMs struggle to match the ability of basic language models in recognizing and reasoning over languages that are sufficiently distinct from the ones they see at training time, underscoring the distinction between learning individual languages and possessing a general theory of language.

[LG-21] DarkFarseer: Inductive Spatio-temporal Kriging via Hidden Style Enhancement and Sparsity-Noise Mitigation

链接: https://arxiv.org/abs/2501.02808
作者: Zhuoxuan Liang,Wei Li,Dalin Zhang,Yidan Chen,Zhihong Wang,Xiangping Zheng,Moustafa Youssef
类目: Machine Learning (cs.LG)
*备注: TKDE (Under Review)

点击查看摘要

Abstract:With the rapid growth of the Internet of Things and Cyber-Physical Systems, widespread sensor deployment has become essential. However, the high costs of building sensor networks limit their scale and coverage, making fine-grained deployment challenging. Inductive Spatio-Temporal Kriging (ISK) addresses this issue by introducing virtual sensors. Based on graph neural networks (GNNs) extracting the relationships between physical and virtual sensors, ISK can infer the measurements of virtual sensors from physical sensors. However, current ISK methods rely on conventional message-passing mechanisms and network architectures, without effectively extracting spatio-temporal features of physical sensors and focusing on representing virtual sensors. Additionally, existing graph construction methods face issues of sparse and noisy connections, destroying ISK performance. To address these issues, we propose DarkFarseer, a novel ISK framework with three key components. First, we propose the Neighbor Hidden Style Enhancement module with a style transfer strategy to enhance the representation of virtual nodes in a temporal-then-spatial manner to better extract the spatial relationships between physical and virtual nodes. Second, we propose Virtual-Component Contrastive Learning, which aims to enrich the node representation by establishing the association between the patterns of virtual nodes and the regional patterns within graph components. Lastly, we design a Similarity-Based Graph Denoising Strategy, which reduces the connectivity strength of noisy connections around virtual nodes and their neighbors based on their temporal information and regional spatial patterns. Extensive experiments demonstrate that DarkFarseer significantly outperforms existing ISK methods.

[LG-22] GraphDART: Graph Distillation for Efficient Advanced Persistent Threat Detection

链接: https://arxiv.org/abs/2501.02796
作者: Saba Fathi Rabooki,Bowen Li,Falih Gozi Febrinanto,Ciyuan Peng,Elham Naghizade,Fengling Han,Feng Xia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: “This work has been submitted to the IEEE for possible publication.”

点击查看摘要

Abstract:Cyber-physical-social systems (CPSSs) have emerged in many applications over recent decades, requiring increased attention to security concerns. The rise of sophisticated threats like Advanced Persistent Threats (APTs) makes ensuring security in CPSSs particularly challenging. Provenance graph analysis has proven effective for tracing and detecting anomalies within systems, but the sheer size and complexity of these graphs hinder the efficiency of existing methods, especially those relying on graph neural networks (GNNs). To address these challenges, we present GraphDART, a modular framework designed to distill provenance graphs into compact yet informative representations, enabling scalable and effective anomaly detection. GraphDART can take advantage of diverse graph distillation techniques, including classic and modern graph distillation methods, to condense large provenance graphs while preserving essential structural and contextual information. This approach significantly reduces computational overhead, allowing GNNs to learn from distilled graphs efficiently and enhance detection performance. Extensive evaluations on benchmark datasets demonstrate the robustness of GraphDART in detecting malicious activities across cyber-physical-social systems. By optimizing computational efficiency, GraphDART provides a scalable and practical solution to safeguard interconnected environments against APTs.

[LG-23] Orthogonal greedy algorithm for linear operator learning with shallow neural network

链接: https://arxiv.org/abs/2501.02791
作者: Ye Lin,Jiwei Jia,Young Ju Lee,Ran Zhang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Greedy algorithms, particularly the orthogonal greedy algorithm (OGA), have proven effective in training shallow neural networks for fitting functions and solving partial differential equations (PDEs). In this paper, we extend the application of OGA to the tasks of linear operator learning, which is equivalent to learning the kernel function through integral transforms. Firstly, a novel greedy algorithm is developed for kernel estimation rate in a new semi-inner product, which can be utilized to approximate the Green’s function of linear PDEs from data. Secondly, we introduce the OGA for point-wise kernel estimation to further improve the approximation rate, achieving orders of accuracy improvement across various tasks and baseline models. In addition, we provide a theoretical analysis on the kernel estimation problem and the optimal approximation rates for both algorithms, establishing their efficacy and potential for future applications in PDEs and operator learning tasks.

[LG-24] From Dense to Sparse: Event Response for Enhanced Residential Load Forecasting

链接: https://arxiv.org/abs/2501.02781
作者: Xin Cao,Qinghua Tao,Yingjie Zhou,Lu Zhang,Le Zhang,Dongjin Song,Dapeng Oliver Wu,Ce Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Residential load forecasting (RLF) is crucial for resource scheduling in power systems. Most existing methods utilize all given load records (dense data) to indiscriminately extract the dependencies between historical and future time series. However, there exist important regular patterns residing in the event-related associations among different appliances (sparse knowledge), which have yet been this http URL this paper, we propose an Event-Response Knowledge Guided approach (ERKG) for RLF by incorporating the estimation of electricity usage events for different appliances, mining event-related sparse knowledge from the load series. With ERKG, the event-response estimation enables portraying the electricity consumption behaviors of residents, revealing regular variations in appliance operational this http URL be specific, ERKG consists of knowledge extraction and guidance: i) a forecasting model is designed for the electricity usage events by estimating appliance operational states, aiming to extract the event-related sparse knowledge; ii) a novel knowledge-guided mechanism is established by fusing such state estimates of the appliance events into the RLF model, which can give particular focuses on the patterns of users’ electricity consumption this http URL, ERKG can flexibly serve as a plug-in module to boost the capability of existing forecasting models by leveraging event response. In numerical experiments, extensive comparisons and ablation studies have verified the effectiveness of our ERKG, e.g., over 8% MAE can be reduced on the tested state-of-the-art forecasting models. The source code will be available at this https URL.

[LG-25] Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

链接: https://arxiv.org/abs/2501.02774
作者: Zijian Wang,Bin Wang,Mingwen Shao,Hongbo Dou,Boxiang Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid action models are widely considered an effective approach to reinforcement learning (RL) modeling. The current mainstream method is to train agents under Parameterized Action Markov Decision Processes (PAMDPs), which performs well in specific environments. Unfortunately, these models either exhibit drastic low learning efficiency in complex PAMDPs or lose crucial information in the conversion between raw space and latent space. To enhance the learning efficiency and asymptotic performance of the agent, we propose a model-based RL (MBRL) algorithm, FLEXplore. FLEXplore learns a parameterized-action-conditioned dynamics model and employs a modified Model Predictive Path Integral control. Unlike conventional MBRL algorithms, we carefully design the dynamics loss function and reward smoothing process to learn a loose yet flexible model. Additionally, we use the variational lower bound to maximize the mutual information between the state and the hybrid action, enhancing the exploration effectiveness of the agent. We theoretically demonstrate that FLEXplore can reduce the regret of the rollout trajectory through the Wasserstein Metric under given Lipschitz conditions. Our empirical results on several standard benchmarks show that FLEXplore has outstanding learning efficiency and asymptotic performance compared to other baselines.

[LG-26] CHAT: Beyond Contrastive Graph Transformer for Link Prediction in Heterogeneous Networks

链接: https://arxiv.org/abs/2501.02760
作者: Shengming Zhang,Le Zhang,Jingbo Zhou,Hui Xiong
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Link prediction in heterogeneous networks is crucial for understanding the intricacies of network structures and forecasting their future developments. Traditional methodologies often face significant obstacles, including over-smoothing-wherein the excessive aggregation of node features leads to the loss of critical structural details-and a dependency on human-defined meta-paths, which necessitate extensive domain knowledge and can be inherently restrictive. These limitations hinder the effective prediction and analysis of complex heterogeneous networks. In response to these challenges, we propose the Contrastive Heterogeneous grAph Transformer (CHAT). CHAT introduces a novel sampling-based graph transformer technique that selectively retains nodes of interest, thereby obviating the need for predefined meta-paths. The method employs an innovative connection-aware transformer to encode node sequences and their interconnections with high fidelity, guided by a dual-faceted loss function specifically designed for heterogeneous network link prediction. Additionally, CHAT incorporates an ensemble link predictor that synthesizes multiple samplings to achieve enhanced prediction accuracy. We conducted comprehensive evaluations of CHAT using three distinct drug-target interaction (DTI) datasets. The empirical results underscore CHAT’s superior performance, outperforming both general-task approaches and models specialized in DTI prediction. These findings substantiate the efficacy of CHAT in addressing the complex problem of link prediction in heterogeneous networks.

[LG-27] Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences AAAI2025

链接: https://arxiv.org/abs/2501.02735
作者: Xiwen Chen,Peijie Qiu,Wenhui Zhu,Huayu Li,Hao Wang,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:Since its introduction, the transformer has shifted the development trajectory away from traditional models (e.g., RNN, MLP) in time series forecasting, which is attributed to its ability to capture global dependencies within temporal tokens. Follow-up studies have largely involved altering the tokenization and self-attention modules to better adapt Transformers for addressing special challenges like non-stationarity, channel-wise dependency, and variable correlation in time series. However, we found that the expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting after investigating several representative methods, where there is an almost linear relationship between sequence representation entropy and mean square error, with more diverse representations performing better. In this paper, we propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective, where these learnable sequences are able to provide complementary information beyond current input to feed attention. We further enhance the Sequence Complementors via a diversification loss that is theoretically covered. The empirical evaluation of both long-term and short-term forecasting has confirmed its superiority over the recent state-of-the-art methods.

[LG-28] Learning Stochastic Nonlinear Dynamics with Embedded Latent Transfer Operators

链接: https://arxiv.org/abs/2501.02721
作者: Naichang Ke,Ryogo Tanaka,Yoshinobu Kawahara
类目: Machine Learning (cs.LG)
*备注: This submission includes a supplementary file ( this http URL ) providing additional details. It also contains a code directory (code/) for the experiments

点击查看摘要

Abstract:We consider an operator-based latent Markov representation of a stochastic nonlinear dynamical system, where the stochastic evolution of the latent state embedded in a reproducing kernel Hilbert space is described with the corresponding transfer operator, and develop a spectral method to learn this representation based on the theory of stochastic realization. The embedding may be learned simultaneously using reproducing kernels, for example, constructed with feed-forward neural networks. We also address the generalization of sequential state-estimation (Kalman filtering) in stochastic nonlinear systems, and of operator-based eigen-mode decomposition of dynamics, for the representation. Several examples with synthetic and real-world data are shown to illustrate the empirical characteristics of our methods, and to investigate the performance of our model in sequential state-estimation and mode decomposition.

[LG-29] Knowledge Distillation with Adapted Weight

链接: https://arxiv.org/abs/2501.02705
作者: Sirong Wu,Xi Luo,Junjie Liu,Yuhui Deng
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Although large models have shown a strong capacity to solve large-scale problems in many areas including natural language and computer vision, their voluminous parameters are hard to deploy in a real-time system due to computational and energy constraints. Addressing this, knowledge distillation through Teacher-Student architecture offers a sustainable pathway to compress the knowledge of large models into more manageable sizes without significantly compromising performance. To enhance the robustness and interpretability of this framework, it is critical to understand how individual training data impact model performance, which is an area that remains underexplored. We propose the \textbfKnowledge Distillation with Adaptive Influence Weight (KD-AIF) framework which leverages influence functions from robust statistics to assign weights to training data, grounded in the four key SAFE principles: Sustainability, Accuracy, Fairness, and Explainability. This novel approach not only optimizes distillation but also increases transparency by revealing the significance of different data. The exploration of various update mechanisms within the KD-AIF framework further elucidates its potential to significantly improve learning efficiency and generalization in student models, marking a step toward more explainable and deployable Large Models. KD-AIF is effective in knowledge distillation while also showing exceptional performance in semi-supervised learning with outperforms existing baselines and methods in multiple benchmarks (CIFAR-100, CIFAR-10-4k, SVHN-1k, and GLUE).

[LG-30] Persistence of Backdoor-based Watermarks for Neural Networks: A Comprehensive Evaluation

链接: https://arxiv.org/abs/2501.02704
作者: Anh Tu Ngo,Chuan Song Heng,Nandish Chattopadhyay,Anupam Chattopadhyay
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have gained considerable traction in recent years due to the unparalleled results they gathered. However, the cost behind training such sophisticated models is resource intensive, resulting in many to consider DNNs to be intellectual property (IP) to model owners. In this era of cloud computing, high-performance DNNs are often deployed all over the internet so that people can access them publicly. As such, DNN watermarking schemes, especially backdoor-based watermarks, have been actively developed in recent years to preserve proprietary rights. Nonetheless, there lies much uncertainty on the robustness of existing backdoor watermark schemes, towards both adversarial attacks and unintended means such as fine-tuning neural network models. One reason for this is that no complete guarantee of robustness can be assured in the context of backdoor-based watermark. In this paper, we extensively evaluate the persistence of recent backdoor-based watermarks within neural networks in the scenario of fine-tuning, we propose/develop a novel data-driven idea to restore watermark after fine-tuning without exposing the trigger set. Our empirical results show that by solely introducing training data after fine-tuning, the watermark can be restored if model parameters do not shift dramatically during fine-tuning. Depending on the types of trigger samples used, trigger accuracy can be reinstated to up to 100%. Our study further explores how the restoration process works using loss landscape visualization, as well as the idea of introducing training data in fine-tuning stage to alleviate watermark vanishing.

[LG-31] Exploring a Datasets Statistical Effect Size Impact on Model Performance and Data Sample-Size Sufficiency

链接: https://arxiv.org/abs/2501.02673
作者: Arya Hatamian,Lionel Levine,Haniyeh Ehsani Oskouie,Majid Sarrafzadeh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Having a sufficient quantity of quality data is a critical enabler of training effective machine learning models. Being able to effectively determine the adequacy of a dataset prior to training and evaluating a model’s performance would be an essential tool for anyone engaged in experimental design or data collection. However, despite the need for it, the ability to prospectively assess data sufficiency remains an elusive capability. We report here on two experiments undertaken in an attempt to better ascertain whether or not basic descriptive statistical measures can be indicative of how effective a dataset will be at training a resulting model. Leveraging the effect size of our features, this work first explores whether or not a correlation exists between effect size, and resulting model performance (theorizing that the magnitude of the distinction between classes could correlate to a classifier’s resulting success). We then explore whether or not the magnitude of the effect size will impact the rate of convergence of our learning rate, (theorizing again that a greater effect size may indicate that the model will converge more rapidly, and with a smaller sample size needed). Our results appear to indicate that this is not an effective heuristic for determining adequate sample size or projecting model performance, and therefore that additional work is still needed to better prospectively assess adequacy of data.

[LG-32] Incentive-Compatible Federated Learning with Stackelberg Game Modeling

链接: https://arxiv.org/abs/2501.02662
作者: Simin Javaherian,Bryce Turney,Li Chen,Nian-Feng Tzeng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has gained prominence as a decentralized machine learning paradigm, allowing clients to collaboratively train a global model while preserving data privacy. Despite its potential, FL faces significant challenges in heterogeneous environments, where varying client resources and capabilities can undermine overall system performance. Existing approaches primarily focus on maximizing global model accuracy, often at the expense of unfairness among clients and suboptimal system efficiency, particularly in non-IID (non-Independent and Identically Distributed) settings. In this paper, we introduce FLamma, a novel Federated Learning framework based on adaptive gamma-based Stackelberg game, designed to address the aforementioned limitations and promote fairness. Our approach allows the server to act as the leader, dynamically adjusting a decay factor while clients, acting as followers, optimally select their number of local epochs to maximize their utility. Over time, the server incrementally balances client influence, initially rewarding higher-contributing clients and gradually leveling their impact, driving the system toward a Stackelberg Equilibrium. Extensive simulations on both IID and non-IID datasets show that our method significantly improves fairness in accuracy distribution without compromising overall model performance or convergence speed, outperforming traditional FL baselines.

[LG-33] A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model

链接: https://arxiv.org/abs/2501.02652
作者: Shivaram Kalyanakrishnan,Sheel Shah,Santhosh Kumar Guguloth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, excluding references and appendices. Total of 28 pages

点击查看摘要

Abstract:Reinforcement learning (RL) enables an agent interacting with an unknown MDP M to optimise its behaviour by observing transitions sampled from M . A natural entity that emerges in the agent’s reasoning is \widehatM , the maximum likelihood estimate of M based on the observed transitions. The well-known \textitcertainty-equivalence method (CEM) dictates that the agent update its behaviour to \widehat\pi , which is an optimal policy for \widehatM . Not only is CEM intuitive, it has been shown to enjoy minimax-optimal sample complexity in some regions of the parameter space for PAC RL with a generative model~\citepAgarwal2020GenModel. A seemingly unrelated algorithm is the trajectory tree method'' (TTM)~\citepKearns+MN:1999, originally developed for efficient decision-time planning in large POMDPs. This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM. The qualitative benefits of this view are (1) new and simple proofs of sample complexity upper bounds for CEM, in fact under a (2) weaker assumption on the rewards than is prevalent in the current literature. Our analysis applies to both non-stationary and stationary MDPs. Quantitatively, we obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs, in the regime that the mistake probability’’ \delta is small. Additionally, we show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs in the small- \delta regime. Comments: 14 pages, excluding references and appendices. Total of 28 pages Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2501.02652 [cs.LG] (or arXiv:2501.02652v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.02652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] HALO: Hadamard-Assisted Lossless Optimization for Efficient Low-Precision LLM Training and Fine-Tuning

链接: https://arxiv.org/abs/2501.02625
作者: Saleh Ashkboos,Mahdi Nikdan,Soroush Tabesh,Roberto L. Castro,Torsten Hoefler,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Quantized training of Large Language Models (LLMs) remains an open challenge, as maintaining accuracy while performing all matrix multiplications in low precision has proven difficult. This is particularly the case when fine-tuning pre-trained models, which often already have large weight and activation outlier values that render quantized optimization difficult. We present HALO, a novel quantization-aware training approach for Transformers that enables accurate and efficient low-precision training by combining 1) strategic placement of Hadamard rotations in both forward and backward passes, to mitigate outliers during the low-precision computation, 2) FSDP integration for low-precision communication, and 3) high-performance kernel support. Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision. Applied to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks, while delivering up to 1.31x end-to-end speedup for full fine-tuning on RTX 4090 GPUs. Our method supports both standard and parameter-efficient fine-tuning (PEFT) methods, both backed by efficient kernel implementations. Our results demonstrate the first practical approach to fully quantized LLM fine-tuning that maintains accuracy in FP8 precision, while delivering performance benefits.

[LG-35] Chameleon2: An Efficient Chameleon2 Clustering with Approximate Nearest Neighbors

链接: https://arxiv.org/abs/2501.02612
作者: Priyanshu Singh,Kapil Ahuja
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 29 Pages, 15 Figures, 12 Tables

点击查看摘要

Abstract:Clustering algorithms are fundamental tools in data analysis, with hierarchical methods being particularly valuable for their flexibility. Chameleon is a widely used hierarchical clustering algorithm that excels at identifying high-quality clusters of arbitrary shapes, sizes, and densities. Chameleon2 is the most recent variant that has demonstrated significant improvements, but suffers from critical failings and there are certain improvements that can be made. The first failure we address is that the complexity of Chameleon2 is claimed to be O(n^2) , while we demonstrate that it is actually O(n^2\logn) , with n being the number of data points. Furthermore, we suggest improvements to Chameleon2 that ensure that the complexity remains O(n^2) with minimal to no loss of performance. The second failing of Chameleon2 is that it lacks transparency and it does not provide the fine-tuned algorithm parameters used to obtain the claimed results. We meticulously provide all such parameter values to enhance replicability. The improvement which we make in Chameleon2 is that we replace the exact k -NN search with an approximate k -NN search. This further reduces the algorithmic complexity down to O(n\logn) without any performance loss. Here, we primarily configure three approximate nearest neighbor search algorithms (Annoy, FLANN and NMSLIB) to align with the overarching Chameleon2 clustering framework. Experimental evaluations on standard benchmark datasets demonstrate that the proposed Chameleon2++ algorithm is more efficient, robust, and computationally optimal. Comments: 29 Pages, 15 Figures, 12 Tables Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) ACMclasses: I.2; F.2 Cite as: arXiv:2501.02612 [cs.LG] (or arXiv:2501.02612v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.02612 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] Efficient Graph Condensation via Gaussian Process

链接: https://arxiv.org/abs/2501.02565
作者: Lin Wang,Qing Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To address these issues, this paper proposes Graph Condensation via Gaussian Process (GCGP), a novel and computationally efficient approach to graph condensation. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges. The implementation of our method is publicly available at this https URL.

[LG-37] Predicting Vulnerability to Malware Using Machine Learning Models: A Study on Microsoft Windows Machines

链接: https://arxiv.org/abs/2501.02493
作者: Marzieh Esnaashari,Nima Moradi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an era of escalating cyber threats, malware poses significant risks to individuals and organizations, potentially leading to data breaches, system failures, and substantial financial losses. This study addresses the urgent need for effective malware detection strategies by leveraging Machine Learning (ML) techniques on extensive datasets collected from Microsoft Windows Defender. Our research aims to develop an advanced ML model that accurately predicts malware vulnerabilities based on the specific conditions of individual machines. Moving beyond traditional signature-based detection methods, we incorporate historical data and innovative feature engineering to enhance detection capabilities. This study makes several contributions: first, it advances existing malware detection techniques by employing sophisticated ML algorithms; second, it utilizes a large-scale, real-world dataset to ensure the applicability of findings; third, it highlights the importance of feature analysis in identifying key indicators of malware infections; and fourth, it proposes models that can be adapted for enterprise environments, offering a proactive approach to safeguarding extensive networks against emerging threats. We aim to improve cybersecurity resilience, providing critical insights for practitioners in the field and addressing the evolving challenges posed by malware in a digital landscape. Finally, discussions on results, insights, and conclusions are presented.

[LG-38] An Analysis Framework for Understanding Deep Neural Networks Based on Network Dynamics

链接: https://arxiv.org/abs/2501.02436
作者: Yuchen Lin,Yong Zhang,Sihan Feng,Hong Zhao
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Machine Learning (stat.ML)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Advancing artificial intelligence demands a deeper understanding of the mechanisms underlying deep learning. Here, we propose a straightforward analysis framework based on the dynamics of learning models. Neurons are categorized into two modes based on whether their transformation functions preserve order. This categorization reveals how deep neural networks (DNNs) maximize information extraction by rationally allocating the proportion of neurons in different modes across deep layers. We further introduce the attraction basins of the training samples in both the sample vector space and the weight vector space to characterize the generalization ability of DNNs. This framework allows us to identify optimal depth and width configurations, providing a unified explanation for fundamental DNN behaviors such as the “flat minima effect,” “grokking,” and double descent phenomena. Our analysis extends to networks with depths up to 100 layers.

[LG-39] nsor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition

链接: https://arxiv.org/abs/2501.02379
作者: Robert Joseph George,David Pitt,Jiawei Zhao,Jean Kossaifi,Cheng Luo,Yuandong Tian,Anima Anandkumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Tensor-GaLore, a novel method for efficient training of neural networks with higher-order tensor weights. Many models, particularly those used in scientific computing, employ tensor-parameterized layers to capture complex, multidimensional relationships. When scaling these methods to high-resolution problems makes memory usage grow intractably, and matrix based optimization methods lead to suboptimal performance and compression. We propose to work directly in the high-order space of the complex tensor parameter space using a tensor factorization of the gradients during optimization. We showcase its effectiveness on Fourier Neural Operators (FNOs), a class of models crucial for solving partial differential equations (PDE) and prove the theory of it. Across various PDE tasks like the Navier Stokes and Darcy Flow equations, Tensor-GaLore achieves substantial memory savings, reducing optimizer memory usage by up to 75%. These substantial memory savings across AI for science demonstrate Tensor-GaLore’s potential.

[LG-40] A ghost mechanism: An analytical model of abrupt learning

链接: https://arxiv.org/abs/2501.02378
作者: Fatih Dinc,Ege Cirakman,Yiqi Jiang,Mert Yuksekgonul,Mark J. Schnitzer,Hidenori Tanaka
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:\emphAbrupt learning is commonly observed in neural networks, where long plateaus in network performance are followed by rapid convergence to a desirable solution. Yet, despite its common occurrence, the complex interplay of task, network architecture, and learning rule has made it difficult to understand the underlying mechanisms. Here, we introduce a minimal dynamical system trained on a delayed-activation task and demonstrate analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations. Through our toy model, we show that the emergence of a ghost point destabilizes learning dynamics. We identify a critical learning rate that prevents learning through two distinct loss landscape features: a no-learning zone and an oscillatory minimum. Testing these predictions in recurrent neural networks (RNNs), we confirm that ghost points precede abrupt learning and accompany the destabilization of learning. We demonstrate two complementary remedies: lowering the model output confidence prevents the network from getting stuck in no-learning zones, while increasing trainable ranks beyond task requirements (\textiti.e., adding sloppy parameters) provides more stable learning trajectories. Our model reveals a bifurcation-free mechanism for abrupt learning and illustrates the importance of both deliberate uncertainty and redundancy in stabilizing learning dynamics.

[LG-41] BADTV: Unveiling Backdoor Threats in Third-Party Task Vectors

链接: https://arxiv.org/abs/2501.02373
作者: Chia-Yi Hsu,Yu-Lin Tsai,Yu Zhe,Yan-Lun Chen,Chih-Hsun Lin,Chia-Mu Yu,Yang Zhang,Chun-Ying Huang,Jun Sakuma
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Task arithmetic in large-scale pre-trained models enables flexible adaptation to diverse downstream tasks without extensive re-training. By leveraging task vectors (TVs), users can perform modular updates to pre-trained models through simple arithmetic operations like addition and subtraction. However, this flexibility introduces new security vulnerabilities. In this paper, we identify and evaluate the susceptibility of TVs to backdoor attacks, demonstrating how malicious actors can exploit TVs to compromise model integrity. By developing composite backdoors and eliminating redudant clean tasks, we introduce BadTV, a novel backdoor attack specifically designed to remain effective under task learning, forgetting, and analogies operations. Our extensive experiments reveal that BadTV achieves near-perfect attack success rates across various scenarios, significantly impacting the security of models using task arithmetic. We also explore existing defenses, showing that current methods fail to detect or mitigate BadTV. Our findings highlight the need for robust defense mechanisms to secure TVs in real-world applications, especially as TV services become more popular in machine-learning ecosystems.

[LG-42] Predicting two-dimensional spatiotemporal chaotic patterns with optimized high-dimensional hybrid reservoir computing

链接: https://arxiv.org/abs/2501.02369
作者: Tamon Nakano Sebastian Baur Christoph Räth
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:As an alternative approach for predicting complex dynamical systems where physics-based models are no longer reliable, reservoir computing (RC) has gained popularity. The hybrid approach is considered an interesting option for improving the prediction performance of RC. The idea is to combine a knowledge-based model (KBM) to support the fully data-driven RC prediction. There are three types of hybridization for RC, namely full hybrid (FH), input hybrid (IH) and output hybrid (OH), where it was shown that the latter one is superior in terms of the accuracy and the robustness for the prediction of low-dimensional chaotic systems. Here, we extend the formalism to the prediction of spatiotemporal patterns in two dimensions. To overcome the curse of dimensionality for this very high-dimensional case we employ the local states ansatz, where only a few locally adjacent time series are utilized for the RC-based prediction. Using simulation data from the Barkley model describing chaotic electrical wave propagation in cardiac tissue, we outline the formalism of high-dimensional hybrid RC and assess the performance of the different hybridization schemes. We find that all three methods (FH, IH and OH) perform better than reservoir only, where improvements are small when the model is very inaccurate. For small model errors and small reservoirs FH and OH perform nearly equally well and better than IH. Given the smaller CPU needs for OH and especially the better interpretability of it, OH is to be favored. For large reservoirs the performance of OH drops below that of FH and IH. Generally, it maybe advisable to test the three setups for a given application and select the best suited one that optimizes between the counteracting factors of prediction performance and CPU needs.

[LG-43] Easing Optimization Paths: a Circuit Perspective ICASSP2025

链接: https://arxiv.org/abs/2501.02362
作者: Ambroise Odonnat,Wassim Bouaziz,Vivien Cabannes
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Gradient descent is the method of choice for training large artificial intelligence systems. As these systems become larger, a better understanding of the mechanisms behind gradient training would allow us to alleviate compute costs and help steer these systems away from harmful behaviors. To that end, we suggest utilizing the circuit perspective brought forward by mechanistic interpretability. After laying out our intuition, we illustrate how it enables us to design a curriculum for efficient learning in a controlled setting. The code is available at \urlthis https URL.

[LG-44] When is the Computation of a Feature Attribution Method Tractable?

链接: https://arxiv.org/abs/2501.02356
作者: P. Barceló,R. Cominetti,M. Morgado
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages in row format

点击查看摘要

Abstract:Feature attribution methods have become essential for explaining machine learning models. Many popular approaches, such as SHAP and Banzhaf values, are grounded in power indices from cooperative game theory, which measure the contribution of features to model predictions. This work studies the computational complexity of power indices beyond SHAP, addressing the conditions under which they can be computed efficiently. We identify a simple condition on power indices that ensures that computation is polynomially equivalent to evaluating expected values, extending known results for SHAP. We also introduce Bernoulli power indices, showing that their computation can be simplified to a constant number of expected value evaluations. Furthermore, we explore interaction power indices that quantify the importance of feature subsets, proving that their computation complexity mirrors that of individual features.

[LG-45] Reweighting Improves Conditional Risk Bounds

链接: https://arxiv.org/abs/2501.02353
作者: Yikai Zhang,Jiahe Lin,Fengpei Li,Songzhu Zheng,Anant Raj,Anderson Schneider,Yuriy Nevmyvaka
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages

点击查看摘要

Abstract:In this work, we study the weighted empirical risk minimization (weighted ERM) schema, in which an additional data-dependent weight function is incorporated when the empirical risk function is being minimized. We show that under a general ``balanceable" Bernstein condition, one can design a weighted ERM estimator to achieve superior performance in certain sub-regions over the one obtained from standard ERM, and the superiority manifests itself through a data-dependent constant term in the error bound. These sub-regions correspond to large-margin ones in classification settings and low-variance ones in heteroscedastic regression settings, respectively. Our findings are supported by evidence from synthetic data experiments.

[LG-46] Digital Twin Calibration with Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2501.02205
作者: Hua Zheng,Wei Xie,Ilya O. Ryzhov,Keilung Choy
类目: Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:This paper presents a novel methodological framework, called the Actor-Simulator, that incorporates the calibration of digital twins into model-based reinforcement learning for more effective control of stochastic systems with complex nonlinear dynamics. Traditional model-based control often relies on restrictive structural assumptions (such as linear state transitions) and fails to account for parameter uncertainty in the model. These issues become particularly critical in industries such as biopharmaceutical manufacturing, where process dynamics are complex and not fully known, and only a limited amount of data is available. Our approach jointly calibrates the digital twin and searches for an optimal control policy, thus accounting for and reducing model error. We balance exploration and exploitation by using policy performance as a guide for data collection. This dual-component approach provably converges to the optimal policy, and outperforms existing methods in extensive numerical experiments based on the biopharmaceutical manufacturing domain.

[LG-47] On LLM -Enhanced Mixed-Type Data Imputation with High-Order Message Passing

链接: https://arxiv.org/abs/2501.02191
作者: Jianwei Wang,Kai Wang,Ying Zhang,Wenjie Zhang,Xiwei Xu,Xuemin Lin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Missing data imputation, which aims to impute the missing values in the raw datasets to achieve the completeness of datasets, is crucial for modern data-driven models like large language models (LLMs) and has attracted increasing interest over the past decades. Despite its importance, existing solutions for missing data imputation either 1) only support numerical and categorical data or 2) show an unsatisfactory performance due to their design prioritizing text data and the lack of key properties for tabular data imputation. In this paper, we propose UnIMP, a Unified IMPutation framework that leverages LLM and high-order message passing to enhance the imputation of mixed-type data including numerical, categorical, and text data. Specifically, we first introduce a cell-oriented hypergraph to model the table. We then propose BiHMP, an efficient Bidirectional High-order Message-Passing network to aggregate global-local information and high-order relationships on the constructed hypergraph while capturing the inter-column heterogeneity and intra-column homogeneity. To effectively and efficiently align the capacity of the LLM with the information aggregated by BiHMP, we introduce Xfusion, which, together with BiHMP, acts as adapters for the LLM. We follow a pre-training and fine-tuning pipeline to train UnIMP, integrating two optimizations: chunking technique, which divides tables into smaller chunks to enhance efficiency; and progressive masking technique, which gradually adapts the model to learn more complex data patterns. Both theoretical proofs and empirical experiments on 10 real world datasets highlight the superiority of UnIMP over existing techniques.

[LG-48] SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services

链接: https://arxiv.org/abs/2501.02181
作者: Yaodan Xu,Sheng Zhou,Zhisheng Niu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted by IEEE Transactions on Parallel and Distributed Systems (TPDS)

点击查看摘要

Abstract:For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of “tail” states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.

[LG-49] he Efficiency vs. Accuracy Trade-off: Optimizing RAG -Enhanced LLM Recommender Systems Using Multi-Head Early Exit

链接: https://arxiv.org/abs/2501.02173
作者: Huixue Zhou,Hengrui Gu,Xi Liu,Kaixiong Zhou,Mingfu Liang,Yongkang Xiao,Srinivas Govindan,Piyush Chawla,Jiyan Yang,Xiangfei Meng,Huayu Li,Buyun Zhang,Liang Luo,Wen-Yen Chen,Yiping Han,Bo Long,Rui Zhang,Tianlong Chen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) in recommender systems for predicting Click-Through Rates (CTR) necessitates a delicate balance between computational efficiency and predictive accuracy. This paper presents an optimization framework that combines Retrieval-Augmented Generation (RAG) with an innovative multi-head early exit architecture to concurrently enhance both aspects. By integrating Graph Convolutional Networks (GCNs) as efficient retrieval mechanisms, we are able to significantly reduce data retrieval times while maintaining high model performance. The early exit strategy employed allows for dynamic termination of model inference, utilizing real-time predictive confidence assessments across multiple heads. This not only quickens the responsiveness of LLMs but also upholds or improves their accuracy, making it ideal for real-time application scenarios. Our experiments demonstrate how this architecture effectively decreases computation time without sacrificing the accuracy needed for reliable recommendation delivery, establishing a new standard for efficient, real-time LLM deployment in commercial systems.

[LG-50] Exploring Secure Machine Learning Through Payload Injection and FGSM Attacks on ResNet-50

链接: https://arxiv.org/abs/2501.02147
作者: Umesh Yadav,Suman Niraula,Gaurav Kumar Gupta,Bicky Yadav
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the resilience of a ResNet-50 image classification model under two prominent security threats: Fast Gradient Sign Method (FGSM) adversarial attacks and malicious payload injection. Initially, the model attains a 53.33% accuracy on clean images. When subjected to FGSM perturbations, its overall accuracy remains unchanged; however, the model’s confidence in incorrect predictions notably increases. Concurrently, a payload injection scheme is successfully executed in 93.33% of the tested samples, revealing how stealthy attacks can manipulate model predictions without degrading visual quality. These findings underscore the vulnerability of even high-performing neural networks and highlight the urgency of developing more robust defense mechanisms for security-critical applications.

[LG-51] How Your Location Relates to Health: Variable Importance and Interpretable Machine Learning for Environmental and Sociodemographic Data AAAI

链接: https://arxiv.org/abs/2501.02111
作者: Ishaan Maitra,Raymond Lin,Eric Chen,Jon Donnelly,Sanja Šćepanović,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注: AAAI

点击查看摘要

Abstract:Health outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time. Only recently has fine-grained spatial and temporal data become available to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information. Leveraging this new resource, we use a variety of variable importance techniques to robustly identify the most informative predictors across multiple health outcomes. We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR) to analyze both local and global spatial dependencies of each variable on various health outcomes. Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation. Regional analyses reveal local variations with air pollution and solar radiation, with notable shifts during COVID. This comprehensive approach provides actionable insights for addressing health disparities, and advocates for the integration of interpretable machine learning in public health.

[LG-52] Beyond CVaR: Leverag ing Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

链接: https://arxiv.org/abs/2501.02087
作者: Mehrdad Moghimi,Hyejin Ku
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.

[LG-53] Counterfactual Explanation for Auto-Encoder Based Time-Series Anomaly Detection

链接: https://arxiv.org/abs/2501.02069
作者: Abhishek Srinivasan,Varun Singapuri Ravi,Juan Carlos Andresen,Anders Holst
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 6 tables, conference proceeding

点击查看摘要

Abstract:The complexity of modern electro-mechanical systems require the development of sophisticated diagnostic methods like anomaly detection capable of detecting deviations. Conventional anomaly detection approaches like signal processing and statistical modelling often struggle to effectively handle the intricacies of complex systems, particularly when dealing with multi-variate signals. In contrast, neural network-based anomaly detection methods, especially Auto-Encoders, have emerged as a compelling alternative, demonstrating remarkable performance. However, Auto-Encoders exhibit inherent opaqueness in their decision-making processes, hindering their practical implementation at scale. Addressing this opacity is essential for enhancing the interpretability and trustworthiness of anomaly detection models. In this work, we address this challenge by employing a feature selector to select features and counterfactual explanations to give a context to the model output. We tested this approach on the SKAB benchmark dataset and an industrial time-series dataset. The gradient based counterfactual explanation approach was evaluated via validity, sparsity and distance measures. Our experimental findings illustrate that our proposed counterfactual approach can offer meaningful and valuable insights into the model decision-making process, by explaining fewer signals compared to conventional approaches. These insights enhance the trustworthiness and interpretability of anomaly detection models.

[LG-54] Active Learning Enables Extrapolation in Molecular Generative Models

链接: https://arxiv.org/abs/2501.02059
作者: Evan R. Antoniuk,Peggy Li,Nathan Keilbart,Stephen Weitzner,Bhavya Kailkhura,Anna M. Hiszpanski
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Although generative models hold promise for discovering molecules with optimized desired properties, they often fail to suggest synthesizable molecules that improve upon the known molecules seen in training. We find that a key limitation is not in the molecule generation process itself, but in the poor generalization capabilities of molecular property predictors. We tackle this challenge by creating an active-learning, closed-loop molecule generation pipeline, whereby molecular generative models are iteratively refined on feedback from quantum chemical simulations to improve generalization to new chemical space. Compared against other generative model approaches, only our active learning approach generates molecules with properties that extrapolate beyond the training data (reaching up to 0.44 standard deviations beyond the training data range) and out-of-distribution molecule classification accuracy is improved by 79%. By conditioning molecular generation on thermodynamic stability data from the active-learning loop, the proportion of stable molecules generated is 3.5x higher than the next-best model.

[LG-55] owards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

链接: https://arxiv.org/abs/2501.02042
作者: Christopher Burger,Charles Walter,Thai Le,Lingwei Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 12 pages, 1 figure, 4 tables. arXiv admin note: substantial text overlap with arXiv:2406.15839 . substantial text overlap with arXiv:2501.01516

点击查看摘要

Abstract:Recent work has investigated the concept of adversarial attacks on explainable AI (XAI) in the NLP domain with a focus on examining the vulnerability of local surrogate methods such as Lime to adversarial perturbations or small changes on the input of a machine learning (ML) model. In such attacks, the generated explanation is manipulated while the meaning and structure of the original input remain similar under the ML model. Such attacks are especially alarming when XAI is used as a basis for decision making (e.g., prescribing drugs based on AI medical predictors) or for legal action (e.g., legal dispute involving AI software). Although weaknesses across many XAI methods have been shown to exist, the reasons behind why remain little explored. Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another. A poor choice of similarity measure can lead to erroneous conclusions about the stability or adversarial robustness of an XAI method. Therefore, this work investigates a variety of similarity measures designed for text-based ranked lists referenced in related work to determine their comparative suitability for use. We find that many measures are overly sensitive, resulting in erroneous estimates of stability. We then propose a weighting scheme for text-based data that incorporates the synonymity between the features within an explanation, providing more accurate estimates of the actual weakness of XAI methods to adversarial examples.

[LG-56] Information Subtraction: Learning Representations for Conditional Entropy

链接: https://arxiv.org/abs/2501.02012
作者: Keng Hou Leong,Yuxuan Xiu,Wai Kin(Victor)Chan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The representations of conditional entropy and conditional mutual information are significant in explaining the unique effects among variables. While previous studies based on conditional contrastive sampling have effectively removed information regarding discrete sensitive variables, they have not yet extended their scope to continuous cases. This paper introduces Information Subtraction, a framework designed to generate representations that preserve desired information while eliminating the undesired. We implement a generative-based architecture that outputs these representations by simultaneously maximizing an information term and minimizing another. With its flexibility in disentangling information, we can iteratively apply Information Subtraction to represent arbitrary information components between continuous variables, thereby explaining the various relationships that exist between them. Our results highlight the representations’ ability to provide semantic features of conditional entropy. By subtracting sensitive and domain-specific information, our framework demonstrates effective performance in fair learning and domain generalization. The code for this paper is available at this https URL

[LG-57] Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

链接: https://arxiv.org/abs/2501.02010
作者: Antoine Ledent,Peng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Balancing predictive power and interpretability has long been a challenging research area, particularly in powerful yet complex models like neural networks, where nonlinearity obstructs direct interpretation. This paper introduces a novel approach to constructing an explainable neural network that harmonizes predictiveness and explainability. Our model, termed SparXnet, is designed as a linear combination of a sparse set of jointly learned features, each derived from a different trainable function applied to a single 1-dimensional input feature. Leveraging the ability to learn arbitrarily complex relationships, our neural network architecture enables automatic selection of a sparse set of important features, with the final prediction being a linear combination of rescaled versions of these features. We demonstrate the ability to select significant features while maintaining comparable predictive performance and direct interpretability through extensive experiments on synthetic and real-world datasets. We also provide theoretical analysis on the generalization bounds of our framework, which is favorably linear in the number of selected features and only logarithmic in the number of input features. We further lift any dependence of sample complexity on the number of parameters or the architectural details under very mild conditions. Our research paves the way for further research on sparse and explainable neural networks with guarantee.

[LG-58] HMM-LSTM Fusion Model for Economic Forecasting

链接: https://arxiv.org/abs/2501.02002
作者: Guhan Sivakumar
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注: 33 pages, 18 figures

点击查看摘要

Abstract:This paper explores the application of Hidden Markov Models (HMM) and Long Short-Term Memory (LSTM) neural networks for economic forecasting, focusing on predicting CPI inflation rates. The study explores a new approach that integrates HMM-derived hidden states and means as additional features for LSTM modeling, aiming to enhance the interpretability and predictive performance of the models. The research begins with data collection and preprocessing, followed by the implementation of the HMM to identify hidden states representing distinct economic conditions. Subsequently, LSTM models are trained using the original and augmented data sets, allowing for comparative analysis and evaluation. The results demonstrate that incorporating HMM-derived data improves the predictive accuracy of LSTM models, particularly in capturing complex temporal patterns and mitigating the impact of volatile economic conditions. Additionally, the paper discusses the implementation of Integrated Gradients for model interpretability and provides insights into the economic dynamics reflected in the forecasting outcomes.

[LG-59] Communication Efficient Cooperative Edge AI via Event-Triggered Computation Offloading

链接: https://arxiv.org/abs/2501.02001
作者: You Zhou,Changsheng You,Kaibin Huang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:Rare events, despite their infrequency, often carry critical information and require immediate attentions in mission-critical applications such as autonomous driving, healthcare, and industrial automation. The data-intensive nature of these tasks and their need for prompt responses, combined with designing edge AI (or edge inference), pose significant challenges in systems and techniques. Existing edge inference approaches often suffer from communication bottlenecks due to high-dimensional data transmission and fail to provide timely responses to rare events, limiting their effectiveness for mission-critical applications in the sixth-generation (6G) mobile networks. To overcome these challenges, we propose a channel-adaptive, event-triggered edge-inference framework that prioritizes efficient rare-event processing. Central to this framework is a dual-threshold, multi-exit architecture, which enables early local inference for rare events detected locally while offloading more complex rare events to edge servers for detailed classification. To further enhance the system’s performance, we developed a channel-adaptive offloading policy paired with an online algorithm to dynamically determine the optimal confidence thresholds for controlling offloading decisions. The associated optimization problem is solved by reformulating the original non-convex function into an equivalent strongly convex one. Using deep neural network classifiers and real medical datasets, our experiments demonstrate that the proposed framework not only achieves superior rare-event classification accuracy, but also effectively reduces communication overhead, as opposed to existing edge-inference approaches.

[LG-60] owards Sustainable Large Language Model Serving

链接: https://arxiv.org/abs/2501.01990
作者: Sophia Nguyen,Beihao Zhou,Yi Ding,Sihang Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In this work, we study LLMs from a carbon emission perspective, addressing both operational and embodied emissions, and paving the way for sustainable LLM serving. We characterize the performance and energy of LLaMA with 1B, 3B, and 7B parameters using two Nvidia GPU types, a latest-generation RTX6000 Ada and an older-generation T4. We analytically model operational carbon emissions based on energy consumption and carbon intensities from three grid regions – each representing a different energy source mix, and embodied carbon emissions based on chip area and memory size. Our characterization and modeling provide us with an in-depth understanding of the performance, energy, and carbon emissions of LLM serving. Our findings highlight the potential for optimizing sustainable LLM serving systems by considering both operational and embodied carbon emissions simultaneously.

[LG-61] Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

链接: https://arxiv.org/abs/2501.03184
作者: Holger Severin Bovbjerg(1),Jan Østergaard(1),Jesper Jensen(1 and 2),Zheng-Hua Tan(1) ((1) Aalborg University, (2) Oticon A/S)
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing for possible publication. 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.

[LG-62] Slim multi-scale convolutional autoencoder-based reduced-order models for interpretable features of a complex dynamical system

链接: https://arxiv.org/abs/2501.03070
作者: Philipp Teutsch,Philipp Pfeffer,Mohammad Sharifi Ghazijahani,Christian Cierpka,Jörg Schumacher,Patrick Mäder
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, data-driven deep learning models have gained significant interest in the analysis of turbulent dynamical systems. Within the context of reduced-order models (ROMs), convolutional autoencoders (CAEs) pose a universally applicable alternative to conventional approaches. They can learn nonlinear transformations directly from data, without prior knowledge of the system. However, the features generated by such models lack interpretability. Thus, the resulting model is a black-box which effectively reduces the complexity of the system, but does not provide insights into the meaning of the latent features. To address this critical issue, we introduce a novel interpretable CAE approach for high-dimensional fluid flow data that maintains the reconstruction quality of conventional CAEs and allows for feature interpretation. Our method can be easily integrated into any existing CAE architecture with minor modifications of the training process. We compare our approach to Proper Orthogonal Decomposition (POD) and two existing methods for interpretable CAEs. We apply all methods to three different experimental turbulent Rayleigh-Bénard convection datasets with varying complexity. Our results show that the proposed method is lightweight, easy to train, and achieves relative reconstruction performance improvements of up to 6.4% over POD for 64 modes. The relative improvement increases to up to 229.8% as the number of modes decreases. Additionally, our method delivers interpretable features similar to those of POD and is significantly less resource-intensive than existing CAE approaches, using less than 2% of the parameters. These approaches either trade interpretability for reconstruction performance or only provide interpretability to a limited extend.

[LG-63] Group Shapley with Robust Significance Testing and Its Application to Bond Recovery Rate Prediction

链接: https://arxiv.org/abs/2501.03041
作者: Jingyi Wang,Ying Chen,Paolo Giudici
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Group Shapley, a metric that extends the classical individual-level Shapley value framework to evaluate the importance of feature groups, addressing the structured nature of predictors commonly found in business and economic data. More importantly, we develop a significance testing procedure based on a three-cumulant chi-square approximation and establish the asymptotic properties of the test statistics for Group Shapley values. Our approach can effectively handle challenging scenarios, including sparse or skewed distributions and small sample sizes, outperforming alternative tests such as the Wald test. Simulations confirm that the proposed test maintains robust empirical size and demonstrates enhanced power under diverse conditions. To illustrate the method’s practical relevance in advancing Explainable AI, we apply our framework to bond recovery rate predictions using a global dataset (1996-2023) comprising 2,094 observations and 98 features, grouped into 16 subgroups and five broader categories: bond characteristics, firm fundamentals, industry-specific factors, market-related variables, and macroeconomic indicators. Our results identify the market-related variables group as the most influential. Furthermore, Lorenz curves and Gini indices reveal that Group Shapley assigns feature importance more equitably compared to individual Shapley values.

[LG-64] NeuroPMD: Neural Fields for Density Estimation on Product Manifolds

链接: https://arxiv.org/abs/2501.02994
作者: William Consagra,Zhiling Gu,Zhengwu Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel deep neural network methodology for density estimation on product Riemannian manifold domains. In our approach, the network directly parameterizes the unknown density function and is trained using a penalized maximum likelihood framework, with a penalty term formed using manifold differential operators. The network architecture and estimation algorithm are carefully designed to handle the challenges of high-dimensional product manifold domains, effectively mitigating the curse of dimensionality that limits traditional kernel and basis expansion estimators, as well as overcoming the convergence issues encountered by non-specialized neural network methods. Extensive simulations and a real-world application to brain structural connectivity data highlight the clear advantages of our method over the competing alternatives.

[LG-65] Classifier Weighted Mixture models

链接: https://arxiv.org/abs/2501.02989
作者: Elouan Argouarc’h,François Desbouvries,Eric Barat,Eiji Kawasaki,Thomas Dautremer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes an extension of standard mixture stochastic models, by replacing the constant mixture weights with functional weights defined using a classifier. Classifier Weighted Mixtures enable straightforward density evaluation, explicit sampling, and enhanced expressivity in variational estimation problems, without increasing the number of components nor the complexity of the mixture components.

[LG-66] A Point Process Model for Optimizing Repeated Personalized Action Delivery to Users

链接: https://arxiv.org/abs/2501.02961
作者: Alexander Merkov,David Rohde
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:This paper provides a formalism for an important class of causal inference problems inspired by user-advertiser interaction in online advertiser. Then this formalism is specialized to an extension of temporal marked point processes and the neural point processes are suggested as practical solutions to some interesting special cases.

[LG-67] Improved Approximation Algorithms for Low-Rank Problems Using Semidefinite Optimization

链接: https://arxiv.org/abs/2501.02942
作者: Ryan Cory-Wright,Jean Pauphilet
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 30 pages, 5 figures, plus references and appendices

点击查看摘要

Abstract:Inspired by the impact of the Goemans-Williamson algorithm on combinatorial optimization, we construct an analogous relax-then-sample strategy for low-rank optimization problems. First, for orthogonally constrained quadratic optimization problems, we derive a semidefinite relaxation and a randomized rounding scheme, which obtains provably near-optimal solutions, mimicking the blueprint from Goemans and Williamson for the Max-Cut problem. We then extend our approach to generic low-rank optimization problems by developing new semidefinite relaxations that are both tighter and more broadly applicable than those in prior works. Although our original proposal introduces large semidefinite matrices as decision variables, we show that most of the blocks in these matrices can be safely omitted without altering the optimal value, hence improving the scalability of our approach. Using several examples (including matrix completion, basis pursuit, and reduced-rank regression), we show how to reduce the size of our relaxation even further. Finally, we numerically illustrate the effectiveness and scalability of our relaxation and our sampling scheme on orthogonally constrained quadratic optimization and matrix completion problems.

[LG-68] A Bayesian Approach for Discovering Time- Delayed Differential Equation from Data

链接: https://arxiv.org/abs/2501.02934
作者: Debangshu Chowdhury,Souvik Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Time-delayed differential equations (TDDEs) are widely used to model complex dynamic systems where future states depend on past states with a delay. However, inferring the underlying TDDEs from observed data remains a challenging problem due to the inherent nonlinearity, uncertainty, and noise in real-world systems. Conventional equation discovery methods often exhibit limitations when dealing with large time delays, relying on deterministic techniques or optimization-based approaches that may struggle with scalability and robustness. In this paper, we present BayTiDe - Bayesian Approach for Discovering Time-Delayed Differential Equations from Data, that is capable of identifying arbitrarily large values of time delay to an accuracy that is directly proportional to the resolution of the data input to it. BayTiDe leverages Bayesian inference combined with a sparsity-promoting discontinuous spike-and-slab prior to accurately identify time-delayed differential equations. The approach accommodates arbitrarily large time delays with accuracy proportional to the input data resolution, while efficiently narrowing the search space to achieve significant computational savings. We demonstrate the efficiency and robustness of BayTiDe through a range of numerical examples, validating its ability to recover delayed differential equations from noisy data.

[LG-69] Predicting band gap from chemical composition: A simple learned model for a material property with atypical statistics

链接: https://arxiv.org/abs/2501.02932
作者: Andrew Ma,Owen Dugan,Marin Soljačić
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:In solid-state materials science, substantial efforts have been devoted to the calculation and modeling of the electronic band gap. While a wide range of ab initio methods and machine learning algorithms have been created that can predict this quantity, the development of new computational approaches for studying the band gap remains an active area of research. Here we introduce a simple machine learning model for predicting the band gap using only the chemical composition of the crystalline material. To motivate the form of the model, we first analyze the empirical distribution of the band gap, which sheds new light on its atypical statistics. Specifically, our analysis enables us to frame band gap prediction as a task of modeling a mixed random variable, and we design our model accordingly. Our model formulation incorporates thematic ideas from chemical heuristic models for other material properties in a manner that is suited towards the band gap modeling task. The model has exactly one parameter corresponding to each element, which is fit using data. To predict the band gap for a given material, the model computes a weighted average of the parameters associated with its constituent elements and then takes the maximum of this quantity and zero. The model provides heuristic chemical interpretability by intuitively capturing the associations between the band gap and individual chemical elements.

[LG-70] Proteomic Learning of Gamma-Aminobutyric Acid (GABA) Receptor-Mediated Anesthesia

链接: https://arxiv.org/abs/2501.02824
作者: Jian Jiang,Long Chen,Yueying Zhu,Yazhou Shi,Huahai Qiu,Bengong Zhang,Tianshou Zhou,Guo-Wei Wei
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anesthetics are crucial in surgical procedures and therapeutic interventions, but they come with side effects and varying levels of effectiveness, calling for novel anesthetic agents that offer more precise and controllable effects. Targeting Gamma-aminobutyric acid (GABA) receptors, the primary inhibitory receptors in the central nervous system, could enhance their inhibitory action, potentially reducing side effects while improving the potency of anesthetics. In this study, we introduce a proteomic learning of GABA receptor-mediated anesthesia based on 24 GABA receptor subtypes by considering over 4000 proteins in protein-protein interaction (PPI) networks and over 1.5 millions known binding compounds. We develop a corresponding drug-target interaction network to identify potential lead compounds for novel anesthetic design. To ensure robust proteomic learning predictions, we curated a dataset comprising 136 targets from a pool of 980 targets within the PPI networks. We employed three machine learning algorithms, integrating advanced natural language processing (NLP) models such as pretrained transformer and autoencoder embeddings. Through a comprehensive screening process, we evaluated the side effects and repurposing potential of over 180,000 drug candidates targeting the GABRA5 receptor. Additionally, we assessed the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of these candidates to identify those with near-optimal characteristics. This approach also involved optimizing the structures of existing anesthetics. Our work presents an innovative strategy for the development of new anesthetic drugs, optimization of anesthetic use, and deeper understanding of potential anesthesia-related side effects.

[LG-71] Analogue Forecast System for Daily Precipitation Prediction Using Autoencoder Feature Extraction: Application in Hong Kong

链接: https://arxiv.org/abs/2501.02814
作者: Yee Chun Tsoi,Yu Ting Kwok,Ming Chun Lam,Wai Kin Wong
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:In the Hong Kong Observatory, the Analogue Forecast System (AFS) for precipitation has been providing useful reference in predicting possible daily rainfall scenarios for the next 9 days, by identifying historical cases with similar weather patterns to the latest output from the deterministic model of the European Centre for Medium-Range Weather Forecasts (ECMWF). Recent advances in machine learning allow more sophisticated models to be trained using historical data and the patterns of high-impact weather events to be represented more effectively. As such, an enhanced AFS has been developed using the deep learning technique autoencoder. The datasets of the fifth generation of the ECMWF Reanalysis (ERA5) are utilised where more meteorological elements in higher horizontal, vertical and temporal resolutions are available as compared to the previous ECMWF reanalysis products used in the existing AFS. The enhanced AFS features four major steps in generating the daily rain class forecasts: (1) preprocessing of gridded ERA5 and ECMWF model forecast, (2) feature extraction by the pretrained autoencoder, (3) application of optimised feature weightings based on historical cases, and (4) calculation of the final rain class from a weighted ensemble of top analogues. The enhanced AFS demonstrates a consistent and superior performance over the existing AFS, especially in capturing heavy rain cases, during the verification period from 2019 to 2022. This paper presents the detailed formulation of the enhanced AFS and discusses its advantages and limitations in supporting precipitation forecasting in Hong Kong.

[LG-72] Beyond mathcalO(sqrtT) Regret: Decoupling Learning and Decision-making in Online Linear Programming

链接: https://arxiv.org/abs/2501.02761
作者: Wenzhi Gao,Dongdong Ge,Chenyu Xue,Chunlin Sun,Yinyu Ye
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Extension of conference submission this https URL

点击查看摘要

Abstract:Online linear programming plays an important role in both revenue management and resource allocation, and recent research has focused on developing efficient first-order online learning algorithms. Despite the empirical success of first-order methods, they typically achieve a regret no better than \mathcalO ( \sqrtT ) , which is suboptimal compared to the \mathcalO (\log T) bound guaranteed by the state-of-the-art linear programming (LP)-based online algorithms. This paper establishes a general framework that improves upon the \mathcalO ( \sqrtT ) result when the LP dual problem exhibits certain error bound conditions. For the first time, we show that first-order learning algorithms achieve o( \sqrtT ) regret in the continuous support setting and \mathcalO (\log T) regret in the finite support setting beyond the non-degeneracy assumption. Our results significantly improve the state-of-the-art regret results and provide new insights for sequential decision-making.

[LG-73] Improving Quantum Machine Learning via Heat-Bath Algorithmic Cooling

链接: https://arxiv.org/abs/2501.02687
作者: Nayeli A. Rodríguez-Briones,Daniel K. Park
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:This work introduces an approach rooted in quantum thermodynamics to enhance sampling efficiency in quantum machine learning (QML). We propose conceptualizing quantum supervised learning as a thermodynamic cooling process. Building on this concept, we develop a quantum refrigerator protocol that enhances sample efficiency during training and prediction without the need for Grover iterations or quantum phase estimation. Inspired by heat-bath algorithmic cooling protocols, our method alternates entropy compression and thermalization steps to decrease the entropy of qubits, increasing polarization towards the dominant bias. This technique minimizes the computational overhead associated with estimating classification scores and gradients, presenting a practical and efficient solution for QML algorithms compatible with noisy intermediate-scale quantum devices.

[LG-74] Re-examining Granger Causality from Causal Bayesian Networks Perspective

链接: https://arxiv.org/abs/2501.02672
作者: S. A. Adedayo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Characterizing cause-effect relationships in complex systems could be critical to understanding these systems. For many, Granger causality (GC) remains a computational tool of choice to identify causal relations in time series data. Like other causal discovery tools, GC has limitations and has been criticized as a non-causal framework. Here, we addressed one of the recurring criticisms of GC by endowing it with proper causal interpretation. This was achieved by analyzing GC from Reichenbach’s Common Cause Principles (RCCPs) and causal Bayesian networks (CBNs) lenses. We showed theoretically and graphically that this reformulation endowed GC with a proper causal interpretation under certain assumptions and achieved satisfactory results on simulation.

[LG-75] LWFNet: Coherent Doppler Wind Lidar-Based Network for Wind Field Retrieval

链接: https://arxiv.org/abs/2501.02613
作者: Ran Tao,Chong Wang,Hao Chen,Mingjiao Jia,Xiang Shang,Luoyuan Qu,Guoliang Shentu,Yanyu Lu,Yanfeng Huo,Lei Bai,Xianghui Xue,Xiankang Dou
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Accurate detection of wind fields within the troposphere is essential for atmospheric dynamics research and plays a crucial role in extreme weather forecasting. Coherent Doppler wind lidar (CDWL) is widely regarded as the most suitable technique for high spatial and temporal resolution wind field detection. However, since coherent detection relies heavily on the concentration of aerosol particles, which cause Mie scattering, the received backscattering lidar signal exhibits significantly low intensity at high altitudes. As a result, conventional methods, such as spectral centroid estimation, often fail to produce credible and accurate wind retrieval results in these regions. To address this issue, we propose LWFNet, the first Lidar-based Wind Field (WF) retrieval neural Network, built upon Transformer and the Kolmogorov-Arnold network. Our model is trained solely on targets derived from the traditional wind retrieval algorithm and utilizes radiosonde measurements as the ground truth for test results evaluation. Experimental results demonstrate that LWFNet not only extends the maximum wind field detection range but also produces more accurate results, exhibiting a level of precision that surpasses the labeled targets. This phenomenon, which we refer to as super-accuracy, is explored by investigating the potential underlying factors that contribute to this intriguing occurrence. In addition, we compare the performance of LWFNet with other state-of-the-art (SOTA) models, highlighting its superior effectiveness and capability in high-resolution wind retrieval. LWFNet demonstrates remarkable performance in lidar-based wind field retrieval, setting a benchmark for future research and advancing the development of deep learning models in this domain.

[LG-76] ransformers Simulate MLE for Sequence Generation in Bayesian Networks

链接: https://arxiv.org/abs/2501.02547
作者: Yuan Cao,Yihan He,Dennis Wu,Hong-Yu Chen,Jianqing Fan,Han Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 51 pages, 17 figures, 5 tables

点击查看摘要

Abstract:Transformers have achieved significant success in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite these achievements, the theoretical understanding of transformers’ capabilities remains limited. In this paper, we investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE). Specifically, we consider a setting where a context is formed by a set of independent sequences generated according to a Bayesian network. We demonstrate that there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis highlights the potential of transformers to learn complex probabilistic models and contributes to a better understanding of large language models as a powerful class of sequence generators.

[LG-77] Unified Guidance for Geometry-Conditioned Molecular Generation NEURIPS

链接: https://arxiv.org/abs/2501.02526
作者: Sirine Ayadi,Leon Hetzel,Johanna Sommer,Fabian Theis,Stephan Günnemann
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS)

点击查看摘要

Abstract:Effectively designing molecular geometries is essential to advancing pharmaceutical innovations, a domain, which has experienced great attention through the success of generative models and, in particular, diffusion models. However, current molecular diffusion models are tailored towards a specific downstream task and lack adaptability. We introduce UniGuide, a framework for controlled geometric guidance of unconditional diffusion models that allows flexible conditioning during inference without the requirement of extra training or networks. We show how applications such as structure-based, fragment-based, and ligand-based drug design are formulated in the UniGuide framework and demonstrate on-par or superior performance compared to specialised models. Offering a more versatile approach, UniGuide has the potential to streamline the development of molecular generative models, allowing them to be readily used in diverse application scenarios.

[LG-78] IRIS: A Bayesian Approach for Image Reconstruction in Radio Interferometry with expressive Score-Based priors

链接: https://arxiv.org/abs/2501.02473
作者: Noé Dia,M. J. Yantovski-Barth,Alexandre Adam,Micah Bowles,Laurence Perreault-Levasseur,Yashar Hezaveh,Anna Scaife
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 17 pages, 8 figures, submitted to the Astrophysical Journal

点击查看摘要

Abstract:Inferring sky surface brightness distributions from noisy interferometric data in a principled statistical framework has been a key challenge in radio astronomy. In this work, we introduce Imaging for Radio Interferometry with Score-based models (IRIS). We use score-based models trained on optical images of galaxies as an expressive prior in combination with a Gaussian likelihood in the uv-space to infer images of protoplanetary disks from visibility data of the DSHARP survey conducted by ALMA. We demonstrate the advantages of this framework compared with traditional radio interferometry imaging algorithms, showing that it produces plausible posterior samples despite the use of a misspecified galaxy prior. Through coverage testing on simulations, we empirically evaluate the accuracy of this approach to generate calibrated posterior samples.

[LG-79] ransfer learning via Regularized Linear Discriminant Analysis

链接: https://arxiv.org/abs/2501.02411
作者: Hongzhe Zhang,Arnab Auddy,Hongzhe Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear discriminant analysis is a widely used method for classification. However, the high dimensionality of predictors combined with small sample sizes often results in large classification errors. To address this challenge, it is crucial to leverage data from related source models to enhance the classification performance of a target model. We propose to address this problem in the framework of transfer learning. In this paper, we present novel transfer learning methods via regularized random-effects linear discriminant analysis, where the discriminant direction is estimated as a weighted combination of ridge estimates obtained from both the target and source models. Multiple strategies for determining these weights are introduced and evaluated, including one that minimizes the estimation risk of the discriminant vector and another that minimizes the classification error. Utilizing results from random matrix theory, we explicitly derive the asymptotic values of these weights and the associated classification error rates in the high-dimensional setting, where p/n \rightarrow \infty , with p representing the predictor dimension and n the sample size. We also provide geometric interpretations of various weights and a guidance on which weights to choose. Extensive numerical studies, including simulations and analysis of proteomics-based 10-year cardiovascular disease risk classification, demonstrate the effectiveness of the proposed approach. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2501.02411 [stat.ML] (or arXiv:2501.02411v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.02411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-80] On The Causal Network Of Face-selective Regions In Human Brain During Movie Watching

链接: https://arxiv.org/abs/2501.02333
作者: Ali Bavafa,Gholam-Ali Hossein-Zadeh
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Understanding the causal interactions in simple brain tasks, such as face detection, remains a challenging and ambiguous process for researchers. In this study, we address this issue by employing a novel causal discovery method – Directed Acyclic Graphs via M-matrices for Acyclicity (DAGMA) – to investigate the causal structure of the brain’s face-selective network and gain deeper insights into its mechanism. Using natural movie stimuli, we extract causal network of face-selective regions and analyze how frames containing faces influence this network. Our findings reveal that the presence of faces in the stimuli have causal effect both on the number and strength of causal connections within the network. Additionally, our results highlight the crucial role of subcortical regions in satisfying causal sufficiency, emphasizing its importance in causal studies of brain. This study provides a new perspective on understanding the causal architecture of the face-selective network of the brain, motivating further research on neural causality.

[LG-81] Analysis of Fluorescence Telescope Data Using Machine Learning Methods

链接: https://arxiv.org/abs/2501.02311
作者: Mikhail Zotov,Pavel Zakharov(for the JEM-EUSO Collaboration)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 12 pages; to be published in the proceedings of the 38th Russian Cosmic Ray Conference (2024)

点击查看摘要

Abstract:Fluorescence telescopes are among the key instruments used for studying ultra-high energy cosmic rays in all modern experiments. We use model data for a small ground-based telescope EUSO-TA to try some methods of machine learning and neural networks for recognizing tracks of extensive air showers in its data and for reconstruction of energy and arrival directions of primary particles. We also comment on the opportunities to use this approach for other fluorescence telescopes and outline possible ways of improving the performance of the suggested methods.

[LG-82] Beyond Log-Concavity and Score Regularity: Improved Convergence Bounds for Score-Based Generative Models in W2-distance

链接: https://arxiv.org/abs/2501.02298
作者: Marta Gentiloni-Silveri,Antonio Ocello
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score-based Generative Models (SGMs) aim to sample from a target distribution by learning score functions using samples perturbed by Gaussian noise. Existing convergence bounds for SGMs in the \mathcalW_2 -distance rely on stringent assumptions about the data distribution. In this work, we present a novel framework for analyzing \mathcalW_2 -convergence in SGMs, significantly relaxing traditional assumptions such as log-concavity and score regularity. Leveraging the regularization properties of the Ornstein-Uhlenbeck (OU) process, we show that weak log-concavity of the data distribution evolves into log-concavity over time. This transition is rigorously quantified through a PDE-based analysis of the Hamilton-Jacobi-Bellman equation governing the log-density of the forward process. Moreover, we establish that the drift of the time-reversed OU process alternates between contractive and non-contractive regimes, reflecting the dynamics of concavity. Our approach circumvents the need for stringent regularity conditions on the score function and its estimators, relying instead on milder, more practical assumptions. We demonstrate the wide applicability of this framework through explicit computations on Gaussian mixture models, illustrating its versatility and potential for broader classes of data distributions.

[LG-83] Robust Multi-Dimensional Scaling via Accelerated Alternating Projections

链接: https://arxiv.org/abs/2501.02208
作者: Tong Deng,Tianming Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We consider the robust multi-dimensional scaling (RMDS) problem in this paper. The goal is to localize point locations from pairwise distances that may be corrupted by outliers. Inspired by classic MDS theories, and nonconvex works for the robust principal component analysis (RPCA) problem, we propose an alternating projection based algorithm that is further accelerated by the tangent space projection technique. For the proposed algorithm, if the outliers are sparse enough, we can establish linear convergence of the reconstructed points to the original points after centering and rotation alignment. Numerical experiments verify the state-of-the-art performances of the proposed algorithm.

[LG-84] Majorization-Minimization Dual Stagewise Algorithm for Generalized Lasso

链接: https://arxiv.org/abs/2501.02197
作者: Jianmin Chen,Kun Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The generalized lasso is a natural generalization of the celebrated lasso approach to handle structural regularization problems. Many important methods and applications fall into this framework, including fused lasso, clustered lasso, and constrained lasso. To elevate its effectiveness in large-scale problems, extensive research has been conducted on the computational strategies of generalized lasso. However, to our knowledge, most studies are under the linear setup, with limited advances in non-Gaussian and non-linear models. We propose a majorization-minimization dual stagewise (MM-DUST) algorithm to efficiently trace out the full solution paths of the generalized lasso problem. The majorization technique is incorporated to handle different convex loss functions through their quadratic majorizers. Utilizing the connection between primal and dual problems and the idea of ``slow-brewing’’ from stagewise learning, the minimization step is carried out in the dual space through a sequence of simple coordinate-wise updates on the dual coefficients with a small step size. Consequently, selecting an appropriate step size enables a trade-off between statistical accuracy and computational efficiency. We analyze the computational complexity of MM-DUST and establish the uniform convergence of the approximated solution paths. Extensive simulation studies and applications with regularized logistic regression and Cox model demonstrate the effectiveness of the proposed approach.

[LG-85] Molecule-dynamic-based Aging Clock and Aging Roadmap Forecast with Sundial

链接: https://arxiv.org/abs/2501.02176
作者: Wei Wu,Zizhen Deng,Chi Zhang,Can Liao,Jinzhuo Wang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Addressing the unavoidable bias inherent in supervised aging clocks, we introduce Sundial, a novel framework that models molecular dynamics through a diffusion field, capturing both the population-level aging process and the individual-level relative aging order. Sundial enables unbiasedestimation of biological age and the forecast of aging roadmap. Fasteraging individuals from Sundial exhibit a higher disease risk compared to those identified from supervised aging clocks. This framework opens new avenues for exploring key topics, including age- and sex-specific aging dynamics and faster yet healthy aging paths.

[LG-86] Learning Fricke signs from Maass form Coefficients

链接: https://arxiv.org/abs/2501.02105
作者: Joanna Bieri,Giorgi Butbaia,Edgar Costa,Alyson Deines,Kyu-Hwan Lee,David Lowry-Duda,Thomas Oliver,Yidi Qi,Tamara Veenstra
类目: Number Theory (math.NT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 10 figures, 5 tables

点击查看摘要

Abstract:In this paper, we conduct a data-scientific investigation of Maass forms. We find that averaging the Fourier coefficients of Maass forms with the same Fricke sign reveals patterns analogous to the recently discovered “murmuration” phenomenon, and that these patterns become more pronounced when parity is incorporated as an additional feature. Approximately 43% of the forms in our dataset have an unknown Fricke sign. For the remaining forms, we employ Linear Discriminant Analysis (LDA) to machine learn their Fricke sign, achieving 96% (resp. 94%) accuracy for forms with even (resp. odd) parity. We apply the trained LDA model to forms with unknown Fricke signs to make predictions. The average values based on the predicted Fricke signs are computed and compared to those for forms with known signs to verify the reasonableness of the predictions. Additionally, a subset of these predictions is evaluated against heuristic guesses provided by Hejhal’s algorithm, showing a match approximately 95% of the time. We also use neural networks to obtain results comparable to those from the LDA model.

[LG-87] Laws of thermodynamics for exponential families

链接: https://arxiv.org/abs/2501.02071
作者: Akshay Balsubramani
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We develop the laws of thermodynamics in terms of general exponential families. By casting learning (log-loss minimization) problems in max-entropy and statistical mechanics terms, we translate thermodynamics results to learning scenarios. We extend the well-known way in which exponential families characterize thermodynamic and learning equilibria. Basic ideas of work and heat, and advanced concepts of thermodynamic cycles and equipartition of energy, find exact and useful counterparts in AI / statistics terms. These ideas have broad implications for quantifying and addressing distribution shift.

[LG-88] Modeling COVID-19 spread in the USA using metapopulation SIR models coupled with graph convolutional neural networks

链接: https://arxiv.org/abs/2501.02043
作者: Petr Kisselev,Padmanabhan Seshaiyer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Graph convolutional neural networks (GCNs) have shown tremendous promise in addressing data-intensive challenges in recent years. In particular, some attempts have been made to improve predictions of Susceptible-Infected-Recovered (SIR) models by incorporating human mobility between metapopulations and using graph approaches to estimate corresponding hyperparameters. Recently, researchers have found that a hybrid GCN-SIR approach outperformed existing methodologies when used on the data collected on a precinct level in Japan. In our work, we extend this approach to data collected from the continental US, adjusting for the differing mobility patterns and varying policy responses. We also develop the strategy for real-time continuous estimation of the reproduction number and study the accuracy of model predictions for the overall population as well as individual states. Strengths and limitations of the GCN-SIR approach are discussed as a potential candidate for modeling disease dynamics.

信息检索

[IR-0] OpenTable data with multi-criteria ratings

链接: https://arxiv.org/abs/2501.03072
作者: Yong Zheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the development of recommender systems (RSs), several promising systems have emerged, such as context-aware RS, multi-criteria RS, and group RS. Multi-criteria recommender systems (MCRSs) are designed to provide personalized recommendations by considering user preferences in multiple attributes or criteria simultaneously. Unlike traditional RSs that typically focus on a single rating, these systems help users make more informed decisions by considering their diverse preferences and needs across various dimensions. In this article, we release the OpenTable data set which was crawled from this http URL. The data set can be considered as a benchmark data set for multi-criteria recommendations.

[IR-1] FlipedRAG : Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models

链接: https://arxiv.org/abs/2501.02968
作者: Zhuo Chen,Yuyang Gong,Miaokun Chen,Haotan Liu,Qikai Cheng,Fan Zhang,Wei Lu,Xiaozhong Liu,Jiawei Liu
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2407.13757

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) addresses hallucination and real-time constraints by dynamically retrieving relevant information from a knowledge database to supplement the LLMs’ input. When presented with a query, RAG selects the most semantically similar texts from its knowledge bases and uses them as context for the LLMs to generate more accurate responses. RAG also creates a new attack surface, especially since RAG databases are frequently sourced from public domains. While existing studies have predominantly focused on optimizing RAG’s performance and efficiency, emerging research has begun addressing the security concerns associated with RAG. However, these works have some limitations, typically focusing on either white-box methodologies or heuristic-based black-box attacks. Furthermore, prior research has mainly targeted simple factoid question answering, which is neither practically challenging nor resistant to correction. In this paper, we unveil a more realistic and threatening scenario: opinion manipulation for controversial topics against RAG. Particularly, we propose a novel RAG black-box attack method, termed FlipedRAG, which is transfer-based. By leveraging instruction engineering, we obtain partial retrieval model outputs from black-box RAG system, facilitating the training of surrogate models to enhance the effectiveness of opinion manipulation attack. Extensive experimental results confirms that our approach significantly enhances the average success rate of opinion manipulation by 16.7%. It achieves an average of a 50% directional change in the opinion polarity of RAG responses across four themes. Additionally, it induces a 20% shift in user cognition. Furthermore, we discuss the efficacy of potential defense mechanisms and conclude that they are insufficient in mitigating this type of attack, highlighting the urgent need to develop novel defensive strategies.

[IR-2] Integrating Language-Image Prior into EEG Decoding for Cross-Task Zero-Calibration RSVP-BCI

链接: https://arxiv.org/abs/2501.02841
作者: Xujin Li,Wei Wei,Shuang Qiu,Xinyi Zhang,Fu Li,Huiguang He
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an effective technology used for information detection by detecting Event-Related Potentials (ERPs). The current RSVP decoding methods can perform well in decoding EEG signals within a single RSVP task, but their decoding performance significantly decreases when directly applied to different RSVP tasks without calibration data from the new tasks. This limits the rapid and efficient deployment of RSVP-BCI systems for detecting different categories of targets in various scenarios. To overcome this limitation, this study aims to enhance the cross-task zero-calibration RSVP decoding performance. First, we design three distinct RSVP tasks for target image retrieval and build an open-source dataset containing EEG signals and corresponding stimulus images. Then we propose an EEG with Language-Image Prior fusion Transformer (ELIPformer) for cross-task zero-calibration RSVP decoding. Specifically, we propose a prompt encoder based on the language-image pre-trained model to extract language-image features from task-specific prompts and stimulus images as prior knowledge for enhancing EEG decoding. A cross bidirectional attention mechanism is also adopted to facilitate the effective feature fusion and alignment between the EEG and language-image features. Extensive experiments demonstrate that the proposed model achieves superior performance in cross-task zero-calibration RSVP decoding, which promotes the RSVP-BCI system from research to practical application.

[IR-3] Improving GenIR Systems Based on User Feedback

链接: https://arxiv.org/abs/2501.02838
作者: Qingyao Ai,Zhicheng Dou,Min Zhang
类目: Information Retrieval (cs.IR)
*备注: Chapter 5 of the book on Information Access in the Era of Generative AI

点击查看摘要

Abstract:In this chapter, we discuss how to improve the GenIR systems based on user feedback. Before describing the approaches, it is necessary to be aware that the concept of “user” has been extended in the interactions with the GenIR systems. Different types of feedback information and strategies are also provided. Then the alignment techniques are highlighted in terms of objectives and methods. Following this, various ways of learning from user feedback in GenIR are presented, including continual learning, learning and ranking in the conversational context, and prompt learning. Through this comprehensive exploration, it becomes evident that innovative techniques are being proposed beyond traditional methods of utilizing user feedback, and contribute significantly to the evolution of GenIR in the new era. We also summarize some challenging topics and future directions that require further investigation.

[IR-4] Quantum Cognition-Inspired EEG-based Recommendation via Graph Neural Networks

链接: https://arxiv.org/abs/2501.02671
作者: Jinkun Han,Wei Li,Yingshu Li,Zhipeng Cai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Current recommendation systems recommend goods by considering users’ historical behaviors, social relations, ratings, and other multi-modals. Although outdated user information presents the trends of a user’s interests, no recommendation system can know the users’ real-time thoughts indeed. With the development of brain-computer interfaces, it is time to explore next-generation recommenders that show users’ real-time thoughts without delay. Electroencephalography (EEG) is a promising method of collecting brain signals because of its convenience and mobility. Currently, there is only few research on EEG-based recommendations due to the complexity of learning human brain activity. To explore the utility of EEG-based recommendation, we propose a novel neural network model, QUARK, combining Quantum Cognition Theory and Graph Convolutional Networks for accurate item recommendations. Compared with the state-of-the-art recommendation models, the superiority of QUARK is confirmed via extensive experiments.

[IR-5] Interactive Information Need Prediction with Intent and Context

链接: https://arxiv.org/abs/2501.02635
作者: Kevin Ros,Dhyey Pandya,ChengXiang Zhai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The ability to predict a user’s information need would have wide-ranging implications, from saving time and effort to mitigating vocabulary gaps. We study how to interactively predict a user’s information need by letting them select a pre-search context (e.g., a paragraph, sentence, or singe word) and specify an optional partial search intent (e.g., “how”, “why”, “applications”, etc.). We examine how various generative language models can explicitly make this prediction by generating a question as well as how retrieval models can implicitly make this prediction by retrieving an answer. We find that this prediction process is possible in many cases and that user-provided partial search intent can help mitigate large pre-search contexts. We conclude that this framework is promising and suitable for real-world applications.

[IR-6] Citation Structural Diversity: A Novel and Concise Metric Combining Structure and Semantics for Literature Evaluation

链接: https://arxiv.org/abs/2501.02429
作者: Mingyue Kong,Yinglong Zhang,Likun Sheng,Kaifeng Hong
类目: Information Retrieval (cs.IR)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:As academic research becomes increasingly diverse, traditional literature evaluation methods face significant limitations,particularly in capturing the complexity of academic dissemination and the multidimensional impacts of literature. To address these challenges, this paper introduces a novel literature evaluation model of citation structural diversity, with a focus on assessing its feasibility as an evaluation metric. By refining citation network and incorporating both ciation structural features and semantic information, the study examines the influence of the proposed model of citation structural diversity on citation volume and long-term academic impact. The findings reveal that literature with higher citation structural diversity demonstrates notable advantages in both citation frequency and sustained academic influence. Through data grouping and a decade-long citation trend analysis, the potential application of this model in literature evaluation is further validated. This research offers a fresh perspective on optimizing literature evaluation methods and emphasizes the distinct advantages of citation structural diversity in measuring interdisciplinarity.

[IR-7] GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems

链接: https://arxiv.org/abs/2501.02408
作者: Mehmet Deniz Türkmen,Mucahid Kutlu,Bahadir Altun,Gokalp Cosgun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Building test collections for Information Retrieval evaluation has traditionally been a resource-intensive and time-consuming task, primarily due to the dependence on manual relevance judgments. While various cost-effective strategies have been explored, the development of such collections remains a significant challenge. In this paper, we present GenTREC , the first test collection constructed entirely from documents generated by a Large Language Model (LLM), eliminating the need for manual relevance judgments. Our approach is based on the assumption that documents generated by an LLM are inherently relevant to the prompts used for their generation. Based on this heuristic, we utilized existing TREC search topics to generate documents. We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant. To introduce realistic retrieval challenges, we also generated non-relevant documents, ensuring that IR systems are tested against a diverse and robust set of materials. The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance “judgments”. We conducted extensive experiments to evaluate GenTREC in terms of document quality, relevance judgment accuracy, and evaluation reliability. Notably, our findings indicate that the ranking of IR systems using GenTREC is compatible with the evaluations conducted using traditional TREC test collections, particularly for P@100, MAP, and RPrec metrics. Overall, our results show that our proposed approach offers a promising, low-cost alternative for IR evaluation, significantly reducing the burden of building and maintaining future IR evaluation resources.

[IR-8] Knowledge Graph Retrieval-Augmented Generation for LLM -based Recommendation

链接: https://arxiv.org/abs/2501.02226
作者: Shijie Wang,Wenqi Fan,Yue Feng,Xinyu Ma,Shuaiqiang Wang,Dawei Yin
类目: Information Retrieval (cs.IR)
*备注: Preprint. Under review

点击查看摘要

Abstract:Recommender systems have become increasingly vital in our daily lives, helping to alleviate the problem of information overload across various user-oriented online services. The emergence of Large Language Models (LLMs) has yielded remarkable achievements, demonstrating their potential for the development of next-generation recommender systems. Despite these advancements, LLM-based recommender systems face inherent limitations stemming from their LLM backbones, particularly issues of hallucinations and the lack of up-to-date and domain-specific knowledge. Recently, Retrieval-Augmented Generation (RAG) has garnered significant attention for addressing these limitations by leveraging external knowledge sources to enhance the understanding and generation of LLMs. However, vanilla RAG methods often introduce noise and neglect structural relationships in knowledge, limiting their effectiveness in LLM-based recommendations. To address these limitations, we propose to retrieve high-quality and up-to-date structure information from the knowledge graph (KG) to augment recommendations. Specifically, our approach develops a retrieval-augmented framework, termed K-RagRec, that facilitates the recommendation generation process by incorporating structure information from the external KG. Extensive experiments have been conducted to demonstrate the effectiveness of our proposed method.

[IR-9] he Application of Large Language Models in Recommendation Systems

链接: https://arxiv.org/abs/2501.02178
作者: Peiyang Yu,Zeqiu Xu,Jiani Wang,Xiaochuan Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models into recommendation frameworks presents key advantages for personalization and adaptability of experiences to the users. Classic methods of recommendations, such as collaborative filtering and content-based filtering, are seriously limited in the solution of cold-start problems, sparsity of data, and lack of diversity in information considered. LLMs, of which GPT-4 is a good example, have emerged as powerful tools that enable recommendation frameworks to tap into unstructured data sources such as user reviews, social interactions, and text-based content. By analyzing these data sources, LLMs improve the accuracy and relevance of recommendations, thereby overcoming some of the limitations of traditional approaches. This work discusses applications of LLMs in recommendation systems, especially in electronic commerce, social media platforms, streaming services, and educational technologies. This showcases how LLMs enrich recommendation diversity, user engagement, and the system’s adaptability; yet it also looks into the challenges connected to their technical implementation. This can also be presented as a study that shows the potential of LLMs for changing user experiences and making innovation possible in industries.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-07

目录

概览 (2025-01-07)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载