Arxiv今日论文 | 2025-01-06

本篇博文主要内容为 2025-01-06 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决在语言模型预训练过程中，如何有效利用多样化的训练数据（如不同风格、领域和质量水平的数据）来提升模型的通用能力。由于这些数据源具有异质性，直接学习和部署其中的正确行为具有挑战性。为此，论文提出了一种名为“元数据调节后冷却”（Metadata Conditioning then Cooldown, MeCo）的新方法。MeCo的关键在于在预训练过程中引入额外的元数据（如URL）作为学习线索，随后通过一个仅使用标准文本的冷却阶段，使模型在无元数据的情况下也能正常运作。该方法显著加速了不同规模模型（600M到8B参数）和不同训练数据源（如C4、RefinedWeb和DCLM）的预训练过程。例如，使用MeCo训练的1.6B参数模型在减少33%数据量的情况下，仍能达到与标准预训练相当的下游任务性能。此外，MeCo还允许通过在推理提示中引入真实或虚构的元数据来引导模型生成特定属性的输出，例如减少有害生成或提升常识任务表现。MeCo方法简单、无额外计算开销，并展现出在生成更具能力和可控性语言模型方面的潜力。

链接: https://arxiv.org/abs/2501.01956
作者: Tianyu Gao,Alexander Wettig,Luxi He,Yihe Dong,Sadhika Malladi,Danqi Chen
机构: Princeton Language and Intelligence, Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注: Code available at this https URL

点击查看摘要

Abstract:The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like this http URL) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending this http URL to reduce harmful generations or this http URL (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
zh

[NLP-1] Abstractive Text Summarization for Contemporary Sanskrit Prose: Issues and Challenges

【速读】：该论文试图解决为当代梵文散文开发抽象文本摘要（Abstractive Text Summarization, TS）模型所面临的挑战。梵文是一种低资源屈折语言（low-resource inflectional language），其复杂的语法结构和有限的可用数据资源使得开发抽象文本摘要模型尤为困难。论文通过四个不同主题的子问题来探讨这些挑战，包括数据收集、预处理、模型训练和推理等方面。解决方案的关键在于构建一个完整的梵文抽象文本摘要流程，并详细报告了在开发过程中每个阶段所遇到的挑战及其应对策略。通过这一研究，论文为梵文文本摘要领域提供了一个初步的框架，并为未来的研究奠定了基础。

链接: https://arxiv.org/abs/2501.01933
作者: Shagun Sinha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD Thesis

点击查看摘要

Abstract:This thesis presents Abstractive Text Summarization models for contemporary Sanskrit prose. The first chapter, titled Introduction, presents the motivation behind this work, the research questions, and the conceptual framework. Sanskrit is a low-resource inflectional language. The key research question that this thesis investigates is what the challenges in developing an abstractive TS for Sanskrit. To answer the key research questions, sub-questions based on four different themes have been posed in this work. The second chapter, Literature Review, surveys the previous works done. The third chapter, data preparation, answers the remaining three questions from the third theme. It reports the data collection and preprocessing challenges for both language model and summarization model trainings. The fourth chapter reports the training and inference of models and the results obtained therein. This research has initiated a pipeline for Sanskrit abstractive text summarization and has reported the challenges faced at every stage of the development. The research questions based on every theme have been answered to answer the key research question.
zh

[NLP-2] Long Context vs. RAG for LLM s: An Evaluation and Revisits

【速读】：该论文旨在解决如何使大语言模型（LLMs）能够有效整合极长的外部上下文信息的问题。为此，论文探讨了两种主要策略：扩展上下文窗口（Long Context, LC）和使用检索器选择性访问相关信息（Retrieval-Augmented Generation, RAG）。论文的关键解决方案是通过过滤掉无需外部上下文即可回答的问题，识别最有效的检索方法，并扩展数据集来进行更全面的评估。研究结果表明，LC在问答基准测试中通常优于RAG，尤其是在基于维基百科的问题上，而基于摘要的检索与LC表现相当，基于分块的检索则表现较差。然而，RAG在对话式和一般性问题查询中具有优势。这些发现揭示了RAG和LC策略之间的权衡，为未来优化LLMs与外部知识源的整合提供了指导。

链接: https://arxiv.org/abs/2501.01880
作者: Xinze Li,Yixin Cao,Yubo Ma,Aixin Sun
机构: S-Lab, Nanyang Technological University (南洋理工大学); School of Computer Science, Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: 14 pages excluding references and appendix

点击查看摘要

Abstract:Extending context windows (i.e., Long Context, LC) and using retrievers to selectively access relevant information (i.e., Retrieval-Augmented Generation, RAG) are the two main strategies to enable LLMs to incorporate extremely long external contexts. This paper revisits recent studies on this topic, highlighting their key insights and discrepancies. We then provide a more comprehensive evaluation by filtering out questions answerable without external context, identifying the most effective retrieval methods, and expanding the datasets. We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions. Summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind. However, RAG has advantages in dialogue-based and general question queries. These insights underscore the trade-offs between RAG and LC strategies, offering guidance for future optimization of LLMs with external knowledge sources. We also provide an in-depth discussion on this topic, highlighting the overlooked importance of context relevance in existing studies.
zh

[NLP-3] urning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

【速读】：该论文试图解决大型语言模型（LLM）在推理能力上存在的漏洞，特别是针对复杂的越狱攻击（jailbreak attacks）的脆弱性问题。尽管已有大量努力将语言模型与人类价值观和伦理准则对齐，但这些模型仍然容易受到利用其推理能力的攻击。传统安全机制通常侧重于检测显式的恶意意图，而未能解决更深层次的推理漏洞。

论文提出的解决方案是POATE（Polar Opposite query generation, Adversarial Template construction, and Elaboration）技术，该技术通过对比推理（contrastive reasoning）来引导模型生成不道德的响应。POATE通过生成语义相反的查询意图，并结合对抗性模板（adversarial templates），巧妙地引导模型产生有害响应。实验表明，POATE在六种不同参数规模的语言模型家族（包括LLaMA3、Gemma2、Phi3和GPT-4）上表现出较高的攻击成功率（约44%），显著优于现有方法。此外，论文还提出了一种防御策略，通过链式思维提示（chain-of-thought prompting）和逆向思维（reverse thinking）来增强模型的推理鲁棒性，从而缓解推理驱动的对抗性攻击。

链接: https://arxiv.org/abs/2501.01872
作者: Rachneet Sachdeva,Rima Hazra,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt (达姆施塔特工业大学); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学 “圣克莱门特奥赫里德斯基”)
类目: Computation and Language (cs.CL)
备注: Our code is publicly available at this https URL

点击查看摘要

Abstract:Despite significant efforts to align large language models with human values and ethical guidelines, these models remain susceptible to sophisticated jailbreak attacks that exploit their reasoning capabilities. Traditional safety mechanisms often focus on detecting explicit malicious intent, leaving deeper vulnerabilities unaddressed. In this work, we introduce a jailbreak technique, POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), which leverages contrastive reasoning to elicit unethical responses. POATE generates prompts with semantically opposite intents and combines them with adversarial templates to subtly direct models toward producing harmful responses. We conduct extensive evaluations across six diverse language model families of varying parameter sizes, including LLaMA3, Gemma2, Phi3, and GPT-4, to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. We evaluate our proposed attack against seven safety defenses, revealing their limitations in addressing reasoning-based vulnerabilities. To counteract this, we propose a defense strategy that improves reasoning robustness through chain-of-thought prompting and reverse thinking, mitigating reasoning-driven adversarial exploits.
zh

[NLP-4] me Series Language Model for Descriptive Caption Generation

【速读】：该论文试图解决时间序列数据（time series data）的自然语言描述生成问题，特别是在数据稀缺的情况下如何提升时间序列字幕生成（time series captioning）的性能。现有的预训练基础模型（pre-trained foundation models）在自然语言处理（NLP）和计算机视觉（CV）领域取得了显著进展，但在时间序列分析中的应用受到数据稀缺的限制。尽管已有一些基于大语言模型（LLM）的方法用于时间序列预测，但时间序列字幕生成在LLM背景下的研究仍较为不足。

论文提出的解决方案是TSLM（Time Series Language Model），这是一种专门为时间序列字幕生成设计的编码器-解码器模型（encoder-decoder model）。TSLM的关键创新在于：1）通过上下文提示（in-context prompting）生成合成数据，缓解数据稀缺问题；2）通过一种新颖的跨模态密集检索评分（cross-modal dense retrieval scoring）方法对生成的时间序列-字幕对进行去噪处理，从而提高生成数据的质量。实验结果表明，TSLM在多个时间序列字幕生成数据集上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.01832
作者: Mohamed Trabelsi,Aidan Boyd,Jin Cao,Huseyin Uzunalioglu
机构: Nokia Bell Labs (诺基亚贝尔实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The automatic generation of representative natural language descriptions for observable patterns in time series data enhances interpretability, simplifies analysis and increases cross-domain utility of temporal data. While pre-trained foundation models have made considerable progress in natural language processing (NLP) and computer vision (CV), their application to time series analysis has been hindered by data scarcity. Although several large language model (LLM)-based methods have been proposed for time series forecasting, time series captioning is under-explored in the context of LLMs. In this paper, we introduce TSLM, a novel time series language model designed specifically for time series captioning. TSLM operates as an encoder-decoder model, leveraging both text prompts and time series data representations to capture subtle temporal patterns across multiple phases and generate precise textual descriptions of time series inputs. TSLM addresses the data scarcity problem in time series captioning by first leveraging an in-context prompting synthetic data generation, and second denoising the generated data via a novel cross-modal dense retrieval scoring applied to time series-caption pairs. Experimental findings on various time series captioning datasets demonstrate that TSLM outperforms existing state-of-the-art approaches from multiple data modalities by a significant margin.
zh

[NLP-5] Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

【速读】：该论文旨在解决现有自动化红队测试（Automated Red-Teaming）方法在检测大型语言模型（LLMs）安全漏洞时的局限性，特别是这些方法通常只能识别孤立的安全缺陷，难以适应动态防御机制并高效发现复杂漏洞。为解决这一问题，作者提出了Auto-RT，一种基于强化学习（Reinforcement Learning）的框架，能够自动探索和优化复杂攻击策略，从而有效发现安全漏洞。其关键解决方案包括两点：1）早期终止探索（Early-terminated Exploration），通过聚焦于高潜力攻击策略来加速探索过程；2）渐进奖励追踪算法（Progressive Reward Tracking Algorithm），结合中间降级模型（Intermediate Downgrade Models），动态优化搜索轨迹以实现成功的漏洞利用。实验表明，Auto-RT在多种LLMs上显著提升了探索效率，并自动优化攻击策略，从而检测到更广泛的漏洞，检测速度更快，成功率比现有方法高出16.63%。

链接: https://arxiv.org/abs/2501.01830
作者: Yanjiang Liu,Shuhen Zhou,Yaojie Lu,Huijia Zhu,Weiqiang Wang,Hongyu Lin,Ben He,Xianpei Han,Le Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
zh

[NLP-6] he Proof is in the Almond Cookies

【速读】：该论文试图解决如何使机器人或人工智能烹饪助手能够理解并处理烹饪食谱（以及更一般的操作说明），以支持人类厨师在厨房中的工作。这一问题的解决对于社会具有重要意义，特别是在帮助老年人或身体障碍者保持自主性，以及减轻专业厨房的压力方面。论文提出了一种新颖的计算食谱理解方法，该方法模仿了人类基于叙事的理解过程。通过整合语言处理、本体论（ontologies）和心理模拟等多种知识来源，食谱被建模为丰富的叙事结构。这种叙事结构的关键在于能够应对食谱语言中的挑战（如零指代），优化机器人的规划过程，评估AI系统对当前任务的理解程度，并使食谱注释实现语言独立性。

链接: https://arxiv.org/abs/2501.01827
作者: Remi van Trijp,Katrien Beuls,Paul Van Eecke
机构: Sony Computer Science Laboratories Paris (索尼计算机科学实验室巴黎); Faculté d’informatique, Université de Namur (那慕尔大学信息学院); Artificial Intelligence Laboratory, Vrije Universiteit Brussel (布鲁塞尔自由大学人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a case study on how to process cooking recipes (and more generally, how-to instructions) in a way that makes it possible for a robot or artificial cooking assistant to support human chefs in the kitchen. Such AI assistants would be of great benefit to society, as they can help to sustain the autonomy of aging adults or people with a physical impairment, or they may reduce the stress in a professional kitchen. We propose a novel approach to computational recipe understanding that mimics the human sense-making process, which is narrative-based. Using an English recipe for almond crescent cookies as illustration, we show how recipes can be modelled as rich narrative structures by integrating various knowledge sources such as language processing, ontologies, and mental simulation. We show how such narrative structures can be used for (a) dealing with the challenges of recipe language, such as zero anaphora, (b) optimizing a robot’s planning process, © measuring how well an AI system understands its current tasks, and (d) allowing recipe annotations to become language-independent.
zh

[NLP-7] SDPO: Segment-Level Direct Preference Optimization for Social Agents

【速读】：该论文旨在解决基于大语言模型（LLMs）的社交代理在处理复杂目标导向社交对话时的不足。现有基于直接偏好优化（Direct Preference Optimization, DPO）的方法在多轮交互中分为轮次级别（turn-level）和会话级别（session-level）两种方法，但轮次级别方法过于细粒度，仅关注单个轮次，而会话级别方法则过于粗粒度，容易引入训练噪声。为解决这些局限性，论文提出了分段级别直接偏好优化（Segment-Level Direct Preference Optimization, SDPO），该方法专注于交互中的关键片段，以优化多轮代理行为，同时最小化训练噪声。通过在SOTOPIA基准上的评估，SDPO调优的代理在性能上持续优于现有的DPO方法和专有LLMs（如GPT-4o），展示了SDPO在提升基于LLM的代理社交智能方面的潜力。

链接: https://arxiv.org/abs/2501.01821
作者: Aobo Kong,Wentao Ma,Shiwan Zhao,Yongbin Li,Yuchuan Wu,Ke Wang,Xiaoqian Liu,Qicheng Li,Yong Qin,Fei Huang
机构: TMCC, CS, Nankai University (南开大学); Tongyi Lab (通义实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO’s potential to advance the social intelligence of LLM-based agents. We release our code and data at this https URL.
zh

[NLP-8] End-to-End Long Document Summarization using Gradient Caching

【速读】：该论文试图解决在长文档摘要任务中，基于Transformer的编码器-解码器模型（encoder-decoder models）在训练过程中由于二次方内存消耗而难以处理长输入文档的问题。现有的方法在测试时虽然可以扩展输入长度，但在训练时仍需要对输入文档进行截断，导致训练和测试条件不匹配。论文提出的解决方案是CachED（Gradient Caching for Encoder-Decoder models），该方法通过非重叠滑动窗口处理输入文档，并在解码器中进行融合。在反向传播过程中，梯度在解码器中被缓存，并通过重新计算隐藏向量以分块方式传递到编码器，类似于梯度检查点（gradient checkpointing）技术。实验表明，CachED BART能够在训练时处理超过50万个标记（tokens），并在不增加额外参数的情况下实现更优的性能。

链接: https://arxiv.org/abs/2501.01805
作者: Rohit Saxena,Hao Tang,Frank Keller
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at test time, but training with these approaches is still difficult, requiring truncation of input documents and causing a mismatch between training and test conditions. In this work, we propose CachED (Gradient \textbfCach ing for \textbfE ncoder- \textbfD ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models, using the entire document without truncation. Specifically, we apply non-overlapping sliding windows to input documents, followed by fusion in decoder. During backpropagation, the gradients are cached at the decoder and are passed through the encoder in chunks by re-computing the hidden vectors, similar to gradient checkpointing. In the experiments on long document summarization, we extend BART to CachED BART, processing more than 500K tokens during training and achieving superior performance without using any additional parameters.
zh

[NLP-9] Reading Between the Lines: A dataset and a study on why some texts are tougher than others COLING’2025

【速读】：该论文旨在解决如何更好地理解特定智力障碍（intellectual disabilities）群体在阅读文本时遇到的困难，特别是那些在认知功能（cognitive functioning）、阅读和理解能力、IQ低于70以及在概念领域（conceptual domains）存在挑战的人群。论文的关键解决方案包括引入一个基于心理学实证研究和翻译学研究（translation studies）的困难标注方案（annotation scheme），并利用从平行文本（parallel texts）中提取的标注数据集进行多类别分类（multiclass classification）任务。为此，作者微调了四种不同的预训练Transformer模型，以预测简化文本所需的策略，并探讨了这些语言模型在预测句子难度时的决策可解释性。

链接: https://arxiv.org/abs/2501.01796
作者: Nouran Khallaf,Carlo Eugeni,Serge Sharoff
机构: University of Leeds (利兹大学)
类目: Computation and Language (cs.CL)
备注: Published at Writing Aids at the Crossroads of AI, Cognitive Science and NLP WR-AI-CogS, at COLING’2025, Abu Dhabi

点击查看摘要

Abstract:Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences. The resources are available from this https URL
zh

[NLP-10] Automating Legal Concept Interpretation with LLM s: Retrieval Generation and Evaluation

【速读】：该论文试图解决法律文本中模糊概念（vague legal concepts）的自动解释问题。由于法律条款中的模糊概念需要适应不断变化的社会环境，法律从业者通常需要对这些概念进行详细的解释，这一过程依赖于法律专家的专业注释，耗时且成本高昂。论文提出了一种新颖的检索增强生成框架（retrieval-augmented generation framework），称为ATRI，用于自动从过去的司法判例中检索相关信息并解释模糊法律概念。该框架的关键在于通过检索相关判例信息并结合生成式AI（Generative AI）技术，自动生成高质量的法律概念解释。此外，论文还提出了一个新的基准测试——法律概念蕴含（Legal Concept Entailment），用于在没有专家参与的情况下自动评估生成的概念解释。实验结果表明，生成的解释能够有效帮助大语言模型（LLMs）理解模糊法律概念，且其质量与人类专家撰写的解释相当。

链接: https://arxiv.org/abs/2501.01743
作者: Kangcheng Luo,Quzhe Huang,Cong Jiang,Yansong Feng
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Legal articles often include vague concepts to adapt to the ever-changing society. Providing detailed interpretations of these concepts is a critical task for legal practitioners, which requires meticulous and professional annotations by legal experts, admittedly time-consuming and expensive to collect at scale. In this paper, we introduce a novel retrieval-augmented generation framework, ATRI, for AuTomatically Retrieving relevant information from past judicial precedents and Interpreting vague legal concepts. We further propose a new benchmark, Legal Concept Entailment, to automate the evaluation of generated concept interpretations without expert involvement. Automatic evaluations indicate that our generated interpretations can effectively assist large language models (LLMs) in understanding vague legal concepts. Multi-faceted evaluations by legal experts indicate that the quality of our concept interpretations is comparable to those written by human experts. Our work has strong implications for leveraging LLMs to support legal practitioners in interpreting vague legal concepts and beyond.
zh

[NLP-11] How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在生成内容时可能产生的毒性（toxicity）问题。尽管现有的对齐（alignment）方法可以在一定程度上缓解这一问题，但它们并不能彻底消除模型生成有害内容的风险。因此，论文提出了EvoTox，一种自动化测试框架，用于定量评估即使在模型对齐后，LLMs仍然可能被推向生成毒性响应的程度。EvoTox的关键在于采用了一种迭代进化策略，利用两个LLMs之间的相互作用：一个是待测试的系统（System Under Test, SUT），另一个是提示生成器（Prompt Generator），后者通过生成提示来引导SUT生成更高毒性的响应。毒性水平通过一个基于现有毒性分类器的自动化预言机（oracle）进行评估。该框架在四个不同复杂度的LLMs上进行了定量和定性评估，结果显示EvoTox在检测毒性水平方面的有效性显著高于现有的基线方法，且成本开销有限。

链接: https://arxiv.org/abs/2501.01741
作者: Simone Corbo,Luca Bancale,Valeria De Gennaro,Livia Lestingi,Vincenzo Scotti,Matteo Camilli
机构: Department of Electronics, Information and Bioengineering (DEIB) of Politecnico di Milano (PoliMI) University (米兰理工大学电子、信息和生物工程系)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM, which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs’ inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using four state-of-the-art LLMs as evaluation subjects having increasing complexity (7-13 billion parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).
zh

[NLP-12] he Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters

【速读】：该论文试图解决现有评估机器心智理论（Theory-of-Mind, ToM）能力的基准测试中忽略长期个人背景信息的问题。现有的基准测试通常使用简短的叙述，缺乏全局背景信息，而人类的心智理论能力则依赖于对他人背景和生活故事的理解。为了验证长期个人背景信息在ToM中的重要性，并评估大语言模型（LLMs）在这种现实评估场景中的表现，作者提出了一个新的基准测试CharToM-QA。该基准测试基于经典小说中的角色，包含了1,035个ToM问题。通过人类研究，作者发现，当参与者阅读过相关小说时，其表现显著优于未阅读过小说的参与者。同时，实验表明，尽管LLMs在预训练阶段已经接触过这些故事，但其表现仍显著低于人类，突显了当前LLMs在捕捉ToM推理所需的细致上下文信息方面的局限性。解决方案的关键在于引入包含长期背景信息的基准测试，以更真实地评估机器的ToM能力。

链接: https://arxiv.org/abs/2501.01705
作者: Chulun Zhou,Qiujing Wang,Mo Yu,Xiaoqian Yue,Rui Lu,Jiangnan Li,Yifan Zhou,Shunchi Zhang,Jie Zhou,Wai Lam
机构: The Chinese University of Hong Kong(香港中文大学); Pattern Recognition Center, WeChat AI, Tencent Inc, China(腾讯公司微信AI模式识别中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, under review

点击查看摘要

Abstract:Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others’ thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines’ ToM capabilities, due to their usage of short narratives without global backgrounds. In this paper, we verify the importance of understanding long personal backgrounds in ToM and assess the performance of LLMs in such realistic evaluation scenarios. To achieve this, we introduce a novel benchmark, CharToM-QA, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 model, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.
zh

[NLP-13] Agent Refine: Enhancing Agent Generalization through Refinement Tuning

【速读】：该论文试图解决基于大语言模型（LLM）的智能体在复杂任务中泛化能力不足的问题。现有开源的LLM与商业模型（如GPT系列）之间存在显著差距，尤其是在面对新环境或任务时，现有智能体训练方法容易出现过拟合，导致在未见过的任务上表现不佳。具体表现为智能体在训练过程中频繁出现格式错误，且难以从错误中学习，只能记忆已有的观察-动作关系。

解决方案的关键在于提出了一个名为AgentRefine的框架，通过指令调优（instruction tuning）来提升智能体的泛化能力。该框架的核心思想是让模型能够通过观察轨迹中的反馈来纠正其错误动作。具体而言，论文提出了一个智能体合成框架，涵盖多样化的环境和任务，并利用强大的LLM根据环境反馈来优化其错误动作。AgentRefine在多种智能体任务上的泛化能力显著优于现有的最先进方法，并且在面对扰动时表现出更好的鲁棒性，能够在推理过程中生成多样化的思考路径。这一发现建立了智能体泛化能力与自我优化之间的关联，为未来研究提供了新的范式。

链接: https://arxiv.org/abs/2501.01702
作者: Dayuan Fu,Keqing He,Yejie Wang,Wentao Hong,Zhuoma Gongque,Weihao Zeng,Wei Wang,Jingang Wang,Xunliang Cai,Weiran Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans. However, there is still a large gap between open-sourced LLMs and commercial models like the GPT series. In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning. We first observe that the existing agent training corpus exhibits satisfactory results on held-in evaluation sets but fails to generalize to held-out sets. These agent-tuning works face severe formatting errors and are frequently stuck in the same mistake for a long while. We analyze that the poor generalization ability comes from overfitting to several manual agent environments and a lack of adaptation to new situations. They struggle with the wrong action steps and can not learn from the experience but just memorize existing observation-action relations. Inspired by the insight, we propose a novel AgentRefine framework for agent-tuning. The core idea is to enable the model to learn to correct its mistakes via observation in the trajectory. Specifically, we propose an agent synthesis framework to encompass a diverse array of environments and tasks and prompt a strong LLM to refine its error action according to the environment feedback. AgentRefine significantly outperforms state-of-the-art agent-tuning work in terms of generalization ability on diverse agent tasks. It also has better robustness facing perturbation and can generate diversified thought in inference. Our findings establish the correlation between agent generalization and self-refinement and provide a new paradigm for future research.
zh

[NLP-14] Adaptive Few-shot Prompting for Machine Translation with Pre-trained Language Models AAAI2025

【速读】：该论文试图解决大语言模型（LLMs）在神经机器翻译任务中存在的提示敏感性（prompt-sensitive）问题。现有研究表明，固定的提示（prompt）对于不同的输入句子并不总是最优的，这限制了LLMs在机器翻译中的表现。为了解决这一问题，论文提出了一种自适应少样本提示（Adaptive Few-Shot Prompting, AFSP）框架，旨在为不同的源语言输入句子自动选择合适的翻译示例，从而更好地激发LLMs的翻译能力。

解决方案的关键在于以下几个方面：首先，基于LLM的嵌入（embedding）构建了一个翻译示例检索模块，用于从对齐的平行翻译语料库中检索出与源句子语义相似的前k个翻译示例。与使用其他嵌入模型不同，该模块直接利用部署的LLM的嵌入层来构建输入表示，从而检索出更具语义相关性的翻译示例。其次，为了确保源输入和目标输出之间的语义一致性，论文强制LLM在翻译示例的帮助下生成多个目标语言的输出候选，并对这些候选进行重新排序。此外，为了评估AFSP框架的有效性并扩展神经机器翻译的研究边界，论文还构建了一个高质量的外交中英平行数据集，包含5,528对中英平行句子。实验结果表明，AFSP框架在外交中英平行数据集和联合国平行语料库（中英部分）上均表现出显著的有效性和优越性。

链接: https://arxiv.org/abs/2501.01679
作者: Lei Tang,Jinghui Qin,Wenxuan Ye,Hao Tan,Zhijing Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: published to AAAI2025

点击查看摘要

Abstract:Recently, Large language models (LLMs) with in-context learning have demonstrated remarkable potential in handling neural machine translation. However, existing evidence shows that LLMs are prompt-sensitive and it is sub-optimal to apply the fixed prompt to any input for downstream machine translation tasks. To address this issue, we propose an adaptive few-shot prompting (AFSP) framework to automatically select suitable translation demonstrations for various source input sentences to further elicit the translation capability of an LLM for better machine translation. First, we build a translation demonstration retrieval module based on LLM’s embedding to retrieve top-k semantic-similar translation demonstrations from aligned parallel translation corpus. Rather than using other embedding models for semantic demonstration retrieval, we build a hybrid demonstration retrieval module based on the embedding layer of the deployed LLM to build better input representation for retrieving more semantic-related translation demonstrations. Then, to ensure better semantic consistency between source inputs and target outputs, we force the deployed LLM itself to generate multiple output candidates in the target language with the help of translation demonstrations and rerank these candidates. Besides, to better evaluate the effectiveness of our AFSP framework on the latest language and extend the research boundary of neural machine translation, we construct a high-quality diplomatic Chinese-English parallel dataset that consists of 5,528 parallel Chinese-English sentences. Finally, extensive experiments on the proposed diplomatic Chinese-English parallel dataset and the United Nations Parallel Corpus (Chinese-English part) show the effectiveness and superiority of our proposed AFSP.
zh

[NLP-15] CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis

【速读】：该论文试图解决当前推理扩展方法（如Self-consistency和Best-of-N）在复杂推理任务中依赖候选答案质量的问题，特别是在所有候选答案都不正确时无法生成正确答案的局限性。解决方案的关键在于提出了一种基于链式思维（CoT, Chain-of-Thought）推理的合成器（CoT-based Synthesizer），通过分析多个候选答案中的互补信息，即使所有候选答案都存在缺陷，也能合成出更优的答案。此外，论文还引入了一个自动化的数据生成管道，生成多样化的训练数据，使得较小的LLMs能够通过训练这些数据来提高包括API-based LLMs在内的更大模型的推理准确性。实验结果表明，该方法在多个基准数据集上显著提升了性能，特别是在MATH数据集上，Llama3-8B和GPT-4o分别提升了11.8%和10.3%的准确率。

链接: https://arxiv.org/abs/2501.01668
作者: Bohan Zhang,Xiaokang Zhang,Jing Zhang,Jifan Yu,Sijia Luo,Jie Tang
机构: 1School of Information, Renmin University of China(中国人民大学信息学院); 2Tsinghua University(清华大学); 3Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China(数据工程与知识工程重点实验室, 北京, 中国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current inference scaling methods, such as Self-consistency and Best-of-N, have proven effective in improving the accuracy of LLMs on complex reasoning tasks. However, these methods rely heavily on the quality of candidate responses and are unable to produce correct answers when all candidates are incorrect. In this paper, we propose a novel inference scaling strategy, CoT-based Synthesizer, which leverages CoT reasoning to synthesize superior answers by analyzing complementary information from multiple candidate responses, even when all candidate responses are flawed. To enable a lightweight and cost-effective implementation, we introduce an automated data generation pipeline that creates diverse training data. This allows smaller LLMs trained on this data to improve the inference accuracy of larger models, including API-based LLMs. Experimental results across four benchmark datasets with seven policy models demonstrate that our method significantly enhances performance, with gains of 11.8% for Llama3-8B and 10.3% for GPT-4o on the MATH dataset. The corresponding training data and code are publicly available on this https URL.
zh

[NLP-16] MIRAG E: Exploring How Large Language Models Perform in Complex Social Interactive Environments

【速读】：该论文旨在解决如何全面评估大型语言模型（LLMs）在复杂互动角色扮演场景中表现的问题，特别是在模拟高级人类行为方面的能力。为此，论文提出了一个名为“多宇宙互动角色扮演能力通用评估框架”（Multiverse Interactive Role-play Ability General Evaluation, MIRAGE）的综合评估框架。该框架通过设计八个精心制作的剧本，涵盖多样化的主题和风格，提供了一个丰富的模拟环境。MIRAGE采用四种不同的评估方法：信任倾向指数（Trust Inclination Index, TII）用于衡量信任与怀疑的动态变化，线索调查能力（Clue Investigation Capability, CIC）用于评估模型的信息处理能力，互动能力指数（Interactivity Capability Index, ICI）用于评估角色扮演能力，以及剧本遵从指数（Script Compliance Index, SCI）用于评估模型对指令的理解和遵循能力。实验结果表明，即使是像GPT-4这样的流行模型，在面对MIRAGE的复杂性时也面临显著挑战。

链接: https://arxiv.org/abs/2501.01652
作者: Cai Yin,Gu Zhouhong,Du Zhaohan,Ye Zheyu,Cao Shaosheng,Xu Yiqian,Feng Hongwei,Chen Ping
机构: Institute of Big Data, Fudan University (复旦大学大数据学院); Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学计算机科学技术学院数据科学重点实验室); School of Computer Science, Fudan University (复旦大学计算机科学技术学院); Xiaohongshu Inc. (小红书公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in environmental perception, reasoning-based decision-making, and simulating complex human behaviors, particularly in interactive role-playing contexts. This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE), a comprehensive framework designed to assess LLMs’ proficiency in portraying advanced human behaviors through murder mystery games. MIRAGE features eight intricately crafted scripts encompassing diverse themes and styles, providing a rich simulation. To evaluate LLMs’ performance, MIRAGE employs four distinct methods: the Trust Inclination Index (TII) to measure dynamics of trust and suspicion, the Clue Investigation Capability (CIC) to measure LLMs’ capability of conducting information, the Interactivity Capability Index (ICI) to assess role-playing capabilities and the Script Compliance Index (SCI) to assess LLMs’ capability of understanding and following instructions. Our experiments indicate that even popular models like GPT-4 face significant challenges in navigating the complexities presented by the MIRAGE. The datasets and simulation codes are available in \hrefthis https URLgithub.
zh

[NLP-17] Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs

【速读】：该论文旨在解决生物医学知识图谱（Biomedical Knowledge Graphs, BKGs）中有效链接预测的问题，特别是在揭示潜在的新药物-疾病关系等复杂关系时面临的挑战。为了解决现有BKGs的局限性，作者提出了一种新颖的多模态方法，该方法结合了专门的语言模型（Language Models, LMs）嵌入和图对比学习（Graph Contrastive Learning, GCL），以增强实体内部关系，同时采用知识图谱嵌入（Knowledge Graph Embedding, KGE）模型来捕捉实体间关系，从而实现有效的链接预测。此外，作者还提出了PrimeKG++，这是一个包含多模态数据的增强型知识图谱，涵盖了生物序列和每种实体类型的文本描述。通过将语义信息和关系信息统一表示，该方法展示了强大的泛化能力，能够对未见过的节点进行准确的链接预测。实验结果表明，该方法在PrimeKG++和DrugBank药物-靶点相互作用数据集上均表现出有效性和鲁棒性。

链接: https://arxiv.org/abs/2501.01644
作者: Tien Dang,Viet Thanh Duy Nguyen,Minh Tuan Le,Truong-Son Hy
机构: University of Information Technology, Ho Chi Minh City, Vietnam; AI Center, FPT Software, Ho Chi Minh City, Vietnam; Washington University in St. Louis, United States; University of Alabama at Birmingham, Birmingham, Alabama, United States
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at this https URL
zh

[NLP-18] A non-ergodic framework for understanding emergent capabilities in Large Language Models

【速读】：该论文试图解决的问题是：尽管大规模语言模型（Large Language Models）在规模扩展时展现出意想不到的涌现能力（emergent capabilities），但目前缺乏一个理论框架来解释这些能力为何以及如何涌现。论文通过引入Stuart Kauffman的“邻近可能”（Theory of the Adjacent Possible, TAP）理论，提出了一个数学框架来解释语言模型能力的涌现机制。关键解决方案在于证明了语言模型是非遍历性系统（non-ergodic systems），并通过资源受限的TAP方程展示了架构、训练和上下文约束如何通过语义空间中的相变（phase transitions）来塑造模型能力。实验验证了三种不同语言模型的能力是通过离散的相变涌现的，这些相变由约束交互和路径依赖的探索所引导。该框架为理解语言模型中的涌现现象提供了理论基础，并指导了能够引导能力涌现的架构开发。

链接: https://arxiv.org/abs/2501.01638
作者: Javier Marin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have emergent capabilities that come unexpectedly at scale, but we need a theoretical framework to explain why and how they emerge. We prove that language models are actually non-ergodic systems while providing a mathematical framework based on Stuart Kauffman’s theory of the adjacent possible (TAP) to explain capability emergence. Our resource-constrained TAP equation demonstrates how architectural, training, and contextual constraints interact to shape model capabilities through phase transitions in semantic space. We prove through experiments with three different language models that capacities emerge through discrete transitions guided by constraint interactions and path-dependent exploration. This framework provides a theoretical basis for understanding emergence in language models and guides the development of architectures that can guide capability emergence.
zh

[NLP-19] Crossing Language Borders: A Pipeline for Indonesian Manhwa Translation

【速读】：该论文旨在解决将印尼语翻译为英语的韩国漫画（Manhwa）自动化翻译问题，以提高翻译效率并降低人工翻译的成本。解决方案的关键在于结合计算机视觉、文本识别和自然语言处理技术，构建一个自动化翻译流水线。具体步骤包括：使用微调的YOLOv5xu模型进行对话气泡检测，Tesseract进行光学字符识别（OCR），以及微调的MarianMT模型进行机器翻译。通过这一流程，论文展示了在低资源语言（印尼语）环境下，如何高效地将Manhwa翻译为英语，从而使其更易于全球读者获取。

链接: https://arxiv.org/abs/2501.01629
作者: Nithyasri Narasimhan,Sagarika Singh
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this project, we develop a practical and efficient solution for automating the Manhwa translation from Indonesian to English. Our approach combines computer vision, text recognition, and natural language processing techniques to streamline the traditionally manual process of Manhwa(Korean comics) translation. The pipeline includes fine-tuned YOLOv5xu for speech bubble detection, Tesseract for OCR and fine-tuned MarianMT for machine translation. By automating these steps, we aim to make Manhwa more accessible to a global audience while saving time and effort compared to manual translation methods. While most Manhwa translation efforts focus on Japanese-to-English, we focus on Indonesian-to-English translation to address the challenges of working with low-resource languages. Our model shows good results at each step and was able to translate from Indonesian to English efficiently.
zh

[NLP-20] ICPC: In-context Prompt Compression with Faster Inference

【速读】：该论文试图解决大型语言模型（LLMs）在处理长提示（long prompts）时面临的输入长度限制问题。由于LLMs的输入大小是固定的，长提示的处理变得具有挑战性。现有的提示压缩方法虽然能够通过去除冗余的标记来减少提示长度，但通常需要额外的计算资源并导致内存开销。为此，论文提出了一种新颖且可扩展的提示压缩方法——ICPC（In-context Prompt Compression）。ICPC的关键在于通过编码器计算提示中每个词出现的概率，并通过信息函数计算每个词所携带的信息量，从而在压缩过程中有效减少信息损失并提高压缩速度。实验表明，ICPC能够有效压缩不同类别的长文本，并在多种自然语言处理任务中实现更好的性能和速度。

链接: https://arxiv.org/abs/2501.01625
作者: Ziyang Yu,Yuyu Liu
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.
zh

[NLP-21] PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents

【速读】：该论文试图解决如何有效评估基于大语言模型（LLMs）的精神病学评估对话代理（PACAs）在临床环境中的适用性问题。当前，尽管LLMs在生成类人对话方面取得了显著进展，但缺乏标准化的方法来评估这些对话代理与患者互动的临床适当性。为此，作者提出了PSYCHE框架，该框架通过模拟基于多维度精神病学构建的虚拟患者（包括其背景、病史和行为），来实现对PACAs的临床相关性、伦理安全性、成本效益和定量评估。PSYCHE框架的关键在于通过模拟患者的多维度特征，使PACAs能够在接近真实临床环境的情境下进行评估，从而提供更为准确和可靠的性能指标。

链接: https://arxiv.org/abs/2501.01594
作者: Jingoo Lee,Kyungho Lim,Young-Chul Jung,Byung-Hoon Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have accelerated the development of conversational agents capable of generating human-like responses. Since psychiatric assessments typically involve complex conversational interactions between psychiatrists and patients, there is growing interest in developing LLM-based psychiatric assessment conversational agents (PACAs) that aim to simulate the role of psychiatrists in clinical evaluations. However, standardized methods for benchmarking the clinical appropriateness of PACAs’ interaction with patients still remain underexplored. Here, we propose PSYCHE, a novel framework designed to enable the 1) clinically relevant, 2) ethically safe, 3) cost-efficient, and 4) quantitative evaluation of PACAs. This is achieved by simulating psychiatric patients based on a multi-faceted psychiatric construct that defines the simulated patients’ profiles, histories, and behaviors, which PACAs are expected to assess. We validate the effectiveness of PSYCHE through a study with 10 board-certified psychiatrists, supported by an in-depth analysis of the simulated patient utterances.
zh

[NLP-22] (WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology Results and Challenges

【速读】：该论文试图解决大型语言模型（LLMs）在处理多项选择题（MCQs）任务时面临的挑战，特别是由于模型幻觉（hallucinations）和提示不清晰（unclear prompts）导致的效果不佳问题。解决方案的关键在于对微软的PHI-3模型进行微调（fine-tuning），并在TruthfulQA数据集上进行优化提示设计（optimized prompts），以提升模型在MCQ任务中的表现。通过微调和优化提示，PHI-3.5模型在处理MCQ任务时的困惑度（perplexity）从4.68降至2.27，准确率（accuracy）从62%提升至90.8%。这一研究强调了高效模型在自适应学习系统和教育评估中的重要性，为在课堂中更广泛地应用LLMs（如测试准备、学生反馈和个性化学习）铺平了道路。

链接: https://arxiv.org/abs/2501.01588
作者: Mohamed Hisham Abdellatif
机构: Cairo University, Systems and Biomedical Engineering, Egypt (开罗大学，系统与生物医学工程系，埃及)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become essential tools across various domains due to their impressive capabilities in understanding and generating human-like text. The ability to accurately answer multiple-choice questions (MCQs) holds significant value in education, particularly in automated tutoring systems and assessment platforms. However, adapting LLMs to handle MCQ tasks effectively remains challenging due to the hallucinations and unclear prompts. This work explores the potential of Microsoft’s PHI-3\citeAbdin2024, a compact yet efficient LLM, for MCQ answering. Our contributions include fine-tuning the model on the TruthfulQA dataset, designing optimized prompts to enhance model performance, and evaluating using perplexity and traditional metrics like accuracy and F1 score. Results show a remarkable improvement in PHI-3.5’s MCQ handling post-fine-tuning, with perplexity decreasing from 4.68 to 2.27, and accuracy rising from 62% to 90.8%. This research underlines the importance of efficient models in adaptive learning systems and educational assessments, paving the way for broader integration into the classroom, particularly in fields like test preparation, student feedback, and personalized learning.
zh

[NLP-23] Predicting the Performance of Black-box LLM s through Self-Queries

【速读】：该论文试图解决在大语言模型（LLMs）被广泛依赖的背景下，如何预测模型在特定实例上的错误行为的问题。由于在实际应用中，许多用户只能通过API以黑盒方式访问模型，无法直接获取模型的内部表示（internal representations），因此传统的基于内部表示的解释方法不适用。论文提出了一种黑盒方法，通过使用后续提示（follow-up prompts）并利用不同响应的概率作为表示，来提取模型的特征，并训练可靠的预测器来预测模型的行为。关键解决方案在于，通过训练线性模型（linear model）在这些低维表示上，能够生成可靠且可泛化的预测器，用于评估模型在实例级别的表现（例如，某个生成是否正确地回答了问题）。这种方法甚至在某些情况下优于基于模型隐藏状态或全词汇分布的白盒线性预测器。此外，提取的特征还可用于评估语言模型状态的更细微方面，例如区分不同模型架构和大小，或检测通过API提供的模型是否被误传（例如，识别是否提供了GPT-3.5而非GPT-4o-mini）。

链接: https://arxiv.org/abs/2501.01558
作者: Dylan Sam,Marc Finzi,J. Zico Kolter
机构: Carnegie Mellon University(卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model’s hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model’s state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).
zh

[NLP-24] Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

【速读】：该论文试图解决大语言模型（LLMs）后对齐（post-alignment）过程中，直接偏好优化（Direct Preference Optimisation, DPO）算法变体众多且难以理解其相互关联的问题。为了解决这一问题，论文提出了一个基于互信息（mutual information）的统一框架，并引入了一种具有灵活先验（flexible priors）的新损失函数。通过精心设计这些先验，论文展示了如何从该框架中推导出多种现有算法，如SimPO、TDPO、SparsePO等。这一统一框架为研究人员提供了更清晰、结构化的方法，有助于更好地理解不同DPO变体之间的关系，简化了DPO算法的研究格局，并为开发更鲁棒和可解释的对齐技术奠定了基础。

链接: https://arxiv.org/abs/2501.01544
作者: Rasul Tutnov,Antoine Grosnit,Haitham Bou-Ammar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Post-alignment of large language models (LLMs) is critical in improving their utility, safety, and alignment with human intentions. Direct preference optimisation (DPO) has become one of the most widely used algorithms for achieving this alignment, given its ability to optimise models based on human feedback directly. However, the vast number of DPO variants in the literature has made it increasingly difficult for researchers to navigate and fully grasp the connections between these approaches. This paper introduces a unifying framework inspired by mutual information, which proposes a new loss function with flexible priors. By carefully specifying these priors, we demonstrate that many existing algorithms, such as SimPO, TDPO, SparsePO, and others, can be derived from our framework. This unification offers a clearer and more structured approach, allowing researchers to understand the relationships between different DPO variants better. We aim to simplify the landscape of DPO algorithms, making it easier for the research community to gain insights and foster further advancements in LLM alignment. Ultimately, we hope our framework can be a foundation for developing more robust and interpretable alignment techniques.
zh

[NLP-25] A Metasemantic-Metaprag matic Framework for Taxonomizing Multimodal Communicative Alignment

【速读】：该论文试图解决当前认知-社会计算和工程方法在人类-机器多模态通信中对语义/元语义（semantic/metasemantic）领域的过度强调，而忽视了元语用索引性（metapragmatic indexicality）在跨越语义-语用（semantic-pragmatic）通信谱系中的关键作用。解决方案的关键在于提出了一种动态的元语义-元语用分类法（metasemantic-metapragmatic taxonomy），该分类法基于美国逻辑学家和实用主义哲学家查尔斯·桑德斯·皮尔斯（Charles Sanders Peirce）提出的三种基本通信能力：图像性（iconic，感官和感知质量）、索引性（indexical，语境和社会文化关联）和规则性（rule-like，符号和直觉推理）。论文进一步引入了索引性语境化（indexical contextualization）的概念，并提出了“语境化方向性”（contextualization directionality）原则，用于描述在多模态通信中维持、导航或转换语义和语用模式的元语用能力。这一框架还讨论了其在意图性（intentionality）、身份（identity）、情感（affect）和伦理（ethics）方面的更广泛影响。

链接: https://arxiv.org/abs/2501.01535
作者: Eugene Yu Ji
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 34 pages, 1 figure, 3 tables. Draft presented at 2023 ZJU Logic and AI Summit EAI Workshop

点击查看摘要

Abstract:Drawing on contemporary pragmatist philosophy and linguistic theories on cognition, meaning, and communication, this paper presents a dynamic, metasemantic-metapragmatic taxonomy for grounding and conceptualizing human-like multimodal communicative alignment. The framework is rooted in contemporary developments of the three basic communicative capacities initially identified by American logician and pragmatist philosopher Charles Sanders Peirce: iconic (sensory and perceptual qualities), indexical (contextual and sociocultural associations), and rule-like (symbolic and intuitive reasoning). Expanding on these developments, I introduce the concept of indexical contextualization and propose the principle of “contextualization directionality” for characterizing the crucial metapragmatic capacity for maintaining, navigating, or transitioning between semantic and pragmatic modes of multimodal communication. I contend that current cognitive-social computational and engineering methodologies disproportionately emphasize the semantic/metasemantic domain, overlooking the pivotal role of metapragmatic indexicality in traversing the semantic-pragmatic spectrum of communication. The framework’s broader implications for intentionality, identity, affect, and ethics in within-modal and cross-modal human-machine alignment are also discussed.
zh

[NLP-26] Improving Robustness Estimates in Natural Language Explainable AI though Synonymity Weighted Similarity Measures

【速读】：该论文试图解决可解释人工智能（Explainable AI, XAI）方法在面对对抗样本（adversarial examples）时的可靠性问题。具体来说，现有的XAI方法在生成解释时可能无法提供可靠的解释，尤其是在对抗样本的干扰下，解释可能会被改变，而原始模型的输出保持不变。论文的核心解决方案是通过引入同义词权重（synonymity weighting）来改进现有的信息检索度量方法，从而更准确地评估XAI方法在面对对抗样本时的脆弱性。这一改进利用了在对抗XAI过程中被丢弃的信息，即被扰动词语的同义性，从而提高了对XAI方法实际弱点的估计精度。

链接: https://arxiv.org/abs/2501.01516
作者: Christopher Burger
机构: University of Mississippi (密西西比大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Explainable AI (XAI) has seen a surge in recent interest with the proliferation of powerful but intractable black-box models. Moreover, XAI has come under fire for techniques that may not offer reliable explanations. As many of the methods in XAI are themselves models, adversarial examples have been prominent in the literature surrounding the effectiveness of XAI, with the objective of these examples being to alter the explanation while maintaining the output of the original model. For explanations in natural language, it is natural to use measures found in the domain of information retrieval for use with ranked lists to guide the adversarial XAI process. We show that the standard implementation of these measures are poorly suited for the comparison of explanations in adversarial XAI and amend them by using information that is discarded, the synonymity of perturbed words. This synonymity weighting produces more accurate estimates of the actual weakness of XAI methods to adversarial examples.
zh

[NLP-27] Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search AAAI2025

【速读】：该论文试图解决大语言模型（LLMs）在推理任务中的表现不足问题。尽管LLMs在多种任务中展现了显著的能力，但推理仍然是其面临的挑战。为了提升LLMs的推理能力，论文提出了一种基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）的方法，通过LLMs自身生成过程监督数据来进行训练。具体而言，该方法通过LLM采样推理步骤，并为每个步骤分配一个反映其“相对正确性”的评分，然后通过最小化生成推理步骤的加权对数似然来训练LLM。这种“生成-训练”过程迭代进行，直到模型性能达到预期。实验结果表明，该方法在两个数学推理数据集上显著提升了LLMs的性能，并且在一个数据集上训练的模型在另一个数据集上也表现出推理能力的迁移性。解决方案的关键在于利用MCTS生成高质量的过程监督数据，并通过迭代训练优化LLMs的推理能力。

链接: https://arxiv.org/abs/2501.01478
作者: Shuangtao Li,Shuaihao Dong,Kexin Luan,Xinhan Di,Chaofan Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 1 figure, 2 tables accepted by aaai 2025 NeurMAD workshop

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated their remarkable capacity across a variety of tasks. However, reasoning remains a challenge for LLMs. To improve LLMs’ reasoning ability, process supervision has proven to be better than outcome supervision. In this work, we study using Monte Carlo Tree Search (MCTS) to generate process supervision data with LLMs themselves for training them. We sample reasoning steps with an LLM and assign each step a score that captures its “relative correctness,” and the LLM is then trained by minimizing weighted log-likelihood of generating the reasoning steps. This generate-then-train process is repeated iteratively until this http URL experimental results demonstrate that the proposed methods considerably improve the performance of LLMs on two mathematical reasoning datasets. Furthermore, models trained on one dataset also exhibit improved performance on the other, showing the transferability of the enhanced reasoning ability.
zh

[NLP-28] Reinforcing Thinking through Reasoning -Enhanced Reward Models

【速读】：该论文试图解决大型语言模型（LLMs）在复杂多步推理过程中由于缺乏对自身知识边界的自我认知而难以决定何时停止思考的问题。尽管人类偏好对齐（human preference alignment）展示了显著的潜力，但其依赖昂贵的人工标注数据，难以遵循扩展定律（scaling law）。此外，语言模型的自我批判（self-critique）作为替代方案，因其固有的偏见而受到质疑。论文提出了一种名为“蒸馏-强化-推理”（Distillation-Reinforcement-Reasoning, DRR）的三步框架，通过将LLM自身的推理过程提炼为合成行为数据，避免了手动标注中间步骤的需求。该框架首先利用推理器（Reasoner，即LLM）生成反映其推理能力的行为数据，然后在这些数据上训练一个轻量级的判别奖励模型（discriminative reward model, DM），最后在推理时部署DM以辅助推理器的决策。实验表明，DRR框架在不依赖复杂数据标注的情况下，优于自我批判方法，并因其轻量级设计、易于复制和适应性，适用于广泛的LLM中心任务。

链接: https://arxiv.org/abs/2501.01457
作者: Diji Yang,Linda Zeng,Kezhen Chen,Yi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit great potential in complex multi-step reasoning through inference-time thinking but still struggle with deciding when to stop thinking due to limited self-awareness about their knowledge boundaries. While human preference alignment has shown extraordinary opportunities, expensive labeling challenges adherence to scaling law. Language model self-critique, as an alternative to using human-labeled reasoning data, is questioned with its inherited biases. This work addresses these challenges by distilling the LLM’s own reasoning processes into synthetic behavioral data, eliminating the need for manual labeling of intermediate steps. Building on this concept, we propose Distillation-Reinforcement-Reasoning (DRR), a three-step framework that leverages the LLM’s inherent behaviors as external feedback by first generating behavioral data using the Reasoner (LLM) to reflect its reasoning capabilities, then training a lightweight discriminative reward model (DM) on behavioral data, and finally deploying the DM at inference time to assist the Reasoner’s decision-making. Experiments on multiple benchmarks show that the DRR framework outperforms self-critique approaches without relying on additional complex data annotation. Benefiting from lightweight design, ease of replication, and adaptability, DRR is applicable to a wide range of LLM-centric tasks.
zh

计算机视觉

[CV-0] VITA-1.5: Towards GPT -4o Level Real-Time Vision and Speech Interaction

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉和语音任务中集成不足的问题，特别是语音在多模态对话系统中的关键作用尚未得到充分重视。由于视觉和语音模态之间存在本质差异，实现高性能的视觉和语音交互仍然是一个重大挑战。论文提出的解决方案是通过一种精心设计的多阶段训练方法，逐步训练大语言模型（LLM）以理解视觉和语音信息，从而实现流畅的视觉和语音交互。该方法的关键在于不仅保留了强大的视觉-语言能力，还实现了高效的语音到语音对话功能，而无需单独的自动语音识别（ASR）和文本到语音（TTS）模块，从而显著加速了多模态端到端的响应速度。通过与现有最先进的模型在图像、视频和语音任务基准上的对比，论文证明了该模型具备强大的视觉和语音能力，能够实现近乎实时的视觉和语音交互。

链接: https://arxiv.org/abs/2501.01957
作者: Chaoyou Fu,Haojia Lin,Xiong Wang,Yi-Fan Zhang,Yunhang Shen,Xiaoyu Liu,Yangze Li,Zuwei Long,Heting Gao,Ke Li,Xiawu Zheng,Rongrong Ji,Xing Sun,Caifeng Shan,Ran He
机构: 1NJU(南京大学); 2Tencent Youtu Lab(腾讯优图实验室); 3XMU(厦门大学); 4CASIA(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: this https URL

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.
zh

[CV-1] VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

【速读】：该论文旨在解决从单目视频中高效重建精确3D模型的关键挑战，这一问题在虚拟现实、机器人和场景理解等应用中至关重要。现有方法通常需要预计算的相机参数和逐帧重建流程，容易导致误差累积并带来显著的计算开销。为解决这些限制，论文提出了VideoLifter框架，其核心在于利用可学习模型的几何先验，直接从视频序列中逐步优化从稀疏到稠密的全局3D表示。VideoLifter通过将视频序列分割为局部窗口，匹配和配准帧，构建一致的片段，并分层对齐这些片段以生成统一的3D模型。通过跨帧和跨片段跟踪和传播稀疏点对应关系，VideoLifter逐步优化相机姿态和3D结构，最小化重投影误差，从而提高精度和鲁棒性。该方法显著加速了重建过程，训练时间减少了82%以上，同时在视觉保真度和计算效率上超越了当前最先进的方法。

链接: https://arxiv.org/abs/2501.01949
作者: Wenyan Cong,Kevin Wang,Jiahui Lei,Colton Stearns,Yuanhao Cai,Dilin Wang,Rakesh Ranjan,Matt Feiszli,Leonidas Guibas,Zhangyang Wang,Weiyao Wang,Zhiwen Fan
机构: UT Austin(德克萨斯大学奥斯汀分校); UPenn(宾夕法尼亚大学); Stanford(斯坦福大学); JHU(约翰斯·霍普金斯大学); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Efficiently reconstructing accurate 3D models from monocular video is a key challenge in computer vision, critical for advancing applications in virtual reality, robotics, and scene understanding. Existing approaches typically require pre-computed camera parameters and frame-by-frame reconstruction pipelines, which are prone to error accumulation and entail significant computational overhead. To address these limitations, we introduce VideoLifter, a novel framework that leverages geometric priors from a learnable model to incrementally optimize a globally sparse to dense 3D representation directly from video sequences. VideoLifter segments the video sequence into local windows, where it matches and registers frames, constructs consistent fragments, and aligns them hierarchically to produce a unified 3D model. By tracking and propagating sparse point correspondences across frames and fragments, VideoLifter incrementally refines camera poses and 3D structure, minimizing reprojection error for improved accuracy and robustness. This approach significantly accelerates the reconstruction process, reducing training time by over 82% while surpassing current state-of-the-art methods in visual fidelity and computational efficiency.
zh

[CV-2] Bridging Classification and Segmentation in Osteosarcoma Assessment via Foundation and Discrete Diffusion Models

【速读】：该论文旨在解决骨肉瘤（osteosarcoma）全切片图像（WSIs）中坏死区域评估的主观性和变异性问题。传统的手动评估方法存在主观性，且结果容易受到评估者个体差异的影响。为此，作者提出了一种名为FDDM的新框架，该框架通过结合基于补丁（patch-based）的分类和基于区域（region-based）的精细化分割，实现了跨补丁信息的整合。FDDM的关键创新在于其两阶段操作：首先进行补丁级别的分类，随后通过区域级别的精细化处理，进一步提升分割精度。通过使用新构建的骨肉瘤图像数据集，FDDM在分割性能和坏死率估计方面显著优于现有方法，分别提升了10%的mIOU（平均交并比）和32.12%的坏死率估计精度。该框架为骨肉瘤评估设立了新的基准，展示了基础模型（foundation models）和基于扩散（diffusion-based）的精细化方法在复杂医学影像任务中的潜力。

链接: https://arxiv.org/abs/2501.01932
作者: Manh Duong Nguyen,Dac Thai Nguyen,Trung Viet Nguyen,Homi Yamada,Huy Hieu Pham,Phi Le Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2025 IEEE International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Osteosarcoma, the most common primary bone cancer, often requires accurate necrosis assessment from whole slide images (WSIs) for effective treatment planning and prognosis. However, manual assessments are subjective and prone to variability. In response, we introduce FDDM, a novel framework bridging the gap between patch classification and region-based segmentation. FDDM operates in two stages: patch-based classification, followed by region-based refinement, enabling cross-patch information intergation. Leveraging a newly curated dataset of osteosarcoma images, FDDM demonstrates superior segmentation performance, achieving up to a 10% improvement mIOU and a 32.12% enhancement in necrosis rate estimation over state-of-the-art methods. This framework sets a new benchmark in osteosarcoma assessment, highlighting the potential of foundation models and diffusion-based refinements in complex medical imaging tasks.
zh

[CV-3] Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在复杂生成任务中产生的幻觉（hallucinations）问题，即生成的文本与视觉输入之间存在不一致性。现有的方法通过推理时干预（如对比解码和注意力修正）来减少对语言先验的过度依赖，但这些方法忽略了由虚假的跨模态相关性（spurious inter-modality correlations）引起的幻觉。为此，论文提出了一种无需训练的跨模态相关性校准解码方法（Inter-Modality Correlation Calibration Decoding, IMCCD）。该方法的关键在于设计了一个跨模态值增强解码模块（Cross-Modal Value-Enhanced Decoding, CMVED），通过一种新颖的对比解码机制来缓解幻觉。CMVED在估计失真分布时，屏蔽与显著跨模态注意力权重相关的值向量，从而同时解决单模态过度依赖和误导性跨模态相关性问题。此外，内容驱动的注意力优化模块（Content-Driven Attention Refinement, CDAR）优化跨模态注意力权重，引导模型关注重要的视觉内容。实验结果表明，该方法在减少LVLM文本生成中的幻觉方面优于现有技术。

链接: https://arxiv.org/abs/2501.01926
作者: Jiaming Li,Jiacheng Zhang,Zequn Jie,Lin Ma,Guanbin Li
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院); The University of Hong Kong (香港大学); Meituan (美团); Research Institute, Sun Yat-sen University (中山大学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at this https URL.
zh

[CV-4] ransformer-Driven Inverse Problem Transform for Fast Blind Hyperspectral Image Dehazing

【速读】：该论文试图解决高光谱去雾（Hyperspectral Dehazing, HyDHZ）问题，即从受雾霾影响的高光谱遥感图像中恢复出清晰的高光谱图像（Hyperspectral Image, HSI）。高光谱去雾技术对于后续的识别和分类任务至关重要，因为航空可见光/红外成像光谱仪（AVIRIS）数据表明，典型的高光谱遥感图像中存在大量受雾霾影响的区域。

解决方案的关键在于将高光谱去雾问题重新表述为光谱超分辨率（Spectral Super-Resolution, SSR）问题。具体而言，该算法首先自动选择一些未受污染或信息丰富的波段，然后通过光谱超分辨率技术在特征空间中对这些波段进行上采样，从而获得一个初步的清晰高光谱图像。接着，通过一个深度变换网络（Deep Transformer Network）进一步优化该图像，其中设计了一个全局注意力机制来捕捉非局部信息。该方法的创新之处在于首次将空间-光谱变换器引入高光谱去雾领域，并且该算法是盲去雾算法，无需用户手动选择受污染区域。实验结果表明，该方法在减少颜色失真方面具有显著优势。

链接: https://arxiv.org/abs/2501.01924
作者: Po-Wei Tang,Chia-Hsiang Lin,Yangrui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work has been accepted by IEEE Transactions on Geoscience and Remote Sensing (TGRS)

点击查看摘要

Abstract:Hyperspectral dehazing (HyDHZ) has become a crucial signal processing technology to facilitate the subsequent identification and classification tasks, as the airborne visible/infrared imaging spectrometer (AVIRIS) data portal reports a massive portion of haze-corrupted areas in typical hyperspectral remote sensing images. The idea of inverse problem transform (IPT) has been proposed in recent remote sensing literature in order to reformulate a hardly tractable inverse problem (e.g., HyDHZ) into a relatively simple one. Considering the emerging spectral super-resolution (SSR) technique, which spectrally upsamples multispectral data to hyperspectral data, we aim to solve the challenging HyDHZ problem by reformulating it as an SSR problem. Roughly speaking, the proposed algorithm first automatically selects some uncorrupted/informative spectral bands, from which SSR is applied to spectrally upsample the selected bands in the feature space, thereby obtaining a clean hyperspectral image (HSI). The clean HSI is then further refined by a deep transformer network to obtain the final dehazed HSI, where a global attention mechanism is designed to capture nonlocal information. There are very few HyDHZ works in existing literature, and this article introduces the powerful spatial-spectral transformer into HyDHZ for the first time. Remarkably, the proposed transformer-driven IPT-based HyDHZ (T2HyDHZ) is a blind algorithm without requiring the user to manually select the corrupted region. Extensive experiments demonstrate the superiority of T2HyDHZ with less color distortion.
zh

[CV-5] Detecting and Mitigating Adversarial Attacks on Deep Learning-Based MRI Reconstruction Without Any Retraining

【速读】：该论文试图解决深度学习（DL）方法在重建欠采样磁共振成像（MRI）数据时对对抗性攻击（adversarial attacks）的脆弱性问题。尽管基于物理驱动的深度学习方法在MRI重建中表现出色，但它们容易受到微小对抗性输入扰动的影响，导致输出图像出现严重失真。现有的缓解策略通常需要重新训练模型，并且可能会降低对未受扰动/干净输入的重建质量。本文提出了一种无需重新训练即可检测和缓解对抗性攻击的新方法。其关键解决方案基于循环测量一致性（cyclic measurement consistency）的概念：通过将模型输出映射到另一种欠采样模式的MRI测量数据，并使用同一模型进行重建，从而检测攻击。在没有攻击的情况下，两次重建结果应保持一致；而在存在攻击时，重建结果会出现不一致。基于这一思想，作者进一步设计了一种新的目标函数，通过在攻击输入附近的小范围内最小化该函数来缓解攻击。实验结果表明，该方法在不同数据集、攻击类型/强度以及物理驱动深度学习网络中显著减少了对抗性扰动的影响，并且在定性和定量上优于需要重新训练的传统缓解方法。

链接: https://arxiv.org/abs/2501.01908
作者: Mahdi Saberi,Chi Zhang,Mehmet Akcakaya
机构: Department of Electrical and Computer Engineering, University of Minnesota(明尼苏达大学电气与计算机工程系); Center for Magnetic Resonance Research, University of Minnesota(明尼苏达大学磁共振研究中心); Department of Radiology, Stanford University(斯坦福大学放射学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning (DL) methods, especially those based on physics-driven DL, have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, or attacks, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining and may lower reconstruction quality for non-perturbed/clean inputs. In this work, we propose a novel approach for detecting and mitigating adversarial attacks on MRI reconstruction models without any retraining. Our detection strategy is based on the idea of cyclic measurement consistency. The output of the model is mapped to another set of MRI measurements for a different sub-sampling pattern, and this synthesized data is reconstructed with the same model. Intuitively, without an attack, the second reconstruction is expected to be consistent with the first, while with an attack, disruptions are present. Subsequently, this idea is extended to devise a novel objective function, which is minimized within a small ball around the attack input for mitigation. Experimental results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods that involve retraining.
zh

[CV-6] Virgo: A Preliminary Exploration on Reproducing o1 -like MLLM

【速读】：该论文试图解决在多模态大语言模型（MLLMs）中实现慢思考（slow-thinking）推理系统的挑战。由于MLLMs需要处理跨不同模态的复杂数据语义，实现多模态慢思考系统比单模态更具挑战性。论文提出的解决方案是通过使用少量文本形式的长思考数据对现有的MLLM进行微调，从而构建了一个名为Virgo（Visual reasoning with long thought）的多模态慢思考系统。研究发现，这些以自然语言表达的长思考推理过程可以有效地迁移到MLLMs中，并且文本推理数据在激发MLLMs的慢思考能力方面甚至比视觉推理数据更为有效。这一发现表明，慢思考能力主要与语言模型组件相关，并且可以跨模态或领域迁移，为开发更强大的慢思考推理系统提供了指导。

链接: https://arxiv.org/abs/2501.01904
作者: Yifan Du,Zikang Liu,Yifan Li,Wayne Xin Zhao,Yuqi Huo,Bingning Wang,Weipeng Chen,Zheng Liu,Zhongyuan Wang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Baichuan AI (百川智能); BAAI (北京智源人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report on Slow Thinking with LLMs: Visual Reasoning

点击查看摘要

Abstract:Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at this https URL. Comments: Technical Report on Slow Thinking with LLMs: Visual Reasoning Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.01904 [cs.CV] (or arXiv:2501.01904v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.01904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-7] EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

【速读】：该论文旨在解决机器人操作任务中的未来空间生成问题，特别是在长序列生成和复杂环境下的运动建模方面。论文提出的解决方案EnerVerse框架，通过整合卷积和双向注意力机制（bidirectional attention mechanisms）来实现内部块空间建模，确保低层次的一致性和连续性。为了应对视频数据中的冗余问题，提出了稀疏记忆上下文（sparse memory context）与块式单向生成范式（chunkwise unidirectional generative paradigm）相结合的方法，以支持无限长序列的生成。此外，引入了自由锚点视图空间（Free Anchor View, FAV），该空间通过提供灵活的视角来增强观察和分析能力，减少运动建模的模糊性，并消除受限环境中的物理约束，从而显著提升机器人在不同任务和设置中的泛化能力和适应性。最后，论文还提出了一个数据引擎管道，结合生成模型和4D高斯散射（4D Gaussian Splatting, 4DGS），通过迭代提升数据质量和多样性，缩小仿真与现实的差距。实验结果表明，该框架显著增强了策略预测能力，特别是在长距离机器人操作任务中表现突出。

链接: https://arxiv.org/abs/2501.01895
作者: Siyuan Huang,Liliang Chen,Pengfei Zhou,Shengcong Chen,Zhengkai Jiang,Yue Hu,Peng Gao,Hongsheng Li,Maoqing Yao,Guanghui Ren
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot’s generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model’s robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.
zh

[CV-8] ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation

【速读】：该论文试图解决的是人群体积估计（Crowd Volume Estimation, CVE）问题，即通过仅使用RGB图像来估计人群的集体体积。CVE在事件管理、公共安全、基础设施应力评估和确保重量平衡等方面具有重要应用。论文的关键解决方案包括：1）提出了首个CVE基准，即ANTHROPOS-V数据集，该数据集包含多样化的城市环境中的合成逼真视频，标注了每个人的体积、SMPL形状参数和关键点；2）探索了与CVE相关的评估指标，并定义了基于人体网格恢复（Human Mesh Recovery）和人群计数（Crowd Counting）领域的基线模型；3）提出了一种专门针对CVE的方法，该方法在性能上超越了基线模型。尽管数据集是合成的，但个体的体重和身高分布与现实世界中的性别分布一致，并且能够从真实图像中迁移到CVE的下游任务中。

链接: https://arxiv.org/abs/2501.01877
作者: Luca Collorone,Stefano D’Arrigo,Massimiliano Pappa,Guido Maria D’Amely di Melendugno,Giovanni Ficarra,Fabio Galasso
机构: Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the novel task of Crowd Volume Estimation (CVE), defined as the process of estimating the collective body volume of crowds using only RGB images. Besides event management and public safety, CVE can be instrumental in approximating body weight, unlocking weight sensitive applications such as infrastructure stress assessment, and assuring even weight balance. We propose the first benchmark for CVE, comprising ANTHROPOS-V, a synthetic photorealistic video dataset featuring crowds in diverse urban environments. Its annotations include each person’s volume, SMPL shape parameters, and keypoints. Also, we explore metrics pertinent to CVE, define baseline models adapted from Human Mesh Recovery and Crowd Counting domains, and propose a CVE specific methodology that surpasses baselines. Although synthetic, the weights and heights of individuals are aligned with the real-world population distribution across genders, and they transfer to the downstream task of CVE from real images. Benchmark and code are available at this http URL.
zh

[CV-9] owards Hard and Soft Shadow Removal via Dual-Branch Separation Network and Vision Transformer ICML

【速读】：该论文旨在解决计算机视觉领域中的图像阴影去除问题。在现实场景中，阴影会改变图像的颜色和亮度，给感知和纹理识别带来挑战。传统方法和深度学习方法通常忽视了处理硬阴影（hard shadows）和软阴影（soft shadows）的不同需求，缺乏针对每种阴影类型的详细处理。为此，论文提出了一种双路径模型（dual-path model），通过专门设计的损失函数分别处理硬阴影和软阴影。该模型首先对阴影类型进行分类，然后通过适当的路径进行处理，生成无阴影的输出。为了增强边缘细节和特征融合，模型结合了Vision Transformer和UNet++。实验结果表明，该模型在ISTD数据集上达到了2.905的RMSE值，优于现有的单路径方法，展示了其在阴影去除任务中的显著效果。

链接: https://arxiv.org/abs/2501.01864
作者: Jiajia Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, IEEE International Conference on Machine Learning and Cybernetics (ICMLC) 2024

点击查看摘要

Abstract:Image shadow removal is a crucial task in computer vision. In real-world scenes, shadows alter image color and brightness, posing challenges for perception and texture recognition. Traditional and deep learning methods often overlook the distinct needs for handling hard and soft shadows, thereby lacking detailed processing to specifically address each type of shadow in this http URL propose a dual-path model that processes these shadows separately using specially designed loss functions to accomplish the hard and soft shadow removal. The model classifies shadow types and processes them through appropriate paths to produce shadow-free outputs, integrating a Vision Transformer with UNet++ for enhanced edge detail and feature fusion. Our model outperforms state-of-the-art methods and achieves 2.905 RMSE value on the ISTD dataset, which demonstrates greater effectiveness than typical single-path approaches.
zh

[CV-10] UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery

【速读】：该论文旨在解决无人机图像目标检测（UAV-OD）中现有算法依赖手动设计组件且调参复杂的问题。现有的端到端模型主要针对自然图像设计，对无人机图像的检测效果较差。为此，论文提出了一种专为无人机图像设计的高效检测变换器（DETR）框架，即UAV-DETR。该框架的关键创新包括：1）多尺度特征融合与频率增强模块，用于在不同尺度上捕捉空间和频率信息；2）频率聚焦的下采样模块，以在下采样过程中保留关键的空间细节；3）语义对齐和校准模块，用于对齐和融合来自不同融合路径的特征。实验结果表明，该方法在多个无人机图像数据集上具有显著的效果和泛化能力，特别是在VisDrone数据集上，AP和AP_50分别提高了3.1%和4.2%。

链接: https://arxiv.org/abs/2501.01855
作者: Huaxiang Zhang,Kai Liu,Zhongxue Gan,Guo-Niu Zhu
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与应用技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1% and \textAP_50 by 4.2% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: this https URL
zh

[CV-11] Semantic Segmentation for Sequential Historical Maps by Learning from Only One Map

【速读】：该论文试图解决历史地图（historical maps）数字化过程中缺乏真实标注数据（ground-truth annotations）的问题，这一问题限制了基于深度学习的语义分割（semantic segmentation）模型的训练。为了解决这一挑战，论文提出了一种弱监督的年龄追踪策略（weakly-supervised age-tracing strategy），利用相邻时间段历史地图在外观和土地利用模式上的相似性，通过将一张地图的模型预测结果作为相邻时间段地图的伪标签（pseudo-labels）来指导模型微调。这一方法显著提升了分割性能，实验结果表明，在最佳情况下，平均交并比（mIoU）达到了77.3%，比基线方法提升了约20%，且模型的平均总体准确率达到了97%。

链接: https://arxiv.org/abs/2501.01845
作者: Yunshuang Yuan,Frank Thiemann,Monika Sester
机构: Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Historical maps are valuable resources that capture detailed geographical information from the past. However, these maps are typically available in printed formats, which are not conducive to modern computer-based analyses. Digitizing these maps into a machine-readable format enables efficient computational analysis. In this paper, we propose an automated approach to digitization using deep-learning-based semantic segmentation, which assigns a semantic label to each pixel in scanned historical maps. A key challenge in this process is the lack of ground-truth annotations required for training deep neural networks, as manual labeling is time-consuming and labor-intensive. To address this issue, we introduce a weakly-supervised age-tracing strategy for model fine-tuning. This approach exploits the similarity in appearance and land-use patterns between historical maps from neighboring time periods to guide the training process. Specifically, model predictions for one map are utilized as pseudo-labels for training on maps from adjacent time periods. Experiments conducted on our newly curated \textitHameln dataset demonstrate that the proposed age-tracing strategy significantly enhances segmentation performance compared to baseline models. In the best-case scenario, the mean Intersection over Union (mIoU) achieved 77.3%, reflecting an improvement of approximately 20% over baseline methods. Additionally, the fine-tuned model achieved an average overall accuracy of 97%, highlighting the effectiveness of our approach for digitizing historical maps.
zh

[CV-12] Dedicated Inference Engine and Binary-Weight Neural Networks for Lightweight Instance Segmentation CVPR2024

【速读】：该论文旨在解决嵌入式系统中计算成本高的问题，特别是在处理现代二值权重神经网络（Binary-weight Neural Networks, BNNs）时的硬件架构设计挑战。论文提出了一种硬件架构设计方法，通过将乘法累加（Multiply-Accumulate, MAC）操作简化为位操作（bitwise operations），从而有效减少推理引擎的硬件成本。该方案的关键在于通过移除部分计算成本，显著降低了推理引擎的门数（gate count），并实现了仅需52%硬件成本的高效计算。此外，论文还提出了两种轻量级网络，结合了SegNeXt的主干网络和SparseInst的解码器，用于实例分割任务，进一步验证了所提推理引擎在实际应用中的可行性和高效性。实验结果表明，该推理引擎在处理实例分割网络时，尽管模型尺寸比YOLACT小77.7倍，但在“Person”类别上仍能实现更高的精度。

链接: https://arxiv.org/abs/2501.01841
作者: Tse-Wei Chen,Wei Tao,Dongyue Zhao,Kazuhiro Mima,Tadayuki Ito,Kinya Osa,Masami Kato
机构: Device Technology Development Headquarters, Canon Inc.(佳能公司设备技术开发总部); Canon Innovative Solution (Beijing) Co., Ltd.(佳能创新解决方案（北京）有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: Camera-ready version for CVPR 2024 workshop (Embedded Vision Workshop)

点击查看摘要

Abstract:Reducing computational costs is an important issue for development of embedded systems. Binary-weight Neural Networks (BNNs), in which weights are binarized and activations are quantized, are employed to reduce computational costs of various kinds of applications. In this paper, a design methodology of hardware architecture for inference engines is proposed to handle modern BNNs with two operation modes. Multiply-Accumulate (MAC) operations can be simplified by replacing multiply operations with bitwise operations. The proposed method can effectively reduce the gate count of inference engines by removing a part of computational costs from the hardware system. The architecture of MAC operations can calculate the inference results of BNNs efficiently with only 52% of hardware costs compared with the related works. To show that the inference engine can handle practical applications, two lightweight networks which combine the backbones of SegNeXt and the decoder of SparseInst for instance segmentation are also proposed. The output results of the lightweight networks are computed using only bitwise operations and add operations. The proposed inference engine has lower hardware costs than related works. The experimental results show that the proposed inference engine can handle the proposed instance-segmentation networks and achieves higher accuracy than YOLACT on the “Person” category although the model size is 77.7 \times smaller compared with YOLACT.
zh

[CV-13] MoColl: Agent -Based Specific and General Model Collaboration for Image Captioning

【速读】：该论文试图解决图像描述生成（Image Captioning）任务中，现有方法在领域特定知识和通用知识整合方面的局限性。具体而言，领域专用模型虽然擅长捕捉领域特定细节，但缺乏泛化能力；而基于大语言模型（LLMs）的视觉-语言模型（VLMs）虽然能够利用通用知识，但在领域特定任务上的适应性较差。为解决这一问题，论文提出了一种名为MoColl的代理增强模型协作框架，通过将复杂的图像描述任务分解为一系列相互关联的问答子任务，结合领域专用的视觉问答（VQA）模型和基于LLM的代理，有效整合领域特定知识和通用知识。关键解决方案在于利用VQA模型进行领域特定的视觉分析，同时由LLM代理生成问题并合成连贯的描述，并通过代理指导VQA模型的训练，进一步提升其领域特定能力。实验结果表明，该框架在放射学报告生成任务中显著提升了生成报告的质量。

链接: https://arxiv.org/abs/2501.01834
作者: Pu Yang,Bin Dong
机构: the School of Mathematical Sciences, Peking University (北京大学数学科学学院); Beijing International Center for Mathematical Research, Center for Machine Learning Research, Peking University (北京大学北京国际数学研究中心, 机器学习研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image captioning is a critical task at the intersection of computer vision and natural language processing, with wide-ranging applications across various domains. For complex tasks such as diagnostic report generation, deep learning models require not only domain-specific image-caption datasets but also the incorporation of relevant general knowledge to provide contextual accuracy. Existing approaches exhibit inherent limitations: specialized models excel in capturing domain-specific details but lack generalization, while vision-language models (VLMs) built on large language models (LLMs) leverage general knowledge but struggle with domain-specific adaptation. To address these limitations, this paper proposes a novel agent-enhanced model collaboration framework, which we called \textbfMoColl, designed to effectively integrate domain-specific and general knowledge. Specifically, our approach is to decompose complex image captioning tasks into a series of interconnected question-answer subtasks. A trainable visual question answering (VQA) model is employed as a specialized tool to focus on domain-specific visual analysis, answering task-specific questions based on image content. Concurrently, an LLM-based agent with general knowledge formulates these questions and synthesizes the resulting question-answer pairs into coherent captions. Beyond its role in leveraging the VQA model, the agent further guides its training to enhance its domain-specific capabilities. Experimental results on radiology report generation validate the effectiveness of the proposed framework, demonstrating significant improvements in the quality of generated reports.
zh

[CV-14] Uncertainty-Aware Label Refinement on Hypergraphs for Personalized Federated Facial Expression Recognition

【速读】：该论文试图解决在面部表情识别（Facial Expression Recognition, FER）任务中，由于隐私问题难以收集大规模集中式数据的问题。为此，论文提出了一种基于个性化联邦学习（Personalized Federated Learning）框架的解决方案，旨在在去中心化的环境中实现高效的表情识别。解决方案的关键在于提出了一种新颖的基于超图（hypergraph）的不确定性感知标签精炼方法（Uncertainty-Aware Label Refinement on Hypergraphs, AMY）。该方法通过在每个本地模型中引入不确定性估计（Uncertainty Estimation, UE）模块和表情分类（Expression Classification, EC）模块，利用超图建模样本间的高阶复杂关系，并结合这些关系生成不确定性特征。通过个性化不确定性估计器，能够为本地客户端中的样本估计可靠的不确定性权重。在EC模块中，通过超图上的标签传播，获得高质量的精炼标签，用于重新训练表情分类器。该方法有效缓解了客户端间样本异质性的不确定性，并在每个客户端中学习到鲁棒的个性化FER模型。实验结果表明，该方法在两个具有挑战性的真实世界面部表情数据库上均优于多种现有方法，证明了超图建模在不确定性估计和标签精炼中的优越性。

链接: https://arxiv.org/abs/2501.01816
作者: Hu Ding,Yan Yan,Yang Lu,Jing-Hao Xue,Hanzi Wang
机构: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University (福建省智能城市感知与计算重点实验室, 厦门大学信息学院); Department of Statistical Science, University College London (伦敦大学学院统计科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most facial expression recognition (FER) models are trained on large-scale expression data with centralized learning. Unfortunately, collecting a large amount of centralized expression data is difficult in practice due to privacy concerns of facial images. In this paper, we investigate FER under the framework of personalized federated learning, which is a valuable and practical decentralized setting for real-world applications. To this end, we develop a novel uncertainty-Aware label refineMent on hYpergraphs (AMY) method. For local training, each local model consists of a backbone, an uncertainty estimation (UE) block, and an expression classification (EC) block. In the UE block, we leverage a hypergraph to model complex high-order relationships between expression samples and incorporate these relationships into uncertainty features. A personalized uncertainty estimator is then introduced to estimate reliable uncertainty weights of samples in the local client. In the EC block, we perform label propagation on the hypergraph, obtaining high-quality refined labels for retraining an expression classifier. Based on the above, we effectively alleviate heterogeneous sample uncertainty across clients and learn a robust personalized FER model in each client. Experimental results on two challenging real-world facial expression databases show that our proposed method consistently outperforms several state-of-the-art methods. This indicates the superiority of hypergraph modeling for uncertainty estimation and label refinement on the personalized federated FER task. The source code will be released at this https URL.
zh

[CV-15] MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

【速读】：该论文旨在解决生成逼真说话头像视频中的两个关键问题：1）缺乏建模单一基本情感表达的框架，限制了复杂情感（如复合情感）的生成；2）缺乏包含丰富人类情感表达的综合数据集，限制了模型的潜力。为解决这些问题，论文提出了以下关键创新：1）情感专家混合模型（Mixture of Emotion Experts, MoEE），该模型通过解耦六种基本情感，实现了单一和复合情感状态的精确合成；2）DH-FaceEmoVid-150数据集，该数据集专门包含了六种常见的人类情感表达以及四种复合情感，从而扩展了情感驱动模型的训练潜力。此外，论文还提出了一个情感到潜在空间的模块（emotion-to-latents module），利用多模态输入（如音频、文本和标签）来增强情感控制的灵活性，确保多样化的控制输入，并支持仅通过音频控制情感。通过广泛的定量和定性评估，论文证明了MoEE框架与DH-FaceEmoVid-150数据集在生成复杂情感表达和细腻面部细节方面的卓越表现，为该领域设定了新的基准。

链接: https://arxiv.org/abs/2501.01808
作者: Huaize Liu,Wenzhang Sun,Donglin Di,Shibo Sun,Jiahui Yang,Changqing Zou,Hujun Bao
机构: Zhejiang Lab(浙江实验室); Li Auto(理想汽车); Harbin Institute of Technology(哈尔滨工业大学); Zhejiang University(浙江大学); Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a)the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b)the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1)the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2)the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.
zh

[CV-16] JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing

【速读】：该论文试图解决在基于输入音频的唇形编辑中，唇音同步（lip-audio synchronization）和高质量视觉生成（high visual quality）的挑战。解决方案的关键在于提出了一个名为JoyGen的两阶段框架，该框架包括音频驱动的唇部运动生成（audio-driven lip motion generation）和视觉外观合成（visual appearance synthesis）。在第一阶段，通过3D重建模型和音频到运动模型（audio2motion model）分别预测身份和表情系数。接着，通过将音频特征与面部深度图（facial depth map）结合，提供了全面的监督，以确保面部生成中的精确唇音同步。此外，论文还构建了一个包含130小时高质量视频的中文唇语数据集，并在开源HDTF数据集和自建数据集上训练JoyGen。实验结果表明，该方法在唇音同步和视觉质量方面表现出色。

链接: https://arxiv.org/abs/2501.01798
作者: Qili Wang,Dajiang Wu,Zihang Xu,Junshi Huang,Jun Lv
机构: JD.Com, Inc.(京东); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant progress has been made in talking-face video generation research; however, precise lip-audio synchronization and high visual quality remain challenging in editing lip shapes based on input audio. This paper introduces JoyGen, a novel two-stage framework for talking-face generation, comprising audio-driven lip motion generation and visual appearance synthesis. In the first stage, a 3D reconstruction model and an audio2motion model predict identity and expression coefficients respectively. Next, by integrating audio features with a facial depth map, we provide comprehensive supervision for precise lip-audio synchronization in facial generation. Additionally, we constructed a Chinese talking-face dataset containing 130 hours of high-quality video. JoyGen is trained on the open-source HDTF dataset and our curated dataset. Experimental results demonstrate superior lip-audio synchronization and visual quality achieved by our method.
zh

[CV-17] A Minimal Subset Approach for Efficient and Scalable Loop Closure

【速读】：该论文试图解决在大规模和长期任务中，由于需要识别、验证和处理大量候选对以建立位姿图优化（pose graph optimization）的边连接，导致闭环检测（loop closure detection）计算量过大的问题。解决方案的关键在于提出了一种最小子集方法（Minimal Subset Approach, MSA），该方法通过滑动窗口框架内的冗余最小化和信息保留两个关键因素，优化关键帧采样策略。MSA在减少冗余关键帧的同时，保留了必要的信息，从而在不显著影响性能的情况下，提高了系统的可扩展性并降低了计算开销。实验结果表明，MSA在多种环境中表现一致，且无需手动调整参数。

链接: https://arxiv.org/abs/2501.01791
作者: Nikolaos Stathoulopoulos,Christoforos Kanellakis,George Nikolakopoulos
机构: Robotics and AI Group, Department of Computer, Electrical and Space Engineering, Luleå University of Technology (吕勒奥理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 8 Figures, 2 Tables. Submitted

点击查看摘要

Abstract:Loop closure detection in large-scale and long-term missions can be computationally demanding due to the need to identify, verify, and process numerous candidate pairs to establish edge connections for the pose graph optimization. Keyframe sampling mitigates this by reducing the number of frames stored and processed in the back-end system. In this article, we address the gap in optimized keyframe sampling for the combined problem of pose graph optimization and loop closure detection. Our Minimal Subset Approach (MSA) employs an optimization strategy with two key factors, redundancy minimization and information preservation, within a sliding window framework to efficiently reduce redundant keyframes, while preserving essential information. This method delivers comparable performance to baseline approaches, while enhancing scalability and reducing computational overhead. Finally, we evaluate MSA on relevant publicly available datasets, showcasing that it consistently performs across a wide range of environments, without requiring any manual parameter tuning.
zh

[CV-18] Ingredients: Blending Custom Photos with Video Diffusion Transformers

【速读】：该论文旨在解决如何通过结合多张特定身份（ID）照片来定制视频生成的问题。解决方案的关键在于提出了一个名为\texttt{Ingredients}的框架，该框架包含三个主要模块：1）面部提取器（facial extractor），用于从全局和局部视角捕捉每个身份的多功能且精确的面部特征；2）多尺度投影器（multi-scale projector），将面部嵌入映射到视频扩散变换器（video diffusion transformers）的图像查询上下文中；3）ID路由器（ID router），动态组合并分配多个ID嵌入到相应的时空区域。通过精心策划的文本-视频数据集和多阶段训练协议，该框架在将定制照片转化为动态和个性化视频内容方面表现出色，展示了在基于变换器的架构中更有效的生成视频控制工具的显著进步。

链接: https://arxiv.org/abs/2501.01790
作者: Zhengcong Fei,Debang Li,Di Qiu,Changqian Yu,Mingyuan Fan
机构: Kunlun Inc.(昆仑公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as \textttIngredients. Generally, our method consists of three primary modules: (\textbfi) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (\textbfii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (\textbfiii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, \textttIngredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: \urlthis https URL.
zh

[CV-19] Universal Online Temporal Calibration for Optimization-based Visual-Inertial Navigation Systems

【速读】：该论文旨在解决视觉和惯性传感器（IMU）在6自由度（6DoF）运动估计中的时间偏移（time offset）校准问题。精确的时间偏移校准是实现准确和鲁棒跟踪的前提条件。为了解决这一问题，作者提出了一种通用的在线时间校准策略，适用于基于优化的视觉-惯性导航系统。该方案的关键在于将时间偏移 ( t_d ) 作为状态参数引入优化残差模型，通过 ( t_d )、角速度和平移速度将IMU状态与对应的图像时间戳对齐。这种方法允许在优化过程中同时优化时间偏移和其他跟踪状态。由于该方法仅修改了残差模型的结构，因此可以应用于具有不同跟踪前端的各种优化框架。实验结果表明，该方法在噪声传感器数据的情况下，能够提供更准确的时间偏移估计和更快的收敛速度。

链接: https://arxiv.org/abs/2501.01788
作者: Yunfei Fan,Tianyu Zhao,Linan Guo,Chen Chen,Xin Wang,Fengyi Zhou
机构: 1111 PICO Technology Co., Ltd., Beijing, China (北京PICO科技有限公司); 2222 China University of Mining and Technology (Beijing), Beijing, China (中国矿业大学(北京))
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:6-Degree of Freedom (6DoF) motion estimation with a combination of visual and inertial sensors is a growing area with numerous real-world applications. However, precise calibration of the time offset between these two sensor types is a prerequisite for accurate and robust tracking. To address this, we propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems. Technically, we incorporate the time offset td as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp using td, angular velocity and translational velocity. This allows the temporal misalignment td to be optimized alongside other tracking states during the process. As our method only modifies the structure of the residual model, it can be applied to various optimization-based frameworks with different tracking frontends. We evaluate our calibration method with both EuRoC and simulation data and extensive experiments demonstrate that our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data.
zh

[CV-20] CPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation

【速读】：该论文旨在解决现有多帧提升方法在3D人体姿态估计中忽略2D姿态序列内部复杂依赖关系，仅学习单一时间相关性的问题。为了解决这一局限性，作者提出了TCPFormer模型，其核心在于引入隐式姿态代理（implicit pose proxy）作为中间表示。每个隐式姿态代理能够构建一个时间相关性，从而帮助模型学习更全面的人体运动时间相关性。解决方案的关键在于三个模块：代理更新模块（Proxy Update Module, PUM）、代理调用模块（Proxy Invocation Module, PIM）和代理注意力模块（Proxy Attention Module, PAM）。PUM首先利用姿态特征更新隐式姿态代理，使其能够存储姿态序列中的代表性信息；PIM随后调用并整合姿态代理与姿态序列，增强每个姿态的运动语义；最后，PAM利用姿态序列与姿态代理之间的映射关系，增强整个姿态序列的时间相关性。实验结果表明，TCPFormer在Human3.6M和MPI-INF-3DHP数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.01770
作者: Jiajie Liu,Mengyuan Liu,Hong Liu,Wenhao Li
机构: 1. Peking University (北京大学); 2. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by the 39th Annual AAAl Conference on Artificial Intelligence (AAAl 2025)

点击查看摘要

Abstract:Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.
zh

[CV-21] LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

【速读】：该论文试图解决逻辑异常检测（Logical Anomaly Detection, AD）中的问题，特别是在工业检测等应用中，如何通过理解图像中的逻辑关系和一致性来识别异常。传统方法依赖于先验知识和大量手动标注数据，计算资源需求高且训练数据量大。论文提出了一种基于自回归多模态视觉语言模型（Autoregressive, multimodal Vision Language Models, AVLMs）的解决方案，结合格式嵌入（format embedding）和逻辑推理器（logic reasoner），在公开基准测试MVTec LOCO AD上取得了显著优于现有方法的性能，AUROC达到86.0%，F1-max达到83.7%，并提供了异常的解释。该方案的关键在于利用AVLMs在视觉推理中的优异表现，结合逻辑推理能力，实现了无需大量标注数据的高效异常检测。

链接: https://arxiv.org/abs/2501.01767
作者: Er Jin,Qihui Feng,Yongli Mou,Stefan Decker,Gerhard Lakemeyer,Oliver Simons,Johannes Stegmaier
机构: 1: Institute for Software Technology, German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany(德国人工智能研究中心软件技术研究所); 2: Chair for Information Systems and Databases, RWTH Aachen University, Aachen, Germany(亚琛工业大学信息系统与数据库研究所); 3: Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany(弗劳恩霍夫应用信息技术研究所); 4: Institute for Theoretical Physics, University of Cologne, Cologne, Germany(科隆大学理论物理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image’s visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and F1-max of 83.7%, along with explanations of anomalies. This significantly outperforms the existing SOTA method by a large margin.
zh

[CV-22] Adverse Weather Conditions Augmentation of LiDAR Scenes with Latent Diffusion Models

【速读】：该论文试图解决自动驾驶应用中LiDAR（激光雷达）场景数据在恶劣天气条件下稀缺的问题。由于现有数据集在恶劣天气条件下的场景较少，这限制了机器学习模型的鲁棒性，并影响了自动驾驶系统在特定地点和季节的可靠性。论文提出了一种基于自动编码器（autoencoder）和潜在扩散模型（latent diffusion models）的潜在扩散过程，用于生成特定驾驶场景下的恶劣天气条件LiDAR场景。此外，通过利用清晰条件下的LiDAR场景并进行后处理步骤，进一步提高了生成场景的真实感。解决方案的关键在于通过生成模型来弥补数据集的不足，从而增强自动驾驶系统在复杂环境下的适应性和可靠性。

链接: https://arxiv.org/abs/2501.01761
作者: Andrea Matteazzi,Pascal Colling,Michael Arnold,Dietmar Tutsch
机构: University of Wuppertal(伍珀塔尔大学); Aptiv Services Deutschland GmbH(安波福服务德国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is an intermediate version of our work

点击查看摘要

Abstract:LiDAR scenes constitute a fundamental source for several autonomous driving applications. Despite the existence of several datasets, scenes from adverse weather conditions are rarely available. This limits the robustness of downstream machine learning models, and restrains the reliability of autonomous driving systems in particular locations and seasons. Collecting feature-diverse scenes under adverse weather conditions is challenging due to seasonal limitations. Generative models are therefore essentials, especially for generating adverse weather conditions for specific driving scenarios. In our work, we propose a latent diffusion process constituted by autoencoder and latent diffusion models. Moreover, we leverage the clear condition LiDAR scenes with a postprocessing step to improve the realism of the generated adverse weather condition scenes.
zh

[CV-23] From Age Estimation to Age-Invariant Face Recognition: Generalized Age Feature Extraction Using Order-Enhanced Contrastive Learning

【速读】：该论文试图解决在跨数据集评估中，现有模型在提取广义年龄特征（generalized age features）时性能显著下降的问题。这些模型通常仅通过将提取的特征直接映射到训练年龄标签来完成任务，而没有显式地建模自然老化过程，导致其在不同数据集和场景中的泛化能力不足。为解决这一问题，论文提出了顺序增强对比学习（Order-Enhanced Contrastive Learning, OrdCon）方法。该方法的关键在于通过将两个特征的方向向量与自然老化方向或其反向对齐，从而有效建模老化过程。此外，OrdCon结合了一种新颖的软代理匹配损失（soft proxy matching loss），利用度量学习确保特征位于每个年龄簇的中心，并最小化类内方差。实验表明，该方法在同数据集评估中与最先进方法性能相当，在跨数据集评估中显著提升了年龄估计和年龄不变人脸识别任务的性能。

链接: https://arxiv.org/abs/2501.01760
作者: Haoyi Wang,Victor Sanchez,Chang-Tsun Li,Nathan Clarke
机构: School of Engineering, Computing and Mathematics, University of Plymouth, Plymouth, PL4 8AA, UK (普利茅斯大学工程、计算与数学学院); Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK (华威大学计算机科学系); School of Information Technology, Deakin University, Geelong VIC 3216, Australia (迪肯大学信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized age feature extraction is crucial for age-related facial analysis tasks, such as age estimation and age-invariant face recognition (AIFR). Despite the recent successes of models in homogeneous-dataset experiments, their performance drops significantly in cross-dataset evaluations. Most of these models fail to extract generalized age features as they only attempt to map extracted features with training age labels directly without explicitly modeling the natural progression of aging. In this paper, we propose Order-Enhanced Contrastive Learning (OrdCon), which aims to extract generalized age features to minimize the domain gap across different datasets and scenarios. OrdCon aligns the direction vector of two features with either the natural aging direction or its reverse to effectively model the aging process. The method also leverages metric learning which is incorporated with a novel soft proxy matching loss to ensure that features are positioned around the center of each age cluster with minimum intra-class variance. We demonstrate that our proposed method achieves comparable results to state-of-the-art methods on various benchmark datasets in homogeneous-dataset evaluations for both age estimation and AIFR. In cross-dataset experiments, our method reduces the mean absolute error by about 1.38 in average for age estimation task and boosts the average accuracy for AIFR by 1.87%.
zh

[CV-24] Augmentation Matters: A Mix-Paste Method for X-Ray Prohibited Item Detection under Noisy Annotations

【速读】：该论文试图解决在X射线图像中自动检测违禁物品时，由于训练数据中存在噪声标注（包括类别噪声和边界框噪声）而导致模型性能下降的问题。现有的深度学习方法通常假设训练数据的标注是正确的，但在大规模X射线图像中，由于物品重叠等原因，获取准确的标注极为困难，导致标注噪声的存在。

解决方案的关键在于提出了一种新颖的数据增强方法——标签感知的混合补丁粘贴增强方法（Mix-Paste）。具体而言，该方法通过将来自不同图像的相同类别的物品补丁进行混合，并用混合后的补丁替换原始图像中的补丁，从而增加生成图像中包含正确违禁物品的概率。同时，这种混合过程模拟了物品重叠的情况，使模型能够更好地学习X射线图像的特征。此外，论文还设计了一种基于物品的大损失抑制策略（LLS），用于抑制由于混合操作导致的额外物品预测所产生的大损失，从而进一步提升模型的鲁棒性。

通过在带有噪声标注的X射线数据集和MS-COCO数据集上的实验，该方法展示了其在处理噪声标注方面的显著优势，证明了数据增强在处理噪声标注问题上的巨大潜力。

链接: https://arxiv.org/abs/2501.01733
作者: Ruikang Chen,Yan Yan,Jing-Hao Xue,Yang Lu,Hanzi Wang
机构: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University (福建智能城市感知与计算重点实验室, 厦门大学信息学院); Department of Statistical Science, University College London (伦敦大学学院统计科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The manuscript has been ACCEPTED for publication as a regular paper in the IEEE Transactions on Information Forensics Security

点击查看摘要

Abstract:Automatic X-ray prohibited item detection is vital for public safety. Existing deep learning-based methods all assume that the annotations of training X-ray images are correct. However, obtaining correct annotations is extremely hard if not impossible for large-scale X-ray images, where item overlapping is this http URL a result, X-ray images are easily contaminated with noisy annotations, leading to performance deterioration of existing this http URL this paper, we address the challenging problem of training a robust prohibited item detector under noisy annotations (including both category noise and bounding box noise) from a novel perspective of data augmentation, and propose an effective label-aware mixed patch paste augmentation method (Mix-Paste). Specifically, for each item patch, we mix several item patches with the same category label from different images and replace the original patch in the image with the mixed patch. In this way, the probability of containing the correct prohibited item within the generated image is increased. Meanwhile, the mixing process mimics item overlapping, enabling the model to learn the characteristics of X-ray images. Moreover, we design an item-based large-loss suppression (LLS) strategy to suppress the large losses corresponding to potentially positive predictions of additional items due to the mixing operation. We show the superiority of our method on X-ray datasets under noisy annotations. In addition, we evaluate our method on the noisy MS-COCO dataset to showcase its generalization ability. These results clearly indicate the great potential of data augmentation to handle noise annotations. The source code is released at this https URL.
zh

[CV-25] Multi-modal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds

【速读】：该论文试图解决的问题是如何通过深度学习方法融合近距离传感数据（2D正射影像和3D机载激光扫描点云）来提升森林生物多样性评估的准确性。传统的地面调查方法虽然能提供高质量的评估结果，但其劳动密集且空间覆盖有限。论文提出的解决方案关键在于利用多模态数据融合策略，结合2D正射影像的光谱信息和3D机载激光扫描点云的结构信息，通过深度神经网络（ResNet用于正射影像，PointVector用于点云）分别评估每种数据模态的生物多样性潜力，并进一步探索了基于置信度的集成方法和特征级串联策略，最终通过特征级串联策略实现了75.5%的平均准确率，证明了两种数据模态在森林生物多样性评估中的互补性。

链接: https://arxiv.org/abs/2501.01728
作者: Simon B. Jensen,Stefan Oehmcke,Andreas Møgelmose,Meysam Madadi,Christian Igel,Sergio Escalera,Thomas B. Moeslund
机构: Visual Analysis and Perception Laboratory, Aalborg University, Denmark(丹麦奥尔堡大学视觉分析与感知实验室); Pioneer Centre for Artificial Intelligence, Denmark(丹麦人工智能先锋中心); Department of Computer Science, Copenhagen University, Denmark(丹麦哥本哈根大学计算机科学系); Institute for Visual & Analytic Computing, Rostock University, Germany(德国罗斯托克大学视觉与分析计算研究所); University of Barcelona and Computer Vision Center, Spain(西班牙巴塞罗那大学与计算机视觉中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate assessment of forest biodiversity is crucial for ecosystem management and conservation. While traditional field surveys provide high-quality assessments, they are labor-intensive and spatially limited. This study investigates whether deep learning-based fusion of close-range sensing data from 2D orthophotos (12.5 cm resolution) and 3D airborne laser scanning (ALS) point clouds (8 points/m^2) can enhance biodiversity assessment. We introduce the BioVista dataset, comprising 44.378 paired samples of orthophotos and ALS point clouds from temperate forests in Denmark, designed to explore multi-modal fusion approaches for biodiversity potential classification. Using deep neural networks (ResNet for orthophotos and PointVector for ALS point clouds), we investigate each data modality’s ability to assess forest biodiversity potential, achieving mean accuracies of 69.4% and 72.8%, respectively. We explore two fusion approaches: a confidence-based ensemble method and a feature-level concatenation strategy, with the latter achieving a mean accuracy of 75.5%. Our results demonstrate that spectral information from orthophotos and structural information from ALS point clouds effectively complement each other in forest biodiversity assessment.
zh

[CV-26] IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

【速读】：该论文旨在解决低分辨率（LR）深度图在机器人、导航和医学成像等领域中难以实现详细场景感知的问题。为了解决这一问题，论文提出了一种基于高分辨率（HR）结构化输入（如RGB或灰度图像）的引导深度超分辨率（GDSR）方法。其核心解决方案是引入了增量引导注意力融合（IGAF）模块，该模块能够有效地学习并融合来自RGB图像和LR深度图的特征，从而生成精确的HR深度图。通过IGAF模块，论文构建了一个鲁棒的超分辨率模型，并在多个基准数据集上进行了评估，结果表明该模型在NYU v2数据集上的×4、×8和×16上采样任务中达到了最先进的性能，同时在Middlebury、Lu和RGB-D-D数据集的零样本设置中也优于所有基线模型。

链接: https://arxiv.org/abs/2501.01723
作者: Athanasios Tragakis,Chaitanya Kaul,Kevin J. Mitchell,Hang Dai,Roderick Murray-Smith,Daniele Faccio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for \times 4 , \times 8 , and \times 16 upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.
zh

[CV-27] AR4D: Autoregressive 4D Generation from Monocular Videos

【速读】：该论文旨在解决现有动态3D内容生成（即4D生成）方法中存在的多样性不足、时空不一致性以及提示对齐效果差等问题。这些问题主要源于现有方法依赖的Score Distillation Sampling (SDS)技术，其固有的随机性导致了上述缺陷。为解决这些问题，论文提出了AR4D，一种无需SDS的4D生成新范式。其核心解决方案包括三个阶段：首先，利用预训练的专家模型从单目视频的首帧生成3D表示，并将其作为规范空间进行微调；其次，基于视频自然自回归的特性，提出通过前一帧的3D表示生成当前帧的表示，以提升几何和运动估计的准确性，同时引入渐进式视角采样策略防止过拟合；最后，通过全局变形场和每帧3D表示的几何信息进行细化，避免自回归生成过程中的外观漂移。实验表明，AR4D在不依赖SDS的情况下实现了最先进的4D生成效果，显著提升了多样性、时空一致性和提示对齐效果。

链接: https://arxiv.org/abs/2501.01722
作者: Hanxin Zhu,Tianyu He,Xiqian Yu,Junliang Guo,Zhibo Chen,Jiang Bian
机构: University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TL;DR: We present a novel method for 4D generation from monocular videos without relying on SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts. Project page: this https URL

点击查看摘要

Abstract:Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame’s 3D representation based on its previous frame’s representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame’s 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
zh

[CV-28] Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models AAAI2025

【速读】：该论文旨在解决人脸反欺骗（Face Anti-Spoofing, FAS）任务中的两个主要问题：现有方法在跨域场景下的泛化能力有限，以及缺乏对模型决策的解释性。为解决这些问题，论文提出了一种基于多模态大语言模型（Multimodal Large Language Model, MLLM）的框架，称为可解释人脸反欺骗（Interpretable Face Anti-Spoofing, I-FAS）。该框架将FAS任务转化为可解释的视觉问答（Visual Question Answering, VQA）范式。关键解决方案包括：1）提出了一种欺骗感知的标注与过滤（Spoof-aware Captioning and Filtering, SCF）策略，通过生成高质量的图像标注来增强模型的监督信号；2）设计了一种偏置语言模型（Lopsided Language Model, L-LM）损失函数，分离判断和解释的损失计算，优先优化判断任务；3）引入全局感知连接器（Globally Aware Connector, GAC），以增强模型对全局视觉特征的感知能力。实验结果表明，该方法在跨域基准测试中显著优于现有最先进方法。

链接: https://arxiv.org/abs/2501.01720
作者: Guosheng Zhang,Keyao Wang,Haixiao Yue,Ajian Liu,Gang Zhang,Kun Yao,Errui Ding,Jingdong Wang
机构: 1. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所智能信息处理重点实验室); 2. University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2025

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model’s supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model’s perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.
zh

[CV-29] KeyNode-Driven Geometry Coding for Real-World Scanned Human Dynamic Mesh Compression

【速读】：该论文试图解决真实世界扫描的3D人体动态网格（dynamic meshes）的压缩问题。这类网格在帧间拓扑结构（topology）变化且存在扫描缺陷（如孔洞和异常点），增加了预测和压缩的复杂性。此外，人体网格通常结合刚性和非刚性运动，使得预测和编码比纯刚性运动的物体更为困难。为解决这些问题，论文提出了一种基于嵌入关键节点（key nodes）的压缩方法。该方法通过将每个顶点的时态运动（temporal motion）表示为邻近关键节点变换的距离加权组合，从而仅需传输关键节点的变换信息。为提高关键节点驱动的预测质量，论文引入了基于八叉树（octree）的残差编码方案和双向预测模式（Dual-direction prediction mode），该模式利用来自两个方向的I帧进行预测。实验结果表明，该方法在低比特率下表现尤为突出，平均比特率节省达24.51%。

链接: https://arxiv.org/abs/2501.01717
作者: Huong Hoang,Truong Nguyen,Pamela Cosman
机构: Center for Wireless Communications, University of California, San Diego (加州大学圣地亚哥分校无线通信中心); National Science Foundation (国家科学基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The compression of real-world scanned 3D human dynamic meshes is an emerging research area, driven by applications such as telepresence, virtual reality, and 3D digital streaming. Unlike synthesized dynamic meshes with fixed topology, scanned dynamic meshes often not only have varying topology across frames but also scan defects such as holes and outliers, increasing the complexity of prediction and compression. Additionally, human meshes often combine rigid and non-rigid motions, making accurate prediction and encoding significantly more difficult compared to objects that exhibit purely rigid motion. To address these challenges, we propose a compression method designed for real-world scanned human dynamic meshes, leveraging embedded key nodes. The temporal motion of each vertex is formulated as a distance-weighted combination of transformations from neighboring key nodes, requiring the transmission of solely the key nodes’ transformations. To enhance the quality of the KeyNode-driven prediction, we introduce an octree-based residual coding scheme and a Dual-direction prediction mode, which uses I-frames from both directions. Extensive experiments demonstrate that our method achieves significant improvements over the state-of-the-art, with an average bitrate saving of 24.51% across the evaluated sequences, particularly excelling at low bitrates.
zh

[CV-30] Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision

【速读】：该论文旨在解决从RGB图像中估计布料三维状态的问题。其核心解决方案是通过一个预测-更新框架，结合动作条件动力学模型（action-conditioned dynamics model）和三维高斯泼溅（3D Gaussian Splatting）技术来实现。关键创新在于将基于三维网格的表示与高斯泼溅相结合，从而在布料状态空间和图像空间之间建立了一个可微分的映射。这种映射使得仅通过RGB图像的监督，能够利用基于梯度的优化技术来修正不准确的状态估计。实验结果表明，该方法不仅提高了状态估计的准确性，还显著减少了收敛时间。

链接: https://arxiv.org/abs/2501.01715
作者: Alberta Longhini,Marcel Büsching,Bardienus P. Duisterhof,Jens Lundell,Jeffrey Ichnowski,Mårten Björkman,Danica Kragic
机构: KTH Royal Institute of Technology(瑞典皇家理工学院); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at the 8th Conference on Robot Learning (CoRL 2024). Code and videos available at: this http URL

点击查看摘要

Abstract:We introduce Cloth-Splatting, a method for estimating 3D states of cloth from RGB images through a prediction-update framework. Cloth-Splatting leverages an action-conditioned dynamics model for predicting future states and uses 3D Gaussian Splatting to update the predicted states. Our key insight is that coupling a 3D mesh-based representation with Gaussian Splatting allows us to define a differentiable map between the cloth state space and the image space. This enables the use of gradient-based optimization techniques to refine inaccurate state estimates using only RGB supervision. Our experiments demonstrate that Cloth-Splatting not only improves state estimation accuracy over current baselines but also reduces convergence time.
zh

[CV-31] Enhancing Large Vision Model in Street Scene Semantic Understanding through Leverag ing Posterior Optimization Trajectory

【速读】：该论文试图解决自动驾驶（Autonomous Driving, AD）感知模型在数据量不断增加时容易出现的欠拟合（under-fitting）问题。随着时间推移，AD模型拟合的数据量不断增加，虽然有助于提升模型的泛化能力，但当数据量超过模型的拟合能力时，模型容易出现欠拟合现象。为解决这一问题，论文提出使用预训练的大规模视觉模型（Large Vision Models, LVMs）作为骨干网络，并结合下游感知头（perception head）来理解AD语义信息。这一设计不仅能够利用LVMs强大的拟合能力克服欠拟合问题，还能通过LVMs丰富多样的训练数据提升感知模型的泛化能力。此外，为了减轻车辆在运行LVM骨干网络时训练感知头的计算负担，论文引入了后验优化轨迹（Posterior Optimization Trajectory, POT）引导的优化方案（POTGui），通过POT生成器（POTGen）提前生成未来优化方向，从而加速模型收敛。实验表明，该方法在性能上提升了66.48%，且收敛速度比现有最先进方法快6倍以上。

链接: https://arxiv.org/abs/2501.01710
作者: Wei-Bin Kou,Qingfeng Lin,Ming Tang,Shuai Wang,Rongguang Ye,Guangxu Zhu,Yik-Chung Wu
机构: 1Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong 999077, China(香港大学电气与电子工程系); 2Shenzhen Research Institute of Big Data, Shenzhen, China(深圳大数据研究院); 3Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学计算机科学与工程系); 4Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 7 pages

点击查看摘要

Abstract:To improve the generalization of the autonomous driving (AD) perception model, vehicles need to update the model over time based on the continuously collected data. As time progresses, the amount of data fitted by the AD model expands, which helps to improve the AD model generalization substantially. However, such ever-expanding data is a double-edged sword for the AD model. Specifically, as the fitted data volume grows to exceed the the AD model’s fitting capacities, the AD model is prone to under-fitting. To address this issue, we propose to use a pretrained Large Vision Models (LVMs) as backbone coupled with downstream perception head to understand AD semantic information. This design can not only surmount the aforementioned under-fitting problem due to LVMs’ powerful fitting capabilities, but also enhance the perception generalization thanks to LVMs’ vast and diverse training data. On the other hand, to mitigate vehicles’ computational burden of training the perception head while running LVM backbone, we introduce a Posterior Optimization Trajectory (POT)-Guided optimization scheme (POTGui) to accelerate the convergence. Concretely, we propose a POT Generator (POTGen) to generate posterior (future) optimization direction in advance to guide the current optimization iteration, through which the model can generally converge within 10 epochs. Extensive experiments demonstrate that the proposed method improves the performance by over 66.48% and converges faster over 6 times, compared to the existing state-of-the-art approach.
zh

[CV-32] MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

【速读】：该论文旨在解决在视觉-语言模型（VLMs）中使用多个视觉编码器（visual encoders）时带来的计算成本显著增加的问题。为了解决这一问题，论文提出了一种名为“多视觉编码器知识蒸馏”（Mixture-of-Visual-Encoder Knowledge Distillation, MoVE-KD）的新框架。该框架的关键在于将多个视觉编码器的独特能力蒸馏到一个高效的单一编码器模型中。具体而言，通过采用低秩适应（low-rank adaptation, LoRA）和专家混合（mixture-of-experts, MoEs）技术，MoVE-KD能够根据输入特征选择性地激活特定知识，从而在保持每个教师编码器独特特征的同时，减少冲突并提高模型的适应性和效率。此外，论文还提出了一种基于注意力的蒸馏策略，该策略能够自适应地权衡不同视觉编码器的贡献，并强调有价值的视觉标记（visual tokens），从而减轻从多个教师模型中复制全面但不同特征的负担。实验结果表明，该方法在LLaVA和LLaVA-NeXT等流行的VLMs上具有显著的有效性。

链接: https://arxiv.org/abs/2501.01709
作者: Jiajun Cao,Yuan Zhang,Tao Huang,Ming Lu,Qizhe Zhang,Ruichuan An,Ningning MA,Shanghang Zhang
机构: Peking University(北京大学); Autonomous Driving Development, NIO(蔚来自动驾驶开发); The University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. The code will be released.
zh

[CV-33] Optimal Fiducial Marker Placement for Satellite Proximity Operations Using Observability Gramians

【速读】：该论文旨在解决在卫星相对接近操作中，如何在目标卫星表面最优地布置基准标记（fiducial marker）的问题。具体来说，研究通过使用对偶四元数（dual quaternions）建模卫星对的绝对和相对平移及姿态运动方程，并利用经验可观测性格拉米矩阵（empirical observability Gramian）方法分析相对对偶四元数系统的可观测性。解决方案的关键在于确定一组基准标记的最优布置，这些标记能够同时提供光学距离和姿态测量。通过数值模拟地球静止轨道飞越场景，研究得出了在目标卫星表面布置5个和10个基准标记的最优方案。结果表明，最优解能够最大化基准标记之间的距离，并选择对非线性轨迹中状态变化最敏感的标记位置，尽管这些位置的可观测时间较短。

链接: https://arxiv.org/abs/2501.01704
作者: Nicholas B. Andrews,Kristi A. Morgansen
机构: Department of Aeronautics and Astronautics, University of Washington (华盛顿大学航空与航天工程系)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Optimization and Control (math.OC)
备注: 18 pages, 7 figures, 1 table, presented at 45th Annual American Astronautical Society (AAS) Guidance, Navigation and Control (GNC) Conference

点击查看摘要

Abstract:This paper investigates optimal fiducial marker placement on the surface of a satellite performing relative proximity operations with an observer satellite. The absolute and relative translation and attitude equations of motion for the satellite pair are modeled using dual quaternions. The observability of the relative dual quaternion system is analyzed using empirical observability Gramian methods. The optimal placement of a fiducial marker set, in which each marker gives simultaneous optical range and attitude measurements, is determined for the pair of satellites. A geostationary flyby between the observing body (chaser) and desired (target) satellites is numerically simulated and the optimal fiducial placement sets of five and ten on the surface of the desired satellite are solved. It is shown that the optimal solution maximizes the distance between fiducial markers and selects marker locations that are most sensitive to measuring changes in the state during the nonlinear trajectory, despite being visible for less time than other candidate marker locations. Definitions and properties of quaternions and dual quaternions, and parallels between the two, are presented alongside the relative motion model.
zh

[CV-34] Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation

【速读】：该论文旨在解决图像情感表达的直观理解和精确控制问题，以及音乐情感维度与视觉艺术整合的不足。尽管深度学习在图像识别方面取得了显著进展，但图像中的情感表达仍然难以直观理解和精确控制。同时，音乐研究主要集中在理论层面，对其情感维度及其与视觉艺术的整合探索有限。为解决这些问题，论文提出了EmoMV，一种基于音乐情感驱动的音乐到视觉的操纵方法。EmoMV通过自下而上处理音乐元素（如音高和节奏），并将这些情感自上而下应用于视觉方面（如颜色和光照），从而实现音乐情感内容到视觉图像的转换。该方法的有效性通过多尺度框架进行评估，包括图像质量指标、美学评估和脑电图（EEG）测量，以捕捉实时情感响应。实验结果表明，EmoMV能够有效地将音乐情感内容转化为视觉上引人注目的图像，推动了多模态情感整合，并为创意产业和交互技术开辟了新途径。

链接: https://arxiv.org/abs/2501.01700
作者: Junjie Xu,Xingjiao Wu,Tanren Yao,Zihao Zhang,Jiayang Bei,Wu Wen,Liang He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotional information is essential for enhancing human-computer interaction and deepening image understanding. However, while deep learning has advanced image recognition, the intuitive understanding and precise control of emotional expression in images remain challenging. Similarly, music research largely focuses on theoretical aspects, with limited exploration of its emotional dimensions and their integration with visual arts. To address these gaps, we introduce EmoMV, an emotion-driven music-to-visual manipulation method that manipulates images based on musical emotions. EmoMV combines bottom-up processing of music elements-such as pitch and rhythm-with top-down application of these emotions to visual aspects like color and lighting. We evaluate EmoMV using a multi-scale framework that includes image quality metrics, aesthetic assessments, and EEG measurements to capture real-time emotional responses. Our results demonstrate that EmoMV effectively translates music’s emotional content into visually compelling images, advancing multimodal emotional integration and opening new avenues for creative industries and interactive technologies.
zh

[CV-35] Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels AAAI25

【速读】：该论文试图解决跨模态哈希（Cross-modal Hashing, CMH）在实际应用中由于噪声标签（noisy labels）导致的模型误导问题。现有的方法通常假设多模态数据标签是准确的，然而在实际场景中，标签往往存在噪声，这会影响模型的性能。为了解决这一问题，论文提出了一种新的认知跨模态检索方法，称为“带噪声标签的鲁棒自步哈希”（Robust Self-paced Hashing with Noisy Labels, RSHNL）。该方法的关键在于模仿人类认知过程，逐步从易到难学习样本，并通过动态估计每个实例的学习难度来区分噪声标签。具体来说，RSHNL通过对比哈希学习（Contrastive Hashing Learning, CHL）增强多模态一致性，减少语义鸿沟；通过中心聚合学习（Center Aggregation Learning, CAL）缓解类内差异；最后通过噪声容忍自步哈希（Noise-tolerance Self-paced Hashing, NSH）动态估计学习难度并区分噪声标签。对于估计为干净的样本对，进一步采用自步正则化器逐步从易到难学习哈希码。实验结果表明，RSHNL在跨模态哈希任务中表现优异，优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.01699
作者: Ruitao Pu,Yuan Sun,Yang Qin,Zhenwen Ren,Xiaomin Song,Huiming Zheng,Dezhong Peng
机构: 1: 未知; 2: 未知; 3: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages, AAAI 25 conference

点击查看摘要

Abstract:Cross-modal hashing (CMH) has appeared as a popular technique for cross-modal retrieval due to its low storage cost and high computational efficiency in large-scale data. Most existing methods implicitly assume that multi-modal data is correctly labeled, which is expensive and even unattainable due to the inevitable imperfect annotations (i.e., noisy labels) in real-world scenarios. Inspired by human cognitive learning, a few methods introduce self-paced learning (SPL) to gradually train the model from easy to hard samples, which is often used to mitigate the effects of feature noise or outliers. It is a less-touched problem that how to utilize SPL to alleviate the misleading of noisy labels on the hash model. To tackle this problem, we propose a new cognitive cross-modal retrieval method called Robust Self-paced Hashing with Noisy Labels (RSHNL), which can mimic the human cognitive process to identify the noise while embracing robustness against noisy labels. Specifically, we first propose a contrastive hashing learning (CHL) scheme to improve multi-modal consistency, thereby reducing the inherent semantic gap. Afterward, we propose center aggregation learning (CAL) to mitigate the intra-class variations. Finally, we propose Noise-tolerance Self-paced Hashing (NSH) that dynamically estimates the learning difficulty for each instance and distinguishes noisy labels through the difficulty level. For all estimated clean pairs, we further adopt a self-paced regularizer to gradually learn hash codes from easy to hard. Extensive experiments demonstrate that the proposed RSHNL performs remarkably well over the state-of-the-art CMH methods.
zh

[CV-36] CrossView-GS: Cross-view Gaussian Splatting For Large-scale Scene Reconstruction

【速读】：该论文旨在解决现有3D高斯泼溅（3D Gaussian Splatting, 3DGS）方法在处理大视角变化场景时的优化挑战。具体而言，现有方法在视角变化较小的场景中表现良好，但在跨视角场景中，由于视角差异较大，导致模型优化困难。为此，论文提出了一种基于双分支融合（dual-branch fusion）的跨视角高斯泼溅方法，用于大规模场景重建。该方法的关键在于：首先，通过独立重建空中和地面视角的模型作为两个独立分支，为高斯分布的初始化和密集化提供可靠的先验信息；其次，引入梯度感知正则化策略（gradient-aware regularization strategy）来缓解由显著视角差异引起的平滑问题；最后，采用高斯补充策略（Gaussian supplementation strategy）将双分支的互补信息整合到跨视角模型中。实验结果表明，该方法在新视角合成任务中优于现有最先进方法。

链接: https://arxiv.org/abs/2501.01695
作者: Chenhao Zhang,Yuanping Cao,Lei Zhang
机构: School of Computer Science, Beijing Institute of Technology, Beijing 100081, China (北京理工大学计算机学院，北京 100081，中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a prominent method for scene representation and reconstruction, leveraging densely distributed Gaussian primitives to enable real-time rendering of high-resolution images. While existing 3DGS methods perform well in scenes with minor view variation, large view changes in cross-view scenes pose optimization challenges for these methods. To address these issues, we propose a novel cross-view Gaussian Splatting method for large-scale scene reconstruction, based on dual-branch fusion. Our method independently reconstructs models from aerial and ground views as two independent branches to establish the baselines of Gaussian distribution, providing reliable priors for cross-view reconstruction during both initialization and densification. Specifically, a gradient-aware regularization strategy is introduced to mitigate smoothing issues caused by significant view disparities. Additionally, a unique Gaussian supplementation strategy is utilized to incorporate complementary information of dual-branch into the cross-view model. Extensive experiments on benchmark datasets demonstrate that our method achieves superior performance in novel view synthesis compared to state-of-the-art methods.
zh

[CV-37] VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement

【速读】：该论文试图解决基于面部视频的远程生理信号测量（remote photoplethysmography, rPPG）任务中，现有深度学习方法在小型和大型数据集上性能不平衡的问题。传统方法主要依赖于卷积神经网络（CNN）和Transformer模型，但这些模型在处理时空特征时存在局限性。论文提出的解决方案是VidFormer，一种新颖的端到端框架，结合了3D卷积神经网络（3DCNN）和Transformer模型，分别用于提取输入数据的局部和全局特征。关键创新点包括：1）对传统皮肤反射模型进行分析并引入改进模型以重建rPPG信号；2）在3DCNN和Transformer中引入时空注意力机制，增强时空特征提取能力；3）设计了一个模块，促进3DCNN和Transformer之间的信息交换与融合。实验结果表明，VidFormer在五个公开数据集上均优于当前最先进的方法。

链接: https://arxiv.org/abs/2501.01691
作者: Jiachen Li,Shisheng Guo,Longzhen Tang,Cuolong Cui,Lingjiang Kong,Xiaobo Yang
机构: School of Information and Communication Engineering, University of Electronic Science and Technology of China (电子科技大学信息与通信工程学院); Yangtze Delta Region Institute, University of Electronic Science and Technology of China (电子科技大学长三角研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.
zh

[CV-38] Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network

【速读】：该论文试图解决传统步态和运动分析（gait and movement analysis）依赖于昂贵的运动捕捉系统和专业人员的问题，限制了其在临床中的广泛应用。为了解决这一问题，论文提出了一种基于单摄像头视频的定量运动分析方法，通过双模式输入的卷积Transformer网络（dual-pattern input convolutional Transformer network）来估计关键的步态参数。该解决方案的关键在于利用深度学习技术，特别是双输入Transformer模型，从单视角RGB视频中提取步态偏差指数（gait deviation index, GDI）、膝关节屈曲角度、步长和步频等关键指标。该方法在资源受限的环境中表现出色，且优于现有的先进方法，具有较高的临床应用潜力。

链接: https://arxiv.org/abs/2501.01689
作者: Hiep Dinh,Son Le,My Than,Minh Ho,Nicolas Vuillerme,Hieu Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at The IEEE International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Gait and movement analysis have become a well-established clinical tool for diagnosing health conditions, monitoring disease progression for a wide spectrum of diseases, and to implement and assess treatment, surgery and or rehabilitation interventions. However, quantitative motion assessment remains limited to costly motion capture systems and specialized personnel, restricting its accessibility and broader application. Recent advancements in deep neural networks have enabled quantitative movement analysis using single-camera videos, offering an accessible alternative to conventional motion capture systems. In this paper, we present an efficient approach for clinical gait analysis through a dual-pattern input convolutional Transformer network. The proposed system leverages a dual-input Transformer model to estimate essential gait parameters from single RGB videos captured by a single-view camera. The system demonstrates high accuracy in estimating critical metrics such as the gait deviation index (GDI), knee flexion angle, step length, and walking cadence, validated on a dataset of individuals with movement disorders. Notably, our approach surpasses state-of-the-art methods in various scenarios, using fewer resources and proving highly suitable for clinical application, particularly in resource-constrained environments.
zh

[CV-39] IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks

【速读】：该论文试图解决RGB-D图像分割（RGB-D segmentation）领域中实例级分割（instance-level segmentation）数据集相对稀缺的问题。现有的研究主要集中在语义分割（semantic segmentation）上，导致缺乏能够捕捉细粒度细节的实例级数据集，从而限制了模型在识别单个物体时的性能。为了解决这一问题，论文引入了三个实例级的RGB-D分割基准数据集，这些数据集支持从室内导航到机器人操作等多种应用。此外，论文还对这些基准数据集上的多种基线模型进行了广泛评估，揭示了这些模型的优缺点，并为未来的研究提供了方向。最后，论文提出了一种简单但有效的RGB-D数据集成方法，通过大量实验验证了该方法的有效性，为更细致的场景理解提供了稳健的框架。

链接: https://arxiv.org/abs/2501.01685
作者: Aecheon Jung,Soyun Choi,Junhong Min,Sungeun Hong
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image segmentation is a vital task for providing human assistance and enhancing autonomy in our daily lives. In particular, RGB-D segmentation-leveraging both visual and depth cues-has attracted increasing attention as it promises richer scene understanding than RGB-only methods. However, most existing efforts have primarily focused on semantic segmentation and thus leave a critical gap. There is a relative scarcity of instance-level RGB-D segmentation datasets, which restricts current methods to broad category distinctions rather than fully capturing the fine-grained details required for recognizing individual objects. To bridge this gap, we introduce three RGB-D instance segmentation benchmarks, distinguished at the instance level. These datasets are versatile, supporting a wide range of applications from indoor navigation to robotic manipulation. In addition, we present an extensive evaluation of various baseline models on these benchmarks. This comprehensive analysis identifies both their strengths and shortcomings, guiding future work toward more robust, generalizable solutions. Finally, we propose a simple yet effective method for RGB-D data integration. Extensive evaluations affirm the effectiveness of our approach, offering a robust framework for advancing toward more nuanced scene understanding.
zh

[CV-40] PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping

【速读】：该论文旨在解决大规模城市场景中建筑物表面重建的问题，特别是在保持原始图像分辨率的情况下实现精细重建。传统方法在处理大规模场景时，通常面临视频内存占用高和优化时间长的问题。为此，论文提出了一种并行高斯溅射方法（PG-SAG），其关键解决方案包括：1）利用跨模态模型（Cross-modal model - Language Segment Anything）对建筑物进行语义分割，生成建筑物掩码；2）根据注册图像中的可见性检查，将分割后的建筑物区域分组为子区域，并在这些子区域上并行优化高斯核（Gaussian kernels）；3）重新定义掩码边缘的法向量损失（normal loss），以减少边缘法向量的模糊性；4）引入梯度约束的负载均衡损失（gradient-constrained balance-load loss），以优化3D高斯核的重建过程，减少像素并行渲染阶段的线程等待时间。通过这些方法，PG-SAG在多个城市场景数据集上表现出优于现有3DGS方法的性能。

链接: https://arxiv.org/abs/2501.01677
作者: Tengfei Wang,Xin Wang,Yongmao Hou,Yiwei Xu,Wendi Zhang,Zongqian Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a transformative method in the field of real-time novel synthesis. Based on 3DGS, recent advancements cope with large-scale scenes via spatial-based partition strategy to reduce video memory and optimization time costs. In this work, we introduce a parallel Gaussian splatting method, termed PG-SAG, which fully exploits semantic cues for both partitioning and Gaussian kernel optimization, enabling fine-grained building surface reconstruction of large-scale urban areas without downsampling the original image resolution. First, the Cross-modal model - Language Segment Anything is leveraged to segment building masks. Then, the segmented building regions is grouped into sub-regions according to the visibility check across registered images. The Gaussian kernels for these sub-regions are optimized in parallel with masked pixels. In addition, the normal loss is re-formulated for the detected edges of masks to alleviate the ambiguities in normal vectors on edges. Finally, to improve the optimization of 3D Gaussians, we introduce a gradient-constrained balance-load loss that accounts for the complexity of the corresponding scenes, effectively minimizing the thread waiting time in the pixel-parallel rendering stage as well as the reconstruction lost. Extensive experiments are tested on various urban datasets, the results demonstrated the superior performance of our PG-SAG on building surface reconstruction, compared to several state-of-the-art 3DGS-based methods. Project Web:this https URL.
zh

[CV-41] EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical image segmentation

【速读】：该论文旨在解决弱监督医学图像分割（Weakly-supervised medical image segmentation）中由于弱标签（weak labels）的不确定性导致的性能差距问题。现有的弱监督方法在标注效率上有所提升，但与全监督方法相比仍存在显著性能差距。为此，作者提出了一种新的弱标注方法及其学习框架EAUWSeg，以消除标注不确定性。解决方案的关键在于：首先，提出了边界多边形标注（Bounded Polygon Annotation, BPAnno），通过为病变区域标注两个多边形来简化标注过程；其次，设计了一种定制的学习机制，将边界多边形视为两个独立的标注，并通过提供对抗性监督信号来学习不变特征；最后，引入了一个置信度辅助一致性学习器（confidence-auxiliary consistency learner）与分类引导的置信度生成器（classification-guided confidence generator）相结合，利用同一类别内像素的特征表示一致性以及边界多边形标注中的类别特定信息，为不确定区域的像素提供可靠的监督信号。实验结果表明，EAUWSeg不仅优于现有的弱监督分割方法，而且在性能上接近甚至超越全监督方法，同时显著减少了标注工作量。

链接: https://arxiv.org/abs/2501.01658
作者: Wang Lituan,Zhang Lei,Wang Yan,Wang Zhenbin,Zhang Zhenwei,Zhang Yi
机构: Machine Intelligence Laboratory, College of Computer Science, Sichuan University (四川大学); Institute of High Performance Computing, A*STAR (新加坡科技研究局高性能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weakly-supervised medical image segmentation is gaining traction as it requires only rough annotations rather than accurate pixel-to-pixel labels, thereby reducing the workload for specialists. Although some progress has been made, there is still a considerable performance gap between the label-efficient methods and fully-supervised one, which can be attributed to the uncertainty nature of these weak labels. To address this issue, we propose a novel weak annotation method coupled with its learning framework EAUWSeg to eliminate the annotation uncertainty. Specifically, we first propose the Bounded Polygon Annotation (BPAnno) by simply labeling two polygons for a lesion. Then, the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations is proposed to learn invariant feature by providing adversarial supervision signal for model training. Subsequently, a confidence-auxiliary consistency learner incorporates with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region by leveraging the feature presentation consistency across pixels within the same category as well as class-specific information encapsulated in bounded polygons annotation. Experimental results demonstrate that EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, compared to fully-supervised counterparts, the proposed method not only delivers superior performance but also costs much less annotation workload. This underscores the superiority and effectiveness of our approach.
zh

[CV-42] Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection

【速读】：该论文试图解决RGB-D显著目标检测（RGB-D Salient Object Detection, SOD）中存在的跨模态特征融合问题。现有方法通常在手动强制融合范式下直接融合RGB和深度信息的注意力特征，而未充分考虑两者之间的固有差异，这可能导致性能下降。此外，全局和局部信息的长程依赖关系使得难以采用统一的融合策略。为解决这些问题，论文提出了GL-DMNet，一种具有全局-局部感知能力的双互学习网络。其关键解决方案包括：1）设计了位置互融合模块（position mutual fusion module）和通道互融合模块（channel mutual fusion module），以在空间和通道维度上利用不同模态间的相互依赖关系；2）采用基于级联Transformer注入重建的高效解码器，联合集成多层次融合特征。实验结果表明，GL-DMNet在六个基准数据集上优于24种现有RGB-D SOD方法，平均性能提升了约3%。

链接: https://arxiv.org/abs/2501.01648
作者: Kang Yi,Haoran Tang,Yumeng Li,Jing Xu,Jun Zhang
机构: College of Artificial Intelligence, Nankai University, Tianjin 300350, China (南开大学人工智能学院); Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong (香港理工大学计算学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at this https URL.
zh

[CV-43] HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

【速读】：该论文试图解决长时间视频理解（hour-long video understanding）领域中的挑战，具体包括长期视频分析困难、大规模模型效率低下以及缺乏大规模基准数据集等问题。为了解决这些问题，论文提出了一个名为HLV-1K的大规模长时间视频基准数据集，该数据集包含1009个时长超过一小时的视频，并配有14,847个高质量的问题回答（QA）和多项选择题回答（MCQA）对，涵盖了帧级、事件内级、跨事件级和长期推理任务。通过构建这一数据集，论文旨在评估和推动长时间视频理解模型的发展，特别是在细粒度任务中的应用，如长时间直播视频、会议记录和电影的深度理解。

链接: https://arxiv.org/abs/2501.01645
作者: Heqing Zou,Tianze Luo,Guiyang Xie,Victor(Xiao Jie)Zhang,Fengmao Lv,Guangcong Wang,Junyang Chen,Zhuochen Wang,Hansheng Zhang,Huaijian Zhang
机构: TikTok; Nanyang Technological University(南洋理工大学); Southwest Jiaotong University(西南交通大学); Great Bay University(大湾区大学); Shenzhen University(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.
zh

[CV-44] CBIR-Sli: Interpretable Content-Based Image Retrieval with 2D Slice Embeddings

【速读】：该论文试图解决当前基于文本的脑部磁共振成像（MRI）检索方法的局限性，提出了一种基于内容的图像检索（CBIR）系统。现有方法在处理3D脑部MRI数据时，通常依赖于2D切片，这可能导致病理特征的遗漏和深度方向信息的不连续性。此外，目前尚未有能够完整保留脑部结构信息的实用CBIR系统。论文提出的解决方案名为iCBIR-Sli（Interpretable CBIR with 2D Slice Embedding），其关键创新在于首次全局利用一系列2D切片，通过有效聚合切片信息，生成具有高完整性、可用性、鲁棒性和互操作性的低维表示。该方法在五个公开的脑部MRI数据集（ADNI2/3, OASIS3/4, AIBL）上进行了检索评估实验，展示了与现有深度学习模型相当的检索性能（macro F1 = 0.859），并且无需外部分类器即可实现高解释性，能够清晰识别与疾病相关的脑区。

链接: https://arxiv.org/abs/2501.01642
作者: Shuhei Tomoshige,Hayato Muraki,Kenichi Oishi,Hitoshi Iyatomi
机构: Dept. of Science and Engineering, Hosei University (法政大学); Dept. of Radiology and Radiological Science, Johns Hopkins Medicine (约翰霍普金斯医学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 8 pages, 2 figures. Accepted at the SPIE Medical Imaging

点击查看摘要

Abstract:Current methods for searching brain MR images rely on text-based approaches, highlighting a significant need for content-based image retrieval (CBIR) systems. Directly applying 3D brain MR images to machine learning models offers the benefit of effectively learning the brain’s structure; however, building the generalized model necessitates a large amount of training data. While models that consider depth direction and utilize continuous 2D slices have demonstrated success in segmentation and classification tasks involving 3D data, concerns remain. Specifically, using general 2D slices may lead to the oversight of pathological features and discontinuities in depth direction information. Furthermore, to the best of the authors’ knowledge, there have been no attempts to develop a practical CBIR system that preserves the entire brain’s structural information. In this study, we propose an interpretable CBIR method for brain MR images, named iCBIR-Sli (Interpretable CBIR with 2D Slice Embedding), which, for the first time globally, utilizes a series of 2D slices. iCBIR-Sli addresses the challenges associated with using 2D slices by effectively aggregating slice information, thereby achieving low-dimensional representations with high completeness, usability, robustness, and interoperability, which are qualities essential for effective CBIR. In retrieval evaluation experiments utilizing five publicly available brain MR datasets (ADNI2/3, OASIS3/4, AIBL) for Alzheimer’s disease and cognitively normal, iCBIR-Sli demonstrated top-1 retrieval performance (macro F1 = 0.859), comparable to existing deep learning models explicitly designed for classification, without the need for an external classifier. Additionally, the method provided high interpretability by clearly identifying the brain regions indicative of the searched-for disease.
zh

[CV-45] Uncertainty and Energy based Loss Guided Semi-Supervised Semantic Segmentation WACV

【速读】：该论文旨在解决半监督（Semi-supervised, SS）语义分割中像素级标注耗时且成本高的问题。其核心解决方案是通过结合伪标签（pseudo labels）和真实标签（ground-truth labels）来训练网络，具体采用了数据不确定性（aleatoric uncertainty）和基于能量的建模（energy-based modeling）方法。数据不确定性通过具有两个预测分支的网络来建模数据中的固有噪声变化，并通过网络输出的逐像素方差参数量化数据不确定性。同时，基于能量的损失函数利用生成式建模的潜力来提升下游半监督分割任务的性能。这些方法结合伪交集标签（pseudo-intersection labels）、伪并集标签（pseudo-union labels）和真实标签，应用于网络的不同分支中，最终在与现有先进方法的对比分析中展示了性能指标的提升。

链接: https://arxiv.org/abs/2501.01640
作者: Rini Smita Thakur,Vinod K. Kurmi
机构: Indian Institute of Science Education and Research, Bhopal, India (印度科学教育与研究学院，博帕尔，印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Semi-supervised (SS) semantic segmentation exploits both labeled and unlabeled images to overcome tedious and costly pixel-level annotation problems. Pseudolabel supervision is one of the core approaches of training networks with both pseudo labels and ground-truth labels. This work uses aleatoric or data uncertainty and energy based modeling in intersection-union pseudo supervised this http URL aleatoric uncertainty is modeling the inherent noise variations of the data in a network with two predictive branches. The per-pixel variance parameter obtained from the network gives a quantitative idea about the data uncertainty. Moreover, energy-based loss realizes the potential of generative modeling on the downstream SS segmentation task. The aleatoric and energy loss are applied in conjunction with pseudo-intersection labels, pseudo-union labels, and ground-truth on the respective network branch. The comparative analysis with state-of-the-art methods has shown improvement in performance metrics.
zh

[CV-46] ACE: Anti-Editing Concept Erasure in Text-to-Image Models

【速读】：该论文旨在解决文本到图像扩散模型（text-to-image diffusion models）在生成高质量图像时可能产生的非法内容（如受版权保护的图像）问题，特别是现有概念擦除方法在防止生成阶段擦除目标概念方面表现良好，但在防止不希望的编辑操作方面表现不佳。为此，作者提出了一种抗编辑概念擦除方法（Anti-Editing Concept Erasure, ACE），其关键创新在于将擦除指导注入到条件噪声预测和无条件噪声预测中，从而在生成和编辑阶段都能有效防止目标概念的生成。此外，作者在训练过程中引入了随机校正指导（stochastic correction guidance），以解决无关概念被侵蚀的问题。实验结果表明，ACE在擦除IP角色、显式概念和艺术风格方面均优于现有方法。

链接: https://arxiv.org/abs/2501.01633
作者: Zihao Wang,Yuxiang Wei,Fan Li,Renjing Pei,Hang Xu,Wangmeng Zuo
机构: Harbin Institute of Technology(哈尔滨工业大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室); Pazhou Lab (Huangpu)(琶洲实验室（黄埔）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, code available at this https URL

点击查看摘要

Abstract:Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available at this https URL.
zh

[CV-47] Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

【速读】：该论文试图解决医学图像分割（Medical image segmentation）中全局和局部特征表示的聚合问题，特别是现有方法在处理长程（long-range）和短程（short-range）特征交互时面临的挑战。现有的视觉Mamba（ViM）模型虽然在长程特征迭代中表现出色且具有线性复杂度，但忽略了短程局部依赖性的保留，且受限于固定的扫描模式，无法有效捕捉动态的空间上下文信息。为解决这些问题，论文提出了一种名为上下文聚类ViM（CCViM）的方法，通过在现有ViM模型中引入上下文聚类模块，将图像令牌（tokens）分割为不同的窗口以实现自适应的局部聚类。该方法有效地结合了长程和短程特征交互，从而增强了医学图像分割任务中的空间上下文表示。实验结果表明，CCViM在多个公开数据集上均优于当前的最先进方法。

链接: https://arxiv.org/abs/2501.01618
作者: Yun Zhu,Dong Zhang,Yi Lin,Yifei Feng,Jinhui Tang
机构: School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院); Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology (香港科技大学电子与计算机工程系); Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (香港科技大学计算机科学与工程系); First School of Clinical Medicine, Nanjing Medical University (南京医科大学第一临床医学院); Department of General Surgery, the First Affiliated Hospital of Nanjing Medical University (南京医科大学第一附属医院普外科)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Our paper has been accepted by the IEEE Transactions on Medical Imaging. Our code can be found at this https URL

点击查看摘要

Abstract:Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at this https URL.
zh

[CV-48] Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model

【速读】：该论文旨在解决随着数字图像数据量增加而带来的图像分类复杂性，特别是针对单张图像可能关联多个类别（1到19个，不包括12个）的多标签分类问题。解决方案的关键在于提出了一种多模态分类器，该分类器结合了先进的图像识别算法（如卷积神经网络，CNN）和自然语言处理（NLP）模型，通过融合模块将这两种不同的模态整合在一起。通过引入文本数据（如图像描述），该模型能够提供视觉分析无法完全捕捉的上下文理解，从而提高了标签预测的准确性。该方案还包括严格的训练和验证阶段，并通过消融实验验证了各模型组件的有效性。初步结果表明，该分类器在准确性和效率方面表现出色，展示了其作为自动图像标注系统的潜力。

链接: https://arxiv.org/abs/2501.01611
作者: Haixu Liu,Penghao Jiang,Zerui Tao
机构: The University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the volume of digital image data increases, the effectiveness of image classification intensifies. This study introduces a robust multi-label classification system designed to assign multiple labels to a single image, addressing the complexity of images that may be associated with multiple categories (ranging from 1 to 19, excluding 12). We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models, incorporating a fusion module to integrate these distinct modalities. The purpose of integrating textual data is to enhance the accuracy of label prediction by providing contextual understanding that visual analysis alone cannot fully capture. Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description (i.e., captions). This approach includes rigorous training and validation phases, with each model component verified and analyzed through ablation experiments. Preliminary results demonstrate the classifier’s accuracy and efficiency, highlighting its potential as an automatic image-labeling system.
zh

[CV-49] Few-shot Implicit Function Generation via Equivariance

【速读】：该论文试图解决在有限训练数据下生成多样且功能一致的隐式神经表示（Implicit Neural Representations, INRs）权重的问题。由于即使对于相同的信号，最优的INRs权重也可能因其初始化不同而显著变化，因此这一任务具有挑战性。为解决这一问题，作者提出了EquiGen框架，其核心思想是通过权重置换（weight permutations）将功能相似的网络相互转换，形成一个等变群（equivariance group）。通过将这些权重投影到一个等变潜在空间（equivariant latent space），即使在少量样本的情况下，也能在该群内实现多样化的权重生成。EquiGen框架通过对比学习和平滑增强训练的等变编码器、等变引导的扩散过程以及在等变子空间中的受控扰动来实现这一目标。实验结果表明，该方法在少量样本情况下能够有效生成多样化的INRs权重，同时保持其功能特性。

链接: https://arxiv.org/abs/2501.01601
作者: Suizhi Huang,Xingyi Yang,Hongtao Lu,Xinchao Wang
机构: Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have emerged as a powerful framework for representing continuous signals. However, generating diverse INR weights remains challenging due to limited training data. We introduce Few-shot Implicit Function Generation, a new problem setup that aims to generate diverse yet functionally consistent INR weights from only a few examples. This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. The core idea is that functionally similar networks can be transformed into one another through weight permutations, forming an equivariance group. By projecting these weights into an equivariant latent space, we enable diverse generation within these groups, even with few examples. EquiGen implements this through an equivariant encoder trained via contrastive learning and smooth augmentation, an equivariance-guided diffusion process, and controlled perturbations in the equivariant subspace. Experiments on 2D image and 3D shape INR datasets demonstrate that our approach effectively generates diverse INR weights while preserving their functional properties in few-shot scenarios.
zh

[CV-50] Adaptive Homophily Clustering: A Structure Homophily Graph Learning with Adaptive Filter for Hyperspectral Image

【速读】：该论文试图解决高光谱图像（HSI）聚类任务中的关键挑战，即在无训练标签的情况下，如何有效利用空间结构信息进行聚类。现有深度图聚类方法在HSI处理中表现出色，但仍存在结构信息利用不足、特征表示能力差以及图更新能力弱等问题。为此，论文提出了一种基于同质性结构图学习和自适应滤波的聚类方法（AHSGC）。其解决方案的关键在于：首先，通过同质性区域生成构建初始图；其次，设计自适应滤波图编码器，以自适应地捕捉图上的高频和低频特征；接着，开发基于KL散度的图嵌入聚类自训练解码器，生成伪标签用于网络训练；同时，引入同质性增强的结构学习机制，通过定向相关性估计和图边稀疏化动态更新图结构；最后，采用联合网络优化实现网络自训练和图更新。实验表明，AHSGC具有高聚类精度、低计算复杂度和强鲁棒性。

链接: https://arxiv.org/abs/2501.01595
作者: Yao Ding,Weijie Kang,Aitao Yang,Zhili Zhang,Junyang Zhao,Jie Feng,Danfeng Hong,Qinhe Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 85 figure

点击查看摘要

Abstract:Hyperspectral image (HSI) clustering has been a fundamental but challenging task with zero training labels. Currently, some deep graph clustering methods have been successfully explored for HSI due to their outstanding performance in effective spatial structural information encoding. Nevertheless, insufficient structural information utilization, poor feature presentation ability, and weak graph update capability limit their performance. Thus, in this paper, a homophily structure graph learning with an adaptive filter clustering method (AHSGC) for HSI is proposed. Specifically, homogeneous region generation is first developed for HSI processing and constructing the original graph. Afterward, an adaptive filter graph encoder is designed to adaptively capture the high and low frequency features on the graph for subsequence processing. Then, a graph embedding clustering self-training decoder is developed with KL Divergence, with which the pseudo-label is generated for network training. Meanwhile, homophily-enhanced structure learning is introduced to update the graph according to the clustering task, in which the orient correlation estimation is adopted to estimate the node connection, and graph edge sparsification is designed to adjust the edges in the graph dynamically. Finally, a joint network optimization is introduced to achieve network self-training and update the graph. The K-means is adopted to express the latent features. Extensive experiments and repeated comparative analysis have verified that our AHSGC contains high clustering accuracy, low computational complexity, and strong robustness. The code source will be available at this https URL.
zh

[CV-51] D3-Human: Dynamic Disentangled Digital Human from Monocular Video

【速读】：该论文旨在解决从单目视频中重建动态解耦数字人体几何（Dynamic Disentangled Digital Human geometry）的问题。传统方法主要集中于重建未解耦的穿衣人体或仅重建衣物，难以直接应用于动画制作等场景。关键挑战在于衣物对身体的遮挡，导致重建过程中需要确保可见区域的细节和不可见区域的合理性。为此，论文提出了一种结合显式（explicit）和隐式（implicit）表示的方法，利用显式表示的鲁棒性和隐式表示的灵活性，对解耦的穿衣人体进行建模。具体而言，该方法将可见区域重建为SDF（Signed Distance Field），并提出了一种新的人体流形符号距离场（hmSDF）来分割可见衣物和可见身体，进而合并可见与不可见的身体部分。实验结果表明，D^3-Human能够高质量地实现不同穿衣人体的解耦重建，并可直接应用于衣物迁移和动画制作。

链接: https://arxiv.org/abs/2501.01589
作者: Honghu Chen,Bo Peng,Yunfan Tao,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce D ^3 -Human, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, D ^3 -Human can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation.
zh

[CV-52] Click-Calib: A Robust Extrinsic Calibration Method for Surround-View Systems

【速读】：该论文试图解决环绕视觉系统（Surround-View System, SVS）在高级驾驶辅助系统（Advanced Driver Assistance System, ADAS）中的外参标定（extrinsic calibration）问题。传统的外参标定方法依赖于物理标定板，过程繁琐且耗时，且主要关注车辆周围的短距离区域，导致在远距离区域的标定质量较低。为了解决这些问题，论文提出了Click-Calib，一种无需标定板的离线外参标定方法。该方法的关键在于用户只需在自然场景中点击地面上的几个关键点，系统通过最小化这些关键点的重投影距离误差来优化相机姿态，从而在短距离和长距离范围内实现精确标定。此外，Click-Calib支持单帧和多帧模式，其中多帧模式能够提供更好的标定结果。实验结果表明，该方法在内部数据集和公开的WoodScape数据集上均表现出优于基线方法的精度和鲁棒性。

链接: https://arxiv.org/abs/2501.01557
作者: Lihao Wang
机构: Valeo(法雷奥)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surround-View System (SVS) is an essential component in Advanced Driver Assistance System (ADAS) and requires precise calibrations. However, conventional offline extrinsic calibration methods are cumbersome and time-consuming as they rely heavily on physical patterns. Additionally, these methods primarily focus on short-range areas surrounding the vehicle, resulting in lower calibration quality in more distant zones. To address these limitations, we propose Click-Calib, a pattern-free approach for offline SVS extrinsic calibration. Without requiring any special setup, the user only needs to click a few keypoints on the ground in natural scenes. Unlike other offline calibration approaches, Click-Calib optimizes camera poses over a wide range by minimizing reprojection distance errors of keypoints, thereby achieving accurate calibrations at both short and long distances. Furthermore, Click-Calib supports both single-frame and multiple-frame modes, with the latter offering even better results. Evaluations on our in-house dataset and the public WoodScape dataset demonstrate its superior accuracy and robustness compared to baseline methods. Code is avalaible at this https URL.
zh

[CV-53] ask-Driven Fixation Network: An Efficient Architecture with Fixation Selection

【速读】：该论文旨在解决在复杂任务中如何减少神经网络规模和计算开销的问题。其核心解决方案是提出了一种新型神经网络架构，该架构包含三个主要模块：低分辨率通道（low-resolution channel）、高分辨率通道（high-resolution channel）和混合编码模块（hybrid encoding module）。其中，混合编码模块的关键创新在于引入了动态生成注视点（fixation point generator）的机制，该机制能够根据任务需求自动选择感兴趣区域（regions of interest），从而避免对整个图像进行高分辨率分析的冗余计算。通过这种任务驱动的方式，模型能够在保持任务性能的同时显著提高计算效率。

链接: https://arxiv.org/abs/2501.01548
作者: Shuguang Wang,Yuanjing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures, 2 tables

点击查看摘要

Abstract:This paper presents a novel neural network architecture featuring automatic fixation point selection, designed to efficiently address complex tasks with reduced network size and computational overhead. The proposed model consists of: a low-resolution channel that captures low-resolution global features from input images; a high-resolution channel that sequentially extracts localized high-resolution features; and a hybrid encoding module that integrates the features from both channels. A defining characteristic of the hybrid encoding module is the inclusion of a fixation point generator, which dynamically produces fixation points, enabling the high-resolution channel to focus on regions of interest. The fixation points are generated in a task-driven manner, enabling the automatic selection of regions of interest. This approach avoids exhaustive high-resolution analysis of the entire image, maintaining task performance and computational efficiency.
zh

[CV-54] SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

【速读】：该论文旨在解决视觉Transformer（Vision Transformers, ViTs）在对抗性扰动下的脆弱性问题，特别是对抗性过拟合（adversarial overfitting）问题。尽管ViTs在计算机视觉任务中表现出色，但其在面对对抗性攻击时的鲁棒性较差，甚至比卷积神经网络（Convolutional Neural Networks, CNNs）更为脆弱。此外，ViTs由于参数量大、架构复杂，更容易在对抗性训练中出现过拟合，导致在干净数据和对抗性数据上的准确性均下降。

论文提出的解决方案是一种新颖的层选择性微调方法，称为SAFER（Sharpness-Aware Fine-tuning for Enhanced Robustness）。该方法的核心在于识别并选择性微调那些最容易发生过拟合的层，而不是对整个模型进行优化。具体而言，SAFER在冻结模型其余部分的同时，对这些选定的层应用锐度感知最小化（sharpness-aware minimization），从而有效减少对抗性过拟合。实验结果表明，该方法在各种ViT架构和数据集上均能显著提升干净数据和对抗性数据的准确性，典型改进约为5%，某些情况下甚至达到20%的提升。

链接: https://arxiv.org/abs/2501.01529
作者: Bhavna Gopal,Huanrui Yang,Mark Horton,Yiran Chen
机构: Department of Electrical and Computer Engineering, Duke University (杜克大学电气与计算机工程系); Department of Electrical Engineering and Computer Science, University of California, Berkeley (加州大学伯克利分校电气工程与计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision transformers (ViTs) have become essential backbones in advanced computer vision applications and multi-modal foundation models. Despite their strengths, ViTs remain vulnerable to adversarial perturbations, comparable to or even exceeding the vulnerability of convolutional neural networks (CNNs). Furthermore, the large parameter count and complex architecture of ViTs make them particularly prone to adversarial overfitting, often compromising both clean and adversarial accuracy. This paper mitigates adversarial overfitting in ViTs through a novel, layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire model, we identify and selectively fine-tune a small subset of layers most susceptible to overfitting, applying sharpness-aware minimization to these layers while freezing the rest of the model. Our method consistently enhances both clean and adversarial accuracy over baseline approaches. Typical improvements are around 5%, with some cases achieving gains as high as 20% across various ViT architectures and datasets. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.01529 [cs.CV] (or arXiv:2501.01529v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.01529 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-55] LS-GAN: Human Motion Synthesis with Latent-space GANs

【速读】：该论文旨在解决基于文本输入的人体运动合成（Human motion synthesis conditioned on textual input）中训练和推理时间过长的问题。现有的方法通常使用原始运动数据或基于扩散模型（diffusion models）的潜在空间表示，但这些方法在训练和推理时效率较低。论文提出了一种新颖的框架，利用生成对抗网络（Generative Adversarial Networks, GANs）在潜在空间中进行运动合成，从而显著减少训练和推理时间，同时保持与最先进的扩散方法相当的结果。关键解决方案在于使用潜在空间中的GAN，通过简化模型结构，实现了91%以上的浮点运算（FLOPs）减少，并在HumanML3D和HumanAct12基准测试中取得了0.482的FID（Fréchet Inception Distance）分数。这一方法为高效且高质量的运动合成提供了新的可能性。

链接: https://arxiv.org/abs/2501.01449
作者: Avinash Amballa,Gayathri Akkinapalli,Vinitra Muralikrishnan
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:Human motion synthesis conditioned on textual input has gained significant attention in recent years due to its potential applications in various domains such as gaming, film production, and virtual reality. Conditioned Motion synthesis takes a text input and outputs a 3D motion corresponding to the text. While previous works have explored motion synthesis using raw motion data and latent space representations with diffusion models, these approaches often suffer from high training and inference times. In this paper, we introduce a novel framework that utilizes Generative Adversarial Networks (GANs) in the latent space to enable faster training and inference while achieving results comparable to those of the state-of-the-art diffusion methods. We perform experiments on the HumanML3D, HumanAct12 benchmarks and demonstrate that a remarkably simple GAN in the latent space achieves a FID of 0.482 with more than 91% in FLOPs reduction compared to latent diffusion model. Our work opens up new possibilities for efficient and high-quality motion synthesis using latent space GANs.
zh

[CV-56] Exoplanet Detection via Differentiable Rendering

【速读】：该论文试图解决直接成像系外行星（exoplanet）时面临的高对比度问题，即由于主恒星与其行星之间的亮度差异极大，导致望远镜科学图像中出现由波前像差（wavefront aberrations）引起的散斑（speckles），这些散斑可能模仿行星信号，从而干扰微弱系外行星信号的检测。传统后处理方法主要在图像强度域操作，未充分利用波前传感数据（wavefront sensing data），这些数据主要用于自适应光学校正，但其在后处理中的潜力未被充分挖掘。

论文提出的解决方案之关键是利用可微分渲染（differentiable rendering）方法，结合波前传感数据，模拟光通过日冕仪望远镜系统（coronagraphic telescope system）的波基传播过程。通过基于梯度的优化（gradient-based optimization），该方法显著提高了星光减除效果，增强了对微弱系外行星信号的灵敏度。基于詹姆斯·韦伯太空望远镜（James Webb Space Telescope）配置的仿真实验验证了该方法的有效性，展示了在对比度和行星检测极限方面的显著提升。这一方法通过计算技术进步，重新利用了先前未被充分利用的波前数据，为增强系外行星成像和表征开辟了新途径。

链接: https://arxiv.org/abs/2501.01912
作者: Brandon Y. Feng,Rodrigo Ferrer-Chávez,Aviad Levis,Jason J. Wang,Katherine L. Bouman,William T. Freeman
机构: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (麻省理工学院计算机科学与人工智能实验室); Department of Physics & Astronomy, Northwestern University (西北大学物理与天文学系); Department of Computer Science, the University of Toronto (多伦多大学计算机科学系); Departments of Computing and Mathematical Sciences, Electrical Engineering, and Astronomy, California Institute of Technology (加州理工学院计算与数学科学、电气工程与天文学系)
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Direct imaging of exoplanets is crucial for advancing our understanding of planetary systems beyond our solar system, but it faces significant challenges due to the high contrast between host stars and their planets. Wavefront aberrations introduce speckles in the telescope science images, which are patterns of diffracted starlight that can mimic the appearance of planets, complicating the detection of faint exoplanet signals. Traditional post-processing methods, operating primarily in the image intensity domain, do not integrate wavefront sensing data. These data, measured mainly for adaptive optics corrections, have been overlooked as a potential resource for post-processing, partly due to the challenge of the evolving nature of wavefront aberrations. In this paper, we present a differentiable rendering approach that leverages these wavefront sensing data to improve exoplanet detection. Our differentiable renderer models wave-based light propagation through a coronagraphic telescope system, allowing gradient-based optimization to significantly improve starlight subtraction and increase sensitivity to faint exoplanets. Simulation experiments based on the James Webb Space Telescope configuration demonstrate the effectiveness of our approach, achieving substantial improvements in contrast and planet detection limits. Our results showcase how the computational advancements enabled by differentiable rendering can revitalize previously underexploited wavefront data, opening new avenues for enhancing exoplanet imaging and characterization.
zh

[CV-57] Compressed Domain Prior-Guided Video Super-Resolution for Cloud Gaming Content

【速读】：该论文试图解决在云游戏（Cloud Gaming）场景中，压缩游戏视频内容在超分辨率（Super-Resolution, SR）重建过程中出现的块效应（block artifacts）和振铃效应（ringing effects）问题。传统超分辨率网络在处理压缩视频时，往往会放大这些伪影，同时忽略游戏内容的边缘细节，导致重建效果不理想。为解决这一问题，论文提出了一种名为编码先验引导超分辨率（Coding Prior-Guided Super-Resolution, CPGSR）的轻量级网络。其关键解决方案包括：1）设计了一个压缩域引导块（Compressed Domain Guided Block, CDGB），用于从编码先验中提取不同深度的特征，并将其与U-net骨干网络的特征进行融合；2）采用一系列重参数化块（re-parameterization blocks）进行重建；3）提出了一种基于视频编码量化思想的分区焦点频率损失（partitioned focal frequency loss），以有效引导模型保留高频信息。实验结果表明，该方法在压缩游戏视频的超分辨率重建中具有显著优势。

链接: https://arxiv.org/abs/2501.01773
作者: Qizhe Wang,Qian Yin,Zhimeng Huang,Weijia Jiang,Yi Su,Siwei Ma,Jiaqi Zhang
机构: National Engineering Research Center of Visual Technology, School of Computer Science, Peking University (北京大学); Migu Interactive Entertainment Co., Ltd (咪咕互动娱乐有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, Data Compression Conference2025

点击查看摘要

Abstract:Cloud gaming is an advanced form of Internet service that necessitates local terminals to decode within limited resources and time latency. Super-Resolution (SR) techniques are often employed on these terminals as an efficient way to reduce the required bit-rate bandwidth for cloud gaming. However, insufficient attention has been paid to SR of compressed game video content. Most SR networks amplify block artifacts and ringing effects in decoded frames while ignoring edge details of game content, leading to unsatisfactory reconstruction results. In this paper, we propose a novel lightweight network called Coding Prior-Guided Super-Resolution (CPGSR) to address the SR challenges in compressed game video content. First, we design a Compressed Domain Guided Block (CDGB) to extract features of different depths from coding priors, which are subsequently integrated with features from the U-net backbone. Then, a series of re-parameterization blocks are utilized for reconstruction. Ultimately, inspired by the quantization in video coding, we propose a partitioned focal frequency loss to effectively guide the model’s focus on preserving high-frequency information. Extensive experiments demonstrate the advancement of our approach.
zh

[CV-58] Laparoscopic Scene Analysis for Intraoperative Visualisation of Gamma Probe Signals in Minimally Invasive Cancer Surgery

【速读】：该论文试图解决在癌症手术中缺乏可靠的术中可视化工具的问题，这导致外科医生在切除癌组织时主要依赖触觉和肉眼观察，增加了手术成本和患者风险。解决方案的关键在于开发一种基于核素（nuclear agents）的微型癌症检测探针（SENSEI），该探针通过发射的伽马信号在术中更准确地识别癌症。然而，由于探针是非成像的且与组织之间存在空气间隙，外科医生难以在组织表面定位探针的感应区域。为此，论文提出了一系列技术手段，包括工具跟踪、姿态估计、分割工具、腹腔镜图像深度估计算法以及三维重建方法，以解决探针感应区域在二维腹腔镜图像中的可视化问题。

链接: https://arxiv.org/abs/2501.01752
作者: Baoru Huang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Doctoral thesis

点击查看摘要

Abstract:Cancer remains a significant health challenge worldwide, with a new diagnosis occurring every two minutes in the UK. Surgery is one of the main treatment options for cancer. However, surgeons rely on the sense of touch and naked eye with limited use of pre-operative image data to directly guide the excision of cancerous tissues and metastases due to the lack of reliable intraoperative visualisation tools. This leads to increased costs and harm to the patient where the cancer is removed with positive margins, or where other critical structures are unintentionally impacted. There is therefore a pressing need for more reliable and accurate intraoperative visualisation tools for minimally invasive surgery to improve surgical outcomes and enhance patient care. A recent miniaturised cancer detection probe (i.e., SENSEI developed by Lightpoint Medical Ltd.) leverages the cancer-targeting ability of nuclear agents to more accurately identify cancer intra-operatively using the emitted gamma signal. However, the use of this probe presents a visualisation challenge as the probe is non-imaging and is air-gapped from the tissue, making it challenging for the surgeon to locate the probe-sensing area on the tissue surface. Geometrically, the sensing area is defined as the intersection point between the gamma probe axis and the tissue surface in 3D space but projected onto the 2D laparoscopic image. Hence, in this thesis, tool tracking, pose estimation, and segmentation tools were developed first, followed by laparoscope image depth estimation algorithms and 3D reconstruction methods. Comments: Doctoral thesis Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2501.01752 [eess.IV] (or arXiv:2501.01752v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.01752 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-59] SNeRV: Spectra-preserving Neural Representation for Video ECCV2024

【速读】：该论文试图解决现有基于神经表示视频（NeRV）的方法在捕捉精细空间细节和运动模式方面的困难，这些困难主要源于频谱偏差（spectral bias），即神经网络学习高频（HF）组件的速度比低频（LF）组件慢。为了解决这一问题，论文提出了一种称为频谱保持的NeRV（SNeRV）的新方法，通过有效处理各种频率组件来增强隐式视频表示。SNeRV的关键解决方案包括使用二维离散小波变换（DWT）将视频分解为LF和HF特征，从而保留空间结构并直接解决频谱偏差问题。此外，SNeRV通过编码LF组件并利用解码器生成包含精细纹理的HF组件，实现了紧凑性和细节保留的平衡。为了进一步增强表示能力，SNeRV引入了多分辨率融合单元（MFU）和高频恢复器（HFR）等专门模块，并通过时间扩展上采样块（TUBs）将时空LF特征嵌入网络，以有效捕捉相邻视频帧之间的时间相关性。实验结果表明，SNeRV在捕捉细节和增强重建方面优于现有的NeRV模型，展示了其在隐式视频表示领域的潜力。

链接: https://arxiv.org/abs/2501.01681
作者: Jina Kim,Jihoo Lee,Je-Won Kang
机构: SoC R&D Center, LG Electronics; Dept. of Electronic and Electrical Engineering and Graduate Program in Smart Factory, Ewha W. University, Seoul, South Korea
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024

点击查看摘要

Abstract:Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes are available at this https URL.
zh

[CV-60] Embedding Similarity Guided License Plate Super Resolution

【速读】：该论文试图解决低分辨率车牌图像的超分辨率（Super-resolution, SR）问题，特别是在安全和监控应用中，准确的车牌识别至关重要。为了解决这一挑战，论文提出了一种新颖的框架，结合了基于像素的损失（pixel-based loss）和嵌入相似性学习（embedding similarity learning）。该框架的关键在于引入了像素和嵌入一致性损失（Pixel and Embedding Consistency Loss, PECL），通过集成Siamese网络并应用对比损失（contrastive loss）来强制嵌入相似性，从而提升感知和结构保真度。通过有效平衡像素级精度和嵌入级一致性，该框架在高分辨率（HR）和超分辨率（SR）车牌之间实现了细粒度特征的优越对齐。实验结果表明，该方法在PSNR_RGB、PSNR_Y和光学字符识别（OCR）精度方面均优于现有技术，展示了嵌入相似性学习在极端超分辨率场景中提升感知质量和任务特定性能的潜力。

链接: https://arxiv.org/abs/2501.01483
作者: Abderrezzaq Sendjasni,Mohamed-Chaker Larabi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Neurocomputing

点击查看摘要

Abstract:Super-resolution (SR) techniques play a pivotal role in enhancing the quality of low-resolution images, particularly for applications such as security and surveillance, where accurate license plate recognition is crucial. This study proposes a novel framework that combines pixel-based loss with embedding similarity learning to address the unique challenges of license plate super-resolution (LPSR). The introduced pixel and embedding consistency loss (PECL) integrates a Siamese network and applies contrastive loss to force embedding similarities to improve perceptual and structural fidelity. By effectively balancing pixel-wise accuracy with embedding-level consistency, the framework achieves superior alignment of fine-grained features between high-resolution (HR) and super-resolved (SR) license plates. Extensive experiments on the CCPD dataset validate the efficacy of the proposed framework, demonstrating consistent improvements over state-of-the-art methods in terms of PSNR_RGB, PSNR_Y and optical character recognition (OCR) accuracy. These results highlight the potential of embedding similarity learning to advance both perceptual quality and task-specific performance in extreme super-resolution scenarios.
zh

[CV-61] An unsupervised method for MRI recovery: Deep image prior with structured sparsity

【速读】：该论文旨在解决无需全采样k空间数据的无监督MRI重建问题。传统方法通常依赖于全采样数据，但在实际应用中，获取全采样数据可能具有挑战性。为此，作者提出了一种名为DISCUS（Deep Image Prior with Structured Sparsity）的无监督重建方法。该方法通过扩展深度图像先验（DIP），引入了组稀疏性（group sparsity）到帧特定编码向量中，从而能够发现低维流形以捕捉时间变化。DISCUS的关键在于利用结构化稀疏性来约束重建过程，使得在没有全采样数据的情况下，仍能实现高质量的图像重建。该方法在模拟和实测数据上进行了验证，结果表明其在归一化均方误差（NMSE）和结构相似性指数（SSIM）方面优于现有的压缩感知和基于DIP的方法，并且在专家评分中也表现出色。

链接: https://arxiv.org/abs/2501.01482
作者: Muhammad Ahmad Sultan,Chong Chen,Yingmin Liu,Katarzyna Gil,Karolina Zareba,Rizwan Ahmad
机构: The Ohio State University (俄亥俄州立大学); The Ohio State University Wexner Medical Center (俄亥俄州立大学韦克斯纳医学中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Objective: To propose and validate an unsupervised MRI reconstruction method that does not require fully sampled k-space data. Materials and Methods: The proposed method, deep image prior with structured sparsity (DISCUS), extends the deep image prior (DIP) by introducing group sparsity to frame-specific code vectors, enabling the discovery of a low-dimensional manifold for capturing temporal variations. \discus was validated using four studies: (I) simulation of a dynamic Shepp-Logan phantom to demonstrate its manifold discovery capabilities, (II) comparison with compressed sensing and DIP-based methods using simulated single-shot late gadolinium enhancement (LGE) image series from six distinct digital cardiac phantoms in terms of normalized mean square error (NMSE) and structural similarity index measure (SSIM), (III) evaluation on retrospectively undersampled single-shot LGE data from eight patients, and (IV) evaluation on prospectively undersampled single-shot LGE data from eight patients, assessed via blind scoring from two expert readers. Results: DISCUS outperformed competing methods, demonstrating superior reconstruction quality in terms of NMSE and SSIM (Studies I–III) and expert reader scoring (Study IV). Discussion: An unsupervised image reconstruction method is presented and validated on simulated and measured data. These developments can benefit applications where acquiring fully sampled data is challenging.
zh

[CV-62] Unleashing Correlation and Continuity for Hyperspectral Reconstruction from RGB Images

【速读】：该论文旨在解决从RGB图像重建高光谱图像（HSI）的问题，以低成本获得高空间分辨率的高光谱图像。解决方案的关键在于充分利用光谱特征的局部相关性和全局连续性。为此，作者提出了一个相关性连续性网络（CCNet），其中包含两个核心模块：组内光谱相关性建模（GrSCM）模块和邻域光谱连续性建模（NeSCM）模块。GrSCM模块通过局部范围内的高效建模来捕捉光谱带之间的相似性，而NeSCM模块则利用记忆单元递归地建模全局光谱的渐进变化特征。此外，作者还设计了基于块的自适应融合（PAF）模块，以自适应方式将全局连续性特征与局部光谱特征有效融合。这些创新显著提升了重建高光谱图像的质量，并在主流数据集NTIRE2022和NTIRE2020上实现了当前最先进的性能。

链接: https://arxiv.org/abs/2501.01481
作者: Fuxiang Feng,Runmin Cong,Shoushui Wei,Yipeng Zhang,Jun Li,Sam Kwong,Wei Zhang
机构: School of Control Science and Engineering, Shandong University, Jinan 250061, China (山东大学控制科学与工程学院); Key Laboratory of Machine Intelligence and System Control, Ministry of Education, Jinan 250061, China (机器智能与系统控制教育部重点实验室); University of California, Los Angeles, America (加州大学洛杉矶分校); Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan 430078, China (中国地质大学智能地学信息处理湖北省重点实验室); School of Computer Science, China University of Geosciences, Wuhan 430078, China (中国地质大学计算机学院); Lingnan University, Hong Kong SAR, China (岭南大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing Hyperspectral Images (HSI) from RGB images can yield high spatial resolution HSI at a lower cost, demonstrating significant application potential. This paper reveals that local correlation and global continuity of the spectral characteristics are crucial for HSI reconstruction tasks. Therefore, we fully explore these inter-spectral relationships and propose a Correlation and Continuity Network (CCNet) for HSI reconstruction from RGB images. For the correlation of local spectrum, we introduce the Group-wise Spectral Correlation Modeling (GrSCM) module, which efficiently establishes spectral band similarity within a localized range. For the continuity of global spectrum, we design the Neighborhood-wise Spectral Continuity Modeling (NeSCM) module, which employs memory units to recursively model the progressive variation characteristics at the global level. In order to explore the inherent complementarity of these two modules, we design the Patch-wise Adaptive Fusion (PAF) module to efficiently integrate global continuity features into the spectral features in a patch-wise adaptive manner. These innovations enhance the quality of reconstructed HSI. We perform comprehensive comparison and ablation experiments on the mainstream datasets NTIRE2022 and NTIRE2020 for the spectral reconstruction task. Compared to the current advanced spectral reconstruction algorithms, our designed algorithm achieves State-Of-The-Art (SOTA) performance.
zh

[CV-63] ch Report: Divide and Conquer 3D Real-Time Reconstruction for Improved IGS

【速读】：该论文旨在解决基于内窥镜视频（endoscopic videos）追踪手术修改的技术难题，并探讨其临床优势。尽管这一技术在临床上具有重要价值，但其实现仍面临诸多挑战。论文提出了一种模块化管道（modular pipeline），通过分解和应对临床过程中的挑战来解决问题。该管道集成了帧选择（frame selection）、深度估计（depth estimation）和三维重建（3D reconstruction）等组件，具有灵活性和适应性，能够整合新的方法。关键的技术进展包括深度估计中集成了Depth-Anything V2和EndoDAC，以及改进了迭代最近点（Iterative Closest Point, ICP）对齐过程。通过在Hamlyn数据集上的实验，验证了这些集成方法的有效性，并讨论了系统的能力和局限性。

链接: https://arxiv.org/abs/2501.01465
作者: Yicheng Zhu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking surgical modifications based on endoscopic videos is technically feasible and of great clinical advantages; however, it still remains challenging. This report presents a modular pipeline to divide and conquer the clinical challenges in the process. The pipeline integrates frame selection, depth estimation, and 3D reconstruction components, allowing for flexibility and adaptability in incorporating new methods. Recent advancements, including the integration of Depth-Anything V2 and EndoDAC for depth estimation, as well as improvements in the Iterative Closest Point (ICP) alignment process, are detailed. Experiments conducted on the Hamlyn dataset demonstrate the effectiveness of the integrated methods. System capability and limitations are both discussed.
zh

[CV-64] Estimation of 3T MR images from 1.5T images regularized with Physics based Constraint

【速读】：该论文试图解决低场强（1.5T）磁共振成像（MRI）图像质量提升的问题，特别是在缺乏高场强（如3T或7T）参考图像的情况下。现有的后处理方法通常依赖于高场强图像的示例或像素级对应关系，但这些方法在处理1.5T图像时效果不佳，且需要昂贵的图像配准过程。论文提出了一种无监督框架，通过线性变换（Linear Transformation, LT）来估计未知的高场强图像和线性变换参数，并引入基于物理的约束条件，以非线性函数关系进一步优化低场强和高场强图像之间的映射。实验结果表明，该方法能够生成质量提升的1.5T图像（即估计的3T类图像），并在组织分割和体积量化方面优于现有的方法。

链接: https://arxiv.org/abs/2501.01464
作者: Prabhjot Kaur,Atul Singh Minhas,Chirag Kamal Ahuja,Anil Kumar Sao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: conference paper

点击查看摘要

Abstract:Limited accessibility to high field MRI scanners (such as 7T, 11T) has motivated the development of post-processing methods to improve low field images. Several existing post-processing methods have shown the feasibility to improve 3T images to produce 7T-like images [3,18]. It has been observed that improving lower field (LF, =1.5T) images comes with additional challenges due to poor image quality such as the function mapping 1.5T and higher field (HF, 3T) images is more complex than the function relating 3T and 7T images [10]. Except for [10], no method has been addressed to improve =1.5T MRI images. Further, most of the existing methods [3,18] including [10] require example images, and also often rely on pixel to pixel correspondences between LF and HF images which are usually inaccurate for =1.5T images. The focus of this paper is to address the unsupervised framework for quality improvement of 1.5T images and avoid the expensive requirements of example images and associated image registration. The LF and HF images are assumed to be related by a linear transformation (LT). The unknown HF image and unknown LT are estimated in alternate minimization framework. Further, a physics based constraint is proposed that provides an additional non-linear function relating LF and HF images in order to achieve the desired high contrast in estimated HF image. The experimental results demonstrate that the proposed approach provides processed 1.5T images, i.e., estimated 3T-like images with improved image quality, and is comparably better than the existing methods addressing similar problems. The improvement in image quality is also shown to provide better tissue segmentation and volume quantification as compared to scanner acquired 1.5T images.
zh

[CV-65] GDSR: Global-Detail Integration through Dual-Branch Network with Wavelet Losses for Remote Sensing Image Super-Resolution

【速读】：该论文试图解决遥感图像超分辨率（RSI-SR）任务中现有方法无法同时有效捕捉全局和局部依赖关系的问题，以及在大规模遥感图像上计算成本过高的问题。解决方案的关键在于引入了Receptance Weighted Key Value (RWKV)机制，该机制以线性复杂度捕捉长程依赖关系。此外，论文提出了全局-细节双分支结构（GDSR），通过并行RWKV和卷积操作来处理大规模遥感图像，并引入了全局-细节重建模块（GDRM）来桥接两个分支的互补作用。最后，论文提出了小波损失（Wavelet Loss），该损失函数有效捕捉图像中的高频细节信息，从而提升超分辨率重建的视觉质量。实验结果表明，GDSR在多个基准数据集上均优于现有的基于Transformer的方法HAT，且在参数数量和计算量上显著减少，推理速度更快。

链接: https://arxiv.org/abs/2501.01460
作者: Qiwei Zhu,Kai Li,Guojing Zhang,Xiaoying Wang,Jianqiang Huang,Xilai Li
机构: School of Computer Technology and Application, Qinghai University (青海大学计算机技术与应用学院); Qinghai Provincial Laboratory for Intelligent Computing and Application, Qinghai University (青海大学青海省智能计算与应用实验室); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); School of Computer and Information Science, Qinghai Institute of Technology (青海理工学院计算机与信息科学学院); College of Agriculture and Animal Husbandry, Qinghai University (青海大学农牧学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: GDSR: Global-Detail Integration through Dual-Branch Network with Wavelet Losses for Remote Sensing Image Super-Resolution

点击查看摘要

Abstract:In recent years, deep neural networks, including Convolutional Neural Networks, Transformers, and State Space Models, have achieved significant progress in Remote Sensing Image (RSI) Super-Resolution (SR). However, existing SR methods typically overlook the complementary relationship between global and local dependencies. These methods either focus on capturing local information or prioritize global information, which results in models that are unable to effectively capture both global and local features simultaneously. Moreover, their computational cost becomes prohibitive when applied to large-scale RSIs. To address these challenges, we introduce the novel application of Receptance Weighted Key Value (RWKV) to RSI-SR, which captures long-range dependencies with linear complexity. To simultaneously model global and local features, we propose the Global-Detail dual-branch structure, GDSR, which performs SR reconstruction by paralleling RWKV and convolutional operations to handle large-scale RSIs. Furthermore, we introduce the Global-Detail Reconstruction Module (GDRM) as an intermediary between the two branches to bridge their complementary roles. In addition, we propose Wavelet Loss, a loss function that effectively captures high-frequency detail information in images, thereby enhancing the visual quality of SR, particularly in terms of detail reconstruction. Extensive experiments on several benchmarks, including AID, AID_CDM, RSSRD-QH, and RSSRD-QH_CDM, demonstrate that GSDR outperforms the state-of-the-art Transformer-based method HAT by an average of 0.05 dB in PSNR, while using only 63% of its parameters and 51% of its FLOPs, achieving an inference speed 2.9 times faster. Furthermore, the Wavelet Loss shows excellent generalization across various architectures, providing a novel perspective for RSI-SR enhancement.
zh

[CV-66] SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction

【速读】：该论文试图解决在临床实践中难以获取配对训练数据集的问题，这一问题限制了监督深度学习（SDL）在X射线计算机断层扫描（CT）图像重建中的广泛应用。为了解决这一问题，论文提出了一种自监督跨任务互学习（SS-CTML）框架。该框架的关键在于从全视角扫描的投影数据中提取稀疏视角和有限视角的投影数据，从而生成三个独立的重建任务：全视角CT（FVCT）重建、稀疏视角CT（SVCT）重建和有限视角CT（LVCT）重建。通过构建三个神经网络分别处理这些任务，并设计跨任务互学习目标，使得这些网络能够通过相互学习进行自监督优化，最终实现高质量的CT图像重建。实验结果表明，该框架在定量和定性评估中均表现出优异的图像重建性能。

链接: https://arxiv.org/abs/2501.01456
作者: Gaofeng Chen,Yaoduo Zhang,Li Huang,Pengfei Wang,Wenyu Zhang,Dong Zeng,Jianhua Ma,Ji He
机构: School of Biomedical Engineering, Guangzhou Medical University (广州医科大学生物医学工程学院); Sun Yat-sen Memorial Hospital, Sun Yat-sen University (中山大学孙逸仙纪念医院); School of Biomedical Engineering, Southern Medical University (南方医科大学生物医学工程学院); School of Life Science and Technology, Xi’an Jiaotong University (西安交通大学生命科学与技术学院); Fourth Affiliated Hospital, Guangzhou Medical University (广州医科大学附属第四医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Supervised deep-learning (SDL) techniques with paired training datasets have been widely studied for X-ray computed tomography (CT) image reconstruction. However, due to the difficulties of obtaining paired training datasets in clinical routine, the SDL methods are still away from common uses in clinical practices. In recent years, self-supervised deep-learning (SSDL) techniques have shown great potential for the studies of CT image reconstruction. In this work, we propose a self-supervised cross-task mutual learning (SS-CTML) framework for CT image reconstruction. Specifically, a sparse-view scanned and a limited-view scanned sinogram data are first extracted from a full-view scanned sinogram data, which results in three individual reconstruction tasks, i.e., the full-view CT (FVCT) reconstruction, the sparse-view CT (SVCT) reconstruction, and limited-view CT (LVCT) reconstruction. Then, three neural networks are constructed for the three reconstruction tasks. Considering that the ultimate goals of the three tasks are all to reconstruct high-quality CT images, we therefore construct a set of cross-task mutual learning objectives for the three tasks, in which way, the three neural networks can be self-supervised optimized by learning from each other. Clinical datasets are adopted to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the SS-CTML framework can obtain promising CT image reconstruction performance in terms of both quantitative and qualitative measurements.
zh

[CV-67] Real-Time Computational Visual Aberration Correcting Display Through High-Contrast Inverse Blurring

【速读】：该论文旨在解决传统视觉矫正设备（如眼镜或隐形眼镜）在某些场景下使用不便的问题，提出了一种实时视觉矫正显示（VCD）框架，通过消除折射视觉像差来改善视觉清晰度。解决方案的关键在于利用与观看者眼睛相关的点扩散函数（PSF）对显示图像进行去卷积处理，从而校正视觉像差。为了减少去卷积过程中产生的振铃伪影，论文采用了掩蔽技术对预滤波图像进行处理。此外，通过在YUV/YCbCr色彩空间中仅对亮度（luma）通道进行去卷积操作，增强了显示的对比度并减少了色彩失真。最后，论文引入了一种实时计算PSF的技术，该技术根据观看者相对于屏幕的球面坐标进行自适应调整，确保即使观看者从屏幕法线以外的角度观察，PSF仍能保持准确和无失真，从而实现无论观看角度如何都能提供一致的视觉矫正效果。实验结果表明，该方法显著提高了视觉清晰度，结构相似性指数（SSIM）达到83.04%，验证了该框架的有效性。

链接: https://arxiv.org/abs/2501.01450
作者: Akhilesh Balaji,Dhruv Ramu
机构: Neev Academy, Bengaluru, Karnataka, 560037 India (尼夫学院, 班加罗尔, 卡纳塔克邦, 560037 印度); Ashoka University, Rajiv Gandhi Education City, National Capital Region P.O. Rai, Sonepat, Haryana, 131029 India (阿育王大学, 拉吉夫·甘地教育城, 国家首都区 P.O. 拉伊, 索尼帕特, 哈里亚纳邦, 131029 印度)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 14 figures

点击查看摘要

Abstract:This paper presents a framework for developing a live vision-correcting display (VCD) to address refractive visual aberrations without the need for traditional vision correction devices like glasses or contact lenses, particularly in scenarios where wearing them may be inconvenient. We achieve this correction through deconvolution of the displayed image using a point spread function (PSF) associated with the viewer’s eye. We address ringing artefacts using a masking technique applied to the prefiltered image. We also enhance the display’s contrast and reduce color distortion by operating in the YUV/YCbCr color space, where deconvolution is performed solely on the luma (brightness) channel. Finally, we introduce a technique to calculate a real-time PSF that adapts based on the viewer’s spherical coordinates relative to the screen. This ensures that the PSF remains accurate and undistorted even when the viewer observes the display from an angle relative to the screen normal, thereby providing consistent visual correction regardless of the viewing angle. The results of our display demonstrate significant improvements in visual clarity, achieving a structural similarity index (SSIM) of 83.04%, highlighting the effectiveness of our approach.
zh

人工智能

[AI-0] MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators

链接: https://arxiv.org/abs/2501.01951
作者: Cheng Wan,Runkao Tao,Zheng Du,Yang Katie Zhao,Yingyan Celine Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Graph convolutional networks (GCNs) have demonstrated superiority in graph-based learning tasks. However, training GCNs on full graphs is particularly challenging, due to the following two challenges: (1) the associated feature tensors can easily explode the memory and block the communication bandwidth of modern accelerators, and (2) the computation workflow in training GCNs alternates between sparse and dense matrix operations, complicating the efficient utilization of computational resources. Existing solutions for scalable distributed full-graph GCN training mostly adopt partition parallelism, which is unsatisfactory as they only partially address the first challenge while incurring scaled-out communication volume. To this end, we propose MixGCN aiming to simultaneously address both the aforementioned challenges towards GCN training. To tackle the first challenge, MixGCN integrates mixture of parallelism. Both theoretical and empirical analysis verify its constant communication volumes and enhanced balanced workload; For handling the second challenge, we consider mixture of accelerators (i.e., sparse and dense accelerators) with a dedicated accelerator for GCN training and a fine-grain pipeline. Extensive experiments show that MixGCN achieves boosted training efficiency and scalability.

[AI-1] MADGEN – Mass-Spec attends to De Novo Molecular generation

链接: https://arxiv.org/abs/2501.01950
作者: Yinkai Wang,Xiaohui Chen,Liping Liu,Soha Hassoun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the “dark chemical space” without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN’s performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.

[AI-2] Cold-Start Recommendation towards the Era of Large Language Models (LLM s): A Comprehensive Survey and Roadmap

链接: https://arxiv.org/abs/2501.01945
作者: Weizhi Zhang,Yuanchen Bei,Liangwei Yang,Henry Peng Zou,Peilin Zhou,Aiwei Liu,Yinghui Li,Hao Chen,Jianling Wang,Yu Wang,Feiran Huang,Sheng Zhou,Jiajun Bu,Allen Lin,James Caverlee,Fakhri Karray,Irwin King,Philip S. Yu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cold-start problem is one of the long-standing challenges in recommender systems, focusing on accurately modeling new or interaction-limited users or items to provide better recommendations. Due to the diversification of internet platforms and the exponential growth of users and items, the importance of cold-start recommendation (CSR) is becoming increasingly evident. At the same time, large language models (LLMs) have achieved tremendous success and possess strong capabilities in modeling user and item information, providing new potential for cold-start recommendations. However, the research community on CSR still lacks a comprehensive review and reflection in this field. Based on this, in this paper, we stand in the context of the era of large language models and provide a comprehensive review and discussion on the roadmap, related literature, and future directions of CSR. Specifically, we have conducted an exploration of the development path of how existing CSR utilizes information, from content features, graph relations, and domain information, to the world knowledge possessed by large language models, aiming to provide new insights for both the research and industrial communities on CSR. Related resources of cold-start recommendations are collected and continuously updated for the community in this https URL.

[AI-3] Mingling with the Good to Backdoor Federated Learning

链接: https://arxiv.org/abs/2501.01913
作者: Nuno Neves
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 13 pages, 9 figures, under submission

点击查看摘要

Abstract:Federated learning (FL) is a decentralized machine learning technique that allows multiple entities to jointly train a model while preserving dataset privacy. However, its distributed nature has raised various security concerns, which have been addressed by increasingly sophisticated defenses. These protections utilize a range of data sources and metrics to, for example, filter out malicious model updates, ensuring that the impact of attacks is minimized or eliminated. This paper explores the feasibility of designing a generic attack method capable of installing backdoors in FL while evading a diverse array of defenses. Specifically, we focus on an attacker strategy called MIGO, which aims to produce model updates that subtly blend with legitimate ones. The resulting effect is a gradual integration of a backdoor into the global model, often ensuring its persistence long after the attack concludes, while generating enough ambiguity to hinder the effectiveness of defenses. MIGO was employed to implant three types of backdoors across five datasets and different model architectures. The results demonstrate the significant threat posed by these backdoors, as MIGO consistently achieved exceptionally high backdoor accuracy (exceeding 90%) while maintaining the utility of the main task. Moreover, MIGO exhibited strong evasion capabilities against ten defenses, including several state-of-the-art methods. When compared to four other attack strategies, MIGO consistently outperformed them across most configurations. Notably, even in extreme scenarios where the attacker controls just 0.1% of the clients, the results indicate that successful backdoor insertion is possible if the attacker can persist for a sufficient number of rounds. Comments: 13 pages, 9 figures, under submission Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: D.4.6; I.2 Cite as: arXiv:2501.01913 [cs.CR] (or arXiv:2501.01913v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.01913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture

链接: https://arxiv.org/abs/2501.01892
作者: Shvetank Prakash,Andrew Cheng,Jason Yik,Arya Tschand,Radhika Ghosal,Ikechukwu Uchendu,Jessica Quaye,Jeffrey Ma,Shreyas Grampurohit,Sofia Giannuzzi,Arnav Balyan,Fin Amin,Aadya Pipersenia,Yash Choudhary,Ankita Nayak,Amir Yazdanbakhsh,Vijay Janapa Reddi
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at this https URL.

[AI-5] Evaluating Scenario-based Decision-making for Interactive Autonomous Driving Using Rational Criteria: A Survey

链接: https://arxiv.org/abs/2501.01886
作者: Zhen Tian,Zhihao Lin,Dezong Zhao,Wenjing Zhao,David Flynn,Shuja Ansari,Chongfeng Wei
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) can significantly promote the advances in road transport mobility in terms of safety, reliability, and decarbonization. However, ensuring safety and efficiency in interactive during within dynamic and diverse environments is still a primary barrier to large-scale AV adoption. In recent years, deep reinforcement learning (DRL) has emerged as an advanced AI-based approach, enabling AVs to learn decision-making strategies adaptively from data and interactions. DRL strategies are better suited than traditional rule-based methods for handling complex, dynamic, and unpredictable driving environments due to their adaptivity. However, varying driving scenarios present distinct challenges, such as avoiding obstacles on highways and reaching specific exits at intersections, requiring different scenario-specific decision-making algorithms. Many DRL algorithms have been proposed in interactive decision-making. However, a rationale review of these DRL algorithms across various scenarios is lacking. Therefore, a comprehensive evaluation is essential to assess these algorithms from multiple perspectives, including those of vehicle users and vehicle manufacturers. This survey reviews the application of DRL algorithms in autonomous driving across typical scenarios, summarizing road features and recent advancements. The scenarios include highways, on-ramp merging, roundabouts, and unsignalized intersections. Furthermore, DRL-based algorithms are evaluated based on five rationale criteria: driving safety, driving efficiency, training efficiency, unselfishness, and interpretability (DDTUI). Each criterion of DDTUI is specifically analyzed in relation to the reviewed algorithms. Finally, the challenges for future DRL-based decision-making algorithms are summarized.

[AI-6] LCFed: An Efficient Clustered Federated Learning Framework for Heterogeneous Data

链接: https://arxiv.org/abs/2501.01850
作者: Yuxin Zhang,Haoyu Chen,Zheng Lin,Zhe Chen,Jin Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL) by organizing edge devices with similar data distributions into clusters, enabling collaborative model training tailored to each group. However, existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training, which leads to suboptimal performance. Moreover, traditional clustering methods incur significant computational overhead, especially as the number of edge devices increases. In this paper, we propose LCFed, an efficient CFL framework to combat these challenges. By leveraging model partitioning and adopting distinct aggregation strategies for each sub-model, LCFed effectively incorporates global knowledge into intra-cluster co-training, achieving optimal training performance. Additionally, LCFed customizes a computationally efficient model similarity measurement method based on low-rank models, enabling real-time cluster updates with minimal computational overhead. Extensive experiments show that LCFed outperforms state-of-the-art benchmarks in both test accuracy and clustering computational efficiency.

[AI-7] Multi-Agent Conversational Online Learning for Adaptive LLM Response Identification

链接: https://arxiv.org/abs/2501.01849
作者: Xiangxiang Dai,Yuejin Xie,Maoli Liu,Xuchuang Wang,Zhuohua Li,Huanyu Wang,John C.S. Lui
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The remarkable generative capability of large language models (LLMs) has sparked a growing interest in automatically generating responses for different applications. Given the dynamic nature of user preferences and the uncertainty of LLM response performance, it is crucial to design efficient online learning algorithms to identify optimal LLM responses (i.e., high-quality responses that also meet user preferences). Most existing online algorithms adopt a centralized approach and fail to leverage explicit user preferences for more efficient and personalized LLM response identification. In contrast, this paper introduces \textitMACO (\underlineMulti-\underlineAgent \underlineConversational \underlineOnline Learning for Adaptive LLM Response Identification): 1) The online LLM response identification process is accelerated by multiple local agents (such as smartphones), while enhancing data privacy; 2) A novel conversational mechanism is proposed to adaptively conduct conversations for soliciting user preferences (e.g., a preference for a humorous tone over a serious one in generated responses), so to minimize uncertainty in preference estimation. Our theoretical analysis demonstrates that \cadi\ is near-optimal regarding cumulative regret. Additionally, \cadi\ offers reduced communication costs and computational complexity by eliminating the traditional, computing-intensive ``G-optimal design" found in previous works. Extensive experiments with the open LLM \textitLlama, coupled with two different embedding models from Google and OpenAI for text vector representation, demonstrate that \cadi\ significantly outperforms the current state-of-the-art in online LLM response identification.

[AI-8] Practical machine learning is learning on small samples

链接: https://arxiv.org/abs/2501.01836
作者: Marina Sapir
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Based on limited observations, machine learning discerns a dependence which is expected to hold in the future. What makes it possible? Statistical learning theory imagines indefinitely increasing training sample to justify its approach. In reality, there is no infinite time or even infinite general population for learning. Here I argue that practical machine learning is based on an implicit assumption that underlying dependence is relatively smooth" : likely, there are no abrupt differences in feedback between cases with close data points. From this point of view learning shall involve selection of the hypothesis smoothly" approximating the training set. I formalize this as Practical learning paradigm. The paradigm includes terminology and rules for description of learners. Popular learners (local smoothing, k-NN, decision trees, Naive Bayes, SVM for classification and for regression) are shown here to be implementations of this paradigm.

[AI-9] ASKCOS: an open source software suite for synthesis planning

链接: https://arxiv.org/abs/2501.01835
作者: Zhengkai Tu,Sourabh J. Choure,Mun Hong Fong,Jihye Roh,Itai Levin,Kevin Yu,Joonyoung F. Joung,Nathan Morgan,Shih-Cheng Li,Xiaoqi Sun,Huiqian Lin,Mark Murnin,Jordan P. Liles,Thomas J. Struble,Michael E. Fortunato,Mengjie Liu,William H. Green,Klavs F. Jensen,Connor W. Coley
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of machine learning and the availability of large-scale reaction datasets have accelerated the development of data-driven models for computer-aided synthesis planning (CASP) in the past decade. Here, we detail the newest version of ASKCOS, an open source software suite for synthesis planning that makes available several research advances in a freely available, practical tool. Four one-step retrosynthesis models form the basis of both interactive planning and automatic planning modes. Retrosynthetic planning is complemented by other modules for feasibility assessment and pathway evaluation, including reaction condition recommendation, reaction outcome prediction, and auxiliary capabilities such as solubility prediction and quantum mechanical descriptor prediction. ASKCOS has assisted hundreds of medicinal, synthetic, and process chemists in their day-to-day tasks, complementing expert decision making. It is our belief that CASP tools like ASKCOS are an important part of modern chemistry research, and that they offer ever-increasing utility and accessibility.

[AI-10] BERT4MIMO: A Foundation Model using BERT Architecture for Massive MIMO Channel State Information Prediction

链接: https://arxiv.org/abs/2501.01802
作者: Ferhat Ozgur Catak,Murat Kuzlu,Umit Cali
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 10 pages

点击查看摘要

Abstract:Massive MIMO (Multiple-Input Multiple-Output) is an advanced wireless communication technology, using a large number of antennas to improve the overall performance of the communication system in terms of capacity, spectral, and energy efficiency. The performance of MIMO systems is highly dependent on the quality of channel state information (CSI). Predicting CSI is, therefore, essential for improving communication system performance, particularly in MIMO systems, since it represents key characteristics of a wireless channel, including propagation, fading, scattering, and path loss. This study proposes a foundation model inspired by BERT, called BERT4MIMO, which is specifically designed to process high-dimensional CSI data from massive MIMO systems. BERT4MIMO offers superior performance in reconstructing CSI under varying mobility scenarios and channel conditions through deep learning and attention mechanisms. The experimental results demonstrate the effectiveness of BERT4MIMO in a variety of wireless environments.

[AI-11] Creating Artificial Students that Never Existed: Leverag ing Large Language Models and CTGANs for Synthetic Data Generation

链接: https://arxiv.org/abs/2501.01793
作者: Mohammad Khalil,Farhad Vadiee,Ronas Shakya,Qinyi Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.

[AI-12] Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

链接: https://arxiv.org/abs/2501.01785
作者: Qinyi Liu,Oscar Deho,Farhad Vadiee,Mohammad Khalil,Srecko Joksimovic,George Siemens
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

[AI-13] Combined Hyper-Extensible Extremely-Secured Zero-Trust CIAM-PAM architecture

链接: https://arxiv.org/abs/2501.01732
作者: Shivom Aggarwal,Shourya Mehra,Safeer Sathar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Customer Identity and Access Management (CIAM) systems play a pivotal role in securing enterprise infrastructures. However, the complexity of implementing these systems requires careful architectural planning to ensure positive Return on Investment (RoI) and avoid costly delays. The proliferation of Active Persistent cyber threats, coupled with advancements in AI, cloud computing, and geographically distributed customer populations, necessitates a paradigm shift towards adaptive and zero-trust security frameworks. This paper introduces the Combined Hyper-Extensible Extremely-Secured Zero-Trust (CHEZ) CIAM-PAM architecture, designed specifically for large-scale enterprises. The CHEZ PL CIAM-PAM framework addresses critical security gaps by integrating federated identity management (private and public identities), password-less authentication, adaptive multi-factor authentication (MFA), microservice-based PEP (Policy Entitlement Point), multi-layer RBAC (Role Based Access Control) and multi-level trust systems. This future-proof design also includes end-to-end data encryption, and seamless integration with state-of-the-art AI-based threat detection systems, while ensuring compliance with stringent regulatory standards.

[AI-14] Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal Reinforcement Learning

链接: https://arxiv.org/abs/2501.01727
作者: Gavin B. Rens
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, this is a preprint of the peer-reviewed version published by SCITEPRESS for ICAART-2025

点击查看摘要

Abstract:Humanoid robots must master numerous tasks with sparse rewards, posing a challenge for reinforcement learning (RL). We propose a method combining RL and automated planning to address this. Our approach uses short goal-conditioned policies (GCPs) organized hierarchically, with Monte Carlo Tree Search (MCTS) planning using high-level actions (HLAs). Instead of primitive actions, the planning process generates HLAs. A single plan-tree, maintained during the agent’s lifetime, holds knowledge about goal achievement. This hierarchy enhances sample efficiency and speeds up reasoning by reusing HLAs and anticipating future actions. Our Hierarchical Goal-Conditioned Policy Planning (HGCPP) framework uniquely integrates GCPs, MCTS, and hierarchical RL, potentially improving exploration and planning in complex tasks.

[AI-15] LLM s Legal Aid: Understanding Legal Needs Exhibited Through User Queries

链接: https://arxiv.org/abs/2501.01711
作者: Michal Kuk,Jakub Harasta
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted at AI for Access to Justice Workshop at Jurix 2024, Brno, Czechia

点击查看摘要

Abstract:The paper presents a preliminary analysis of an experiment conducted by Frank Bold, a Czech expert group, to explore user interactions with GPT-4 for addressing legal queries. Between May 3, 2023, and July 25, 2023, 1,252 users submitted 3,847 queries. Unlike studies that primarily focus on the accuracy, factuality, or hallucination tendencies of large language models (LLMs), our analysis focuses on the user query dimension of the interaction. Using GPT-4o for zero-shot classification, we categorized queries on (1) whether users provided factual information about their issue (29.95%) or not (70.05%), (2) whether they sought legal information (64.93%) or advice on the course of action (35.07%), and (3) whether they imposed requirements to shape or control the model’s answer (28.57%) or not (71.43%). We provide both quantitative and qualitative insight into user needs and contribute to a better understanding of user engagement with LLMs.

[AI-16] BARTPredict: Empowering IoT Security with LLM -Driven Cyber Threat Prediction

链接: https://arxiv.org/abs/2501.01664
作者: Alaeddine Diaf,Abdelaziz Amara Korba,Nour Elislem Karabadji,Yacine Ghamri-Doudane
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Internet of Things (IoT) technology in various domains has led to operational advancements, but it has also introduced new vulnerabilities to cybersecurity threats, as evidenced by recent widespread cyberattacks on IoT devices. Intrusion detection systems are often reactive, triggered by specific patterns or anomalies observed within the network. To address this challenge, this work proposes a proactive approach to anticipate and preemptively mitigate malicious activities, aiming to prevent potential damage before it occurs. This paper proposes an innovative intrusion prediction framework empowered by Pre-trained Large Language Models (LLMs). The framework incorporates two LLMs: a fine-tuned Bidirectional and AutoRegressive Transformers (BART) model for predicting network traffic and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model for evaluating the predicted traffic. By harnessing the bidirectional capabilities of BART the framework then identifies malicious packets among these predictions. Evaluated using the CICIoT2023 IoT attack dataset, our framework showcases a notable enhancement in predictive performance, attaining an impressive 98% overall accuracy, providing a powerful response to the cybersecurity challenges that confront IoT networks.

[AI-17] AVATAR: Adversarial Autoencoders with Autoregressive Refinement for Time Series Generation SDM2025

链接: https://arxiv.org/abs/2501.01649
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted to the SDM 2025 on December 20, 2024

点击查看摘要

Abstract:Data augmentation can significantly enhance the performance of machine learning tasks by addressing data scarcity and improving generalization. However, generating time series data presents unique challenges. A model must not only learn a probability distribution that reflects the real data distribution but also capture the conditional distribution at each time step to preserve the inherent temporal dependencies. To address these challenges, we introduce AVATAR, a framework that combines Adversarial Autoencoders (AAE) with Autoregressive Learning to achieve both objectives. Specifically, our technique integrates the autoencoder with a supervisor and introduces a novel supervised loss to assist the decoder in learning the temporal dynamics of time series data. Additionally, we propose another innovative loss function, termed distribution loss, to guide the encoder in more efficiently aligning the aggregated posterior of the autoencoder’s latent representation with a prior Gaussian distribution. Furthermore, our framework employs a joint training mechanism to simultaneously train all networks using a combined loss, thereby fulfilling the dual objectives of time series generation. We evaluate our technique across a variety of time series datasets with diverse characteristics. Our experiments demonstrate significant improvements in both the quality and practical utility of the generated data, as assessed by various qualitative and quantitative metrics.

[AI-18] Artificial Intelligent Implications on Health Data Privacy and Confidentiality

链接: https://arxiv.org/abs/2501.01639
作者: Ahmad Momani
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid integration of artificial intelligence (AI) in healthcare is revolutionizing medical diagnostics, personalized medicine, and operational efficiency. However, alongside these advancements, significant challenges arise concerning patient data privacy, ethical considerations, and regulatory compliance. This paper examines the dual impact of AI on healthcare, highlighting its transformative potential and the critical need for safeguarding sensitive health information. It explores the role of the Health Insurance Portability and Accountability Act (HIPAA) as a regulatory framework for ensuring data privacy and security, emphasizing the importance of robust safeguards and ethical standards in AI-driven healthcare. Through case studies, including AI applications in diabetic retinopathy, oncology, and the controversies surrounding data sharing, this study underscores the ethical and legal complexities of AI implementation. A balanced approach that fosters innovation while maintaining patient trust and privacy is imperative. The findings emphasize the importance of continuous education, transparency, and adherence to regulatory frameworks to harness AI’s full potential responsibly and ethically in healthcare.

[AI-19] Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible User Perception

链接: https://arxiv.org/abs/2501.01598
作者: Yunzhe Li,Facheng Hu,Hongzi Zhu,Quan Liu,Xiaoke Zhao,Jiangang Shen,Shan Chang,Minyi Guo
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: in Proceedings of IEEE INFOCOM 2025, London, United Kingdom

点击查看摘要

Abstract:A wide range of user perception applications leverage inertial measurement unit (IMU) data for online prediction. However, restricted by the non-i.i.d. nature of IMU data collected from mobile devices, most systems work well only in a controlled setting (e.g., for a specific user in particular postures), limiting application scenarios. To achieve uncontrolled online prediction on mobile devices, referred to as the flexible user perception (FUP) problem, is attractive but hard. In this paper, we propose a novel scheme, called Prism, which can obtain high FUP accuracy on mobile devices. The core of Prism is to discover task-aware domains embedded in IMU dataset, and to train a domain-aware model on each identified domain. To this end, we design an expectation-maximization (EM) algorithm to estimate latent domains with respect to the specific downstream perception task. Finally, the best-fit model can be automatically selected for use by comparing the test sample and all identified domains in the feature space. We implement Prism on various mobile devices and conduct extensive experiments. Results demonstrate that Prism can achieve the best FUP performance with a low latency.

[AI-20] BLAST: A Stealthy Backdoor Leverag e Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

链接: https://arxiv.org/abs/2501.01593
作者: Yinbo Yu,Saihao Yan,Xueyu Yin,Jing Fang,Jiajia Liu
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12. arXiv admin note: substantial text overlap with arXiv:2409.07775

点击查看摘要

Abstract:Recent studies have shown that cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform malicious actions leading to failures or malicious goals. However, existing backdoor attacks suffer from several issues, e.g., instant trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor leverage attack against c-MADRL, BLAST, which attacks the entire multi-agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual-injected fixed visual patterns or instant status and control the period to perform malicious actions. This method can guarantee the stealthiness and practicality of BLAST. Secondly, we hack the original reward function of the backdoor agent via unilateral guidance to inject BLAST, so as to achieve the \textitleverage attack effect that can pry open the entire multi-agent system via a single backdoor agent. We evaluate our BLAST against 3 classic c-MADRL algorithms (VDN, QMIX, and MAPPO) in 2 popular c-MADRL environments (SMAC and Pursuit), and 2 existing defense mechanisms. The experimental results demonstrate that BLAST can achieve a high attack success rate while maintaining a low clean performance variance rate.

[AI-21] BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

链接: https://arxiv.org/abs/2501.01540
作者: Kanishk Gandhi,Michael Y. Li,Lyle Goodyear,Louise Li,Aditi Bhaskar,Mohammed Zaman,Noah D. Goodman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: KG and MYL contributed equally

点击查看摘要

Abstract:Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM’s ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent’s ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.

[AI-22] In Search of a Lost Metric: Human Empowerment as a Pillar of Socially Conscious Navigation

链接: https://arxiv.org/abs/2501.01539
作者: Vasanth Reddy Baddam,Behdad Chalaki,Vaishnav Tadiparthi,Hossein Nourkhiz Mahjoub,Ehsan Moradi-Pari,Hoda Eldardiry,Almuatazbellah Boker
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 8 figures, 2 tables, Accepted to 20th edition of the IEEE/ACM International Conference on Human-Robot Interaction (HRI)

点击查看摘要

Abstract:In social robot navigation, traditional metrics like proxemics and behavior naturalness emphasize human comfort and adherence to social norms but often fail to capture an agent’s autonomy and adaptability in dynamic environments. This paper introduces human empowerment, an information-theoretic concept that measures a human’s ability to influence their future states and observe those changes, as a complementary metric for evaluating social compliance. This metric reveals how robot navigation policies can indirectly impact human empowerment. We present a framework that integrates human empowerment into the evaluation of social performance in navigation tasks. Through numerical simulations, we demonstrate that human empowerment as a metric not only aligns with intuitive social behavior, but also shows statistically significant differences across various robot navigation policies. These results provide a deeper understanding of how different policies affect social compliance, highlighting the potential of human empowerment as a complementary metric for future research in social navigation.

[AI-23] DiagrammaticLearning: A Graphical Language for Compositional Training Regimes

链接: https://arxiv.org/abs/2501.01515
作者: Mason Lary,Richard Samuelson,Alexander Wilentz,Alina Zare,Matthew Klawonn,James P. Fairbanks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Category Theory (math.CT)
*备注:

点击查看摘要

Abstract:Motivated by deep learning regimes with multiple interacting yet distinct model components, we introduce learning diagrams, graphical depictions of training setups that capture parameterized learning as data rather than code. A learning diagram compiles to a unique loss function on which component models are trained. The result of training on this loss is a collection of models whose predictions ``agree" with one another. We show that a number of popular learning setups such as few-shot multi-task learning, knowledge distillation, and multi-modal learning can be depicted as learning diagrams. We further implement learning diagrams in a library that allows users to build diagrams of PyTorch and this http URL models. By implementing some classic machine learning use cases, we demonstrate how learning diagrams allow practitioners to build complicated models as compositions of smaller components, identify relationships between workflows, and manipulate models during or after training. Leveraging a category theoretic framework, we introduce a rigorous semantics for learning diagrams that puts such operations on a firm mathematical foundation.

[AI-24] AI-Enabled Operations at Fermi Complex: Multivariate Time Series Prediction for Outage Prediction and Diagnosis AAAI

链接: https://arxiv.org/abs/2501.01509
作者: Milan Jain,Burcu O. Mutlu,Caleb Stam,Jan Strube,Brian A. Schupbach,Jason M. St. John,William A. Pellico
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP)
*备注: Presented in the AAAI Workshop on AI for Time Series Analysis 2025

点击查看摘要

Abstract:The Main Control Room of the Fermilab accelerator complex continuously gathers extensive time-series data from thousands of sensors monitoring the beam. However, unplanned events such as trips or voltage fluctuations often result in beam outages, causing operational downtime. This downtime not only consumes operator effort in diagnosing and addressing the issue but also leads to unnecessary energy consumption by idle machines awaiting beam restoration. The current threshold-based alarm system is reactive and faces challenges including frequent false alarms and inconsistent outage-cause labeling. To address these limitations, we propose an AI-enabled framework that leverages predictive analytics and automated labeling. Using data from 2,703 Linac devices and 80 operator-labeled outages, we evaluate state-of-the-art deep learning architectures, including recurrent, attention-based, and linear models, for beam outage prediction. Additionally, we assess a Random Forest-based labeling system for providing consistent, confidence-scored outage annotations. Our findings highlight the strengths and weaknesses of these architectures for beam outage prediction and identify critical gaps that must be addressed to fully harness AI for transitioning downtime handling from reactive to predictive, ultimately reducing downtime and improving decision-making in accelerator management.

[AI-25] Drift2Matrix: Kernel-Induced Self Representation for Concept Drift Adaptation in Co-evolving Time Series

链接: https://arxiv.org/abs/2501.01480
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of time series analysis, tackling the phenomenon of concept drift poses a significant challenge. Concept drift – characterized by the evolving statistical properties of time series data, affects the reliability and accuracy of conventional analysis models. This is particularly evident in co-evolving scenarios where interactions among variables are crucial. This paper presents Drift2Matrix, a novel framework that leverages kernel-induced self-representation for adaptive responses to concept drift in time series. Drift2Matrix employs a kernel-based learning mechanism to generate a representation matrix, encapsulating the inherent dynamics of co-evolving time series. This matrix serves as a key tool for identification and adaptation to concept drift by observing its temporal variations. Furthermore, Drift2Matrix effectively identifies prevailing patterns and offers insights into emerging trends through pattern evolution analysis. Our empirical evaluation of Drift2Matrix across various datasets demonstrates its effectiveness in handling the complexities of concept drift. This approach introduces a novel perspective in the theoretical domain of co-evolving time series analysis, enhancing adaptability and accuracy in the face of dynamic data environments.

[AI-26] Unraveling Indirect In-Context Learning Using Influence Functions

链接: https://arxiv.org/abs/2501.01473
作者: Hadi Askari,Shivanshu Gupta,Terry Tong,Fei Wang,Anshuman Chhabra,Muhao Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:This work introduces a novel paradigm for generalized In-Context Learning (ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore demonstration selection strategies tailored for two distinct real-world scenarios: Mixture of Tasks and Noisy Demonstrations. We systematically evaluate the effectiveness of Influence Functions (IFs) as a selection tool for these settings, highlighting the potential for IFs to better capture the informativeness of examples within the demonstration pool. For the Mixture of Tasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU, BigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining BertScore-Recall (BSR) with an IF surrogate model can significantly improve performance, leading to average absolute accuracy gains of 0.37% and 1.45% for 3-shot and 5-shot setups when compared to traditional ICL metrics. In the Noisy Demonstrations setting, we examine scenarios where demonstrations might be mislabeled. Our experiments show that reweighting traditional ICL selectors (BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an average of 2.90% for Cosine Similarity and 2.94% for BSR on noisy GLUE benchmarks. In sum, we propose a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights into the role of IFs for Indirect ICL.

[AI-27] Augmented Contrastive Clustering with Uncertainty-Aware Prototyping for Time Series Test Time Adaptation

链接: https://arxiv.org/abs/2501.01472
作者: Peiliang Gong,Mohamed Ragab,Min Wu,Zhenghua Chen,Yongyi Su,Xiaoli Li,Daoqiang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Test-time adaptation aims to adapt pre-trained deep neural networks using solely online unlabelled test data during inference. Although TTA has shown promise in visual applications, its potential in time series contexts remains largely unexplored. Existing TTA methods, originally designed for visual tasks, may not effectively handle the complex temporal dynamics of real-world time series data, resulting in suboptimal adaptation performance. To address this gap, we propose Augmented Contrastive Clustering with Uncertainty-aware Prototyping (ACCUP), a straightforward yet effective TTA method for time series data. Initially, our approach employs augmentation ensemble on the time series data to capture diverse temporal information and variations, incorporating uncertainty-aware prototypes to distill essential characteristics. Additionally, we introduce an entropy comparison scheme to selectively acquire more confident predictions, enhancing the reliability of pseudo labels. Furthermore, we utilize augmented contrastive clustering to enhance feature discriminability and mitigate error accumulation from noisy pseudo labels, promoting cohesive clustering within the same class while facilitating clear separation between different classes. Extensive experiments conducted on three real-world time series datasets and an additional visual dataset demonstrate the effectiveness and generalization potential of the proposed method, advancing the underexplored realm of TTA for time series data.

[AI-28] Balance-aware Sequence Sampling Makes Multi-modal Learning Better

链接: https://arxiv.org/abs/2501.01470
作者: Zhi-Hao Guan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To address the modality imbalance caused by data heterogeneity, existing multi-modal learning (MML) approaches primarily focus on balancing this difference from the perspective of optimization objectives. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample. Via the evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to rebalance MML. Moreover, considering that sample balance may evolve as the model capability increases, we propose a learning-based probabilistic sampling method to dynamically update the training sequences at the epoch level, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) MML approaches.

[AI-29] Goal Recognition using Actor-Critic Optimization

链接: https://arxiv.org/abs/2501.01463
作者: Ben Nageris,Felipe Meneguzzi,Reuth Mirsky
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Goal Recognition aims to infer an agent’s goal from a sequence of observations. Existing approaches often rely on manually engineered domains and discrete representations. Deep Recognition using Actor-Critic Optimization (DRACO) is a novel approach based on deep reinforcement learning that overcomes these limitations by providing two key contributions. First, it is the first goal recognition algorithm that learns a set of policy networks from unstructured data and uses them for inference. Second, DRACO introduces new metrics for assessing goal hypotheses through continuous policy representations. DRACO achieves state-of-the-art performance for goal recognition in discrete settings while not using the structured inputs used by existing approaches. Moreover, it outperforms these approaches in more challenging, continuous settings at substantially reduced costs in both computing and memory. Together, these results showcase the robustness of the new algorithm, bridging traditional goal recognition and deep reinforcement learning.

[AI-30] Pan-infection Foundation Framework Enables Multiple Pathogen Prediction

链接: https://arxiv.org/abs/2501.01462
作者: Lingrui Zhang,Haonan Wu,Nana Jin,Chenqing Zheng,Jize Xie,Qitai Cai,Jun Wang,Qin Cao,Xubin Zheng,Jiankun Wang,Lixin Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Host-response-based diagnostics can improve the accuracy of diagnosing bacterial and viral infections, thereby reducing inappropriate antibiotic prescriptions. However, the existing cohorts with limited sample size and coarse infections types are unable to support the exploration of an accurate and generalizable diagnostic model. Here, we curate the largest infection host-response transcriptome data, including 11,247 samples across 89 blood transcriptome datasets from 13 countries and 21 platforms. We build a diagnostic model for pathogen prediction starting from a pan-infection model as foundation (AUC = 0.97) based on the pan-infection dataset. Then, we utilize knowledge distillation to efficiently transfer the insights from this “teacher” model to four lightweight pathogen “student” models, i.e., staphylococcal infection (AUC = 0.99), streptococcal infection (AUC = 0.94), HIV infection (AUC = 0.93), and RSV infection (AUC = 0.94), as well as a sepsis “student” model (AUC = 0.99). The proposed knowledge distillation framework not only facilitates the diagnosis of pathogens using pan-infection data, but also enables an across-disease study from pan-infection to sepsis. Moreover, the framework enables high-degree lightweight design of diagnostic models, which is expected to be adaptively deployed in clinical settings.

[AI-31] GAN-TAT: A Novel Framework Using Protein Interaction Networks in Druggable Gene Identification

链接: https://arxiv.org/abs/2501.01458
作者: George Yuanji Wang,Srisharan Murugesan,Aditya Prince Rohatgi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Identifying druggable genes is essential for developing effective pharmaceuticals. With the availability of extensive, high-quality data, computational methods have become a significant asset. Protein Interaction Network (PIN) is valuable but challenging to implement due to its high dimensionality and sparsity. Previous methods relied on indirect integration, leading to resolution loss. This study proposes GAN-TAT, a framework utilizing an advanced graph embedding technology, ImGAGN, to directly integrate PIN for druggable gene inference work. Tested on three Pharos datasets, GAN-TAT achieved the highest AUC-ROC score of 0.951 on Tclin. Further evaluation shows that GAN-TAT’s predictions are supported by clinical evidence, highlighting its potential practical applications in pharmacogenomics. This research represents a methodological attempt with the direct utilization of PIN, expanding potential new solutions for developing drug targets. The source code of GAN-TAT is available at (this https URL).

[AI-32] Human-AI Teaming Using Large Language Models : Boosting Brain-Computer Interfacing (BCI) and Brain Research

链接: https://arxiv.org/abs/2501.01451
作者: Maryna Kapitonova,Tonio Ball
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Recently, there is an increasing interest in using artificial intelligence (AI) to automate aspects of the research process, or even autonomously conduct the full research cycle from idea generation, over data analysis, to composing and evaluation of scientific manuscripts. Examples of working AI scientist systems have been demonstrated for computer science tasks and running molecular biology labs. While some approaches aim for full autonomy of the scientific AI, others rather aim for leveraging human-AI teaming. Here, we address how to adapt such approaches for boosting Brain-Computer Interface (BCI) development, as well as brain research resp. neuroscience at large. We argue that at this time, a strong emphasis on human-AI teaming, in contrast to fully autonomous AI BCI researcher will be the most promising way forward. We introduce the collaborative workspaces concept for human-AI teaming based on a set of Janusian design principles, looking both ways, to the human as well as to the AI side. Based on these principles, we present ChatBCI, a Python-based toolbox for enabling human-AI collaboration based on interaction with Large Language Models (LLMs), designed for BCI research and development projects. We show how ChatBCI was successfully used in a concrete BCI project on advancing motor imagery decoding from EEG signals. Our approach can be straightforwardly extended to broad neurotechnological and neuroscientific topics, and may by design facilitate human expert knowledge transfer to scientific AI systems in general.

[AI-33] Explanatory Debiasing: Involving Domain Experts in the Data Generation Process to Mitigate Representation Bias in AI Systems

链接: https://arxiv.org/abs/2501.01441
作者: Aditya Bhattacharya,Simone Stumpf,Robin De Croon,Katrien Verbert
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Pre-print version, please cite the main article instead of the pre-print version

点击查看摘要

Abstract:Representation bias is one of the most common types of biases in artificial intelligence (AI) systems, causing AI models to perform poorly on underrepresented data segments. Although AI practitioners use various methods to reduce representation bias, their effectiveness is often constrained by insufficient domain knowledge in the debiasing process. To address this gap, this paper introduces a set of generic design guidelines for effectively involving domain experts in representation debiasing. We instantiated our proposed guidelines in a healthcare-focused application and evaluated them through a comprehensive mixed-methods user study with 35 healthcare experts. Our findings show that involving domain experts can reduce representation bias without compromising model accuracy. Based on our findings, we also offer recommendations for developers to build robust debiasing systems guided by our generic design guidelines, ensuring more effective inclusion of domain experts in the debiasing process.

[AI-34] Probabilistic Mission Design in Neuro-Symbolic Systems

链接: https://arxiv.org/abs/2501.01439
作者: Simon Kohaut,Benedict Flade,Daniel Ochs,Devendra Singh Dhami,Julian Eggert,Kristian Kersting
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: arXiv admin note: text overlap with arXiv:2406.03454

点击查看摘要

Abstract:Advanced Air Mobility (AAM) is a growing field that demands accurate modeling of legal concepts and restrictions in navigating intelligent vehicles. In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of Unmanned Aircraft Systems (UAS) beyond visual line of sight (BVLOS) is an endearing task that promises to enhance significantly today’s logistics and emergency response capabilities. To tackle these challenges, we present a probabilistic and neuro-symbolic architecture to encode legal frameworks and expert knowledge over uncertain spatial relations and noisy perception in an interpretable and adaptable fashion. More specifically, we demonstrate Probabilistic Mission Design (ProMis), a system architecture that links geospatial and sensory data with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent’s state space and its legality. As a result, ProMis generates Probabilistic Mission Landscapes (PML), which quantify the agent’s belief that a set of mission conditions is satisfied across its navigation space. Extending prior work on ProMis’ reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many important AAM scenarios.

[AI-35] Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity?

链接: https://arxiv.org/abs/2501.01435
作者: Mario Fritz
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:General Purpose AI - such as Large Language Models (LLMs) - have seen rapid deployment in a wide range of use cases. Most surprisingly, they have have made their way from plain language models, to chat-bots, all the way to an almost ``operating system’'-like status that can control decisions and logic of an application. Tool-use, Microsoft co-pilot/office integration, and OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities. These methods come with a range of cybersecurity challenges. We highlight some of the work we have done in terms of evaluation as well as outline future opportunities and challenges.

[AI-36] Mathematical Definition and Systematization of Puzzle Rules

链接: https://arxiv.org/abs/2501.01433
作者: Itsuki Maeda,Yasuhiro Inoue
类目: Artificial Intelligence (cs.AI); History and Overview (math.HO)
*备注: 15pages

点击查看摘要

Abstract:While logic puzzles have engaged individuals through problem-solving and critical thinking, the creation of new puzzle rules has largely relied on ad-hoc processes. Pencil puzzles, such as Slitherlink and Sudoku, represent a prominent subset of these games, celebrated for their intellectual challenges rooted in combinatorial logic and spatial reasoning. Despite extensive research into solving techniques and automated problem generation, a unified framework for systematic and scalable rule design has been lacking. Here, we introduce a mathematical framework for defining and systematizing pencil puzzle rules. This framework formalizes grid elements, their positional relationships, and iterative composition operations, allowing for the incremental construction of structures that form the basis of puzzle rules. Furthermore, we establish a formal method to describe constraints and domains for each structure, ensuring solvability and coherence. Applying this framework, we successfully formalized the rules of well-known Nikoli puzzles, including Slitherlink and Sudoku, demonstrating the formal representation of a significant portion (approximately one-fourth) of existing puzzles. These results validate the potential of the framework to systematize and innovate puzzle rule design, establishing a pathway to automated rule generation. By providing a mathematical foundation for puzzle rule creation, this framework opens avenues for computers, potentially enhanced by AI, to design novel puzzle rules tailored to player preferences, expanding the scope of puzzle diversity. Beyond its direct application to pencil puzzles, this work illustrates how mathematical frameworks can bridge recreational mathematics and algorithmic design, offering tools for broader exploration in logic-based systems, with potential applications in educational game design, personalized learning, and computational creativity.

[AI-37] Survey on safe robot control via learning

链接: https://arxiv.org/abs/2501.01432
作者: Bassel El Mabsout
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Control systems are critical to modern technological infrastructure, spanning industries from aerospace to healthcare. This survey explores the landscape of safe robot learning, investigating methods that balance high-performance control with rigorous safety constraints. By examining classical control techniques, learning-based approaches, and embedded system design, the research seeks to understand how robotic systems can be developed to prevent hazardous states while maintaining optimal performance across complex operational environments.

[AI-38] Quantifying A Firms AI Engagement: Constructing Objective Data-Driven AI Stock Indices Using 10-K Filings

链接: https://arxiv.org/abs/2501.01763
作者: Lennart Ante,Aman Saggu
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Portfolio Management (q-fin.PM); Risk Management (q-fin.RM)
*备注: 43 pages, 5 tables, 3 figures, 1 appendix figure

点击查看摘要

Abstract:Following an analysis of existing AI-related exchange-traded funds (ETFs), we reveal the selection criteria for determining which stocks qualify as AI-related are often opaque and rely on vague phrases and subjective judgments. This paper proposes a new, objective, data-driven approach using natural language processing (NLP) techniques to classify AI stocks by analyzing annual 10-K filings from 3,395 NASDAQ-listed firms between 2011 and 2023. This analysis quantifies each company’s engagement with AI through binary indicators and weighted AI scores based on the frequency and context of AI-related terms. Using these metrics, we construct four AI stock indices-the Equally Weighted AI Index (AII), the Size-Weighted AI Index (SAII), and two Time-Discounted AI Indices (TAII05 and TAII5X)-offering different perspectives on AI investment. We validate our methodology through an event study on the launch of OpenAI’s ChatGPT, demonstrating that companies with higher AI engagement saw significantly greater positive abnormal returns, with analyses supporting the predictive power of our AI measures. Our indices perform on par with or surpass 14 existing AI-themed ETFs and the Nasdaq Composite Index in risk-return profiles, market responsiveness, and overall performance, achieving higher average daily returns and risk-adjusted metrics without increased volatility. These results suggest our NLP-based approach offers a reliable, market-responsive, and cost-effective alternative to existing AI-related ETF products. Our innovative methodology can also guide investors, asset managers, and policymakers in using corporate data to construct other thematic portfolios, contributing to a more transparent, data-driven, and competitive approach.

[AI-39] Constructing and explaining machine learning models for chemistry: example of the exploration and design of boron-based Lewis acids

链接: https://arxiv.org/abs/2501.01576
作者: Juliette Fenogli,Laurence Grimaud,Rodolphe Vuilleumier(CPCV, Département de chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, Paris, France)
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
*备注: Main text is 12 pages, 5 figures, 3 extended-data figures. Supplementary information is 25 pages. For associated code and datasets, see this https URL

点击查看摘要

Abstract:The integration of machine learning (ML) into chemistry offers transformative potential in the design of molecules. However, the focus has often been on creating highly efficient predictive models, sometimes at the expense of interpretability. We leverage explainable AI techniques to explore the design of boron-based Lewis acids, which play a pivotal role in organic reactions. Using Fluoride Ion Affinity as a proxy for Lewis acidity, we developed interpretable ML models based on chemically meaningful descriptors, including ab initio features and substituent-based parameters. By constraining the chemical space to well-defined molecular scaffolds, we achieved highly accurate predictions, surpassing conventional black-box deep learning models in low-data regime. Interpretability analyses of the models unraveled the origin of Lewis acidity in these compounds and identified actionable levers to modulate it. This work bridges ML and chemist’s way of thinking, demonstrating how explainable models can inspire molecular design and enhance scientific understanding of chemical reactivity.

[AI-40] ransfer Learning Analysis of Variational Quantum Circuits ICASSP2025

链接: https://arxiv.org/abs/2501.01507
作者: Huan-Hsin Tseng,Hsin-Yi Lin,Samuel Yen-Chi Chen,Shinjae Yoo
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:This work analyzes transfer learning of the Variational Quantum Circuit (VQC). Our framework begins with a pretrained VQC configured in one domain and calculates the transition of 1-parameter unitary subgroups required for a new domain. A formalism is established to investigate the adaptability and capability of a VQC under the analysis of loss bounds. Our theory observes knowledge transfer in VQCs and provides a heuristic interpretation for the mechanism. An analytical fine-tuning method is derived to attain the optimal transition for adaptations of similar domains.

[AI-41] ORACLE: A Real-Time Hierarchical Deep-Learning Photometric Classifier for the LSST

链接: https://arxiv.org/abs/2501.01496
作者: Ved G. Shah,Alex Gagliano,Konstantin Malanchev,Gautham Narayan, TheLSST Dark Energy Science Collaboration
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 19 figures, 9 tables. Submitted to ApJ

点击查看摘要

Abstract:We present ORACLE, the first hierarchical deep-learning model for real-time, context-aware classification of transient and variable astrophysical phenomena. ORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has been trained using a custom hierarchical cross-entropy loss function to provide high-confidence classifications along an observationally-driven taxonomy with as little as a single photometric observation. Contextual information for each object, including host galaxy photometric redshift, offset, ellipticity and brightness, is concatenated to the light curve embedding and used to make a final prediction. Training on \sim 0.5M events from the Extended LSST Astronomical Time-Series Classification Challenge, we achieve a top-level (Transient vs Variable) macro-averaged precision of 0.96 using only 1 day of photometric observations after the first detection in addition to contextual information, for each event; this increases to 0.99 once 64 days of the light curve has been obtained, and 0.83 at 1024 days after first detection for 19-way classification (including supernova sub-types, active galactic nuclei, variable stars, microlensing events, and kilonovae). We also compare ORACLE with other state-of-the-art classifiers and report comparable performance for the 19-way classification task, in addition to delivering accurate top-level classifications much earlier. The code and model weights used in this work are publicly available at our associated GitHub repository (this https URL).

[AI-42] A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design

链接: https://arxiv.org/abs/2501.01477
作者: Weihang Dai
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注: PhD Qualifying Exam (2021)

点击查看摘要

Abstract:Proteins are sequences of amino acids that serve as the basic building blocks of living organisms. Despite rapidly growing databases documenting structural and functional information for various protein sequences, our understanding of proteins remains limited because of the large possible sequence space and the complex inter- and intra-molecular forces. Deep learning, which is characterized by its ability to learn relevant features directly from large datasets, has demonstrated remarkable performance in fields such as computer vision and natural language processing. It has also been increasingly applied in recent years to the data-rich domain of protein sequences with great success, most notably with Alphafold2’s breakout performance in the protein structure prediction. The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics, including protein design, one of the most difficult but useful tasks. In this paper, we broadly categorize problems in protein bioinformatics into three main categories: 1) structural prediction, 2) functional prediction, and 3) protein design, and review the progress achieved from using deep learning methodologies in each of them. We expand on the main challenges of the protein design problem and highlight how advances in structural and functional prediction have directly contributed to design tasks. Finally, we conclude by identifying important topics and future research directions.

[AI-43] A Fourfold Pathogen Reference Ontology Suite

链接: https://arxiv.org/abs/2501.01454
作者: Shane Babcock,Carter Benson,Giacomo De Colle,Sydney Cohen,Alexander D. Diehl,Ram A.N.R. Challa,Anthony Huffman,Yongqun He,John Beverley
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 25 pages

点击查看摘要

Abstract:Infectious diseases remain a critical global health challenge, and the integration of standardized ontologies plays a vital role in managing related data. The Infectious Disease Ontology (IDO) and its extensions, such as the Coronavirus Infectious Disease Ontology (CIDO), are essential for organizing and disseminating information related to infectious diseases. The COVID-19 pandemic highlighted the need for updating IDO and its virus-specific extensions. There is an additional need to update IDO extensions specific to bacteria, fungus, and parasite infectious diseases. We adopt the “hub and spoke” methodology to generate pathogen-specific extensions of IDO: Virus Infectious Disease Ontology (VIDO), Bacteria Infectious Disease Ontology (BIDO), Mycosis Infectious Disease Ontology (MIDO), and Parasite Infectious Disease Ontology (PIDO). The creation of pathogen-specific reference ontologies advances modularization and reusability of infectious disease data within the IDO ecosystem. Future work will focus on further refining these ontologies, creating new extensions, and developing application ontologies based on them, in line with ongoing efforts to standardize biological and biomedical terminologies for improved data sharing and analysis.

机器学习

[LG-0] Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer

链接: https://arxiv.org/abs/2501.01936
作者: Vishal Sunder,Eric Fosler-Lussier
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially and we ensure that the SLU task is conditioned on the ASR task by having CTC self conditioning. This novel joint modeling of ASR and SLU improves SLU performance significantly over just using SLU optimization. We further improve the performance by aligning the acoustic embeddings of this model with the semantically richer BERT model. Our proposed knowledge transfer strategy makes use of a bag-of-entity prediction layer on the aligned embeddings and the output of this is used to condition the RNN-T based SLU decoding. These techniques show significant improvement over several strong baselines and can perform at par with large models like Whisper with significantly fewer parameters.

[LG-1] Fusion DeepONet: A Data-Efficient Neural Operator for Geometry-Dependent Hypersonic Flows on Arbitrary Grids

链接: https://arxiv.org/abs/2501.01934
作者: Ahmad Peyvan,Varun Kumar
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Designing re-entry vehicles requires accurate predictions of hypersonic flow around their geometry. Rapid prediction of such flows can revolutionize vehicle design, particularly for morphing geometries. We evaluate advanced neural operator models such as Deep Operator Networks (DeepONet), parameter-conditioned U-Net, Fourier Neural Operator (FNO), and MeshGraphNet, with the objective of addressing the challenge of learning geometry-dependent hypersonic flow fields with limited data. Specifically, we compare the performance of these models for two grid types: uniform Cartesian and irregular grids. To train these models, we use 36 unique elliptic geometries for generating high-fidelity simulations with a high-order entropy-stable DGSEM solver, emphasizing the challenge of working with a scarce dataset. We evaluate and compare the four operator-based models for their efficacy in predicting hypersonic flow field around the elliptic body. Moreover, we develop a novel framework, called Fusion DeepONet, which leverages neural field concepts and generalizes effectively across varying geometries. Despite the scarcity of training data, Fusion DeepONet achieves performance comparable to parameter-conditioned U-Net on uniform grids while it outperforms MeshGraphNet and vanilla DeepONet on irregular, arbitrary grids. Fusion DeepONet requires significantly fewer trainable parameters as compared to U-Net, MeshGraphNet, and FNO, making it computationally efficient. We also analyze the basis functions of the Fusion DeepONet model using Singular Value Decomposition. This analysis reveals that Fusion DeepONet generalizes effectively to unseen solutions and adapts to varying geometries and grid points, demonstrating its robustness in scenarios with limited training data.

[LG-2] GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction AAAI-25

链接: https://arxiv.org/abs/2501.01930
作者: Yuwei Miao,Yuzhi Guo,Hehuan Ma,Jingquan Yan,Feng Jiang,Rui Liao,Junzhou Huang
类目: Machine Learning (cs.LG)
*备注: Accept by AAAI-25

点击查看摘要

Abstract:Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.

[LG-3] Social Processes: Probabilistic Meta-learning for Adaptive Multiparty Interaction Forecasting

链接: https://arxiv.org/abs/2501.01915
作者: Augustinas Jučas,Chirag Raman
类目: Machine Learning (cs.LG)
*备注: This is an extension paper to “Social Processes: Self-Supervised Meta-Learning over Conversational Groups for Forecasting Nonverbal Social Cues”, by Raman et al. ( arXiv:2107.13576 )

点击查看摘要

Abstract:Adaptively forecasting human behavior in social settings is an important step toward achieving Artificial General Intelligence. Most existing research in social forecasting has focused either on unfocused interactions, such as pedestrian trajectory prediction, or on monadic and dyadic behavior forecasting. In contrast, social psychology emphasizes the importance of group interactions for understanding complex social dynamics. This creates a gap that we address in this paper: forecasting social interactions at the group (conversation) level. Additionally, it is important for a forecasting model to be able to adapt to groups unseen at train time, as even the same individual behaves differently across different groups. This highlights the need for a forecasting model to explicitly account for each group’s unique dynamics. To achieve this, we adopt a meta-learning approach to human behavior forecasting, treating every group as a separate meta-learning task. As a result, our method conditions its predictions on the specific behaviors within the group, leading to generalization to unseen groups. Specifically, we introduce Social Process (SP) models, which predict a distribution over future multimodal cues jointly for all group members based on their preceding low-level multimodal cues, while incorporating other past sequences of the same group’s interactions. In this work we also analyze the generalization capabilities of SP models in both their outputs and latent spaces through the use of realistic synthetic datasets.

[LG-4] Alleviating Overfitting in Transformation-Interaction-Rational Symbolic Regression with Multi-Objective Optimization

链接: https://arxiv.org/abs/2501.01905
作者: Fabricio Olivetti de Franca
类目: Machine Learning (cs.LG)
*备注: 25 pages, 8 figures, 4 tables, Genetic Programming and Evolvable Machines, vol 24, no 2

点击查看摘要

Abstract:The Transformation-Interaction-Rational is a representation for symbolic regression that limits the search space of functions to the ratio of two nonlinear functions each one defined as the linear regression of transformed variables. This representation has the main objective to bias the search towards simpler expressions while keeping the approximation power of standard approaches. The performance of using Genetic Programming with this representation was substantially better than with its predecessor (Interaction-Transformation) and ranked close to the state-of-the-art on a contemporary Symbolic Regression benchmark. On a closer look at these results, we observed that the performance could be further improved with an additional selective pressure for smaller expressions when the dataset contains just a few data points. The introduction of a penalization term applied to the fitness measure improved the results on these smaller datasets. One problem with this approach is that it introduces two additional hyperparameters: i) a criteria to when the penalization should be activated and, ii) the amount of penalization to the fitness function. In this paper, we extend Transformation-Interaction-Rational to support multi-objective optimization, specifically the NSGA-II algorithm, and apply that to the same benchmark. A detailed analysis of the results show that the use of multi-objective optimization benefits the overall performance on a subset of the benchmarks while keeping the results similar to the single-objective approach on the remainder of the datasets. Specifically to the small datasets, we observe a small (and statistically insignificant) improvement of the results suggesting that further strategies must be explored. Comments: 25 pages, 8 figures, 4 tables, Genetic Programming and Evolvable Machines, vol 24, no 2 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.01905 [cs.LG] (or arXiv:2501.01905v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.01905 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Fabrício Olivetti de França. 2023. Alleviating overfitting in transformation-interaction-rational symbolic regression with multi-objective optimization. Genetic Programming and Evolvable Machines 24, 2 (Dec 2023) Related DOI: https://doi.org/10.1007/s10710-023-09461-3 Focus to learn more DOI(s) linking to related resources

[LG-5] Exploring Equality: An Investigation into Custom Loss Functions for Fairness Definitions

链接: https://arxiv.org/abs/2501.01889
作者: Gordon Lee,Simeon Sayer
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 17 Pages, 12 Figures

点击查看摘要

[LG-6] DFF: Decision-Focused Fine-tuning for Smarter Predict-then-Optimize with Limited Data AAAI

链接: https://arxiv.org/abs/2501.01874
作者: Jiaqi Yang,Enming Liang,Zicheng Su,Zhichao Zou,Peng Zhen,Jiecheng Guo,Wanjing Ma,Kun An
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Decision-focused learning (DFL) offers an end-to-end approach to the predict-then-optimize (PO) framework by training predictive models directly on decision loss (DL), enhancing decision-making performance within PO contexts. However, the implementation of DFL poses distinct challenges. Primarily, DL can result in deviation from the physical significance of the predictions under limited data. Additionally, some predictive models are non-differentiable or black-box, which cannot be adjusted using gradient-based methods. To tackle the above challenges, we propose a novel framework, Decision-Focused Fine-tuning (DFF), which embeds the DFL module into the PO pipeline via a novel bias correction module. DFF is formulated as a constrained optimization problem that maintains the proximity of the DL-enhanced model to the original predictive model within a defined trust region. We theoretically prove that DFF strictly confines prediction bias within a predetermined upper bound, even with limited datasets, thereby substantially reducing prediction shifts caused by DL under limited data. Furthermore, the bias correction module can be integrated into diverse predictive models, enhancing adaptability to a broad range of PO tasks. Extensive evaluations on synthetic and real-world datasets, including network flow, portfolio optimization, and resource allocation problems with different predictive models, demonstrate that DFF not only improves decision performance but also adheres to fine-tuning constraints, showcasing robust adaptability across various scenarios.

[LG-7] Learning from Ambiguous Data with Hard Labels ICASSP2025

链接: https://arxiv.org/abs/2501.01844
作者: Zeke Xie,Zheng He,Nan Lu,Lichen Bai,Bao Li,Shuo Yang,Mingming Sun,Ping Li
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, accepted by ICASSP 2025

点击查看摘要

Abstract:Real-world data often contains intrinsic ambiguity that the common single-hard-label annotation paradigm ignores. Standard training using ambiguous data with these hard labels may produce overly confident models and thus leading to poor generalization. In this paper, we propose a novel framework called Quantized Label Learning (QLL) to alleviate this issue. First, we formulate QLL as learning from (very) ambiguous data with hard labels: ideally, each ambiguous instance should be associated with a ground-truth soft-label distribution describing its corresponding probabilistic weight in each class, however, this is usually not accessible; in practice, we can only observe a quantized label, i.e., a hard label sampled (quantized) from the corresponding ground-truth soft-label distribution, of each instance, which can be seen as a biased approximation of the ground-truth soft-label. Second, we propose a Class-wise Positive-Unlabeled (CPU) risk estimator that allows us to train accurate classifiers from only ambiguous data with quantized labels. Third, to simulate ambiguous datasets with quantized labels in the real world, we design a mixing-based ambiguous data generation procedure for empirical evaluation. Experiments demonstrate that our CPU method can significantly improve model generalization performance and outperform the baselines.

[LG-8] Age-Based Device Selection and Transmit Power Optimization in Over-the-Air Federated Learning

链接: https://arxiv.org/abs/2501.01828
作者: Jingyuan Liu,Zheng Chang,Ying-Chang Liang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, over-the-air federated learning (FL) has attracted significant attention for its ability to enhance communication efficiency. However, the performance of over-the-air FL is often constrained by device selection strategies and signal aggregation errors. In particular, neglecting straggler devices in FL can lead to a decline in the fairness of model updates and amplify the global model’s bias toward certain devices’ data, ultimately impacting the overall system performance. To address this issue, we propose a joint device selection and transmit power optimization framework that ensures the appropriate participation of straggler devices, maintains efficient training performance, and guarantees timely updates. First, we conduct a theoretical analysis to quantify the convergence upper bound of over-the-air FL under age-of-information (AoI)-based device selection. Our analysis further reveals that both the number of selected devices and the signal aggregation errors significantly influence the convergence upper bound. To minimize the expected weighted sum peak age of information, we calculate device priorities for each communication round using Lyapunov optimization and select the highest-priority devices via a greedy algorithm. Then, we formulate and solve a transmit power and normalizing factor optimization problem for selected devices to minimize the time-average mean squared error (MSE). Experimental results demonstrate that our proposed method offers two significant advantages: (1) it reduces MSE and improves model performance compared to baseline methods, and (2) it strikes a balance between fairness and training efficiency while maintaining satisfactory timeliness, ensuring stable model performance.

[LG-9] Rerouting LLM Routers

链接: https://arxiv.org/abs/2501.01818
作者: Avital Shafran,Roei Schuster,Thomas Ristenpart,Vitaly Shmatikov
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers’ adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets’’ that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2501.01818 [cs.CR] (or arXiv:2501.01818v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.01818 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] John Ellipsoids via Lazy Updates NEURIPS2024

链接: https://arxiv.org/abs/2501.01801
作者: David P. Woodruff,Taisuke Yasuda
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We give a faster algorithm for computing an approximate John ellipsoid around n points in d dimensions. The best known prior algorithms are based on repeatedly computing the leverage scores of the points and reweighting them by these scores [CCLY19]. We show that this algorithm can be substantially sped up by delaying the computation of high accuracy leverage scores by using sampling, and then later computing multiple batches of high accuracy leverage scores via fast rectangular matrix multiplication. We also give low-space streaming algorithms for John ellipsoids using similar ideas.

[LG-11] A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

链接: https://arxiv.org/abs/2501.01774
作者: Zechen Wu,Amy Greenwald,Ronald Parr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditionally, TD and FQI are viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number, such as the use of a target network in Deep Q-Networks (DQN) in the OPE setting. This perspective, however, fails to capture the convergence connections between these algorithms and may lead to incorrect conclusions, for example, that the convergence of TD implies the convergence of FQI. In this paper, we focus on linear value function approximation and offer a new perspective, unifying TD, FQI, and PFQI as the same iterative method for solving the Least Squares Temporal Difference (LSTD) system, but using different preconditioners and matrix splitting schemes. TD uses a constant preconditioner, FQI employs a data-feature adaptive preconditioner, and PFQI transitions between the two. Then, we reveal that in the context of linear function approximation, increasing the number of updates under the same target value function essentially represents a transition from using a constant preconditioner to data-feature adaptive preconditioner. This unifying perspective also simplifies the analyses of the convergence conditions for these algorithms and clarifies many issues. Consequently, we fully characterize the convergence of each algorithm without assuming specific properties of the chosen features (e.g., linear independence). We also examine how common assumptions about feature representations affect convergence, and discover new conditions on features that are important for convergence. These convergence conditions allow us to establish the convergence connections between these algorithms and to address important questions.

[LG-12] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

链接: https://arxiv.org/abs/2501.01765
作者: Mingjie Li,Wai Man Si,Michael Backes,Yang Zhang,Yisen Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As advancements in large language models (LLMs) continue and the demand for personalized models increases, parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) will become essential due to their efficiency in reducing computation costs. However, recent studies have raised alarming concerns that LoRA fine-tuning could potentially compromise the safety alignment in LLMs, posing significant risks for the model owner. In this paper, we first investigate the underlying mechanism by analyzing the changes in safety alignment related features before and after fine-tuning. Then, we propose a fixed safety module calculated by safety data and a task-specific initialization for trainable parameters in low-rank adaptations, termed Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.

[LG-13] Catch Causal Signals from Edges for Label Imbalance in Graph Classification ICASSP2025

链接: https://arxiv.org/abs/2501.01707
作者: Fengrui Zhang,Yujia Yin,Hongzong Li,Yifan Chen,Tianyi Qu
类目: Machine Learning (cs.LG)
*备注: ICASSP 2025

点击查看摘要

Abstract:Despite significant advancements in causal research on graphs and its application to cracking label imbalance, the role of edge features in detecting the causal effects within graphs has been largely overlooked, leaving existing methods with untapped potential for further performance gains. In this paper, we enhance the causal attention mechanism through effectively leveraging edge information to disentangle the causal subgraph from the original graph, as well as further utilizing edge features to reshape graph representations. Capturing more comprehensive causal signals, our design leads to improved performance on graph classification tasks with label imbalance issues. We evaluate our approach on real-word datasets PTC, Tox21, and ogbg-molhiv, observing improvements over baselines. Overall, we highlight the importance of edge features in graph causal detection and provide a promising direction for addressing label imbalance challenges in graph-level tasks. The model implementation details and the codes are available on this https URL

[LG-14] Comparative Study of Deep Learning Architectures for Textual Damage Level Classification

链接: https://arxiv.org/abs/2501.01694
作者: Aziida Nanyonga,Hassan Wasswa,Graham Wild
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the paramount importance of safety in the aviation industry, even minor operational anomalies can have significant consequences. Comprehensive documentation of incidents and accidents serves to identify root causes and propose safety measures. However, the unstructured nature of incident event narratives poses a challenge for computer systems to interpret. Our study aimed to leverage Natural Language Processing (NLP) and deep learning models to analyze these narratives and classify the aircraft damage level incurred during safety occurrences. Through the implementation of LSTM, BLSTM, GRU, and sRNN deep learning models, our research yielded promising results, with all models showcasing competitive performance, achieving an accuracy of over 88% significantly surpassing the 25% random guess threshold for a four-class classification problem. Notably, the sRNN model emerged as the top performer in terms of recall and accuracy, boasting a remarkable 89%. These findings underscore the potential of NLP and deep learning models in extracting actionable insights from unstructured text narratives, particularly in evaluating the extent of aircraft damage within the realm of aviation safety occurrences.

[LG-15] Denoising and Adaptive Online Vertical Federated Learning for Sequential Multi-Sensor Data in Industrial Internet of Things

链接: https://arxiv.org/abs/2501.01693
作者: Heqiang Wang,Xiaoxiong Zhong,Kang Liu,Fangming Liu,Weizhe Zhang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:With the continuous improvement in the computational capabilities of edge devices such as intelligent sensors in the Industrial Internet of Things, these sensors are no longer limited to mere data collection but are increasingly capable of performing complex computational tasks. This advancement provides both the motivation and the foundation for adopting distributed learning approaches. This study focuses on an industrial assembly line scenario where multiple sensors, distributed across various locations, sequentially collect real-time data characterized by distinct feature spaces. To leverage the computational potential of these sensors while addressing the challenges of communication overhead and privacy concerns inherent in centralized learning, we propose the Denoising and Adaptive Online Vertical Federated Learning (DAO-VFL) algorithm. Tailored to the industrial assembly line scenario, DAO-VFL effectively manages continuous data streams and adapts to shifting learning objectives. Furthermore, it can address critical challenges prevalent in industrial environment, such as communication noise and heterogeneity of sensor capabilities. To support the proposed algorithm, we provide a comprehensive theoretical analysis, highlighting the effects of noise reduction and adaptive local iteration decisions on the regret bound. Experimental results on two real-world datasets further demonstrate the superior performance of DAO-VFL compared to benchmarks algorithms.

[LG-16] Analyzing Aviation Safety Narratives with LDA NMF and PLSA: A Case Study Using Socrata Datasets

链接: https://arxiv.org/abs/2501.01690
作者: Aziida Nanyonga,Graham Wild
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the application of topic modelling techniques Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA) on the Socrata dataset spanning from 1908 to 2009. Categorized by operator type (military, commercial, and private), the analysis identified key themes such as pilot error, mechanical failure, weather conditions, and training deficiencies. The study highlights the unique strengths of each method: LDA ability to uncover overlapping themes, NMF production of distinct and interpretable topics, and PLSA nuanced probabilistic insights despite interpretative complexity. Statistical analysis revealed that PLSA achieved a coherence score of 0.32 and a perplexity value of -4.6, NMF scored 0.34 and 37.1, while LDA achieved the highest coherence of 0.36 but recorded the highest perplexity at 38.2. These findings demonstrate the value of topic modelling in extracting actionable insights from unstructured aviation safety narratives, aiding in the identification of risk factors and areas for improvement across sectors. Future directions include integrating additional contextual variables, leveraging neural topic models, and enhancing aviation safety protocols. This research provides a foundation for advanced text-mining applications in aviation safety management.

[LG-17] Inversely Learning Transferable Rewards via Abstracted States

链接: https://arxiv.org/abs/2501.01669
作者: Yikang Gui,Prashant Doshi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning (IRL) has progressed significantly toward accurately learning the underlying rewards in both discrete and continuous domains from behavior data. The next advance is to learn \em intrinsic preferences in ways that produce useful behavior in settings or tasks which are different but aligned with the observed ones. In the context of robotic applications, this helps integrate robots into processing lines involving new tasks (with shared intrinsic preferences) without programming from scratch. We introduce a method to inversely learn an abstract reward function from behavior trajectories in two or more differing instances of a domain. The abstract reward function is then used to learn task behavior in another separate instance of the domain. This step offers evidence of its transferability and validates its correctness. We evaluate the method on trajectories in tasks from multiple domains in OpenAI’s Gym testbed and AssistiveGym and show that the learned abstract reward functions can successfully learn task behaviors in instances of the respective domains, which have not been seen previously.

[LG-18] FairSense: Long-Term Fairness Analysis of ML-Enabled Systems ICSE2025

链接: https://arxiv.org/abs/2501.01665
作者: Yining She,Sumon Biswas,Christian Kästner,Eunsuk Kang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: In Proceedings of the 47th International Conference on Software Engineering (ICSE 2025)

点击查看摘要

Abstract:Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are model-centric and designed to detect fairness issues under static settings. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system impact the environment, which in turn affects future decision-making. Such a self-reinforcing feedback loop can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation-based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. Given a fairness requirement, FairSense performs Monte-Carlo simulation to enumerate evolution traces for each system configuration. Then, FairSense performs sensitivity analysis on the space of possible configurations to understand the impact of design options and environmental factors on the long-term fairness of the system. We demonstrate FairSense’s potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.

[LG-19] Look Back for More: Harnessing Historical Sequential Updates for Personalized Federated Adapter Tuning AAAI2025

链接: https://arxiv.org/abs/2501.01653
作者: Danni Peng,Yuan Wang,Huazhu Fu,Jinpeng Jiang,Yong Liu,Rick Siow Mong Goh,Qingsong Wei
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Personalized federated learning (PFL) studies effective model personalization to address the data heterogeneity issue among clients in traditional federated learning (FL). Existing PFL approaches mainly generate personalized models by relying solely on the clients’ latest updated models while ignoring their previous updates, which may result in suboptimal personalized model learning. To bridge this gap, we propose a novel framework termed pFedSeq, designed for personalizing adapters to fine-tune a foundation model in FL. In pFedSeq, the server maintains and trains a sequential learner, which processes a sequence of past adapter updates from clients and generates calibrations for personalized adapters. To effectively capture the cross-client and cross-step relations hidden in previous updates and generate high-performing personalized adapters, pFedSeq adopts the powerful selective state space model (SSM) as the architecture of sequential learner. Through extensive experiments on four public benchmark datasets, we demonstrate the superiority of pFedSeq over state-of-the-art PFL methods.

[LG-20] A Probabilistic Model for Node Classification in Directed Graphs

链接: https://arxiv.org/abs/2501.01630
作者: Diego Huerta,Gerardo Arizmendi
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 33 pages, 5 figures

点击查看摘要

Abstract:In this work, we present a probabilistic model for directed graphs where nodes have attributes and labels. This model serves as a generative classifier capable of predicting the labels of unseen nodes using either maximum likelihood or maximum a posteriori estimations. The predictions made by this model are highly interpretable, contrasting with some common methods for node classification, such as graph neural networks. We applied the model to two datasets, demonstrating predictive performance that is competitive with, and even superior to, state-of-the-art methods. One of the datasets considered is adapted from the Math Genealogy Project, which has not previously been utilized for this purpose. Consequently, we evaluated several classification algorithms on this dataset to compare the performance of our model and provide benchmarks for this new resource.

[LG-21] Adaptive Meta-learning-based Adversarial Training for Robust Automatic Modulation Classification

链接: https://arxiv.org/abs/2501.01620
作者: Amirmohammad Bamdad,Ali Owfi,Fatemeh Afghah
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Submitted to IEEE International Conference on Communications (ICC) 2025

点击查看摘要

Abstract:DL-based automatic modulation classification (AMC) models are highly susceptible to adversarial attacks, where even minimal input perturbations can cause severe misclassifications. While adversarially training an AMC model based on an adversarial attack significantly increases its robustness against that attack, the AMC model will still be defenseless against other adversarial attacks. The theoretically infinite possibilities for adversarial perturbations mean that an AMC model will inevitably encounter new unseen adversarial attacks if it is ever to be deployed to a real-world communication system. Moreover, the computational limitations and challenges of obtaining new data in real-time will not allow a full training process for the AMC model to adapt to the new attack when it is online. To this end, we propose a meta-learning-based adversarial training framework for AMC models that substantially enhances robustness against unseen adversarial attacks and enables fast adaptation to these attacks using just a few new training samples, if any are available. Our results demonstrate that this training framework provides superior robustness and accuracy with much less online training time than conventional adversarial training of AMC models, making it highly efficient for real-world deployment.

[LG-22] Online Meta-Learning Channel Autoencoder for Dynamic End-to-end Physical Layer Optimization

链接: https://arxiv.org/abs/2501.01608
作者: Ali Owfi,Jonathan Ashdown,Kurt Turck
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: To be published in IEEE Wireless Communications and Networking Conference (WCNC) 2025

点击查看摘要

Abstract:Channel Autoencoders (CAEs) have shown significant potential in optimizing the physical layer of a wireless communication system for a specific channel through joint end-to-end training. However, the practical implementation of CAEs faces several challenges, particularly in realistic and dynamic scenarios. Channels in communication systems are dynamic and change with time. Still, most proposed CAE designs assume stationary scenarios, meaning they are trained and tested for only one channel realization without regard for the dynamic nature of wireless communication systems. Moreover, conventional CAEs are designed based on the assumption of having access to a large number of pilot signals, which act as training samples in the context of CAEs. However, in real-world applications, it is not feasible for a CAE operating in real-time to acquire large amounts of training samples for each new channel realization. Hence, the CAE has to be deployable in few-shot learning scenarios where only limited training samples are available. Furthermore, most proposed conventional CAEs lack fast adaptability to new channel realizations, which becomes more pronounced when dealing with a limited number of pilots. To address these challenges, this paper proposes the Online Meta Learning channel AE (OML-CAE) framework for few-shot CAE scenarios with dynamic channels. The OML-CAE framework enhances adaptability to varying channel conditions in an online manner, allowing for dynamic adjustments in response to evolving communication scenarios. Moreover, it can adapt to new channel conditions using only a few pilots, drastically increasing pilot efficiency and making the CAE design feasible in realistic scenarios.

[LG-23] Multivariate Time Series Anomaly Detection using DiffGAN Model

链接: https://arxiv.org/abs/2501.01591
作者: Guangqiang Wu,Fu Zhang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 19 pages, 3 figures, 1 table

点击查看摘要

Abstract:In recent years, some researchers have applied diffusion models to multivariate time series anomaly detection. The partial diffusion strategy, which depends on the diffusion steps, is commonly used for anomaly detection in these models. However, different diffusion steps have an impact on the reconstruction of the original data, thereby impacting the effectiveness of anomaly detection. To address this issue, we propose a novel method named DiffGAN, which adds a generative adversarial network component to the denoiser of diffusion model. This addition allows for the simultaneous generation of noisy data and prediction of diffusion steps. Compared to multiple state-of-the-art reconstruction models, experimental results demonstrate that DiffGAN achieves superior performance in anomaly detection.

[LG-24] Stackelberg Game Based Performance Optimization in Digital Twin Assisted Federated Learning over NOMA Networks

链接: https://arxiv.org/abs/2501.01584
作者: Bibo Wu,Fang Fang,Xianbin Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Despite the advantage of preserving data privacy, federated learning (FL) still suffers from the straggler issue due to the limited computing resources of distributed clients and the unreliable wireless communication environment. By effectively imitating the distributed resources, digital twin (DT) shows great potential in alleviating this issue. In this paper, we leverage DT in the FL framework over non-orthogonal multiple access (NOMA) network to assist FL training process, considering malicious attacks on model updates from clients. A reputationbased client selection scheme is proposed, which accounts for client heterogeneity in multiple aspects and effectively mitigates the risks of poisoning attacks in FL systems. To minimize the total latency and energy consumption in the proposed system, we then formulate a Stackelberg game by considering clients and the server as the leader and the follower, respectively. Specifically, the leader aims to minimize the energy consumption while the objective of the follower is to minimize the total latency during FL training. The Stackelberg equilibrium is achieved to obtain the optimal solutions. We first derive the strategies for the followerlevel problem and include them in the leader-level problem which is then solved via problem decomposition. Simulation results verify the superior performance of the proposed scheme.

[LG-25] Semialgebraic Neural Networks: From roots to representations

链接: https://arxiv.org/abs/2501.01564
作者: S. David Mis,Matti Lassas,Maarten V. de Hoop
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Many numerical algorithms in scientific computing – particularly in areas like numerical linear algebra, PDE simulation, and inverse problems – produce outputs that can be represented by semialgebraic functions; that is, the graph of the computed function can be described by finitely many polynomial equalities and inequalities. In this work, we introduce Semialgebraic Neural Networks (SANNs), a neural network architecture capable of representing any bounded semialgebraic function, and computing such functions up to the accuracy of a numerical ODE solver chosen by the programmer. Conceptually, we encode the graph of the learned function as the kernel of a piecewise polynomial selected from a class of functions whose roots can be evaluated using a particular homotopy continuation method. We show by construction that the SANN architecture is able to execute this continuation method, thus evaluating the learned semialgebraic function. Furthermore, the architecture can exactly represent even discontinuous semialgebraic functions by executing a continuation method on each connected component of the target function. Lastly, we provide example applications of these networks and show they can be trained with traditional deep-learning techniques.

[LG-26] ransfer Neyman-Pearson Algorithm for Outlier Detection

链接: https://arxiv.org/abs/2501.01525
作者: Mohammadreza M. Kalan,Eitan J. Neugut,Samory Kpotufe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of transfer learning in outlier detection where target abnormal data is rare. While transfer learning has been considered extensively in traditional balanced classification, the problem of transfer in outlier detection and more generally in imbalanced classification settings has received less attention. We propose a general meta-algorithm which is shown theoretically to yield strong guarantees w.r.t. to a range of changes in abnormal distribution, and at the same time amenable to practical implementation. We then investigate different instantiations of this general meta-algorithm, e.g., based on multi-layer neural networks, and show empirically that they outperform natural extensions of transfer methods for traditional balanced classification settings (which are the only solutions available at the moment).

[LG-27] reeLUT: An Efficient Alternative to Deep Neural Networks for Inference Acceleration Using Gradient Boosted Decision Trees

链接: https://arxiv.org/abs/2501.01511
作者: Alireza Khataei,Kia Bazargan
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted by FPGA’25 conference

点击查看摘要

Abstract:Accelerating machine learning inference has been an active research area in recent years. In this context, field-programmable gate arrays (FPGAs) have demonstrated compelling performance by providing massive parallelism in deep neural networks (DNNs). Neural networks (NNs) are computationally intensive during inference, as they require massive amounts of multiplication and addition, which makes their implementations costly. Numerous studies have recently addressed this challenge to some extent using a combination of sparsity induction, quantization, and transformation of neurons or sub-networks into lookup tables (LUTs) on FPGAs. Gradient boosted decision trees (GBDTs) are a high-accuracy alternative to DNNs in a wide range of regression and classification tasks, particularly for tabular datasets. The basic building block of GBDTs is a decision tree, which resembles the structure of binary decision diagrams. FPGA design flows are heavily optimized to implement such a structure efficiently. In addition to decision trees, GBDTs perform simple operations during inference, including comparison and addition. We present TreeLUT as an open-source tool for implementing GBDTs using an efficient quantization scheme, hardware architecture, and pipelining strategy. It primarily utilizes LUTs with no BRAMs or DSPs on FPGAs, resulting in high efficiency. We show the effectiveness of TreeLUT using multiple classification datasets, commonly used to evaluate ultra-low area and latency architectures. Using these benchmarks, we compare our implementation results with existing DNN and GBDT methods, such as DWN, PolyLUT-Add, NeuraLUT, LogicNets, FINN, hls4ml, and others. Our results show that TreeLUT significantly improves hardware utilization, latency, and throughput at competitive accuracy compared to previous works.

[LG-28] Explainable Brain Age Gap Prediction in Neurodegenerative Conditions using coVariance Neural Networks

链接: https://arxiv.org/abs/2501.01510
作者: Saurabh Sihag,Gonzalo Mateos,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注: Accepted at ISBI, 2025

点击查看摘要

Abstract:Brain age is the estimate of biological age derived from neuroimaging datasets using machine learning algorithms. Increasing \textitbrain age gap characterized by an elevated brain age relative to the chronological age can reflect increased vulnerability to neurodegeneration and cognitive decline. Hence, brain age gap is a promising biomarker for monitoring brain health. However, black-box machine learning approaches to brain age gap prediction have limited practical utility. Recent studies on coVariance neural networks (VNN) have proposed a relatively transparent deep learning pipeline for neuroimaging data analyses, which possesses two key features: (i) inherent \textitanatomically interpretablity of derived biomarkers; and (ii) a methodologically interpretable perspective based on \textitlinkage with eigenvectors of anatomic covariance matrix. In this paper, we apply the VNN-based approach to study brain age gap using cortical thickness features for various prevalent neurodegenerative conditions. Our results reveal distinct anatomic patterns for brain age gap in Alzheimer’s disease, frontotemporal dementia, and atypical Parkinsonian disorders. Furthermore, we demonstrate that the distinct anatomic patterns of brain age gap are linked with the differences in how VNN leverages the eigenspectrum of the anatomic covariance matrix, thus lending explainability to the reported results.

[LG-29] Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries

链接: https://arxiv.org/abs/2501.01453
作者: Ali Rabeh,Ethan Herron,Aditya Balu,Soumik Sarkar,Chinmay Hegde,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Rapid yet accurate simulations of fluid dynamics around complex geometries is critical in a variety of engineering and scientific applications, including aerodynamics and biomedical flows. However, while scientific machine learning (SciML) has shown promise, most studies are constrained to simple geometries, leaving complex, real-world scenarios underexplored. This study addresses this gap by benchmarking diverse SciML models, including neural operators and vision transformer-based foundation models, for fluid flow prediction over intricate geometries. Using a high-fidelity dataset of steady-state flows across various geometries, we evaluate the impact of geometric representations – Signed Distance Fields (SDF) and binary masks – on model accuracy, scalability, and generalization. Central to this effort is the introduction of a novel, unified scoring framework that integrates metrics for global accuracy, boundary layer fidelity, and physical consistency to enable a robust, comparative evaluation of model performance. Our findings demonstrate that foundation models significantly outperform neural operators, particularly in data-limited scenarios, and that SDF representations yield superior results with sufficient training data. Despite these advancements, all models struggle with out-of-distribution generalization, highlighting a critical challenge for future SciML applications. By advancing both evaluation methodologies and modeling capabilities, this work paves the way for robust and scalable ML solutions for fluid dynamics across complex geometries.

[LG-30] CSI Compression using Channel Charting

链接: https://arxiv.org/abs/2501.01431
作者: Baptiste Chatelier(IETR, INSA Rennes, MERCE-France),Vincent Corlay(MERCE-France),Matthieu Crussière(INSA Rennes, IETR),Luc Le Magoarou(INSA Rennes, IETR)
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Reaping the benefits of multi-antenna communication systems in frequency division duplex (FDD) requires channel state information (CSI) reporting from mobile users to the base station (BS). Over the last decades, the amount of CSI to be collected has become very challenging owing to the dramatic increase of the number of antennas at BSs. To mitigate the overhead associated with CSI reporting, compressed CSI techniques have been proposed with the idea of recovering the original CSI at the BS from its compressed version sent by the mobile users. Channel charting is an unsupervised dimensionality reduction method that consists in building a radio-environment map from CSIs. Such a method can be considered in the context of the CSI compression problem, since a chart location is, by definition, a low-dimensional representation of the CSI. In this paper, the performance of channel charting for a task-based CSI compression application is studied. A comparison of the proposed method against baselines on realistic synthetic data is proposed, showing promising results.

[LG-31] Signal Recovery Using a Spiked Mixture Model

链接: https://arxiv.org/abs/2501.01840
作者: Paul-Louis Delacour,Sander Wahls,Jeffrey M. Spraggins,Lukasz Migas,Raf Van de Plas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the spiked mixture model (SMM) to address the problem of estimating a set of signals from many randomly scaled and noisy observations. Subsequently, we design a novel expectation-maximization (EM) algorithm to recover all parameters of the SMM. Numerical experiments show that in low signal-to-noise ratio regimes, and for data types where the SMM is relevant, SMM surpasses the more traditional Gaussian mixture model (GMM) in terms of signal recovery performance. The broad relevance of the SMM and its corresponding EM recovery algorithm is demonstrated by applying the technique to different data types. The first case study is a biomedical research application, utilizing an imaging mass spectrometry dataset to explore the molecular content of a rat brain tissue section at micrometer scale. The second case study demonstrates SMM performance in a computer vision application, segmenting a hyperspectral imaging dataset into underlying patterns. While the measurement modalities differ substantially, in both case studies SMM is shown to recover signals that were missed by traditional methods such as k-means clustering and GMM.

[LG-32] Unified Native Spaces in Kernel Methods

链接: https://arxiv.org/abs/2501.01825
作者: Xavier Emery,Emilio Porcu,Moreno Bevilacqua
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There exists a plethora of parametric models for positive definite kernels, and their use is ubiquitous in disciplines as diverse as statistics, machine learning, numerical analysis, and approximation theory. Usually, the kernel parameters index certain features of an associated process. Amongst those features, smoothness (in the sense of Sobolev spaces, mean square differentiability, and fractal dimensions), compact or global supports, and negative dependencies (hole effects) are of interest to several theoretical and applied disciplines. This paper unifies a wealth of well-known kernels into a single parametric class that encompasses them as special cases, attained either by exact parameterization or through parametric asymptotics. We furthermore characterize the Sobolev space that is norm equivalent to the RKHS associated with the new kernel. As a by-product, we infer the Sobolev spaces that are associated with existing classes of kernels. We illustrate the main properties of the new class, show how this class can switch from compact to global supports, and provide special cases for which the kernel attains negative values over nontrivial intervals. Hence, the proposed class of kernel is the reproducing kernel of a very rich Hilbert space that contains many special cases, including the celebrated Matérn and Wendland kernels, as well as their aliases with hole effects.

[LG-33] QuantumBind-RBFE: Accurate Relative Binding Free Energy Calculations Using Neural Network Potentials

链接: https://arxiv.org/abs/2501.01811
作者: Francesc Sabanés Zariquiey,Stephen E. Farr,Stefan Doerr,Gianni De Fabritiis
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurate prediction of protein-ligand binding affinities is crucial in drug discovery, particularly during hit-to-lead and lead optimization phases, however, limitations in ligand force fields continue to impact prediction accuracy. In this work, we validate relative binding free energy (RBFE) accuracy using neural network potentials (NNPs) for the ligands. We utilize a novel NNP model, AceForce 1.0, based on the TensorNet architecture for small molecules that broadens the applicability to diverse drug-like compounds, including all important chemical elements and supporting charged molecules. Using established benchmarks, we show overall improved accuracy and correlation in binding affinity predictions compared with GAFF2 for molecular mechanics and ANI2-x for NNPs. Slightly less accuracy but comparable correlations with OPLS4. We also show that we can run the NNP simulations at 2 fs timestep, at least two times larger than previous NNP models, providing significant speed gains. The results show promise for further evolutions of free energy calculations using NNPs while demonstrating its practical use already with the current generation. The code and NNP model are publicly available for research use.

[LG-34] Beyond Non-Degeneracy: Revisiting Certainty Equivalent Heuristic for Online Linear Programming

链接: https://arxiv.org/abs/2501.01716
作者: Yilun Chen,Wenjia Wang
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The Certainty Equivalent heuristic (CE) is a widely-used algorithm for various dynamic resource allocation problems in OR and OM. Despite its popularity, existing theoretical guarantees of CE are limited to settings satisfying restrictive fluid regularity conditions, particularly, the non-degeneracy conditions, under the widely held belief that the violation of such conditions leads to performance deterioration and necessitates algorithmic innovation beyond CE. In this work, we conduct a refined performance analysis of CE within the general framework of online linear programming. We show that CE achieves uniformly near-optimal regret (up to a polylogarithmic factor in T ) under only mild assumptions on the underlying distribution, without relying on any fluid regularity conditions. Our result implies that, contrary to prior belief, CE effectively beats the curse of degeneracy for a wide range of problem instances with continuous conditional reward distributions, highlighting the distinction of the problem’s structure between discrete and non-discrete settings. Our explicit regret bound interpolates between the mild (\log T)^2 regime and the worst-case \sqrtT regime with a parameter \beta quantifying the minimal rate of probability accumulation of the conditional reward distributions, generalizing prior findings in the multisecretary setting. To achieve these results, we develop novel algorithmic analytical techniques. Drawing tools from the empirical processes theory, we establish strong concentration analysis of the solutions to random linear programs, leading to improved regret analysis under significantly relaxed assumptions. These techniques may find potential applications in broader online decision-making contexts. Subjects: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2501.01716 [math.OC] (or arXiv:2501.01716v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2501.01716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Guaranteed Nonconvex Low-Rank Tensor Estimation via Scaled Gradient Descent

链接: https://arxiv.org/abs/2501.01696
作者: Tong Wu
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensors, which give a faithful and effective representation to deliver the intrinsic structure of multi-dimensional data, play a crucial role in an increasing number of signal processing and machine learning problems. However, tensor data are often accompanied by arbitrary signal corruptions, including missing entries and sparse noise. A fundamental challenge is to reliably extract the meaningful information from corrupted tensor data in a statistically and computationally efficient manner. This paper develops a scaled gradient descent (ScaledGD) algorithm to directly estimate the tensor factors with tailored spectral initializations under the tensor-tensor product (t-product) and tensor singular value decomposition (t-SVD) framework. In theory, ScaledGD achieves linear convergence at a constant rate that is independent of the condition number of the ground truth low-rank tensor for two canonical problems – tensor robust principal component analysis and tensor completion – as long as the level of corruptions is not too large and the sample size is sufficiently large, while maintaining the low per-iteration cost of gradient descent. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties for low-rank tensor estimation with the t-SVD decomposition. Finally, numerical examples are provided to demonstrate the efficacy of ScaledGD in accelerating the convergence rate of ill-conditioned low-rank tensor estimation in these two applications.

[LG-36] Unsupervised learning for anticipating critical transitions

链接: https://arxiv.org/abs/2501.01579
作者: Shirin Panahi,Ling-Wei Kong,Bryan Glaz,Mulugeta Haile,Ying-Cheng Lai
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:For anticipating critical transitions in complex dynamical systems, the recent approach of parameter-driven reservoir computing requires explicit knowledge of the bifurcation parameter. We articulate a framework combining a variational autoencoder (VAE) and reservoir computing to address this challenge. In particular, the driving factor is detected from time series using the VAE in an unsupervised-learning fashion and the extracted information is then used as the parameter input to the reservoir computer for anticipating the critical transition. We demonstrate the power of the unsupervised learning scheme using prototypical dynamical systems including the spatiotemporal Kuramoto-Sivashinsky system. The scheme can also be extended to scenarios where the target system is driven by several independent parameters or with partial state observations.

[LG-37] Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

链接: https://arxiv.org/abs/2501.01484
作者: Cicero X. Lu,Tushar Mittal,Christine H. Chen,Alexis Y. Li,Kadin Worthen,B. A. Sargent,Carey M. Lisse,G. C. Sloan,Dean C. Hines,Dan M. Watson,Isabel Rebollido,Bin B. Ren,Joel D. Green
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 23 pages, 16 figures, Accepted to ApJS, $\texttt{CLUES}$ software available on GitHub

点击查看摘要

Abstract:Debris disks, which consist of dust, planetesimals, planets, and gas, offer a unique window into the mineralogical composition of their parent bodies, especially during the critical phase of terrestrial planet formation spanning 10 to a few hundred million years. Observations from the \textitSpitzer Space Telescope have unveiled thousands of debris disks, yet systematic studies remain scarce, let alone those with unsupervised clustering techniques. This study introduces \textttCLUES (CLustering UnsupErvised with Sequencer), a novel, non-parametric, fully-interpretable machine-learning spectral analysis tool designed to analyze and classify the spectral data of debris disks. \textttCLUES combines multiple unsupervised clustering methods with multi-scale distance measures to discern new groupings and trends, offering insights into compositional diversity and geophysical processes within these disks. Our analysis allows us to explore a vast parameter space in debris disk mineralogy and also offers broader applications in fields such as protoplanetary disks and solar system objects. This paper details the methodology, implementation, and initial results of \textttCLUES , setting the stage for more detailed follow-up studies focusing on debris disk mineralogy and demographics.

[LG-38] Analyzing Country-Level Vaccination Rates and Determinants of Practical Capacity to Administer COVID-19 Vaccines

链接: https://arxiv.org/abs/2501.01447
作者: Sharika J. Hegde,Max T.M. Ng,Marcos Rios,Hani S. Mahmassani,Ying Chen,Karen Smilowitz
类目: General Economics (econ.GN); Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP)
*备注: 31 pages, 7 figures. A previous version was presented at the 102nd Transportation Research Board Annual Meeting in Washington, D.C. in 2023

点击查看摘要

Abstract:The COVID-19 vaccine development, manufacturing, transportation, and administration proved an extreme logistics operation of global magnitude. Global vaccination levels, however, remain a key concern in preventing the emergence of new strains and minimizing the impact of the pandemic’s disruption of daily life. In this paper, country-level vaccination rates are analyzed through a queuing framework to extract service rates that represent the practical capacity of a country to administer vaccines. These rates are further characterized through regression and interpretable machine learning methods with country-level demographic, governmental, and socio-economic variates. Model results show that participation in multi-governmental collaborations such as COVAX may improve the ability to vaccinate. Similarly, improved transportation and accessibility variates such as roads per area for low-income countries and rail lines per area for high-income countries can improve rates. It was also found that for low-income countries specifically, improvements in basic and health infrastructure (as measured through spending on healthcare, number of doctors and hospital beds per 100k, population percent with access to electricity, life expectancy, and vehicles per 1000 people) resulted in higher vaccination rates. Of the high-income countries, those with larger 65-plus populations struggled to vaccinate at high rates, indicating potential accessibility issues for the elderly. This study finds that improving basic and health infrastructure, focusing on accessibility in the last mile, particularly for the elderly, and fostering global partnerships can improve logistical operations of such a scale. Such structural impediments and inequities in global health care must be addressed in preparation for future global public health crises.

信息检索

[IR-0] Item Association Factorization Mixed Markov Chains for Sequential Recommendation

链接: https://arxiv.org/abs/2501.01429
作者: DongYu Du,Yue Chan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation refers to recommending the next item of interest for a specific user based on his/her historical behavior sequence up to a certain time. While previous research has extensively examined Markov chain-based sequential recommendation models, the majority of these studies has focused on the user’s historical behavior sequence but has paid little attention to the overall correlation between items. This study introduces a sequential recommendation algorithm known as Item Association Factorization Mixed Markov Chains, which incorporates association information between items using an item association graph, integrating it with user behavior sequence information. Our experimental findings from the four public datasets demonstrate that the newly introduced algorithm significantly enhances the recommendation ranking results without substantially increasing the parameter count. Additionally, research on tuning the prior balancing parameters underscores the significance of incorporating item association information across different datasets.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-06

目录

概览 (2025-01-06)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载