本篇博文主要内容为 2025-01-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-01-24)

今日共更新407篇论文,其中:

  • 自然语言处理59篇(Computation and Language (cs.CL))
  • 人工智能101篇(Artificial Intelligence (cs.AI))
  • 计算机视觉90篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习136篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

【速读】: 该论文试图解决大语言模型(LLMs)在机器翻译(MT)任务中面临的挑战,特别是由于预训练数据以英语为中心以及从人类反馈中进行强化学习(RLHF)的复杂性所导致的性能瓶颈。为了解决这些问题,论文提出了一种名为“置信度-奖励驱动的偏好优化”(Confidence-Reward driven Preference Optimization, CRPO)的新方法。CRPO的关键在于结合奖励分数和模型置信度来改进数据选择,从而在微调过程中选择更具挑战性的句子对,即模型不确定或表现不佳的样本,以促进更有效的学习。该方法不仅适用于大语言模型,还能推广到如NLLB等编码器-解码器模型,展示了其广泛的适用性。实验结果表明,CRPO在翻译准确性和数据效率方面优于现有的RS-DPO、RSO和MBR评分等方法。

链接: https://arxiv.org/abs/2501.13927
作者: Guofeng Cui,Pichao Wang,Yang Liu,Zemian Ke,Zhu Liu,Vimal Bhat
机构: Rutgers University(罗格斯大学); Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.
zh

[NLP-1] Can We Generate Images with CoT? Lets Verify and Reinforce Image Generation Step by Step

【速读】: 该论文探讨了如何将链式思维推理(Chain-of-Thought, CoT)应用于自回归图像生成(autoregressive image generation)场景中的验证和强化问题。具体来说,论文研究了三种关键技术:1) 扩展测试时计算以进行验证;2) 通过直接偏好优化(Direct Preference Optimization, DPO)使模型偏好对齐;3) 将这些技术整合以实现互补效果。研究结果表明,这些方法可以有效结合并显著提升图像生成性能。此外,论文提出了专门用于自回归图像生成的潜在评估奖励模型(Potential Assessment Reward Model, PARM)及其改进版本PARM++,其中PARM通过潜在评估方法自适应地评估每个生成步骤,而PARM++进一步引入了自校正机制以修正生成的不满意图像。通过这些推理策略,论文成功提升了基线模型Show-o的性能,在GenEval基准上实现了24%的显著提升,并超越了Stable Diffusion 3的15%。

链接: https://arxiv.org/abs/2501.13926
作者: Ziyu Guo,Renrui Zhang,Chengzhuo Tong,Zhizheng Zhao,Peng Gao,Hongsheng Li,Pheng-Ann Heng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Journal Version. Code and models are released at this https URL

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at this https URL
zh

[NLP-2] he Breeze 2 Herd of Models: Traditional Chinese LLM s Based on Llama with Vision-Aware and Function-Calling Capabilities

【速读】: 该论文旨在解决传统中文(Traditional Chinese)语言表示在多模态任务中的不足问题。解决方案的关键在于Breeze 2模型,该模型基于Llama 3架构,通过在大规模语料库上进行继续预训练,增强了传统中文的语言和文化表示能力。Breeze 2引入了视觉编码器(visual encoder)和桥接模块(bridge module),使其具备视觉感知能力,并通过提示模板(prompt templates)和功能调用数据(function-calling data)的后训练支持功能调用。该模型在多个任务上进行了基准测试,包括台湾常识、指令跟随、长上下文理解、功能调用和视觉理解等,展示了其在多模态任务中的有效性。此外,论文还展示了3B参数模型在移动应用中的潜力。

链接: https://arxiv.org/abs/2501.13921
作者: Chan-Jan Hsu,Chia-Sheng Liu,Meng-Hsi Chen,Muxi Chen,Po-Chun Hsu,Yi-Chang Chen,Da-Shan Shiu
机构: MediaTek Research(联发科研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Breeze 2 is a suite of advanced multi-modal language models, available in 3B and 8B parameter configurations, specifically designed to enhance Traditional Chinese language representation. Building upon the Llama 3, Breeze 2 continues pretraining on an extensive corpus to enhance the linguistic and cultural heritage of Traditional Chinese. It incorporates vision-aware capabilities through a visual encoder and a bridge module, and supports function-calling via prompt templates and post-training on function-calling data. The effectiveness of Breeze 2 is benchmarked across various tasks, including Taiwan general knowledge, instruction-following, long context, function calling, and vision understanding. Furthermore, we showcase the capabilities of the its 3B model in a mobile application. We are publicly releasing all Breeze 2 models under the Llama 3 Community License.
zh

[NLP-3] IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

【速读】: 该论文试图解决当前文本到图像(Text-to-Image, T2I)模型在多个领域中的性能评估问题。随着T2I模型在图像生成、可控生成、图像编辑、视频、音频、3D和运动生成以及计算机视觉任务(如语义分割和深度估计)等方面的能力不断扩展,现有的评估框架已不足以全面评估这些模型在多样化任务中的表现。为此,作者开发了IMAGINE-E评估框架,并对六种主流T2I模型(FLUX.1、Ideogram2.0、Midjourney、Dall-E3、Stable Diffusion 3和Jimeng)进行了全面评估。评估涵盖了五个关键领域:结构化输出生成、真实感与物理一致性、特定领域生成、挑战性场景生成以及多风格创作任务。通过这一综合评估,论文揭示了各模型的优势和局限性,特别是FLUX.1和Ideogram2.0在结构化和特定领域任务中的卓越表现,进一步强调了T2I模型作为基础AI工具的广泛应用潜力。

链接: https://arxiv.org/abs/2501.13920
作者: Jiayi Lei,Renrui Zhang,Xiangfei Hu,Weifeng Lin,Zhen Li,Wenjian Sun,Ruoyi Du,Le Zhuo,Zhongyu Li,Xinyue Li,Shitian Zhao,Ziyu Guo,Yiting Lu,Peng Gao,Hongsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 75 pages, 73 figures, Evaluation scripts: this https URL

点击查看摘要

Abstract:With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models’ performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model’s strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at this https URL.
zh

[NLP-4] mporal Preference Optimization for Long-Form Video Understanding

【速读】: 该论文旨在解决视频大模态模型(video-LMMs)在长视频中实现有效时间定位(temporal grounding)的挑战。尽管视频大模态模型在视频理解方面取得了显著进展,但在处理长视频时,现有模型在时间定位上的表现仍然有限。为此,论文提出了一种名为时间偏好优化(Temporal Preference Optimization, TPO)的后训练框架,通过偏好学习来增强视频大模态模型的时间定位能力。TPO采用自训练方法,利用两种粒度的偏好数据集:局部时间定位(localized temporal grounding)和全局时间定位(comprehensive temporal grounding),使模型能够区分准确和不准确的时间响应。通过在这些偏好数据集上进行优化,TPO显著提升了模型的时间理解能力,同时减少了对人工标注数据的依赖。实验结果表明,TPO在三个长视频理解基准测试(LongVideoBench、MLVU和Video-MME)上均表现出色,特别是在Video-MME基准测试中,LLaVA-Video-TPO成为领先的7B模型,证明了TPO在提升长视频时间推理能力方面的潜力和可扩展性。

链接: https://arxiv.org/abs/2501.13919
作者: Rui Li,Xiaohan Wang,Yuhui Zhang,Zeyu Wang,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks–LongVideoBench, MLVU, and Video-MME–demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: this https URL.
zh

[NLP-5] Analysis of Indic Language Capabilities in LLM s

【速读】: 该论文旨在评估文本输入输出的生成式大语言模型(LLMs)在理解和生成印度语言(Indic languages)方面的性能,以确定哪些印度语言适合纳入安全基准测试。研究通过回顾现有的评估研究和数据集,以及支持印度语言的28个LLMs,分析了这些模型在训练数据、模型和数据许可、访问类型和模型开发者等方面的表现。关键解决方案包括对LLMs在不同印度语言上的性能进行比较,发现不同语言之间存在显著的性能差异,其中印地语(Hindi)在模型中最为广泛代表。尽管模型性能与使用人数最多的前五种语言大致相关,但在此之后的评估结果则有所不同。

链接: https://arxiv.org/abs/2501.13912
作者: Aatman Vaidya,Tarunima Prabhakar,Denny George,Swair Shah
机构: Tattle Civic Tech(印度)
类目: Computation and Language (cs.CL)
备注: 17 pages, 2 figures, 5 tables

点击查看摘要

Abstract:This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages. This evaluation is used to identify and prioritize Indic languages suited for inclusion in safety benchmarks. We conduct this study by reviewing existing evaluation studies and datasets; and a set of twenty-eight LLMs that support Indic languages. We analyze the LLMs on the basis of the training data, license for model and data, type of access and model developers. We also compare Indic language performance across evaluation datasets and find that significant performance disparities in performance across Indic languages. Hindi is the most widely represented language in models. While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
zh

[NLP-6] GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

【速读】: 该论文试图解决图形用户界面(Graphical User Interface, GUI)动作定位(action grounding)模型在新环境中性能下降的问题。具体来说,现有的GUI动作定位模型通常通过在有限的GUI数据集上进行微调(fine-tuning)来训练,但这些数据集覆盖的环境有限,导致模型在面对未见过的新环境时性能显著下降。为了解决这一问题,论文提出了一种名为GUI-Bee的自主代理(autonomous agent),该代理基于多模态大语言模型(Multimodal Large Language Models, MLLMs),通过探索收集高质量、环境特定的数据,并利用这些数据持续微调GUI动作定位模型。关键解决方案包括引入一种新颖的Q值激励上下文强化学习(Q-value-Incentive In-Context Reinforcement Learning, Q-ICRL)方法,以优化探索效率和数据质量。此外,论文还提出了一个名为NovelScreenSpot的基准测试,用于评估数据在帮助模型适应新环境方面的效果,并通过实验验证了GUI-Bee收集的数据的有效性。

链接: https://arxiv.org/abs/2501.13896
作者: Yue Fan,Handong Zhao,Ruiyi Zhang,Yu Shen,Xin Eric Wang,Gang Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine-tuning. To realize this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. Our agent leverages a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI-Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee. Project page: this https URL
zh

[NLP-7] A RAG -Based Institutional Assistant

【速读】: 该论文试图解决大语言模型(LLMs)在需要访问结构化知识库或特定文档的场景中表现不佳的问题,特别是在知识密集型任务中的局限性。为了解决这一问题,论文提出了检索增强生成(RAG)模型,通过将相关文档片段整合到生成模型的输入中,提升其性能。解决方案的关键在于系统架构中的两个核心模块:检索器(retriever)和生成模型(generative model)。通过实验不同的模型类型并调整超参数(如文档块大小和检索文档数量),研究发现,当向LLMs提供正确的文档块时,准确率显著提高至54.02%,而未提供上下文时,性能则大幅下降至13.68%。这些结果表明,数据库访问对提升LLM性能至关重要,同时也揭示了当前语义搜索方法在准确识别相关文档方面的局限性。

链接: https://arxiv.org/abs/2501.13880
作者: Gustavo Kuratomi,Paulo Pirozelli,Fabio G. Cozman,Sarajane M. Peres
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) demonstrate strong text generation capabilities, they struggle in scenarios requiring access to structured knowledge bases or specific documents, limiting their effectiveness in knowledge-intensive tasks. To address this limitation, retrieval-augmented generation (RAG) models have been developed, enabling generative models to incorporate relevant document fragments into their inputs. In this paper, we design and evaluate a RAG-based virtual assistant specifically tailored for the University of São Paulo. Our system architecture comprises two key modules: a retriever and a generative model. We experiment with different types of models for both components, adjusting hyperparameters such as chunk size and the number of retrieved documents. Our optimal retriever model achieves a Top-5 accuracy of 30%, while our most effective generative model scores 22.04% against ground truth answers. Notably, when the correct document chunks are supplied to the LLMs, accuracy significantly improves to 54.02%, an increase of over 30 percentage points. Conversely, without contextual input, performance declines to 13.68%. These findings highlight the critical role of database access in enhancing LLM performance. They also reveal the limitations of current semantic search methods in accurately identifying relevant documents and underscore the ongoing challenges LLMs face in generating precise responses.
zh

[NLP-8] hink Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

【速读】: 该论文探讨了在低资源语言(low-resource languages)环境下,构建社交媒体内容审核工具所面临的挑战。研究聚焦于全球南方(Global South)地区使用的四种低资源语言:泰米尔语(Tamil)、斯瓦希里语(Swahili)、马格里布阿拉伯语(Maghrebi Arabic)和克丘亚语(Quechua)。研究发现,社交媒体公司对研究者访问数据的限制加剧了这些语言的历史边缘化,导致缺乏用于研究在线危害的数据集。此外,现有的预处理技术和语言模型主要针对数据丰富的英语设计,无法应对低资源语言的复杂形态学特征,导致在审核泰米尔语、斯瓦希里语、阿拉伯语和克丘亚语内容时出现严重错误。论文指出,当前审核流程中的问题根源于系统性的不平等,并持续强化历史权力失衡。解决方案的关键在于采取多方利益相关者(multi-stakeholder)的方法,改进低资源语言的内容审核系统。

链接: https://arxiv.org/abs/2501.13836
作者: Farhana Shahid,Mona Elswah,Aditya Vashistha
机构: Cornell University(康奈尔大学); Center for Democracy and Technology(民主与技术中心)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Most social media users come from non-English speaking countries in the Global South. Despite the widespread prevalence of harmful content in these regions, current moderation systems repeatedly struggle in low-resource languages spoken there. In this work, we examine the challenges AI researchers and practitioners face when building moderation tools for low-resource languages. We conducted semi-structured interviews with 22 AI researchers and practitioners specializing in automatic detection of harmful content in four diverse low-resource languages from the Global South. These are: Tamil from South Asia, Swahili from East Africa, Maghrebi Arabic from North Africa, and Quechua from South America. Our findings reveal that social media companies’ restrictions on researchers’ access to data exacerbate the historical marginalization of these languages, which have long lacked datasets for studying online harms. Moreover, common preprocessing techniques and language models, predominantly designed for data-rich English, fail to account for the linguistic complexity of low-resource languages. This leads to critical errors when moderating content in Tamil, Swahili, Arabic, and Quechua, which are morphologically richer than English. Based on our findings, we establish that the precarities in current moderation pipelines are rooted in deep systemic inequities and continue to reinforce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve moderation for low-resource languages.
zh

[NLP-9] On the Reasoning Capacity of AI Models and How to Quantify It

【速读】: 该论文试图解决当前大型语言模型(LLMs)在复杂推理任务中的局限性问题,特别是如何更准确地评估和理解这些模型的推理能力。传统评估方法主要依赖准确性指标,但这些指标往往高估了模型的真实推理能力,因为模型的表现可能依赖于记忆和模式匹配,而非真正的逻辑推理。

论文提出的解决方案包括两个关键部分:首先,引入了一种新的现象学方法(phenomenological approach),通过系统性扰动揭示模型决策的基本机制。其次,开发了两个互补的现象学模型:概率混合模型(Probabilistic Mixture Model, PMM)和信息论一致性分析(Information-Theoretic Consistency, ITC)。PMM将模型响应分解为推理、记忆和猜测成分,而ITC则量化模型置信度与策略选择之间的关系。通过这些方法,论文揭示了模型在应对查询时如何动态平衡不同的认知策略,并提出了基于策略分布而非聚合性能指标的可靠性阈值标准。这一框架为实际应用中的模型部署提供了定量标准,有助于更准确地评估模型的推理能力。

链接: https://arxiv.org/abs/2501.13833
作者: Santosh Kumar Radha,Oktay Goktas
机构: Second.institution.edu(第二机构); institution.edu(机构)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have intensified the debate surrounding the fundamental nature of their reasoning capabilities. While achieving high performance on benchmarks such as GPQA and MMLU, these models exhibit limitations in more complex reasoning tasks, highlighting the need for more rigorous evaluation methodologies. We propose a novel phenomenological approach that goes beyond traditional accuracy metrics to probe the underlying mechanisms of model behavior, establishing a framework that could broadly impact how we analyze and understand AI systems. Using positional bias in multiple-choice reasoning tasks as a case study, we demonstrate how systematic perturbations can reveal fundamental aspects of model decision-making. To analyze these behaviors, we develop two complementary phenomenological models: a Probabilistic Mixture Model (PMM) that decomposes model responses into reasoning, memorization, and guessing components and an Information-Theoretic Consistency (ITC) analysis that quantifies the relationship between model confidence and strategy selection. Through controlled experiments on reasoning benchmarks, we show that true reasoning remains challenging for current models, with apparent success often relying on sophisticated combinations of memorization and pattern matching rather than genuine logical deduction. More fundamentally, we demonstrate that accuracy alone often overstates a model’s reasoning abilities, as model behavior can be characterized through underlying mechanisms in the phase space of cognitive strategies, revealing how models dynamically balance different approaches when responding to queries. This framework enables quantitative criteria for real-world deployments, allowing applications to specify reliability thresholds based on strategy distributions rather than aggregate performance metrics.
zh

[NLP-10] Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing ICASSP2025

【速读】: 该论文试图解决在大语言模型(LLMs)进行文本重写任务时,解码成本随输出长度增加而增加的问题。尽管这些任务的输入和输出之间存在大量重叠,但传统的重写方法仍然需要处理整个输出序列,导致计算资源的浪费。论文提出的解决方案关键是通过利用输入和输出之间的重叠,引入基于短语的编辑表示(phrase-based edit representations),从而压缩重写过程并减少计算开销。具体来说,作者提出了目标短语编辑表示(target-phrase-only edit representation),并将其与之前的编辑跨度表示(edit span representations)进行了系统比较。实验结果表明,该方法在自动语音识别(ASR)后编辑任务中,能够在保持较高准确性的同时,显著减少输出长度,达到效率与准确性的最佳平衡。在LibriSpeech测试集上,该方法能够缩小50-60%的WER差距,同时仅损失10-20%的长度压缩率。

链接: https://arxiv.org/abs/2501.13831
作者: Hao Zhang,Felix Stahlberg,Shankar Kumar
机构: Google Research(谷歌研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted by ICASSP 2025

点击查看摘要

Abstract:Large Language Models (LLMs) excel at rewriting tasks such as text style transfer and grammatical error correction. While there is considerable overlap between the inputs and outputs in these tasks, the decoding cost still increases with output length, regardless of the amount of overlap. By leveraging the overlap between the input and the output, Kaneko and Okazaki (2023) proposed model-agnostic edit span representations to compress the rewrites to save computation. They reported an output length reduction rate of nearly 80% with minimal accuracy impact in four rewriting tasks. In this paper, we propose alternative edit phrase representations inspired by phrase-based statistical machine translation. We systematically compare our phrasal representations with their span representations. We apply the LLM rewriting model to the task of Automatic Speech Recognition (ASR) post editing and show that our target-phrase-only edit representation has the best efficiency-accuracy trade-off. On the LibriSpeech test set, our method closes 50-60% of the WER gap between the edit span model and the full rewrite model while losing only 10-20% of the length reduction rate of the edit span model.
zh

[NLP-11] Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

【速读】: 该论文试图解决现有视频基准测试无法系统评估大型多模态模型(Large Multimodal Models, LMMs)在知识获取能力方面的问题。具体而言,现有的视频基准测试未能有效评估LMMs在感知信息、理解知识和适应知识以解决新问题这三个认知阶段的表现。为解决这一问题,论文提出了Video-MMMU,这是一个多模态、多学科的基准测试,旨在评估LMMs从视频中获取和利用知识的能力。Video-MMMU包含300个专家级视频和900个人工标注的问题,覆盖六个学科,通过阶段对齐的问答对(感知、理解和适应)来评估知识获取能力。此外,论文提出了一个知识增益指标(\Deltaknowledge),用于量化观看视频后模型性能的提升。评估结果显示,随着认知需求的增加,LMMs的性能显著下降,且与人类的知识获取能力存在显著差距,这突显了需要开发新方法来增强LMMs从视频中学习和适应的能力。

链接: https://arxiv.org/abs/2501.13826
作者: Kairui Hu,Penghao Wu,Fanyi Pu,Wang Xiao,Yuanhan Zhang,Xiang Yue,Bo Li,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs’ ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, \Deltaknowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs’ capability to learn and adapt from videos.
zh

[NLP-12] Hallucinations Can Improve Large Language Models in Drug Discovery

【速读】: 该论文试图解决大语言模型(LLMs)在药物发现领域中幻觉(hallucinations)的潜在应用问题。尽管研究者对LLMs中的幻觉现象表示担忧,但论文提出假设,认为幻觉可能在某些需要创造力的领域(如药物发现)中具有积极作用。为了验证这一假设,作者通过让LLMs将分子的SMILES字符串(Simplified Molecular Input Line Entry System)转化为自然语言描述,并将这些描述作为提示(prompt)的一部分,用于解决药物发现中的特定任务。通过对七个LLMs和五个分类任务的评估,研究结果表明,包含幻觉的文本确实能够提升LLMs的性能,尤其是在Llama-3.1-8B模型中,ROC-AUC(Receiver Operating Characteristic - Area Under Curve)提升了18.35%。此外,GPT-4o生成的幻觉在所有模型中表现最为一致。论文还通过实证分析和案例研究探讨了影响性能的关键因素及其背后的原因。该研究为LLMs在药物发现中的应用提供了新的视角,并为未来研究提供了方向。

链接: https://arxiv.org/abs/2501.13824
作者: Shuzhou Yuan,Michael Färber
机构: Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Germany; Dresden University of Technology (德累斯顿工业大学), Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concerns about hallucinations in Large Language Models (LLMs) have been raised by researchers, yet their potential in areas where creativity is vital, such as drug discovery, merits exploration. In this paper, we come up with the hypothesis that hallucinations can improve LLMs in drug discovery. To verify this hypothesis, we use LLMs to describe the SMILES string of molecules in natural language and then incorporate these descriptions as part of the prompt to address specific tasks in drug discovery. Evaluated on seven LLMs and five classification tasks, our findings confirm the hypothesis: LLMs can achieve better performance with text containing hallucinations. Notably, Llama-3.1-8B achieves an 18.35% gain in ROC-AUC compared to the baseline without hallucination. Furthermore, hallucinations generated by GPT-4o provide the most consistent improvements across models. Additionally, we conduct empirical analyses and a case study to investigate key factors affecting performance and the underlying reasons. Our research sheds light on the potential use of hallucinations for LLMs and offers new perspectives for future research leveraging LLMs in drug discovery.
zh

[NLP-13] Generation of reusable learning objects from digital medical collections: An analysis based on the MASMDOA framework

【速读】: 该论文旨在分析Clavy工具生成可重用学习对象(Reusable Learning Objects, RLOs)的过程,并探讨其在医疗教育中的应用。Clavy能够从多个医学知识源中检索数据,并将这些数据重新配置为多样化的多媒体结构和组织形式,进而生成适应不同医疗教育场景和学习需求的学习对象。解决方案的关键在于Clavy能够通过教育标准规范导出这些学习对象,从而提升其可重用性,使得医学学生和医疗从业者能够通过流行的电子学习平台轻松访问这些资源。这一工具的核心价值在于其能够将现有的数字医学知识库转化为易于使用的学习对象,促进知识的有效传递和应用。

链接: https://arxiv.org/abs/2501.13806
作者: Félix Buendía,Joaquín Gayoso-Cabada,José-Luis Sierra
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: first submited

点击查看摘要

Abstract:Learning Objects represent a widespread approach to structuring instructional materials in a large variety of educational contexts. The main aim of this work consists of analyzing from a qualitative point of view the process of generating reusable learning objects (RLOs) followed by Clavy, a tool that can be used to retrieve data from multiple medical knowledge sources and reconfigure such sources in diverse multimedia-based structures and organizations. From these organizations, Clavy is able to generate learning objects which can be adapted to various instructional healthcare scenarios with several types of user profiles and distinct learning requirements. Moreover, Clavy provides the capability of exporting these learning objects through educational standard specifications, which improves their reusability features. The analysis insights highlight the importance of having a tool able to transfer knowledge from the available digital medical collections to learning objects that can be easily accessed by medical students and healthcare practitioners through the most popular e-learning platforms.
zh

[NLP-14] Parameter-Efficient Fine-Tuning for Foundation Models

【速读】: 该论文旨在解决在基础模型(Foundation Models, FMs)背景下参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的技术、趋势和应用中的关键问题。基础模型如ChatGPT、DALL-E和LLaVA等在语言理解、生成任务和多模态任务中表现出色,但由于其庞大的参数量和计算复杂度,直接微调这些模型成本高昂。PEFT作为一种成本效益高的微调技术,通过最小化参数和计算复杂度,同时追求在下游任务中的最优性能,提供了解决方案。论文系统地回顾了PEFT的关键类别和核心机制,探讨了其在不同基础模型中的应用,并指出了未来改进PEFT的潜在研究方向。通过提供全面的综述,该论文为新手和专家理解和使用PEFT技术提供了宝贵的资源。

链接: https://arxiv.org/abs/2501.13787
作者: Dan Zhang,Tao Feng,Lilong Xue,Yuandong Wang,Yuxiao Dong,Jie Tang
机构: The Knowledge Engineering Group (KEG), Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 6 figures, 7 tables

点击查看摘要

Abstract:This survey delves into the realm of Parameter-Efficient Fine-Tuning (PEFT) within the context of Foundation Models (FMs). PEFT, a cost-effective fine-tuning technique, minimizes parameters and computational complexity while striving for optimal downstream task performance. FMs, like ChatGPT, DALL-E, and LLaVA specialize in language understanding, generative tasks, and multimodal tasks, trained on diverse datasets spanning text, images, and videos. The diversity of FMs guides various adaptation strategies for PEFT. Therefore, this survey aims to provide a comprehensive overview of PEFT techniques applied to diverse FMs and address critical gaps in understanding the techniques, trends, and applications. We start by providing a detailed development of FMs and PEFT. Subsequently, we systematically review the key categories and core mechanisms of PEFT across diverse FMs to offer a comprehensive understanding of trends. We also explore the most recent applications across various FMs to demonstrate the versatility of PEFT, shedding light on the integration of systematic PEFT methods with a range of FMs. Furthermore, we identify potential research and development directions for improving PEFTs in the future. This survey provides a valuable resource for both newcomers and experts seeking to understand and use the power of PEFT across FMs. All reviewed papers are listed at \urlthis https URL.
zh

[NLP-15] Explainable XR: Understanding User Behaviors of XR Environments using LLM -assisted Analytics Framework

【速读】: 该论文旨在解决现有扩展现实(Extended Reality, XR)用户分析框架在处理跨虚拟性(AR、VR、MR)转换、多用户协作应用场景以及多模态数据复杂性方面的挑战。为了解决这些问题,论文提出了Explainable XR框架,其关键解决方案包括三个主要组件:(1) 一种新颖的用户数据记录模式,称为用户行为描述符(User Action Descriptor, UAD),能够捕捉用户的多模态行为及其意图和上下文;(2) 一个平台无关的XR会话记录器;(3) 一个视觉分析界面,提供基于大语言模型(Large Language Models, LLMs)的洞察,帮助分析师从不同角度探索和分析记录的XR会话数据。通过这些组件,Explainable XR提供了一个虚拟性无关的解决方案,用于沉浸式会话的收集、分析和可视化,从而更好地理解用户行为并提供多方面的可操作洞察。

链接: https://arxiv.org/abs/2501.13778
作者: Yoonsang Kim,Zainab Aamir,Mithilesh Singh,Saeed Boorboor,Klaus Mueller,Arie E. Kaufman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 11 pages, 8 figures. This is the author’s version of the article that has been accepted for publication in IEEE Transactions on Visualization and Computer Graphics

点击查看摘要

Abstract:We present Explainable XR, an end-to-end framework for analyzing user behavior in diverse eXtended Reality (XR) environments by leveraging Large Language Models (LLMs) for data interpretation assistance. Existing XR user analytics frameworks face challenges in handling cross-virtuality - AR, VR, MR - transitions, multi-user collaborative application scenarios, and the complexity of multimodal data. Explainable XR addresses these challenges by providing a virtuality-agnostic solution for the collection, analysis, and visualization of immersive sessions. We propose three main components in our framework: (1) A novel user data recording schema, called User Action Descriptor (UAD), that can capture the users’ multimodal actions, along with their intents and the contexts; (2) a platform-agnostic XR session recorder, and (3) a visual analytics interface that offers LLM-assisted insights tailored to the analysts’ perspectives, facilitating the exploration and analysis of the recorded XR session data. We demonstrate the versatility of Explainable XR by demonstrating five use-case scenarios, in both individual and collaborative XR applications across virtualities. Our technical evaluation and user studies show that Explainable XR provides a highly usable analytics solution for understanding user actions and delivering multifaceted, actionable insights into user behaviors in immersive environments.
zh

[NLP-16] Do Large Language Models Truly Understand Geometric Structures?

【速读】: 该论文试图解决大语言模型(LLMs)在几何能力方面的挑战,特别是其在空间理解和抽象思维方面的不足。现有数据集主要评估LLMs的最终答案,但无法真正衡量其对几何结构的理解,因为LLMs可能通过巧合得出正确答案。为解决这一问题,论文引入了GeomRel数据集,旨在通过隔离几何关系识别的核心步骤来评估LLMs对几何结构的理解。基于这一基准,论文对多种LLMs进行了全面评估,并识别了其在理解几何结构方面的关键局限性。进一步,论文提出了几何思维链(GeoCoT)方法,通过增强LLMs识别几何关系的能力,显著提升了其性能。

链接: https://arxiv.org/abs/2501.13773
作者: Xiaofeng Wang,Yiming Wang,Wenhong Zhu,Rui Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Geometric ability is a significant challenge for large language models (LLMs) due to the need for advanced spatial comprehension and abstract thinking. Existing datasets primarily evaluate LLMs on their final answers, but they cannot truly measure their true understanding of geometric structures, as LLMs can arrive at correct answers by coincidence. To fill this gap, we introduce the GeomRel dataset, designed to evaluate LLMs’ understanding of geometric structures by isolating the core step of geometric relationship identification in problem-solving. Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs’ ability to identify geometric relationships, resulting in significant performance improvements.
zh

[NLP-17] UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models ICLR2025

【速读】: 该论文旨在解决现有基准测试在评估大型语言模型(LLMs)在数学推理能力方面的不足,特别是缺乏对本科水平数学问题的广泛覆盖以及可能存在的测试集污染问题。为此,作者提出了UGMathBench,一个专门设计用于评估LLMs在本科水平数学推理能力的多样化和动态基准测试。UGMathBench包含5,062个问题,涵盖16个学科和111个主题,具有10种不同的答案类型。每个问题包括三个随机化版本,并计划随着开源LLMs在UGMathBench上的饱和而发布更多版本。此外,作者提出了两个关键指标:有效准确率(EAcc),用于衡量所有三个版本中正确解决问题的百分比;以及推理差距(Δ),用于通过计算所有版本的平均准确率与EAcc之间的差异来评估推理的鲁棒性。通过对23个领先LLMs的广泛评估,作者发现OpenAI-o1-mini达到了最高的EAcc(56.3%),并且在不同模型中观察到较大的Δ值,这强调了未来研究需要开发具有高EAcc和Δ=0的“大型推理模型”。UGMathBench及其详细评估代码的发布预计将作为推动LLMs在解决数学问题方面发展的重要资源。

链接: https://arxiv.org/abs/2501.13766
作者: Xin Xu,Jiaxin Zhang,Tianhao Chen,Zitong Chao,Jishan Hu,Can Yang
机构: The Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ( \Delta ), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3% by OpenAI-o1-mini, with large \Delta values observed across different models. This highlights the need for future research aimed at developing “large reasoning models” with high EAcc and \Delta = 0 . We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems.
zh

[NLP-18] 2-Tier SimCSE: Elevating BERT for Robust Sentence Embeddings

【速读】: 该论文旨在解决自然语言处理任务中句子嵌入(sentence embeddings)的语义捕捉和跨上下文泛化能力不足的问题。为此,作者提出了基于对比学习(contrastive learning)的SimCSE(Simple Contrastive Learning of Sentence Embeddings)方法,并通过微调minBERT模型来提升情感分析、语义文本相似度(STS)和复述检测任务的性能。解决方案的关键在于:1)实验了三种不同的dropout技术(标准dropout、课程dropout和自适应dropout)以应对过拟合问题;2)提出了一种新颖的2-Tier SimCSE微调模型,结合了无监督和有监督的SimCSE方法,特别针对STS任务;3)探索了SimCSE模型在复述检测和SST任务中的迁移学习潜力。研究结果表明,2-Tier模型在STS任务中表现优异,平均测试得分为0.742,但迁移学习效果有限,表明STS任务的知识迁移性较弱。

链接: https://arxiv.org/abs/2501.13758
作者: Yumeng Wang,Ziran Zhou,Junjin Wang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective sentence embeddings that capture semantic nuances and generalize well across diverse contexts are crucial for natural language processing tasks. We address this challenge by applying SimCSE (Simple Contrastive Learning of Sentence Embeddings) using contrastive learning to fine-tune the minBERT model for sentiment analysis, semantic textual similarity (STS), and paraphrase detection. Our contributions include experimenting with three different dropout techniques, namely standard dropout, curriculum dropout, and adaptive dropout, to tackle overfitting, proposing a novel 2-Tier SimCSE Fine-tuning Model that combines both unsupervised and supervised SimCSE on STS task, and exploring transfer learning potential for Paraphrase and SST tasks. Our findings demonstrate the effectiveness of SimCSE, with the 2-Tier model achieving superior performance on the STS task, with an average test score of 0.742 across all three downstream tasks. The results of error analysis reveals challenges in handling complex sentiments and reliance on lexical overlap for paraphrase detection, highlighting areas for future research. The ablation study revealed that removing Adaptive Dropout in the Single-Task Unsupervised SimCSE Model led to improved performance on the STS task, indicating overfitting due to added parameters. Transfer learning from SimCSE models on Paraphrase and SST tasks did not enhance performance, suggesting limited transferability of knowledge from the STS task.
zh

[NLP-19] A Study of the Plausibility of Attention between RNN Encoders in Natural Language Inference

【速读】: 该论文试图解决的问题是评估注意力图(attention maps)在自然语言处理(NLP)模型中的解释能力,特别是在句子比较任务(如自然语言推理)中的实用性。具体来说,研究关注的是注意力图是否能够为模型的决策提供合理的解释,以及这些解释是否对人类理解模型决策有帮助。解决方案的关键在于通过比较两个RNN编码器之间的交叉注意力权重(cross-attention weights)与基于人类标注和启发式标注的数据(eSNLI语料库)来评估注意力图的合理性。研究发现,启发式标注与人类标注之间存在合理的相关性,因此可以用于评估句子比较任务中的合理解释。然而,原始的注意力权重与合理解释之间的关联仍然较为松散。

链接: https://arxiv.org/abs/2501.13735
作者: Duc Hau Nguyen,Duc Hau Nguyen,Pascale Sébillot
机构: IRISA, CNRS, INSA Rennes (法国国家科学研究中心, 雷恩国立应用科学学院); IRISA, CNRS (法国国家科学研究中心); IRISA, INSA Rennes (法国国家科学研究中心, 雷恩国立应用科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention maps in neural models for NLP are appealing to explain the decision made by a model, hopefully emphasizing words that justify the decision. While many empirical studies hint that attention maps can provide such justification from the analysis of sound examples, only a few assess the plausibility of explanations based on attention maps, i.e., the usefulness of attention maps for humans to understand the decision. These studies furthermore focus on text classification. In this paper, we report on a preliminary assessment of attention maps in a sentence comparison task, namely natural language inference. We compare the cross-attention weights between two RNN encoders with human-based and heuristic-based annotations on the eSNLI corpus. We show that the heuristic reasonably correlates with human annotations and can thus facilitate evaluation of plausible explanations in sentence comparison tasks. Raw attention weights however remain only loosely related to a plausible explanation.
zh

[NLP-20] Pseudocode-Injection Magic: Enabling LLM s to Tackle Graph Computational Tasks

【速读】: 该论文试图解决的是图计算任务中大型语言模型(LLMs)在处理复杂图结构时的能力不足以及高推理成本的问题。现有的方法由于LLMs在理解复杂图结构方面的局限性以及高推理成本,难以有效处理大规模图数据。为了解决这一问题,论文提出了一种名为PIE(Pseudocode-Injection-Enhanced LLM Reasoning for Graph Computational Tasks)的新框架。该框架的关键在于通过三个步骤来增强LLMs的推理能力:问题理解、提示设计和代码生成。在PIE框架中,LLMs负责理解问题并提取相关信息以生成正确的代码,而图结构的分析和代码的执行则交由解释器完成。此外,通过在提示中注入任务相关的伪代码,进一步辅助LLMs生成高效的代码。PIE还采用了成本效益高的试错技术,确保生成的代码能够正确执行。与需要为每个测试用例调用LLMs的其他方法不同,PIE仅在代码生成阶段调用LLM,从而允许生成的代码被重复使用,显著降低了推理成本。实验结果表明,PIE在准确性和计算效率方面均优于现有基线方法。

链接: https://arxiv.org/abs/2501.13731
作者: Chang Gong,Wanrui Bian,Zhijie Zhang,Weiguo Zheng
机构: School of Data Science, Fudan University (复旦大学数据科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:Graph computational tasks are inherently challenging and often demand the development of advanced algorithms for effective solutions. With the emergence of large language models (LLMs), researchers have begun investigating their potential to address these tasks. However, existing approaches are constrained by LLMs’ limited capability to comprehend complex graph structures and their high inference costs, rendering them impractical for handling large-scale graphs. Inspired by human approaches to graph problems, we introduce a novel framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph Computational Tasks), which consists of three key steps: problem understanding, prompt design, and code generation. In this framework, LLMs are tasked with understanding the problem and extracting relevant information to generate correct code. The responsibility for analyzing the graph structure and executing the code is delegated to the interpreter. We inject task-related pseudocodes into the prompts to further assist the LLMs in generating efficient code. We also employ cost-effective trial-and-error techniques to ensure that the LLM-generated code executes correctly. Unlike other methods that require invoking LLMs for each individual test case, PIE only calls the LLM during the code generation phase, allowing the generated code to be reused and significantly reducing inference costs. Extensive experiments demonstrate that PIE outperforms existing baselines in terms of both accuracy and computational efficiency.
zh

[NLP-21] RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

【速读】: 该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)模型在生成过程中对外部检索知识的依赖问题,特别是当检索到的知识与模型内部记忆不一致时,模型难以评估检索知识的正确性,导致生成响应时出现知识冲突。为解决这一问题,论文提出了检索偏好优化(Retrieval Preference Optimization, RPO),这是一种轻量级且有效的对齐方法,能够基于检索相关性自适应地利用多源知识。RPO通过隐式表示检索相关性并将其整合到奖励模型中,将检索评估和响应生成统一到一个模型中,避免了传统方法需要额外步骤评估检索质量的复杂性。RPO是首个在训练中量化检索相关性意识的RAG专用对齐方法,克服了数学上的障碍。实验结果表明,RPO在四个数据集上的准确率比RAG提高了4-10%,且无需额外组件,展现了其强大的泛化能力。

链接: https://arxiv.org/abs/2501.13726
作者: Shi-Qi Yan,Zhen-Hua Ling
机构: National Engineering Research Center of Speech and Language Information Processing (语音与语言信息处理国家工程研究中心); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization.
zh

[NLP-22] Musical ethnocentrism in Large Language Models

【速读】: 该论文旨在分析和解决大型语言模型(LLMs)中存在的音乐文化偏见(geocultural biases)问题。具体来说,研究关注的是由于训练数据中不同地理区域和文化的不平衡代表性,以及其中包含的价值判断所导致的偏见。论文通过两个实验来解决这一问题:首先,通过提示LLMs生成各类别的“Top 100”音乐贡献者列表,并分析这些贡献者的来源国家;其次,要求LLMs对不同国家的音乐文化进行数值评分。研究结果表明,LLMs在两项实验中都表现出对西方音乐文化的强烈偏好。解决方案的关键在于通过实验揭示和量化这些偏见,为进一步的偏见检测和缓解提供数据支持。

链接: https://arxiv.org/abs/2501.13720
作者: Anna Kruspe
机构: Munich University of Applied Sciences(慕尼黑应用科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) reflect the biases in their training data and, by extension, those of the people who created this training data. Detecting, analyzing, and mitigating such biases is becoming a focus of research. One type of bias that has been understudied so far are geocultural biases. Those can be caused by an imbalance in the representation of different geographic regions and cultures in the training data, but also by value judgments contained therein. In this paper, we make a first step towards analyzing musical biases in LLMs, particularly ChatGPT and Mixtral. We conduct two experiments. In the first, we prompt LLMs to provide lists of the “Top 100” musical contributors of various categories and analyze their countries of origin. In the second experiment, we ask the LLMs to numerically rate various aspects of the musical cultures of different countries. Our results indicate a strong preference of the LLMs for Western music cultures in both experiments.
zh

[NLP-23] DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

【速读】: 该论文试图解决大型语言模型(LLMs)在自动化软件开发中依赖推断(dependency inference)方面的挑战,即准确识别仓库运行所需的内部组件和外部包(packages)。现有研究表明,依赖相关的问题导致了超过40%的生成仓库中的运行时错误。为了解决这一问题,论文提出了DI-BENCH,这是一个大规模基准测试和评估框架,专门用于评估LLMs在依赖推断上的能力。DI-BENCH包含581个仓库,涵盖Python、C#、Rust和JavaScript的测试环境。通过文本和执行指标的综合实验,发现当前表现最佳的模型在执行通过率上仅为42.9%,表明仍有显著的改进空间。DI-BENCH为评估LLMs在仓库上的性能提供了新的视角,推动了更健壮的端到端软件合成的发展。

链接: https://arxiv.org/abs/2501.13699
作者: Linghao Zhang,Junhao Wang,Shilin He,Chaoyun Zhang,Yu Kang,Bowen Li,Jiaheng Wen,Chengxing Xie,Maoquan Wang,Yufan Huang,Elsie Nallipogu,Qingwei Lin,Yingnong Dang,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
机构: Microsoft(微软); Wuhan University(武汉大学); Tongji University(同济大学); Shanghai AI Laboratory(上海人工智能实验室); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
zh

[NLP-24] Question Answering on Patient Medical Records with Private Fine-Tuned LLM s

【速读】: 该论文试图解决电子健康记录(EHRs)在Fast Healthcare Interoperability Resources(FHIR)标准下存储时,由于其复杂性和数据量庞大,用户难以有效检索和解读关键健康信息的问题。解决方案的关键在于利用大型语言模型(LLMs)进行语义问答(QA),通过两个任务来实现:首先识别与用户查询最相关的FHIR资源(Task1),然后基于这些资源回答查询(Task2)。论文提出了一种新颖的方法,通过私有部署和微调的LLMs来实现这一目标。实验结果表明,尽管微调后的LLMs规模比GPT-4系列模型小250倍,但在Task1的F1分数上优于GPT-4系列模型0.55%,在Task2的Meteor任务上优于42%。此外,论文还探讨了LLMs的高级应用,包括顺序微调、模型自评估(自恋评估)以及训练数据规模对性能的影响。

链接: https://arxiv.org/abs/2501.13687
作者: Sara Kothari,Ayush Gupta
机构: Stanford University (斯坦福大学); Genloop Labs, Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs. This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.13687 [cs.CL] (or arXiv:2501.13687v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.13687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-25] Collective Memory and Narrative Cohesion: A Computational Study of Palestinian Refugee Oral Histories in Lebanon COLING2025

【速读】: 该论文旨在探讨巴勒斯坦难民群体在黎巴嫩如何通过共享叙事维持对“Nakba”(巴勒斯坦大灾难)的集体记忆,并研究性别和居住地等因素对这一集体记忆形成的影响。研究基于Halbwachs的群体记忆理论,通过对叙事文本的统计分析和语义嵌入(semantic embeddings)来量化叙事的相似性。研究发现,共享的起源(shared origin)是叙事相似性的关键决定因素,尤其是在主题关键词、地标和重要人物的提及上。此外,共享的居住地(shared residence)也增强了叙事的凝聚力,尤其是在与共享起源相结合时。女性的叙事在主题上表现出更高的凝聚力,特别是在描述英国占领时期的经历时,凸显了记忆形成中的性别维度。该研究通过分析巴勒斯坦口述历史档案(POHA),强调了口述历史在保护巴勒斯坦身份和抵抗历史抹除中的重要作用。

链接: https://arxiv.org/abs/2501.13682
作者: Ghadeer Awwad,Lavinia Dunagan,David Gamba,Tamara N. Rayan
机构: School of Information, University of Michigan (密歇根大学信息学院)
类目: Computation and Language (cs.CL)
备注: Appeared in the 1st International Workshop on Nakba Narratives as Language Resources as part of COLING 2025

点击查看摘要

Abstract:This study uses the Palestinian Oral History Archive (POHA) to investigate how Palestinian refugee groups in Lebanon sustain a cohesive collective memory of the Nakba through shared narratives. Grounded in Halbwachs’ theory of group memory, we employ statistical analysis of pairwise similarity of narratives, focusing on the influence of shared gender and location. We use textual representation and semantic embeddings of narratives to represent the interviews themselves. Our analysis demonstrates that shared origin is a powerful determinant of narrative similarity across thematic keywords, landmarks, and significant figures, as well as in semantic embeddings of the narratives. Meanwhile, shared residence fosters cohesion, with its impact significantly amplified when paired with shared origin. Additionally, women’s narratives exhibit heightened thematic cohesion, particularly in recounting experiences of the British occupation, underscoring the gendered dimensions of memory formation. This research deepens the understanding of collective memory in diasporic settings, emphasizing the critical role of oral histories in safeguarding Palestinian identity and resisting erasure.
zh

[NLP-26] Certified Robustness Under Bounded Levenshtein Distance ICLR2025

【速读】: 该论文旨在解决文本分类器在面对对抗性扰动(adversarial perturbations)时的脆弱性问题,特别是针对Levenshtein距离约束下的鲁棒性验证问题。现有的验证方法在处理Levenshtein距离约束时计算成本过高,难以实际应用。论文提出了一种名为LipsLev的新方法,首次计算了卷积分类器在Levenshtein距离下的Lipschitz常数(Lipschitz constant),并利用这些估计值训练1-Lipschitz分类器。该方法的关键在于通过单次前向传播即可计算分类器的认证半径(certified radius),从而显著提高了验证效率。实验结果表明,LipsLev在AG-News数据集上分别实现了38.80%和13.93%的验证准确率(verified accuracy),且计算速度比现有方法快4个数量级。这一工作为文本领域的高效验证开辟了新的途径。

链接: https://arxiv.org/abs/2501.13676
作者: Elias Abad Rocamora,Grigorios G. Chrysos,Volkan Cevher
机构: LIONS - École Polytechnique Fédérale de Lausanne, Switzerland (瑞士洛桑联邦理工学院); Department of Electrical and Computer Engineering, University of Wisconsin-Madison, USA (美国威斯康星大学麦迪逊分校电气与计算机工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in ICLR 2025

点击查看摘要

Abstract:Text classifiers suffer from small perturbations, that if chosen adversarially, can dramatically change the output of the model. Verification methods can provide robustness certificates against such adversarial perturbations, by computing a sound lower bound on the robust accuracy. Nevertheless, existing verification methods incur in prohibitive costs and cannot practically handle Levenshtein distance constraints. We propose the first method for computing the Lipschitz constant of convolutional classifiers with respect to the Levenshtein distance. We use these Lipschitz constant estimates for training 1-Lipschitz classifiers. This enables computing the certified radius of a classifier in a single forward pass. Our method, LipsLev, is able to obtain 38.80 % and 13.93 % verified accuracy at distance 1 and 2 respectively in the AG-News dataset, while being 4 orders of magnitude faster than existing approaches. We believe our work can open the door to more efficient verification in the text domain.
zh

[NLP-27] How to Complete Domain Tuning while Keeping General Ability in LLM : Adaptive Layer-wise and Element-wise Regularization

【速读】: 该论文试图解决大语言模型(LLMs)在领域特定任务微调过程中出现的灾难性遗忘(catastrophic forgetting)问题。灾难性遗忘指的是模型在微调过程中覆盖或丢失了预训练阶段获得的重要通用知识,从而限制了模型的广泛应用。为解决这一问题,论文提出了一种新颖的方法,通过计算模型参数的元素级重要性来保留通用知识。解决方案的关键在于采用双目标优化策略:一方面通过正则化损失(regularization loss)保留对通用知识至关重要的参数;另一方面通过交叉熵损失(cross-entropy loss)适应领域特定任务。此外,论文引入了分层系数(layer-wise coefficients),以动态平衡不同层在双目标优化中的贡献。实验表明,该方法在科学、医学和物理任务上有效缓解了灾难性遗忘,同时提升了模型的适应性,且相比现有方法在速度和存储需求上具有显著优势。

链接: https://arxiv.org/abs/2501.13669
作者: Shezheng Song,Hao Xu,Jun Ma,Shasha Li,Long Peng,Qian Wan,Xiaodong Liu,Jie Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong general-purpose language capabilities. However, fine-tuning these models on domain-specific tasks often leads to catastrophic forgetting, where the model overwrites or loses essential knowledge acquired during pretraining. This phenomenon significantly limits the broader applicability of LLMs. To address this challenge, we propose a novel approach to compute the element-wise importance of model parameters crucial for preserving general knowledge during fine-tuning. Our method utilizes a dual-objective optimization strategy: (1) regularization loss to retain the parameter crucial for general knowledge; (2) cross-entropy loss to adapt to domain-specific tasks. Additionally, we introduce layer-wise coefficients to account for the varying contributions of different layers, dynamically balancing the dual-objective optimization. Extensive experiments on scientific, medical, and physical tasks using GPT-J and LLaMA-3 demonstrate that our approach mitigates catastrophic forgetting while enhancing model adaptability. Compared to previous methods, our solution is approximately 20 times faster and requires only 10%-15% of the storage, highlighting the practical efficiency. The code will be released.
zh

[NLP-28] LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(MLLMs)在处理视觉和文本模态时因大量视觉标记(vision tokens)导致的高计算开销问题,特别是在资源受限环境中的实用性受限。论文提出的解决方案是语言引导的视觉标记剪枝(Language-Guided Vision Token Pruning, LVPruning),其关键是通过跨注意力模块(cross-attention modules)计算视觉标记与语言标记(language tokens)交互的重要性,从而决定哪些视觉标记可以被剪枝。这种方法无需修改原始MLLM的参数,易于应用或移除。实验表明,LVPruning能够在LLaVA-1.5的中间层有效减少高达90%的视觉标记,推理时的TFLOPs(Tera Floating-Point Operations Per Second)降低了62.1%,同时在九个多模态基准测试中平均性能损失仅为0.45%。

链接: https://arxiv.org/abs/2501.13652
作者: Yizheng Sun,Yanze Xin,Hao Li,Jingyuan Sun,Chenghua Lin,Riza Batista-Navarro
机构: University of Manchester(曼彻斯特大学); Imperial College London(帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting their practicality in resource-constrained environments. We introduce Language-Guided Vision Token Pruning (LVPruning) for MLLMs, an effective yet simple method that significantly reduces the computational burden while preserving model performance. LVPruning employs cross-attention modules to compute the importance of vision tokens based on their interaction with language tokens, determining which to prune. Importantly, LVPruning can be integrated without modifying the original MLLM parameters, which makes LVPruning simple to apply or remove. Our experiments show that LVPruning can effectively reduce up to 90% of vision tokens by the middle layer of LLaVA-1.5, resulting in a 62.1% decrease in inference Tera Floating-Point Operations Per Second (TFLOPs), with an average performance loss of just 0.45% across nine multi-modal benchmarks.
zh

[NLP-29] Sigma: Differential Rescaling of Query Key and Value for Efficient Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在系统领域中的效率问题,特别是在长上下文场景下的推理速度优化。论文提出的解决方案核心在于引入了一种新颖的架构——DiffQKV注意力机制(Differentially Compressed Query-Key-Value Attention)。该机制通过差异化压缩注意力机制中的Key (K) 和 Value (V) 组件,并根据它们对模型性能和效率指标的不同影响进行优化,从而显著提升了推理效率。具体而言,论文通过实验发现模型对K和V组件的压缩具有不同的敏感性,进而提出了差异化压缩的KV组件,并通过扩展Query (Q) 的维度来增强模型的表示能力,同时最小化对推理速度的影响。实验结果表明,DiffQKV注意力机制在长上下文场景下比传统的分组查询注意力机制(Grouped-Query Attention, GQA)提升了高达33.36%的推理速度。此外,论文还通过在系统领域数据上的预训练,展示了Sigma模型在系统领域任务中的卓越性能,显著超越了GPT-4。

链接: https://arxiv.org/abs/2501.13629
作者: Zhenghao Lin,Zihao Tang,Xiao Liu,Yeyun Gong,Yi Cheng,Qi Chen,Hang Li,Ying Xin,Ziyue Yang,Kailai Yang,Yu Yan,Xiao Liang,Shuai Lu,Yiming Huang,Zheheng Luo,Lei Qu,Xuan Feng,Yaoxiang Wang,Yuqing Xia,Feiyang Chen,Yuting Jiang,Yasen Hu,Hao Ni,Binyang Li,Guoshuai Zhao,Jui-Hao Chiang,Zhongxin Guo,Chen Lin,Kun Kuang,Wenjie Li,Yelong Shen,Jian Jiao,Peng Cheng,Mao Yang
机构: Microsoft Sigma Team
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model’s varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model’s representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
zh

[NLP-30] Domain-Specific Machine Translation to Translate Medicine Brochures in English to Sorani Kurdish

【速读】: 该论文试图解决库尔德语(Sorani Kurdish)社区获取药品说明书信息受限的问题,这一问题导致库尔德语使用者无法获得关键的医疗信息。为解决这一问题,研究者开发了一种专门的机器翻译(Machine Translation, MT)模型,用于将英文药品说明书翻译成库尔德语。解决方案的关键在于构建了一个包含22,940对对齐句子的平行语料库,这些句子来自伊拉克库尔德斯坦地区(KRI)两家制药公司的319份药品说明书。研究者使用Moses工具包训练了一个统计机器翻译(Statistical Machine Translation, SMT)模型,并通过七次实验获得了22.65到48.93的BLEU评分。此外,研究者还通过后处理使用医学词典解决了翻译过程中遇到的未知词汇问题,进一步提升了翻译质量。最终,通过人工评估,50%的专业人士认为翻译结果一致,83.3%认为翻译准确,66.7%的用户认为翻译清晰且对使用药品有信心。

链接: https://arxiv.org/abs/2501.13609
作者: Mariam Shamal,Hossein Hassani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures,3 tables

点击查看摘要

Abstract:Access to Kurdish medicine brochures is limited, depriving Kurdish-speaking communities of critical health information. To address this problem, we developed a specialized Machine Translation (MT) model to translate English medicine brochures into Sorani Kurdish using a parallel corpus of 22,940 aligned sentence pairs from 319 brochures, sourced from two pharmaceutical companies in the Kurdistan Region of Iraq (KRI). We trained a Statistical Machine Translation (SMT) model using the Moses toolkit, conducting seven experiments that resulted in BLEU scores ranging from 22.65 to 48.93. We translated three new brochures to improve the evaluation process and encountered unknown words. We addressed unknown words through post-processing with a medical dictionary, resulting in BLEU scores of 56.87, 31.05, and 40.01. Human evaluation by native Kurdish-speaking pharmacists, physicians, and medicine users showed that 50% of professionals found the translations consistent, while 83.3% rated them accurate. Among users, 66.7% considered the translations clear and felt confident using the medications.
zh

[NLP-31] Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization

【速读】: 该论文旨在解决在检索增强的大型语言模型(LLMs)中确保上下文忠实性(contextual faithfulness)的问题,特别是在长形式问答(LFQA)场景中。为了构建可信的信息检索系统,模型需要能够准确区分忠实和不忠实的生成内容。论文的关键解决方案是提出了RHIO框架,该框架通过选择性屏蔽负责检索上下文信息的注意力头(retrieval heads)来增强不忠实的样本,模拟模型内在的错误。这些样本随后被纳入联合训练中,使模型能够在控制标记(control tokens)的条件下区分忠实和不忠实的输出。此外,控制标记还被用于自诱导对比输出,通过对比解码(contrastive decoding)放大其差异。为了评估上下文忠实性,论文还引入了GroundBench基准,该基准由五个现有的LFQA数据集组成。实验结果表明,RHIO显著提高了忠实性,甚至优于GPT-4o。

链接: https://arxiv.org/abs/2501.13573
作者: Lei Huang,Xiaocheng Feng,Weitao Ma,Yuchun Fan,Xiachong Feng,Yangfan Ye,Weihong Zhong,Yuxuan Gu,Baoxin Wang,Dayong Wu,Guoping Hu,Bing Qin
机构: Harbin Institute of Technology, China(哈尔滨工业大学); Peng Cheng Laboratory, China(鹏城实验室); Northeastern University, China(东北大学); The University of Hong Kong, China(香港大学); iFLYTEK Research, China(科大讯飞研究院)
类目: Computation and Language (cs.CL)
备注: Submitted to ARR October 2024

点击查看摘要

Abstract:Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to explicitly discriminate between faithful and unfaithful generations. RHIO first augments unfaithful samples that simulate realistic model-intrinsic errors by selectively masking retrieval heads. Then, these samples are incorporated into joint training, enabling the model to distinguish unfaithful outputs from faithful ones conditioned on control tokens. Furthermore, these control tokens are leveraged to self-induce contrastive outputs, amplifying their difference through contrastive decoding. Additionally, to facilitate the evaluation of contextual faithfulness, we also introduce GroundBench, a comprehensive benchmark compiled from five existing LFQA datasets. Extensive experimental results on GroundBench demonstrate that RHIO significantly improves faithfulness, even outperforming GPT-4o.
zh

[NLP-32] K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor NAACL2025

【速读】: 该论文试图解决在检索增强问答(Retrieval-augmented QA)系统中,由于检索到的文档可能包含大量无关信息或需要高专业知识,导致读者模型难以准确理解文本并可能产生幻觉(hallucinations)的问题。为了解决这些问题,论文提出了K-COMP(Knowledge-injected compressor,知识注入压缩器)作为解决方案。K-COMP的关键在于自动生成所需的先验知识,并在压缩检索到的段落之前将这些知识整合到压缩过程中。通过这种方式,K-COMP确保了问题意图与压缩后的上下文之间的一致性,从而引导读者模型找到相关答案并信任上下文。

链接: https://arxiv.org/abs/2501.13567
作者: Jeonghun Cho,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence (人工智能研究生院), Pohang University of Science and Technology (浦项科技大学), Republic of Korea (韩国); Department of Computer Science and Engineering (计算机科学与工程系), Pohang University of Science and Technology (浦项科技大学), Republic of Korea (韩国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025

点击查看摘要

Abstract:Retrieval-augmented question answering (QA) integrates external information, and thereby increases the QA accuracy of reader models that lack domain knowledge. However, documents retrieved for closed domains require high expertise, so the reader model may have difficulty fully comprehending the text. Moreover, the retrieved documents contain thousands of tokens, some unrelated to the question. As a result, the documents include some inaccurate information, which could lead the reader model to mistrust the passages and could result in hallucinations. To solve these problems, we propose K-COMP (Knowledge-injected compressor) which provides the knowledge required to answer correctly. The compressor automatically generates the requisite prior knowledge to facilitate the answering process prior to the compression of retrieved passages. Subsequently, the passages are compressed autoregressively, with the generated knowledge being integrated into the compression process. This process ensures alignment between the question intent and the compressed context. By augmenting this prior knowledge and concise context, the reader models are guided toward relevant answers and trust the context.
zh

[NLP-33] LLM s Can Plan Only If We Tell Them ICLR2025

【速读】: 该论文试图解决大语言模型(LLMs)在自主规划任务中的局限性问题。尽管LLMs在自然语言处理和推理方面表现出显著能力,但在标准规划基准测试(如Blocksworld)中,即使是最先进的模型如GPT-4也难以在没有额外支持的情况下达到人类水平的表现。论文的核心解决方案是通过对“思维算法”(Algorithm-of-Thoughts, AoT)进行创新性增强,提出了AoT+方法。AoT+通过自主生成长期规划,显著提升了规划基准测试中的表现,超越了现有方法和人类基线水平,且无需依赖外部反馈机制或受控环境。这一改进的关键在于减少了计算和开发资源的消耗,同时提高了模型的自主规划能力。

链接: https://arxiv.org/abs/2501.13545
作者: Bilgehan Sel,Ruoxi Jia,Ming Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements to Algorithm-of-Thoughts (AoT), which we dub AoT+, help achieve state-of-the-art results in planning benchmarks out-competing prior methods and human baselines all autonomously.
zh

[NLP-34] ReasVQA: Advancing VideoQA with Imperfect Reasoning Process NAACL2025

【速读】: 该论文试图解决视频问答(VideoQA)任务中的复杂视觉和时间关系理解问题,以提高模型在回答视频相关问题的准确性。解决方案的关键在于引入了一种名为ReasVQA(Reasoning-enhanced Video Question Answering)的新方法,该方法通过利用多模态大语言模型(MLLMs)生成的推理过程来增强VideoQA模型的性能。具体来说,ReasVQA包括三个主要阶段:推理生成、推理精炼和从推理中学习。首先,使用额外的MLLMs生成详细的推理过程;其次,通过过滤步骤精炼这些推理过程以确保数据质量;最后,利用这些可能不完美的推理数据,通过多任务学习指导VideoQA模型如何基于给定视频解释和回答问题。该方法在三个流行基准测试上进行了评估,结果显示在NExT-QA、STAR和IntentQA上分别实现了+2.9、+7.3和+5.9的显著性能提升,证明了将推理过程整合到VideoQA中的监督优势。

链接: https://arxiv.org/abs/2501.13536
作者: Jianxin Liang,Xiaojun Meng,Huishuai Zhang,Yueqian Wang,Jiansheng Wei,Dongyan Zhao
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); National Key Laboratory of General Artificial Intelligence, Peking University (北京大学通用人工智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to main conference at NAACL 2025; 8 pages;

点击查看摘要

Abstract:Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce \textbfReasVQA (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA. Our findings demonstrate the supervising benefits of integrating reasoning processes into VideoQA. Further studies validate each component of our method, also with different backbones and MLLMs, and again highlight the advantages of this simple but effective method. We offer a new perspective on enhancing VideoQA performance by utilizing advanced reasoning techniques, setting a new benchmark in this research field.
zh

[NLP-35] DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition

【速读】: 该论文试图解决的是在自监督学习(SSL)框架下,data2vec模型在多层平均过程中不可避免地耦合了语言和音素特征的问题,这影响了其在多语言自动语音识别(ASR)中的表现。为了解决这一问题,作者提出了一种基于解耦量化的data2vec模型(DQ-Data2vec),其核心思想是通过改进的在线K-means量化器(online K-means quantizers)来解耦语言和音素信息。具体而言,在语言量化中,作者通过设定与语言数量相匹配的聚类数,显式地将浅层中的语言相关信息与其他无关特征(如说话者信息)解耦。类似地,该方法也应用于解耦中层中的音素和词汇特征。实验结果表明,DQ-Data2vec在CommonVoice数据集上相比data2vec和UniData2vec,在音素错误率(PER)和词错误率(WER)上分别实现了9.51%和11.58%的相对降低,在弱监督场景下效果更为显著。

链接: https://arxiv.org/abs/2501.13497
作者: Qijie Shao,Linhao Dong,Kun Wei,Sining Sun,Lei Xie
机构: Northwestern Polytechnical University (西北工业大学); Bytedance Speech, Beijing Bytedance Technology Co Ltd (字节跳动语音, 北京字节跳动科技有限公司)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to the IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

点击查看摘要

Abstract:Data2vec is a self-supervised learning (SSL) approach that employs a teacher-student architecture for contextual representation learning via masked prediction, demonstrating remarkable performance in monolingual ASR. Previous studies have revealed that data2vec’s shallow layers capture speaker and language information, middle layers encode phoneme and word features, while deep layers are responsible for reconstruction. Language and phoneme features are crucial for multilingual ASR. However, data2vec’s masked representation generation relies on multi-layer averaging, inevitably coupling these features. To address this limitation, we propose a decoupling quantization based data2vec (DQ-Data2vec) for multilingual ASR, which includes a data2vec backbone and two improved online K-means quantizers. Our core idea is using the K-means quantizer with specified cluster numbers to decouple language and phoneme information for masked prediction. Specifically, in the language quantization, considering that the number of languages is significantly different from other irrelevant features (e.g., speakers), we assign the cluster number to match the number of languages, explicitly decoupling shallow layers’ language-related information from irrelevant features. This strategy is also applied to decoupling middle layers’ phoneme and word features. In a self-supervised scenario, experiments on the CommonVoice dataset demonstrate that DQ-Data2vec achieves a relative reduction of 9.51% in phoneme error rate (PER) and 11.58% in word error rate (WER) compared to data2vec and UniData2vec. Moreover, in a weakly-supervised scenario incorporating language labels and high-resource language text labels, the relative reduction is 18.09% and 1.55%, respectively.
zh

[NLP-36] RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles

【速读】: 该论文旨在解决大语言模型(LLMs)在处理序列数据时存在的单向因果性(unidirectional causality)限制,即所谓的“逆转诅咒”(reversal curse)。具体表现为,当模型被提示回忆序列数据中的前文时,往往无法正确回忆,例如在回忆美国国歌中某句的前一句时,模型可能无法准确返回正确内容。这种现象源于模型生成文本时依赖于前文标记(tokens),要求信息以一致的标记顺序学习和重现。

论文提出的解决方案是引入“自引用因果循环”(self-referencing causal cycle, RECALL)机制,其核心在于通过“循环标记”(cycle tokens)连接训练数据的不同部分,从而使得模型能够从后续标记中回忆前文信息。通过严格的概率形式化和控制实验,论文展示了这些循环如何影响模型重现信息的能力。该机制为克服逆转诅咒提供了一种新的视角,表明逆转诅咒并非总是实践中的障碍。

链接: https://arxiv.org/abs/2501.13491
作者: Munachiso Nwadike,Zangir Iklassov,Toluwani Aremu,Tatsuya Hiraoka,Velibor Bojkovic,Benjamin Heinzerling,Hilal Alqaubeh,Martin Takáč,Kentaro Inui
机构: MBZUAI(Mohamed bin Zayed University of Artificial Intelligence, 阿布扎比人工智能大学); RIKEN AIP(RIKEN Center for Advanced Intelligence Project, 日本理化学研究所先进智能项目中心); Tohoku University(东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the concept of the self-referencing causal cycle (abbreviated RECALL) - a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse. When an LLM is prompted with sequential data, it often fails to recall preceding context. For example, when we ask an LLM to recall the line preceding “O say does that star-spangled banner yet wave” in the U.S. National Anthem, it often fails to correctly return “Gave proof through the night that our flag was still there” - this is due to the reversal curse. It occurs because language models such as ChatGPT and Llama generate text based on preceding tokens, requiring facts to be learned and reproduced in a consistent token order. While the reversal curse is often viewed as a limitation, we offer evidence of an alternative view: it is not always an obstacle in practice. We find that RECALL is driven by what we designate as cycle tokens - sequences that connect different parts of the training data, enabling recall of preceding tokens from succeeding ones. Through rigorous probabilistic formalization and controlled experiments, we demonstrate how the cycles they induce influence a model’s ability to reproduce information. To facilitate reproducibility, we provide our code and experimental details at this https URL.
zh

[NLP-37] MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods

【速读】: 该论文试图解决将量化(Quantization)技术应用于Mamba模型时面临的挑战。Mamba是一种高效的序列模型,尽管在多种任务中表现出色,但现有的量化方法(如Quarot)在应用于Mamba模型时效果不佳,导致显著的精度下降(例如在Vim-T模型上W8A8量化下精度下降21%)。论文指出,Mamba模型在门投影、输出投影和矩阵乘法中存在显著的异常值(outliers),且其独特的并行扫描机制进一步放大了这些异常值,导致数据分布不均匀且具有重尾特性。此外,即使应用了Hadamard变换,权重和激活的通道间方差仍然不一致。

为解决这些问题,论文提出了MambaQuant,一种后训练量化(Post-Training Quantization, PTQ)框架。其关键创新点包括:1)基于Karhunen-Loeve变换(KLT)增强的旋转技术,使旋转矩阵能够适应不同的通道分布;2)平滑融合旋转(Smooth-Fused rotation),用于均衡通道方差,并可将额外参数合并到模型权重中。实验表明,MambaQuant能够在Mamba模型的视觉和语言任务中将权重和激活量化为8位,且精度损失小于1%。这是首次为Mamba家族模型设计的全面PTQ方案,为其进一步应用奠定了基础。

链接: https://arxiv.org/abs/2501.13484
作者: Zukang Xu,Yuxuan Yue,Xing Hu,Zhihang Yuan,Zixu Jiang,Zhixuan Chen,Jiangyong Yu,Chen Xu,Sifan Zhou,Dawei Yang
机构: Houmo AI; Nanjing University(南京大学); Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); Southeast University(东南大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mamba is an efficient sequence model that rivals Transformers and demonstrates significant potential as a foundational architecture for various tasks. Quantization is commonly used in neural networks to reduce model size and computational latency. However, applying quantization to Mamba remains underexplored, and existing quantization methods, which have been effective for CNN and Transformer models, appear inadequate for Mamba models (e.g., Quarot suffers a 21% accuracy drop on Vim-T ^\dagger even under W8A8). We have pioneered the exploration of this issue and identified several key challenges. First, significant outliers are present in gate projections, output projections, and matrix multiplications. Second, Mamba’s unique parallel scan further amplifies these outliers, leading to uneven and heavy-tailed data distributions. Third, even with the application of the Hadamard transform, the variance across channels in weights and activations still remains inconsistent. To these ends, we propose MambaQuant, a post-training quantization (PTQ) framework consisting of: 1) Karhunen-Loeve Transformation (KLT) enhanced rotation, rendering the rotation matrix adaptable to diverse channel distributions. 2) Smooth-Fused rotation, which equalizes channel variances and can merge additional parameters into model weights. Experiments show that MambaQuant can quantize both weights and activations into 8-bit with less than 1% accuracy loss for Mamba-based vision and language tasks. To the best of our knowledge, MambaQuant is the first comprehensive PTQ design for the Mamba family, paving the way for further advancements in its application.
zh

[NLP-38] Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer

【速读】: 该论文旨在解决传统Transformer模型在文本分类任务中捕捉深层语义关系和优化计算复杂度方面的不足。为了解决这些问题,论文提出了两个关键改进:一是引入多层次注意力机制(multi-level attention mechanism),通过结合全局注意力和局部注意力,有效建模文本中的全局语义和局部特征;二是采用对比学习策略(contrastive learning strategy),通过构建正负样本对,增强模型对不同类别的区分能力,从而提升分类效果。此外,论文还设计了一个轻量级模块,以优化特征变换过程并降低计算成本,从而提高模型在大规模文本数据上的训练和推理效率。实验结果表明,改进后的Transformer模型在分类准确率、F1分数和召回率等指标上优于BiLSTM、CNN、标准Transformer和BERT等对比模型,展现出更强的语义表示能力和泛化性能。

链接: https://arxiv.org/abs/2501.13467
作者: Jia Gao,Guiran Liu,Binrong Zhu,Shicheng Zhou,Hongye Zheng,Xiaoxuan Liao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper studies a text classification algorithm based on an improved Transformer to improve the performance and efficiency of the model in text classification tasks. Aiming at the shortcomings of the traditional Transformer model in capturing deep semantic relationships and optimizing computational complexity, this paper introduces a multi-level attention mechanism and a contrastive learning strategy. The multi-level attention mechanism effectively models the global semantics and local features in the text by combining global attention with local attention; the contrastive learning strategy enhances the model’s ability to distinguish between different categories by constructing positive and negative sample pairs while improving the classification effect. In addition, in order to improve the training and inference efficiency of the model on large-scale text data, this paper designs a lightweight module to optimize the feature transformation process and reduce the computational cost. Experimental results on the dataset show that the improved Transformer model outperforms the comparative models such as BiLSTM, CNN, standard Transformer, and BERT in terms of classification accuracy, F1 score, and recall rate, showing stronger semantic representation ability and generalization performance. The method proposed in this paper provides a new idea for algorithm optimization in the field of text classification and has good application potential and practical value. Future work will focus on studying the performance of this model in multi-category imbalanced datasets and cross-domain tasks and explore the integration wi
zh

[NLP-39] Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

【速读】: 该论文试图解决传统Softmax注意力机制在推理长度增加时出现的数值不稳定性和性能下降问题。解决方案的关键在于将Softmax操作分解为一个非线性变换和l_1范数,并识别出l_1范数对于维持模型性能的重要性。通过用Softplus激活函数替换非线性变换,并基于不变性熵引入动态长度尺度因子以适应不同长度的token,论文提出了一种新型的注意力机制。此外,为了进一步增强该机制的长度外推能力,论文引入了一种重加权机制,通过放大显著注意力权重并削弱较弱权重,使模型能够更有效地聚焦于相关token。这一方法在管理长序列时表现出显著优势,即使在训练token长度的16倍时仍能保持几乎恒定的验证损失,同时确保数值稳定性。

链接: https://arxiv.org/abs/2501.13428
作者: Bo Gao,Michael W. Spratling
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages and 2 figures

点击查看摘要

Abstract:Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the l_1 -norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic length scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16 \times the training token length while ensuring numerical stability. Our code is available at: this https URL.
zh

[NLP-40] A Survey of Code-switched Arabic NLP: Progress Challenges and Future Directions COLING2025

【速读】: 该论文旨在解决阿拉伯世界复杂的语言环境对自然语言处理(NLP)技术提出的挑战。阿拉伯世界存在一种复杂的双语(diglossic)和多语言现象,涉及现代标准阿拉伯语(Modern Standard Arabic)、各种方言及次方言,以及多种欧洲语言的使用。这种多元化的语言环境导致了语码转换(code-switching)现象的普遍存在,既包括阿拉伯语内部的语码转换,也包括阿拉伯语与外语之间的转换。论文通过综述当前关于语码转换阿拉伯语NLP的文献,提供了对该领域现有研究、挑战、研究空白以及未来研究方向建议的全面视角。解决方案的关键在于深入理解并有效处理阿拉伯语中的语码转换现象,以开发出能够适应这一复杂语言环境的NLP技术。

链接: https://arxiv.org/abs/2501.13419
作者: Injy Hamed,Caroline Sabty,Slim Abdennadher,Ngoc Thang Vu,Thamar Solorio,Nizar Habash
机构: MBZUAI; German International University, Egypt; University of Stuttgart; University of Houston; New York University Abu Dhabi
类目: Computation and Language (cs.CL)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:Language in the Arab world presents a complex diglossic and multilingual setting, involving the use of Modern Standard Arabic, various dialects and sub-dialects, as well as multiple European languages. This diverse linguistic landscape has given rise to code-switching, both within Arabic varieties and between Arabic and foreign languages. The widespread occurrence of code-switching across the region makes it vital to address these linguistic needs when developing language technologies. In this paper, we provide a review of the current literature in the field of code-switched Arabic NLP, offering a broad perspective on ongoing efforts, challenges, research gaps, and recommendations for future research directions.
zh

[NLP-41] ExLM: Rethinking the Impact of texttt[MASK] Tokens in Masked Language Models

【速读】: 该论文探讨了掩码语言模型(Masked Language Models, MLMs)中掩码(\texttt[MASK])对模型性能的影响,特别是掩码引入的语义损坏问题(corrupted semantics problem)。语义损坏问题指的是掩码后的上下文可能传达多重、模糊的语义,从而影响模型在下游任务中的表现。为解决这一问题,论文提出了一种增强上下文的掩码语言模型(ExLM)。该模型的关键在于扩展输入上下文中的掩码,并建模这些扩展状态之间的依赖关系。通过这种扩展,模型能够捕捉更丰富的语义信息,有效缓解预训练过程中的语义损坏问题。实验结果表明,ExLM在文本建模和SMILES建模任务中均取得了显著的性能提升,并通过上下文增强减少了MLMs中常见的多模态问题。

链接: https://arxiv.org/abs/2501.13397
作者: Kangjie Zheng,Junwei Yang,Siyue Liang,Bin Feng,Zequn Liu,Wei Ju,Zhiping Xiao,Ming Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 29 pages, 12 figures

点击查看摘要

Abstract:Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly replacing some tokens in the input sentences with \texttt[MASK] tokens and predicting the original tokens based on the remaining context. This paper explores the impact of \texttt[MASK] tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands \texttt[MASK] tokens in the input context and models the dependencies between these expanded states. This expansion increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enhances semantic representations through context enhancement, and effectively reduces the multimodality problem commonly observed in MLMs.
zh

[NLP-42] Can Large Language Models Understand Preferences in Personalized Recommendation?

【速读】: 该论文试图解决现有推荐系统评估方法中存在的问题,即用户评分偏差(user rating bias)和物品质量(item quality)对评分预测的干扰,导致难以准确捕捉用户的个人偏好。为了解决这一问题,作者提出了PerRecBench,通过将用户分组来消除评分偏差和物品质量的影响,从而在分组排序(grouped ranking)的方式下评估推荐技术对用户个人偏好的捕捉能力。关键解决方案在于引入PerRecBench,该方法能够有效分离评分偏差和物品质量的影响,专注于评估推荐技术在识别用户偏好方面的表现。通过这一方法,作者发现尽管基于大语言模型(LLMs)的推荐技术在评分预测上表现良好,但在消除评分偏差和物品质量后,这些模型在识别用户偏好方面表现不佳。此外,研究还揭示了成对排序(pairwise ranking)和列表排序(listwise ranking)方法相对于点排序(pointwise ranking)的优越性,以及用户画像和预训练数据分布对推荐效果的重要性。

链接: https://arxiv.org/abs/2501.13391
作者: Zhaoxuan Tan,Zinan Zeng,Qingkai Zeng,Zhenyu Wu,Zheyuan Liu,Fengran Mo,Meng Jiang
机构: University of Notre Dame(圣母大学); Xi’an Jiaotong University(西安交通大学); Université de Montréal(蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in various tasks, including personalized recommendations. Existing evaluation methods often focus on rating prediction, relying on regression errors between actual and predicted ratings. However, user rating bias and item quality, two influential factors behind rating scores, can obscure personal preferences in user-item pair data. To address this, we introduce PerRecBench, disassociating the evaluation from these two factors and assessing recommendation techniques on capturing the personal preferences in a grouped ranking manner. We find that the LLM-based recommendation techniques that are generally good at rating prediction fail to identify users’ favored and disfavored items when the user rating bias and item quality are eliminated by grouping users. With PerRecBench and 19 LLMs, we find that while larger models generally outperform smaller ones, they still struggle with personalized recommendation. Our findings reveal the superiority of pairwise and listwise ranking approaches over pointwise ranking, PerRecBench’s low correlation with traditional regression metrics, the importance of user profiles, and the role of pretraining data distributions. We further explore three supervised fine-tuning strategies, finding that merging weights from single-format training is promising but improving LLMs’ understanding of user preferences remains an open research problem. Code and data are available at this https URL
zh

[NLP-43] Do as We Do Not as You Think: the Conformity of Large Language Models ICLR2025

【速读】: 该论文旨在探讨大语言模型(LLMs)驱动的多智能体系统中的从众行为(conformity)问题,包括从众行为的存在性、影响因素以及潜在的缓解策略。从众行为类似于人类群体动态中的从众偏见(conformity bias)和群体思维(groupthink),可能影响多智能体系统的集体问题解决能力和伦理风险。论文提出了一个名为BenchForm的从众行为基准测试,包含推理密集型任务和五种不同的交互协议,用于评估LLMs在协作场景中的行为。通过量化从众率(conformity rate)和独立率(independence rate)等指标,论文分析了交互时间和多数群体规模等因素对从众行为的影响,并探讨了增强角色(enhanced personas)和反思机制(reflection mechanism)两种缓解策略。研究结果揭示了LLMs在协作中的从众行为特征,为构建更稳健且符合伦理的协作AI系统提供了理论基础和实践指导。

链接: https://arxiv.org/abs/2501.13381
作者: Zhiyuan Weng,Guikun Chen,Wenguan Wang
机构: Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL)
备注: ICLR 2025. Code: this https URL

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi-agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem-solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM-driven multi-agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity-oriented benchmark, featuring reasoning-intensive tasks and five distinct interaction protocols designed to probe LLMs’ behavior in collaborative scenarios. Several representative LLMs are evaluated on BenchForm, using metrics such as conformity rate and independence rate to quantify conformity’s impact. Our analysis delves into factors influencing conformity, including interaction time and majority size, and examines how the subject agent rationalizes its conforming behavior. Furthermore, we explore two strategies to mitigate conformity effects, i.e., developing enhanced personas and implementing a reflection mechanism. Several interesting findings regarding LLMs’ conformity are derived from empirical results and case studies. We hope that these insights can pave the way for more robust and ethically-aligned collaborative AI systems. Our benchmark and code are available at BenchForm.
zh

[NLP-44] Agent Rec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback

【速读】: 该论文旨在解决多智能体系统(multi-agent systems)中如何选择最合适的智能体来执行特定任务的问题。具体来说,论文提出了一种基于自然语言提示(natural language prompt)的智能体推荐架构,通过扩展 Sentence-BERT (SBERT) 编码器模型来实现。解决方案的关键在于将自然语言提示编码为句子嵌入(sentence embeddings),并通过微调(fine-tuning)和基于人类反馈的强化学习(reinforcement learning from human feedback)来最小化属于同一智能体的句子嵌入之间的距离。这种方法不仅计算成本低,还能适应新类别,具有可解释性和可控性。通过测量嵌入之间的余弦相似度(cosine similarity),模型能够基于最近邻分类自然语言提示,从而实现高精度的智能体推荐。

链接: https://arxiv.org/abs/2501.13333
作者: Joshua Park,Yongfeng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 10 pages, 8 figures, preprint

点击查看摘要

Abstract:Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at this https URL.
zh

[NLP-45] Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers NAACL2025

【速读】: 该论文旨在解决AI安全审核(AI Safety Moderation, ASM)分类器在社交媒体平台内容审核中的公平性和鲁棒性问题。具体而言,论文关注这些分类器是否会对少数群体用户的内容进行不公平的分类,以及它们在不同输入条件下的行为是否一致。解决方案的关键在于通过评估四种广泛使用的闭源ASM分类器(OpenAI Moderation API、Perspective API、Google Cloud Natural Language API和Clarifai API)的公平性和鲁棒性,揭示潜在的差距并提出改进建议。公平性评估采用了人口统计学平等(demographic parity)和条件统计平等(conditional statistical parity)等指标,而鲁棒性则通过测试分类器对微小和自然输入扰动的敏感性来分析。研究结果表明,这些分类器在公平性和鲁棒性方面存在潜在问题,强调了在未来版本中缓解这些问题的必要性。

链接: https://arxiv.org/abs/2501.13302
作者: Akshit Achara,Anshuman Chhabra
机构: King’s Institute for Artificial Intelligence, King’s College London (伦敦国王学院人工智能研究所); Department of Computer Science and Engineering, University of South Florida (南佛罗里达大学计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025 Main Conference

点击查看摘要

Abstract:AI Safety Moderation (ASM) classifiers are designed to moderate content on social media platforms and to serve as guardrails that prevent Large Language Models (LLMs) from being fine-tuned on unsafe inputs. Owing to their potential for disparate impact, it is crucial to ensure that these classifiers: (1) do not unfairly classify content belonging to users from minority groups as unsafe compared to those from majority groups and (2) that their behavior remains robust and consistent across similar inputs. In this work, we thus examine the fairness and robustness of four widely-used, closed-source ASM classifiers: OpenAI Moderation API, Perspective API, Google Cloud Natural Language (GCNL) API, and Clarifai API. We assess fairness using metrics such as demographic parity and conditional statistical parity, comparing their performance against ASM models and a fair-only baseline. Additionally, we analyze robustness by testing the classifiers’ sensitivity to small and natural input perturbations. Our findings reveal potential fairness and robustness gaps, highlighting the need to mitigate these issues in future versions of these models.
zh

[NLP-46] Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents NAACL2025

【速读】: 该论文试图解决的问题是如何利用大语言模型(LLMs)加速材料发现和设计的过程。具体而言,研究旨在探索LLMs在生成可行假设方面的潜力,这些假设一旦经过验证,可以显著加快新材料的开发。解决方案的关键在于构建一个新颖的数据集,该数据集从近期期刊出版物中提取,包含了实际应用中的目标、约束条件和方法。基于此数据集,研究测试了LLM代理在特定约束条件下生成假设的能力。为了评估这些假设的相关性和质量,研究提出了一种可扩展的评估指标,该指标模拟了材料科学家在批判性评估假设时所使用的过程。通过这一数据集、方法和评估框架,研究旨在推动未来利用LLMs加速材料发现和设计的研究。

链接: https://arxiv.org/abs/2501.13299
作者: Shrinidhi Kumbhar,Venkatesh Mishra,Kevin Coutinho,Divij Handa,Ashif Iquebal,Chitta Baral
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted in NAACL 2025

点击查看摘要

Abstract:Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials science experts, we curated a novel dataset from recent journal publications, featuring real-world goals, constraints, and methods for designing real-world applications. Using this dataset, we test LLM-based agents that generate hypotheses for achieving given goals under specific constraints. To assess the relevance and quality of these hypotheses, we propose a novel scalable evaluation metric that emulates the process a materials scientist would use to evaluate a hypothesis critically. Our curated dataset, proposed method, and evaluation framework aim to advance future research in accelerating materials discovery and design with LLMs.
zh

[NLP-47] RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering NAACL2025

【速读】: 该论文旨在解决多模态检索增强问答(MRAQA)中传统排序方法与现代生成式大语言模型(LLMs)不兼容的问题。传统排序方法依赖于小型编码器语言模型,而现代生成式大语言模型在多种自然语言处理任务中表现出色。为解决这一问题,论文提出了RAMQA框架,该框架结合了学习排序方法和生成式排列增强排序技术。关键解决方案包括:首先使用LLaVA作为骨干网络训练一个点对点多模态排序器,然后通过指令微调训练一个LLaMA模型,采用自回归多任务学习方法对前k个文档进行重排序。生成式排序模型能够生成重排序的文档ID,并从候选文档的各种排列中生成具体答案。实验结果表明,该方法在WebQA和MultiModalQA两个MRAQA基准测试中显著优于现有基线方法,验证了其有效性。

链接: https://arxiv.org/abs/2501.13297
作者: Yang Bai,Christan Earl Grant,Daisy Zhe Wang
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted by NAACL 2025 Findings

点击查看摘要

Abstract:Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: this https URL
zh

[NLP-48] Automatic Fact-Checking with Frame-Semantics

【速读】: 该论文旨在解决自动事实核查(automatic fact-checking)中的挑战,特别是在当今信息生态系统中应对错误信息的传播问题。论文提出了一种基于框架语义学(frame semantics)的新范式,通过增强对声明(claims)的结构化理解来提高事实核查的准确性。解决方案的关键在于利用框架语义学来改进证据检索(evidence retrieval),并通过引入一个从PolitiFact提取的真实世界声明数据集来支持这一方法。该数据集经过专门标注,适用于大规模结构化数据分析。论文通过两个案例研究验证了该方法的有效性:第一个案例使用投票语义框架(Vote semantic frame)分析投票相关声明,第二个案例则探讨了来自经济合作与发展组织(OECD)的多种语义框架和数据源。研究结果表明,框架语义学在提升证据检索方面具有显著效果,为自动事实核查能力的提升提供了重要进展。

链接: https://arxiv.org/abs/2501.13288
作者: Jacob Devasier,Rishabh Mediratta,Akshith Putta,Chengkai Li
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel paradigm for automatic fact-checking that leverages frame semantics to enhance the structured understanding of claims, addressing the challenges posed by misinformation in today’s information ecosystem. To support this approach, we introduce a pilot dataset of real-world claims extracted from PolitiFact, specifically annotated for large-scale structured data. This dataset underpins two case studies: the first investigates voting-related claims using the Vote semantic frame, while the second explores various semantic frames and data sources from the Organisation for Economic Co-operation and Development (OECD). Our findings demonstrate the effectiveness of frame semantics in improving evidence retrieval, indicating a meaningful advancement in automatic fact-checking capabilities. Finally, we conducted a survey of frames evoked in fact-checked claims, identifying high-impact frames to guide future research.
zh

[NLP-49] oyteller: AI-powered Visual Storytelling Through Toy-Playing with Character Symbols

【速读】: 该论文旨在解决如何通过用户直接操作角色符号(如玩具操作)来生成混合故事文本和视觉内容的问题。解决方案的关键在于利用拟人化的符号运动来传达丰富且细致的社会互动,并通过将运动和文本映射到一个共享的语义空间,实现运动引导的文本生成和文本引导的运动生成。这一方法使得大型语言模型和运动生成模型能够利用该语义空间作为翻译层,从而提升生成效果。技术评估表明,Toyteller系统在性能上优于基线模型GPT-4o。用户研究还发现,玩具操作有助于表达难以言传的意图,但仅靠运动无法完全表达所有用户意图,因此需要结合其他模态(如语言)来进一步完善表达。

链接: https://arxiv.org/abs/2501.13284
作者: John Joon Young Chung,Melissa Roemmele,Max Kreminski
机构: Midjourney(中途)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to CHI2025

点击查看摘要

Abstract:We introduce Toyteller, an AI-powered storytelling system where users generate a mix of story text and visuals by directly manipulating character symbols like they are toy-playing. Anthropomorphized symbol motions can convey rich and nuanced social interactions; Toyteller leverages these motions (1) to let users steer story text generation and (2) as a visual output format that accompanies story text. We enabled motion-steered text generation and text-steered motion generation by mapping motions and text onto a shared semantic space so that large language models and motion generation models can use it as a translational layer. Technical evaluations showed that Toyteller outperforms a competitive baseline, GPT-4o. Our user study identified that toy-playing helps express intentions difficult to verbalize. However, only motions could not express all user intentions, suggesting combining it with other modalities like language. We discuss the design space of toy-playing interactions and implications for technical HCI research on human-AI interaction.
zh

[NLP-50] RAG -Reward: Optimizing RAG with Reward Modeling and RLHF

【速读】: 该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)在大型语言模型(Large Language Models, LLMs)中的应用问题,特别是在知识密集型任务中如何提高生成质量和可信度。尽管已有大量研究关注检索、生成和评估的改进,但强化学习中奖励模型在优化RAG和建立自动化基准测试管道中的作用尚未得到充分探索。论文的关键解决方案是引入了RAG-Reward数据集,该数据集旨在实现无幻觉(hallucination-free)、全面、可靠且高效的RAG。通过定义四个关键指标来评估生成质量,并开发了一个自动化注释管道,利用多个LLM生成不同RAG场景下的输出。GPT-4o被用于评估和构建偏好数据。在此基础上,论文训练了奖励模型,并应用了基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)来提高LLM在RAG中的有效性。实验结果表明,奖励模型在测试集上达到了最先进的性能,证明了该方法的有效性和数据集的质量。此外,训练后的策略模型在生成质量上的提升也凸显了使用RLHF增强RAG管道的可行性。

链接: https://arxiv.org/abs/2501.13264
作者: Hanning Zhang,Juntong Song,Juno Zhu,Yuanhao Wu,Tong Zhang,Cheng Niu
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); NewsBreak
类目: Computation and Language (cs.CL)
备注: Preprint, work in progress

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances Large Language Models (LLMs) with relevant and up-to-date knowledge, improving their ability to answer knowledge-intensive questions. It has been shown to enhance both generation quality and trustworthiness. While numerous works have focused on improving retrieval, generation, and evaluation, the role of reward models in reinforcement learning for optimizing RAG and establishing automated benchmarking pipelines remains underexplored. In this paper, we introduce \textbfRAG-Reward, a dataset designed to enable \textithallucination-free, comprehensive, reliable, and efficient RAG. We define four key metrics for assessing generation quality and develop an automated annotation pipeline that leverages multiple LLMs to generate outputs across diverse RAG scenarios. GPT-4o is used to evaluate and construct preference data. Using \textbfRAG-Reward, we train reward models and apply reinforcement learning with human feedback (RLHF) to improve LLMs’ effectiveness in RAG. Experimental results show that our reward model achieves state-of-the-art performance on a held-out test set, demonstrating both the effectiveness of our approach and the quality of our dataset. Furthermore, the improved generation quality of the trained policy model highlights the feasibility of using RLHF to enhance RAG pipelines.
zh

[NLP-51] Preference Curriculum: LLM s Should Always Be Pretrained on Their Preferred Data

【速读】: 该论文试图解决当前大语言模型(LLMs)在预训练过程中使用一致数据分布的问题。随着模型能力的提升,直觉上应该使用差异化的数据进行预训练。为此,作者提出了基于困惑度差异的偏好课程学习(PDPC)框架,通过感知和使用LLMs偏好的数据来训练和提升模型。解决方案的关键在于引入了困惑度差异(PD)指标来衡量强模型和弱模型对样本的拟合差异,高PD的样本对弱模型更具挑战性,适合安排在预训练的后期阶段。此外,作者提出了PD偏好函数,用于近似模型并预测LLM在任何时间点的数据偏好,从而离线完成整个数据的安排,确保训练的连续性。实验结果表明,PDPC框架在1.3B和3B模型上显著超越了基线方法,特别是3B模型在各种基准测试中的平均准确率提升了超过4.1%。

链接: https://arxiv.org/abs/2501.13126
作者: Xuemiao Zhang,Liangyu Xu,Feiyu Duan,Yongwei Zhou,Sirui Wang,Jingang Wang,Xunliang Cai
机构: Peking University(北京大学); Beihang University(北京航空航天大学); Tsinghua University(清华大学); Meituan(美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures

点击查看摘要

Abstract:Current large language models (LLMs) generally utilize a consistent data distribution throughout the entire pretraining process. However, as the model’s ability improves, it intuitively should be pretrained with differentiated data. To achieve it, we propose the Perplexity Difference based Preference Curriculum learning (PDPC) framework, which always perceives and uses the data preferred by LLMs to train and boost them. Firstly, we introduce the PD metric to measure the difference in how well strong and weak models fit the samples. Samples with high PD are more challenging for weak models to learn and are more suitable to be arranged in the later stage of pretraining. Secondly, we propose the PD preference function to approximate the model and predict the data preference of the LLM at any time, so as to complete the arrangement of the entire data offline and ensure continuous training without interruption. Experimental results on 1.3B and 3B models demonstrate that our PDPC significantly surpasses baselines. Notably, the 3B model achieved more substantial gains, with an increased average accuracy of over 4.1% across various benchmarks.
zh

[NLP-52] Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction

【速读】: 该论文旨在解决在教育领域设计多项选择题(MCQs)时,如何生成更具迷惑性的干扰项(distractors)以更准确地识别学生的误解和知识缺口,并有效评估其理解水平的问题。现有研究在干扰项生成方面未能充分关注提升干扰项的难度,导致MCQs的有效性降低。论文提出的解决方案关键在于:首先,训练一个成对排序模型(pairwise ranker),通过推理学生的误解来评估两个干扰项的相对迷惑性;其次,利用该模型生成成对干扰项排序数据集,并通过直接偏好优化(Direct Preference Optimization, DPO)训练干扰项生成器,以生成更具迷惑性的干扰项。实验结果表明,该成对排序模型能够有效识别学生的潜在误解,其排序准确性与人类专家相当;同时,干扰项生成器在生成迷惑性干扰项方面优于多个基线模型,并生成了具有更高项目区分度指数(item discrimination index, DI)的题目。

链接: https://arxiv.org/abs/2501.13125
作者: Yooseop Lee,Suin Kim,Yohan Jo
机构: Seoul National University(首尔国立大学); Elice(Elice)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In designing multiple-choice questions (MCQs) in education, creating plausible distractors is crucial for identifying students’ misconceptions and gaps in knowledge and accurately assessing their understanding. However, prior studies on distractor generation have not paid sufficient attention to enhancing the difficulty of distractors, resulting in reduced effectiveness of MCQs. This study presents a pipeline for training a model to generate distractors that are more likely to be selected by students. First, we train a pairwise ranker to reason about students’ misconceptions and assess the relative plausibility of two distractors. Using this model, we create a dataset of pairwise distractor ranks and then train a distractor generator via Direct Preference Optimization (DPO) to generate more plausible distractors. Experiments on computer science subjects (Python, DB, MLDL) demonstrate that our pairwise ranker effectively identifies students’ potential misunderstandings and achieves ranking accuracy comparable to human experts. Furthermore, our distractor generator outperforms several baselines in generating plausible distractors and produces questions with a higher item discrimination index (DI).
zh

[NLP-53] Debate Helps Weak-to-Strong Generalization AAAI2025

【速读】: 该论文试图解决未来超人类模型(superhuman models)与人类期望行为对齐的问题。由于未来超人类模型的能力将超越人类,人类只能提供弱监督(weak supervision),这可能导致未来AI系统的安全性下降。为了解决这一问题,论文提出了结合可扩展监督(scalable oversight)和弱到强泛化(weak-to-strong generalization)两种互补方法的解决方案。具体而言,论文探讨了如何利用强大的预训练模型来增强人类监督,并通过增强的弱监督来进一步监督强模型。关键解决方案包括:通过微调一个弱模型(weak model)在强模型(strong model)的帮助下生成更可靠的标签,然后利用这些标签来微调强模型。此外,论文还通过实验验证了辩论(debate)机制可以帮助弱模型从不完全可信的强模型中提取可信信息,并通过弱模型集合(ensemble of weak models)来利用强模型生成的长论证,从而获得更稳健的监督估计。实验结果表明,这种结合方法在OpenAI的弱到强NLP基准测试中表现更好,表明辩论机制有助于弱到强泛化。

链接: https://arxiv.org/abs/2501.13124
作者: Hao Lang,Fei Huang,Yongbin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI2025 Special Track on AI Alignment (Oral presentation)

点击查看摘要

Abstract:Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.
zh

[NLP-54] Zero-Shot Verification-guided Chain of Thoughts

【速读】: 该论文试图解决在零样本(zero-shot)设置下,如何通过链式思维(Chain-of-Thought, COT)提示和基于大语言模型(Large Language Models, LLMs)的自我验证机制,来引导模型进行推理步骤的分解和验证。现有的研究大多依赖于微调的验证器或手工设计的少样本示例,而本文则专注于在完全零样本的情况下,利用LLM进行自我生成的推理步骤的自我验证。解决方案的关键在于设计了新的零样本提示方法,称为COT STEP,用于辅助推理步骤的分解,并设计了两种新的零样本提示方法用于基于LLM的验证器。通过评估验证器在分类推理链正确性方面的能力,并探索如何利用验证器得分来指导不同LLM在数学和常识推理任务中的推理过程。

链接: https://arxiv.org/abs/2501.13122
作者: Jishnu Ray Chowdhury,Cornelia Caragea
机构: University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous works have demonstrated the effectiveness of Chain-of-Thought (COT) prompts and verifiers in guiding Large Language Models (LLMs) through the space of reasoning. However, most such studies either use a fine-tuned verifier or rely on manually handcrafted few-shot examples. In contrast, in this paper, we focus on LLM-based self-verification of self-generated reasoning steps via COT prompts in a completely zero-shot regime. To explore this setting, we design a new zero-shot prompt, which we call COT STEP, to aid zero-shot decomposition of reasoning steps and design two new zero-shot prompts for LLM-based verifiers. We evaluate the verifiers’ ability to classify the correctness of reasoning chains and explore different ways to use verifier scores in guiding reasoning for various mathematical and commonsense reasoning tasks with different LLMs.
zh

[NLP-55] Episodic Memories Generation and Evaluation Benchmark for Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在情景记忆(episodic memory)方面的不足问题。情景记忆是指能够回忆特定时间和空间背景下的事件的能力,是人类认知的基石,支持连贯的叙事、规划和决策。尽管LLMs在许多方面表现出色,但它们缺乏有效的情景记忆机制,导致其在推理和输出时容易出现虚构(confabulations)。为了解决这一问题,论文提出了一个综合框架,用于建模和评估LLMs的情景记忆能力。该框架借鉴了认知科学的方法,采用结构化方式表示情景事件,包括时间和空间背景、相关实体以及详细描述。论文还开发了一个无污染的情景记忆基准测试,并开源了代码和数据集,用于评估LLMs在各种回忆和情景推理任务中的表现。通过评估包括GPT-4、Claude变体、Llama 3.1和o1-mini在内的先进模型,论文发现即使是最先进的LLMs在处理多个相关事件或复杂的时空关系时仍存在显著困难,尤其是在10k-100k tokens的短上下文中。

链接: https://arxiv.org/abs/2501.13121
作者: Alexis Huet,Zied Ben Houidi,Dario Rossi
机构: Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Episodic memory – the ability to recall specific events grounded in time and space – is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships – even in contexts as short as 10k-100k tokens.
zh

[NLP-56] Multilinguality in LLM -Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness AAAI-2025

【速读】: 该论文试图解决在使用大语言模型(LLMs)设计奖励函数(reward functions)时,非英语提示(non-English language prompts)对任务性能(task performance)和公平性(fairness)的影响问题。具体来说,研究关注的是在资源分配问题(如公共卫生领域)中,使用低资源语言(low-resource languages)进行提示时,LLMs的表现是否会受到影响,以及这些提示是否会引入用户不希望看到的群体偏见(biases along population groups)。

解决方案的关键在于通过实验验证不同语言提示(包括英语和非英语)对任务性能和公平性的影响。研究使用了一个合成环境(synthetic environment),并将多种语言的提示输入到DLM算法(一种利用LLMs为RMABs设计奖励函数的算法)中。结果表明,英语提示下的LLM奖励函数显著优于其他语言提示下的表现,且提示的复杂性(prompt complexity)增加时,所有语言的性能都会下降,但英语提示的鲁棒性更强。此外,低资源语言和复杂提示更容易在无意中引入不公平性。

链接: https://arxiv.org/abs/2501.13120
作者: Ambreesh Parthasarathy,Chandrasekar Subramanian,Ganesh Senrayan,Shreyash Adappanavar,Aparna Taneja,Balaraman Ravindran,Milind Tambe
机构: 1. Indian Institute of Technology Madras (印度理工学院马德拉斯分校); 2. Robert Bosch Centre for Data Science and Artificial Intelligence (罗伯特博世数据科学与人工智能中心); 3. Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at the AAAI-2025 Deployable AI Workshop

点击查看摘要

Abstract:Restless Multi-Armed Bandits (RMABs) have been successfully applied to resource allocation problems in a variety of settings, including public health. With the rapid development of powerful large language models (LLMs), they are increasingly used to design reward functions to better match human preferences. Recent work has shown that LLMs can be used to tailor automated allocation decisions to community needs using language prompts. However, this has been studied primarily for English prompts and with a focus on task performance only. This can be an issue since grassroots workers, especially in developing countries like India, prefer to work in local languages, some of which are low-resource. Further, given the nature of the problem, biases along population groups unintended by the user are also undesirable. In this work, we study the effects on both task performance and fairness when the DLM algorithm, a recent work on using LLMs to design reward functions for RMABs, is prompted with non-English language commands. Specifically, we run the model on a synthetic environment for various prompts translated into multiple languages. The prompts themselves vary in complexity. Our results show that the LLM-proposed reward functions are significantly better when prompted in English compared to other languages. We also find that the exact phrasing of the prompt impacts task performance. Further, as prompt complexity increases, performance worsens for all languages; however, it is more robust with English prompts than with lower-resource languages. On the fairness side, we find that low-resource languages and more complex prompts are both highly likely to create unfairness along unintended dimensions.
zh

[NLP-57] MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking

【速读】: 该论文旨在解决大型语言模型(LLMs)在推理和决策任务中存在的推理过程质量和连贯性不足的问题。尽管LLMs在这些任务中表现出色,但其推理过程仍可通过增强内省和自我反思来进一步提升。为此,论文提出了Multiplex CoT(Chain of Thought)方法,该方法通过引入双重链式思维(Double Chain of Thought)来模拟自我审查机制。具体而言,Multiplex CoT利用迭代推理的力量,模型首先生成一个初始的链式思维,随后通过第二轮思维生成对该推理进行批判和优化。这种递归方法能够生成更加连贯、逻辑性强且稳健的答案,从而提升整体决策过程。该方法的实现依赖于简单的提示工程(prompt engineering),无需额外训练即可在现有LLM架构中实现类似学习-优化模型(Learning-Refinement Model, LRM)的效果。此外,论文还提供了在Google Colab中实现该方法的实用指南,便于在实际应用中集成。

链接: https://arxiv.org/abs/2501.13117
作者: Shihao Ji,Zihui Song,Fucheng Zhong,Jisen Jia,Zhaobo Wu,Zheyi Cao,Tianhao Xu
机构: Data Dream, AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated their impressive abilities in various reasoning and decision-making tasks. However, the quality and coherence of the reasoning process can still benefit from enhanced introspection and self-reflection. In this paper, we introduce Multiplex CoT (Chain of Thought), a method that enables LLMs to simulate a form of self-review while reasoning, by initiating double Chain of Thought (CoT) thinking. Multiplex CoT leverages the power of iterative reasoning, where the model generates an initial chain of thought and subsequently critiques and refines this reasoning with a second round of thought generation. This recursive approach allows for more coherent, logical, and robust answers, improving the overall decision-making process. We demonstrate how this method can be effectively implemented using simple prompt engineering in existing LLM architectures, achieving an effect similar to that of the Learning-Refinement Model (LRM) without the need for additional training. Additionally, we present a practical guide for implementing the method in Google Colab, enabling easy integration into real-world applications.
zh

[NLP-58] Dagger Behind Smile: Fool LLM s with a Happy Ending Story

【速读】: 该论文旨在解决大语言模型(LLMs)在面对越狱攻击(jailbreak attacks)时的脆弱性问题。传统的越狱攻击方法包括基于优化的攻击和手动设计的攻击,但这些方法存在效率低、可迁移性差、易被检测或需要复杂交互等局限性。论文提出了一种新颖的视角:LLMs对正向提示(positive prompts)更为敏感。基于这一发现,作者提出了“快乐结局攻击”(Happy Ending Attack, HEA),通过将恶意请求嵌入到一个包含正向提示(如“快乐结局”)的场景模板中,从而诱导LLMs在一步或两步内生成恶意内容。HEA的关键在于其高效性和有效性,实验表明该方法在包括GPT-4o、Llama3-70b和Gemini-pro在内的多个先进LLMs上实现了88.79%的平均攻击成功率,并提供了对其成功原因的定量解释。

链接: https://arxiv.org/abs/2501.13115
作者: Xurui Song,Zhixin Xie,Shuo Huai,Jiayi Kong,Jun Luo
机构: College of Computing and Data Science, Nanyang Technological University (南洋理工大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The wide adoption of Large Language Models (LLMs) has attracted significant attention from \textitjailbreak attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious content. However, optimization-based attacks have limited efficiency and transferability, while manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to \textitpositive prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a \textithappy ending, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. This has made HEA both efficient and effective, as it requires only up to two steps to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% Attack Success Rate on average. We also provide potential quantitative explanations for the success of HEA.
zh

计算机视觉

[CV-0] Fast3R: Towards 3D Reconstruction of 1000 Images in One Forward Pass FAST3

【速读】:该论文试图解决多视角3D重建(Multi-view 3D reconstruction)中的核心挑战,特别是在需要跨多个视角进行准确且可扩展表示的场景中。当前的主流方法如DUSt3R采用基于成对图像处理的方式,这需要昂贵的全局对齐过程来从多个视角进行重建。本文提出的解决方案Fast 3D Reconstruction (Fast3R)通过并行处理多个视角,避免了迭代对齐的需求。Fast3R的关键创新在于其基于Transformer的架构,能够在单次前向传递中处理N张图像,从而显著提高了推理速度并减少了误差累积。实验结果表明,Fast3R在相机姿态估计和3D重建任务中表现出色,提供了更高的可扩展性且不牺牲重建精度。

链接: https://arxiv.org/abs/2501.13928
作者: Jianing Yang,Alexander Sax,Kevin J. Liang,Mikael Henaff,Hao Tang,Ang Cao,Joyce Chai,Franziska Meier,Matt Feiszli
机构: Meta; University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注: Project website: this https URL

点击查看摘要

Abstract:Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R’s Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.
zh

[CV-1] GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

【速读】:该论文旨在解决大模态模型(LMMs)在遥感(RS)图像理解中的局限性问题。尽管LMMs在自然图像领域的细粒度定位(fine-grained grounding)方面取得了显著进展,但在遥感图像中表现不佳,主要由于遥感图像具有独特的俯视视角、尺度变化以及高分辨率图像中小物体的存在,这些因素对区域级理解提出了独特挑战。此外,遥感领域缺乏细粒度的、领域特定的定位数据,进一步限制了LMMs在遥感图像中的定位对话能力的发展。

为解决这些问题,论文提出了GeoPixel,这是首个支持像素级定位的端到端高分辨率遥感大模态模型。GeoPixel通过在对话中生成交错掩码(interleaved masks)来实现细粒度视觉感知,并支持高达4K高清分辨率的任意宽高比图像,非常适合高精度遥感图像分析。为了支持遥感图像中的定位对话生成(GCG),论文通过半自动化流程构建了一个视觉定位数据集GeoPixelD,该流程利用针对遥感数据定制的标记提示(set-of-marks prompting)和空间先验(spatial priors)来系统地控制数据生成过程。GeoPixel在像素级理解任务中表现出色,超越了现有LMMs在单目标和多目标分割任务中的性能。方法论的消融研究验证了整体架构中每个组件的有效性。

链接: https://arxiv.org/abs/2501.13925
作者: Akashah Shabbir,Mohammed Zumri,Mohammed Bennamoun,Fahad S. Khan,Salman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.
zh

[CV-2] owards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization ICLR2025

【速读】:该论文旨在解决多模态开放集测试时适应(Multimodal Open-set Test-time Adaptation, MM-OSTTA)问题,即在测试阶段,模型需要在线适应包含未知类别的未标注目标域数据,且涉及多种模态(如文本、图像等)。现有方法主要针对单模态开放集测试时适应(Unimodal OSTTA),未能有效处理多模态数据的复杂性。为此,论文提出了自适应熵感知优化(Adaptive Entropy-aware Optimization, AEO)框架,其关键创新在于引入了两个核心组件:未知类感知的自适应熵优化(Unknown-aware Adaptive Entropy Optimization, UAE)和自适应模态预测差异优化(Adaptive Modality Prediction Discrepancy Optimization, AMP)。这些组件通过放大已知和未知样本之间的熵差异,增强了模型在在线适应过程中区分未知类样本的能力。实验结果表明,AEO框架在多模态开放集测试时适应任务中表现出色,尤其在长期和持续适应场景中具有显著优势。

链接: https://arxiv.org/abs/2501.13924
作者: Hao Dong,Eleni Chatzi,Olga Fink
机构: ETH Zürich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Test-time adaptation (TTA) has demonstrated significant potential in addressing distribution shifts between training and testing data. Open-set test-time adaptation (OSTTA) aims to adapt a source pre-trained model online to an unlabeled target domain that contains unknown classes. This task becomes more challenging when multiple modalities are involved. Existing methods have primarily focused on unimodal OSTTA, often filtering out low-confidence samples without addressing the complexities of multimodal data. In this work, we present Adaptive Entropy-aware Optimization (AEO), a novel framework specifically designed to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA) for the first time. Our analysis shows that the entropy difference between known and unknown samples in the target domain strongly correlates with MM-OSTTA performance. To leverage this, we propose two key components: Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP). These components enhance the ability of model to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. To thoroughly evaluate our proposed methods in the MM-OSTTA setting, we establish a new benchmark derived from existing datasets. This benchmark includes two downstream tasks and incorporates five modalities. Extensive experiments across various domain shift situations demonstrate the efficacy and versatility of the AEO framework. Additionally, we highlight the strong performance of AEO in long-term and continual MM-OSTTA settings, both of which are challenging and highly relevant to real-world applications. Our source code is available at this https URL.
zh

[CV-3] Improving Video Generation with Human Feedback

【速读】:该论文旨在解决视频生成(video generation)领域中存在的运动不流畅和视频与提示(prompt)不对齐的问题。为了解决这些问题,论文提出了一种系统化的流程,利用人类反馈来优化视频生成模型。关键解决方案包括构建一个大规模的人类偏好数据集,涵盖多维度成对标注,并引入一个多维视频奖励模型(VideoReward)。此外,论文从统一的强化学习(reinforcement learning)视角出发,提出了三种基于流模型(flow-based models)的对齐算法:Flow-DPO(直接偏好优化)、Flow-RWR(奖励加权回归)和Flow-NRG(推理时奖励引导)。实验结果表明,VideoReward显著优于现有的奖励模型,而Flow-DPO在性能上优于Flow-RWR和标准的有监督微调方法。Flow-NRG则允许用户在推理时为多个目标分配自定义权重,满足个性化视频质量需求。

链接: https://arxiv.org/abs/2501.13918
作者: Jie Liu,Gongye Liu,Jiajun Liang,Ziyang Yuan,Xiaokun Liu,Mingwu Zheng,Xiele Wu,Qiulin Wang,Wenyu Qin,Menghan Xia,Xintao Wang,Xiaohong Liu,Fei Yang,Pengfei Wan,Di Zhang,Kun Gai,Yujiu Yang,Wanli Ouyang
机构: The Chinese University of Hong Kong(香港中文大学); Tsinghua University(清华大学); Kuaishou Technology(快手科技); Shanghai Jiao Tong University(上海交通大学); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: this https URL.
zh

[CV-4] Binary Diffusion Probabilistic Model

【速读】:该论文旨在解决传统去噪扩散概率模型(DDPMs)在处理离散或二进制数据结构时的局限性。传统DDPMs依赖于连续数据表示和均方误差(MSE)损失进行训练,使用高斯噪声模型,这在处理二进制数据时可能不是最优的。论文提出的解决方案是引入二进制扩散概率模型(BDPM),通过将图像分解为位平面(bitplanes)并采用基于异或(XOR)的噪声变换,结合二进制交叉熵损失进行去噪模型的训练。这一方法实现了精确的噪声控制和计算高效的推理,显著降低了计算成本并改善了模型收敛性。BDPM在图像超分辨率、修复和盲图像恢复等任务中表现优异,且在达到最优结果时所需的推理步骤少于传统DDPM模型,展示了更高的推理效率。

链接: https://arxiv.org/abs/2501.13915
作者: Vitaliy Kinakh,Slava Voloshynovskiy
机构: University of Geneva (日内瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the Binary Diffusion Probabilistic Model (BDPM), a novel generative model optimized for binary data representations. While denoising diffusion probabilistic models (DDPMs) have demonstrated notable success in tasks like image synthesis and restoration, traditional DDPMs rely on continuous data representations and mean squared error (MSE) loss for training, applying Gaussian noise models that may not be optimal for discrete or binary data structures. BDPM addresses this by decomposing images into bitplanes and employing XOR-based noise transformations, with a denoising model trained using binary cross-entropy loss. This approach enables precise noise control and computationally efficient inference, significantly lowering computational costs and improving model convergence. When evaluated on image restoration tasks such as image super-resolution, inpainting, and blind image restoration, BDPM outperforms state-of-the-art methods on the FFHQ, CelebA, and CelebA-HQ datasets. Notably, BDPM requires fewer inference steps than traditional DDPM models to reach optimal results, showcasing enhanced inference efficiency.
zh

[CV-5] PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection

【速读】:该论文旨在解决面向对象检测(Oriented Object Detection, OOD)中的单点监督问题。现有的方法通常需要额外的先验信息来生成伪旋转框,而本文提出的PointOBB-v3框架则无需额外先验,直接通过单点监督生成伪旋转框,并支持端到端的训练范式。解决方案的关键在于引入了三种独特的图像视图:原始视图、缩放视图和旋转/翻转视图,并基于这些视图构建了尺度增强模块和角度获取模块。尺度增强模块通过引入尺度敏感一致性损失(Scale-Sensitive Consistency, SSC)和尺度敏感特征融合模块(Scale-Sensitive Feature Fusion, SSFF)来提升模型对物体尺度的估计能力;角度获取模块则采用基于对称性的自监督学习来实现精确的角度预测。此外,论文还提出了一个端到端版本,通过集成检测器分支并引入实例感知加权策略(Instance-Aware Weighting, IAW)来消除伪标签生成过程,并专注于高质量预测。实验结果表明,PointOBB-v3在多个数据集上均显著优于现有方法,平均准确率提升了3.56%。

链接: https://arxiv.org/abs/2501.13898
作者: Peiyuan Zhang,Junwei Luo,Xue Yang,Yi Yu,Qingyun Li,Yue Zhou,Xiaosong Jia,Xudong Lu,Jingdong Chen,Xiang Li,Junchi Yan,Yansheng Li
机构: School of Computer Science, Wuhan University (武汉大学计算机学院); School of Remote Sensing and Information Engineering, Wuhan University (武汉大学遥感信息工程学院); Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系); School of Automation, Southeast University (东南大学自动化学院); SEIE, Harbin Institute of Technology (哈尔滨工业大学电子与信息工程学院); S-Lab, CCDS, Nanyang Technological University (南洋理工大学CCDS实验室); CSE & SAI, Shanghai Jiao Tong University (上海交通大学计算机科学与工程系及人工智能研究院); The Chinese University of Hong Kong (香港中文大学); Ant Group (蚂蚁集团); VCIP, CS, Nankai University (南开大学计算机科学与技术学院视觉计算与图像处理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, 10 tables

点击查看摘要

Abstract:With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model’s ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at this https URL.
zh

[CV-6] Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

【速读】:该论文旨在解决细粒度视觉理解(fine-grained visual understanding)的问题,特别是如何为图像中的每个对象生成像素级对齐的实例特定描述。为此,作者提出了Pix2Cap-COCO数据集,这是首个全景像素级描述数据集,通过自动化的标注流程,利用GPT-4V生成与像素对齐的实例特定描述,从而帮助模型学习对象与其上下文之间更细粒度的关系。该数据集包含167,254条详细描述,平均每条描述包含22.94个单词。基于Pix2Cap-COCO,作者还引入了一个新的任务——全景分割-描述(panoptic segmentation-captioning),要求模型同时识别图像中的实例并为每个实例生成详细描述。为了评估该任务,作者设计了一个基于X-Decoder的基线模型。实验结果表明,Pix2Cap-COCO是一个极具挑战性的数据集,因为它要求模型在细粒度视觉理解和详细语言生成方面都表现出色。此外,作者利用Pix2Cap-COCO对大型多模态模型(LMMs)进行监督微调(SFT),显著提升了模型在视觉理解和语言生成任务上的性能。

链接: https://arxiv.org/abs/2501.13893
作者: Zuyao You,Junke Wang,Lingyu Kong,Bo He,Zuxuan Wu
机构: 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (复旦大学计算机科学学院智能信息处理上海重点实验室); 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing (上海智能视觉计算协同创新中心); 3University of Maryland, College Park (马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation. Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on large multimodal models (LMMs) to enhance their performance. For example, training with Pix2Cap-COCO significantly improves the performance of GPT4RoI, yielding gains in CIDEr +1.4%, ROUGE +0.4%, and SPICE +0.5% on Visual Genome dataset, and strengthens its region understanding ability on the ViP-BENCH, with an overall improvement of +5.1%, including notable increases in recognition accuracy +11.2% and language generation quality +22.2%.
zh

[CV-7] Generating Realistic Forehead-Creases for User Verification via Conditioned Piecewise Polynomial Curves WACV

【速读】:该论文旨在解决额头皱纹(forehead creases)图像生成和验证系统中的问题,特别是在生成具有高真实性和多样性的额头皱纹图像方面。论文提出了一种基于几何建模的方法,使用B样条(B-spline)和贝塞尔曲线(Bézier curves)来精确模拟额头皱纹的几何形状,从而生成包含主要皱纹和非显著皱纹模式的高质量图像。这些几何渲染的图像作为扩散模型(diffusion-based Edge-to-Image translation model)的视觉提示,生成相应的配对样本,进而用于训练额头皱纹验证网络。为了增强生成样本的类内多样性,论文采用了两种策略:(a) 在定义的约束条件下扰动B样条的控制点以保持标签一致性,(b) 对几何视觉提示应用图像级增强技术,如dropout和弹性变换,专门针对皱纹模式进行优化。通过将生成的合成数据集与真实数据结合,该方法显著提升了跨数据库验证协议下额头皱纹验证系统的性能。

链接: https://arxiv.org/abs/2501.13889
作者: Abhishek Tandon,Geetanjali Sharma,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra
机构: Indian Institute of Technology Mandi, India (印度理工学院曼迪分校); Technology Innovation Hub - Indian Institute of Technology, Mandi (印度理工学院曼迪分校技术创新中心); Norwegian University of Science and Technology (NTNU), Norway (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV-W 2025

点击查看摘要

Abstract:We propose a trait-specific image generation method that models forehead creases geometrically using B-spline and Bézier curves. This approach ensures the realistic generation of both principal creases and non-prominent crease patterns, effectively constructing detailed and authentic forehead-crease images. These geometrically rendered images serve as visual prompts for a diffusion-based Edge-to-Image translation model, which generates corresponding mated samples. The resulting novel synthetic identities are then used to train a forehead-crease verification network. To enhance intra-subject diversity in the generated samples, we employ two strategies: (a) perturbing the control points of B-splines under defined constraints to maintain label consistency, and (b) applying image-level augmentations to the geometric visual prompts, such as dropout and elastic transformations, specifically tailored to crease patterns. By integrating the proposed synthetic dataset with real-world data, our method significantly improves the performance of forehead-crease verification systems under a cross-database verification protocol.
zh

[CV-8] Multimodal Sensor Dataset for Monitoring Older Adults Post Lower-Limb Fractures in Community Settings

【速读】:该论文试图解决老年人在下肢骨折(Lower-Limb Fractures, LLF)康复过程中面临的社会孤立和功能衰退问题。这些问题不仅影响康复效果,还可能对老年人的身心健康产生负面影响。论文提出的解决方案是通过多模态传感器平台(multi-modal sensor platforms)持续收集数据,并利用机器学习算法进行分析,从而实现对老年人健康状况的远程监控。该平台能够识别出有孤立和衰退风险的个体,并及时向临床医生发出警报。论文还介绍了一个新的公开数据集MAISON-LLF,该数据集包含了从社区环境中康复的老年人收集的智能手机、智能手表、运动探测器、睡眠追踪床垫以及临床问卷数据。通过监督机器学习和深度学习模型对传感器和临床问卷数据进行分析,为研究社区提供了一个基础性的比较框架。

链接: https://arxiv.org/abs/2501.13888
作者: Ali Abedi,Charlene H. Chu,Shehroz S. Khan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lower-Limb Fractures (LLF) are a major health concern for older adults, often leading to reduced mobility and prolonged recovery, potentially impairing daily activities and independence. During recovery, older adults frequently face social isolation and functional decline, complicating rehabilitation and adversely affecting physical and mental health. Multi-modal sensor platforms that continuously collect data and analyze it using machine-learning algorithms can remotely monitor this population and infer health outcomes. They can also alert clinicians to individuals at risk of isolation and decline. This paper presents a new publicly available multi-modal sensor dataset, MAISON-LLF, collected from older adults recovering from LLF in community settings. The dataset includes data from smartphone and smartwatch sensors, motion detectors, sleep-tracking mattresses, and clinical questionnaires on isolation and decline. The dataset was collected from ten older adults living alone at home for eight weeks each, totaling 560 days of 24-hour sensor data. For technical validation, supervised machine-learning and deep-learning models were developed using the sensor and clinical questionnaire data, providing a foundational comparison for the research community.
zh

[CV-9] Eye Gaze as a Signal for Conveying User Attention in Contextual AI Systems

【速读】:该论文探讨了在多模态 AI 代理(multimodal AI agents)与用户协作解决现实世界问题时,眼动追踪(eye tracking)技术在传达用户注意力相对于物理环境的作用。研究假设通过眼动追踪获取的用户注意力信息能够提升 AI 代理的上下文理解能力。论文首先通过观察大量的人-物交互数据,测量了眼动追踪信号质量与其在附近物理物体上可靠定位注视点的能力之间的关系。随后,实验通过将用户的扫描路径历史(scanpath history)作为额外上下文信息传递给多模态代理,验证了眼动追踪作为用户注意力信号的高价值,并证明其能够有效传达用户当前任务和兴趣信息。解决方案的关键在于利用眼动追踪技术捕捉用户的注意力分布,并将其作为上下文信息集成到多模态代理的决策过程中,从而提升代理的交互效果。

链接: https://arxiv.org/abs/2501.13878
作者: Ethan Wilson,Naveen Sendhilnathan,Charlie S. Burlingham,Yusuf Mansour,Robert Cavin,Sai Deep Tetali,Ajoy Savio Fernandes,Michael J. Proulx
机构: Meta Reality Labs ResearchUSA; Meta Reality LabsUSA
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advanced multimodal AI agents can now collaborate with users to solve challenges in the world. We explore eye tracking’s role in such interaction to convey a user’s attention relative to the physical environment. We hypothesize that this knowledge improves contextual understanding for AI agents. By observing hours of human-object interactions, we first measure the relationship between an eye tracker’s signal quality and its ability to reliably place gaze on nearby physical objects. We then conduct experiments which relay the user’s scanpath history as additional context querying multimodal agents. Our results show that eye tracking provides high value as a user attention signal and can convey information about the user’s current task and interests to the agent.
zh

[CV-10] Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning

【速读】:该论文试图解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中的模态差距问题,即文本原型无法完全捕捉所有类别原型的最优表示,尤其是那些具有细粒度特征的类别。为了解决这一问题,论文提出了一种基于视觉-语言模型(Vision-Language Models, VLMs)的双模态原型联合学习框架。该框架的关键在于同时引入文本和视觉模态的原型:文本原型用于捕捉广泛的概念信息,帮助模型在未见过的组合上进行泛化;而视觉原型则用于缓解模态差距导致的分类错误,并捕捉细粒度细节以区分外观相似的图像。通过设计专门的分解模块和联合学习策略,论文有效地优化了这两种模态的原型,使其在训练过程中捕捉关键类别信息,并在推理过程中作为重要的参考目标。实验结果表明,该方法在封闭世界和开放世界设置下均取得了优异的表现,验证了其在提升组合泛化能力方面的有效性。

链接: https://arxiv.org/abs/2501.13859
作者: Shiyu Zhang,Cheng Yan,Yang Liu,Chenchen Jing,Lei Zhou,Wenjun Wang
机构: Tianjin University(天津大学); Zhejiang University(浙江大学); Hainan University(海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions of attributes and objects by leveraging knowledge learned from seen compositions. Recent approaches have explored the use of Vision-Language Models (VLMs) to align textual and visual modalities. These methods typically employ prompt engineering, parameter-tuning, and modality fusion to generate rich textual prototypes that serve as class prototypes for CZSL. However, the modality gap results in textual prototypes being unable to fully capture the optimal representations of all class prototypes, particularly those with fine-grained features, which can be directly obtained from the visual modality. In this paper, we propose a novel Dual-Modal Prototype Joint Learning framework for the CZSL task. Our approach, based on VLMs, introduces prototypes in both the textual and visual modalities. The textual prototype is optimized to capture broad conceptual information, aiding the model’s generalization across unseen compositions. Meanwhile, the visual prototype is used to mitigate the classification errors caused by the modality gap and capture fine-grained details to distinguish images with similar appearances. To effectively optimize these prototypes, we design specialized decomposition modules and a joint learning strategy that enrich the features from both modalities. These prototypes not only capture key category information during training but also serve as crucial reference targets during inference. Experimental results demonstrate that our approach achieves state-of-the-art performance in the closed-world setting and competitive performance in the open-world setting across three publicly available CZSL benchmarks. These findings validate the effectiveness of our method in advancing compositional generalization.
zh

[CV-11] First Lessons Learned of an Artificial Intelligence Robotic System for Autonomous Coarse Waste Recycling Using Multispectral Imaging-Based Methods

【速读】:该论文试图解决当前粗颗粒废物处理设施中手动分拣材料效率低下的问题,特别是大量可回收材料在粗颗粒废物中流失的情况。为了解决这一问题,论文提出了两个关键方面的自动化解决方案:首先是通过多光谱图像(包括紫外光(UV)、可见光(VIS)、近红外光(NIR)和短波红外光(SWIR))进行材料分类,以在混合废物堆中实现物体检测与材料分类;其次是利用基于人工智能的控制器和成本效益高的摄像头,研究液压重型机械的自主控制,以实现大块废物的自动化分拣。这些解决方案的核心在于通过多光谱图像技术和人工智能控制,提高废物分拣的效率和可回收材料的回收率。

链接: https://arxiv.org/abs/2501.13855
作者: Timo Lange,Ajish Babu,Philipp Meyer,Matthis Keppner,Tim Tiedemann,Martin Wittmaier,Sebastian Wolff,Thomas Vögele
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Published in Proceedings of Sardinia 2023, 19th International Symposium on Waste Management, Resource Recovery and Sustainable Landfilling

点击查看摘要

Abstract:Current disposal facilities for coarse-grained waste perform manual sorting of materials with heavy machinery. Large quantities of recyclable materials are lost to coarse waste, so more effective sorting processes must be developed to recover them. Two key aspects to automate the sorting process are object detection with material classification in mixed piles of waste, and autonomous control of hydraulic machinery. Because most objects in those accumulations of waste are damaged or destroyed, object detection alone is not feasible in the majority of cases. To address these challenges, we propose a classification of materials with multispectral images of ultraviolet (UV), visual (VIS), near infrared (NIR), and short-wave infrared (SWIR) spectrums. Solution for autonomous control of hydraulic heavy machines for sorting of bulky waste is being investigated using cost-effective cameras and artificial intelligence-based controllers.
zh

[CV-12] Where Do You Go? Pedestrian Trajectory Prediction using Scene Features

【速读】:该论文旨在解决行人轨迹预测中的精度问题,特别是在自动驾驶车辆安全性和减少行人交通事故中的应用。现有研究主要关注行人之间的交互建模,而环境因素和场景物体布局的影响则相对较少被探讨。论文提出了一种新颖的轨迹预测模型,通过结合行人交互和环境上下文来提高预测精度。解决方案的关键在于:首先,利用稀疏图框架捕捉行人之间的时空交互;其次,采用先进的图像增强和语义分割技术提取详细的场景特征;最后,通过交叉注意力机制融合场景和交互特征,使模型能够优先考虑影响行人运动的相关环境因素。最终,通过时间卷积网络处理融合后的特征来预测未来的行人轨迹。实验结果表明,该方法显著优于现有最先进的方法,ADE(Average Displacement Error)和FDE(Final Displacement Error)分别达到0.252米和0.372米,证明了结合社会交互和环境上下文在行人轨迹预测中的重要性。

链接: https://arxiv.org/abs/2501.13848
作者: Mohammad Ali Rezaei,Fardin Ayar,Ehsan Javanmardi,Manabu Tsukada,Mahdi Javanmardi
机构: 1Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran (德黑兰阿米尔卡比尔理工大学计算机工程系); 2Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan (东京大学信息科学与技术研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by 2024 International Conference on Intelligent Computing and its Emerging Applications

点击查看摘要

Abstract:Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene-object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian-scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross-attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.
zh

[CV-13] MV-GMN: State Space Model for Multi-View Action Recognition

【速读】:该论文旨在解决多视角动作识别(multi-view action recognition)中基于Transformer的模型在计算资源需求上的局限性,特别是在多视角和多时间序列场景下。为了解决这一问题,论文提出了MV-GMN模型,这是一种专门设计的状态空间模型(state-space model),能够高效地聚合多模态数据(RGB和骨架数据)、多视角信息以及多时间序列信息,同时降低计算复杂度。MV-GMN模型的核心创新在于其多视角图Mamba网络(Multi-View Graph Mamba network),该网络由一系列MV-GMN模块组成。每个模块包含一个双向状态空间块(Bidirectional State Space Block)和一个图卷积网络(GCN)模块。双向状态空间块引入了四种扫描策略,包括视角优先和时间优先的方法,而GCN模块则通过基于规则和KNN的方法构建图网络,有效整合了不同视角和时间实例的特征。实验结果表明,MV-GMN在多个数据集上超越了现有技术,特别是在NTU RGB+D 120数据集上,跨主体和跨视角场景下的准确率分别达到了97.3%和96.7%。此外,MV-GMN在仅需线性推理复杂度的前提下,超越了基于Transformer的基线模型,显著降低了计算负担,提升了多视角动作识别技术的可扩展性和适用性。

链接: https://arxiv.org/abs/2501.13829
作者: Yuhui Lin,Jiaxuan Lu,Yue Yong,Jiahao Zhang
机构: Xi’an Jiaotong-Liverpool University (西交利物浦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multi-view action recognition have largely relied on Transformer-based models. While effective and adaptable, these models often require substantial computational resources, especially in scenarios with multiple views and multiple temporal sequences. Addressing this limitation, this paper introduces the MV-GMN model, a state-space model specifically designed to efficiently aggregate multi-modal data (RGB and skeleton), multi-view perspectives, and multi-temporal information for action recognition with reduced computational complexity. The MV-GMN model employs an innovative Multi-View Graph Mamba network comprising a series of MV-GMN blocks. Each block includes a proposed Bidirectional State Space Block and a GCN module. The Bidirectional State Space Block introduces four scanning strategies, including view-prioritized and time-prioritized approaches. The GCN module leverages rule-based and KNN-based methods to construct the graph network, effectively integrating features from different viewpoints and temporal instances. Demonstrating its efficacy, MV-GMN outperforms the state-of-the-arts on several datasets, achieving notable accuracies of 97.3% and 96.7% on the NTU RGB+D 120 dataset in cross-subject and cross-view scenarios, respectively. MV-GMN also surpasses Transformer-based baselines while requiring only linear inference complexity, underscoring the model’s ability to reduce computational load and enhance the scalability and applicability of multi-view action recognition technologies.
zh

[CV-14] Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在高风险医疗应用中由于虚假相关性(spurious correlations)而导致的“捷径学习”(shortcut learning)问题。这种捷径学习可能导致模型在实际应用中产生潜在的致命后果。为了解决这一问题,论文提出了一种半自动化的框架,通过利用可解释人工智能(eXplainable Artificial Intelligence, XAI)的洞察,从数据和模型的角度识别虚假行为。该框架能够检索出虚假数据点,并检测出编码相关预测规则的模型电路(model circuits)。此外,论文还展示了如何利用这些捷径编码进行基于XAI的样本级和像素级数据标注,为偏差缓解方法提供有价值的信息,以消除不希望的捷径行为。通过在四种医疗数据集上的实验,论文成功识别并缓解了VGG16、ResNet50和现代视觉Transformer模型中的偏差,最终提高了这些模型在现实世界医疗任务中的鲁棒性和适用性。

链接: https://arxiv.org/abs/2501.13818
作者: Frederik Pahde,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek
机构: 1Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希赫兹研究所); 2Technische Universität Berlin (柏林工业大学); 3Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks are increasingly employed in high-stakes medical applications, despite their tendency for shortcut learning in the presence of spurious correlations, which can have potentially fatal consequences in practice. Detecting and mitigating shortcut behavior is a challenging task that often requires significant labeling efforts from domain experts. To alleviate this problem, we introduce a semi-automated framework for the identification of spurious behavior from both data and model perspective by leveraging insights from eXplainable Artificial Intelligence (XAI). This allows the retrieval of spurious data points and the detection of model circuits that encode the associated prediction rules. Moreover, we demonstrate how these shortcut encodings can be used for XAI-based sample- and pixel-level data annotation, providing valuable information for bias mitigation methods to unlearn the undesired shortcut behavior. We show the applicability of our framework using four medical datasets across two modalities, featuring controlled and real-world spurious correlations caused by data artifacts. We successfully identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision Transformer models, ultimately increasing their robustness and applicability for real-world medical tasks.
zh

[CV-15] By-Example Synthesis of Vector Textures

【速读】:该论文旨在解决从单一光栅样本(raster exemplar)合成任意大小的新颖矢量纹理(vector texture)的问题。解决方案的关键在于以下几个步骤:首先,对样本进行分割以提取主要纹理单元(primary textons),并根据视觉相似性对其进行聚类;其次,计算每个纹理单元的邻域描述符,以捕捉类别间的关系,这些关系在合成时被使用;接着,通过简单的方法提取并放置次要纹理单元(secondary textons)到主要多边形(primary polygons)的后面;最后,构建由一组数据点和颜色定义的背景梯度场(gradient field),并调整次要多边形的颜色以更好地匹配该梯度场。通过这些步骤,论文提出了一种能够生成高质量矢量纹理的新方法。

链接: https://arxiv.org/abs/2501.13812
作者: Christopher Palazzolo,Oliver van Kaick,David Mould
机构: Carleton University (卡尔顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We propose a new method for synthesizing an arbitrarily sized novel vector texture given a single raster exemplar. Our method first segments the exemplar to extract the primary textons, and then clusters them based on visual similarity. We then compute a descriptor to capture each texton’s neighborhood which contains the inter-category relationships that are used at synthesis time. Next, we use a simple procedure to both extract and place the secondary textons behind the primary polygons. Finally, our method constructs a gradient field for the background which is defined by a set of data points and colors. The color of the secondary polygons are also adjusted to better match the gradient field. To compare our work with other methods, we use a wide range of perceptual-based metrics.
zh

[CV-16] EgoHand: Ego-centric Hand Pose Estimation and Gesture Recognition with Head-mounted Millimeter-wave Radar and IMUs

【速读】:该论文旨在解决当前虚拟现实(VR)头戴设备中使用底部摄像头进行手势检测时可能带来的隐私泄露问题。由于这些摄像头可能会无意中捕捉到用户的私密身体部位或个人环境,存在隐私风险。为解决这一问题,论文提出了EgoHand系统,其关键解决方案是通过集成毫米波雷达(millimeter-wave radar)和惯性测量单元(IMUs)来实现手势识别,从而提供一种不依赖摄像头的替代方案,增强隐私保护。EgoHand采用了一种两阶段的基于骨架的手势识别方案:首先使用一种新颖的端到端Transformer架构估计手部关节的坐标,然后利用这些坐标进行手势识别。实验结果表明,该系统在10名受试者中实现了90.8%的手势识别准确率,并在跨用户、惯用手、身体姿势和场景等多种跨域测试中表现出鲁棒性。

链接: https://arxiv.org/abs/2501.13805
作者: Yizhe Lv,Tingting Zhang,Yunpeng Song,Han Ding,Jinsong Han,Fei Wang
机构: Xi’an Jiaotong University(西安交通大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Recent advanced Virtual Reality (VR) headsets, such as the Apple Vision Pro, employ bottom-facing cameras to detect hand gestures and inputs, which offers users significant convenience in VR interactions. However, these bottom-facing cameras can sometimes be inconvenient and pose a risk of unintentionally exposing sensitive information, such as private body parts or personal surroundings. To mitigate these issues, we introduce EgoHand. This system provides an alternative solution by integrating millimeter-wave radar and IMUs for hand gesture recognition, thereby offering users an additional option for gesture interaction that enhances privacy protection. To accurately recognize hand gestures, we devise a two-stage skeleton-based gesture recognition scheme. In the first stage, a novel end-to-end Transformer architecture is employed to estimate the coordinates of hand joints. Subsequently, these estimated joint coordinates are utilized for gesture recognition. Extensive experiments involving 10 subjects show that EgoHand can detect hand gestures with 90.8% accuracy. Furthermore, EgoHand demonstrates robust performance across a variety of cross-domain tests, including different users, dominant hands, body postures, and scenes.
zh

[CV-17] PromptMono: Cross Prompting Attention for Self-Supervised Monocular Depth Estimation in Challenging Environments

【速读】:该论文试图解决在复杂环境下单目深度估计(monocular depth estimation)的困难问题。尽管在理想条件下单目深度估计已经取得了显著进展,但在具有挑战性的环境中,其性能仍然受限。为此,论文提出了一种名为PromptMono的自监督学习框架,通过视觉提示学习(visual prompt learning)来预测不同环境下的深度信息。解决方案的关键在于引入了一组可学习的参数作为视觉提示(visual prompts),以捕捉特定领域的知识,并提出了一个新颖的门控交叉提示注意力(gated cross prompting attention, GCPA)模块,将提示信息整合到图像表示中,从而增强在各种条件下的深度估计性能。实验结果表明,该方法在Oxford Robotcar和nuScenes数据集上表现优异。

链接: https://arxiv.org/abs/2501.13796
作者: Changhao Wang,Guanwen Zhang,Zhengyun Cheng,Wei Zhou
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Considerable efforts have been made to improve monocular depth estimation under ideal conditions. However, in challenging environments, monocular depth estimation still faces difficulties. In this paper, we introduce visual prompt learning for predicting depth across different environments within a unified model, and present a self-supervised learning framework called PromptMono. It employs a set of learnable parameters as visual prompts to capture domain-specific knowledge. To integrate prompting information into image representations, a novel gated cross prompting attention (GCPA) module is proposed, which enhances the depth estimation in diverse conditions. We evaluate the proposed PromptMono on the Oxford Robotcar dataset and the nuScenes dataset. Experimental results demonstrate the superior performance of the proposed method.
zh

[CV-18] raining-Free Zero-Shot Temporal Action Detection with Vision-Language Models

【速读】:该论文旨在解决零样本时序动作检测(Zero-shot Temporal Action Detection, ZSTAD)领域中现有方法依赖于全监督或无监督策略所带来的问题,如领域偏移和高计算成本。为解决这些问题,论文提出了一种无需训练的零样本时序动作检测方法(FreeZAD),该方法利用现有的视觉-语言模型(Vision-Language, ViL)直接对未修剪视频中的未见活动进行分类和定位,而无需任何额外的微调或适配。其关键解决方案包括设计了对数衰减加权内外对比分数(Logarithmic decay weighted Outer-Inner-Contrastive Score, LogOIC)和基于频率的动作性校准(Actionness Calibration),以减少对显式时序建模和伪标签质量的依赖。此外,论文还引入了基于原型中心采样(Prototype-Centric Sampling, PCS)的测试时适配(Test-Time Adaptation, TTA)策略,进一步提升了ViL模型在ZSTAD任务中的适应性。实验结果表明,该方法在THUMOS14和ActivityNet-1.3数据集上优于现有的无监督方法,且运行时间仅为现有方法的1/13。

链接: https://arxiv.org/abs/2501.13795
作者: Chaolei Han,Hongsong Wang,Jidong Kuang,Lei Zhang,Jie Gui
机构: Southeast University (东南大学); Nanjing Normal University (南京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.
zh

[CV-19] Solving the long-tailed distribution problem by exploiting the synergies and balance of different techniques

【速读】:该论文旨在解决长尾数据分布(long-tailed data distribution)中模型难以有效学习和分类尾部类别(tail classes)的问题。长尾数据分布在实际数据中非常常见,传统的经验风险最小化(empirical risk minimisation)方法在处理这类数据时表现不佳。论文提出了三种长尾识别技术的协同作用:监督对比学习(Supervised Contrastive Learning, SCL)、稀有类别样本生成器(Rare-Class Sample Generator, RSG)和标签分布感知边缘损失(Label-Distribution-Aware Margin Loss, LDAM)。SCL通过增强类内特征聚类和类间分离性来提升模型性能,但倾向于偏向主导类别(dominant classes)。RSG通过生成新的尾部特征来补偿SCL对尾部特征空间的挤压,进一步增强了类内聚类效果。LDAM则通过为尾部类别引入更大的边缘来提升模型在尾部类别上的表现。这三种技术的协同作用不仅放大了各自的优势,还弥补了各自的不足,最终在不牺牲主导类别准确率的情况下,显著提升了尾部类别的识别性能。

链接: https://arxiv.org/abs/2501.13756
作者: Ziheng Wang,Toni Lassila,Sharib Ali
机构: Faculty of Engineering and Physical Sciences, University of Leeds (利兹大学工程学院与物理科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13

点击查看摘要

Abstract:In real-world data, long-tailed data distribution is common, making it challenging for models trained on empirical risk minimisation to learn and classify tail classes effectively. While many studies have sought to improve long tail recognition by altering the data distribution in the feature space and adjusting model decision boundaries, research on the synergy and corrective approach among various methods is limited. Our study delves into three long-tail recognition techniques: Supervised Contrastive Learning (SCL), Rare-Class Sample Generator (RSG), and Label-Distribution-Aware Margin Loss (LDAM). SCL enhances intra-class clusters based on feature similarity and promotes clear inter-class separability but tends to favour dominant classes only. When RSG is integrated into the model, we observed that the intra-class features further cluster towards the class centre, which demonstrates a synergistic effect together with SCL’s principle of enhancing intra-class clustering. RSG generates new tail features and compensates for the tail feature space squeezed by SCL. Similarly, LDAM is known to introduce a larger margin specifically for tail classes; we demonstrate that LDAM further bolsters the model’s performance on tail classes when combined with the more explicit decision boundaries achieved by SCL and RSG. Furthermore, SCL can compensate for the dominant class accuracy sacrificed by RSG and LDAM. Our research emphasises the synergy and balance among the three techniques, with each amplifying the strengths of the others and mitigating their shortcomings. Our experiment on long-tailed distribution datasets, using an end-to-end architecture, yields competitive results by enhancing tail class accuracy without compromising dominant class performance, achieving a balanced improvement across all classes.
zh

[CV-20] You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain

【速读】:该论文旨在解决在行星、月球和小天体表面地形检测中,基于学习的计算机视觉方法在自主航天器应用中的实时性和计算效率问题。由于缺乏标注数据和依赖监督学习方法,这些方法的训练复杂且计算成本高,难以在航天器处理器上实现实时操作。论文提出了一种基于无监督域适应(Unsupervised Domain Adaptation, UDA)的解决方案,通过将视觉相似性对齐(Visual Similarity-based Alignment, VSA)集成到轻量级单阶段目标检测架构中,提升了空间地形检测的UDA性能。关键改进在于对VSA方案的创新扩展,使其在多类和高海拔场景下的性能显著提升。论文提出的YOCOv2方法在模拟和真实数据上均表现出色,相比YOCOv1和现有技术,性能提升超过31%,并通过航天器飞行硬件性能基准测试和NASA任务数据的定性评估验证了其实际应用价值。

链接: https://arxiv.org/abs/2501.13725
作者: Timothy Chase Jr,Christopher Wilson,Karthik Dantu
机构: University at Buffalo, Buffalo, NY 14260, USA; NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The in-situ detection of planetary, lunar, and small-body surface terrain is crucial for autonomous spacecraft applications, where learning-based computer vision methods are increasingly employed to enable intelligence without prior information or human intervention. However, many of these methods remain computationally expensive for spacecraft processors and prevent real-time operation. Training of such algorithms is additionally complex due to the scarcity of labeled data and reliance on supervised learning approaches. Unsupervised Domain Adaptation (UDA) offers a promising solution by facilitating model training with disparate data sources such as simulations or synthetic scenes, although UDA is difficult to apply to celestial environments where challenging feature spaces are paramount. To alleviate such issues, You Only Crash Once (YOCOv1) has studied the integration of Visual Similarity-based Alignment (VSA) into lightweight one-stage object detection architectures to improve space terrain UDA. Although proven effective, the approach faces notable limitations, including performance degradations in multi-class and high-altitude scenarios. Building upon the foundation of YOCOv1, we propose novel additions to the VSA scheme that enhance terrain detection capabilities under UDA, and our approach is evaluated across both simulated and real-world data. Our second YOCO rendition, YOCOv2, is capable of achieving state-of-the-art UDA performance on surface terrain detection, where we showcase improvements upwards of 31% compared with YOCOv1 and terrestrial state-of-the-art. We demonstrate the practical utility of YOCOv2 with spacecraft flight hardware performance benchmarking and qualitative evaluation of NASA mission data.
zh

[CV-21] A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation

【速读】:该论文旨在解决多潜在变量生成模型(MLVGMs)在图像生成过程中潜在变量的利用效率问题。尽管MLVGMs(如StyleGAN、NVAE)通过多个潜在变量从全局特征到局部细节逐步生成图像,但其生成动态和潜在变量的利用仅通过经验观察,缺乏系统性量化分析。为此,论文提出了一种基于互信息(Mutual Information, MI)的新框架,用于系统量化每个潜在变量在MLVGMs中的影响,揭示未充分利用的变量,并指导下游应用中的模型使用。在此基础上,论文进一步提出了一种用于自监督对比表示学习(Self-Supervised Contrastive Representation Learning, SSCRL)的合成数据生成方法,通过利用MLVGMs的层次化和解耦的潜在变量,并结合前述分析,施加定制的潜在扰动以生成多样化的视图,而无需依赖真实数据。此外,论文还引入了连续采样(Continuous Sampling, CS)策略,使生成器在SSCRL训练过程中动态生成新样本,显著增加数据多样性。实验结果表明,MLVGMs生成的视图在性能上与真实数据生成的视图相当甚至更优。该研究为理解和利用MLVGMs提供了系统性的方法,推动了生成建模和自监督学习的发展。

链接: https://arxiv.org/abs/2501.13718
作者: Dario Serez,Marco Cristani,Alessio Del Bue,Vittorio Murino,Pietro Morerio
机构: Istituto Italiano di Tecnologia (意大利技术研究院); University of Genoa (热那亚大学); University of Verona (维罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images, from global characteristics to finer and local details (e.g., StyleGAN, NVAE), emerging as powerful tools for diverse applications. Yet their generative dynamics and latent variable utilization remain only empirically observed. In this work, we propose a novel framework to systematically quantify the impact of each latent variable in MLVGMs, using Mutual Information (MI) as a guiding metric. Our analysis reveals underutilized variables and can guide the use of MLVGMs in downstream applications. With this foundation, we introduce a method for generating synthetic data for Self-Supervised Contrastive Representation Learning (SSCRL). By leveraging the hierarchical and disentangled variables of MLVGMs, and guided by the previous analysis, we apply tailored latent perturbations to produce diverse views for SSCRL, without relying on real data altogether. Additionally, we introduce a Continuous Sampling (CS) strategy, where the generator dynamically creates new samples during SSCRL training, greatly increasing data variability. Our comprehensive experiments demonstrate the effectiveness of these contributions, showing that MLVGMs’ generated views compete on par with or even surpass views generated from real data. This work establishes a principled approach to understanding and exploiting MLVGMs, advancing both generative modeling and self-supervised learning. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.13718 [cs.CV] (or arXiv:2501.13718v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.13718 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dario Serez [view email] [v1] Thu, 23 Jan 2025 14:46:38 UTC (19,569 KB)
zh

[CV-22] Skin Disease Detection and Classification of Actinic Keratosis and Psoriasis Utilizing Deep Transfer Learning

【速读】:该论文试图解决皮肤疾病诊断中存在的两个主要问题:一是早期诊断的重要性,二是现有诊断方法的高成本和不易获取性。为了解决这些问题,作者提出了一种基于深度学习技术的高效诊断方法。该解决方案的关键在于使用了一种改进的VGG16卷积神经网络(CNN)模型。该模型通过引入多个卷积层,并利用ImageNet权重进行初始化,同时修改了顶层结构,增加了全连接层和最终的softmax激活层,以实现对皮肤疾病的分类。此外,尽管VGG16架构默认不包含数据增强,但作者在模型训练前应用了旋转、平移和缩放等预处理技术来增强数据。该方法在公开的“皮肤疾病数据集”上达到了90.67%的准确率,展示了其在皮肤疾病分类中的可靠性,并具有实际应用的潜力。

链接: https://arxiv.org/abs/2501.13713
作者: Fahud Ahmmed,Md. Zaheer Raihan,Kamnur Nahar,D.M. Asadujjaman,Md. Mahfujur Rahman,Abdullah Tamim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skin diseases can arise from infections, allergies, genetic factors, autoimmune disorders, hormonal imbalances, or environmental triggers such as sun damage and pollution. Some skin diseases, such as Actinic Keratosis and Psoriasis, can be fatal if not treated in time. Early identification is crucial, but the diagnostic methods for these conditions are often expensive and not widely accessible. In this study, we propose a novel and efficient method for diagnosing skin diseases using deep learning techniques. This approach employs a modified VGG16 Convolutional Neural Network (CNN) model. The model includes several convolutional layers and utilizes ImageNet weights with modified top layers. The top layer is updated with fully connected layers and a final softmax activation layer to classify skin diseases. The dataset used, titled “Skin Disease Dataset,” is publicly available. While the VGG16 architecture does not include data augmentation by default, preprocessing techniques such as rotation, shifting, and zooming were applied to augment the data prior to model training. The proposed methodology achieved 90.67% accuracy using the modified VGG16 model, demonstrating its reliability in classifying skin diseases. The promising results highlight the potential of this approach for real-world applications.
zh

[CV-23] YOLO1 1-JDE: Fast and Accurate Multi-Object Tracking with Self-Supervised Re-ID WACV2025

【速读】:该论文旨在解决多目标跟踪(MOT)中的实时性和准确性问题,特别是在目标检测与身份重识别(Re-Identification, Re-ID)的联合任务中。解决方案的关键在于提出了YOLO11-JDE模型,该模型通过将专用的Re-ID分支集成到YOLO11s中,实现了联合检测与嵌入(Joint Detection and Embedding, JDE)。Re-ID分支在完全自监督的环境下进行训练,同时进行目标检测,从而避免了使用昂贵的身份标注数据集。通过采用三元组损失(triplet loss)并结合硬正样本和半硬负样本挖掘策略,模型能够学习到具有判别性的嵌入特征。此外,通过自定义的跟踪实现,成功整合了运动、外观和位置信息,进一步增强了数据关联的准确性。YOLO11-JDE在MOT17和MOT20基准测试中取得了具有竞争力的结果,在帧率(FPS)和参数数量上均优于现有的JDE方法,使其成为实际应用中极具吸引力的解决方案。

链接: https://arxiv.org/abs/2501.13710
作者: Iñaki Erregue,Kamal Nasrollahi,Sergio Escalera
机构: Universitat de Barcelona(巴塞罗那大学); Computer Vision Center(计算机视觉中心); Aalborg University(奥尔堡大学); Milestone Systems(里程碑系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the 5th Workshop on Real-World Surveillance: Applications and Challenges (WACV 2025)

点击查看摘要

Abstract:We introduce YOLO11-JDE, a fast and accurate multi-object tracking (MOT) solution that combines real-time object detection with self-supervised Re-Identification (Re-ID). By incorporating a dedicated Re-ID branch into YOLO11s, our model performs Joint Detection and Embedding (JDE), generating appearance features for each detection. The Re-ID branch is trained in a fully self-supervised setting while simultaneously training for detection, eliminating the need for costly identity-labeled datasets. The triplet loss, with hard positive and semi-hard negative mining strategies, is used for learning discriminative embeddings. Data association is enhanced with a custom tracking implementation that successfully integrates motion, appearance, and location cues. YOLO11-JDE achieves competitive results on MOT17 and MOT20 benchmarks, surpassing existing JDE methods in terms of FPS and using up to ten times fewer parameters. Thus, making our method a highly attractive solution for real-world applications.
zh

[CV-24] Regularizing cross entropy loss via minimum entropy and K-L divergence

【速读】:该论文旨在解决深度学习分类任务中的损失函数优化问题,提出两种新的损失函数以改进标准的交叉熵损失(cross-entropy loss)。这两种损失函数分别称为混合熵损失(MIX-ENT)和最小熵正则化交叉熵损失(MIN-ENT)。MIX-ENT通过引入一个正则化项,该正则化项等效于最小熵项和Kullback-Leibler(K-L)散度项的和,但与标准交叉熵损失中的K-L散度不同,它交换了目标概率和假设概率的角色。MIN-ENT则是在标准交叉熵损失的基础上直接添加了一个最小熵正则化项。这两种损失函数的核心在于通过最小化神经网络输出的假设概率分布的熵,从而提升模型的分类性能。实验结果表明,在EMNIST-Letters数据集上,使用MIX-ENT和MIN-ENT的VGG模型在分类准确率上超越了之前表现优异的Spinal-VGG模型,分别达到了95.927%和95.933%的准确率,显著提升了模型的性能。

链接: https://arxiv.org/abs/2501.13709
作者: Abdulrahman Oladipupo Ibraheem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at this https URL entropy project.
zh

[CV-25] EventVL: Understand Event Streams via Multimodal Large Language Model

【速读】:该论文试图解决现有基于事件的视觉-语言模型(VLM)在传统感知任务中仅依赖CLIP模型,导致无法充分理解和显式提取事件流中的语义和上下文信息的问题。为了解决这一缺陷,作者提出了EventVL,这是首个生成式基于事件的多模态大语言模型(MLLM)框架,旨在显式地理解语义。解决方案的关键包括:1)构建了一个包含近140万对高质量数据的事件-图像/视频-文本数据集,以弥合不同模态语义之间的数据鸿沟;2)设计了事件时空表示(Event Spatiotemporal Representation),通过多样化的聚合和分割方法充分挖掘事件流中的综合信息;3)引入了动态语义对齐(Dynamic Semantic Alignment),以改进和补全事件的稀疏语义空间。实验表明,EventVL在事件描述和场景生成任务中显著优于现有的MLLM基线模型。

链接: https://arxiv.org/abs/2501.13707
作者: Pengteng Li,Yunfan Lu,Pinghao Song,Wuyang Li,Huizai Yao,Hui Xiong
机构: HKUST(GZ)(香港科技大学广州校区); KU Leuven(鲁汶大学); CUHK(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and segmenting the event stream. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events. Extensive experiments show that our EventVL can significantly surpass existing MLLM baselines in event captioning and scene description generation tasks. We hope our research could contribute to the development of the event vision community.
zh

[CV-26] raining-Free Consistency Pipeline for Fashion Repose

【速读】:该论文旨在解决在图像编辑中执行非刚性变换(如改变物体姿态或基于图像的条件化)时,保持物体身份一致性的挑战。特别是在时尚行业中,现有的扩散模型(diffusion models)在处理长袖服装的姿态调整时,往往难以保持品牌属性和物体身份的一致性,且需要定制化的训练数据,这在现实场景中并不总是可行。论文提出的解决方案FashionRepose是一种无需训练的管道,专门为时尚行业设计,能够在不进行专门训练的情况下,通过集成现成的模型来实现近实时的非刚性姿态编辑。其关键在于采用零样本(zero-shot)方法,确保在编辑过程中保持物体身份和品牌属性的一致性,从而满足工业应用中对精确性和一致性的高要求。

链接: https://arxiv.org/abs/2501.13692
作者: Potito Aghilar,Vito Walter Anelli,Michelantonio Trizio,Tommaso Di Noia
机构: Politecnico di Bari, Italy(巴里理工大学, 意大利); Wideverse, Italy(Wideverse, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have significantly broadened the possibilities for editing images of real-world objects. However, performing non-rigid transformations, such as changing the pose of objects or image-based conditioning, remains challenging. Maintaining object identity during these edits is difficult, and current methods often fall short of the precision needed for industrial applications, where consistency is critical. Additionally, fine-tuning diffusion models requires custom training data, which is not always accessible in real-world scenarios. This work introduces FashionRepose, a training-free pipeline for non-rigid pose editing specifically designed for the fashion industry. The approach integrates off-the-shelf models to adjust poses of long-sleeve garments, maintaining identity and branding attributes. FashionRepose uses a zero-shot approach to perform these edits in near real-time, eliminating the need for specialized training. consistent image editing. The solution holds potential for applications in the fashion industry and other fields demanding identity preservation in image editing.
zh

[CV-27] MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

【速读】:该论文试图解决的是基于文本描述的视频对象分割(Referring Video Object Segmentation, RVOS)任务中的两个主要挑战:一是如何将文本有效转化为提示(prompts),二是如何增强模型对全局上下文(global context)的感知能力。为了解决这些问题,论文提出了一种名为MPG-SAM 2的新框架。该框架的关键在于:1)使用统一的多模态编码器(multimodal encoder)联合编码视频和文本特征,生成语义对齐的视频和文本嵌入(embeddings)以及多模态类别标记(class tokens);2)通过掩码先验生成器(mask prior generator)利用视频嵌入和类别标记生成目标对象的伪掩码(pseudo masks)和全局上下文;3)引入层次化的全局-历史聚合器(hierarchical global-historical aggregator),使SAM 2能够在像素和对象级别聚合目标对象的全局和历史信息,从而增强目标表示和时间一致性。这些模块共同作用,显著提升了RVOS任务的性能。

链接: https://arxiv.org/abs/2501.13667
作者: Fu Rong,Meng Lan,Qian Zhang,Lefei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of our proposed modules.
zh

[CV-28] QMamba: Post-Training Quantization for Vision State Space Models

【速读】:该论文试图解决在资源受限的边缘设备上高效部署状态空间模型(State Space Models, SSMs)的问题。由于SSMs在长序列建模方面具有高效性,但其计算成本较高,特别是在边缘设备上部署时,这一问题尤为突出。为此,论文提出了一种基于训练后量化(Post-Training Quantization, PTQ)的框架QMamba,旨在通过量化技术降低SSMs的计算复杂度。

解决方案的关键在于针对SSMs中激活分布的特性设计了两种量化方法:长尾偏斜量化(Long-tailed Skewness Quantization, LtSQ)和时间组量化(Temporal Group Quantization, TGQ)。LtSQ用于量化离散参数,TGQ用于量化隐藏状态序列。这两种方法分别针对SSMs中离散参数的长尾偏斜分布和隐藏状态序列的高度动态变化特性,有效减少了量化误差。实验结果表明,QMamba在多种模型规模和架构的视觉任务中均优于现有的PTQ方法,特别是在4-bit激活的ImageNet分类任务上,QMamba比现有方法提升了21.0%的性能。

链接: https://arxiv.org/abs/2501.13624
作者: Yinglong Li,Xiaoyu Liu,Jiacheng Li,Ruikang Xu,Yinda Chen,Zhiwei Xiong
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State Space Models (SSMs), as key components of Mamaba, have gained increasing attention for vision models recently, thanks to their efficient long sequence modeling capability. Given the computational cost of deploying SSMs on resource-limited edge devices, Post-Training Quantization (PTQ) is a technique with the potential for efficient deployment of SSMs. In this work, we propose QMamba, one of the first PTQ frameworks to our knowledge, designed for vision SSMs based on the analysis of the activation distributions in SSMs. We reveal that the distribution of discrete parameters exhibits long-tailed skewness and the distribution of the hidden state sequence exhibits highly dynamic variations. Correspondingly, we design Long-tailed Skewness Quantization (LtSQ) to quantize discrete parameters and Temporal Group Quantization (TGQ) to quantize hidden states, which reduces the quantization errors. Extensive experiments demonstrate that QMamba outperforms advanced PTQ methods on vision models across multiple model sizes and architectures. Notably, QMamba surpasses existing methods by 21.0% on ImageNet classification with 4-bit activations.
zh

[CV-29] Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task

【速读】:该论文旨在评估视觉-语言模型(Vision-Language Models, VLMs)在复杂视觉任务中的推理能力,特别是通过Bongard Openworld Problems基准测试来探究其潜力和局限性。论文提出了三种受人类启发的推理范式:整体分析(holistic analysis,全局上下文处理)、演绎规则学习(deductive rule learning,显式规则推导与应用)以及成分分析(componential analysis,图像的结构化分解)。研究结果表明,包括GPT-4o和Gemini在内的最先进模型不仅在结构化推理任务中超越了人类基准,而且在成分分析方面表现尤为突出。然而,消融研究揭示了模型在处理合成图像、进行细粒度区分以及解释微妙上下文信息等方面的关键挑战。这些发现强调了在模型鲁棒性和泛化能力方面进一步发展的必要性,同时凸显了结构化推理方法在提升VLM能力方面的变革潜力。

链接: https://arxiv.org/abs/2501.13620
作者: Mohit Vaishnav,Tanel Tammet
机构: Applied Artificial Intelligence Group, Tallinn University of Technology (塔林理工大学); Kimova AI (基莫瓦人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the reasoning capabilities of Vision-Language Models (VLMs) in complex visual tasks provides valuable insights into their potential and limitations. In this work, we assess the performance of VLMs on the challenging Bongard Openworld Problems benchmark, which involves reasoning over natural images. We propose and evaluate three human-inspired paradigms: holistic analysis (global context processing), deductive rule learning (explicit rule derivation and application), and componential analysis (structured decomposition of images into components). Our results demonstrate that state-of-the-art models, including GPT-4o and Gemini, not only surpass human benchmarks but also excel in structured reasoning tasks, with componential analysis proving especially effective. However, ablation studies reveal key challenges, such as handling synthetic images, making fine-grained distinctions, and interpreting nuanced contextual information. These insights underscore the need for further advancements in model robustness and generalization, while highlighting the transformative potential of structured reasoning approaches in enhancing VLM capabilities.
zh

[CV-30] Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving

【速读】:该论文试图解决在自动驾驶(Autonomous Driving, AD)系统中,针对视觉-语言模型(Vision-Language Models, VLMs)的黑盒对抗攻击(black-box adversarial attacks)问题。现有研究主要集中在白盒攻击上,而更具实际挑战性的黑盒攻击场景尚未得到充分探索。论文提出了Cascading Adversarial Disruption (CAD)方法,其关键包括两个核心部分:首先,通过决策链破坏(Decision Chain Disruption)在低层次推理中生成并注入欺骗性语义,确保扰动在整个决策链中保持有效;其次,通过风险场景诱导(Risky Scene Induction)利用代理VLM理解和构建高风险场景,以适应当前驾驶环境的动态变化。实验表明,CAD在多个AD VLM和基准测试中实现了最先进的攻击效果,显著优于现有方法,并在实际AD车辆上验证了其适用性。

链接: https://arxiv.org/abs/2501.13563
作者: Lu Wang,Tianyuan Zhang,Yang Qu,Siyuan Liang,Yuwei Chen,Aishan Liu,Xianglong Liu,Dacheng Tao
机构: Beihang University(北京航空航天大学); National University of Singapore(新加坡国立大学); Aviation Industry Development Research Center of China(中国航空工业发展研究中心); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have significantly advanced autonomous driving (AD) by enhancing reasoning capabilities; however, these models remain highly susceptible to adversarial attacks. While existing research has explored white-box attacks to some extent, the more practical and challenging black-box scenarios remain largely underexplored due to their inherent difficulty. In this paper, we take the first step toward designing black-box adversarial attacks specifically targeting VLMs in AD. We identify two key challenges for achieving effective black-box attacks in this context: the effectiveness across driving reasoning chains in AD systems and the dynamic nature of driving scenarios. To address this, we propose Cascading Adversarial Disruption (CAD). It first introduces Decision Chain Disruption, which targets low-level reasoning breakdown by generating and injecting deceptive semantics, ensuring the perturbations remain effective across the entire decision-making chain. Building on this, we present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios that are likely to result in critical errors in the current driving contexts. Extensive experiments conducted on multiple AD VLMs and benchmarks demonstrate that CAD achieves state-of-the-art attack effectiveness, significantly outperforming existing methods (+13.43% on average). Moreover, we validate its practical applicability through real-world attacks on AD vehicles powered by VLMs, where the route completion rate drops by 61.11% and the vehicle crashes directly into the obstacle vehicle with adversarial patches. Finally, we release CADA dataset, comprising 18,808 adversarial visual-question-answer pairs, to facilitate further evaluation and research in this critical domain. Our codes and dataset will be available after paper’s acceptance.
zh

[CV-31] GoDe: Gaussians on Demand for Progressive Level of Detail and Scalable Compression

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting)在新视角合成(novel view synthesis)中因存储容量和显存需求高而导致的扩展性和适应性不足的问题。现有的方法在面对不同计算能力或带宽等关键因素时,通常需要重新训练模型,缺乏灵活性。论文提出了一种模型无关的技术,通过将高斯分布组织成多个层次结构,实现了渐进式细节层次(Level of Detail, LoD)策略。该方案结合了最新的3DGS压缩方法,使得单个模型能够在多个压缩比下即时扩展,且与不可扩展的单一模型相比,质量影响极小甚至没有,同时无需重新训练。这一方法在典型数据集和基准测试中验证了其低失真性,并在扩展性和适应性方面取得了显著提升。

链接: https://arxiv.org/abs/2501.13558
作者: Francesco Di Sario,Riccardo Renzulli,Marco Grangetto,Akihiro Sugimoto,Enzo Tartaglione
机构: University of Turin, Italy(都灵大学, 意大利); National Institute of Informatics, Japan(国立情报学研究所, 日本); LTCI, Télécom Paris, Institut Polytechnique de Paris, France(巴黎电信学院, 巴黎综合理工学院, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting enhances real-time performance in novel view synthesis by representing scenes with mixtures of Gaussians and utilizing differentiable rasterization. However, it typically requires large storage capacity and high VRAM, demanding the design of effective pruning and compression techniques. Existing methods, while effective in some scenarios, struggle with scalability and fail to adapt models based on critical factors such as computing capabilities or bandwidth, requiring to re-train the model under different configurations. In this work, we propose a novel, model-agnostic technique that organizes Gaussians into several hierarchical layers, enabling progressive Level of Detail (LoD) strategy. This method, combined with recent approach of compression of 3DGS, allows a single model to instantly scale across several compression ratios, with minimal to none impact to quality compared to a single non-scalable model and without requiring re-training. We validate our approach on typical datasets and benchmarks, showcasing low distortion and substantial gains in terms of scalability and adaptability.
zh

[CV-32] One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

【速读】:该论文试图解决文本到图像生成模型(Text-to-Image Generation Models)在生成一致性身份保持(identity-preserving)图像时面临的挑战,特别是在叙事(storytelling)场景中。现有的方法通常需要在大规模数据集上进行广泛的训练或对原始模型架构进行额外修改,这限制了它们在不同领域和多样化扩散模型配置中的适用性。论文提出了一种无需训练的新方法,称为“One-Prompt-One-Story”(1Prompt1Story),通过将多个提示(prompts)串联为单一输入来保持角色身份的一致性。该方法进一步通过两种新技术——奇异值重加权(Singular-Value Reweighting)和身份保持交叉注意力(Identity-Preserving Cross-Attention)——来优化生成过程,确保每一帧图像与输入描述更好地对齐。实验结果表明,该方法在定量和定性评估中均优于现有的文本到图像生成方法。

链接: https://arxiv.org/abs/2501.13554
作者: Tao Liu,Kai Wang,Senmao Li,Joost van de Weijer,Fahad Shahbaz Khan,Shiqi Yang,Yaxing Wang,Jian Yang,Ming-Ming Cheng
机构: VCIP, CS, Nankai University (南开大学); Computer Vision Center, Universitat Autònoma de Barcelona (巴塞罗那自治大学计算机视觉中心); Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); Linkoping University (林雪平大学); Independent Researcher, Tokyo (东京独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed “One-Prompt-One-Story” (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at this https URL.
zh

[CV-33] Overcoming Support Dilution for Robust Few-shot Semantic Segmentation

【速读】:该论文试图解决少样本语义分割(Few-shot Semantic Segmentation, FSS)任务中,随着支持集(support set)规模的增大,现有FSS网络难以有效聚焦于高贡献支持样本(high-contributed supports),而容易被低贡献支持样本(low-contributed supports)干扰,导致分割性能下降的问题,即所谓的“支持稀释”(support dilution)问题。为解决这一问题,论文提出了三个关键技术:首先,引入贡献指数(contribution index)来定量评估高贡献支持样本是否被稀释;其次,开发了对称相关(Symmetric Correlation, SC)模块,用于保留和增强高贡献支持特征,减少低贡献特征的干扰;最后,设计了支持图像剪枝(Support Image Pruning)操作,通过剔除低贡献支持样本,从原始支持池中提取出紧凑且高质量的子集。实验结果表明,该方法在COCO-20i和PASCAL-5i基准数据集上显著优于现有FSS方法,并在在线分割和实际场景分割中展示了其实际应用能力。

链接: https://arxiv.org/abs/2501.13529
作者: Wailing Tang,Biqi Yang,Pheng-Ann Heng,Yun-Hui Liu,Chi-Wing Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 15 figures

点击查看摘要

Abstract:Few-shot Semantic Segmentation (FSS) is a challenging task that utilizes limited support images to segment associated unseen objects in query images. However, recent FSS methods are observed to perform worse, when enlarging the number of shots. As the support set enlarges, existing FSS networks struggle to concentrate on the high-contributed supports and could easily be overwhelmed by the low-contributed supports that could severely impair the mask predictions. In this work, we study this challenging issue, called support dilution, our goal is to recognize, select, preserve, and enhance those high-contributed supports in the raw support pool. Technically, our method contains three novel parts. First, we propose a contribution index, to quantitatively estimate if a high-contributed support dilutes. Second, we develop the Symmetric Correlation (SC) module to preserve and enhance the high-contributed support features, minimizing the distraction by the low-contributed features. Third, we design the Support Image Pruning operation, to retrieve a compact and high quality subset by discarding low-contributed supports. We conduct extensive experiments on two FSS benchmarks, COCO-20i and PASCAL-5i, the segmentation results demonstrate the compelling performance of our solution over state-of-the-art FSS approaches. Besides, we apply our solution for online segmentation and real-world segmentation, convincing segmentation results showing the practical ability of our work for real-world demonstrations.
zh

[CV-34] Diffusion-based Perceptual Neural Video Compression with Temporal Diffusion Information Reuse

【速读】:该论文试图解决视频压缩领域中基于扩散模型(diffusion model)的应用问题,尤其是在感知神经视频压缩框架中的有效集成和优化。具体来说,论文提出了DiffVC框架,该框架通过将基础扩散模型与视频条件编码范式相结合,利用先前解码帧的时间上下文和当前帧的重建潜在表示来指导扩散模型生成高质量结果。解决方案的关键包括:1)提出了时间扩散信息重用(Temporal Diffusion Information Reuse, TDIR)策略,通过重用前一帧的扩散信息显著提高推理效率,同时最小化性能损失;2)引入了基于量化参数的提示机制(Quantization Parameter-based Prompting, QPP),利用量化参数作为提示输入基础扩散模型,显式调节中间特征,从而构建一个鲁棒的可变比特率扩散神经压缩框架。实验结果表明,该方案在感知指标和视觉质量方面均表现出色。

链接: https://arxiv.org/abs/2501.13528
作者: Wenzhuo Ma,Zhenzhong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recently, foundational diffusion models have attracted considerable attention in image compression tasks, whereas their application to video compression remains largely unexplored. In this article, we introduce DiffVC, a diffusion-based perceptual neural video compression framework that effectively integrates foundational diffusion model with the video conditional coding paradigm. This framework uses temporal context from previously decoded frame and the reconstructed latent representation of the current frame to guide the diffusion model in generating high-quality results. To accelerate the iterative inference process of diffusion model, we propose the Temporal Diffusion Information Reuse (TDIR) strategy, which significantly enhances inference efficiency with minimal performance loss by reusing the diffusion information from previous frames. Additionally, to address the challenges posed by distortion differences across various bitrates, we propose the Quantization Parameter-based Prompting (QPP) mechanism, which utilizes quantization parameters as prompts fed into the foundational diffusion model to explicitly modulate intermediate features, thereby enabling a robust variable bitrate diffusion-based neural compression framework. Experimental results demonstrate that our proposed solution delivers excellent performance in both perception metrics and visual quality.
zh

[CV-35] xt-driven Online Action Detection

【速读】:该论文试图解决在线动作检测(online action detection)问题,即在流媒体视频中实时分类动作,同时处理背景噪声和不完整动作的挑战。解决方案的关键在于引入了TOAD(Text-driven Online Action Detection)架构,该架构利用CLIP(Contrastive Language-Image Pretraining)的文本嵌入(textual embeddings),支持零样本(zero-shot)和少样本(few-shot)学习。通过这种方式,TOAD能够高效利用视觉-语言模型(VLMs),而无需显著增加计算开销。实验结果表明,TOAD在THUMOS14数据集上达到了82.46%的mAP(mean Average Precision),超越了现有方法,并在THUMOS14和TVSeries数据集上为零样本和少样本性能设定了新的基准。

链接: https://arxiv.org/abs/2501.13518
作者: Manuel Benavent-Lledo,David Mulero-Pérez,David Ortiz-Perez,Jose Garcia-Rodriguez
机构: Department of Computer Technology, University of Alicante, Spain(西班牙阿利坎特大学计算机技术系); ValgrAI - Valencian Graduate School and Research Network of Artificial Intelligence, Valencia, Spain(西班牙瓦伦西亚人工智能研究生院与研究网络); Institute of Informatics Research, University of Alicante, Alicante, Spain(西班牙阿利坎特大学信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Integrated Computer-Aided Engineering

点击查看摘要

Abstract:Detecting actions as they occur is essential for applications like video surveillance, autonomous driving, and human-robot interaction. Known as online action detection, this task requires classifying actions in streaming videos, handling background noise, and coping with incomplete actions. Transformer architectures are the current state-of-the-art, yet the potential of recent advancements in computer vision, particularly vision-language models (VLMs), remains largely untapped for this problem, partly due to high computational costs. In this paper, we introduce TOAD: a Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning. TOAD leverages CLIP (Contrastive Language-Image Pretraining) textual embeddings, enabling efficient use of VLMs without significant computational overhead. Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming existing methods, and sets new baselines for zero-shot and few-shot performance on the THUMOS14 and TVSeries datasets.
zh

[CV-36] Propensity-driven Uncertainty Learning for Sample Exploration in Source-Free Active Domain Adaptation

【速读】:该论文试图解决源数据不可用的主动领域自适应(Source-free Active Domain Adaptation, SFADA)问题,即在无法访问源数据的情况下,将预训练模型适应到新领域,同时尽量减少对目标领域标注的需求。这一场景在实际应用中尤为重要,尤其是在数据隐私、存储限制或标注成本较高的情况下。关键挑战包括如何从目标领域中选择最具信息量的样本进行标注、如何有效利用已标注和未标注的目标数据,以及如何在不依赖源领域信息的情况下进行模型适应。现有方法通常难以处理噪声或异常样本,且可能需要在训练过程中进行不切实际的渐进式标注。

为解决这些问题,论文提出了Propensity-driven Uncertainty Learning (ProULearn)框架。该框架的关键在于引入了一种新颖的同质性倾向估计机制,结合相关性指数计算来评估特征级关系,从而识别具有代表性和挑战性的样本,同时避免噪声和异常样本。此外,ProULearn还开发了一种中心相关性损失函数,用于优化伪标签并创建紧凑的类别分布,从而有效缩小领域差距并最大化适应性能。这一解决方案的核心在于通过信息量样本选择机制,显著提升了领域自适应的效果。

链接: https://arxiv.org/abs/2501.13517
作者: Zicheng Pan,Xiaohan Yu,Weichuan Zhang,Yongsheng Gao
机构: Griffith University(格里菲斯大学); Macquarie University(麦考瑞大学); Shaanxi University of Science and Technology(陕西科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Source-free active domain adaptation (SFADA) addresses the challenge of adapting a pre-trained model to new domains without access to source data while minimizing the need for target domain annotations. This scenario is particularly relevant in real-world applications where data privacy, storage limitations, or labeling costs are significant concerns. Key challenges in SFADA include selecting the most informative samples from the target domain for labeling, effectively leveraging both labeled and unlabeled target data, and adapting the model without relying on source domain information. Additionally, existing methods often struggle with noisy or outlier samples and may require impractical progressive labeling during training. To effectively select more informative samples without frequently requesting human annotations, we propose the Propensity-driven Uncertainty Learning (ProULearn) framework. ProULearn utilizes a novel homogeneity propensity estimation mechanism combined with correlation index calculation to evaluate feature-level relationships. This approach enables the identification of representative and challenging samples while avoiding noisy outliers. Additionally, we develop a central correlation loss to refine pseudo-labels and create compact class distributions during adaptation. In this way, ProULearn effectively bridges the domain gap and maximizes adaptation performance. The principles of informative sample selection underlying ProULearn have broad implications beyond SFADA, offering benefits across various deep learning tasks where identifying key data points or features is crucial. Extensive experiments on four benchmark datasets demonstrate that ProULearn outperforms state-of-the-art methods in domain adaptation scenarios.
zh

[CV-37] Quantized Spike-driven Transformer ICLR2025

【速读】:该论文试图解决尖峰神经网络(SNN)在资源受限设备上部署时面临的计算资源需求过高的问题。尽管SNN因其尖峰驱动的范式而成为传统人工神经网络(ANN)的节能替代方案,但现有研究主要集中于通过设计大规模Transformer结构来提高精度,这通常依赖于大量的计算资源,限制了其在资源受限设备上的应用。为此,论文提出了一种量化尖峰驱动的Transformer基线(QSD-Transformer),通过使用低比特宽参数来减少资源需求。然而,QSD-Transformer在量化过程中常因尖峰信息失真(SID)而导致性能显著下降。

解决方案的关键在于提出了一种双级优化策略,以纠正量化尖峰驱动自注意力机制(Q-SDSA)中的信息分布问题。具体而言,在低级优化中,引入了信息增强的LIF(Leaky Integrate-and-Fire)模型来纠正Q-SDSA中的信息分布;在高级优化中,提出了一种细粒度蒸馏方案,使QSD-Transformer中的Q-SDSA分布与对应的ANN分布对齐。通过这种双级优化策略,QSD-Transformer在保持高性能的同时显著提升了能效。例如,在ImageNet数据集上,QSD-Transformer实现了80.3%的top-1准确率,同时功耗和模型大小分别减少了6.0倍和8.1倍。

链接: https://arxiv.org/abs/2501.13492
作者: Xuerui Qiu,Jieyuan Zhang,Wenjie Wei,Honglin Cao,Junsheng Guo,Rui-Jie Zhu,Yimeng Shan,Yang Yang,Malu Zhang,Haizhou Li
机构: University of Electronic Science and Technology of China (电子科技大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Future Technology, University of Chinese Academy of Sciences (中国科学院大学未来技术学院); China Agricultural University (中国农业大学); University of California, Santa Cruz (加州大学圣克鲁兹分校); Liaoning Technical University (辽宁工程技术大学); Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Spiking neural networks are emerging as a promising energy-efficient alternative to traditional artificial neural networks due to their spike-driven paradigm. However, recent research in the SNN domain has mainly focused on enhancing accuracy by designing large-scale Transformer structures, which typically rely on substantial computational resources, limiting their deployment on resource-constrained devices. To overcome this challenge, we propose a quantized spike-driven Transformer baseline (QSD-Transformer), which achieves reduced resource demands by utilizing a low bit-width parameter. Regrettably, the QSD-Transformer often suffers from severe performance degradation. In this paper, we first conduct empirical analysis and find that the bimodal distribution of quantized spike-driven self-attention (Q-SDSA) leads to spike information distortion (SID) during quantization, causing significant performance degradation. To mitigate this issue, we take inspiration from mutual information entropy and propose a bi-level optimization strategy to rectify the information distribution in Q-SDSA. Specifically, at the lower level, we introduce an information-enhanced LIF to rectify the information distribution in Q-SDSA. At the upper level, we propose a fine-grained distillation scheme for the QSD-Transformer to align the distribution in Q-SDSA with that in the counterpart ANN. By integrating the bi-level optimization strategy, the QSD-Transformer can attain enhanced energy efficiency without sacrificing its high-performance this http URL instance, when compared to the prior SNN benchmark on ImageNet, the QSD-Transformer achieves 80.3% top-1 accuracy, accompanied by significant reductions of 6.0 \times and 8.1 \times in power consumption and model size, respectively. Code is available at this https URL.
zh

[CV-38] LDR-Net: A Novel Framework for AI-generated Image Detection via Localized Discrepancy Representation

【速读】:该论文旨在解决生成式模型(Generative Models)生成的图像在视觉质量上几乎与真实图像难以区分的问题,这对内容真实性验证提出了挑战。现有的检测方法主要依赖于针对特定生成模型(如GANs或扩散模型)的伪造线索,难以在不同架构之间泛化。基于生成图像在局部区域常表现出异常(如过度平滑、模糊纹理和不自然的像素变化)的观察,论文提出了一种新颖的检测方法——局部差异表示网络(Localized Discrepancy Representation Network, LDR-Net)。LDR-Net通过整合两个互补模块来捕捉生成图像中的平滑伪影和纹理不规则性:局部梯度自相关(Local Gradient Autocorrelation, LGA)用于建模局部平滑异常,局部变化模式(Local Variation Pattern, LVP)则通过建模图像模式的复杂性来捕捉不自然的规律性。通过融合LGA和LVP特征,LDR-Net能够提供全面的局部差异表示,从而在检测生成图像方面实现了最先进的性能,并在未见过的生成模型上表现出良好的泛化能力。

链接: https://arxiv.org/abs/2501.13475
作者: JiaXin Chen,Miao Hu,DengYong Zhang,Yun Song,Xin Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of generative models, the visual quality of generated images has become nearly indistinguishable from the real ones, posing challenges to content authenticity verification. Existing methods for detecting AI-generated images primarily focus on specific forgery clues, which are often tailored to particular generative models like GANs or diffusion models. These approaches struggle to generalize across architectures. Building on the observation that generative images often exhibit local anomalies, such as excessive smoothness, blurred textures, and unnatural pixel variations in small regions, we propose the localized discrepancy representation network (LDR-Net), a novel approach for detecting AI-generated images. LDR-Net captures smoothing artifacts and texture irregularities, which are common but often overlooked. It integrates two complementary modules: local gradient autocorrelation (LGA) which models local smoothing anomalies to detect smoothing anomalies, and local variation pattern (LVP) which captures unnatural regularities by modeling the complexity of image patterns. By merging LGA and LVP features, a comprehensive representation of localized discrepancies can be provided. Extensive experiments demonstrate that our LDR-Net achieves state-of-the-art performance in detecting generated images and exhibits satisfactory generalization across unseen generative models. The code will be released upon acceptance of this paper.
zh

[CV-39] Leverag ing Textual Anatomical Knowledge for Class-Imbalanced Semi-Supervised Multi-Organ Segmentation

【速读】:该论文试图解决在3D医学图像分割任务中,由于器官的复杂解剖结构导致的类别不平衡问题,以及现有半监督学习方法未能充分利用先验信息(如器官间相对位置和器官形状先验)的挑战。解决方案的关键在于将文本解剖知识(Textual Anatomical Knowledge, TAK)整合到分割模型中。具体而言,作者使用GPT-4生成解剖先验的文本描述,并通过基于CLIP的模型对这些描述进行编码,然后将编码后的先验信息作为分割头的参数注入到分割模型中。此外,作者还采用了对比学习来增强文本先验与视觉特征之间的对齐。实验结果表明,该方法在性能上显著超越了现有的最先进方法。

链接: https://arxiv.org/abs/2501.13470
作者: Yuliang Gu,Weilun Tsao,Bo Du,Thierry Géraud,Yongchao Xu
机构: School of Computer Science, Wuhan University (武汉大学计算机学院); EPITA Research and Development Laboratory (EPITA研发实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Annotating 3D medical images demands substantial time and expertise, driving the adoption of semi-supervised learning (SSL) for segmentation tasks. However, the complex anatomical structures of organs often lead to significant class imbalances, posing major challenges for deploying SSL in real-world scenarios. Despite the availability of valuable prior information, such as inter-organ relative positions and organ shape priors, existing SSL methods have yet to fully leverage these insights. To address this gap, we propose a novel approach that integrates textual anatomical knowledge (TAK) into the segmentation model. Specifically, we use GPT-4o to generate textual descriptions of anatomical priors, which are then encoded using a CLIP-based model. These encoded priors are injected into the segmentation model as parameters of the segmentation head. Additionally, contrastive learning is employed to enhance the alignment between textual priors and visual features. Extensive experiments demonstrate the superior performance of our method, significantly surpassing state-of-the-art approaches. The source code will be available at: this https URL.
zh

[CV-40] Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge ICLR2025

【速读】:该论文旨在解决当前视频理解模型在处理长视频序列、支持多轮对话以及适应现实世界动态场景方面的局限性。为了解决这些问题,作者提出了StreamChat,一个无需训练的框架,用于流媒体视频推理和对话交互。StreamChat的关键创新在于其新颖的分层记忆系统(hierarchical memory system),该系统能够高效处理和压缩长视频序列中的特征,从而实现实时多轮对话。此外,框架采用了并行系统调度策略(parallel system scheduling strategy),以提高处理速度并减少延迟,确保在实际应用中的鲁棒性能。通过引入StreamBench基准测试,该框架在多种媒体类型和交互场景下的流媒体视频理解能力得到了广泛评估,结果表明StreamChat在准确性和响应时间上显著优于现有最先进的模型。

链接: https://arxiv.org/abs/2501.13468
作者: Haomiao Xiong,Zongxin Yang,Jiazuo Yu,Yunzhi Zhuge,Lu Zhang,Jiawen Zhu,Huchuan Lu
机构: Dalian University of Technology(大连理工大学); Harvard University(哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025. Code is available at this https URL

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. \StreamChat leverages a novel hierarchical memory system to efficiently process and compress video features over extended sequences, enabling real-time, multi-turn dialogue. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications. Furthermore, we introduce StreamBench, a versatile benchmark that evaluates streaming video understanding across diverse media types and interactive scenarios, including multi-turn interactions and complex reasoning tasks. Extensive evaluations on StreamBench and other public benchmarks demonstrate that StreamChat significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. Code is available at StreamChat: this https URL.
zh

[CV-41] Knowledge-Informed Multi-Agent Trajectory Prediction at Signalized Intersections for Infrastructure-to-Everything

【速读】:该论文旨在解决信号灯控制交叉路口的多智能体轨迹预测问题,以提高智能交通系统和自动驾驶系统的效率和安全性。现有方法主要基于单车感知,且未能充分利用交通信号和道路结构诱导的行为模式等关键交叉路口信息,导致预测性能达到瓶颈。为解决这一问题,论文提出了一个名为I2XTraj的多智能体轨迹预测框架,专门用于基础设施到万物(Infrastructure-to-Everything, I2X)场景。该框架的关键创新点包括:1)利用动态图注意力机制整合交通信号和驾驶行为信息;2)提出连续信号感知机制,自适应处理来自基础设施设备的实时交通信号;3)基于交叉路口拓扑的先验知识,提出驾驶策略感知机制,建模目标意图和驾驶操作的联合分布。I2XTraj是首个明确为基础设施部署设计的多智能体轨迹预测框架,可为交叉路口的所有车辆提供可订阅的预测服务,并在V2X-Seq和SinD数据集上实现了领先的性能,比现有方法在多智能体和单智能体场景中的表现提升了30%以上。

链接: https://arxiv.org/abs/2501.13461
作者: Huilin Yin,Yangwenhui Xu,Jiaxiang Li,Hao Zhang,Gerhard Rigoll
机构: Tongji University(同济大学); Technical University of Munich(慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent trajectory prediction at signalized intersections is crucial for developing efficient intelligent transportation systems and safe autonomous driving systems. Due to the complexity of intersection scenarios and the limitations of single-vehicle perception, the performance of vehicle-centric prediction methods has reached a plateau. Furthermore, most works underutilize critical intersection information, including traffic signals, and behavior patterns induced by road structures. Therefore, we propose a multi-agent trajectory prediction framework at signalized intersections dedicated to Infrastructure-to-Everything (I2XTraj). Our framework leverages dynamic graph attention to integrate knowledge from traffic signals and driving behaviors. A continuous signal-informed mechanism is proposed to adaptively process real-time traffic signals from infrastructure devices. Additionally, leveraging the prior knowledge of the intersection topology, we propose a driving strategy awareness mechanism to model the joint distribution of goal intentions and maneuvers. To the best of our knowledge, I2XTraj represents the first multi-agent trajectory prediction framework explicitly designed for infrastructure deployment, supplying subscribable prediction services to all vehicles at intersections. I2XTraj demonstrates state-of-the-art performance on both the Vehicle-to-Infrastructure dataset V2X-Seq and the aerial-view dataset SinD for signalized intersections. Quantitative evaluations show that our approach outperforms existing methods by more than 30% in both multi-agent and single-agent scenarios.
zh

[CV-42] EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

【速读】:该论文旨在解决现有视频生成方法在身份保持视频生成(Identity-Preserving Video Generation, IPT2V)中存在的“复制-粘贴”伪影和低相似度问题。这些问题主要源于现有方法过度依赖低层次的面部图像信息,导致生成的面部表情僵硬且可能引入不相关的细节伪影。为解决这些问题,论文提出了EchoVideo,其关键解决方案包括两个策略:(1) 身份图像-文本融合模块(Identity Image-Text Fusion Module, IITF),通过整合文本中的高层次语义特征,捕捉干净的面部身份表示,同时摒弃遮挡、姿态和光照变化,以避免引入伪影;(2) 两阶段训练策略,在第二阶段采用随机方法,随机利用浅层次的面部信息,以平衡浅层次特征带来的保真度提升,同时减少对其的过度依赖。这一策略促使模型在训练过程中更多地利用高层次特征,从而增强面部身份表示的鲁棒性。EchoVideo在保持面部身份和全身完整性方面表现出色,实验结果表明其在生成高质量、可控性和保真度视频方面取得了优异的效果。

链接: https://arxiv.org/abs/2501.13452
作者: Jiangchuan Wei,Shiyue Yan,Wenfeng Lin,Boyuan Liu,Renjie Chen,Mingyu Guo
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video generation have significantly impacted various downstream applications, particularly in identity-preserving video generation (IPT2V). However, existing methods struggle with “copy-paste” artifacts and low similarity issues, primarily due to their reliance on low-level facial image information. This dependence can result in rigid facial appearances and artifacts reflecting irrelevant details. To address these challenges, we propose EchoVideo, which employs two key strategies: (1) an Identity Image-Text Fusion Module (IITF) that integrates high-level semantic features from text, capturing clean facial identity representations while discarding occlusions, poses, and lighting variations to avoid the introduction of artifacts; (2) a two-stage training strategy, incorporating a stochastic method in the second phase to randomly utilize shallow facial information. The objective is to balance the enhancements in fidelity provided by shallow features while mitigating excessive reliance on them. This strategy encourages the model to utilize high-level features during training, ultimately fostering a more robust representation of facial identities. EchoVideo effectively preserves facial identities and maintains full-body integrity. Extensive experiments demonstrate that it achieves excellent results in generating high-quality, controllability and fidelity videos.
zh

[CV-43] MultiDreamer3D: Multi-concept 3D Customization with Concept-Aware Diffusion Guidance

【速读】:该论文试图解决多概念定制(multi-concept customization)在3D生成中的问题,目前这一领域尚未得到充分探索。论文提出的解决方案MultiDreamer3D采用分而治之的策略,首先生成3D边界框(3D bounding boxes),接着通过选择性点云生成器(selective point cloud generator)为每个概念生成粗略的点云(coarse point clouds)。这些点云被放置在3D边界框中,并通过带有概念标签的3D高斯泼溅(3D Gaussian Splatting)进行初始化,从而在2D投影中精确识别概念属性。最后,通过概念感知的间隔分数匹配(concept-aware interval score matching)和概念感知的扩散(concept-aware diffusion)对3D高斯进行优化。实验结果表明,MultiDreamer3D不仅确保了对象的存在并保留了每个概念的独特身份,还能成功处理属性变化或交互等复杂情况。

链接: https://arxiv.org/abs/2501.13449
作者: Wooseok Song,Seunggyu Chang,Jaejun Yoo
机构: Ulsan National Institute of Science and Technology (UNIST)(蔚山国立科学技术院); NAVER Cloud(NAVER云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:While single-concept customization has been studied in 3D, multi-concept customization remains largely unexplored. To address this, we propose MultiDreamer3D that can generate coherent multi-concept 3D content in a divide-and-conquer manner. First, we generate 3D bounding boxes using an LLM-based layout controller. Next, a selective point cloud generator creates coarse point clouds for each concept. These point clouds are placed in the 3D bounding boxes and initialized into 3D Gaussian Splatting with concept labels, enabling precise identification of concept attributions in 2D projections. Finally, we refine 3D Gaussians via concept-aware interval score matching, guided by concept-aware diffusion. Our experimental results show that MultiDreamer3D not only ensures object presence and preserves the distinct identities of each concept but also successfully handles complex cases such as property change or interaction. To the best of our knowledge, we are the first to address the multi-concept customization in 3D.
zh

[CV-44] One-cycle Structured Pruning with Stability Driven Structure Search

【速读】:该论文旨在解决现有结构化剪枝(structured pruning)方法通常需要多阶段训练过程且计算成本高的问题。现有的初始化剪枝方法虽然减少了训练成本,但在性能上表现不佳。为此,论文提出了一种高效的“单周期结构化剪枝”框架,将预训练、剪枝和微调整合到一个训练周期中,称为“单周期方法”。该方案的关键在于:1)在网络训练的早期阶段,通过基于范数的组显著性准则(norm-based group saliency criteria)和结构化稀疏正则化(structured sparsity regularization)来搜索最优子网络;2)引入一种新的剪枝指示器,通过评估连续训练周期中演化的剪枝子网络之间的相似性来确定稳定的剪枝时机;3)利用组稀疏正则化加速剪枝过程,从而提升整体效率。实验结果表明,该方法在CIFAR-10/100和ImageNet数据集上使用VGGNet、ResNet、MobileNet和ViT架构时,既能达到最先进的精度,又在训练时间上表现出极高的效率。

链接: https://arxiv.org/abs/2501.13439
作者: Deepak Ghimire,Dayoung Kil,Seonghwan Jeong,Jaesik Park,Seong-heum Kim
机构: 1Korea Electronics Technology Institute (韩国电子技术研究院); 2Soongsil University (崇实大学); 3Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Existing structured pruning typically involves multi-stage training procedures that often demand heavy computation. Pruning at initialization, which aims to address this limitation, reduces training costs but struggles with performance. To address these challenges, we propose an efficient framework for one-cycle structured pruning without compromising model performance. In this approach, we integrate pre-training, pruning, and fine-tuning into a single training cycle, referred to as the `one cycle approach’. The core idea is to search for the optimal sub-network during the early stages of network training, guided by norm-based group saliency criteria and structured sparsity regularization. We introduce a novel pruning indicator that determines the stable pruning epoch by assessing the similarity between evolving pruning sub-networks across consecutive training epochs. Also, group sparsity regularization helps to accelerate the pruning process and results in speeding up the entire process. Extensive experiments on datasets, including CIFAR-10/100, and ImageNet, using VGGNet, ResNet, MobileNet, and ViT architectures, demonstrate that our method achieves state-of-the-art accuracy while being one of the most efficient pruning frameworks in terms of training time. The source code will be made publicly available.
zh

[CV-45] GC-ConsFlow: Leverag ing Optical Flow Residuals and Global Context for Robust Deepfake Detection

【速读】:该论文旨在解决现有Deepfake检测方法在空间和时间特征上的局限性问题。现有方法通常仅关注空间或时间不一致性,忽略了二者之间的相互作用,或受到自然面部运动的干扰。为解决这些问题,论文提出了一种新颖的双流框架——全局上下文一致性流(GC-ConsFlow),通过有效整合空间和时间特征来实现鲁棒的Deepfake检测。其关键解决方案包括两个核心模块:全局分组上下文聚合模块(GGCA)和流梯度时间一致性流(FGTC)。GGCA通过聚合分组的全局上下文信息,增强了空间特征提取能力,能够检测帧内细微的空间伪影;FGTC则利用光流残差和基于梯度的特征,提高了时间特征提取的鲁棒性,以应对不自然面部运动引入的不一致性。通过结合这两个模块,GC-ConsFlow能够有效捕捉互补的时空伪造痕迹,从而在多种压缩场景下显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.13435
作者: Jiaxin Chen,Miao Hu,Dengyong Zhang,Jingyang Meng
机构: Changsha University of Science and Technology (长沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid development of Deepfake technology has enabled the generation of highly realistic manipulated videos, posing severe social and ethical challenges. Existing Deepfake detection methods primarily focused on either spatial or temporal inconsistencies, often neglecting the interplay between the two or suffering from interference caused by natural facial motions. To address these challenges, we propose the global context consistency flow (GC-ConsFlow), a novel dual-stream framework that effectively integrates spatial and temporal features for robust Deepfake detection. The global grouped context aggregation module (GGCA), integrated into the global context-aware frame flow stream (GCAF), enhances spatial feature extraction by aggregating grouped global context information, enabling the detection of subtle, spatial artifacts within frames. The flow-gradient temporal consistency stream (FGTC), rather than directly modeling the residuals, it is used to improve the robustness of temporal feature extraction against the inconsistency introduced by unnatural facial motion using optical flow residuals and gradient-based features. By combining these two streams, GC-ConsFlow demonstrates the effectiveness and robustness in capturing complementary spatiotemporal forgery traces. Extensive experiments show that GC-ConsFlow outperforms existing state-of-the-art methods in detecting Deepfake videos under various compression scenarios.
zh

[CV-46] Emotion estimation from video footage with LSTM

【速读】:该论文旨在解决从实时视频流中通过面部表情估计主要情绪(emotion estimation)的问题。解决方案的关键在于使用长短期记忆网络(LSTM, Long Short-Term Memory)模型处理由MediaPipe库生成的面部混合形状(blend-shapes),从而从检测到的面部表情中推断出主要情绪。该模型在FER2013数据集上进行训练,达到了71%的准确率和62%的F1分数,满足了FER2013数据集的准确率基准,同时显著降低了计算成本。

链接: https://arxiv.org/abs/2501.13432
作者: Samer Attrah
机构: Engineering and Automotive Academy, Hogeschool Van Arnhem en Nijmegen (阿纳姆和奈梅亨应用科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 11 pages, 6 figures, 32 references, 4 tables

点击查看摘要

Abstract:Emotion estimation in general is a field that has been studied for a long time, and several approaches exist using machine learning. in this paper, we present an LSTM model, that processes the blend-shapes produced by the library MediaPipe, for a face detected in a live stream of a camera, to estimate the main emotion from the facial expressions, this model is trained on the FER2013 dataset and delivers a result of 71% accuracy and 62% f1-score which meets the accuracy benchmark of the FER2013 dataset, with significantly reduced computation costs. this https URL Samir-atra/Emotion_estimation_from_video_footage_with_LSTM_ML_algorithm
zh

[CV-47] Auto-Prompting SAM for Weakly Supervised Landslide Extraction

【速读】:该论文试图解决弱监督滑坡提取(Weakly supervised landslide extraction)中的两个主要问题:一是由于缺乏像素级监督(pixel-wise supervision)导致的提取对象边界不精确;二是滑坡对象的特性使得精确分割变得困难。为了解决这些问题,论文提出了一种简单而有效的方法,即通过自动提示(auto-prompting)Segment Anything Model (SAM) 来实现滑坡区域的精细分割,称为APSAM方法。该方案的关键在于不依赖高质量类激活图(CAMs)进行伪标签生成或微调SAM,而是通过提示工程(prompt engineering)直接从SAM推理中生成精细的分割掩码。具体来说,该方法通过自适应提示生成(APG)算法,从对象定位网络(object localization network)获得的CAMs中生成混合提示(hybrid prompts),这些提示能够识别滑坡区域的范围(box prompts)并标注滑坡对象的中心(point prompts),从而指导SAM进行滑坡分割。实验结果表明,该方法在高分辨率航空和卫星数据集上显著优于其他最先进方法,F1分数和IoU分别提高了至少3.0%和3.69%。

链接: https://arxiv.org/abs/2501.13426
作者: Jian Wang,Xiaokang Zhang,Xianping Ma,Weikang Yu,Pedram Ghamisi
机构: School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan 430081, China (武汉科技大学信息科学与工程学院); School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China (香港中文大学深圳理工学院); Chair of Data Science in Earth Observation, Technical University of Munich (TUM), 80333 Munich, Germany (慕尼黑工业大学地球观测数据科学系); Helmholtz-Zentrum Dresden-Rossendorf, 09599 Freiberg, Germany (亥姆霍兹德累斯顿-罗森多夫研究中心); Lancaster Environment Centre, Lancaster University, LA1 4YR Lancaster, U.K. (兰卡斯特大学兰卡斯特环境中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Weakly supervised landslide extraction aims to identify landslide regions from remote sensing data using models trained with weak labels, particularly image-level labels. However, it is often challenged by the imprecise boundaries of the extracted objects due to the lack of pixel-wise supervision and the properties of landslide objects. To tackle these issues, we propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM), i.e., APSAM. Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering. Specifically, it adaptively generates hybrid prompts from the CAMs obtained by an object localization network. To provide sufficient information for SAM prompting, an adaptive prompt generation (APG) algorithm is designed to fully leverage the visual patterns of CAMs, enabling the efficient generation of pseudo-masks for landslide extraction. These informative prompts are able to identify the extent of landslide areas (box prompts) and denote the centers of landslide objects (point prompts), guiding SAM in landslide segmentation. Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method, achieving improvements of at least 3.0% in F1 score and 3.69% in IoU compared to other state-of-the-art methods. The source codes and datasets will be available at this https URL.
zh

[CV-48] Atmospheric Noise-Resilient Image Classification in a Real-World Scenario: Using Hybrid CNN and Pin-GTSVM

【速读】:该论文旨在解决在雾霾(haze)条件下,现有基于深度学习(deep learning)的停车位占用检测系统性能显著下降的问题。当前的停车位分类系统在处理部分遮挡和不同光照条件时表现良好,但在雾霾环境下效果不佳。论文提出了一种新颖的混合模型,结合了预训练的特征提取器(pre-trained feature extractor)和Pinball广义孪生支持向量机(Pin-GTSVM)分类器,从而避免了现有最先进系统对去雾(dehazing)系统的依赖,并且对大气噪声(atmospheric noise)不敏感。该方案的关键在于其能够无缝集成到现有的智能停车基础设施中,仅需少量摄像头即可高效监控和管理数百个停车位。通过在CNRPark Patches、PKLot以及针对雾霾停车场景的自定义数据集上的实验,验证了该模型在雾霾环境下的显著精度提升,突出了其在大气噪声处理上的高效性。

链接: https://arxiv.org/abs/2501.13422
作者: Shlok Mehendale,Jajati Keshari Sahoo,Rajendra Kumar Roul
机构: Department of Computer Science & Engineering, BITS Pilani K K Birla Goa Campus, Goa, India; Department of Mathematics, BITS Pilani K K Birla Goa Campus, Goa, India; Department of Computer Science & Engineering, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parking space occupation detection using deep learning frameworks has seen significant advancements over the past few years. While these approaches effectively detect partial obstructions and adapt to varying lighting conditions, their performance significantly diminishes when haze is present. This paper proposes a novel hybrid model with a pre-trained feature extractor and a Pinball Generalized Twin Support Vector Machine (Pin-GTSVM) classifier, which removes the need for a dehazing system from the current State-of-The-Art hazy parking slot classification systems and is also insensitive to any atmospheric noise. The proposed system can seamlessly integrate with conventional smart parking infrastructures, leveraging a minimal number of cameras to monitor and manage hundreds of parking spaces efficiently. Its effectiveness has been evaluated against established parking space detection methods using the CNRPark Patches, PKLot, and a custom dataset specific to hazy parking scenarios. Furthermore, empirical results indicate a significant improvement in accuracy on a hazy parking system, thus emphasizing efficient atmospheric noise handling.
zh

[CV-49] LVFace: Large Vision model for Face Recogniton

【速读】:该论文试图解决当前人脸识别研究中主要依赖基于卷积神经网络(CNN)的模型架构,导致在性能上未能达到最优状态的问题。为了解决这一问题,作者提出了一种基于大规模视觉模型(large vision models)的新方法,称为LVFace。其关键解决方案在于利用历史研究中的多种损失函数(loss functions)进行正交训练,从而提升模型的性能。通过在最大的公开人脸数据库WebFace42M上的实验,LVFace展示了其相对于其他先进人脸识别方法的优越性,并在ICCV21 MFR-Ongoing挑战赛中取得了第一名的成绩。

链接: https://arxiv.org/abs/2501.13420
作者: Jinghan You,Yuanrui Sun,Mingyu Guo,Chao Feng,Jiao Ran
机构: bytedance; FaceX
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, large vision models have demonstrated powerful representation capabilities in the field of computer vision. However, we unexpectedly found that face recognition research is still mainly focused on CNN-based model architectures, which may lead to suboptimal state-of-the-art (SOTA) performance in face recognition. Therefore, we study how to use various loss functions from historical research orthogonally to train a new state-of-the-art face recognition model based on large vision models, called LVFace. On the largest public face database, WebFace42M, we demonstrated the superiority of LVFace over other advanced face recognition methods and achieved first place in the ICCV21 MFR-Ongoing challenge, until the submission of this work (December 30, 2024, academic track).
zh

[CV-50] Rethinking the Sample Relations for Few-Shot Classification

【速读】:该论文试图解决在少样本学习(Few-Shot Learning, FSL)中,现有对比学习方法在处理不同样本关系时忽略语义相似性差异的问题。具体而言,现有方法通常采用相同的建模方式处理不同粒度的样本关系,导致无法充分利用对比学习的潜力。论文提出了一种名为多粒度关系对比学习(Multi-Grained Relation Contrastive Learning, MGRCL)的预训练特征学习模型,通过精细建模不同粒度的样本关系来提升少样本学习的性能。MGRCL将样本关系分为三类:同一样本在不同变换下的内部关系(intra-sample relation)、同类样本间的内部关系(intra-class relation)以及异类样本间的外部关系(inter-class relation)。关键解决方案包括设计变换一致性学习(Transformation Consistency Learning, TCL)以确保样本在不同变换下的语义一致性,以及采用类别对比学习(Class Contrastive Learning, CCL)来确保样本与其同类样本的距离比异类样本更近,从而保留判别信息。实验表明,该方法在多个少样本学习基准测试中表现优异,并可作为预训练模型集成到其他少样本学习方法中,显著提升其性能。

链接: https://arxiv.org/abs/2501.13418
作者: Guowei Yin,Sheng Huang,Luwen Huangfu,Yi Zhang,Xiaohong Zhang
机构: School of Big Data and Software Engineering, Chongqing University (重庆大学大数据与软件工程学院); Ministry of Education Key Laboratory of Dependable Service Computing in Cyber Physical Society, Chongqing University (重庆大学教育部可信服务计算重点实验室); Fowler College of Business, San Diego State University (圣地亚哥州立大学福勒商学院); Center for Human Dynamics in the Mobile Age (HDMA), San Diego State University (圣地亚哥州立大学移动时代人类动态研究中心); AI4Business Lab, San Diego State University (圣地亚哥州立大学AI4Business实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:Feature quality is paramount for classification performance, particularly in few-shot scenarios. Contrastive learning, a widely adopted technique for enhancing feature quality, leverages sample relations to extract intrinsic features that capture semantic information and has achieved remarkable success in Few-Shot Learning (FSL). Nevertheless, current few-shot contrastive learning approaches often overlook the semantic similarity discrepancies at different granularities when employing the same modeling approach for different sample relations, which limits the potential of few-shot contrastive learning. In this paper, we introduce a straightforward yet effective contrastive learning approach, Multi-Grained Relation Contrastive Learning (MGRCL), as a pre-training feature learning model to boost few-shot learning by meticulously modeling sample relations at different granularities. MGRCL categorizes sample relations into three types: intra-sample relation of the same sample under different transformations, intra-class relation of homogenous samples, and inter-class relation of inhomogeneous samples. In MGRCL, we design Transformation Consistency Learning (TCL) to ensure the rigorous semantic consistency of a sample under different transformations by aligning predictions of input pairs. Furthermore, to preserve discriminative information, we employ Class Contrastive Learning (CCL) to ensure that a sample is always closer to its homogenous samples than its inhomogeneous ones, as homogenous samples share similar semantic content while inhomogeneous samples have different semantic content. Our method is assessed across four popular FSL benchmarks, showing that such a simple pre-training feature learning method surpasses a majority of leading FSL methods. Moreover, our method can be incorporated into other FSL methods as the pre-trained model and help them obtain significant performance gains.
zh

[CV-51] GeomGS: LiDAR-Guided Geometry-Aware Gaussian Splatting for Robot Localization

【速读】:该论文试图解决在机器人学和自动驾驶领域中,3D高斯泼溅(3D Gaussian Splatting, 3DGS)方法在构建精确反映实际世界尺度和几何结构的3D地图时面临的挑战,以及由此导致的定位性能下降问题。为解决这些问题,论文提出了一种名为几何感知高斯泼溅(Geometry-Aware Gaussian Splatting, GeomGS)的新方法。该方案的关键在于通过概率方法将LiDAR数据完全集成到3D高斯基元中,而不是仅将LiDAR用作初始点或引入简单的高斯点约束。为此,论文引入了几何置信度评分(Geometric Confidence Score, GCS),用于识别每个高斯点的结构可靠性。GCS与高斯点在概率距离约束下同时优化,以构建精确的结构。此外,论文还提出了一种新的定位方法,充分利用了GeomGS的几何和光度特性。实验结果表明,GeomGS在多个基准测试中展示了最先进的几何和定位性能,同时提升了光度性能。

链接: https://arxiv.org/abs/2501.13417
作者: Jaewon Lee,Mangyu Kong,Minseong Park,Euntai Kim
机构: School of Electrical and Electronic Engineering, Yonsei University (延世大学电气与电子工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint, Under review

点击查看摘要

Abstract:Mapping and localization are crucial problems in robotics and autonomous driving. Recent advances in 3D Gaussian Splatting (3DGS) have enabled precise 3D mapping and scene understanding by rendering photo-realistic images. However, existing 3DGS methods often struggle to accurately reconstruct a 3D map that reflects the actual scale and geometry of the real world, which degrades localization performance. To address these limitations, we propose a novel 3DGS method called Geometry-Aware Gaussian Splatting (GeomGS). This method fully integrates LiDAR data into 3D Gaussian primitives via a probabilistic approach, as opposed to approaches that only use LiDAR as initial points or introduce simple constraints for Gaussian points. To this end, we introduce a Geometric Confidence Score (GCS), which identifies the structural reliability of each Gaussian point. The GCS is optimized simultaneously with Gaussians under probabilistic distance constraints to construct a precise structure. Furthermore, we propose a novel localization method that fully utilizes both the geometric and photometric properties of GeomGS. Our GeomGS demonstrates state-of-the-art geometric and localization performance across several benchmarks, while also improving photometric performance.
zh

[CV-52] VIGS SLAM: IMU-based Large-Scale 3D Gaussian Splatting SLAM

【速读】:该论文试图解决在大规模室内环境中使用基于辐射场(radiance fields)的地图表示方法(如3D Gaussian Splatting和NeRF)进行SLAM(Simultaneous Localization and Mapping)时面临的挑战。具体来说,这些方法虽然能够构建高度逼真的地图,但在大规模SLAM中仍存在困难,因为它们需要大量的高斯图像进行建图,并且需要相邻图像作为关键帧进行跟踪。论文提出了一种名为VIGS SLAM的新方法,通过融合RGB-D和IMU传感器的数据来解决这一问题。其关键解决方案包括采用基于ICP(Iterative Closest Point)的跟踪框架,并结合IMU预积分来提供准确的初始姿态估计,从而降低基于3D Gaussian Splatting的跟踪计算负载。该方法首次证明了在大规模环境中通过集成IMU传感器测量可以有效执行基于Gaussian Splatting的SLAM,不仅提升了Gaussian Splatting SLAM在房间尺度之外的性能,还在大规模室内环境中实现了与最先进方法相当的SLAM性能。

链接: https://arxiv.org/abs/2501.13402
作者: Gyuhyeon Pak,Euntai Kim
机构: Department of Electrical and Electronic Engineering, Yonsei University (延世大学电气与电子工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Recently, map representations based on radiance fields such as 3D Gaussian Splatting and NeRF, which excellent for realistic depiction, have attracted considerable attention, leading to attempts to combine them with SLAM. While these approaches can build highly realistic maps, large-scale SLAM still remains a challenge because they require a large number of Gaussian images for mapping and adjacent images as keyframes for tracking. We propose a novel 3D Gaussian Splatting SLAM method, VIGS SLAM, that utilizes sensor fusion of RGB-D and IMU sensors for large-scale indoor environments. To reduce the computational load of 3DGS-based tracking, we adopt an ICP-based tracking framework that combines IMU preintegration to provide a good initial guess for accurate pose estimation. Our proposed method is the first to propose that Gaussian Splatting-based SLAM can be effectively performed in large-scale environments by integrating IMU sensor measurements. This proposal not only enhances the performance of Gaussian Splatting SLAM beyond room-scale scenarios but also achieves SLAM performance comparable to state-of-the-art methods in large-scale indoor environments.
zh

[CV-53] YOLOv8 to YOLO1 1: A Comprehensive Architecture In-depth Comparative Review

【速读】:该论文旨在解决当前YOLO(You Only Look Once)模型在深度学习(deep learning)领域中的架构理解和比较问题。由于部分YOLO模型缺乏学术出版物和公开的官方架构图,导致研究人员难以深入理解这些模型的实际运作机制和差异。论文通过详细分析YOLOv8至YOLO11四个最新版本的架构,提供了全面的架构比较,帮助读者快速掌握每个模型的功能及其之间的区别。解决方案的关键在于通过仔细研究相关学术论文、文档和源代码,揭示每个YOLO版本在架构和特征提取方面的改进,同时指出某些模块保持不变。论文还呼吁未来开发者提供更多学术出版物和官方架构图,以便更好地理解模型的功能和进行未来改进。

链接: https://arxiv.org/abs/2501.13400
作者: Priyanto Hidayatullah,Nurjannah Syakrani,Muhammad Rizqi Sholahuddin,Trisna Gelar,Refdinal Tubagus
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: submitted to Journal of Applied Engineering and Technological Science

点击查看摘要

Abstract:In the field of deep learning-based computer vision, YOLO is revolutionary. With respect to deep learning models, YOLO is also the one that is evolving the most rapidly. Unfortunately, not every YOLO model possesses scholarly publications. Moreover, there exists a YOLO model that lacks a publicly accessible official architectural diagram. Naturally, this engenders challenges, such as complicating the understanding of how the model operates in practice. Furthermore, the review articles that are presently available do not delve into the specifics of each model. The objective of this study is to present a comprehensive and in-depth architecture comparison of the four most recent YOLO models, specifically YOLOv8 through YOLO11, thereby enabling readers to quickly grasp not only how each model functions, but also the distinctions between them. To analyze each YOLO version’s architecture, we meticulously examined the relevant academic papers, documentation, and scrutinized the source code. The analysis reveals that while each version of YOLO has improvements in architecture and feature extraction, certain blocks remain unchanged. The lack of scholarly publications and official diagrams presents challenges for understanding the model’s functionality and future enhancement. Future developers are encouraged to provide these resources.
zh

[CV-54] owards Intelligent Design: A Self-driven Framework for Collocated Clothing Synthesis Leverag ing Fashion Styles and Textures ICASSP2024

【速读】:该论文试图解决的是在时尚技术领域中,如何在没有成对服装(paired outfits)的情况下生成与给定服装和谐搭配的服装(collocated clothing synthesis, CCS)的问题。传统方法依赖于时尚专业人士构建的成对服装来训练生成模型,这一过程既耗时又费力。论文提出的解决方案是引入一种名为风格和纹理引导生成网络(style- and texture-guided generative network, ST-Net)的自驱动框架。该框架通过自监督学习(self-supervised learning)从服装的风格和纹理属性中推断出时尚搭配规则,并利用生成对抗网络(generative adversarial network, GAN)来实现这一目标。关键创新在于无需依赖成对服装,而是通过大规模无监督数据集进行训练和评估,从而在视觉真实性和时尚兼容性方面超越了现有方法。

链接: https://arxiv.org/abs/2501.13396
作者: Minglong Dong,Dongliang Zhou,Jianghong Ma,Haijun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for presentation at ICASSP 2024

点击查看摘要

Abstract:Collocated clothing synthesis (CCS) has emerged as a pivotal topic in fashion technology, primarily concerned with the generation of a clothing item that harmoniously matches a given item. However, previous investigations have relied on using paired outfits, such as a pair of matching upper and lower clothing, to train a generative model for achieving this task. This reliance on the expertise of fashion professionals in the construction of such paired outfits has engendered a laborious and time-intensive process. In this paper, we introduce a new self-driven framework, named style- and texture-guided generative network (ST-Net), to synthesize collocated clothing without the necessity for paired outfits, leveraging self-supervised learning. ST-Net is designed to extrapolate fashion compatibility rules from the style and texture attributes of clothing, using a generative adversarial network. To facilitate the training and evaluation of our model, we have constructed a large-scale dataset specifically tailored for unsupervised CCS. Extensive experiments substantiate that our proposed method outperforms the state-of-the-art baselines in terms of both visual authenticity and fashion compatibility.
zh

[CV-55] AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning

【速读】:该论文试图解决图像分类中带有噪声标签(noisy labels)的鲁棒训练问题,特别是针对现实数据集中常见的分布内(in-distribution, ID)和分布外(out-of-distribution, OOD)实例依赖的混合噪声标签问题。现有方法通常无法同时处理这两种噪声类型,且缺乏全面的基准数据集来评估这些方法的有效性。此外,现有的噪声标签学习方法虽然尝试在训练过程中识别噪声标签样本,但未能有效估计ID和OOD噪声率,导致其选择噪声样本的效率较低,且通常采用低效的多阶段学习算法。

论文提出的解决方案是自适应估计实例依赖的分布内和分布外标签噪声(Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise, AEON)方法。AEON是一种高效的单阶段噪声标签学习方法,能够动态估计实例依赖的ID和OOD噪声率,从而提升在复杂噪声环境下的鲁棒性。此外,论文还引入了一个新的基准数据集,以反映现实世界中的ID和OOD噪声场景。实验结果表明,AEON在合成和真实数据集上均达到了最先进的性能。

链接: https://arxiv.org/abs/2501.13389
作者: Arpit Garg,Cuong Nguyen,Rafael Felix,Yuyuan Liu,Thanh-Toan Do,Gustavo Carneiro
机构: University of Adelaide(阿德莱德大学); University of Surrey(萨里大学); University of Oxford(牛津大学); Monash University(莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In Submission

点击查看摘要

Abstract:Robust training with noisy labels is a critical challenge in image classification, offering the potential to reduce reliance on costly clean-label datasets. Real-world datasets often contain a mix of in-distribution (ID) and out-of-distribution (OOD) instance-dependent label noise, a challenge that is rarely addressed simultaneously by existing methods and is further compounded by the lack of comprehensive benchmarking datasets. Furthermore, even though current noisy-label learning approaches attempt to find noisy-label samples during training, these methods do not aim to estimate ID and OOD noise rates to promote their effectiveness in the selection of such noisy-label samples, and they are often represented by inefficient multi-stage learning algorithms. We propose the Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise (AEON) approach to address these research gaps. AEON is an efficient one-stage noisy-label learning methodology that dynamically estimates instance-dependent ID and OOD label noise rates to enhance robustness to complex noise settings. Additionally, we introduce a new benchmark reflecting real-world ID and OOD noise scenarios. Experiments demonstrate that AEON achieves state-of-the-art performance on both synthetic and real-world datasets
zh

[CV-56] From Images to Point Clouds: An Efficient Solution for Cross-media Blind Quality Assessment without Annotated Training

【速读】:该论文旨在解决点云(point clouds)质量评估(PCQA)中的无监督学习问题,特别是如何在没有标注数据的情况下预测新场景中点云的感知质量。解决方案的关键在于提出了一种名为“分布加权图像转移点云质量评估”(DWIT-PCQA)的方法。该方法通过利用图像中的先验知识,模拟人类视觉系统(HVS)的质量评估标准,并将图像中的质量预测能力转移到点云上。具体而言,论文通过域适应(domain adaptation, DA)技术,在相同的特征空间中对图像和点云的特征分布进行对齐。为了降低对齐难度并考虑不同失真分布的影响,论文提出了将传统域适应的优化目标分解为两个子优化函数,并以失真作为过渡。此外,论文还提出了失真引导的偏置特征对齐和质量感知特征解耦,以在特征对齐过程中减少失真对特征到质量映射的破坏。实验结果表明,该方法在不依赖点云标注的情况下,相比一般的盲点云质量评估方法表现出可靠的性能。

链接: https://arxiv.org/abs/2501.13387
作者: Yipeng Liu,Qi Yang,Yujie Zhang,Yiling Xu,Le Yang,Zhu Li
机构: Cooperative Medianet Innovation Center, Shanghai Jiaotong University (上海交通大学); University of Missouri-Kansas City (密苏里大学堪萨斯城分校); Department of Electrical and Computer Engineering, University of Canterbury (坎特伯雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present a novel quality assessment method which can predict the perceptual quality of point clouds from new scenes without available annotations by leveraging the rich prior knowledge in images, called the Distribution-Weighted Image-Transferred Point Cloud Quality Assessment (DWIT-PCQA). Recognizing the human visual system (HVS) as the decision-maker in quality assessment regardless of media types, we can emulate the evaluation criteria for human perception via neural networks and further transfer the capability of quality prediction from images to point clouds by leveraging the prior knowledge in the images. Specifically, domain adaptation (DA) can be leveraged to bridge the images and point clouds by aligning feature distributions of the two media in the same feature space. However, the different manifestations of distortions in images and point clouds make feature alignment a difficult task. To reduce the alignment difficulty and consider the different distortion distribution during alignment, we have derived formulas to decompose the optimization objective of the conventional DA into two suboptimization functions with distortion as a transition. Specifically, through network implementation, we propose the distortion-guided biased feature alignment which integrates existing/estimated distortion distribution into the adversarial DA framework, emphasizing common distortion patterns during feature alignment. Besides, we propose the quality-aware feature disentanglement to mitigate the destruction of the mapping from features to quality during alignment with biased distortions. Experimental results demonstrate that our proposed method exhibits reliable performance compared to general blind PCQA methods without needing point cloud annotations.
zh

[CV-57] Meta-Feature Adapter: Integrating Environmental Metadata for Enhanced Animal Re-identification

【速读】:该论文旨在解决野生动物个体识别(Animal ReID)中现有方法仅依赖视觉数据而忽略环境元数据(environmental metadata)的问题。环境元数据(如温度和昼夜节律)与动物行为和身份高度相关,但现有方法未能充分利用这些信息。为此,论文提出了元特征适配器(Meta-Feature Adapter, MFA),这是一个轻量级模块,旨在将环境元数据整合到视觉-语言基础模型(如CLIP)中,以提升动物个体识别的性能。MFA的关键在于将环境元数据转换为自然语言描述,并将其编码为元数据感知的文本嵌入(metadata-aware text embeddings),然后通过交叉注意力机制(cross-attention mechanism)将这些嵌入与图像特征结合。此外,论文还引入了门控交叉注意力机制(Gated Cross-Attention),动态调整元数据贡献的权重,进一步优化性能。为验证该方法,作者构建了元数据增强的动物个体识别数据集(Metadata Augmented Animal Re-identification, MAAR),包含来自新西兰的六种物种的配对图像数据和环境元数据。实验结果表明,MFA在多个基线模型上均显著提升了动物个体识别的性能。

链接: https://arxiv.org/abs/2501.13368
作者: Yuzhuo Li,Di Zhao,Yihao Wu,Yun Sing Koh
机构: The University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing methods rely exclusively on visual data, neglecting environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. To bridge this gap, we propose the Meta-Feature Adapter (MFA), a lightweight module designed to integrate environmental metadata into vision-language foundation models, such as CLIP, to enhance Animal ReID performance. Our approach translates environmental metadata into natural language descriptions, encodes them into metadata-aware text embeddings, and incorporates these embeddings into image features through a cross-attention mechanism. Furthermore, we introduce a Gated Cross-Attention mechanism that dynamically adjusts the weights of metadata contributions, further improving performance. To validate our approach, we constructed the Metadata Augmented Animal Re-identification (MAAR) dataset, encompassing six species from New Zealand and featuring paired image data and environmental metadata. Extensive experiments demonstrate that MFA consistently improves Animal ReID performance across multiple baseline models.
zh

[CV-58] Enhanced Extractor-Selector Framework and Symmetrization Weighted Binary Cross-Entropy for Edge Detections

【速读】:该论文旨在解决现有提取器-选择器(Extractor-Selector, E-S)框架在边缘检测(Edge Detection, ED)任务中的局限性。具体而言,现有方法在特征提取过程中存在信息损失和多样性不足的问题,导致特征选择机制的效果受限。此外,联合训练(union training)虽然在感知质量上有所提升,但在定量评估中通常无法达到最高分,形成了定量准确性与感知保真度之间的权衡。

论文提出的解决方案包括两个关键点:首先,通过引入更丰富、信息损失更少的特征表示,并在选择过程中结合辅助特征,提升了特征选择机制的有效性。其次,提出了一种新的损失函数——对称加权二值交叉熵(Symmetrization Weight Binary Cross-Entropy, SWBCE),该函数同时强调边缘像素的召回率和错误边缘预测的抑制,从而在感知质量和预测准确性上均有所提升。实验结果表明,采用增强的E-S架构并结合SWBCE损失函数,在BIPED2数据集上的ODS、OIS和AP指标上分别实现了8.25%、8.01%和33.25%的平均提升,显著优于基线模型和标准E-S方法。

链接: https://arxiv.org/abs/2501.13365
作者: Hao Shu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Recent advancements have demonstrated the effectiveness of the extractor-selector (E-S) framework in edge detection (ED) tasks, which achieves state-of-the-art (SOTA) performance in both quantitative metrics and perceptual quality. However, this method still falls short of fully exploiting the potential of feature extractors, as selectors only operate on highly compressed feature maps that lack diversity and suffer from substantial information loss. Additionally, while union training can improve perceptual quality, the highest evaluation scores are typically obtained without it, creating a trade-off between quantitative accuracy and perceptual fidelity. To address these limitations, we propose an enhanced E-S architecture, which utilizes richer, less-loss feature representations and incorporates auxiliary features during the selection process, thereby improving the effectiveness of the feature selection mechanism. Additionally, we introduce a novel loss function, the Symmetrization Weight Binary Cross-Entropy (SWBCE), which simultaneously emphasizes both the recall of edge pixels and the suppression of erroneous edge predictions, thereby enhancing the predictions both in the perceptual quality and the prediction accuracy. The effectiveness and superiority of our approaches over baseline models, the standard E-S framework, and the standard Weight Binary Cross-Entropy (WBCE) loss function are demonstrated by extensive experiments. For example, our enhanced E-S architecture trained with SWBCE loss function achieves average improvements of 8.25 % , 8.01 % , and 33.25 % in ODS, OIS, and AP, measured on BIPED2 compared with the baseline models, significantly outperforming the standard E-S method. The results set new benchmarks for ED tasks, and highlight the potential of the methods in beyond.
zh

[CV-59] A light-weight model to generate NDWI from Sentinel-1

【速读】:该论文旨在解决利用Sentinel-2影像计算归一化水体指数(NDWI, Normalized Difference Water Index)时,云层覆盖对水体面积检测效果的影响问题。为了解决这一问题,作者提出了一种深度学习模型,该模型能够基于Sentinel-1影像生成NDWI,从而克服云层障碍。该模型在预测NDWI时表现出较高的准确性(0.9134)和AUC值(0.8656),并且在回归NDWI值(R2得分为0.4984)和基础分割任务(Mean IoU为0.4139)中均显示出良好的效果。这一解决方案的关键在于利用Sentinel-1影像的微波特性,使其在云层覆盖和夜间等复杂条件下仍能有效生成NDWI影像,从而为水体检测等应用提供了可靠的技术支持。

链接: https://arxiv.org/abs/2501.13357
作者: Saleh Sakib Ahmed,Saifur Rahman Jony,Md. Toufikuzzaman,Saifullah Sayed,Rashed Uz Zzaman,Sara Nowreen,M. Sohel Rahman
机构: Department of CSE, BUET (孟加拉国工程技术大学计算机科学与工程系); IWFM, BUET (孟加拉国工程技术大学水资源管理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The use of Sentinel-2 images to compute Normalized Difference Water Index (NDWI) has many applications, including water body area detection. However, cloud cover poses significant challenges in this regard, which hampers the effectiveness of Sentinel-2 images in this context. In this paper, we present a deep learning model that can generate NDWI given Sentinel-1 images, thereby overcoming this cloud barrier. We show the effectiveness of our model, where it demonstrates a high accuracy of 0.9134 and an AUC of 0.8656 to predict the NDWI. Additionally, we observe promising results with an R2 score of 0.4984 (for regressing the NDWI values) and a Mean IoU of 0.4139 (for the underlying segmentation task). In conclusion, our model offers a first and robust solution for generating NDWI images directly from Sentinel-1 images and subsequent use for various applications even under challenging conditions such as cloud cover and nighttime.
zh

[CV-60] NUDT4MSTAR: A New Dataset and Benchmark Towards SAR Target Recognition in the Wild

【速读】:该论文试图解决合成孔径雷达(SAR)自动目标识别(ATR)技术在大数据驱动时代面临的数据集稀缺问题。具体而言,现有的SAR数据集规模较小,难以支持复杂场景下的精细目标识别任务。为解决这一问题,论文提出了NUDT4MSTAR数据集,这是一个大规模SAR数据集,专门用于野外车辆目标识别。该数据集包含40种目标类型,覆盖5种不同场景,总计超过19万张图像,规模是之前数据集的十倍。为了提升数据集的实用性,每张图像都经过详细标注,包括目标信息和成像条件,并提供处理后的幅度图像和原始复数格式数据。此外,论文还构建了一个包含7个实验和15种识别方法的基准测试,重点关注稳定且有效的ATR问题,并通过迁移学习实验验证了该数据集在地面目标ATR领域的广泛潜力。NUDT4MSTAR的开源有望推动SAR ATR技术的发展,并吸引更多研究者参与相关研究。

链接: https://arxiv.org/abs/2501.13354
作者: Yongxiang Liu,Weijie Li,Li Liu,Jie Zhou,Xuying Xiong,Bowen Peng,Yafei Song,Wei Yang,Tianpeng Liu,Zhen Liu,Xiang Li
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 15 figures; link: this https URL

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) stands as an indispensable sensor for Earth observation, owing to its unique capability for all-day imaging. Nevertheless, in a data-driven era, the scarcity of large-scale datasets poses a significant bottleneck to advancing SAR automatic target recognition (ATR) technology. This paper introduces NUDT4MSTAR, a large-scale SAR dataset for vehicle target recognition in the wild, including 40 target types and a wide array of imaging conditions across 5 different scenes. NUDT4MSTAR represents a significant leap forward in dataset scale, containing over 190,000 images-tenfold the size of its predecessors. To enhance the utility of this dataset, we meticulously annotate each image with detailed target information and imaging conditions. We also provide data in both processed magnitude images and original complex formats. Then, we construct a comprehensive benchmark consisting of 7 experiments with 15 recognition methods focusing on the stable and effective ATR issues. Besides, we conduct transfer learning experiments utilizing various models trained on NUDT4MSTAR and applied to three other target datasets, thereby demonstrating its substantial potential to the broader field of ground objects ATR. Finally, we discuss this dataset’s application value and ATR’s significant challenges. To the best of our knowledge, this work marks the first-ever endeavor to create a large-scale dataset benchmark for fine-grained SAR recognition in the wild, featuring an extensive collection of exhaustively annotated vehicle images. We expect that the open source of NUDT4MSTAR will facilitate the development of SAR ATR and attract a wider community of researchers.
zh

[CV-61] Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision

【速读】:该论文试图解决图像超分辨率(Image Super-Resolution, SR)任务中,Transformer 和 Mamba 架构各自存在的局限性问题。具体而言,Transformer 虽然具有强大的全局上下文建模能力,但其二次计算复杂度导致必须使用基于窗口的注意力机制,限制了感受野的有效扩展。而 Mamba 架构虽然具有线性计算复杂度,能够避免窗口机制并保持较大的感受野,但在处理长上下文依赖时,尤其是在需要高像素级精度的 SR 任务中,其隐藏状态机制会以近似方式压缩和存储大量上下文信息,导致精度不足。论文提出的解决方案是 Contrast,一种混合 SR 模型,结合了卷积(Convolutional)、Transformer 和状态空间(State Space)组件,通过整合 Transformer 和状态空间机制,弥补了各自架构的不足,从而在全局上下文建模和像素级精度方面都得到了提升。

链接: https://arxiv.org/abs/2501.13353
作者: Aman Urumbekov,Zheng Chen
机构: Kyrgyz State Technical University(吉尔吉斯国立技术大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Transformers have become increasingly popular for image super-resolution (SR) tasks due to their strong global context modeling capabilities. However, their quadratic computational complexity necessitates the use of window-based attention mechanisms, which restricts the receptive field and limits effective context expansion. Recently, the Mamba architecture has emerged as a promising alternative with linear computational complexity, allowing it to avoid window mechanisms and maintain a large receptive field. Nevertheless, Mamba faces challenges in handling long-context dependencies when high pixel-level precision is required, as in SR tasks. This is due to its hidden state mechanism, which can compress and store a substantial amount of context but only in an approximate manner, leading to inaccuracies that transformers do not suffer from. In this paper, we propose \textbfContrast, a hybrid SR model that combines \textbfConvolutional, \textbfTransformer, and \textbfState Space components, effectively blending the strengths of transformers and Mamba to address their individual limitations. By integrating transformer and state space mechanisms, \textbfContrast compensates for the shortcomings of each approach, enhancing both global context modeling and pixel-level accuracy. We demonstrate that combining these two architectures allows us to mitigate the problems inherent in each, resulting in improved performance on image super-resolution tasks.
zh

[CV-62] MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize

【速读】:该论文试图解决传统扩散模型(Diffusion-based generative models)在生成高分辨率图像时计算成本高的问题。传统方法直接从噪声输入中生成整个图像,忽略了视觉信号中的层次结构,导致计算复杂度较高。论文提出的解决方案是通过多尺度扩散框架(Multi-Scale Diffusion Framework, MSF)生成层次化的视觉表示,将图像生成过程分解为两个阶段:首先生成低分辨率的基础信号,然后生成高分辨率的残差信号。这种方法借鉴了信号处理中的层次分解原理(如傅里叶分析和小波分析),将图像信号分解为多个空间层次,低分辨率部分包含主要信息,而高分辨率部分则添加高频细节(如纹理)。通过这种分解,模型可以使用更轻量级的Transformer架构来分别处理低分辨率和高分辨率信号,从而显著降低计算成本。实验结果表明,该方法在ImageNet 256x256基准测试中取得了FID(Fréchet Inception Distance)为2.2和IS(Inception Score)为255.4的成绩,同时将计算成本降低了50%。

链接: https://arxiv.org/abs/2501.13349
作者: Haohang Xu,Longyu Chen,Shuangrui Ding,Yilin Gao,Dongsheng Jiang,Yin Li,Shugong Xu,Junqing Yu,Wei Yang
机构: Huawei Inc.; Huazhong University of Science & Technology (华中科技大学); The Chinese University of Hong Kong (香港中文大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based generative models have achieved remarkable progress in visual content generation. However, traditional diffusion models directly denoise the entire image from noisy inputs, disregarding the hierarchical structure present in visual signals. This method is computationally intensive, especially for high-resolution image generation. Signal processing often leverages hierarchical decompositions; for instance, Fourier analysis decomposes signals by frequency, while wavelet analysis captures localized frequency components, reflecting both spatial and frequency information simultaneously. Inspired by these principles, we propose a multiscale diffusion framework that generates hierarchical visual representations, which are subsequently integrated to form the final output. The diffusion model target, whether raw RGB pixels or latent features from a Variational Autoencoder, s divided into multiple components that each capture distinct spatial levels. The low-resolution component contains the primary informative signal, while higher-resolution components add high-frequency details, such as texture. This approach divides image generation into two stages: producing a low-resolution base signal, followed by a high-resolution residual signal. Both stages can be effectively modeled using simpler, lightweight transformer architectures compared to full-resolution generation. This decomposition is conceptually similar to wavelet decomposition but offers a more streamlined and intuitive design. Our method, termed MSF(short for Multi-Scale Factorization), achieves an FID of 2.2 and an IS of 255.4 on the ImageNet 256x256 benchmark, reducing computational costs by 50% compared to baseline methods.
zh

[CV-63] YOLOSCM: An improved YOLO algorithm for cars detection

【速读】:该论文旨在解决城市交通图像中目标检测的三大挑战:1) 图像尺寸巨大,计算资源受限;2) 某些场景中车辆尺寸较小,导致检测信息不足;3) 车辆分布不均匀,导致计算资源利用效率低下。为解决这些问题,论文提出了YOLOSCM(You Only Look Once with Segmentation Clustering Module)框架,其关键创新在于引入了分割聚类模块(Segmentation Clustering Module, SCM)。该模块能够自适应地识别车辆聚集区域,使模型能够集中资源在这些区域进行更精确的检测。此外,论文还提出了一种新的训练策略,以优化复杂城市交通场景中小型车辆和密集目标的检测效果。通过在城市交通数据集上的广泛实验,验证了该方法的有效性和优越性。

链接: https://arxiv.org/abs/2501.13343
作者: Changhui Deng,Lieyang Chen,Shinan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting objects in urban traffic images presents considerable difficulties because of the following reasons: 1) These images are typically immense in size, encompassing millions or even hundreds of millions of pixels, yet computational resources are constrained. 2) The small size of vehicles in certain scenarios leads to insufficient information for accurate detection. 3) The uneven distribution of vehicles causes inefficient use of computational resources. To address these issues, we propose YOLOSCM (You Only Look Once with Segmentation Clustering Module), an efficient and effective framework. To address the challenges of large-scale images and the non-uniform distribution of vehicles, we propose a Segmentation Clustering Module (SCM). This module adaptively identifies clustered regions, enabling the model to focus on these areas for more precise detection. Additionally, we propose a new training strategy to optimize the detection of small vehicles and densely packed targets in complex urban traffic scenes. We perform extensive experiments on urban traffic datasets to demonstrate the effectiveness and superiority of our proposed approach.
zh

[CV-64] Multi-aspect Knowledge Distillation with Large Language Model

【速读】:该论文试图解决传统图像分类方法在仅依赖类别标签进行分类时,难以学习到类别的多方面特征(如自然位置和形状变化)的问题。传统方法主要通过修改模型架构或增加特征,并使用交叉熵损失函数优化模型,但这些方法往往忽略了类别的复杂和多维特征。

解决方案的关键在于提出了一种基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的多方面知识蒸馏方法。该方法包括三个主要步骤:1)通过提出与所需知识相关的多方面问题来查询大语言模型;2)从MLLM中提取相应的logits;3)扩展模型的输出维度以蒸馏这些多方面的logits。随后,作者对类别logits应用交叉熵损失,对多方面logits应用二元交叉熵损失。通过这种方法,模型不仅能够学习视觉方面的知识,还能掌握需要更深层次理解的抽象和复杂特征。

实验结果表明,该方法在图像分类任务中显著提升了基线模型的性能,并且通过扩展到其他任务(如目标检测)展示了其潜力。论文还分析了多方面知识蒸馏的效果,证明了该方法能够将多方面的知识传递给模型,并增强模型在计算机视觉任务中的表现。

链接: https://arxiv.org/abs/2501.13341
作者: Taegyeong Lee,Jinsik Bang,Soyeong Kwon,Taehwan Kim
机构: Artificial Intelligence Graduate School, UNIST (蔚山科学技术院人工智能研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in deep learning have significantly improved performance on computer vision tasks. Previous image classification methods primarily modify model architectures or add features, and they optimize models using cross-entropy loss on class logits. Since they focus on classifying images with considering class labels, these methods may struggle to learn various \emphaspects of classes (e.g., natural positions and shape changes). Rethinking the previous approach from a novel view, we propose a multi-aspect knowledge distillation method using Multimodal Large Language Models (MLLMs). Our approach involves: 1) querying Large Language Model with multi-aspect questions relevant to the knowledge we want to transfer to the model, 2) extracting corresponding logits from MLLM, and 3) expanding the model’s output dimensions to distill these multi-aspect logits. We then apply cross-entropy loss to class logits and binary cross-entropy loss to multi-aspect logits. Through our method, the model can learn not only the knowledge about visual aspects but also the abstract and complex aspects that require a deeper understanding. We primarily apply our method to image classification, and to explore the potential for extending our model, we expand it to other tasks, such as object detection. In all experimental results, our method improves the performance of the baselines. Additionally, we analyze the effect of multi-aspect knowledge distillation. These results demonstrate that our method can transfer knowledge about various aspects to the model and the aspect knowledge can enhance model performance in computer vision tasks. This paper demonstrates the great potential of multi-aspect knowledge distillation, and we believe it offers a promising direction for future research in computer vision and beyond.
zh

[CV-65] Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models

【速读】:该论文揭示了检索增强扩散模型(RDMs)在面对后门攻击时的脆弱性,并提出了一种名为BadRDM的多模态对比攻击方法。解决方案的关键在于通过操纵检索数据库中的内容,利用对比学习(contrastive learning)的恶意变体将后门注入到检索器中,从而实现对生成内容的控制。具体而言,研究者在检索数据库中插入少量图像作为目标毒性替代物,并通过基于熵的选择和生成增强策略来增强攻击效果,最终在保持模型良性功能的同时实现了显著的攻击效果。

链接: https://arxiv.org/abs/2501.13340
作者: Hao Fang,Xiaohang Sui,Hongyao Yu,Jiawei Kong,Sijin Yu,Bin Chen,Hao Wu,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models’ generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG’s characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model’s benign utility.
zh

[CV-66] CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph

【速读】:该论文试图解决移动探索(mobile exploration)在机器人领域中的长期挑战,特别是现有方法主要依赖于主动感知(active perception)而忽视了主动交互(active interaction),这限制了机器人与环境进行充分交互和探索的能力。现有的基于主动交互的机器人探索方法通常局限于桌面场景,未能应对移动探索中的独特挑战,如大范围的探索空间、复杂的动作空间以及多样化的物体关系。

解决方案的关键在于引入了一种3D关系物体图(3D relational object graph),该图编码了多样化的物体关系,并通过主动交互实现探索。基于这一表示方法,作者开发了一个系统,并在多种场景中进行了评估。定性和定量结果表明,该系统在有效性和泛化能力上优于仅依赖视觉-语言模型(VLMs)的方法。

链接: https://arxiv.org/abs/2501.13338
作者: Yixuan Wang,Leonor Fermoselle,Tarik Kelestemur,Jiuguang Wang,Yunzhu Li
机构: Columbia University(哥伦比亚大学); Boston Dynamics AI Institute(波士顿动力人工智能研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Mobile exploration is a longstanding challenge in robotics, yet current methods primarily focus on active perception instead of active interaction, limiting the robot’s ability to interact with and fully explore its environment. Existing robotic exploration approaches via active interaction are often restricted to tabletop scenes, neglecting the unique challenges posed by mobile exploration, such as large exploration spaces, complex action spaces, and diverse object relations. In this work, we introduce a 3D relational object graph that encodes diverse object relations and enables exploration through active interaction. We develop a system based on this representation and evaluate it across diverse scenes. Our qualitative and quantitative results demonstrate the system’s effectiveness and generalization capabilities, outperforming methods that rely solely on vision-language models (VLMs).
zh

[CV-67] Gradient-Free Adversarial Purification with Diffusion Models

【速读】:该论文旨在解决现有对抗防御方法在面对基于扰动的对抗攻击(perturbation-based adversarial attacks)和无限制对抗攻击(unrestricted adversarial attacks)时的局限性。具体而言,对抗训练(adversarial training)需要额外的训练过程,而对抗净化(adversarial purification)则存在时间效率低下的问题。此外,现有防御方法主要针对基于扰动的对抗攻击,无法有效应对无限制对抗攻击。

论文提出的解决方案的关键在于利用对抗攻击通常位于决策边界附近且对像素变化敏感的特性,引入了对抗抗锯齿(adversarial anti-aliasing)和对抗超分辨率(adversarial super-resolution)两种方法。对抗抗锯齿通过减少对抗性修改来增强模型的鲁棒性,而对抗超分辨率则利用干净数据集的先验知识来良性恢复图像。这两种方法无需额外训练,且计算效率高,无需计算梯度。实验结果表明,该方法在应对基于扰动和无限制对抗攻击时均优于现有的对抗净化方法。

链接: https://arxiv.org/abs/2501.13336
作者: Xuelong Dai,Dong Wang,Duan Mingxing,Bin Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Adversarial training and adversarial purification are two effective and practical defense methods to enhance a model’s robustness against adversarial attacks. However, adversarial training necessitates additional training, while adversarial purification suffers from low time efficiency. More critically, current defenses are designed under the perturbation-based adversarial threat model, which is ineffective against the recently proposed unrestricted adversarial attacks. In this paper, we propose an effective and efficient adversarial defense method that counters both perturbation-based and unrestricted adversarial attacks. Our defense is inspired by the observation that adversarial attacks are typically located near the decision boundary and are sensitive to pixel changes. To address this, we introduce adversarial anti-aliasing to mitigate adversarial modifications. Additionally, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly recover images. These approaches do not require additional training and are computationally efficient without calculating gradients. Extensive experiments against both perturbation-based and unrestricted adversarial attacks demonstrate that our defense method outperforms state-of-the-art adversarial purification methods.
zh

[CV-68] Deblur-Avatar: Animatable Avatars from Motion-Blurred Monocular Videos

【速读】:该论文旨在解决从运动模糊的单目视频输入中建模高保真、可动画的3D人体化身(animatable 3D human avatars)的问题。运动模糊在现实世界的动态视频捕捉中非常普遍,尤其是在3D人体化身建模中,人体运动导致的模糊更为常见。现有方法要么假设输入图像是清晰的,无法处理运动模糊带来的细节损失;要么主要考虑相机运动引起的模糊,忽略了人体运动模糊。论文提出的解决方案的关键在于将基于人体运动的运动模糊模型集成到3D高斯泼溅(3D Gaussian Splatting, 3DGS)中。通过显式建模曝光时间内的人体运动轨迹,论文联合优化运动轨迹和3D高斯分布,以重建清晰、高质量的人体化身。此外,采用了一种姿态依赖的融合机制,以区分运动中的身体区域,从而有效地优化模糊和清晰区域。实验结果表明,Deblur-Avatar在渲染质量和定量指标上显著优于现有方法,能够在具有挑战性的运动模糊条件下实现实时渲染。

链接: https://arxiv.org/abs/2501.13335
作者: Xianrui Luo,Juewen Peng,Zhongang Cai,Lei Yang,Fan Yang,Zhiguo Cao,Guosheng Lin
机构: 1S-Lab, Nanyang Technological University (南洋理工大学); 2Nanyang Technological University (南洋理工大学); 3Huazhong University of Science and Technology (华中科技大学); 4SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Deblur-Avatar, a novel framework for modeling high-fidelity, animatable 3D human avatars from motion-blurred monocular video inputs. Motion blur is prevalent in real-world dynamic video capture, especially due to human movements in 3D human avatar modeling. Existing methods either (1) assume sharp image inputs, failing to address the detail loss introduced by motion blur, or (2) mainly consider blur by camera movements, neglecting the human motion blur which is more common in animatable avatars. Our proposed approach integrates a human movement-based motion blur model into 3D Gaussian Splatting (3DGS). By explicitly modeling human motion trajectories during exposure time, we jointly optimize the trajectories and 3D Gaussians to reconstruct sharp, high-quality human avatars. We employ a pose-dependent fusion mechanism to distinguish moving body regions, optimizing both blurred and sharp areas effectively. Extensive experiments on synthetic and real-world datasets demonstrate that Deblur-Avatar significantly outperforms existing methods in rendering quality and quantitative metrics, producing sharp avatar reconstructions and enabling real-time rendering under challenging motion blur conditions.
zh

[CV-69] From Cross-Modal to Mixed-Modal Visible-Infrared Re-Identification

【速读】:该论文试图解决可见光-红外行人重识别(VI-ReID)在实际应用中面临的混合模态库(mixed galleries)问题。具体而言,现有的VI-ReID方法主要关注跨模态匹配,但在实际应用中,库中往往同时包含可见光(V)和红外(I)图像,导致现有方法在混合模态库中表现受限,主要原因是跨模态的域偏移(domain shift)和模态内匹配的低区分度。为解决这一问题,论文提出了一种新的混合模态ReID设置,并引入了混合模态擦除与关联(MixER)方法。MixER方法通过正交分解、模态混淆和ID-模态关联目标,解耦模态特定和模态共享的身份信息,从而增强跨模态特征的鲁棒性,提升跨模态和混合模态设置下的性能。实验结果表明,MixER在SYSU-MM01、RegDB和LLMC数据集上取得了最先进的性能,展示了其在混合模态库应用中的灵活性。

链接: https://arxiv.org/abs/2501.13307
作者: Mahdi Alehdaghi,Rajarshi Bhattacharya,Pourya Shamsolmoali,Rafael M. O. Cruz,Eric Granger
机构: 1LIVIA, ILLS, Dept. of Systems Engineering, ETS Montreal, Canada; 2Dept. of Computer Science, University of York, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible-infrared person re-identification (VI-ReID) aims to match individuals across different camera modalities, a critical task in modern surveillance systems. While current VI-ReID methods focus on cross-modality matching, real-world applications often involve mixed galleries containing both V and I images, where state-of-the-art methods show significant performance limitations due to large domain shifts and low discrimination across mixed modalities. This is because gallery images from the same modality may have lower domain gaps but correspond to different identities. This paper introduces a novel mixed-modal ReID setting, where galleries contain data from both modalities. To address the domain shift among inter-modal and low discrimination capacity in intra-modal matching, we propose the Mixed Modality-Erased and -Related (MixER) method. The MixER learning approach disentangles modality-specific and modality-shared identity information through orthogonal decomposition, modality-confusion, and ID-modality-related objectives. MixER enhances feature robustness across modalities, improving cross-modal and mixed-modal settings performance. Our extensive experiments on the SYSU-MM01, RegDB and LLMC datasets indicate that our approach can provide state-of-the-art results using a single backbone, and showcase the flexibility of our approach in mixed gallery applications.
zh

[CV-70] MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis

【速读】:该论文试图解决在医学基础模型开发中,构建大规模多模态训练数据集(包括CT图像和临床数值数据)所面临的挑战,特别是由于多切片CT数据的结构复杂性和专家标注的高成本。解决方案的关键在于提出了一种名为MEDFORM的多模态预训练策略,该策略通过利用临床数据的互补信息来指导CT图像表示学习。具体而言,MEDFORM采用多实例学习(MIL)高效处理CT切片,并通过双重预训练策略:首先使用基于SimCLR的自监督学习预训练CT切片特征提取器,然后通过跨模态对比学习对齐CT和临床模态。该模型在三种不同类型的癌症(肺癌、乳腺癌和结直肠癌)上进行了预训练,实验结果表明,这种双重预训练策略提高了癌症分类性能,并在少样本学习场景中保持了稳健的性能。

链接: https://arxiv.org/abs/2501.13277
作者: Daeun Jung,Jaehyeok Jang,Sooyoung Jang,Yu Rang Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Computed tomography (CT) and clinical numeric data are essential modalities for cancer evaluation, but building large-scale multimodal training datasets for developing medical foundation models remains challenging due to the structural complexity of multi-slice CT data and high cost of expert annotation. In this study, we propose MEDFORM, a multimodal pre-training strategy that guides CT image representation learning using complementary information from clinical data for medical foundation model development. MEDFORM efficiently processes CT slice through multiple instance learning (MIL) and adopts a dual pre-training strategy: first pretraining the CT slice feature extractor using SimCLR-based self-supervised learning, then aligning CT and clinical modalities through cross-modal contrastive learning. Our model was pre-trained on three different cancer types: lung cancer (141,171 slices), breast cancer (8,100 slices), colorectal cancer (10,393 slices). The experimental results demonstrated that this dual pre-training strategy improves cancer classification performance and maintains robust performance in few-shot learning scenarios. Code available at this https URL
zh

[CV-71] Multimodal AI on Wound Images and Clinical Notes for Home Patient Referral

【速读】:该论文旨在解决慢性伤口(chronic wounds)患者在家中接受护理时,由于访问护士(visiting nurses)的伤口护理专业知识水平不一,导致护理质量不一致的问题。特别是,非临床环境下的转诊决策(referral decisions)常常存在错误、延迟或不必要的情况。为此,论文提出了一种名为深度多模态伤口评估工具(Deep Multimodal Wound Assessment Tool, DM-WAT)的机器学习框架,旨在辅助访问护士判断是否应将慢性伤口患者转诊给专业伤口护理人员。DM-WAT的关键解决方案包括:1)利用智能手机拍摄的伤口图像和电子健康记录(Electronic Health Records, EHRs)中的临床笔记,通过DeiT-Base-Distilled(一种视觉Transformer, ViT)提取图像特征,并通过DeBERTa-base提取文本特征;2)采用中间融合(intermediate fusion)方法结合视觉和文本特征;3)通过图像和文本增强(augmentation)以及迁移学习(transfer learning)应对小样本和不平衡数据集的挑战。实验结果表明,DM-WAT在准确率和F1分数上均优于现有方法,并通过Score-CAM和Captum解释算法增强了模型的可解释性和可信度。

链接: https://arxiv.org/abs/2501.13247
作者: Reza Saadati Fard,Emmanuel Agu,Palawat Busaranuvong,Deepak Kumar,Shefalika Gautam,Bengisu Tulu,Diane Strong
机构: Department of Computer Science at Worcester Polytechnic Institute, Worcester, MA 01609, USA (伍斯特理工学院计算机科学系); Department of Data Science at Worcester Polytechnic Institute, Worcester, MA 01609, USA (伍斯特理工学院数据科学系); Business School at Worcester Polytechnic Institute, Worcester, MA 01609, USA (伍斯特理工学院商学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: arXiv admin note: text overlap with arXiv:2208.05051 by other authors

点击查看摘要

Abstract:Chronic wounds affect 8.5 million Americans, particularly the elderly and patients with diabetes. These wounds can take up to nine months to heal, making regular care essential to ensure healing and prevent severe outcomes like limb amputations. Many patients receive care at home from visiting nurses with varying levels of wound expertise, leading to inconsistent care. Problematic, non-healing wounds should be referred to wound specialists, but referral decisions in non-clinical settings are often erroneous, delayed, or unnecessary. This paper introduces the Deep Multimodal Wound Assessment Tool (DM-WAT), a machine learning framework designed to assist visiting nurses in deciding whether to refer chronic wound patients. DM-WAT analyzes smartphone-captured wound images and clinical notes from Electronic Health Records (EHRs). It uses DeiT-Base-Distilled, a Vision Transformer (ViT), to extract visual features from images and DeBERTa-base to extract text features from clinical notes. DM-WAT combines visual and text features using an intermediate fusion approach. To address challenges posed by a small and imbalanced dataset, it integrates image and text augmentation with transfer learning to achieve high performance. In evaluations, DM-WAT achieved 77% with std 3% accuracy and a 70% with std 2% F1 score, outperforming prior approaches. Score-CAM and Captum interpretation algorithms provide insights into specific parts of image and text inputs that influence recommendations, enhancing interpretability and trust. Comments: arXiv admin note: text overlap with arXiv:2208.05051 by other authors Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2501.13247 [cs.LG] (or arXiv:2501.13247v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.13247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-72] Map Prediction and Generative Entropy for Multi-Agent Exploration

【速读】:该论文旨在解决自主侦察应用中机器人团队在未知环境中如何更高效地探索和预测场景的问题。传统方法依赖于历史观测数据,而本文提出了一种新的方法,通过生成式技术(Generative Technologies)推断场景的合理分布,从而在探索任务中填补多智能体二维占据地图(2D occupancy map)中的未知区域。解决方案的关键在于开发了一种地图预测器,该预测器利用微调的潜在扩散修复模型(latent diffusion inpainting model)在模拟城市环境中生成丰富且连贯的场景解释。通过迭代推断场景解释,该方法能够识别预测中不确定性较高的区域,并引入生成熵(generative entropy)的概念来量化这种不确定性。研究提出了一种新的任务排序方法,优先探索高生成熵区域,假设这将加速场景预测地图的准确收敛。与传统的信息恢复最大化方法相比,新方法在模拟城市环境中显著提高了场景预测的速度和准确性。

链接: https://arxiv.org/abs/2501.13189
作者: Alexander Spinos,Bradley Woosley,Justin Rokisky,Christopher Korpela,John G. Rogers III,Brian A. Bittner
机构: Johns Hopkins University Applied Physics Lab (约翰霍普金斯大学应用物理实验室); DEVCOM Army Research Lab (DEVCOM陆军研究实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditionally, autonomous reconnaissance applications have acted on explicit sets of historical observations. Aided by recent breakthroughs in generative technologies, this work enables robot teams to act beyond what is currently known about the environment by inferring a distribution of reasonable interpretations of the scene. We developed a map predictor that inpaints the unknown space in a multi-agent 2D occupancy map during an exploration mission. From a comparison of several inpainting methods, we found that a fine-tuned latent diffusion inpainting model could provide rich and coherent interpretations of simulated urban environments with relatively little computation time. By iteratively inferring interpretations of the scene throughout an exploration run, we are able to identify areas that exhibit high uncertainty in the prediction, which we formalize with the concept of generative entropy. We prioritize tasks in regions of high generative entropy, hypothesizing that this will expedite convergence on an accurate predicted map of the scene. In our study we juxtapose this new paradigm of task ranking with the state of the art, which ranks regions to explore by those which maximize expected information recovery. We compare both of these methods in a simulated urban environment with three vehicles. Our results demonstrate that by using our new task ranking method, we can predict a correct scene significantly faster than with a traditional information-guided method.
zh

[CV-73] MONA: Moving Object Detection from Videos Shot by Dynamic Camera

【速读】:该论文旨在解决动态城市环境中由于移动摄像头和移动物体导致的相机轨迹估计难题,特别是如何有效区分相机运动和物体运动的问题。为此,作者提出了一个名为MONA的新框架,该框架包含两个关键模块:动态点提取(Dynamic Points Extraction)和移动物体分割(Moving Object Segmentation)。动态点提取模块利用光流(optical flow)和“跟踪任意点”(tracking any point)技术来识别动态点;移动物体分割模块则通过自适应边界框过滤(adaptive bounding box filtering)和“分割任意物体”(Segment Anything)技术实现精确的移动物体分割。通过与相机轨迹估计方法LEAP-VO的集成,MONA在MPI Sintel数据集上取得了优于现有方法的结果,验证了其在移动物体检测中的有效性及其在城市规划等领域的潜在应用价值。

链接: https://arxiv.org/abs/2501.13183
作者: Boxun Hu,Mingze Xia,Ding Zhao,Guanlin Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic urban environments, characterized by moving cameras and objects, pose significant challenges for camera trajectory estimation by complicating the distinction between camera-induced and object motion. We introduce MONA, a novel framework designed for robust moving object detection and segmentation from videos shot by dynamic cameras. MONA comprises two key modules: Dynamic Points Extraction, which leverages optical flow and tracking any point to identify dynamic points, and Moving Object Segmentation, which employs adaptive bounding box filtering, and the Segment Anything for precise moving object segmentation. We validate MONA by integrating with the camera trajectory estimation method LEAP-VO, and it achieves state-of-the-art results on the MPI Sintel dataset comparing to existing methods. These results demonstrate MONA’s effectiveness for moving object detection and its potential in many other applications in the urban planning field.
zh

[CV-74] Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi

【速读】:该论文试图解决的问题是卷积神经网络(CNNs)和视觉Transformer(ViT)架构在解决复杂分类任务时的不同机制,并揭示它们背后的统一学习机制。论文通过定量测量前馈(FF)和多头注意力(MHA)子块中每个节点的单节点性能(SNP),发现CNNs和ViT架构实际上源于相同的学习机制。这一机制使得每个节点能够识别出可能的输出标签的小簇,并通过Transformer编码器逐步增强信号与噪声的比率。解决方案的关键在于提出了两种主要发现:首先,基于SNP的高效应用节点对角连接(ANDC)剪枝技术,可以在不影响准确性的情况下减少计算复杂度;其次,SNP导致MHA头之间的自发对称性破缺,使得每个头通过SNP的协作专注于识别特定的标签子集,从而形成一种定量的MHA共存机制。这些发现基于在CIFAR-100和Flowers-102数据集上训练的紧凑卷积Transformer架构,并呼吁将其扩展到其他架构和应用领域,如自然语言处理。

链接: https://arxiv.org/abs/2501.12900
作者: Ella Koresh,Ronit D. Gross,Yuval Meir,Yarden Tzach,Tal Halevi,Ido Kanter
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) subblocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism leads to two main findings. First, it enables an efficient applied nodal diagonal connection (ANDC) pruning technique without affecting the accuracy. Second, based on the SNP, spontaneous symmetry breaking occurs among the MHA heads, such that each head focuses its attention on a subset of labels through cooperation among its SNPs. Consequently, each head becomes an expert in recognizing its designated labels, representing a quantitative MHA modus vivendi mechanism. These results are based on a compact convolutional transformer architecture trained on the CIFAR-100 and Flowers-102 datasets and call for their extension to other architectures and applications, such as natural language processing.
zh

[CV-75] On Disentangled Training for Nonlinear Transform in Learned Image Compression ICLR2025

【速读】:该论文试图解决学习型图像压缩(Learned Image Compression, LIC)模型训练效率低下的问题,特别是由于非线性变换中能量压缩(energy compaction)导致的收敛速度缓慢。现有方法忽视了能量压缩中的两个关键组成部分,即特征去相关(feature decorrelation)和不均匀能量调制(uneven energy modulation)。为解决这一问题,论文提出了一种线性辅助变换(Linear Auxiliary Transform, AuxT),通过解耦能量压缩来简化非线性变换的分布拟合过程。具体而言,AuxT利用基于小波的线性捷径(Wavelet-based Linear Shortcuts, WLSs)实现特征去相关和子带感知缩放(subband-aware scaling),从而加速训练收敛。实验结果表明,该方法能够将LIC模型的训练速度提升2倍,同时平均降低1%的BD-rate(比特率-失真率)。该方案的关键在于通过轻量级且即插即用的AuxT模块,显著提升了LIC模型的训练效率,同时保持了甚至优于传统方法的率失真性能。

链接: https://arxiv.org/abs/2501.13751
作者: Han Li,Shaohui Li,Wenrui Dai,Maida Cao,Nuowen Kan,Chenglin Li,Junni Zou,Hongkai Xiong
机构: Shanghai Jiao Tong University(上海交通大学); Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院, 清华大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR2025

点击查看摘要

Abstract:Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs, but is challenged by training inefficiency that could incur more than two weeks to train a state-of-the-art model from scratch. Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms. In this paper, we first reveal that such energy compaction consists of two components, i.e., feature decorrelation and uneven energy modulation. On such basis, we propose a linear auxiliary transform (AuxT) to disentangle energy compaction in training nonlinear transforms. The proposed AuxT obtains coarse approximation to achieve efficient energy compaction such that distribution fitting with the nonlinear transforms can be simplified to fine details. We then develop wavelet-based linear shortcuts (WLSs) for AuxT that leverages wavelet-based downsampling and orthogonal linear projection for feature decorrelation and subband-aware scaling for uneven energy modulation. AuxT is lightweight and plug-and-play to be integrated into diverse LIC models to address the slow convergence issue. Experimental results demonstrate that the proposed approach can accelerate training of LIC models by 2 times and simultaneously achieves an average 1% BD-rate reduction. To our best knowledge, this is one of the first successful attempt that can significantly improve the convergence of LIC with comparable or superior rate-distortion performance. Code will be released at \urlthis https URL
zh

[CV-76] Variational U-Net with Local Alignment for Joint Tumor Extraction and Registration (VALOR-Net) of Breast MRI Data Acquired at Two Different Field Strengths

【速读】:该论文试图解决在不同磁场强度(如3T和7T)下获取的多参数乳腺MRI(Magnetic Resonance Imaging)数据的准确对齐和肿瘤分割问题。由于不同磁场强度下获取的图像存在差异,准确对齐和分割肿瘤仍然是一个具有挑战性的研究任务。论文提出的解决方案关键在于开发一种联合图像配准(registration)和肿瘤分割的方法,以确保在不同磁场强度下获取的MRI数据能够一致地进行肿瘤分割。通过使用后对比T1加权三维时间分辨血管造影与随机轨迹序列(TWIST),并结合多种定量指标(如PSNR、SSIM、NCC、Dice系数、F1分数和rel SSD)进行评估,初步结果表明该方法在不同磁场强度下的MRI数据联合配准和分割中具有可行性。

链接: https://arxiv.org/abs/2501.13690
作者: Muhammad Shahkar Khan,Haider Ali,Laura Villazan Garcia,Noor Badshah,Siegfried Trattnig,Florian Schwarzhans,Ramona Woitek,Olgica Zaric
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Multiparametric breast MRI data might improve tumor diagnostics, characterization, and treatment planning. Accurate alignment and delineation of images acquired at different field strengths such as 3T and 7T, remain challenging research tasks. Purpose: To address alignment challenges and enable consistent tumor segmentation across different MRI field strengths. Study type: Retrospective. Subjects: Nine female subjects with breast tumors were involved: six histologically proven invasive ductal carcinomas (IDC) and three fibroadenomas. Field strength/sequence: Imaging was performed at 3T and 7T scanners using post-contrast T1-weighted three-dimensional time-resolved angiography with stochastic trajectories (TWIST) sequence. Assessments: The method’s performance for joint image registration and tumor segmentation was evaluated using several quantitative metrics, including signal-to-noise ratio (PSNR), structural similarity index (SSIM), normalized cross-correlation (NCC), Dice coefficient, F1 score, and relative sum of squared differences (rel SSD). Statistical tests: The Pearson correlation coefficient was used to test the relationship between the registration and segmentation metrics. Results: When calculated for each subject individually, the PSNR was in a range from 27.5 to 34.5 dB, and the SSIM was from 82.6 to 92.8%. The model achieved an NCC from 96.4 to 99.3% and a Dice coefficient of 62.9 to 95.3%. The F1 score was between 55.4 and 93.2% and the rel SSD was in the range of 2.0 and 7.5%. The segmentation metrics Dice and F1 Score are highly correlated (0.995), while a moderate correlation between NCC and SSIM (0.681) was found for registration. Data conclusion: Initial results demonstrate that the proposed method may be feasible in providing joint tumor segmentation and registration of MRI data acquired at different field strengths.
zh

[CV-77] Enhancing Medical Image Analysis through Geometric and Photometric transformations

【速读】:该论文试图解决医学图像分析中因患者隐私和专家资源不足导致的标注数据缺乏问题。解决方案的关键在于通过数据增强(data augmentation)技术来提升模型性能并增加数据集规模。具体而言,作者在皮肤癌数据集上应用了传统的数据增强技术,显著提高了卷积神经网络(CNN)的测试准确率(从90.74%提升至96.88%)并降低了测试损失(从0.7921降至0.1468)。此外,在视网膜和血管数据集上,作者采用了Mixup技术,通过混合两幅随机图像及其对应的掩码,训练U-net模型,使得Dice系数从0提升至0.4163。这些结果表明,数据增强技术在分类和分割任务中能够有效提升模型性能。

链接: https://arxiv.org/abs/2501.13643
作者: Khadija Rais,Mohamed Amroune,Mohamed Yassine Haouam
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image analysis suffers from a lack of labeled data due to several challenges including patient privacy and lack of experts. Although some AI models only perform well with large amounts of data, we will move to data augmentation where there is a solution to improve the performance of our models and increase the dataset size through traditional or advanced techniques. In this paper, we evaluate the effectiveness of data augmentation techniques on two different medical image datasets. In the first step, we applied some transformation techniques to the skin cancer dataset containing benign and malignant classes. Then, we trained the convolutional neural network (CNN) on the dataset before and after augmentation, which significantly improved test accuracy from 90.74% to 96.88% and decreased test loss from 0.7921 to 0.1468 after augmentation. In the second step, we used the Mixup technique by mixing two random images and their corresponding masks using the retina and blood vessels dataset, then we trained the U-net model and obtained the Dice coefficient which increased from 0 before augmentation to 0.4163 after augmentation. The result shows the effect of using data augmentation to increase the dataset size on the classification and segmentation performance.
zh

[CV-78] Self-Supervised Diffusion MRI Denoising via Iterative and Stable Refinement

【速读】:该论文试图解决磁共振成像(MRI),特别是扩散磁共振成像(dMRI)中由于低信噪比(SNR)扫描导致的时空分辨率妥协问题,这些问题无法满足临床对效率和精度的需求。为此,去噪(denoising)成为关键的预处理步骤,尤其是在dMRI中,干净的实验数据难以获取。论文提出的解决方案是Di-Fusion,一种完全自监督的去噪方法,其关键创新在于利用后续扩散步骤和自适应采样过程。与以往方法不同,Di-Fusion通过单阶段框架实现了高效且稳定的训练,无需额外的噪声模型训练,并在采样过程中提供了自适应且可控的结果。实验表明,Di-Fusion在微结构建模、纤维束追踪等下游任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.13514
作者: Chenxu Wu,Qingpeng Kong,Zihang Jiang,S. Kevin Zhou
机构: MIRACLE, Suzhou Institute for Advance Research; University of Science and Technology of China (USTC)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 39pages, 34figures

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI), including diffusion MRI (dMRI), serves as a ``microscope’’ for anatomical structures and routinely mitigates the influence of low signal-to-noise ratio scans by compromising temporal or spatial resolution. However, these compromises fail to meet clinical demands for both efficiency and precision. Consequently, denoising is a vital preprocessing step, particularly for dMRI, where clean data is unavailable. In this paper, we introduce Di-Fusion, a fully self-supervised denoising method that leverages the latter diffusion steps and an adaptive sampling process. Unlike previous approaches, our single-stage framework achieves efficient and stable training without extra noise model training and offers adaptive and controllable results in the sampling process. Our thorough experiments on real and simulated data demonstrate that Di-Fusion achieves state-of-the-art performance in microstructure modeling, tractography tracking, and other downstream tasks.
zh

[CV-79] Scalable Evaluation Framework for Foundation Models in Musculoskeletal MRI Bridging Computational Innovation with Clinical Utility

【速读】:该论文旨在解决基础模型(Foundation Models)在医学影像中的临床应用问题,特别是如何评估这些模型在临床中的实际效果和可转化性。研究通过引入一个评估框架,以肌肉骨骼MRI为案例,评估了SAM、MedSAM和SAM2模型在零样本(zero-shot)和微调(finetuned)范式下的表现,重点关注其处理不同解剖结构的能力以及生成临床可靠生物标志物(如软骨厚度、肌肉体积和椎间盘高度)的准确性。解决方案的关键在于设计了一个模块化的工作流程,强调可扩展性、临床相关性和工作流集成,减少了人工干预,并将验证过程与最终用户的期望对齐。此外,研究通过分层建模揭示了数据集混合、解剖复杂性和MRI采集参数对模型性能的影响,为通过影像优化提高分割精度提供了见解。该框架通过跨学科合作和技术创新与临床优先事项的结合,为将机器学习技术转化为可扩展且有影响力的生物医学解决方案提供了路线图。

链接: https://arxiv.org/abs/2501.13376
作者: Gabrielle Hoyer,Michelle W Tong,Rupsa Bhattacharjee,Valentina Pedoia,Sharmila Majumdar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models hold transformative potential for medical imaging, but their clinical utility requires rigorous evaluation to address their strengths and limitations. This study introduces an evaluation framework for assessing the clinical impact and translatability of SAM, MedSAM, and SAM2, using musculoskeletal MRI as a case study. We tested these models across zero-shot and finetuned paradigms to assess their ability to process diverse anatomical structures and effectuate clinically reliable biomarkers, including cartilage thickness, muscle volume, and disc height. We engineered a modular pipeline emphasizing scalability, clinical relevance, and workflow integration, reducing manual effort and aligning validation with end-user expectations. Hierarchical modeling revealed how dataset mixing, anatomical complexity, and MRI acquisition parameters influence performance, providing insights into the role of imaging refinements in improving segmentation accuracy. This work demonstrates how clinically focused evaluations can connect computational advancements with tangible applications, creating a pathway for foundation models to address medical challenges. By emphasizing interdisciplinary collaboration and aligning technical innovation with clinical priorities, our framework provides a roadmap for advancing machine learning technologies into scalable and impactful biomedical solutions.
zh

[CV-80] Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization

【速读】:该论文试图解决现有医学图像分析方法在临床应用中面临的泛化性问题。具体而言,现有方法通常针对特定模态和分辨率设计,且假设图像来自健康受试者,导致在面对不同扫描参数、分辨率、方向以及病理情况时性能下降。论文提出了一种名为UNA(Unraveling Normal Anatomy)的模态无关学习方法,旨在重建正常脑解剖结构,并能够处理健康扫描和病理情况。其关键解决方案包括:1)引入流体驱动的异常随机化方法,动态生成大量逼真的病理特征;2)结合合成数据和真实数据进行训练,使模型能够直接应用于带有潜在病理的真实图像,无需微调。通过弥合健康与病变图像之间的差距,UNA使得通用模型能够在病变图像上进行大规模分析,为未筛选的临床图像分析提供了新的可能性。

链接: https://arxiv.org/abs/2501.13370
作者: Peirong Liu,Ana Lawry Aguila,Juan E. Iglesias
机构: Harvard Medical School and Massachusetts General Hospital (哈佛医学院和马萨诸塞州总医院); UCL (伦敦大学学院); MIT (麻省理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA’s effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images in the presence of pathology. Code is available at this https URL.
zh

[CV-81] Polyhedra Encoding Transformers: Enhancing Diffusion MRI Analysis Beyond Voxel and Volumetric Embedding

【速读】:该论文旨在解决传统深度学习方法在扩散加权磁共振成像(dMRI)数据分析中的局限性,特别是这些方法未能有效处理不同梯度编码的独特分布问题。传统方法通常使用像素级或体积块级嵌入,类似于结构MRI中的处理方式,无法充分捕捉dMRI数据的球面信号特性。为此,论文提出了一种名为多面体编码变换器(Polyhedra Encoding Transformer, PE-Transformer)的新方法。该方法的创新之处在于将二十面体多边形投影到单位球面上,从预定方向重新采样信号,并将这些信号转换为嵌入,随后通过一个结合了二十面体结构方向信息的变换器编码器进行处理。实验验证表明,该方法在估计多室模型和纤维方向分布(FOD)方面表现出更高的准确性,优于传统的卷积神经网络(CNN)架构和标准变换器。

链接: https://arxiv.org/abs/2501.13352
作者: Tianyuan Yao,Zhiyuan Li,Praitayini Kanakaraj,Derek B. Archer,Kurt Schilling,Lori Beason-Held,Susan Resnick,Bennett A. Landman,Yuankai Huo
机构: Department of Computer Science, Vanderbilt University (范德比尔特大学计算机科学系); Department of Electrical and Computer Engineering, Vanderbilt University (范德比尔特大学电气与计算机工程系); Department of Neurology, Vanderbilt University Medical Center (范德比尔特大学医学中心神经病学系); Department of Biomedical Engineering, Vanderbilt University (范德比尔特大学生物医学工程系); Laboratory of Behavioral Neuroscience, National Institute on Aging (国家老龄化研究所行为神经科学实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-weighted Magnetic Resonance Imaging (dMRI) is an essential tool in neuroimaging. It is arguably the sole noninvasive technique for examining the microstructural properties and structural connectivity of the brain. Recent years have seen the emergence of machine learning and data-driven approaches that enhance the speed, accuracy, and consistency of dMRI data analysis. However, traditional deep learning models often fell short, as they typically utilize pixel-level or volumetric patch-level embeddings similar to those used in structural MRI, and do not account for the unique distribution of various gradient encodings. In this paper, we propose a novel method called Polyhedra Encoding Transformer (PE-Transformer) for dMRI, designed specifically to handle spherical signals. Our approach involves projecting an icosahedral polygon onto a unit sphere to resample signals from predetermined directions. These resampled signals are then transformed into embeddings, which are processed by a transformer encoder that incorporates orientational information reflective of the icosahedral structure. Through experimental validation with various gradient encoding protocols, our method demonstrates superior accuracy in estimating multi-compartment models and Fiber Orientation Distributions (FOD), outperforming both conventional CNN architectures and standard transformers.
zh

[CV-82] Revisiting Data Augmentation for Ultrasound Images

【速读】:该论文试图解决在医学图像处理中,尤其是在超声图像(ultrasound imaging)分析任务中,数据增强(data augmentation)技术应用不足的问题。尽管数据增强是提高深度神经网络泛化性能的广泛有效技术,但在医学图像领域,由于数据可用性有限,其应用往往不足。论文通过分析不同数据增强技术在多种超声图像分析任务中的有效性,填补了这一研究空白。解决方案的关键在于引入了一个新的标准化基准,涵盖了来自10个不同来源的14个超声图像分类和语义分割任务,覆盖了11个身体区域。研究结果表明,许多常用于自然图像任务的增强技术在超声图像上同样有效,甚至在某些情况下比专门为超声图像开发的增强技术更为有效。此外,论文还展示了广泛用于自然图像的TrivialAugment技术在超声图像上的有效性,并提出了一种结构化方法来评估不同数据增强技术,该方法可推广到其他领域和模态。

链接: https://arxiv.org/abs/2501.13193
作者: Adam Tupper,Christian Gagné
机构: Institut Intelligence et Données (IID), Université Laval (拉瓦尔大学); Mila (米拉研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: For associated source code see this https URL

点击查看摘要

Abstract:Data augmentation is a widely used and effective technique to improve the generalization performance of deep neural networks. Yet, despite often facing limited data availability when working with medical images, it is frequently underutilized. This appears to come from a gap in our collective understanding of the efficacy of different augmentation techniques across different tasks and modalities. One modality where this is especially true is ultrasound imaging. This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases. We also show that diverse augmentation using TrivialAugment, which is widely used for natural images, is also effective for ultrasound images. Moreover, our proposed methodology represents a structured approach for assessing various data augmentations that can be applied to other contexts and modalities.
zh

人工智能

[AI-0] Autoencoders for Anomaly Detection are Unreliable

链接: https://arxiv.org/abs/2501.13864
作者: Roel Bouman,Tom Heskes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autoencoders are frequently used for anomaly detection, both in the unsupervised and semi-supervised settings. They rely on the assumption that when trained using the reconstruction loss, they will be able to reconstruct normal data more accurately than anomalous data. Some recent works have posited that this assumption may not always hold, but little has been done to study the validity of the assumption in theory. In this work we show that this assumption indeed does not hold, and illustrate that anomalies, lying far away from normal data, can be perfectly reconstructed in practice. We revisit the theory of failure of linear autoencoders for anomaly detection by showing how they can perfectly reconstruct out of bounds, or extrapolate undesirably, and note how this can be dangerous in safety critical applications. We connect this to non-linear autoencoders through experiments on both tabular data and real-world image data, the two primary application areas of autoencoders for anomaly detection.

[AI-1] Learning to Help in Multi-Class Settings ICLR2025

链接: https://arxiv.org/abs/2501.13810
作者: Yu Wu,Yansong Li,Zeyu Dong,Nitya Sathyavageeswaran,Anand D. Sarwate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 7 figures, conference, ICLR 2025

点击查看摘要

Abstract:Deploying complex machine learning models on resource-constrained devices is challenging due to limited computational power, memory, and model retrainability. To address these limitations, a hybrid system can be established by augmenting the local model with a server-side model, where samples are selectively deferred by a rejector and then sent to the server for processing. The hybrid system enables efficient use of computational resources while minimizing the overhead associated with server usage. The recently proposed Learning to Help (L2H) model trains a server model given a fixed local (client) model, differing from the Learning to Defer (L2D) framework, which trains the client for a fixed (expert) server. In both L2D and L2H, the training includes learning a rejector at the client to determine when to query the server. In this work, we extend the L2H model from binary to multi-class classification problems and demonstrate its applicability in a number of different scenarios of practical interest in which access to the server may be limited by cost, availability, or policy. We derive a stage-switching surrogate loss function that is differentiable, convex, and consistent with the Bayes rule corresponding to the 0-1 loss for the L2H model. Experiments show that our proposed methods offer an efficient and practical solution for multi-class classification in resource-constrained environments.

[AI-2] Defending against Adversarial Malware Attacks on ML-based Android Malware Detection Systems

链接: https://arxiv.org/abs/2501.13782
作者: Ping He,Lorenzo Cavallaro,Shouling Ji
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Android malware presents a persistent threat to users’ privacy and data integrity. To combat this, researchers have proposed machine learning-based (ML-based) Android malware detection (AMD) systems. However, adversarial Android malware attacks compromise the detection integrity of the ML-based AMD systems, raising significant concerns. Existing defenses against adversarial Android malware provide protections against feature space attacks which generate adversarial feature vectors only, leaving protection against realistic threats from problem space attacks which generate real adversarial malware an open problem. In this paper, we address this gap by proposing ADD, a practical adversarial Android malware defense framework designed as a plug-in to enhance the adversarial robustness of the ML-based AMD systems against problem space attacks. Our extensive evaluation across various ML-based AMD systems demonstrates that ADD is effective against state-of-the-art problem space adversarial Android malware attacks. Additionally, ADD shows the defense effectiveness in enhancing the adversarial robustness of real-world antivirus solutions.

[AI-3] Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

链接: https://arxiv.org/abs/2501.13779
作者: Tanya Rodchenko,Natasha Noy,Nino Scherrer,Jennifer Prendki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the topology of data itself informs which tasks to prioritize in data scaling, and shapes the development of the next generation of compute paradigms for tasks where data scaling is inefficient, or even insufficient.

[AI-4] une In Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak

链接: https://arxiv.org/abs/2501.13772
作者: Erjia Xiao,Hao Cheng,Jing Shao,Jinhao Duan,Kaidi Xu,Le Yang,Jindong Gu,Renjing Xu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable zero-shot performance across various natural language processing tasks. The integration of multimodal encoders extends their capabilities, enabling the development of Multimodal Large Language Models that process vision, audio, and text. However, these capabilities also raise significant security concerns, as these models can be manipulated to generate harmful or inappropriate content through jailbreak. While extensive research explores the impact of modality-specific input edits on text-based LLMs and Large Vision-Language Models in jailbreak, the effects of audio-specific edits on Large Audio-Language Models (LALMs) remain underexplored. Hence, this paper addresses this gap by investigating how audio-specific edits influence LALMs inference regarding jailbreak. We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection, and the Edited Audio Datasets (EADs), a comprehensive audio jailbreak benchmark. We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits. This work lays the groundwork for future explorations on audio-modality interactions in LALMs security.

[AI-5] Integrating Causality with Neurochaos Learning: Proposed Approach and Research Agenda

链接: https://arxiv.org/abs/2501.13763
作者: Nanjangud C. Narendra,Nithin Nagaraj
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Deep learning implemented via neural networks, has revolutionized machine learning by providing methods for complex tasks such as object detection/classification and prediction. However, architectures based on deep neural networks have started to yield diminishing returns, primarily due to their statistical nature and inability to capture causal structure in the training data. Another issue with deep learning is its high energy consumption, which is not that desirable from a sustainability perspective. Therefore, alternative approaches are being considered to address these issues, both of which are inspired by the functioning of the human brain. One approach is causal learning, which takes into account causality among the items in the dataset on which the neural network is trained. It is expected that this will help minimize the spurious correlations that are prevalent in the learned representations of deep neural networks. The other approach is Neurochaos Learning, a recent development, which draws its inspiration from the nonlinear chaotic firing intrinsic to neurons in biological neural networks (brain/central nervous system). Both approaches have shown improved results over just deep learning alone. To that end, in this position paper, we investigate how causal and neurochaos learning approaches can be integrated together to produce better results, especially in domains that contain linked data. We propose an approach for this integration to enhance classification, prediction and reinforcement learning. We also propose a set of research questions that need to be investigated in order to make this integration a reality. Comments: 9 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6 Cite as: arXiv:2501.13763 [cs.LG] (or arXiv:2501.13763v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.13763 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-6] On Deciding the Data Complexity of Answering Linear Monadic Datalog Queries with LTL Operators(Extended Version) ICDT’2025

链接: https://arxiv.org/abs/2501.13762
作者: Alessandro Artale,Anton Gnatenko,Vladislav Ryzhikov,Michael Zakharyaschev
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: Extended version of a paper accepted at ICDT’2025

点击查看摘要

Abstract:Our concern is the data complexity of answering linear monadic datalog queries whose atoms in the rule bodies can be prefixed by operators of linear temporal logic LTL. We first observe that, for data complexity, answering any connected query with operators \bigcirc/\bigcirc^- (at the next/previous moment) is either in AC0, or in ACC0!\setminus!AC0 , or NC^1 -complete, or LogSpace-hard and in NLogSpace. Then we show that the problem of deciding LogSpace-hardness of answering such queries is PSpace-complete, while checking membership in the classes AC0 and ACC0 as well as NC^1 -completeness can be done in ExpSpace. Finally, we prove that membership in AC0 or in ACC0, NC^1 -completeness, and LogSpace-hardness are undecidable for queries with operators \Diamond_f/\Diamond_p (sometime in the future/past) provided that NC^1 \ne NLogSpace , and LogSpace \ne NLogSpace .

[AI-7] EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM -driven Agents

链接: https://arxiv.org/abs/2501.13746
作者: Yuhui Yun,Huilong Ye,Xinru Li,Ruojia Li,Jingfeng Deng,Li Li,Haoyi Xiong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The paper introduces EICopilot, an novel agent-based solution enhancing search and exploration of enterprise registration data within extensive online knowledge graphs like those detailing legal entities, registered capital, and major shareholders. Traditional methods necessitate text-based queries and manual subgraph explorations, often resulting in time-consuming processes. EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this landscape by utilizing Large Language Models (LLMs) to interpret natural language queries. This solution automatically generates and executes Gremlin scripts, providing efficient summaries of complex enterprise relationships. Distinct feature a data pre-processing pipeline that compiles and annotates representative queries into a vector database of examples for In-context learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought with ICL to enhance Gremlin script generation for knowledge graph search and exploration, and a novel query masking strategy that improves intent recognition for heightened script accuracy. Empirical evaluations demonstrate the superior performance of EICopilot, including speed and accuracy, over baseline methods, with the \emphFull Mask variant achieving a syntax error rate reduction to as low as 10.00% and an execution correctness of up to 82.14%. These components collectively contribute to superior querying capabilities and summarization of intricate datasets, positioning EICopilot as a groundbreaking tool in the exploration and exploitation of large-scale knowledge graphs for enterprise information search.

[AI-8] Scalable Safe Multi-Agent Reinforcement Learning for Multi-Agent System

链接: https://arxiv.org/abs/2501.13727
作者: Haikuo Du,Fandi Gou,Yunze Cai
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Safety and scalability are two critical challenges faced by practical Multi-Agent Systems (MAS). However, existing Multi-Agent Reinforcement Learning (MARL) algorithms that rely solely on reward shaping are ineffective in ensuring safety, and their scalability is rather limited due to the fixed-size network output. To address these issues, we propose a novel framework, Scalable Safe MARL (SS-MARL), to enhance the safety and scalability of MARL methods. Leveraging the inherent graph structure of MAS, we design a multi-layer message passing network to aggregate local observations and communications of varying sizes. Furthermore, we develop a constrained joint policy optimization method in the setting of local observation to improve safety. Simulation experiments demonstrate that SS-MARL achieves a better trade-off between optimality and safety compared to baselines, and its scalability significantly outperforms the latest methods in scenarios with a large number of agents. The feasibility of our method is also verified by hardware implementation with Mecanum-wheeled vehicles.

[AI-9] Formally Verified Neurosymbolic Trajectory Learning via Tensor-based Linear Temporal Logic on Finite Traces

链接: https://arxiv.org/abs/2501.13712
作者: Mark Chevallier,Filip Smola,Richard Schmoetten,Jacques D. Fleuriot
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:We present a novel formalisation of tensor semantics for linear temporal logic on finite traces (LTLf), with formal proofs of correctness carried out in the theorem prover Isabelle/HOL. We demonstrate that this formalisation can be integrated into a neurosymbolic learning process by defining and verifying a differentiable loss function for the LTLf constraints, and automatically generating an implementation that integrates with PyTorch. We show that, by using this loss, the process learns to satisfy pre-specified logical constraints. Our approach offers a fully rigorous framework for constrained training, eliminating many of the inherent risks of ad-hoc, manual implementations of logical aspects directly in an “unsafe” programming language such as Python, while retaining efficiency in implementation.

[AI-10] Unlearning Clients Features and Samples in Vertical Federated Learning

链接: https://arxiv.org/abs/2501.13683
作者: Ayush K. Varshney,Konstantinos Vandikas,Vicenç Torra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted for publication in PETS 2025, Issue II

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a prominent distributed learning paradigm. Within the scope of privacy preservation, information privacy regulations such as GDPR entitle users to request the removal (or unlearning) of their contribution from a service that is hosting the model. For this purpose, a server hosting an ML model must be able to unlearn certain information in cases such as copyright infringement or security issues that can make the model vulnerable or impact the performance of a service based on that model. While most unlearning approaches in FL focus on Horizontal FL (HFL), where clients share the feature space and the global model, Vertical FL (VFL) has received less attention from the research community. VFL involves clients (passive parties) sharing the sample space among them while not having access to the labels. In this paper, we explore unlearning in VFL from three perspectives: unlearning clients, unlearning features, and unlearning samples. To unlearn clients and features we introduce VFU-KD which is based on knowledge distillation (KD) while to unlearn samples, VFU-GA is introduced which is based on gradient ascent. To provide evidence of approximate unlearning, we utilize Membership Inference Attack (MIA) to audit the effectiveness of our unlearning approach. Our experiments across six tabular datasets and two image datasets demonstrate that VFU-KD and VFU-GA achieve performance comparable to or better than both retraining from scratch and the benchmark R2S method in many cases, with improvements of (0-2%) . In the remaining cases, utility scores remain comparable, with a modest utility loss ranging from 1-5% . Unlike existing methods, VFU-KD and VFU-GA require no communication between active and passive parties during unlearning. However, they do require the active party to store the previously communicated embeddings.

[AI-11] Coarse-to-Fine Process Reward Modeling for Enhanced Mathematical Reasoning

链接: https://arxiv.org/abs/2501.13622
作者: Yulan Hu,Sheng Ouyang,Yong Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Process reward model (PRM) is critical for mathematical reasoning tasks to assign rewards for each intermediate steps. The PRM requires constructing process-wise supervision data for training, which rely on chain-of-thought (CoT) or tree-based methods to construct the reasoning steps, however, the individual reasoning steps may be redundant or containing nuanced errors that difficult to detect. We attribute these to the issue of the overlook of granularity division during process data collection. In this paper, we propose a coarse-to-fine framework to tackle this issue. Specifically, while gathering the process supervision data, we collect the coarse reasoning steps by merging adjacent steps according to preset merging granularity, then we sequentially reduce the merging granularity to collect fine-grained reasoning steps. For each synthesized new step, we relabel according to the label of last step. During training, we also traverse the collected training corpus in a coarse-to-fine manner. We conduct extensive experiments on popular mathematical reasoning datasets across diverse loss criterions, the proposed framework can consistently boost the reasoning performance.

[AI-12] Efficient Synaptic Delay Implementation in Digital Event-Driven AI Accelerators

链接: https://arxiv.org/abs/2501.13610
作者: Roy Meijer,Paul Detterer,Amirreza Yousefzadeh,Alberto Patino-Saucedo,Guanghzi Tang,Kanishkan Vadivel,Yinfu Xu,Manil-Dev Gomony,Federico Corradi,Bernabe Linares-Barranco,Manolis Sifalakis
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2404.10597

点击查看摘要

Abstract:Synaptic delay parameterization of neural network models have remained largely unexplored but recent literature has been showing promising results, suggesting the delay parameterized models are simpler, smaller, sparser, and thus more energy efficient than similar performing (e.g. task accuracy) non-delay parameterized ones. We introduce Shared Circular Delay Queue (SCDQ), a novel hardware structure for supporting synaptic delays on digital neuromorphic accelerators. Our analysis and hardware results show that it scales better in terms of memory, than current commonly used approaches, and is more amortizable to algorithm-hardware co-optimizations, where in fact, memory scaling is modulated by model sparsity and not merely network size. Next to memory we also report performance on latency area and energy per inference.

[AI-13] Optimal Multi-Objective Best Arm Identification with Fixed Confidence AISTATS2025

链接: https://arxiv.org/abs/2501.13607
作者: Zhirui Chen,P.N. Karthik,Yeow Meng Chee,Vincent Y. F. Tan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Accepted to AISTATS 2025

点击查看摘要

Abstract:We consider a multi-armed bandit setting with finitely many arms, in which each arm yields an M -dimensional vector reward upon selection. We assume that the reward of each dimension (a.k.a. \em objective) is generated independently of the others. The best arm of any given objective is the arm with the largest component of mean corresponding to the objective. The end goal is to identify the best arm of \em every objective in the shortest (expected) time subject to an upper bound on the probability of error (i.e., fixed-confidence regime). We establish a problem-dependent lower bound on the limiting growth rate of the expected stopping time, in the limit of vanishing error probabilities. This lower bound, we show, is characterised by a max-min optimisation problem that is computationally expensive to solve at each time step. We propose an algorithm that uses the novel idea of \em surrogate proportions to sample the arms at each time step, eliminating the need to solve the max-min optimisation problem at each step. We demonstrate theoretically that our algorithm is asymptotically optimal. In addition, we provide extensive empirical studies to substantiate the efficiency of our algorithm. While existing works on pure exploration with multi-objective multi-armed bandits predominantly focus on \em Pareto frontier identification, our work fills the gap in the literature by conducting a formal investigation of the multi-objective best arm identification problem.

[AI-14] xt-to-SQL based on Large Language Models and Database Keyword Search

链接: https://arxiv.org/abs/2501.13594
作者: Eduardo R. Nascimento(1 and 3),Caio Viktor S. Avila(1 and 4),Yenier T. Izquierdo(1),Grettel M. García(1),Lucas Feijó L. Andrade(1),Michelle S.P. Facina(2),Melissa Lemos(1 and 3),Marco A. Casanova(1 and 3) ((1) Instituto Tecgraf, PUC-Rio, Rio de Janeiro, Brazil, (2) Petrobras, Rio de Janeiro, Brazil, (3) Departamento de Informática, PUC-Rio, Rio de Janeiro, Brazil, (4) Departamento de Computação, UFC, Fortaleza, Brazil)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-SQL prompt strategies based on Large Language Models (LLMs) achieve remarkable performance on well-known benchmarks. However, when applied to real-world databases, their performance is significantly less than for these benchmarks, especially for Natural Language (NL) questions requiring complex filters and joins to be processed. This paper then proposes a strategy to compile NL questions into SQL queries that incorporates a dynamic few-shot examples strategy and leverages the services provided by a database keyword search (KwS) platform. The paper details how the precision and recall of the schema-linking process are improved with the help of the examples provided and the keyword-matching service that the KwS platform offers. Then, it shows how the KwS platform can be used to synthesize a view that captures the joins required to process an input NL question and thereby simplify the SQL query compilation step. The paper includes experiments with a real-world relational database to assess the performance of the proposed strategy. The experiments suggest that the strategy achieves an accuracy on the real-world relational database that surpasses state-of-the-art approaches. The paper concludes by discussing the results obtained.

[AI-15] Contrastive Representation Learning Helps Cross-institutional Knowledge Transfer: A Study in Pediatric Ventilation Management

链接: https://arxiv.org/abs/2501.13587
作者: Yuxuan(Edison)Liu,Jinpei Han,Padmanabhan Ramnarayan,A. Aldo Faisal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Clinical machine learning deployment across institutions faces significant challenges when patient populations and clinical practices differ substantially. We present a systematic framework for cross-institutional knowledge transfer in clinical time series, demonstrated through pediatric ventilation management between a general pediatric intensive care unit (PICU) and a cardiac-focused unit. Using contrastive predictive coding (CPC) for representation learning, we investigate how different data regimes and fine-tuning strategies affect knowledge transfer across institutional boundaries. Our results show that while direct model transfer performs poorly, CPC with appropriate fine-tuning enables effective knowledge sharing between institutions, with benefits particularly evident in limited data scenarios. Analysis of transfer patterns reveals an important asymmetry: temporal progression patterns transfer more readily than point-of-care decisions, suggesting practical pathways for cross-institutional deployment. Through a systematic evaluation of fine-tuning approaches and transfer patterns, our work provides insights for developing more generalizable clinical decision support systems while enabling smaller specialized units to leverage knowledge from larger centers.

[AI-16] owards a Theory of AI Personhood AAAI-25

链接: https://arxiv.org/abs/2501.13533
作者: Francis Rhys Ward
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AAAI-25 AI Alignment Track

点击查看摘要

Abstract:I am a person and so are you. Philosophically we sometimes grant personhood to non-human animals, and entities such as sovereign states or corporations can legally be considered persons. But when, if ever, should we ascribe personhood to AI systems? In this paper, we outline necessary conditions for AI personhood, focusing on agency, theory-of-mind, and self-awareness. We discuss evidence from the machine learning literature regarding the extent to which contemporary AI systems, such as language models, satisfy these conditions, finding the evidence surprisingly inconclusive. If AI systems can be considered persons, then typical framings of AI alignment may be incomplete. Whereas agency has been discussed at length in the literature, other aspects of personhood have been relatively neglected. AI agents are often assumed to pursue fixed goals, but AI persons may be self-aware enough to reflect on their aims, values, and positions in the world and thereby induce their goals to change. We highlight open research directions to advance the understanding of AI personhood and its relevance to alignment. Finally, we reflect on the ethical considerations surrounding the treatment of AI systems. If AI systems are persons, then seeking control and alignment may be ethically untenable. Comments: AAAI-25 AI Alignment Track Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.13533 [cs.AI] (or arXiv:2501.13533v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.13533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] GCAD: Anomaly Detection in Multivariate Time Series from the Perspective of Granger Causality AAAI2025

链接: https://arxiv.org/abs/2501.13493
作者: Zehao Liu,Mengzhou Gao,Pengfei Jiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Multivariate time series anomaly detection has numerous real-world applications and is being extensively studied. Modeling pairwise correlations between variables is crucial. Existing methods employ learnable graph structures and graph neural networks to explicitly model the spatial dependencies between variables. However, these methods are primarily based on prediction or reconstruction tasks, which can only learn similarity relationships between sequence embeddings and lack interpretability in how graph structures affect time series evolution. In this paper, we designed a framework that models spatial dependencies using interpretable causal relationships and detects anomalies through changes in causal patterns. Specifically, we propose a method to dynamically discover Granger causality using gradients in nonlinear deep predictors and employ a simple sparsification strategy to obtain a Granger causality graph, detecting anomalies from a causal perspective. Experiments on real-world datasets demonstrate that the proposed model achieves more accurate anomaly detection compared to baseline methods.

[AI-18] A Polynomial-Time Algorithm for EFX Orientations of Chores

链接: https://arxiv.org/abs/2501.13481
作者: Kevin Hsu,Valerie King
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注: 8 pages

点击查看摘要

Abstract:This paper addresses the problem of finding EFX orientations of graphs of chores, in which each vertex corresponds to an agent, each edge corresponds to a chore, and a chore has zero marginal utility to an agent if its corresponding edge is not incident to the vertex corresponding to the agent. Recently, Zhou~et~al.~(IJCAI,~2024) analyzed the complexity of deciding whether graphs containing a mixture of goods and chores admit EFX orientations, and conjectured that deciding whether graphs containing only chores admit EFX orientations is NP-complete. In this paper, we resolve this conjecture by exhibiting a polynomial-time algorithm that finds an EFX orientation of a graph containing only chores if one exists, even if the graph contains self-loops. Remarkably, our first result demonstrates a surprising separation between the case of goods and the case of chores, because deciding whether graphs containing only goods admit EFX orientations of goods was shown to be NP-complete by Christodoulou et al.~(EC,~2023). In addition, we show the analogous decision problem for multigraphs to be NP-complete.

[AI-19] Adaptive Testing for LLM -Based Applications: A Diversity-based Approach

链接: https://arxiv.org/abs/2501.13480
作者: Juyeon Yoon,Robert Feldt,Shin Yoo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.

[AI-20] Adaptive Few-Shot Learning (AFSL): Tackling Data Scarcity with Stability Robustness and Versatility

链接: https://arxiv.org/abs/2501.13479
作者: Rishabh Agrawal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot learning (FSL) enables machine learning models to generalize effectively with minimal labeled data, making it crucial for data-scarce domains such as healthcare, robotics, and natural language processing. Despite its potential, FSL faces challenges including sensitivity to initialization, difficulty in adapting to diverse domains, and vulnerability to noisy datasets. To address these issues, this paper introduces Adaptive Few-Shot Learning (AFSL), a framework that integrates advancements in meta-learning, domain alignment, noise resilience, and multi-modal integration. AFSL consists of four key modules: a Dynamic Stability Module for performance consistency, a Contextual Domain Alignment Module for domain adaptation, a Noise-Adaptive Resilience Module for handling noisy data, and a Multi-Modal Fusion Module for integrating diverse modalities. This work also explores strategies such as task-aware data augmentation, semi-supervised learning, and explainable AI techniques to enhance the applicability and robustness of FSL. AFSL provides scalable, reliable, and impactful solutions for real-world, high-stakes domains.

[AI-21] Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks

链接: https://arxiv.org/abs/2501.13457
作者: Ruijia Liu,Ancheng Hou,Xiao Yu,Xiang Yin
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: submitted

点击查看摘要

Abstract:Signal Temporal Logic (STL) is a powerful specification language for describing complex temporal behaviors of continuous signals, making it well-suited for high-level robotic task descriptions. However, generating executable plans for STL tasks is challenging, as it requires consideration of the coupling between the task specification and the system dynamics. Existing approaches either follow a model-based setting that explicitly requires knowledge of the system dynamics or adopt a task-oriented data-driven approach to learn plans for specific tasks. In this work, we investigate the problem of generating executable STL plans for systems whose dynamics are unknown a priori. We propose a new planning framework that uses only task-agnostic data during the offline training stage, enabling zero-shot generalization to new STL tasks. Our framework is hierarchical, involving: (i) decomposing the STL task into a set of progress and time constraints, (ii) searching for time-aware waypoints guided by task-agnostic data, and (iii) generating trajectories using a pre-trained safe diffusion model. Simulation results demonstrate the effectiveness of our method indeed in achieving zero-shot generalization to various STL tasks.

[AI-22] KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks

链接: https://arxiv.org/abs/2501.13456
作者: Taoran Fang,Tianhong Gao,Chunping Wang,Yihao Shang,Wei Chow,Lei Chen,Yang Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) with attention mechanisms, often referred to as attentive GNNs, have emerged as a prominent paradigm in advanced GNN models in recent years. However, our understanding of the critical process of scoring neighbor nodes remains limited, leading to the underperformance of many existing attentive GNNs. In this paper, we unify the scoring functions of current attentive GNNs and propose Kolmogorov-Arnold Attention (KAA), which integrates the Kolmogorov-Arnold Network (KAN) architecture into the scoring process. KAA enhances the performance of scoring functions across the board and can be applied to nearly all existing attentive GNNs. To compare the expressive power of KAA with other scoring functions, we introduce Maximum Ranking Distance (MRD) to quantitatively estimate their upper bounds in ranking errors for node importance. Our analysis reveals that, under limited parameters and constraints on width and depth, both linear transformation-based and MLP-based scoring functions exhibit finite expressive power. In contrast, our proposed KAA, even with a single-layer KAN parameterized by zero-order B-spline functions, demonstrates nearly infinite expressive power. Extensive experiments on both node-level and graph-level tasks using various backbone models show that KAA-enhanced scoring functions consistently outperform their original counterparts, achieving performance improvements of over 20% in some cases.

[AI-23] BMG-Q: Localized Bipartite Match Graph Attention Q-Learning for Ride-Pooling Order Dispatch

链接: https://arxiv.org/abs/2501.13448
作者: Yulong Hu,Siyuan Feng,Sen Li
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces Localized Bipartite Match Graph Attention Q-Learning (BMG-Q), a novel Multi-Agent Reinforcement Learning (MARL) algorithm framework tailored for ride-pooling order dispatch. BMG-Q advances ride-pooling decision-making process with the localized bipartite match graph underlying the Markov Decision Process, enabling the development of novel Graph Attention Double Deep Q Network (GATDDQN) as the MARL backbone to capture the dynamic interactions among ride-pooling vehicles in fleet. Our approach enriches the state information for each agent with GATDDQN by leveraging a localized bipartite interdependence graph and enables a centralized global coordinator to optimize order matching and agent behavior using Integer Linear Programming (ILP). Enhanced by gradient clipping and localized graph sampling, our GATDDQN improves scalability and robustness. Furthermore, the inclusion of a posterior score function in the ILP captures the online exploration-exploitation trade-off and reduces the potential overestimation bias of agents, thereby elevating the quality of the derived solutions. Through extensive experiments and validation, BMG-Q has demonstrated superior performance in both training and operations for thousands of vehicle agents, outperforming benchmark reinforcement learning frameworks by around 10% in accumulative rewards and showing a significant reduction in overestimation bias by over 50%. Additionally, it maintains robustness amidst task variations and fleet size changes, establishing BMG-Q as an effective, scalable, and robust framework for advancing ride-pooling order dispatch operations.

[AI-24] M3PT: A Transformer for Multimodal Multi-Party Social Signal Prediction with Person-aware Blockwise Attention

链接: https://arxiv.org/abs/2501.13416
作者: Yiming Tang,Abrar Anwar,Jesse Thomason
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Multi-party interactions include social signals like body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Incorporating all the multimodal signals in a multi-party interaction is difficult, and past work tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking which allows for the simultaneous processing of multiple social cues across multiple participants and their temporal interactions. This approach better captures social dynamics over time by considering longer horizons of social signals between individuals. We train and evaluate our unified model on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: this https URL

[AI-25] Load and Renewable Energy Forecasting Using Deep Learning for Grid Stability

链接: https://arxiv.org/abs/2501.13412
作者: Kamal Sarkar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the energy landscape changes quickly, grid operators face several challenges, especially when integrating renewable energy sources with the grid. The most important challenge is to balance supply and demand because the solar and wind energy are highly unpredictable. When dealing with such uncertainty, trustworthy short-term load and renewable energy forecasting can help stabilize the grid, maximize energy storage, and guarantee the effective use of renewable resources. Physical models and statistical techniques were the previous approaches employed for this kind of forecasting tasks. In forecasting renewable energy, machine learning and deep learning techniques have recently demonstrated encouraging results. More specifically, the deep learning techniques like CNN and LSTM and the conventional machine learning techniques like regression that are mostly utilized for load and renewable energy forecasting tasks. In this article, we will focus mainly on CNN and LSTM-based forecasting methods.

[AI-26] Concurrent Learning with Aggregated States via Randomized Least Squares Value Iteration

链接: https://arxiv.org/abs/2501.13394
作者: Yan Chen,Qinxun Bai,Yiteng Zhang,Shi Dong,Maria Dimakopoulou,Qi Sun,Zhengyuan Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning. While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents \it concurently explore an environment. The theoretical results %that we established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to \textitrandomized least-squares value iteration (RLSVI) with \textitaggregated state representation. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments. In both setups the per-agent regret decreases at an optimal rate of \Theta\left(\frac1\sqrtN\right) , highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to \citerusso2019worst and \citeagrawal2021improved. We reduce the space complexity by a factor of K while incurring only a \sqrtK increase in the worst-case regret bound, compared to \citepagrawal2021improved,russo2019worst. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.

[AI-27] A review on development of eco-friendly filters in Nepal for use in cigarettes and masks and Air Pollution Analysis with Machine Learning and SHAP Interpretability

链接: https://arxiv.org/abs/2501.13369
作者: Bishwash Paneru,Biplov Paneru,Tanka Mukhiya,Khem Narayan Poudyal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Nepal, air pollution is a serious public health concern, especially in cities like Kathmandu where particulate matter (PM2.5 and PM10) has a major influence on respiratory health and air quality. The Air Quality Index (AQI) is predicted in this work using a Random Forest Regressor, and the model’s predictions are interpreted using SHAP (SHapley Additive exPlanations) analysis. With the lowest Testing RMSE (0.23) and flawless R2 scores (1.00), CatBoost performs better than other models, demonstrating its greater accuracy and generalization which is cross validated using a nested cross validation approach. NowCast Concentration and Raw Concentration are the most important elements influencing AQI values, according to SHAP research, which shows that the machine learning results are highly accurate. Their significance as major contributors to air pollution is highlighted by the fact that high values of these characteristics significantly raise the AQI. This study investigates the Hydrogen-Alpha (HA) biodegradable filter as a novel way to reduce the related health hazards. With removal efficiency of more than 98% for PM2.5 and 99.24% for PM10, the HA filter offers exceptional defense against dangerous airborne particles. These devices, which are biodegradable face masks and cigarette filters, address the environmental issues associated with traditional filters’ non-biodegradable trash while also lowering exposure to air contaminants.

[AI-28] One Fits All: General Mobility Trajectory Modeling via Masked Conditional Diffusion

链接: https://arxiv.org/abs/2501.13347
作者: Qingyue Long,Can Rong,Huandong Wang,Yong Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Trajectory data play a crucial role in many applications, ranging from network optimization to urban planning. Existing studies on trajectory data are task-specific, and their applicability is limited to the specific tasks on which they have been trained, such as generation, recovery, or prediction. However, the potential of a unified model has not yet been fully explored in trajectory modeling. Although various trajectory tasks differ in inputs, outputs, objectives, and conditions, they share common mobility patterns. Based on these common patterns, we can construct a general framework that enables a single model to address different tasks. However, building a trajectory task-general framework faces two critical challenges: 1) the diversity in the formats of different tasks and 2) the complexity of the conditions imposed on different tasks. In this work, we propose a general trajectory modeling framework via masked conditional diffusion (named GenMove). Specifically, we utilize mask conditions to unify diverse formats. To adapt to complex conditions associated with different tasks, we utilize historical trajectory data to obtain contextual trajectory embeddings, which include rich contexts such as spatiotemporal characteristics and user preferences. Integrating the contextual trajectory embedding into diffusion models through a classifier-free guidance approach allows the model to flexibly adjust its outputs based on different conditions. Extensive experiments on mainstream tasks demonstrate that our model significantly outperforms state-of-the-art baselines, with the highest performance improvement exceeding 13% in generation tasks.

[AI-29] Full-Stack Optimized Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation

链接: https://arxiv.org/abs/2501.13344
作者: Rong Shan,Jiachen Zhu,Jianghao Lin,Chenxu Zhu,Bo Chen,Ruiming Tang,Yong Yu,Weinan Zhang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:In this paper, we address the lifelong sequential behavior incomprehension problem in large language models (LLMs) for recommendation, where LLMs struggle to extract useful information from long user behavior sequences, even within their context limits. To tackle this, we propose ReLLaX (Retrieval-enhanced Large Language models Plus), a framework offering optimization across data, prompt, and parameter levels. At the data level, we introduce Semantic User Behavior Retrieval (SUBR) to reduce sequence heterogeneity, making it easier for LLMs to extract key information. For prompt-level enhancement, we employ Soft Prompt Augmentation (SPA) to inject collaborative knowledge, aligning item representations with recommendation tasks and improving LLMs’s exploration of item relationships. Finally, at the parameter level, we propose Component Fully-interactive LoRA (CFLoRA), which enhances LoRA’s expressiveness by enabling interactions between its components, allowing better capture of sequential information. Moreover, we present new perspectives to compare current LoRA-based LLM4Rec methods, i.e. from both a composite and a decomposed view. We theoretically demonstrate that the ways they employ LoRA for recommendation are degraded versions of our CFLoRA, with different constraints on atom component interactions. Extensive experiments on three public datasets demonstrate ReLLaX’s superiority over existing baselines and its ability to mitigate lifelong sequential behavior incomprehension effectively.

[AI-30] Sparse identification of nonlinear dynamics and Koopman operators with Shallow Recurrent Decoder Networks

链接: https://arxiv.org/abs/2501.13329
作者: Mars Liyao Gao,Jan P. Williams,J. Nathan Kutz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Spatiotemporal modeling of real-world data poses a challenging problem due to inherent high dimensionality, measurement noise, and expensive data collection procedures. In this paper, we present Sparse Identification of Nonlinear Dynamics with SHallow REcurrent Decoder networks (SINDy-SHRED), a method to jointly solve the sensing and model identification problems with simple implementation, efficient computation, and robust performance. SINDy-SHRED uses Gated Recurrent Units (GRUs) to model the temporal sequence of sensor measurements along with a shallow decoder network to reconstruct the full spatiotemporal field from the latent state space using only a few available sensors. Our proposed algorithm introduces a SINDy-based regularization; beginning with an arbitrary latent state space, the dynamics of the latent space progressively converges to a SINDy-class functional, provided the projection remains within the set. In restricting SINDy to a linear model, the architecture produces a Koopman-SHRED model which enforces a linear latent space dynamics. We conduct a systematic experimental study including synthetic PDE data, real-world sensor measurements for sea surface temperature, and direct video data. With no explicit encoder, SINDy-SHRED and Koopman-SHRED enable efficient training with minimal hyperparameter tuning and laptop-level computing; further, it demonstrates robust generalization in a variety of applications with minimal to no hyperparameter adjustments. Finally, the interpretable SINDy and Koopman models of latent state dynamics enables accurate long-term video predictions, achieving state-of-the-art performance and outperforming all baseline methods considered, including Convolutional LSTM, PredRNN, ResNet, and SimVP.

[AI-31] Investigation of the Privacy Concerns in AI Systems for Young Digital Citizens: A Comparative Stakeholder Analysis

链接: https://arxiv.org/abs/2501.13321
作者: Molly Campbell,Ankur Barthwal,Sandhya Joshi,Austin Shouli,Ajay Kumar Shrestha
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: To appear in the 2025 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC) proceedings

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) systems into technologies used by young digital citizens raises significant privacy concerns. This study investigates these concerns through a comparative analysis of stakeholder perspectives. A total of 252 participants were surveyed, with the analysis focusing on 110 valid responses from parents/educators and 100 from AI professionals after data cleaning. Quantitative methods, including descriptive statistics and Partial Least Squares Structural Equation Modeling, examined five validated constructs: Data Ownership and Control, Parental Data Sharing, Perceived Risks and Benefits, Transparency and Trust, and Education and Awareness. Results showed Education and Awareness significantly influenced data ownership and risk assessment, while Data Ownership and Control strongly impacted Transparency and Trust. Transparency and Trust, along with Perceived Risks and Benefits, showed minimal influence on Parental Data Sharing, suggesting other factors may play a larger role. The study underscores the need for user-centric privacy controls, tailored transparency strategies, and targeted educational initiatives. Incorporating diverse stakeholder perspectives offers actionable insights into ethical AI design and governance, balancing innovation with robust privacy protections to foster trust in a digital age.

[AI-32] oward Ethical AI: A Qualitative Analysis of Stakeholder Perspectives

链接: https://arxiv.org/abs/2501.13320
作者: Ajay Kumar Shrestha,Sandhya Joshi
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: To appear in the 2025 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC) proceedings

点击查看摘要

Abstract:As Artificial Intelligence (AI) systems become increasingly integrated into various aspects of daily life, concerns about privacy and ethical accountability are gaining prominence. This study explores stakeholder perspectives on privacy in AI systems, focusing on educators, parents, and AI professionals. Using qualitative analysis of survey responses from 227 participants, the research identifies key privacy risks, including data breaches, ethical misuse, and excessive data collection, alongside perceived benefits such as personalized services, enhanced efficiency, and educational advancements. Stakeholders emphasized the need for transparency, privacy-by-design, user empowerment, and ethical oversight to address privacy concerns effectively. The findings provide actionable insights into balancing the benefits of AI with robust privacy protections, catering to the diverse needs of stakeholders. Recommendations include implementing selective data use, fostering transparency, promoting user autonomy, and integrating ethical principles into AI development. This study contributes to the ongoing discourse on ethical AI, offering guidance for designing privacy-centric systems that align with societal values and build trust among users. By addressing privacy challenges, this research underscores the importance of developing AI technologies that are not only innovative but also ethically sound and responsive to the concerns of all stakeholders.

[AI-33] Parallel Belief Contraction via Order Aggregation

链接: https://arxiv.org/abs/2501.13295
作者: Jake Chandler,Richard Booth
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The standard serial'' (aka singleton’‘) model of belief contraction models the manner in which an agent’s corpus of beliefs responds to the removal of a single item of information. One salient extension of this model introduces the idea of parallel'' (aka package’’ or ``multiple’') change, in which an entire set of items of information are simultaneously removed. Existing research on the latter has largely focussed on single-step parallel contraction: understanding the behaviour of beliefs after a single parallel contraction. It has also focussed on generalisations to the parallel case of serial contraction operations whose characteristic properties are extremely weak. Here we consider how to extend serial contraction operations that obey stronger properties. Potentially more importantly, we also consider the iterated case: the behaviour of beliefs after a sequence of parallel contractions. We propose a general method for extending serial iterated belief change operators to handle parallel change based on an n-ary generalisation of Booth Chandler’s TeamQueue binary order aggregators.

[AI-34] Experience with GitHub Copilot for Developer Productivity at Zoominfo

链接: https://arxiv.org/abs/2501.13282
作者: Gal Bakal,Ali Dasdan,Yaniv Katz,Michael Kaufman,Guy Levin
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of GitHub Copilot’s deployment and impact on developer productivity at Zoominfo, a leading Go-To-Market (GTM) Intelligence Platform. We describe our systematic four-phase approach to evaluating and deploying GitHub Copilot across our engineering organization, involving over 400 developers. Our analysis combines both quantitative metrics, focusing on acceptance rates of suggestions given by GitHub Copilot and qualitative feedback given by developers through developer satisfaction surveys. The results show an average acceptance rate of 33% for suggestions and 20% for lines of code, with high developer satisfaction scores of 72%. We also discuss language-specific performance variations, limitations, and lessons learned from this medium-scale enterprise deployment. Our findings contribute to the growing body of knowledge about AI-assisted software development in enterprise settings.

[AI-35] Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions

链接: https://arxiv.org/abs/2501.13230
作者: Yan Ru Pei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:We introduce Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training. The optimal order of tensor contractions can then be systematically determined for every SSM block to maximize training efficiency. This allows more flexibility in designing SSM blocks beyond the depthwise-separable configuration commonly implemented. The new design choices will take inspiration from classical convolutional blocks including group convolutions, full convolutions, and bottleneck blocks. We architect the Centaurus network with a mixture of these blocks, to balance between network size and performance, as well as memory and computational efficiency during both training and inference. We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing tasks including keyword spotting, speech denoising, and automatic speech recognition (ASR). For ASR, Centaurus is the first network with competitive performance that can be made fully state-space based, without using any nonlinear recurrence (LSTMs), explicit convolutions (CNNs), or (surrogate) attention mechanism. Source code is available at this http URL

[AI-36] SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

链接: https://arxiv.org/abs/2501.13200
作者: Alsu Sagirova,Yuri Kuratov,Mikhail Burtsev
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents’ behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: this https URL.

[AI-37] Learning in Log-Domain: Subthreshold Analog AI Accelerator Based on Stochastic Gradient Descent

链接: https://arxiv.org/abs/2501.13181
作者: Momen K Tageldeen,Yacine Belgaid,Vivek Mohan,Zhou Wang,Emmanuel M Drakakis
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid proliferation of AI models, coupled with growing demand for edge deployment, necessitates the development of AI hardware that is both high-performance and energy-efficient. In this paper, we propose a novel analog accelerator architecture designed for AI/ML training workloads using stochastic gradient descent with L2 regularization (SGDr). The architecture leverages log-domain circuits in subthreshold MOS and incorporates volatile memory. We establish a mathematical framework for solving SGDr in the continuous time domain and detail the mapping of SGDr learning equations to log-domain circuits. By operating in the analog domain and utilizing weak inversion, the proposed design achieves significant reductions in transistor area and power consumption compared to digital implementations. Experimental results demonstrate that the architecture closely approximates ideal behavior, with a mean square error below 0.87% and precision as low as 8 bits. Furthermore, the architecture supports a wide range of hyperparameters. This work paves the way for energy-efficient analog AI hardware with on-chip training capabilities.

[AI-38] AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks

链接: https://arxiv.org/abs/2501.13141
作者: Qiongyan Wang,Yutong Xia,Siru ZHong,Weichuang Li,Yuankai Wu,Shifen Cheng,Junbo Zhang,Yu Zheng,Yuxuan Liang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Monitoring real-time air quality is essential for safeguarding public health and fostering social progress. However, the widespread deployment of air quality monitoring stations is constrained by their significant costs. To address this limitation, we introduce \emphAirRadar, a deep neural network designed to accurately infer real-time air quality in locations lacking monitoring stations by utilizing data from existing ones. By leveraging learnable mask tokens, AirRadar reconstructs air quality features in unmonitored regions. Specifically, it operates in two stages: first capturing spatial correlations and then adjusting for distribution shifts. We validate AirRadar’s efficacy using a year-long dataset from 1,085 monitoring stations across China, demonstrating its superiority over multiple baselines, even with varying degrees of unobserved data. The source code can be accessed at this https URL.

[AI-39] Graph Representation Learning with Diffusion Generative Models

链接: https://arxiv.org/abs/2501.13133
作者: Daniel Wesego
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We only need the encoder at the end to extract representations. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning.

[AI-40] A Hierarchical Reinforcement Learning Framework for Multi-UAV Combat Using Leader-Follower Strategy

链接: https://arxiv.org/abs/2501.13132
作者: Jinhui Pang,Jinglin He,Noureldin Mohamed Abdelaal Ahmed Mohamed,Changqing Lin,Zhihui Zhang,Xiaoshuai Hao
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Multi-UAV air combat is a complex task involving multiple autonomous UAVs, an evolving field in both aerospace and artificial intelligence. This paper aims to enhance adversarial performance through collaborative strategies. Previous approaches predominantly discretize the action space into predefined actions, limiting UAV maneuverability and complex strategy implementation. Others simplify the problem to 1v1 combat, neglecting the cooperative dynamics among multiple UAVs. To address the high-dimensional challenges inherent in six-degree-of-freedom space and improve cooperation, we propose a hierarchical framework utilizing the Leader-Follower Multi-Agent Proximal Policy Optimization (LFMAPPO) strategy. Specifically, the framework is structured into three levels. The top level conducts a macro-level assessment of the environment and guides execution policy. The middle level determines the angle of the desired action. The bottom level generates precise action commands for the high-dimensional action space. Moreover, we optimize the state-value functions by assigning distinct roles with the leader-follower strategy to train the top-level policy, followers estimate the leader’s utility, promoting effective cooperation among agents. Additionally, the incorporation of a target selector, aligned with the UAVs’ posture, assesses the threat level of targets. Finally, simulation experiments validate the effectiveness of our proposed method.

[AI-41] Exploring Finetuned Audio-LLM on Heart Murmur Features ALT

链接: https://arxiv.org/abs/2501.13884
作者: Adrian Florea,Xilin Jiang,Nima Mesgarani,Xiaofan Jiang
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 1 figure, and 3 tables. Submitted to IEEE/ACM Conference on Connected Health: Applications, Systems , and Engineering Technologies

点击查看摘要

Abstract:Large language models (LLMs) for audio have excelled in recognizing and analyzing human speech, music, and environmental sounds. However, their potential for understanding other types of sounds, particularly biomedical sounds, remains largely underexplored despite significant scientific interest. In this study, we focus on diagnosing cardiovascular diseases using phonocardiograms, i.e., heart sounds. Most existing deep neural network (DNN) paradigms are restricted to heart murmur classification (healthy vs unhealthy) and do not predict other acoustic features of the murmur such as timing, grading, harshness, pitch, and quality, which are important in helping physicians diagnose the underlying heart conditions. We propose to finetune an audio LLM, Qwen2-Audio, on the PhysioNet CirCor DigiScope phonocardiogram (PCG) dataset and evaluate its performance in classifying 11 expert-labeled murmur features. Additionally, we aim to achieve more noise-robust and generalizable system by exploring a preprocessing segmentation algorithm using an audio representation model, SSAMBA. Our results indicate that the LLM-based model outperforms state-of-the-art methods in 8 of the 11 features and performs comparably in the remaining 3. Moreover, the LLM successfully classifies long-tail murmur features with limited training data, a task that all previous methods have failed to classify. These findings underscore the potential of audio LLMs as assistants to human cardiologists in enhancing heart disease diagnosis.

[AI-42] A space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints

链接: https://arxiv.org/abs/2501.13830
作者: Yan Yang,Bin Gao,Ya-xiang Yuan
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 48 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Imposing additional constraints on low-rank optimization has garnered growing interest. However, the geometry of coupled constraints hampers the well-developed low-rank structure and makes the problem intricate. To this end, we propose a space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints. The ``space-decoupling" is reflected in several ways. We show that the tangent cone of coupled constraints is the intersection of tangent cones of each constraint. Moreover, we decouple the intertwined bounded-rank and orthogonally invariant constraints into two spaces, leading to optimization on a smooth manifold. Implementing Riemannian algorithms on this manifold is painless as long as the geometry of additional constraints is known. In addition, we unveil the equivalence between the reformulated problem and the original problem. Numerical experiments on real-world applications – spherical data fitting, graph similarity measuring, low-rank SDP, model reduction of Markov processes, reinforcement learning, and deep learning – validate the superiority of the proposed framework.

[AI-43] Explainable AI-aided Feature Selection and Model Reduction for DRL-based V2X Resource Allocation

链接: https://arxiv.org/abs/2501.13552
作者: Nasir Khan,Asmaa Abdallah,Abdulkadir Celik,Ahmed M. Eltawil,Sinem Coleri
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is expected to significantly enhance radio resource management (RRM) in sixth-generation (6G) networks. However, the lack of explainability in complex deep learning (DL) models poses a challenge for practical implementation. This paper proposes a novel explainable AI (XAI)- based framework for feature selection and model complexity reduction in a model-agnostic manner. Applied to a multi-agent deep reinforcement learning (MADRL) setting, our approach addresses the joint sub-band assignment and power allocation problem in cellular vehicle-to-everything (V2X) communications. We propose a novel two-stage systematic explainability framework leveraging feature relevance-oriented XAI to simplify the DRL agents. While the former stage generates a state feature importance ranking of the trained models using Shapley additive explanations (SHAP)-based importance scores, the latter stage exploits these importance-based rankings to simplify the state space of the agents by removing the least important features from the model input. Simulation results demonstrate that the XAI-assisted methodology achieves 97% of the original MADRL sum-rate performance while reducing optimal state features by 28%, average training time by 11%, and trainable weight parameters by 46% in a network with eight vehicular pairs.

[AI-44] Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement ICASSP2025

链接: https://arxiv.org/abs/2501.13372
作者: Jae-Sung Bae,Anastasia Kuznetsova,Dinesh Manocha,John Hershey,Trausti Kristjansson,Minje Kim
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted to ICASSP 2025 Satellite Workshop: Generative Data Augmentation for Real-World Signal Processing Applications

点击查看摘要

Abstract:This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To address these issues, synthetic data generation using generative models has gained significant attention. In this challenge, participants are tasked first with building zero-shot TTS systems to augment personalized data. Subsequently, PSE systems are asked to be trained with this augmented personalized dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance. We also provide baseline experiments using open-source zero-shot TTS models to encourage participation and benchmark advancements. Our baseline code implementation and checkpoints are available online.

[AI-45] QuFeX: Quantum feature extraction module for hybrid quantum-classical deep neural networks

链接: https://arxiv.org/abs/2501.13165
作者: Naman Jain,Amir Kalev
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:We introduce Quantum Feature Extraction (QuFeX), a novel quantum machine learning module. The proposed module enables feature extraction in a reduced-dimensional space, significantly decreasing the number of parallel evaluations required in typical quantum convolutional neural network architectures. Its design allows seamless integration into deep classical neural networks, making it particularly suitable for hybrid quantum-classical models. As an application of QuFeX, we propose Qu-Net – a hybrid architecture which integrates QuFeX at the bottleneck of a U-Net architecture. The latter is widely used for image segmentation tasks such as medical imaging and autonomous driving. Our numerical analysis indicates that the Qu-Net can achieve superior segmentation performance compared to a U-Net baseline. These results highlight the potential of QuFeX to enhance deep neural networks by leveraging hybrid computational paradigms, providing a path towards a robust framework for real-world applications requiring precise feature extraction.

[AI-46] Forecasting of Bitcoin Prices Using Hashrate Features: Wavelet and Deep Stacking Approach

链接: https://arxiv.org/abs/2501.13136
作者: Ramin Mousa,Meysam Afrookhteh,Hooman Khaloo,Amir Ali Bengari,Gholamreza Heidary
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2402.05943 by other authors

点击查看摘要

Abstract:Digital currencies have become popular in the last decade due to their non-dependency and decentralized nature. The price of these currencies has seen a lot of fluctuations at times, which has increased the need for prediction. As their most popular, Bitcoin(BTC) has become a research hotspot. The main challenge and trend of digital currencies, especially BTC, is price fluctuations, which require studying the basic price prediction model. This research presents a classification and regression model based on stack deep learning that uses a wavelet to remove noise to predict movements and prices of BTC at different time intervals. The proposed model based on the stacking technique uses models based on deep learning, especially neural networks and transformers, for one, seven, thirty and ninety-day forecasting. Three feature selection models, Chi2, RFE and Embedded, were also applied to the data in the pre-processing stage. The classification model achieved 63% accuracy for predicting the next day and 64%, 67% and 82% for predicting the seventh, thirty and ninety days, respectively. For daily price forecasting, the percentage error was reduced to 0.58, while the error ranged from 2.72% to 2.85% for seven- to ninety-day horizons. These results show that the proposed model performed better than other models in the literature.

[AI-47] Applications and Challenges of AI and Microscopy in Life Science Research: A Review

链接: https://arxiv.org/abs/2501.13135
作者: Himanshu Buckchash,Gyanendra Kumar Verma,Dilip K. Prasad
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph); Subcellular Processes (q-bio.SC)
*备注:

点击查看摘要

Abstract:The complexity of human biology and its intricate systems holds immense potential for advancing human health, disease treatment, and scientific discovery. However, traditional manual methods for studying biological interactions are often constrained by the sheer volume and complexity of biological data. Artificial Intelligence (AI), with its proven ability to analyze vast datasets, offers a transformative approach to addressing these challenges. This paper explores the intersection of AI and microscopy in life sciences, emphasizing their potential applications and associated challenges. We provide a detailed review of how various biological systems can benefit from AI, highlighting the types of data and labeling requirements unique to this domain. Particular attention is given to microscopy data, exploring the specific AI techniques required to process and interpret this information. By addressing challenges such as data heterogeneity and annotation scarcity, we outline potential solutions and emerging trends in the field. Written primarily from an AI perspective, this paper aims to serve as a valuable resource for researchers working at the intersection of AI, microscopy, and biology. It summarizes current advancements, key insights, and open problems, fostering an understanding that encourages interdisciplinary collaborations. By offering a comprehensive yet concise synthesis of the field, this paper aspires to catalyze innovation, promote cross-disciplinary engagement, and accelerate the adoption of AI in life science research.

机器学习

[LG-0] PBM-VFL: Vertical Federated Learning with Feature and Sample Privacy

链接: https://arxiv.org/abs/2501.13916
作者: Linh Tran,Timothy Castiglia,Stacy Patterson,Ana Milanova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Poisson Binomial Mechanism Vertical Federated Learning (PBM-VFL), a communication-efficient Vertical Federated Learning algorithm with Differential Privacy guarantees. PBM-VFL combines Secure Multi-Party Computation with the recently introduced Poisson Binomial Mechanism to protect parties’ private datasets during model training. We define the novel concept of feature privacy and analyze end-to-end feature and sample privacy of our algorithm. We compare sample privacy loss in VFL with privacy loss in HFL. We also provide the first theoretical characterization of the relationship between privacy budget, convergence error, and communication cost in differentially-private VFL. Finally, we empirically show that our model performs well with high levels of privacy.

[LG-1] On Learning Representations for Tabular Data Distillation

链接: https://arxiv.org/abs/2501.13905
作者: Inwon Kang,Parikshit Ram,Yi Zhou,Horst Samulowitz,Oshani Seneviratne
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present \textttTDColER , a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, \sf \small TDBench . Based on an elaborate evaluation on \sf \small TDBench , resulting in 226,890 distilled datasets and 548,880 models trained on them, we demonstrate that \textttTDColER is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.

[LG-2] Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models ICLR2025

链接: https://arxiv.org/abs/2501.13904
作者: Linh Tran,Wei Sun,Stacy Patterson,Ana Milanova
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025 main conference track

点击查看摘要

Abstract:Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank adaptation scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.

[LG-3] Federated Granger Causality Learning for Interdependent Clients with State Space Representation

链接: https://arxiv.org/abs/2501.13890
作者: Ayush Mohanty,Nazal Mohamed,Paritosh Ramanan,Nagi Gebraeel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Advanced sensors and IoT devices have improved the monitoring and control of complex industrial enterprises. They have also created an interdependent fabric of geographically distributed process operations (clients) across these enterprises. Granger causality is an effective approach to detect and quantify interdependencies by examining how one client’s state affects others over time. Understanding these interdependencies captures how localized events, such as faults and disruptions, can propagate throughout the system, possibly causing widespread operational impacts. However, the large volume and complexity of industrial data pose challenges in modeling these interdependencies. This paper develops a federated approach to learning Granger causality. We utilize a linear state space system framework that leverages low-dimensional state estimates to analyze interdependencies. This addresses bandwidth limitations and the computational burden commonly associated with centralized data processing. We propose augmenting the client models with the Granger causality information learned by the server through a Machine Learning (ML) function. We examine the co-dependence between the augmented client and server models and reformulate the framework as a standalone ML algorithm providing conditions for its sublinear and linear convergence rates. We also study the convergence of the framework to a centralized oracle model. Moreover, we include a differential privacy analysis to ensure data security while preserving causal insights. Using synthetic data, we conduct comprehensive experiments to demonstrate the robustness of our approach to perturbations in causality, the scalability to the size of communication, number of clients, and the dimensions of raw data. We also evaluate the performance on two real-world industrial control system datasets by reporting the volume of data saved by decentralization.

[LG-4] What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

链接: https://arxiv.org/abs/2501.13887
作者: Petr Grinberg,Ankur Kumar,Surya Koppisetti,Gaurav Bharaj
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don’t necessarily hold when evaluated on large datasets.

[LG-5] Utilizing Evolution Strategies to Train Transformers in Reinforcement Learning

链接: https://arxiv.org/abs/2501.13883
作者: Matyáš Lorenc
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We explore a capability of evolution strategies to train an agent with its policy based on a transformer architecture in a reinforcement learning setting. We performed experiments using OpenAI’s highly parallelizable evolution strategy to train Decision Transformer in Humanoid locomotion environment and in the environment of Atari games, testing the ability of this black-box optimization technique to train even such relatively large and complicated models (compared to those previously tested in the literature). We also proposed a method to aid the training by first pretraining the model before using the OpenAI-ES to train it further, and tested its effectiveness. The examined evolution strategy proved to be, in general, capable of achieving strong results and managed to obtain high-performing agents. Therefore, the pretraining was shown to be unnecessary; yet still, it helped us observe and formulate several further insights.

[LG-6] Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

链接: https://arxiv.org/abs/2501.13851
作者: Shiling Deng,Serge Belongie,Peter Ebert Christensen
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 13 tables, GitHub repository: this https URL

点击查看摘要

Abstract:Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that utilizes cross-modal embedding to enhance meme analysis, significantly improving retrieval performance. Our contributions include:(1) a novel dataset for large-scale meme study, (2) a scalable meme annotation framework, and (3) a fine-tuned CLIP for meme-text retrieval, all aimed at advancing the understanding and analysis of memes at scale.

[LG-7] PhotoGAN: Generative Adversarial Neural Network Acceleration with Silicon Photonics

链接: https://arxiv.org/abs/2501.13828
作者: Tharini Suresh,Salma Afifi,Sudeep Pasricha
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) are at the forefront of AI innovation, driving advancements in areas such as image synthesis, medical imaging, and data augmentation. However, the unique computational operations within GANs, such as transposed convolutions and instance normalization, introduce significant inefficiencies when executed on traditional electronic accelerators, resulting in high energy consumption and suboptimal performance. To address these challenges, we introduce PhotoGAN, the first silicon-photonic accelerator designed to handle the specialized operations of GAN models. By leveraging the inherent high throughput and energy efficiency of silicon photonics, PhotoGAN offers an innovative, reconfigurable architecture capable of accelerating transposed convolutions and other GAN-specific layers. The accelerator also incorporates a sparse computation optimization technique to reduce redundant operations, improving computational efficiency. Our experimental results demonstrate that PhotoGAN achieves at least 4.4x higher GOPS and 2.18x lower energy-per-bit (EPB) compared to state-of-the-art accelerators, including GPUs and TPUs. These findings showcase PhotoGAN as a promising solution for the next generation of GAN acceleration, providing substantial gains in both performance and energy efficiency.

[LG-8] Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction

链接: https://arxiv.org/abs/2501.13794
作者: Zhi Sheng,Yuan Yuan,Jingtao Ding,Yong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of mobile traffic, \textiti.e., network traffic from cellular base stations, is crucial for optimizing network performance and supporting urban development. However, the non-stationary nature of mobile traffic, driven by human activity and environmental changes, leads to both regular patterns and abrupt variations. Diffusion models excel in capturing such complex temporal dynamics due to their ability to capture the inherent uncertainties. Most existing approaches prioritize designing novel denoising networks but often neglect the critical role of noise itself, potentially leading to sub-optimal performance. In this paper, we introduce a novel perspective by emphasizing the role of noise in the denoising process. Our analysis reveals that noise fundamentally shapes mobile traffic predictions, exhibiting distinct and consistent patterns. We propose NPDiff, a framework that decomposes noise into \textitprior and \textitresidual components, with the \textitprior derived from data dynamics, enhancing the model’s ability to capture both regular and abrupt variations. NPDiff can seamlessly integrate with various diffusion-based prediction models, delivering predictions that are effective, efficient, and robust. Extensive experiments demonstrate that it achieves superior performance with an improvement over 30%, offering a new perspective on leveraging diffusion models in this domain.

[LG-9] Local Steps Speed Up Local GD for Heterogeneous Distributed Logistic Regression ICLR2025

链接: https://arxiv.org/abs/2501.13790
作者: Michael Crawshaw,Blake Woodworth,Mingrui Liu
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:We analyze two variants of Local Gradient Descent applied to distributed logistic regression with heterogeneous, separable data and show convergence at the rate O(1/KR) for K local steps and sufficiently large R communication rounds. In contrast, all existing convergence guarantees for Local GD applied to any problem are at least \Omega(1/R) , meaning they fail to show the benefit of local updates. The key to our improved guarantee is showing progress on the logistic regression objective when using a large stepsize \eta \gg 1/K , whereas prior analysis depends on \eta \leq 1/K .

[LG-10] Fast Iterative and Task-Specific Imputation with Online Learning

链接: https://arxiv.org/abs/2501.13786
作者: Rahul Bordoloi,Clémence Réda,Saptarshi Bej
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Missing feature values are a significant hurdle for downstream machine-learning tasks such as classification and regression. However, they are pervasive in multiple real-life use cases, for instance, in drug discovery research. Moreover, imputation methods might be time-consuming and offer few guarantees on the imputation quality, especially for not-missing-at-random mechanisms. We propose an imputation approach named F3I based on the iterative improvement of a K-nearest neighbor imputation that learns the weights for each neighbor of a data point, optimizing for the most likely distribution of points over data points. This algorithm can also be jointly trained with a downstream task on the imputed values. We provide a theoretical analysis of the imputation quality by F3I for several types of missing mechanisms. We also demonstrate the performance of F3I on both synthetic data sets and real-life drug repurposing and handwritten-digit recognition data.

[LG-11] Matrix Completion in Group Testing: Bounds and Simulations

链接: https://arxiv.org/abs/2501.13780
作者: Trung-Khang Tran,Thach V. Bui
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:The main goal of group testing is to identify a small number of defective items in a large population of items. A test on a subset of items is positive if the subset contains at least one defective item and negative otherwise. In non-adaptive design, all tests can be tested simultaneously and represented by a measurement matrix in which a row and a column represent a test and an item, respectively. An entry in row i and column j is 1 if item j belongs to the test i and is 0 otherwise. Given an unknown set of defective items, the objective is to design a measurement matrix such that, by observing its corresponding outcome vector, the defective items can be recovered efficiently. The basic trait of this approach is that the measurement matrix has remained unchanged throughout the course of generating the outcome vector and recovering defective items. In this paper, we study the case in which some entries in the measurement matrix are erased, called \emphthe missing measurement matrix, before the recovery phase of the defective items, and our objective is to fully recover the measurement matrix from the missing measurement matrix. In particular, we show that some specific rows with erased entries provide information aiding the recovery while others do not. Given measurement matrices and erased entries follow the Bernoulli distribution, we show that before the erasing event happens, sampling sufficient sets of defective items and their corresponding outcome vectors can help us recover the measurement matrix from the missing measurement matrix.

[LG-12] Crossfire: An Elastic Defense Framework for Graph Neural Networks Under Bit Flip Attacks AAAI2025

链接: https://arxiv.org/abs/2501.13776
作者: Lorenz Kummer,Samir Moustafa,Wilfried Gansterer,Nils Kriege
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025, DOI will be included after publication

点击查看摘要

Abstract:Bit Flip Attacks (BFAs) are a well-established class of adversarial attacks, originally developed for Convolutional Neural Networks within the computer vision domain. Most recently, these attacks have been extended to target Graph Neural Networks (GNNs), revealing significant vulnerabilities. This new development naturally raises questions about the best strategies to defend GNNs against BFAs, a challenge for which no solutions currently exist. Given the applications of GNNs in critical fields, any defense mechanism must not only maintain network performance, but also verifiably restore the network to its pre-attack state. Verifiably restoring the network to its pre-attack state also eliminates the need for costly evaluations on test data to ensure network quality. We offer first insights into the effectiveness of existing honeypot- and hashing-based defenses against BFAs adapted from the computer vision domain to GNNs, and characterize the shortcomings of these approaches. To overcome their limitations, we propose Crossfire, a hybrid approach that exploits weight sparsity and combines hashing and honeypots with bit-level correction of out-of-distribution weight elements to restore network integrity. Crossfire is retraining-free and does not require labeled data. Averaged over 2,160 experiments on six benchmark datasets, Crossfire offers a 21.8% higher probability than its competitors of reconstructing a GNN attacked by a BFA to its pre-attack state. These experiments cover up to 55 bit flips from various attacks. Moreover, it improves post-repair prediction quality by 10.85%. Computational and storage overheads are negligible compared to the inherent complexity of even the simplest GNNs.

[LG-13] An Efficient Diffusion-based Non-Autoregressive Solver for Traveling Salesman Problem KDD2025

链接: https://arxiv.org/abs/2501.13767
作者: Mingzhao Wang,You Zhou,Zhiguang Cao,Yubin Xiao,Xuan Wu,Wei Pang,Yuan Jiang,Hui Yang,Peng Zhao,Yuanshu Li
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD2025

点击查看摘要

Abstract:Recent advances in neural models have shown considerable promise in solving Traveling Salesman Problems (TSPs) without relying on much hand-crafted engineering. However, while non-autoregressive (NAR) approaches benefit from faster inference through parallelism, they typically deliver solutions of inferior quality compared to autoregressive ones. To enhance the solution quality while maintaining fast inference, we propose DEITSP, a diffusion model with efficient iterations tailored for TSP that operates in a NAR manner. Firstly, we introduce a one-step diffusion model that integrates the controlled discrete noise addition process with self-consistency enhancement, enabling optimal solution prediction through simultaneous denoising of multiple solutions. Secondly, we design a dual-modality graph transformer to bolster the extraction and fusion of features from node and edge modalities, while further accelerating the inference with fewer layers. Thirdly, we develop an efficient iterative strategy that alternates between adding and removing noise to improve exploration compared to previous diffusion methods. Additionally, we devise a scheduling framework to progressively refine the solution space by adjusting noise levels, facilitating a smooth search for optimal solutions. Extensive experiments on real-world and large-scale TSP instances demonstrate that DEITSP performs favorably against existing neural approaches in terms of solution quality, inference latency, and generalization ability. Our code is available at \hrefthis https URLthis https URL .

[LG-14] Exact Soft Analytical Side-Channel Attacks using Tractable Circuits ICML2024

链接: https://arxiv.org/abs/2501.13748
作者: Thomas Wedenig,Rishub Nagpal,Gaëtan Cassiers,Stefan Mangard,Robert Peharz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: ICML 2024 Conference Paper

点击查看摘要

Abstract:Detecting weaknesses in cryptographic algorithms is of utmost importance for designing secure information systems. The state-of-the-art soft analytical side-channel attack (SASCA) uses physical leakage information to make probabilistic predictions about intermediate computations and combines these “guesses” with the known algorithmic logic to compute the posterior distribution over the key. This attack is commonly performed via loopy belief propagation, which, however, lacks guarantees in terms of convergence and inference quality. In this paper, we develop a fast and exact inference method for SASCA, denoted as ExSASCA, by leveraging knowledge compilation and tractable probabilistic circuits. When attacking the Advanced Encryption Standard (AES), the most widely used encryption algorithm to date, ExSASCA outperforms SASCA by more than 31% top-1 success rate absolute. By leveraging sparse belief messages, this performance is achieved with little more computational cost than SASCA, and about 3 orders of magnitude less than exact inference via exhaustive enumeration. Even with dense belief messages, ExSASCA still uses 6 times less computations than exhaustive inference.

[LG-15] GPT -HTree: A Decision Tree Framework Integrating Hierarchical Clustering and Large Language Models for Explainable Classification

链接: https://arxiv.org/abs/2501.13743
作者: Te Pei,Fuat Alican,Aaron Ontoyin Yin,Yigit Ihlamur
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces GPT-HTree, a framework combining hierarchical clustering, decision trees, and large language models (LLMs) to address this challenge. By leveraging hierarchical clustering to segment individuals based on salient features, resampling techniques to balance class distributions, and decision trees to tailor classification paths within each cluster, GPT-HTree ensures both accuracy and interpretability. LLMs enhance the framework by generating human-readable cluster descriptions, bridging quantitative analysis with actionable insights.

[LG-16] Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function

链接: https://arxiv.org/abs/2501.13734
作者: Maria-Florina Balcan,Anh Tuan Nguyen,Dravyansh Sharma
类目: Machine Learning (cs.LG)
*备注: 48 pages, 4 figures

点击查看摘要

Abstract:Modern machine learning algorithms, especially deep learning based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search based approaches to automating this laborious and compute-intensive task, the fundamental learning theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data driven setting. We assume that we have a series of deep learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. This is unlike previous work in data driven design, where one can typically explicitly model the algorithmic behavior as a function of the hyperparameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter, our analysis relies on subtle concepts including tools from differential/algebraic geometry and constrained optimization. This can be used to show that the learning theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.

[LG-17] A real-time battle situation intelligent awareness system based on Meta-learning RNN

链接: https://arxiv.org/abs/2501.13704
作者: Yuchun Li,Zihan Lin,Xize Wang,Chunyang Liu,Liaoyuan Wu,Fang Zhang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In modern warfare, real-time and accurate battle situation analysis is crucial for making strategic and tactical decisions. The proposed real-time battle situation intelligent awareness system (BSIAS) aims at meta-learning analysis and stepwise RNN (recurrent neural network) modeling, where the former carries out the basic processing and analysis of battlefield data, which includes multi-steps such as data cleansing, data fusion, data mining and continuously updates, and the latter optimizes the battlefield modeling by stepwise capturing the temporal dependencies of data set. BSIAS can predict the possible movement from any side of the fence and attack routes by taking a simulated battle as an example, which can be an intelligent support platform for commanders to make scientific decisions during wartime. This work delivers the potential application of integrated BSIAS in the field of battlefield command analysis engineering.

[LG-18] GenTL: A General Transfer Learning Model for Building Thermal Dynamics

链接: https://arxiv.org/abs/2501.13703
作者: Fabian Raisch,Thomas Krug,Christoph Goebel,Benjamin Tischler
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in the ACM library in Jun 2025

点击查看摘要

Abstract:Transfer Learning (TL) is an emerging field in modeling building thermal dynamics. This method reduces the data required for a data-driven model of a target building by leveraging knowledge from a source building. Consequently, it enables the creation of data-efficient models that can be used for advanced control and fault detection diagnosis. A major limitation of the TL approach is its inconsistent performance across different sources. Although accurate source-building selection for a target is crucial, it remains a persistent challenge. We present GenTL, a general transfer learning model for single-family houses in Central Europe. GenTL can be efficiently fine-tuned to a large variety of target buildings. It is pretrained on a Long Short-Term Memory (LSTM) network with data from 450 different buildings. The general transfer learning model eliminates the need for source-building selection by serving as a universal source for fine-tuning. Comparative analysis with conventional single-source to single-target TL demonstrates the efficacy and reliability of the general pretraining approach. Testing GenTL on 144 target buildings for fine-tuning reveals an average prediction error (RMSE) reduction of 42.1 % compared to fine-tuning single-source models. Comments: This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in the ACM library in Jun 2025 Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2501.13703 [eess.SY] (or arXiv:2501.13703v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2501.13703 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: The 16th ACM International Conference on Future and Sustainable Energy Systems, 2025

[LG-19] HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor

链接: https://arxiv.org/abs/2501.13677
作者: Zihui Wu,Haichang Gao,Jiacheng Luo,Zhaoxiang Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that fundamentally reimagines LLM safety by decoupling it from refusal prefixes through the use of humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests while maintaining engaging interactions. Our approach effectively addresses the common “over-defense” issues in existing safety mechanisms, demonstrating superior robustness against various attack vectors while preserving natural and high-quality interactions on legitimate tasks. Our findings suggest that innovations at the data level are even more fundamental than the alignment algorithm itself in achieving effective LLM safety, opening new directions for developing more resilient and user-friendly AI systems.

[LG-20] Revisiting Online Learning Approach to Inverse Linear Optimization: A Fenchel–Young Loss Perspective and Gap-Dependent Regret Analysis

链接: https://arxiv.org/abs/2501.13648
作者: Shinsaku Sakaue,Han Bao,Taira Tsuchiya
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper revisits the online learning approach to inverse linear optimization studied by Bärmann et al. (2017), where the goal is to infer an unknown linear objective function of an agent from sequential observations of the agent’s input-output pairs. First, we provide a simple understanding of the online learning approach through its connection to online convex optimization of \emphFenchel–Young losses. As a byproduct, we present an offline guarantee on the \emphsuboptimality loss, which measures how well predicted objectives explain the agent’s choices, without assuming the optimality of the agent’s choices. Second, assuming that there is a gap between optimal and suboptimal objective values in the agent’s decision problems, we obtain an upper bound independent of the time horizon T on the sum of suboptimality and \emphestimate losses, where the latter measures the quality of solutions recommended by predicted objectives. Interestingly, our gap-dependent analysis achieves a faster rate than the standard O(\sqrtT) regret bound by exploiting structures specific to inverse linear optimization, even though neither the loss functions nor their domains enjoy desirable properties, such as strong convexity.

[LG-21] he Road to Learning Explainable Inverse Kinematic Models: Graph Neural Networks as Inductive Bias for Symbolic Regression

链接: https://arxiv.org/abs/2501.13641
作者: Pravin Pandey(1),Julia Reuter(1),Christoph Steup(1),Sanaz Mostaghim(1 and 2) ((1) Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany, (2) Fraunhofer Institute for Transportation and Infrastructure Systems IVI, Dresden, Germany)
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper shows how a Graph Neural Network (GNN) can be used to learn an Inverse Kinematics (IK) based on an automatically generated dataset. The generated Inverse Kinematics is generalized to a family of manipulators with the same Degree of Freedom (DOF), but varying link length configurations. The results indicate a position error of less than 1.0 cm for 3 DOF and 4.5 cm for 5 DOF, and orientation error of 2 ^\circ for 3 DOF and 8.2 ^\circ for 6 DOF, which allows the deployment to certain real world-problems. However, out-of-domain errors and lack of extrapolation can be observed in the resulting GNN. An extensive analysis of these errors indicates potential for enhancement in the future. Consequently, the generated GNNs are tailored to be used in future work as an inductive bias to generate analytical equations through symbolic regression.

[LG-22] Quantification via Gaussian Latent Space Representations

链接: https://arxiv.org/abs/2501.13638
作者: Olaya Pérez-Mon,Juan José del Coz,Pablo González
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantification, or prevalence estimation, is the task of predicting the prevalence of each class within an unknown bag of examples. Most existing quantification methods in the literature rely on prior probability shift assumptions to create a quantification model that uses the predictions of an underlying classifier to make optimal prevalence estimates. In this work, we present an end-to-end neural network that uses Gaussian distributions in latent spaces to obtain invariant representations of bags of examples. This approach addresses the quantification problem using deep learning, enabling the optimization of specific loss functions relevant to the problem and avoiding the need for an intermediate classifier, tackling the quantification problem as a direct optimization problem. Our method achieves state-of-the-art results, both against traditional quantification methods and other deep learning approaches for quantification. The code needed to reproduce all our experiments is publicly available at this https URL.

[LG-23] SMILES has to go : Representation of Molecules via Algebraic Data Types

链接: https://arxiv.org/abs/2501.13633
作者: Oliver Goldstein
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 3 Figures

点击查看摘要

Abstract:This paper proposes a novel representation of molecules through Algebraic Data Types (ADTs). The representation has useful properties primarily by including type information. The representation uses the Dietz representation enabling representation of organometallics with multi-centre, multi-atom bonding and delocalised electrons, resonant structures and co-ordinate data of atoms. Furthermore, this representation goes further than any other in the literature, providing a natural data structure to represent shells, subshells and orbitals. Perks of the representation include it’s natural inclusion in reaction descriptions and the ability to make molecules instances of algebraic groups. The representation is further motivated as providing guarantees for those wishing to do Bayesian machine learning (probabilistic programming) over molecular structures. A criticism of competing and commonly used representations such as SMILES and SELFIES is provided and solutions are proposed to the weaknesses of these along with an open source library, written in Haskell. An example of integrating the library with LazyPPL – a lazy probabilistic programming library written in Haskell – is provided, conceptually justifying the efficiency of the representation over string based representations and recent work such as SELFIES. This library distinguishes between the data and the type of data – enabling a separation of concerns between interface and object. I solve three problems associated with the future of SELFIES, molecular programming language, 3D information, syntactic invalidity and Dietz representation.

[LG-24] FedPref: Federated Learning Across Heterogeneous Multi-objective Preferences

链接: https://arxiv.org/abs/2501.13604
作者: Maria Hartmann,Grégoire Danoy,Pascal Bouvry
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted to ACM ToMPECS journal

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning strategy, developed for settings where training data is owned by distributed devices and cannot be shared. FL circumvents this constraint by carrying out model training in distribution. The parameters of these local models are shared intermittently among participants and aggregated to enhance model accuracy. This strategy has been rapidly adopted by the industry in efforts to overcome privacy and resource constraints in model training. However, the application of FL to real-world settings brings additional challenges associated with heterogeneity between participants. Research into mitigating these difficulties in FL has largely focused on only two types of heterogeneity: the unbalanced distribution of training data, and differences in client resources. Yet more types of heterogeneity are becoming relevant as the capability of FL expands to cover more complex problems, from the tuning of LLMs to enabling machine learning on edge devices. In this work, we discuss a novel type of heterogeneity that is likely to become increasingly relevant in future applications: this is preference heterogeneity, emerging when clients learn under multiple objectives, with different importance assigned to each objective on different clients. In this work, we discuss the implications of this type of heterogeneity and propose FedPref, a first algorithm designed to facilitate personalised FL in this setting. We demonstrate the effectiveness of the algorithm across different problems, preference distributions and model architectures. In addition, we introduce a new analytical point of view, based on multi-objective metrics, for evaluating the performance of FL algorithms in this setting beyond the traditional client-focused metrics. We perform a second experimental analysis based in this view, and show that FedPref outperforms compared algorithms.

[LG-25] A Transformer-based Autoregressive Decoder Architecture for Hierarchical Text Classification ECAI

链接: https://arxiv.org/abs/2501.13598
作者: Younes Yousef,Lukas Galke,Ansgar Scherp
类目: Machine Learning (cs.LG)
*备注: 7 pages + 1 for references. 2 Figure. ECAI conference

点击查看摘要

Abstract:Recent approaches in hierarchical text classification (HTC) rely on the capabilities of a pre-trained transformer model and exploit the label semantics and a graph encoder for the label hierarchy. In this paper, we introduce an effective hierarchical text classifier RADAr (Transformer-based Autoregressive Decoder Architecture) that is based only on an off-the-shelf RoBERTa transformer to process the input and a custom autoregressive decoder with two decoder layers for generating the classification output. Thus, unlike existing approaches for HTC, the encoder of RADAr has no explicit encoding of the label hierarchy and the decoder solely relies on the label sequences of the samples observed during training. We demonstrate on three benchmark datasets that RADAr achieves results competitive to the state of the art with less training and inference time. Our model consistently performs better when organizing the label sequences from children to parents versus the inverse, as done in existing HTC approaches. Our experiments show that neither the label semantics nor an explicit graph encoder for the hierarchy is needed. This has strong practical implications for HTC as the architecture has fewer requirements and provides a speed-up by a factor of 2 at inference time. Moreover, training a separate decoder from scratch in conjunction with fine-tuning the encoder allows future researchers and practitioners to exchange the encoder part as new models arise. The source code is available at this https URL.

[LG-26] A Comprehensive Survey on Spectral Clustering with Graph Structure Learnin

链接: https://arxiv.org/abs/2501.13597
作者: Kamal Berahmand,Farid Saberi-Movahed,Razieh Sheikhpour,Yuefeng Li,Mahdi Jalili
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spectral clustering is a powerful technique for clustering high-dimensional data, utilizing graph-based representations to detect complex, non-linear structures and non-convex clusters. The construction of a similarity graph is essential for ensuring accurate and effective clustering, making graph structure learning (GSL) central for enhancing spectral clustering performance in response to the growing demand for scalable solutions. Despite advancements in GSL, there is a lack of comprehensive surveys specifically addressing its role within spectral clustering. To bridge this gap, this survey presents a comprehensive review of spectral clustering methods, emphasizing on the critical role of GSL. We explore various graph construction techniques, including pairwise, anchor, and hypergraph-based methods, in both fixed and adaptive settings. Additionally, we categorize spectral clustering approaches into single-view and multi-view frameworks, examining their applications within one-step and two-step clustering processes. We also discuss multi-view information fusion techniques and their impact on clustering data. By addressing current challenges and proposing future research directions, this survey provides valuable insights for advancing spectral clustering methodologies and highlights the pivotal role of GSL in tackling large-scale and high-dimensional data clustering tasks.

[LG-27] WFCRL: A Multi-Agent Reinforcement Learning Benchmark for Wind Farm Control

链接: https://arxiv.org/abs/2501.13592
作者: Claire Bizon Monroc,Ana Bušić,Donatien Dubuc,Jiamin Zhu
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The wind farm control problem is challenging, since conventional model-based control strategies require tractable models of complex aerodynamical interactions between the turbines and suffer from the curse of dimension when the number of turbines increases. Recently, model-free and multi-agent reinforcement learning approaches have been used to address this challenge. In this article, we introduce WFCRL (Wind Farm Control with Reinforcement Learning), the first open suite of multi-agent reinforcement learning environments for the wind farm control problem. WFCRL frames a cooperative Multi-Agent Reinforcement Learning (MARL) problem: each turbine is an agent and can learn to adjust its yaw, pitch or torque to maximize the common objective (e.g. the total power production of the farm). WFCRL also offers turbine load observations that will allow to optimize the farm performance while limiting turbine structural damages. Interfaces with two state-of-the-art farm simulators are implemented in WFCRL: a static simulator (FLORIS) and a dynamic simulator (this http URL). For each simulator, 10 wind layouts are provided, including 5 real wind farms. Two state-of-the-art online MARL algorithms are implemented to illustrate the scaling challenges. As learning online on this http URL is highly time-consuming, WFCRL offers the possibility of designing transfer learning strategies from FLORIS to this http URL.

[LG-28] owards Robust Incremental Learning under Ambiguous Supervision

链接: https://arxiv.org/abs/2501.13584
作者: Rui Wang,Mingxuan Xia,Chang Yao,Lei Feng,Junbo Zhao,Gang Chen,Haobo Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Incremental Learning (IL) targets to handle sequential fully-supervised learning problems where novel classes emerge from time to time. However, due to inherent annotation uncertainty and ambiguity, collecting high-quality annotated data in a dynamic learning system can be extremely expensive. To mitigate this problem, we propose a novel weakly-supervised learning paradigm called Incremental Partial Label Learning (IPLL), where the sequentially arrived data relate to a set of candidate labels rather than the ground truth. Technically, we develop the Prototype-Guided Disambiguation and Replay Algorithm (PGDR) which leverages the class prototypes as a proxy to mitigate two intertwined challenges in IPLL, i.e., label ambiguity and catastrophic forgetting. To handle the former, PGDR encapsulates a momentum-based pseudo-labeling algorithm along with prototype-guided initialization, resulting in a balanced perception of classes. To alleviate forgetting, we develop a memory replay technique that collects well-disambiguated samples while maintaining representativeness and diversity. By jointly distilling knowledge from curated memory data, our framework exhibits a great disambiguation ability for samples of new tasks and achieves less forgetting of knowledge. Extensive experiments demonstrate that PGDR achieves superior

[LG-29] Minimizing Queue Length Regret for Arbitrarily Varying Channels

链接: https://arxiv.org/abs/2501.13551
作者: G Krishnakumar,Abhishek Sinha
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:We consider an online channel scheduling problem for a single transmitter-receiver pair equipped with N arbitrarily varying wireless channels. The transmission rates of the channels might be non-stationary and could be controlled by an oblivious adversary. At every slot, incoming data arrives at an infinite-capacity data queue located at the transmitter. A scheduler, which is oblivious to the current channel rates, selects one of the N channels for transmission. At the end of the slot, the scheduler only gets to know the transmission rate of the selected channel. The objective is to minimize the queue length regret, defined as the difference between the queue length at some time T achieved by an online policy and the queue length obtained by always transmitting over the single best channel in hindsight. We propose a weakly adaptive Multi-Armed Bandit (MAB) algorithm for minimizing the queue length regret in this setup. Unlike previous works, we do not make any stability assumptions about the queue or the arrival process. Hence, our result holds even when the queueing process is unstable. Our main observation is that the queue length regret can be upper bounded by the regret of a MAB policy that competes against the best channel in hindsight uniformly over all sub-intervals of [T] . As a technical contribution of independent interest, we then propose a weakly adaptive adversarial MAB policy which achieves \tildeO(\sqrtNT^\frac34) regret with high probability, implying the same bound for queue length regret.

[LG-30] Communication-Efficient Stochastic Distributed Learning

链接: https://arxiv.org/abs/2501.13516
作者: Xiaoxing Ren,Nicola Bastianello,Karl H. Johansson,Thomas Parisini
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We address distributed learning problems, both nonconvex and convex, over undirected networks. In particular, we design a novel algorithm based on the distributed Alternating Direction Method of Multipliers (ADMM) to address the challenges of high communication costs, and large datasets. Our design tackles these challenges i) by enabling the agents to perform multiple local training steps between each round of communications; and ii) by allowing the agents to employ stochastic gradients while carrying out local computations. We show that the proposed algorithm converges to a neighborhood of a stationary point, for nonconvex problems, and of an optimal point, for convex problems. We also propose a variant of the algorithm to incorporate variance reduction thus achieving exact convergence. We show that the resulting algorithm indeed converges to a stationary (or optimal) point, and moreover that local training accelerates convergence. We thoroughly compare the proposed algorithms with the state of the art, both theoretically and through numerical results.

[LG-31] Spurious Forgetting in Continual Learning of Language Models ICLR2025

链接: https://arxiv.org/abs/2501.13453
作者: Junhao Zheng,Xidi Cai,Shengjie Qiu,Qianli Ma
类目: Machine Learning (cs.LG)
*备注: ICLR2025

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of “spurious forgetting”, proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios. Our findings underscore the critical distinction between task alignment and knowledge retention, paving the way for more effective strategies in continual learning.

[LG-32] Deep Modularity Networks with Diversity–Preserving Regularization

链接: https://arxiv.org/abs/2501.13451
作者: Yasmin Salehi,Dennis Giannacopoulos
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Graph clustering plays a crucial role in graph representation learning but often faces challenges in achieving feature-space diversity. While Deep Modularity Networks (DMoN) leverage modularity maximization and collapse regularization to ensure structural separation, they do not explicitly encourage diversity in the feature space among clusters. We address this limitation by proposing Deep Modularity Networks with Diversity-Preserving Regularization (DMoN-DPR), which introduces three novel regularization terms: distance-based for inter-cluster separation, variance-based for intra-cluster diversity, and entropy-based for balanced assignments. Our method enhances clustering performance on benchmark datasets, namely Cora, CiteSeer, PubMed, Coauthor CS, and Coauthor Physics, achieving significant improvements in Normalized Mutual Information (NMI), and F1 scores. These results demonstrate the effectiveness of incorporating diversity-preserving regularizations in creating meaningful and interpretable clusters, especially in feature-rich datasets.

[LG-33] Billion-scale Similarity Search Using a Hybrid Indexing Approach with Advanced Filtering

链接: https://arxiv.org/abs/2501.13442
作者: Simeon Emanuilov,Aleksandar Dimov
类目: Information Retrieval (cs.IR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 3 figures, published in Cybernetics and Information Technologies

点击查看摘要

Abstract:This paper presents a novel approach for similarity search with complex filtering capabilities on billion-scale datasets, optimized for CPU inference. Our method extends the classical IVF-Flat index structure to integrate multi-dimensional filters. The proposed algorithm combines dense embeddings with discrete filtering attributes, enabling fast retrieval in high-dimensional spaces. Designed specifically for CPU-based systems, our disk-based approach offers a cost-effective solution for large-scale similarity search. We demonstrate the effectiveness of our method through a case study, showcasing its potential for various practical uses.

[LG-34] Wasserstein-regularized Conformal Prediction under General Distribution Shift

链接: https://arxiv.org/abs/2501.13430
作者: Rui Xu,Chao Chen,Yue Sun,Parvathinathan Venkitasubramaniam,Sihong Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction yields a prediction set with guaranteed 1-\alpha coverage of the true target under the i.i.d. assumption, which may not hold and lead to a gap between 1-\alpha and the actual coverage. Prior studies bound the gap using total variation distance, which cannot identify the gap changes under distribution shift at a given \alpha . Besides, existing methods are mostly limited to covariate shift,while general joint distribution shifts are more common in practice but less this http URL response, we first propose a Wasserstein distance-based upper bound of the coverage gap and analyze the bound using probability measure pushforwards between the shifted joint data and conformal score distributions, enabling a separation of the effect of covariate and concept shifts over the coverage gap. We exploit the separation to design an algorithm based on importance weighting and regularized representation learning (WR-CP) to reduce the Wasserstein bound with a finite-sample error this http URL-CP achieves a controllable balance between conformal prediction accuracy and efficiency. Experiments on six datasets prove that WR-CP can reduce coverage gaps to 3.1% across different confidence levels and outputs prediction sets 38 % smaller than the worst-case approach on average.

[LG-35] Perceived Fairness of the Machine Learning Development Process: Concept Scale Development

链接: https://arxiv.org/abs/2501.13421
作者: Anoop Mishra,Deepak Khazanchi
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2304.03745

点击查看摘要

Abstract:In machine learning (ML) applications, unfairness is triggered due to bias in the data, the data curation process, erroneous assumptions, and implicit bias rendered during the development process. It is also well-accepted by researchers that fairness in ML application development is highly subjective, with a lack of clarity of what it means from an ML development and implementation perspective. Thus, in this research, we investigate and formalize the notion of the perceived fairness of ML development from a sociotechnical lens. Our goal in this research is to understand the characteristics of perceived fairness in ML applications. We address this research goal using a three-pronged strategy: 1) conducting virtual focus groups with ML developers, 2) reviewing existing literature on fairness in ML, and 3) incorporating aspects of justice theory relating to procedural and distributive justice. Based on our theoretical exposition, we propose operational attributes of perceived fairness to be transparency, accountability, and representativeness. These are described in terms of multiple concepts that comprise each dimension of perceived fairness. We use this operationalization to empirically validate the notion of perceived fairness of machine learning (ML) applications from both the ML practioners and users perspectives. The multidimensional framework for perceived fairness offers a comprehensive understanding of perceived fairness, which can guide the creation of fair ML systems with positive implications for society and businesses.

[LG-36] me Series Embedding Methods for Classification Tasks: A Review

链接: https://arxiv.org/abs/2501.13392
作者: Yasamin Ghahremani,Vangelis Metsis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series analysis has become crucial in various fields, from engineering and finance to healthcare and social sciences. In this paper, we present a comprehensive review and evaluation of time series embedding methods for effective representations in machine learning and deep learning models. We introduce a taxonomy of embedding techniques, categorizing them based on their theoretical foundations and application contexts. Unlike previous surveys, our work provides a quantitative evaluation of representative methods from each category by assessing their performance on downstream classification tasks across diverse real-world datasets. Our experimental results demonstrate that the performance of embedding methods varies significantly depending on the dataset and classification algorithm used, highlighting the importance of careful model selection and extensive experimentation for specific applications, including engineering systems. To facilitate further research and practical applications, we provide an open-source code repository implementing these embedding methods. This study contributes to the field by offering a systematic comparison of time series embedding techniques, guiding practitioners in selecting appropriate methods for their specific applications, and providing a foundation for future advancements in time series analysis.

[LG-37] Beyond Task Diversity: Provable Representation Transfer for Sequential Multi-Task Linear Bandits NEURIPS2024

链接: https://arxiv.org/abs/2501.13390
作者: Thang Duong,Zhi Wang,Chicheng Zhang
类目: Machine Learning (cs.LG)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We study lifelong learning in linear bandits, where a learner interacts with a sequence of linear bandit tasks whose parameters lie in an m -dimensional subspace of \mathbbR^d , thereby sharing a low-rank representation. Current literature typically assumes that the tasks are diverse, i.e., their parameters uniformly span the m -dimensional subspace. This assumption allows the low-rank representation to be learned before all tasks are revealed, which can be unrealistic in real-world applications. In this work, we present the first nontrivial result for sequential multi-task linear bandits without the task diversity assumption. We develop an algorithm that efficiently learns and transfers low-rank representations. When facing N tasks, each played over \tau rounds, our algorithm achieves a regret guarantee of \tildeO\big (Nm \sqrt\tau + N^\frac23 \tau^\frac23 d m^\frac13 + Nd^2 + \tau m d \big) under the ellipsoid action set assumption. This result can significantly improve upon the baseline of \tildeO \left (Nd \sqrt\tau\right) that does not leverage the low-rank structure when the number of tasks N is sufficiently large and m \ll d . We also demonstrate empirically on synthetic data that our algorithm outperforms baseline algorithms, which rely on the task diversity assumption.

[LG-38] Fast and Provable Tensor-Train Format Tensor Completion via Precondtioned Riemannian Gradient Descent

链接: https://arxiv.org/abs/2501.13385
作者: Fengmiao Bian,Jian-Feng Cai,Xiaoqun Zhang,Yuanwei Zhang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Low-rank tensor completion aims to recover a tensor from partially observed entries, and it is widely applicable in fields such as quantum computing and image processing. Due to the significant advantages of the tensor train (TT) format in handling structured high-order tensors, this paper investigates the low-rank tensor completion problem based on the TT-format. We proposed a preconditioned Riemannian gradient descent algorithm (PRGD) to solve low TT-rank tensor completion and establish its linear convergence. Experimental results on both simulated and real datasets demonstrate the effectiveness of the PRGD algorithm. On the simulated dataset, the PRGD algorithm reduced the computation time by two orders of magnitude compared to existing classical algorithms. In practical applications such as hyperspectral image completion and quantum state tomography, the PRGD algorithm significantly reduced the number of iterations, thereby substantially reducing the computational time.

[LG-39] Bridging The Multi-Modality Gaps of Audio Visual and Linguistic for Speech Enhancement

链接: https://arxiv.org/abs/2501.13375
作者: Meng-Ping Lin,Jen-Cheng Hou,Chia-Wei Chen,Shao-Yi Chien,Jun-Cheng Chen,Xugang Lu,Yu Tsao
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech Enhancement (SE) aims to improve the quality of noisy speech. It has been shown that additional visual cues can further improve performance. Given that speech communication involves audio, visual, and linguistic modalities, it is natural to expect another performance boost by incorporating linguistic information. However, bridging the modality gaps to efficiently incorporate linguistic information, along with audio and visual modalities during knowledge transfer, is a challenging task. In this paper, we propose a novel multi-modality learning framework for SE. In the model framework, a state-of-the-art diffusion Model backbone is utilized for Audio-Visual Speech Enhancement (AVSE) modeling where both audio and visual information are directly captured by microphones and video cameras. Based on this AVSE, the linguistic modality employs a PLM to transfer linguistic knowledge to the visual acoustic modality through a process termed Cross-Modal Knowledge Transfer (CMKT) during AVSE model training. After the model is trained, it is supposed that linguistic knowledge is encoded in the feature processing of the AVSE model by the CMKT, and the PLM will not be involved during inference stage. We carry out SE experiments to evaluate the proposed model framework. Experimental results demonstrate that our proposed AVSE system significantly enhances speech quality and reduces generative artifacts, such as phonetic confusion compared to the state-of-the-art. Moreover, our visualization results demonstrate that our Cross-Modal Knowledge Transfer method further improves the generated speech quality of our AVSE system. These findings not only suggest that Diffusion Model-based techniques hold promise for advancing the state-of-the-art in AVSE but also justify the effectiveness of incorporating linguistic information to improve the performance of Diffusion-based AVSE systems.

[LG-40] Learning to Bid in Non-Stationary Repeated First-Price Auctions

链接: https://arxiv.org/abs/2501.13358
作者: Zihao Hu,Xiaoyu Fan,Yuan Yao,Jiheng Zhang,Zhengyuan Zhou
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:First-price auctions have recently gained significant traction in digital advertising markets, exemplified by Google’s transition from second-price to first-price auctions. Unlike in second-price auctions, where bidding one’s private valuation is a dominant strategy, determining an optimal bidding strategy in first-price auctions is more complex. From a learning perspective, the learner (a specific bidder) can interact with the environment (other bidders) sequentially to infer their behaviors. Existing research often assumes specific environmental conditions and benchmarks performance against the best fixed policy (static benchmark). While this approach ensures strong learning guarantees, the static benchmark can deviate significantly from the optimal strategy in environments with even mild non-stationarity. To address such scenarios, a dynamic benchmark, which represents the sum of the best possible rewards at each time step, offers a more suitable objective. However, achieving no-regret learning with respect to the dynamic benchmark requires additional constraints. By inspecting reward functions in online first-price auctions, we introduce two metrics to quantify the regularity of the bidding sequence, which serve as measures of non-stationarity. We provide a minimax-optimal characterization of the dynamic regret when either of these metrics is sub-linear in the time horizon.

[LG-41] DoMINO: A Decomposable Multi-scale Iterative Neural Operator for Modeling Large Scale Engineering Simulations

链接: https://arxiv.org/abs/2501.13350
作者: Rishikesh Ranade,Mohammad Amin Nabian,Kaustubh Tangsali,Alexey Kamenev,Oliver Hennigh,Ram Cherukuri,Sanjay Choudhry
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Numerical simulations play a critical role in design and development of engineering products and processes. Traditional computational methods, such as CFD, can provide accurate predictions but are computationally expensive, particularly for complex geometries. Several machine learning (ML) models have been proposed in the literature to significantly reduce computation time while maintaining acceptable accuracy. However, ML models often face limitations in terms of accuracy and scalability and depend on significant mesh downsampling, which can negatively affect prediction accuracy and generalization. In this work, we propose a novel ML model architecture, DoMINO (Decomposable Multi-scale Iterative Neural Operator) developed in NVIDIA Modulus to address the various challenges of machine learning based surrogate modeling of engineering simulations. DoMINO is a point cloudbased ML model that uses local geometric information to predict flow fields on discrete points. The DoMINO model is validated for the automotive aerodynamics use case using the DrivAerML dataset. Through our experiments we demonstrate the scalability, performance, accuracy and generalization of our model to both in-distribution and out-of-distribution testing samples. Moreover, the results are analyzed using a range of engineering specific metrics important for validating numerical simulations.

[LG-42] Co-Learning Bayesian Optimization

链接: https://arxiv.org/abs/2501.13332
作者: Zhendong Guo,Yew-Soon Ong,Tiantian He,Haitao Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is well known to be sample-efficient for solving black-box problems. However, the BO algorithms can sometimes get stuck in suboptimal solutions even with plenty of samples. Intrinsically, such suboptimal problem of BO can attribute to the poor surrogate accuracy of the trained Gaussian process (GP), particularly that in the regions where the optimal solutions locate. Hence, we propose to build multiple GP models instead of a single GP surrogate to complement each other and thus resolving the suboptimal problem of BO. Nevertheless, according to the bias-variance tradeoff equation, the individual prediction errors can increase when increasing the diversity of models, which may lead to even worse overall surrogate accuracy. On the other hand, based on the theory of Rademacher complexity, it has been proved that exploiting the agreement of models on unlabeled information can help to reduce the complexity of the hypothesis space, and therefore achieving the required surrogate accuracy with fewer samples. Such value of model agreement has been extensively demonstrated for co-training style algorithms to boost model accuracy with a small portion of samples. Inspired by the above, we propose a novel BO algorithm labeled as co-learning BO (CLBO), which exploits both model diversity and agreement on unlabeled information to improve the overall surrogate accuracy with limited samples, and therefore achieving more efficient global optimization. Through tests on five numerical toy problems and three engineering benchmarks, the effectiveness of proposed CLBO has been well demonstrated.

[LG-43] Qrazor: Reliable and effortless 4-bit llm quantization by significant data razoring

链接: https://arxiv.org/abs/2501.13331
作者: Dongyoung Lee,Seungkyu Choi,Ik Joon Chang
类目: Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:Large-scale language models (LLMs) have demonstrated outstanding performance in language processing tasks, yet their deployment is often hindered by high memory demands and computational complexity. Although low-bit quantization techniques, such as 4-bit quantization, present a potential solution, they frequently lead to significant accuracy degradation or require substantial effort for such aggressive quantization approaches. To overcome these challenges, we introduce QRazor, a reliable and effortless quantization scheme designed to enable 4-bit quantization for weights, activations, and KV cache in transformer-based LLMs. The scheme involves two main stages: quantization and compression. During the quantization stage, weights, activations, and KV cache values are quantized with wider 8 or 16-bit integers as a basis to achieve nearly identical accuracy to the original full-precision LLM models, using the absolute max scaling. Subsequently, all data are compressed to 4-bit using our proposed significant data razoring (SDR) technique, which retains only the four most salient bits while discarding the others. Furthermore, we present an integer-based arithmetic unit dedicated to QRazor, enabling direct low-precision arithmetic operations without decompressing the SDR data. Despite the reduced quantization effort, QRazor achieves LLM accuracies better or comparable to state-of-the-art 4-bit methods. By also validating the hardware efficiency, our decompression-free arithmetic unit achieves 61.2% and 57.8% reduction in area and power consumption, respectively.

[LG-44] nsor-Var: Variational Data Assimilation in Tensor Product Feature Space

链接: https://arxiv.org/abs/2501.13312
作者: Yiming Yang,Xiaoyuan Cheng,Daniel Giles,Sibo Cheng,Yi He,Xiao Xue,Boli Chen,Yukun Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational data assimilation estimates the dynamical system states by minimizing a cost function that fits the numerical models with observational data. The widely used method, four-dimensional variational assimilation (4D-Var), has two primary challenges: (1) computationally demanding for complex nonlinear systems and (2) relying on state-observation mappings, which are often not perfectly known. Deep learning (DL) has been used as a more expressive class of efficient model approximators to address these challenges. However, integrating such models into 4D-Var remains challenging due to their inherent nonlinearities and the lack of theoretical guarantees for consistency in assimilation results. In this paper, we propose \textitTensor-Var to address these challenges using kernel Conditional Mean Embedding (CME). Tensor-Var improves optimization efficiency by characterizing system dynamics and state-observation mappings as linear operators, leading to a convex cost function in the feature space. Furthermore, our method provides a new perspective to incorporate CME into 4D-Var, offering theoretical guarantees of consistent assimilation results between the original and feature spaces. To improve scalability, we propose a method to learn deep features (DFs) using neural networks within the Tensor-Var framework. Experiments on chaotic systems and global weather prediction with real-time observations show that Tensor-Var outperforms conventional and DL hybrid 4D-Var baselines in accuracy while achieving efficiency comparable to the static 3D-Var method.

[LG-45] Exploring Variance Reduction in Importance Sampling for Efficient DNN Training

链接: https://arxiv.org/abs/2501.13296
作者: Takuro Kutsuna
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:Importance sampling is widely used to improve the efficiency of deep neural network (DNN) training by reducing the variance of gradient estimators. However, efficiently assessing the variance reduction relative to uniform sampling remains challenging due to computational overhead. This paper proposes a method for estimating variance reduction during DNN training using only minibatches sampled under importance sampling. By leveraging the proposed method, the paper also proposes an effective minibatch size to enable automatic learning rate adjustment. An absolute metric to quantify the efficiency of importance sampling is also introduced as well as an algorithm for real-time estimation of importance scores based on moving gradient statistics. Theoretical analysis and experiments on benchmark datasets demonstrated that the proposed algorithm consistently reduces variance, improves training efficiency, and enhances model accuracy compared with current importance-sampling approaches while maintaining minimal computational overhead.

[LG-46] -Graphormer: Using Transformers for Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2501.13274
作者: Hao Yuan Bai,Xue Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data is ubiquitous and appears in all fields of study. In multivariate time series, observations are interconnected both temporally and across components. For instance, in traffic flow analysis, traffic speeds at different intersections exhibit complex spatiotemporal correlations. Modelling this dual structure poses significant challenges. Most existing forecasting methods tackle these challenges by separately learning spatial and temporal dependencies. In this work, we introduce T-Graphormer, a Transformer-based approach designed to model spatiotemporal correlations directly. Extending the Graphormer architecture to incorporate temporal dynamics, our method updates each node representation by selectively attending to all other nodes within a graph sequence. This design enables the model to capture rich spatiotemporal patterns with minimal reliance on predefined spacetime inductive biases. We validate the effectiveness of T-Graphormer on real-world traffic prediction benchmark datasets, achieving up to 10% reductions in both root mean squared error (RMSE) and mean absolute percentage error (MAPE) compared to state-of-the-art methods.

[LG-47] Enhancing Robust Fairness via Confusional Spectral Regularization ICLR2025

链接: https://arxiv.org/abs/2501.13273
作者: Gaojie Jin,Sihao Wu,Jiaxu Liu,Tianjin Huang,Ronghui Mu
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:Recent research has highlighted a critical issue known as ``robust fairness", where robust accuracy varies significantly across different classes, undermining the reliability of deep neural networks (DNNs). A common approach to address this has been to dynamically reweight classes during training, giving more weight to those with lower empirical robust performance. However, we find there is a divergence of class-wise robust performance between training set and testing set, which limits the effectiveness of these explicit reweighting methods, indicating the need for a principled alternative. In this work, we derive a robust generalization bound for the worst-class robust error within the PAC-Bayesian framework, accounting for unknown data distributions. Our analysis shows that the worst-class robust error is influenced by two main factors: the spectral norm of the empirical robust confusion matrix and the information embedded in the model and training set. While the latter has been extensively studied, we propose a novel regularization technique targeting the spectral norm of the robust confusion matrix to improve worst-class robust accuracy and enhance robust fairness. We validate our approach through comprehensive experiments on various datasets and models, demonstrating its effectiveness in enhancing robust fairness.

[LG-48] Hybrid Two-Stage Reconstruction of Multiscale Subsurface Flow with Physics-informed Residual Connected Neural Operator

链接: https://arxiv.org/abs/2501.13271
作者: Peiqi Li,Jie Chen
类目: Machine Learning (cs.LG)
*备注: 21 pages, 14 figures, 3 tables

点击查看摘要

Abstract:The novel neural networks show great potential in solving partial differential equations. For single-phase flow problems in subsurface porous media with high-contrast coefficients, the key is to develop neural operators with accurate reconstruction capability and strict adherence to physical laws. In this study, we proposed a hybrid two-stage framework that uses multiscale basis functions and physics-guided deep learning to solve the Darcy flow problem in high-contrast fractured porous media. In the first stage, a data-driven model is used to reconstruct the multiscale basis function based on the permeability field to achieve effective dimensionality reduction while preserving the necessary multiscale features. In the second stage, the physics-informed neural network, together with Transformer-based global information extractor is used to reconstruct the pressure field by integrating the physical constraints derived from the Darcy equation, ensuring consistency with the physical laws of the real world. The model was evaluated on datasets with different combinations of permeability and basis functions and performed well in terms of reconstruction accuracy. Specifically, the framework achieves R2 values above 0.9 in terms of basis function fitting and pressure reconstruction, and the residual indicator is on the order of 1\times 10^-4 . These results validate the ability of the proposed framework to achieve accurate reconstruction while maintaining physical consistency.

[LG-49] Exploring the Technology Landscape through Topic Modeling Expert Involvement and Reinforcement Learning

链接: https://arxiv.org/abs/2501.13252
作者: Ali Nazari,Michael Weiss
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Quantum Physics (quant-ph)
*备注: 28 pages, 17 figures

点击查看摘要

Abstract:This study presents a method for exploring advancements in a specific technological domain. It combines topic modeling, expert input, and reinforcement learning (RL). The proposed approach has three key steps: (1) generate aspect-based topic models using expert-weighted keywords to emphasize critical aspects, (2) analyze similarities and entropy changes by comparing topic distributions across iterative models, and (3) refine topic selection using reinforcement learning (RL) with a modified reward function that integrates changes in topic divergence and similarity across iterations. The method is tested on quantum communication documents with a focus on advances in cryptography and security protocols. The results show the method’s effectiveness and can identify, rank, and track trends that match expert input. The framework provides a robust tool for exploring evolving technological landscapes.

[LG-50] State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

链接: https://arxiv.org/abs/2501.13241
作者: Xintong Duan,Yutong He,Fahim Tajwar,Wen-Tse Chen,Ruslan Salakhutdinov,Jeff Schneider
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on expert trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

[LG-51] MLPs at the EOC: Spectrum of the NTK

链接: https://arxiv.org/abs/2501.13225
作者: Dávid Terjék,Diego González-Sánchez
类目: Machine Learning (cs.LG)
*备注: 18 pages, 1 figure

点击查看摘要

Abstract:We study the properties of the Neural Tangent Kernel (NTK) \overset\scriptscriptstyle\inftyK : \mathbbR^m_0 \times \mathbbR^m_0 \to \mathbbR^m_l \times m_l corresponding to infinitely wide l -layer Multilayer Perceptrons (MLPs) taking inputs from \mathbbR^m_0 to outputs in \mathbbR^m_l equipped with activation functions \phi(s) = a s + b \vert s \vert for some a,b \in \mathbbR and initialized at the Edge Of Chaos (EOC). We find that the entries \overset\scriptscriptstyle\inftyK(x_1,x_2) can be approximated by the inverses of the cosine distances of the activations corresponding to x_1 and x_2 increasingly better as the depth l increases. By quantifying these inverse cosine distances and the spectrum of the matrix containing them, we obtain tight spectral bounds for the NTK matrix \overset\scriptscriptstyle\inftyK = [\frac1n \overset\scriptscriptstyle\inftyK(x_i_1,x_i_2) : i_1, i_2 \in [1:n]] over a dataset \x_1,\cdots,x_n\ \subset \mathbbR^m_0 , transferred from the inverse cosine distance matrix via our approximation result. Our results show that \Delta_\phi = \fracb^2a^2+b^2 determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases, implying in particular that the absolute value ( \Delta_\phi=1 ) is better than the ReLU ( \Delta_\phi=\frac12 ) in this regard.

[LG-52] Scaling for Fairness? Analyzing Model Size Data Composition and Multilinguality in Vision-Language Bias

链接: https://arxiv.org/abs/2501.13223
作者: Zahraa Al Sahili,Ioannis Patras,Matthew Purver
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large-scale vision-language models (VLMs) become increasingly central to modern AI applications, understanding and mitigating social biases in these systems has never been more this http URL investigate how dataset composition, model size, and multilingual training affect gender and racial bias in a popular VLM, CLIP, and its open-source variants. In particular, we systematically evaluate models trained on varying dataset scales and architectures, as well as multilingual versions encompassing English along with Persian, Turkish, and Finnish, languages with minimal gender marking. To assess social perception bias, we measure the zero-shot performance on face images featuring socially charged terms rooted in the psychological constructs of communion and agency, and demographic labeling bias using both the FairFace and PATA datasets. Our findings reveal three key insights. First, while larger training datasets can mitigate some biases, they may also introduce or amplify others when the data composition is imbalanced. Second, although increasing model size generally improves performance, it does not consistently reduce bias and can, in certain cases, exacerbate it. Finally, while multilingual training broadens linguistic coverage, it does not inherently neutralize bias and can transfer or intensify inequities across languages. Taken together, these results highlight the necessity of inclusive, carefully curated training data to foster fairness rather than relying solely on model scaling or language expansion. We provide a systematic evaluation of vision language bias across diverse demographics, underscoring the urgent need for intentional bias mitigation strategies in next generation AI systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.13223 [cs.LG] (or arXiv:2501.13223v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.13223 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Enhancing Multi-Attribute Fairness in Healthcare Predictive Modeling ALT

链接: https://arxiv.org/abs/2501.13219
作者: Xiaoyang Wang,Christopher C. Yang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted to the 13th IEEE International Conference on Healthcare Informatics (IEEE ICHI 2025)

点击查看摘要

Abstract:Artificial intelligence (AI) systems in healthcare have demonstrated remarkable potential to improve patient outcomes. However, if not designed with fairness in mind, they also carry the risks of perpetuating or exacerbating existing health disparities. Although numerous fairness-enhancing techniques have been proposed, most focus on a single sensitive attribute and neglect the broader impact that optimizing fairness for one attribute may have on the fairness of other sensitive attributes. In this work, we introduce a novel approach to multi-attribute fairness optimization in healthcare AI, tackling fairness concerns across multiple demographic attributes concurrently. Our method follows a two-phase approach: initially optimizing for predictive performance, followed by fine-tuning to achieve fairness across multiple sensitive attributes. We develop our proposed method using two strategies, sequential and simultaneous. Our results show a significant reduction in Equalized Odds Disparity (EOD) for multiple attributes, while maintaining high predictive accuracy. Notably, we demonstrate that single-attribute fairness methods can inadvertently increase disparities in non-targeted attributes whereas simultaneous multi-attribute optimization achieves more balanced fairness improvements across all attributes. These findings highlight the importance of comprehensive fairness strategies in healthcare AI and offer promising directions for future research in this critical area.

[LG-54] S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning

链接: https://arxiv.org/abs/2501.13198
作者: Yichen Wu,Hongming Piao,Long-Kai Huang,Renzhen Wang,Wanhua Li,Hanspeter Pfister,Deyu Meng,Kede Ma,Ying Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual Learning (CL) with foundation models has recently emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks. Existing prompt-based methods generally use a gating mechanism to select relevant prompts aligned with the test query for further processing. However, the success of these methods largely depends on the precision of the gating mechanism, which becomes less scalable with additional computational overhead as tasks increases. To overcome these issues, we propose a Scalable Low-Rank Adaptation (S-LoRA) method for CL (in particular class incremental learning), which incrementally decouples the learning of the direction and magnitude of LoRA parameters. S-LoRA supports efficient inference by employing the last-stage trained model for direct testing without a gating process. Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL. Furthermore, based on our findings, we develop variants of S-LoRA with further improved scalability. Extensive experiments across multiple CL benchmarks and various foundation models consistently validate the effectiveness of S-LoRA.

[LG-55] Efficient Implementation of LinearUCB through Algorithmic Improvements and Vector Computing Acceleration for Embedded Learning Systems

链接: https://arxiv.org/abs/2501.13139
作者: Marco Angioli,Marcello Barbirotta,Abdallah Cheikh,Antonio Mastrandrea,Francesco Menichelli,Mauro Olivieri
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:As the Internet of Things expands, embedding Artificial Intelligence algorithms in resource-constrained devices has become increasingly important to enable real-time, autonomous decision-making without relying on centralized cloud servers. However, implementing and executing complex algorithms in embedded devices poses significant challenges due to limited computational power, memory, and energy resources. This paper presents algorithmic and hardware techniques to efficiently implement two LinearUCB Contextual Bandits algorithms on resource-constrained embedded devices. Algorithmic modifications based on the Sherman-Morrison-Woodbury formula streamline model complexity, while vector acceleration is harnessed to speed up matrix operations. We analyze the impact of each optimization individually and then combine them in a two-pronged strategy. The results show notable improvements in execution time and energy consumption, demonstrating the effectiveness of combining algorithmic and hardware optimizations to enhance learning models for edge computing environments with low-power and real-time requirements.

[LG-56] Serve: An Intent-based Serving System for LLM s

链接: https://arxiv.org/abs/2501.13111
作者: Dimitrios Liakopoulos,Tianrui Hu,Prasoon Sinha,Neeraja J. Yadwadkar
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 19 pages, 24 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming ubiquitous across industries, where applications demand they fulfill diverse user intents. However, developers currently face the challenge of manually exploring numerous deployment configurations - combinations of parallelism and compression techniques that impact resource usage, latency, cost, and accuracy - to meet these intents. Assessing the impact of these configurations on user metrics requires extensive, costly profiling for each model. Existing approaches avoid this expense by using fixed, static configurations, but this often leads to sub-optimal performance and higher costs. Moreover, none of these solutions dynamically adapt to changing user intents to balance latency and cost, effectively. We present iServe, an automated, intent-based system for distributed LLM inference. Instead of manually selecting deployment configurations, developers simply specify their intent - such as minimizing latency, reducing cost, or meeting specific targets for either. iServe introduces fingerprints, lightweight representations of LLMs, to efficiently estimate how different configurations impact latency and memory usage. Based on these insights and GPU availability, iServe dynamically selects the optimal configuration to align with the user’s intent. For various LLMs and query arrival rates, iServe best meets user intents compared to state-of-the-art systems by reducing latency by 77.62% and SLO violations by 7.09x while improving GPU throughput by 4.72x. Moreover, iServe’s fingerprint-based profiling reduces profiling cost by 6.05x (GPU-hours) compared to baselines.

[LG-57] Consistent spectral clustering in sparse tensor block models

链接: https://arxiv.org/abs/2501.13820
作者: Ian Välimaa,Lasse Leskelä
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR)
*备注: 63 pagers

点击查看摘要

Abstract:High-order clustering aims to classify objects in multiway datasets that are prevalent in various fields such as bioinformatics, social network analysis, and recommendation systems. These tasks often involve data that is sparse and high-dimensional, presenting significant statistical and computational challenges. This paper introduces a tensor block model specifically designed for sparse integer-valued data tensors. We propose a simple spectral clustering algorithm augmented with a trimming step to mitigate noise fluctuations, and identify a density threshold that ensures the algorithm’s consistency. Our approach models sparsity using a sub-Poisson noise concentration framework, accommodating heavier than sub-Gaussian tails. Remarkably, this natural class of tensor block models is closed under aggregation across arbitrary modes. Consequently, we obtain a comprehensive framework for evaluating the tradeoff between signal loss and noise reduction during data aggregation. The analysis is based on a novel concentration bound for sparse random Gram matrices. The theoretical findings are illustrated through simulation experiments.

[LG-58] A dimensionality reduction technique based on the Gromov-Wasserstein distance

链接: https://arxiv.org/abs/2501.13732
作者: Rafael P. Eufrazio,Eduardo Fernandes Montesuma,Charles C. Cavalcante
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing relationships between objects is a pivotal problem within data science. In this context, Dimensionality reduction (DR) techniques are employed to generate smaller and more manageable data representations. This paper proposes a new method for dimensionality reduction, based on optimal transportation theory and the Gromov-Wasserstein distance. We offer a new probabilistic view of the classical Multidimensional Scaling (MDS) algorithm and the nonlinear dimensionality reduction algorithm, Isomap (Isometric Mapping or Isometric Feature Mapping) that extends the classical MDS, in which we use the Gromov-Wasserstein distance between the probability measure of high-dimensional data, and its low-dimensional representation. Through gradient descent, our method embeds high-dimensional data into a lower-dimensional space, providing a robust and efficient solution for analyzing complex high-dimensional datasets.

[LG-59] he First Indoor Pathloss Radio Map Prediction Challenge ICASSP2025

链接: https://arxiv.org/abs/2501.13698
作者: Stefanos Bakirtzis,Çağkan Yapar,Kehai Qiu,Ian Wassell,Jie Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: ICASSP 2025

点击查看摘要

Abstract:To encourage further research and to facilitate fair comparisons in the development of deep learning-based radio propagation models, in the less explored case of directional radio signal emissions in indoor propagation environments, we have launched the ICASSP 2025 First Indoor Pathloss Radio Map Prediction Challenge. This overview paper describes the indoor path loss prediction problem, the datasets used, the Challenge tasks, and the evaluation methodology. Finally, the results of the Challenge and a summary of the submitted methods are presented.

[LG-60] Learning under Commission and Omission Event Outliers

链接: https://arxiv.org/abs/2501.13599
作者: Yuecheng Zhang,Guanhua Fang,Wen Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages

点击查看摘要

Abstract:Event stream is an important data format in real life. The events are usually expected to follow some regular patterns over time. However, the patterns could be contaminated by unexpected absences or occurrences of events. In this paper, we adopt the temporal point process framework for learning event stream and we provide a simple-but-effective method to deal with both commission and omission event this http URL particular, we introduce a novel weight function to dynamically adjust the importance of each observed event so that the final estimator could offer multiple statistical merits. We compare the proposed method with the vanilla one in the classification problems, where event streams can be clustered into different groups. Both theoretical and numerical results confirm the effectiveness of our new approach. To our knowledge, our method is the first one to provably handle both commission and omission outliers simultaneously.

[LG-61] LITE: Efficiently Estimating Gaussian Probability of Maximality AISTATS2025

链接: https://arxiv.org/abs/2501.13535
作者: Nicolas Menet,Jonas Hübotter,Parnian Kassraie,Andreas Krause(ETH Zürich)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: accepted in AISTATS 2025

点击查看摘要

Abstract:We consider the problem of computing the probability of maximality (PoM) of a Gaussian random vector, i.e., the probability for each dimension to be maximal. This is a key challenge in applications ranging from Bayesian optimization to reinforcement learning, where the PoM not only helps with finding an optimal action, but yields a fine-grained analysis of the action domain, crucial in tasks such as drug discovery. Existing techniques are costly, scaling polynomially in computation and memory with the vector size. We introduce LITE, the first approach for estimating Gaussian PoM with almost-linear time and memory complexity. LITE achieves SOTA accuracy on a number of tasks, while being in practice several orders of magnitude faster than the baselines. This also translates to a better performance on downstream tasks such as entropy estimation and optimal control of bandits. Theoretically, we cast LITE as entropy-regularized UCB and connect it to prior PoM estimators.

[LG-62] Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data

链接: https://arxiv.org/abs/2501.13483
作者: Aayush Mishra,Daniel Habermann,Marvin Schmitt,Stefan T. Radev,Paul-Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural amortized Bayesian inference (ABI) can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, neural ABI is not yet sufficiently robust for widespread and safe applicability. In particular, when performing inference on observations outside of the scope of the simulated data seen during training, for example, because of model misspecification, the posterior approximations are likely to become highly biased. Due to the bad pre-asymptotic behavior of current neural posterior estimators in the out-of-simulation regime, the resulting estimation biases cannot be fixed in acceptable time by just simulating more training data. In this proof-of-concept paper, we propose a semi-supervised approach that enables training not only on (labeled) simulated data generated from the model, but also on unlabeled data originating from any source, including real-world data. To achieve the latter, we exploit Bayesian self-consistency properties that can be transformed into strictly proper losses without requiring knowledge of true parameter values, that is, without requiring data labels. The results of our initial experiments show remarkable improvements in the robustness of ABI on out-of-simulation data. Even if the observed data is far away from both labeled and unlabeled training data, inference remains highly accurate. If our findings also generalize to other scenarios and model classes, we believe that our new method represents a major breakthrough in neural ABI.

[LG-63] Radio Map Estimation via Latent Domain Plug-and-Play Denoising

链接: https://arxiv.org/abs/2501.13472
作者: Le Xu,Lei Cheng,Junting Chen,Wenqiang Pu,Xiao Fu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radio map estimation (RME), also known as spectrum cartography, aims to reconstruct the strength of radio interference across different domains (e.g., space and frequency) from sparsely sampled measurements. To tackle this typical inverse problem, state-of-the-art RME methods rely on handcrafted or data-driven structural information of radio maps. However, the former often struggles to model complex radio frequency (RF) environments and the latter requires excessive training – making it hard to quickly adapt to in situ sensing tasks. This work presents a spatio-spectral RME approach based on plug-and-play (PnP) denoising, a technique from computational imaging. The idea is to leverage the observation that the denoising operations of signals like natural images and radio maps are similar – despite the nontrivial differences of the signals themselves. Hence, sophisticated denoisers designed for or learned from natural images can be directly employed to assist RME, avoiding using radio map data for training. Unlike conventional PnP methods that operate directly in the data domain, the proposed method exploits the underlying physical structure of radio maps and proposes an ADMM algorithm that denoises in a latent domain. This design significantly improves computational efficiency and enhances noise robustness. Theoretical aspects, e.g., recoverability of the complete radio map and convergence of the ADMM algorithm are analyzed. Synthetic and real data experiments are conducted to demonstrate the effectiveness of our approach.

[LG-64] Advancing Carbon Capture using AI: Design of permeable membrane and estimation of parameters for Carbon Capture using linear regression and membrane-based equations

链接: https://arxiv.org/abs/2501.13373
作者: Bishwash Panerua,Biplov Paneru
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study focuses on membrane-based systems for CO _2 separation, addressing the urgent need for efficient carbon capture solutions to mitigate climate change. Linear regression models, based on membrane equations, were utilized to estimate key parameters, including porosity ( \epsilon ) of 0.4805, Kozeny constant (K) of 2.9084, specific surface area ( \sigma ) of 105.3272 m ^2 /m ^3 , mean pressure (Pm) of 6.2166 MPa, viscosity ( \mu ) of 0.1997 Ns/m ^2 , and gas flux (Jg) of 3.2559 kg m ^-2 s ^-1 . These parameters were derived from the analysis of synthetic datasets using linear regression. The study also provides insights into the performance of the membrane, with a flow rate (Q) of 9.8778 \times 10 ^-4 m ^3 /s, an injection pressure (P _1 ) of 2.8219 MPa, and an exit pressure (P _2 ) of 2.5762 MPa. The permeability value of 0.045 for CO _2 indicates the potential for efficient separation. Optimizing membrane properties to selectively block CO _2 while allowing other gases to pass is crucial for improving carbon capture efficiency. By integrating these technologies into industrial processes, significant reductions in greenhouse gas emissions can be achieved, fostering a circular carbon economy and contributing to global climate goals. This study also explores how artificial intelligence (AI) can aid in designing membranes for carbon capture, addressing the global climate change challenge and supporting the Sustainable Development Goals (SDGs) set by the United Nations.

[LG-65] opological constraints on self-organisation in locally interacting systems

链接: https://arxiv.org/abs/2501.13188
作者: Francesco Sacco,Dalton A R Sakthivadivel,Michael Levin
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注: 9+3 pages, four figures, four tikzpictures. To appear in Philos Trans R Soc A

点击查看摘要

Abstract:All intelligence is collective intelligence, in the sense that it is made of parts which must align with respect to system-level goals. Understanding the dynamics which facilitate or limit navigation of problem spaces by aligned parts thus impacts many fields ranging across life sciences and engineering. To that end, consider a system on the vertices of a planar graph, with pairwise interactions prescribed by the edges of the graph. Such systems can sometimes exhibit long-range order, distinguishing one phase of macroscopic behaviour from another. In networks of interacting systems we may view spontaneous ordering as a form of self-organisation, modelling neural and basal forms of cognition. Here, we discuss necessary conditions on the topology of the graph for an ordered phase to exist, with an eye towards finding constraints on the ability of a system with local interactions to maintain an ordered target state. By studying the scaling of free energy under the formation of domain walls in three model systems – the Potts model, autoregressive models, and hierarchical networks – we show how the combinatorics of interactions on a graph prevent or allow spontaneous ordering. As an application we are able to analyse why multiscale systems like those prevalent in biology are capable of organising into complex patterns, whereas rudimentary language models are challenged by long sequences of outputs.

[LG-66] A Learnt Half-Quadratic Splitting-Based Algorithm for Fast and High-Quality Industrial Cone-beam CT Reconstruction

链接: https://arxiv.org/abs/2501.13128
作者: Aniket Pramanik,Singanallur V. Venkatakrishnan,Obaidullah Rahman,Amirkoushyar Ziabari
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industrial X-ray cone-beam CT (XCT) scanners are widely used for scientific imaging and non-destructive characterization. Industrial CBCT scanners use large detectors containing millions of pixels and the subsequent 3D reconstructions can be of the order of billions of voxels. In order to obtain high-quality reconstruction when using typical analytic algorithms, the scan involves collecting a large number of projections/views which results in large measurement times - limiting the utility of the technique. Model-based iterative reconstruction (MBIR) algorithms can produce high-quality reconstructions from fast sparse-view CT scans, but are computationally expensive and hence are avoided in practice. Single-step deep-learning (DL) based methods have demonstrated that it is possible to obtain fast and high-quality reconstructions from sparse-view data but they do not generalize well to out-of-distribution scenarios. In this work, we propose a half-quadratic splitting-based algorithm that uses convolutional neural networks (CNN) in order to obtain high-quality reconstructions from large sparse-view cone-beam CT (CBCT) measurements while overcoming the challenges with typical approaches. The algorithm alternates between the application of a CNN and a conjugate gradient (CG) step enforcing data-consistency (DC). The proposed method outperforms other methods on the publicly available Walnuts data-set.

信息检索

[IR-0] Graph Neural Controlled Differential Equations For Collaborative Filtering WWW2025

链接: https://arxiv.org/abs/2501.13908
作者: Ke Xu,Weizhi Zhang,Zihe Song,Yuanjie Zhu,Philip S. Yu
类目: Information Retrieval (cs.IR)
*备注: Accepted in WWW 2025 short paper

点击查看摘要

Abstract:Graph Convolution Networks (GCNs) are widely considered state-of-the-art for recommendation systems. Several studies in the field of recommendation systems have attempted to apply collaborative filtering (CF) into the Neural ODE framework. These studies follow the same idea as LightGCN, which removes the weight matrix or with a discrete weight matrix. However, we argue that weight control is critical for neural ODE-based methods. The importance of weight in creating tailored graph convolution for each node is crucial, and employing a fixed/discrete weight means it cannot adjust over time within the ODE function. This rigidity in the graph convolution reduces its adaptability, consequently hindering the performance of recommendations. In this study, to create an optimal control for Neural ODE-based recommendation, we introduce a new method called Graph Neural Controlled Differential Equations for Collaborative Filtering (CDE-CF). Our method improves the performance of the Graph ODE-based method by incorporating weight control in a continuous manner. To evaluate our approach, we conducted experiments on various datasets. The results show that our method surpasses competing baselines, including GCNs-based models and state-of-the-art Graph ODE-based methods.

[IR-1] Large Language Model driven Policy Exploration for Recommender Systems

链接: https://arxiv.org/abs/2501.13816
作者: Jie Wang,Alexandros Karatzoglou,Ioannis Arapakis,Joemon M. Jose
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in Recommender Systems (RS) have incorporated Reinforcement Learning (RL), framing the recommendation as a Markov Decision Process (MDP). However, offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments. Additionally, excessive focus on exploiting short-term relevant items can hinder exploration, leading to suboptimal recommendations and negatively impacting long-term user gains. Online RL-based RS also face challenges in production deployment, due to the risks of exposing users to untrained or unstable policies. Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline to enhance the initial recommendations in online settings. Effectively managing distribution shift and balancing exploration are crucial for improving RL-based RS, especially when leveraging LLM-based pre-training. To address these challenges, we propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM. Our approach involves prompting the LLM with user states to extract item preferences, learning rewards based on feedback, and updating the RL policy using an actor-critic framework. Furthermore, to deploy iALP in an online scenario, we introduce an adaptive variant, A-iALP, that implements a simple fine-tuning strategy (A-iALP _ft ), and an adaptive approach (A-iALP _ap ) designed to mitigate issues with compromised policies and limited exploration. Experiments across three simulated environments demonstrate that A-iALP introduces substantial performance improvements

[IR-2] AirTOWN: A Privacy-Preserving Mobile App for Real-time Pollution-Aware POI Suggestion

链接: https://arxiv.org/abs/2501.13608
作者: Giuseppe Fasano,Yashar Deldjoo,Tommaso Di Noia
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This demo paper presents \airtown, a privacy-preserving mobile application that provides real-time, pollution-aware recommendations for points of interest (POIs) in urban environments. By combining real-time Air Quality Index (AQI) data with user preferences, the proposed system aims to help users make health-conscious decisions about the locations they visit. The application utilizes collaborative filtering for personalized suggestions, and federated learning for privacy protection, and integrates AQI data from sensor networks in cities such as Bari, Italy, and Cork, UK. In areas with sparse sensor coverage, interpolation techniques approximate AQI values, ensuring broad applicability. This system offers a poromsing, health-oriented POI recommendation solution that adapts dynamically to current urban air quality conditions while safeguarding user privacy.

[IR-3] MixRec: Individual and Collective Mixing Empowers Data Augmentation for Recommender Systems WWW’25

链接: https://arxiv.org/abs/2501.13579
作者: Yi Zhang,Yiwen Zhang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW’25

点击查看摘要

Abstract:The core of the general recommender systems lies in learning high-quality embedding representations of users and items to investigate their positional relations in the feature space. Unfortunately, data sparsity caused by difficult-to-access interaction data severely limits the effectiveness of recommender systems. Faced with such a dilemma, various types of self-supervised learning methods have been introduced into recommender systems in an attempt to alleviate the data sparsity through distribution modeling or data augmentation. However, most data augmentation relies on elaborate manual design, which is not only not universal, but the bloated and redundant augmentation process may significantly slow down model training progress. To tackle these limitations, we propose a novel Dual Mixing-based Recommendation Framework (MixRec) to empower data augmentation as we wish. Specifically, we propose individual mixing and collective mixing, respectively. The former aims to provide a new positive sample that is unique to the target (user or item) and to make the pair-wise recommendation loss benefit from it, while the latter aims to portray a new sample that contains group properties in a batch. The two mentioned mixing mechanisms allow for data augmentation with only one parameter that does not need to be set multiple times and can be done in linear time complexity. Besides, we propose the dual-mixing contrastive learning to maximize the utilization of these new-constructed samples to enhance the consistency between pairs of positive samples. Experimental results on four real-world datasets demonstrate the effectiveness of MixRec in terms of recommendation performance, training efficiency, sparsity resistance, and usability.

[IR-4] Federated Conformance Checking

链接: https://arxiv.org/abs/2501.13576
作者: Majid Rafiei,Mahsa Pourbafrani,Wil M.P. van der Aalst
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conformance checking is a crucial aspect of process mining, where the main objective is to compare the actual execution of a process, as recorded in an event log, with a reference process model, e.g., in the form of a Petri net or a BPMN. Conformance checking enables identifying deviations, anomalies, or non-compliance instances. It offers different perspectives on problems in processes, bottlenecks, or process instances that are not compliant with the model. Performing conformance checking in federated (inter-organizational) settings allows organizations to gain insights into the overall process execution and to identify compliance issues across organizational boundaries, which facilitates process improvement efforts among collaborating entities. In this paper, we propose a privacy-aware federated conformance-checking approach that allows for evaluating the correctness of overall cross-organizational process models, identifying miscommunications, and quantifying their costs. For evaluation, we design and simulate a supply chain process with three organizations engaged in purchase-to-pay, order-to-cash, and shipment processes. We generate synthetic event logs for each organization as well as the complete process, and we apply our approach to identify and evaluate the cost of pre-injected miscommunications.

[IR-5] PCSI – The Platform for Content-Structure Inference

链接: https://arxiv.org/abs/2501.13272
作者: Caleb Malchik,Joan Feigenbaum
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 9 pages

点击查看摘要

Abstract:The Platform for Content-Structure Inference (PCSI, pronounced “pixie”) facilitates the sharing of information about the process of converting Web resources into structured content objects that conform to a predefined format. PCSI records encode methods for deriving structured content from classes of URLs, and report the results of applying particular methods to particular URLs. The methods are scripts written in Hex, a variant of Awk with facilities for traversing the HTML DOM.

[IR-6] Exploring GPT s Ability as a Judge in Music Understanding

链接: https://arxiv.org/abs/2501.13261
作者: Kun Fang,Ziyu Wang,Gus Xia,Ichiro Fujinaga
类目: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent progress in text-based Large Language Models (LLMs) and their extended ability to process multi-modal sensory data have led us to explore their applicability in addressing music information retrieval (MIR) challenges. In this paper, we use a systematic prompt engineering approach for LLMs to solve MIR problems. We convert the music data to symbolic inputs and evaluate LLMs’ ability in detecting annotation errors in three key MIR tasks: beat tracking, chord extraction, and key estimation. A concept augmentation method is proposed to evaluate LLMs’ music reasoning consistency with the provided music concepts in the prompts. Our experiments tested the MIR capabilities of Generative Pre-trained Transformers (GPT). Results show that GPT has an error detection accuracy of 65.20%, 64.80%, and 59.72% in beat tracking, chord extraction, and key estimation tasks, respectively, all exceeding the random baseline. Moreover, we observe a positive correlation between GPT’s error finding accuracy and the amount of concept information provided. The current findings based on symbolic music input provide a solid ground for future LLM-based MIR research.

附件下载

点击下载今日全部论文列表