Arxiv今日论文 | 2024-12-13

本篇博文主要展示 2024-12-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决现有文本到图像（T2I）模型在生成复杂或文化特定主题图像时，由于数据集限制而导致的视觉准确性和上下文丰富性不足的问题。解决方案的关键在于引入基于图的检索增强生成（RAG）系统，该系统通过动态从知识图谱中检索详细的特征信息和关系数据，增强了模型的上下文理解能力，从而生成更准确和丰富的图像。此外，论文还提出了一种自校正机制，利用图谱中的丰富上下文来指导图像生成的校正，确保视觉输出的连贯性和忠实性。这些创新显著提升了如Flux、Stable Diffusion和DALL-E等流行模型的性能，并增强了ControlNet在细粒度图像编辑任务中的功能。

链接: https://arxiv.org/abs/2412.09614
作者: Kavana Venkatesh,Yusuf Dalva,Ismini Lourentzou,Pinar Yanardag
关键词-EN: Context Canvas, graph-based RAG, Context Canvas significantly, Canvas significantly enhances, Context Canvas represents
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.
zh

[NLP-1] Olympus: A Universal Task Router for Computer Vision Tasks

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在处理广泛计算机视觉任务时的局限性问题。解决方案的关键在于引入了一个名为 Olympus 的统一框架，该框架通过一个控制器 MLLM 将超过 20 种专门任务分配给相应的模块，实现了基于指令的路由机制。这种机制允许复杂的工作流通过链式动作执行，而无需训练庞大的生成式模型。Olympus 能够轻松集成现有的 MLLMs，扩展其能力并保持相当的性能。实验结果表明，Olympus 在 20 个任务中的平均路由准确率达到 94.75%，在链式动作场景中的精度为 91.82%，展示了其作为通用任务路由器的有效性。

链接: https://arxiv.org/abs/2412.09612
作者: Yuanze Lin,Yunsheng Li,Dongdong Chen,Weijian Xu,Ronald Clark,Philip H. S. Torr
关键词-EN: Multimodal Large Language, transforms Multimodal Large, Large Language Models, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: this https URL
zh

[NLP-2] Agent Trek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

【速读】：该论文试图解决图形用户界面 (Graphical User Interface, GUI) 代理在自动化复杂任务时缺乏高质量、多步骤轨迹数据的问题。解决方案的关键在于提出了 AgentTrek，一个可扩展的数据合成管道，通过利用网络教程生成高质量的 GUI 代理轨迹。具体来说，AgentTrek 自动从互联网收集类似教程的文本，将其转化为带有逐步指令的任务目标，并使用视觉-语言模型 (Visual-Language Model, VLM) 代理在真实数字环境中模拟执行。通过 VLM 评估器确保生成轨迹的正确性，从而显著提升 GUI 代理的训练效果，同时降低了对人工标注的依赖，提高了成本效益。

链接: https://arxiv.org/abs/2412.09605
作者: Yiheng Xu,Dunjie Lu,Zhennan Shen,Junli Wang,Zekun Wang,Yuchen Mao,Caiming Xiong,Tao Yu
关键词-EN: Graphical User Interface, Graphical User, User Interface, automating complex tasks, agents hold great
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.
zh

[NLP-3] meRefine: Temporal Grounding with Time Refining Video LLM

【速读】：该论文试图解决视频时间定位（Video Temporal Grounding）中，基于时间令牌预测（temporal token prediction）的局限性问题。解决方案的关键在于提出了TimeRefine方法，通过将时间定位任务重新定义为时间精炼任务（temporal refining task），模型首先进行粗略预测，然后通过预测目标片段的偏移量来逐步精炼预测结果。此外，引入辅助预测头（auxiliary prediction head）来增强模型的时间感知能力，通过惩罚偏离真实值较远的预测来鼓励更精确的定位。该方法可灵活集成到大多数基于大语言模型（LLM）的时间定位方法中，并在实验中显著提升了ActivityNet和Charades-STA数据集上的mIoU指标。

链接: https://arxiv.org/abs/2412.09601
作者: Xizi Wang,Feng Cheng,Ziyang Wang,Huiyu Wang,Md Mohaiminul Islam,Lorenzo Torresani,Mohit Bansal,Gedas Bertasius,David Crandall
关键词-EN: Video temporal grounding, localize relevant temporal, relevant temporal boundaries, Video, textual prompt
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model’s temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.
zh

[NLP-4] InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

【速读】：该论文试图解决多模态大语言模型（MLLMs）在长时间交互中面临的连续感知、记忆和推理的挑战。现有MLLMs的序列到序列架构限制了它们同时处理输入和生成响应的能力，且依赖长上下文存储历史数据在长期交互中效率低下。解决方案的关键在于引入解耦的流式感知、推理和记忆机制，借鉴“专业化通才AI”概念，提出InternLM-XComposer2.5-OmniLive (IXC2.5-OL)框架，包含三个核心模块：(1) 流式感知模块，实时处理多模态信息并存储关键细节；(2) 多模态长记忆模块，整合短期和长期记忆，压缩短期记忆以提高检索效率；(3) 推理模块，响应查询并与感知和记忆模块协同工作。这一框架模拟人类认知，使MLLMs能够提供持续且适应性的服务。

链接: https://arxiv.org/abs/2412.09596
作者: Pan Zhang,Xiaoyi Dong,Yuhang Cao,Yuhang Zang,Rui Qian,Xilin Wei,Lin Chen,Yifei Li,Junbo Niu,Shuangrui Ding,Qipeng Guo,Haodong Duan,Xin Chen,Han Lv,Zheng Nie,Min Zhang,Bin Wang,Wenwei Zhang,Xinyue Zhang,Jiaye Ge,Wei Li,Jingwen Li,Zhongying Tu,Conghui He,Xingcheng Zhang,Kai Chen,Yu Qiao,Dahua Lin,Jiaqi Wang
关键词-EN: longstanding research goal, Creating AI systems, similar to human, research goal, interact with environments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Github Repo: this https URL

点击查看摘要

Abstract:Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
zh

[NLP-5] OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50 Languages

【速读】：该论文试图解决命名实体识别 (Named Entity Recognition, NER) 领域中数据集多样性和标准化不足的问题。解决方案的关键在于提出了 OpenNER 1.0，这是一个标准化的开放式 NER 数据集集合，包含 34 个数据集，涵盖 51 种语言，并采用多种命名实体本体进行标注。论文通过纠正标注格式问题、统一数据集表示、映射实体类型名称以提高跨语料库的一致性，并提供结构化的数据集集合，从而支持多语言和多本体的 NER 研究。此外，论文还提供了基于三种预训练多语言语言模型的基线模型，以比较最新模型的性能并促进未来 NER 研究的发展。

链接: https://arxiv.org/abs/2412.09587
作者: Chester Palen-Michel,Maxwell Pickering,Maya Kruse,Jonne Sälevä,Constantine Lignos
关键词-EN: named entity recognition, entity recognition, named entity, NER, named entity ontologies
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets. OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.
zh

[NLP-6] DISHONEST: Dissecting misInformation Spread using Homogeneous sOcial NEtworks and Semantic Topic classification

【速读】：该论文试图解决的问题是在COVID-19疫情期间，社交媒体平台上错误信息的传播与“回音室效应”之间的关系。论文的关键解决方案在于通过结合社交互动（基于Twitter的转发网络）和推文内容（使用主题建模）来研究用户行为的双重维度：社交互动的同质性和推文内容的同质性。研究开发了一种新的度量方法，用于跟踪用户在社交网络中随时间变化的互动多样性，并通过分析疫情期间的错误信息数据，验证了社交行为与推文内容之间的相关性。这一发现不仅支持了关于反社会用户行为的普遍直觉，还表明即使在已经充满错误信息的子社区中，这种相关性依然存在。

链接: https://arxiv.org/abs/2412.09578
作者: Caleb Stam,Emily Saldanha,Mahantesh Halappanavar,Anurag Acharya
关键词-EN: significant rise, online platforms, echo chambers exists, pandemic resulted, social interactions
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of the COVID-19 pandemic resulted in a significant rise in the spread of misinformation on online platforms such as Twitter. Oftentimes this growth is blamed on the idea of the “echo chamber.” However, the behavior said to characterize these echo chambers exists in two dimensions. The first is in a user’s social interactions, where they are said to stick with the same clique of like-minded users. The second is in the content of their posts, where they are said to repeatedly espouse homogeneous ideas. In this study, we link the two by using Twitter’s network of retweets to study social interactions and topic modeling to study tweet content. In order to measure the diversity of a user’s interactions over time, we develop a novel metric to track the speed at which they travel through the social network. The application of these analysis methods to misinformation-focused data from the pandemic demonstrates correlation between social behavior and tweet content. We believe this correlation supports the common intuition about how antisocial users behave, and further suggests that it holds even in subcommunities already rife with misinformation.
zh

[NLP-7] DiverseAgent Entropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction

【速读】：该论文试图解决在大语言模型（LLMs）中量化事实参数知识的不确定性问题，尤其是在黑箱设置下。现有方法通过评估模型对原始查询的自一致性来衡量不确定性，但这种方法无法准确捕捉真正的模型不确定性，因为模型可能对原始查询给出一致但错误的回答，或者对同一查询的不同视角问题给出正确但不一致的回答。论文提出的解决方案关键在于DiverseAgentEntropy方法，通过多智能体交互来评估模型的不确定性，假设如果模型确定，它应能在关于同一原始查询的多样化问题集合中一致地回忆起正确答案。此外，该方法还实施了弃权策略，在不确定性高时拒绝响应，从而提高了模型可靠性的预测准确性，并能检测幻觉现象，优于其他基于自一致性的方法。

链接: https://arxiv.org/abs/2412.09572
作者: Yu Feng,Phu Mon Htut,Zheng Qi,Wei Xiao,Manuel Mager,Nikolaos Pappas,Kishaloy Halder,Yang Li,Yassine Benajiba,Dan Roth
关键词-EN: Large Language Models, Large Language, factual parametric knowledge, knowledge of Large, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa. In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model’s uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query. We further implement an abstention policy to withhold responses when uncertainty is high. Our method offers a more accurate prediction of the model’s reliability and further detects hallucinations, outperforming other self-consistency-based methods. Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions even when knowing the correct answer.
zh

[NLP-8] JuStRank: Benchmarking LLM Judges for System Ranking

【速读】：该论文试图解决生成式 AI (Generative AI) 模型和配置选择中的系统性比较问题，特别是如何在大规模和多样化的评估中有效利用基于大语言模型 (LLM) 的评判者。解决方案的关键在于首先验证 LLM 评判者的质量，并通过系统级排名的方式进行评估，而非传统的实例级评估。具体来说，论文提出通过聚合多个系统输出的评判分数来生成系统得分，并通过与人类基准排名的比较来评估评判者的质量。此外，论文还对评判者的行为进行了细粒度分析，包括其决断力和偏见，从而提供了对评判者表现的全面评估。

链接: https://arxiv.org/abs/2412.09569
作者: Ariel Gera,Odellia Boni,Yotam Perlitz,Roy Bar-Haim,Lilach Eden,Asaf Yehudai
关键词-EN: rapid progress, progress of generative, systematically compare, compare and choose, numerous models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
zh

[NLP-9] Does Representation Matter? Exploring Intermediate Layers in Large Language Models NEURIPS

【速读】：该论文试图解决在大语言模型（LLMs）中如何定义和评估一个好的表示（representation）的问题。解决方案的关键在于通过研究不同LLM架构（如Transformer和状态空间模型（State Space Models, SSMs））的中间层表示质量，发现中间层通常比最终层为下游任务提供更丰富的信息。论文采用了一系列度量标准（如提示熵（prompt entropy）、曲率（curvature）和增强不变性（augmentation-invariance））来评估表示质量，并揭示了模型架构差异、训练过程中表示的演变以及输入随机性和提示长度对各层的影响。特别地，观察到某些中间层的熵呈现双峰模式，并探讨了与训练数据相关的潜在解释。这些发现为理解LLMs的内部机制提供了洞见，并为架构优化和训练策略提供了指导。

链接: https://arxiv.org/abs/2412.09563
作者: Oscar Skean,Md Rifat Arefin,Yann LeCun,Ravid Shwartz-Ziv
关键词-EN: large language models, State Space Models, theoretical understanding, language models, practical applications
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to 2024 NeurIPs Workshop on Machine Learning and Compression

点击查看摘要

Abstract:Understanding what defines a good representation in large language models (LLMs) is fundamental to both theoretical understanding and practical applications. In this paper, we investigate the quality of intermediate representations in various LLM architectures, including Transformers and State Space Models (SSMs). We find that intermediate layers often yield more informative representations for downstream tasks than the final layers. To measure the representation quality, we adapt and apply a suite of metrics - such as prompt entropy, curvature, and augmentation-invariance - originally proposed in other contexts. Our empirical study reveals significant architectural differences, how representations evolve throughout training, and how factors like input randomness and prompt length affect each layer. Notably, we observe a bimodal pattern in the entropy of some intermediate layers and consider potential explanations tied to training data. Overall, our results illuminate the internal mechanics of LLMs and guide strategies for architectural optimization and training.
zh

[NLP-10] Audios Dont Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection

【速读】：该论文试图解决音频深度伪造（deepfake audio）技术在金融和社会安全领域带来的安全风险问题。解决方案的关键在于提出了一种基于多频率通道注意力机制（Multi-Frequency Channel Attention, MFCA）和二维离散余弦变换（2D Discrete Cosine Transform, DCT）的音频深度伪造检测方法。该方法通过将音频信号处理为梅尔频谱图（melspectrogram），利用MobileNet V2提取深度特征，并结合MFCA模块对不同频率通道进行加权，从而有效捕捉音频信号中的细粒度频率域特征，提升伪造音频的分类能力。实验结果表明，该方法在准确性、精确度、召回率、F1分数等指标上均优于传统方法，尤其在复杂音频场景中表现出更强的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2412.09467
作者: Yangguang Feng
关键词-EN: artificial intelligence technology, audio deepfake detection, audio deepfake, intelligence technology, audio
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:With the rapid development of artificial intelligence technology, the application of deepfake technology in the audio field has gradually increased, resulting in a wide range of security risks. Especially in the financial and social security fields, the misuse of deepfake audios has raised serious concerns. To address this challenge, this study proposes an audio deepfake detection method based on multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT). By processing the audio signal into a melspectrogram, using MobileNet V2 to extract deep features, and combining it with the MFCA module to weight different frequency channels in the audio signal, this method can effectively capture the fine-grained frequency domain features in the audio signal and enhance the Classification capability of fake audios. Experimental results show that compared with traditional methods, the model proposed in this study shows significant advantages in accuracy, precision,recall, F1 score and other indicators. Especially in complex audio scenarios, this method shows stronger robustness and generalization capabilities and provides a new idea for audio deepfake detection and has important practical application value. In the future, more advanced audio detection technologies and optimization strategies will be explored to further improve the accuracy and generalization capabilities of audio deepfake detection.
zh

[NLP-11] he Impact of Copyrighted Material on Large Language Models : A Norwegian Perspective

【速读】：该论文试图解决在训练生成式语言模型（Generative Language Models）时使用受版权保护的材料所引发的重大法律和伦理问题。其解决方案的关键在于通过实证研究评估这些受版权保护的材料对挪威语大型语言模型（LLMs）性能的影响。研究发现，书籍和报纸对模型在多样化的挪威语基准测试中的表现有积极贡献，而虚构作品可能降低模型性能。这一研究结果为制定补偿参与AI开发的作者的方案提供了依据。

链接: https://arxiv.org/abs/2412.09460
作者: Javier de la Rosa,Vladislav Mikhailov,Lemei Zhang,Freddy Wetjen,David Samuel,Peng Liu,Rolv-Arild Braaten,Petter Mæhlum,Magnus Breder Birkenes,Andrey Kutuzov,Tita Enstad,Svein Arne Brygfjeld,Jon Atle Gulla,Stephan Oepen,Erik Velldal,Wilfred Østgulen,Liljia Øvrelid,Aslak Sira Myhre
关键词-EN: raises critical legal, training generative language, models raises critical, generative language models, language models raises
类目: Computation and Language (cs.CL)
备注: pre-print, under review

点击查看摘要

Abstract:The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.
zh

[NLP-12] From Intention To Implementation: Automating Biomedical Research via LLM s

【速读】：该论文试图解决传统生物医学研究因科学文献和数据集的指数增长而变得日益劳动密集化的问题。解决方案的关键在于引入BioResearcher，这是一个端到端的自动化系统，专门设计用于简化涉及干实验室实验的整个生物医学研究过程。BioResearcher采用模块化的多智能体架构，整合了专门用于搜索、文献处理、实验设计和编程的智能体。通过将复杂任务分解为逻辑相关的子任务，并利用分层学习方法，BioResearcher有效应对了多学科需求和逻辑复杂性的挑战。此外，系统还集成了基于大语言模型（LLM）的审查机制，用于过程中的质量控制，并引入了新的评估指标来衡量实验方案的质量和自动化程度。

链接: https://arxiv.org/abs/2412.09429
作者: Yi Luo,Linghang Shi,Yihao Li,Aobo Zhuang,Yeyun Gong,Ling Liu,Lin Chen
关键词-EN: increasingly labor-intensive due, Large Language Models, Conventional biomedical research, Conventional biomedical, increasingly labor-intensive
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conventional biomedical research is increasingly labor-intensive due to the exponential growth of scientific literature and datasets. Artificial intelligence (AI), particularly Large Language Models (LLMs), has the potential to revolutionize this process by automating various steps. Still, significant challenges remain, including the need for multidisciplinary expertise, logicality of experimental design, and performance measurements. This paper introduces BioResearcher, the first end-to-end automated system designed to streamline the entire biomedical research process involving dry lab experiments. BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming. By decomposing complex tasks into logically related sub-tasks and utilizing a hierarchical learning approach, BioResearcher effectively addresses the challenges of multidisciplinary requirements and logical complexity. Furthermore, BioResearcher incorporates an LLM-based reviewer for in-process quality control and introduces novel evaluation metrics to assess the quality and automation of experimental protocols. BioResearcher successfully achieves an average execution success rate of 63.07% across eight previously unmet research objectives. The generated protocols averagely outperform typical agent systems by 22.0% on five quality metrics. The system demonstrates significant potential to reduce researchers’ workloads and accelerate biomedical discoveries, paving the way for future innovations in automated research systems.
zh

[NLP-13] Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM -Powered AI Tutors

【速读】：该论文试图解决当前最先进的大型语言模型（LLMs）在教育对话中作为AI导师的有效性和教学能力评估问题。解决方案的关键在于提出了一个基于学习科学原则的统一评估分类法（taxonomy），包含八个教学维度，用于评估LLM驱动的AI导师在数学领域中针对学生错误或困惑的回应的教学价值。此外，论文发布了MRBench基准，包含192个对话和1,596条回应，提供了八个教学维度的黄金标准注释，并通过评估Prometheus2 LLM作为评估者的可靠性，分析了不同AI导师的教学能力，指出了哪些LLM适合作为导师，哪些更适合作为问答系统。

链接: https://arxiv.org/abs/2412.09416
作者: Kaushal Kumar Maurya,KV Aditya Srivatsa,Kseniia Petukhova,Ekaterina Kochmar
关键词-EN: large language models, investigate whether current, large language, language models, educational dialogues
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusion in the mathematical domain. We release MRBench – a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 LLM as an evaluator and analyze each tutor’s pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors’ development.
zh

[NLP-14] xt Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy

【速读】：该论文试图解决小众语言（如卢森堡语）在语言模型开发中面临的数据稀缺问题，特别是在卢森堡语的多语言环境中。解决方案的关键在于提出了一种基于T5架构的文本生成模型，通过结合有限的卢森堡语数据与同等数量的德语和法语数据，利用跨语言迁移学习（cross-lingual transfer learning）来提升模型的性能。研究假设这种多语言训练方法将优于单一语言和大型多语言模型。为验证这一假设，论文引入了LuxGen，这是首个针对卢森堡语的文本生成基准测试。

链接: https://arxiv.org/abs/2412.09415
作者: Alistair Plum,Tharindu Ranasinghe,Christoph Purschke
关键词-EN: Luxembourgish, paper addresses, addresses the challenges, challenges in developing, Luxembourg multilingual context
类目: Computation and Language (cs.CL)
备注: Accepted at VarDial 2025

点击查看摘要

Abstract:This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish. Despite its active development, Luxembourgish faces a digital data scarcity, exacerbated by Luxembourg’s multilingual context. We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts, in terms of size and type, of German and French data. We hypothesise that a model trained on Luxembourgish, German, and French will improve the model’s cross-lingual transfer learning capabilities and outperform monolingual and large multilingual models. To verify this, the study at hand explores whether multilingual or monolingual training is more beneficial for Luxembourgish language generation. For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
zh

[NLP-15] Imitate Explore and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

【速读】：该论文试图解决如何实现类似于o1的慢思考推理系统的问题。解决方案的关键在于采用“模仿、探索和自我改进”框架，通过蒸馏的长篇思维数据微调推理模型，使其能够进入慢思考模式，并通过生成多个推演路径来探索复杂问题，最终通过迭代优化训练数据集实现自我改进。实验结果表明，该方法在多个挑战性基准测试中达到了与行业级推理系统相媲美的性能。

链接: https://arxiv.org/abs/2412.09413
作者: Yingqian Min,Zhipeng Chen,Jinhao Jiang,Jie Chen,Jia Deng,Yiwen Hu,Yiru Tang,Jiapeng Wang,Xiaoxue Cheng,Huatong Song,Wayne Xin Zhao,Zheng Liu,Zhongyuan Wang,Ji-Rong Wen
关键词-EN: demonstrated remarkable capabilities, complex reasoning tasks, solving complex reasoning, reasoning systems, demonstrated remarkable
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report on Slow Thinking with LLMs: Part II

点击查看摘要

Abstract:Recently, slow-thinking reasoning systems, such as o1, have demonstrated remarkable capabilities in solving complex reasoning tasks. These systems typically engage in an extended thinking process before responding to a query, allowing them to generate more thorough, accurate, and well-reasoned solutions. These systems are primarily developed and maintained by industry, with their core techniques not publicly disclosed. In response, an increasing number of studies from the research community aim to explore the technical foundations underlying these powerful reasoning systems. Building on these prior efforts, this paper presents a reproduction report on implementing o1-like reasoning systems. We introduce an “imitate, explore, and self-improve” framework as our primary technical approach to train the reasoning model. In the initial phase, we use distilled long-form thought data to fine-tune the reasoning model, enabling it to invoke a slow-thinking mode. The model is then encouraged to explore challenging problems by generating multiple rollouts, which can result in increasingly more high-quality trajectories that lead to correct answers. Furthermore, the model undergoes self-improvement by iteratively refining its training dataset. To verify the effectiveness of this approach, we conduct extensive experiments on three challenging benchmarks. The experimental results demonstrate that our approach achieves competitive performance compared to industry-level reasoning systems on these benchmarks.
zh

[NLP-16] Neural Text Normalization for Luxembourgish using Real-Life Variation Data

【速读】：该论文试图解决卢森堡语文本中由于缺乏完全标准化的书写形式而导致的正字法变异问题。解决方案的关键在于提出了基于ByT5和mT5架构的序列到序列归一化模型，并利用从真实文本中提取的词级变异数据进行训练。通过细粒度的语言学评估，证明了使用真实变异数据的序列模型在卢森堡语文本归一化中的有效性。

链接: https://arxiv.org/abs/2412.09383
作者: Anne-Marie Lutgen,Alistair Plum,Christoph Purschke,Barbara Plank
关键词-EN: fully-fledged standard variety, Orthographic variation, Luxembourgish texts due, standard variety, fully-fledged standard
类目: Computation and Language (cs.CL)
备注: Accepted at VarDial 2025

点击查看摘要

Abstract:Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
zh

[NLP-17] From Bench to Bedside: A Review of Clinical Trialsin Drug Discovery and Development

【速读】：该论文旨在探讨临床试验在药物开发过程中的核心作用，并解决临床试验中面临的主要挑战。解决方案的关键在于：1) 深入理解临床试验的各个阶段（如I期至IV期的特点及其相互关系）；2) 应对临床试验中的伦理问题、受试者招募困难、多样性和代表性不足等挑战；3) 通过创新技术（如人工智能、大数据和数字化）提升试验设计和实施的效率与数据质量；4) 展望新兴疗法（如基因疗法和免疫疗法）对试验设计的影响，以及监管改革和全球合作的重要性。这些关键点共同推动了临床试验在药物开发中的核心作用，促进创新药物的研发和临床应用。

链接: https://arxiv.org/abs/2412.09378
作者: Tianyang Wang,Ming Liu,Benji Peng,Xinyuan Song,Charles Zhang,Xintian Sun,Qian Niu,Junyu Liu,Silin Chen,Keyu Chen,Ming Li,Pohsun Feng,Ziqian Bi,Yunze Wang,Yichao Zhang,Cheng Fei,Lawrence KQ Yan
关键词-EN: Clinical trials, drug development process, drug development, Clinical, bridging the gap
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Clinical trials are an indispensable part of the drug development process, bridging the gap between basic research and clinical application. During the development of new drugs, clinical trials are used not only to evaluate the safety and efficacy of the drug but also to explore its dosage, treatment regimens, and potential side effects. This review discusses the various stages of clinical trials, including Phase I (safety assessment), Phase II (preliminary efficacy evaluation), Phase III (large-scale validation), and Phase IV (post-marketing surveillance), highlighting the characteristics of each phase and their interrelationships. Additionally, the paper addresses the major challenges encountered in clinical trials, such as ethical issues, subject recruitment difficulties, diversity and representativeness concerns, and proposes strategies for overcoming these challenges. With the advancement of technology, innovative technologies such as artificial intelligence, big data, and digitalization are gradually transforming clinical trial design and implementation, improving trial efficiency and data quality. The article also looks forward to the future of clinical trials, particularly the impact of emerging therapies such as gene therapy and immunotherapy on trial design, as well as the importance of regulatory reforms and global collaboration. In conclusion, the core role of clinical trials in drug development will continue to drive the progress of innovative drug development and clinical treatment.
zh

[NLP-18] Word Sense Linking: Disambiguating Outside the Sandbox

【速读】：该论文试图解决词义消歧 (Word Sense Disambiguation, WSD) 在实际应用中的困难，特别是由于标准 WSD 任务假设所有待消歧的词段已预先识别且所有候选词义已知，这些假设在实际应用中难以满足。为此，论文提出了一种新的任务——词义链接 (Word Sense Linking, WSL)，其关键在于系统需要同时识别待消歧的词段并将其链接到最合适的词义。论文采用基于 Transformer 的架构来解决 WSL 任务，并通过逐步放松 WSD 的假设来评估其性能，旨在促进词汇语义在下游应用中的更便捷集成。

链接: https://arxiv.org/abs/2412.09370
作者: Andrei Stefan Bejgu,Edoardo Barba,Luigi Procopio,Alberte Fernández-Castro,Roberto Navigli
关键词-EN: Word Sense Disambiguation, Sense Disambiguation, Word Sense Linking, Word Sense, task called Word
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Word Sense Disambiguation (WSD) is the task of associating a word in a given context with its most suitable meaning among a set of possible candidates. While the task has recently witnessed renewed interest, with systems achieving performances above the estimated inter-annotator agreement, at the time of writing it still struggles to find downstream applications. We argue that one of the reasons behind this is the difficulty of applying WSD to plain text. Indeed, in the standard formulation, models work under the assumptions that a) all the spans to disambiguate have already been identified, and b) all the possible candidate senses of each span are provided, both of which are requirements that are far from trivial. In this work, we present a new task called Word Sense Linking (WSL) where, given an input text and a reference sense inventory, systems have to both identify which spans to disambiguate and then link them to their most suitable this http URL put forward a transformer-based architecture for the task and thoroughly evaluate both its performance and those of state-of-the-art WSD systems scaled to WSL, iteratively relaxing the assumptions of WSD. We hope that our work will foster easier integration of lexical semantics into downstream applications.
zh

[NLP-19] Falcon-UI: Understanding GUI Before Following User Instructions

【速读】：该论文试图解决现有图形用户界面（GUI）代理在理解GUI环境和遵循用户指令之间耦合度过高的问题，尤其是忽视了对GUI环境理解的重要性。解决方案的关键在于引入一个无需指令的GUI导航数据集，称为Insight-UI Dataset，该数据集通过模拟多种平台和分辨率自动生成，以增强模型对GUI环境的理解。论文提出了独立学习GUI操作的可行性，并通过预训练和微调相结合的方式开发了GUI代理模型Falcon-UI，验证了GUI环境理解与代理性能之间的关联性。

链接: https://arxiv.org/abs/2412.09362
作者: Huawen Shen,Chang Liu,Gengluo Li,Xinlong Wang,Yu Zhou,Can Ma,Xiangyang Ji
关键词-EN: Graphical User Interface, Graphical User, Pursuing human-like interaction, Pursuing human-like, GUI
类目: Computation and Language (cs.CL)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Pursuing human-like interaction for Graphical User Interface (GUI) agents requires understanding the GUI context and following user instructions. However, existing works typically couple these two aspects and focus more on instruct-following abilities, while ignoring the importance of understanding the GUI context. In this paper, we introduce an instruction-free GUI navigation dataset, termed Insight-UI Dataset, to enhance model comprehension of GUI environments. Insight-UI Dataset is automatically generated from the Common Crawl corpus, simulating various platforms – including iOS, Android, Windows, and Linux – across multiple resolutions on 312K domains. Although GUI interactions vary by context, diverse interfaces share common internal patterns, such as clicking an item to view its details. It implies the feasibility of independent GUI operation learning, followed by joint optimization with instruction tuning. Thereby, we develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI Dataset and subsequently fine-tuned on Android and Web GUI datasets, including AITW, AITZ, Android Control, and Mind2Web. With 7 billion parameters, Falcon-UI achieves accuracy comparable to the 72 billion-parameter Qwen2VL on AITZ, validating the alignment between GUI context comprehension and agent performance. Our code and dataset will be open-sourced.
zh

[NLP-20] Causal Graphical Models for Vision-Language Compositional Understanding

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在理解人类语言组合性 (compositional properties) 方面的不足，特别是在组合任务 (compositional tasks) 中表现不佳的问题。解决方案的关键在于使用因果图模型 (Causal Graphical Model, CGM) 来建模文本和视觉标记之间的依赖关系，并通过依赖解析器 (dependency parser) 构建该模型。与传统的自回归或并行预测方法不同，论文提出的解码器生成过程遵循部分有序的 CGM 结构，从而鼓励解码器仅学习句子中的主要因果依赖关系，而忽略虚假的相关性。实验结果表明，该方法在五个组合基准测试中显著优于现有的最先进方法，并且在使用较小数据集训练时也能取得更好的效果。

链接: https://arxiv.org/abs/2412.09353
作者: Fiorenzo Parascandolo,Nicholas Moratelli,Enver Sangineto,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: Recent work, bag of words, struggle to fully, human language, Causal Graphical Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a “bag of words”. As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder’s generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.
zh

[NLP-21] raining LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain COLING2025

【速读】：该论文试图解决在金融和保险等专业领域中，通用预训练神经网络因训练数据与下游任务之间的领域不匹配而表现不佳的问题。解决方案的关键在于采用与领域相关的文档进行预训练，以改善在特定任务（如命名实体识别 (NER)）上的表现。研究通过使用匿名的保险相关财务文档（如Payslips数据集）进行实验，证明了这种策略的有效性，并展示了使用较小且更快的模型也能取得竞争性结果。

链接: https://arxiv.org/abs/2412.09341
作者: Benno Uthayasooriyar,Antoine Ly,Franck Vermet,Caio Corro
关键词-EN: Generic pre-trained neural, pre-trained neural networks, Generic pre-trained, finance and insurance, produce good results
类目: Computation and Language (cs.CL)
备注: Coling 2025 workshop (FinNLP)

点击查看摘要

Abstract:Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.
zh

[NLP-22] Benchmarking LLM s for Mimicking Child-Caregiver Language in Interaction

【速读】：该论文试图解决大语言模型（LLMs）在模拟早期儿童与照顾者互动中的语言特征方面的能力问题。解决方案的关键在于通过静态和交互式基准测试方法，评估当前最先进的LLMs（如Llama 3和GPT-4o）在捕捉儿童与照顾者对话中的词汇和语句层面的表现，尽管这些模型在一定程度上能够近似模拟，但在再现互动模式、对齐度和多样性方面仍存在显著不足。论文的最终目标是推动开发一个全面的基准测试，以评估LLMs在面向儿童的应用中的表现。

链接: https://arxiv.org/abs/2412.09318
作者: Jing Liu,Abdellah Fourtassi
关键词-EN: remains largely unexplored, simulate early child-adult, early child-adult interactions, child-adult interactions remains, interactions remains largely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs can generate human-like dialogues, yet their ability to simulate early child-adult interactions remains largely unexplored. In this paper, we examined how effectively LLMs can capture the distinctive features of child-caregiver language in interaction, using both static and interactive benchmarking methods. We found that state-of-the-art LLMs like Llama 3 and GPT-4o can approximate child-caregiver dialogues at the word and utterance level, but they struggle to reproduce the child and caregiver’s discursive patterns, exaggerate alignment, and fail to reach the level of diversity shown by humans. The broader goal of this work is to initiate the development of a comprehensive benchmark for LLMs in child-oriented applications.
zh

[NLP-23] CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLM s

【速读】：该论文旨在解决在资源受限设备上部署大型语言模型 (LLMs) 时面临的计算成本问题，特别是通过后训练量化 (Post-training Quantization, PTQ) 实现极端压缩。解决方案的关键是提出了一种名为通道松弛向量量化 (Channel-Relaxed Vector Quantization, CRVQ) 的新技术，该技术通过两个关键创新显著提升了 PTQ 基线的性能：(1) 精心选择和重新排序一小部分关键权重通道；(2) 利用多个码本放松对关键通道的约束。CRVQ 方法在仅增加极少额外比特的情况下，实现了比当前最强 sub-2-bit PTQ 基线高出 38.9% 的性能提升，接近无损的 1-bit 压缩，并提供了灵活的量化比特宽度和性能定制选项，以适应多样化的硬件平台。

链接: https://arxiv.org/abs/2412.09282
作者: Yuzhuang Xu,Shiyu Ji,Qingfu Zhu,Wanxiang Che
关键词-EN: Powerful large language, large language models, Powerful large, lower computational costs, language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 5 figures, 4 tables

点击查看摘要

Abstract:Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector Quantization (CRVQ), a novel technique that significantly improves the performance of PTQ baselines at the cost of only minimal additional bits. This state-of-the-art extreme compression method achieves its results through two key innovations: (1) carefully selecting and reordering a very small subset of critical weight channels, and (2) leveraging multiple codebooks to relax the constraint of critical channels. With our method, we demonstrate a 38.9% improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer lossless 1-bit compression. Furthermore, our approach offers flexible customization of quantization bit-width and performance, providing a wider range of deployment options for diverse hardware platforms.
zh

[NLP-24] Learning to Solve Domain-Specific Calculation Problems with Knowledge-Intensive Programs Generator

【速读】：该论文试图解决领域大语言模型（Domain Large Language Models, LLMs）在处理涉及复杂领域规则和知识文档的知识密集型计算问题时的不足。解决方案的关键在于提出了一种名为知识密集型程序生成器（Knowledge-Intensive Programs Generator, KIPG）的管道，该生成器能够根据领域特定文档生成知识密集型程序。具体而言，KIPG通过提取关键变量并利用领域知识计算结果，通过迭代偏好对齐来提升代码生成器与领域知识的逻辑一致性。实验以法律领域为例，验证了该管道的有效性，并展示了代码生成器在无需新知识训练的情况下适应其他领域的能力。

链接: https://arxiv.org/abs/2412.09280
作者: Chengyuan Liu,Shihang Wang,Lizhi Qing,Jun Lin,Ji Zhang,Fei Wu,Kun Kuang
关键词-EN: Large Language Models, Domain Large Language, Language Models, Large Language, domain-specific tasks based
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Domain Large Language Models (LLMs) are developed for domain-specific tasks based on general LLMs. But it still requires professional knowledge to facilitate the expertise for some domain-specific tasks. In this paper, we investigate into knowledge-intensive calculation problems. We find that the math problems to be challenging for LLMs, when involving complex domain-specific rules and knowledge documents, rather than simple formulations of terminologies. Therefore, we propose a pipeline to solve the domain-specific calculation problems with Knowledge-Intensive Programs Generator more effectively, named as KIPG. It generates knowledge-intensive programs according to the domain-specific documents. For each query, key variables are extracted, then outcomes which are dependent on domain knowledge are calculated with the programs. By iterative preference alignment, the code generator learns to improve the logic consistency with the domain knowledge. Taking legal domain as an example, we have conducted experiments to prove the effectiveness of our pipeline, and extensive analysis on the modules. We also find that the code generator is also adaptable to other domains, without training on the new knowledge.
zh

[NLP-25] owards Understanding the Robustness of LLM -based Evaluations under Perturbations

【速读】：该论文试图解决传统评估指标（如BLEU和ROUGE）在捕捉生成文本的细微质量方面不足的问题，尤其是在没有单一标准答案的情况下。解决方案的关键在于探索大型语言模型（Large Language Models, LLMs），特别是Google Gemini 1，作为非标准化指标的自动评估工具在摘要和对话任务中的潜力。通过多种提示策略的实验，研究比较了LLMs与人类判断在SummEval和USR数据集上的表现，要求模型生成评分及其解释。此外，研究还通过扰动输入来测试LLM评估器的鲁棒性。结果表明，尽管LLMs显示出一定的潜力，但它们与人类评估者的对齐有限，且对扰动不鲁棒，因此在作为主观指标的可靠独立评估器方面仍需显著改进。

链接: https://arxiv.org/abs/2412.09269
作者: Manav Chaudhary,Harshit Gupta,Savita Bhat,Vasudeva Varma
关键词-EN: ROUGE fall short, Traditional evaluation metrics, single ground truth, BLEU and ROUGE, Traditional evaluation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICON 2024

点击查看摘要

Abstract:Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.
zh

[NLP-26] First Train to Generate then Generate to Train: UnitedSynT5 for Few-Shot NLI

【速读】：该论文试图解决自然语言推理（NLI）任务中现有数据集的局限性问题，特别是通过提升数据集的多样性和复杂性来进一步提高模型的性能。解决方案的关键在于提出了一种基于合成数据增强的新方法，即UnitedSynT5。该方法通过利用T5模型生成额外的假设-前提对，并对这些合成数据进行严格的清洗和整合，将其嵌入到现有的EFL框架中进行训练。这种数据增强策略显著提升了模型的性能，使其在多个NLI数据集上（如SNLI、E-SNLI和MultiNLI）均超越了之前的SOTA模型。

链接: https://arxiv.org/abs/2412.09263
作者: Sourav Banerjee,Anush Mahajan,Ayushi Agarwal,Eishkaran Singh
关键词-EN: Natural Language Inference, Stanford Natural Language, Language Inference, Natural Language, Entailment Few-Shot Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Natural Language Inference (NLI) tasks require identifying the relationship between sentence pairs, typically classified as entailment, contradiction, or neutrality. While the current state-of-the-art (SOTA) model, Entailment Few-Shot Learning (EFL), achieves a 93.1% accuracy on the Stanford Natural Language Inference (SNLI) dataset, further advancements are constrained by the dataset’s limitations. To address this, we propose a novel approach leveraging synthetic data augmentation to enhance dataset diversity and complexity. We present UnitedSynT5, an advanced extension of EFL that leverages a T5-based generator to synthesize additional premise-hypothesis pairs, which are rigorously cleaned and integrated into the training data. These augmented examples are processed within the EFL framework, embedding labels directly into hypotheses for consistency. We train a GTR-T5-XL model on this expanded dataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset, 94.01% accuracy on the E-SNLI dataset, and 92.57% accuracy on the MultiNLI dataset, surpassing the previous SOTA models. This research demonstrates the potential of synthetic data augmentation in improving NLI models, offering a path forward for further advancements in natural language understanding tasks.
zh

[NLP-27] Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLM s COLING2025

【速读】：该论文试图解决讽刺检测中的风格偏差问题，特别是在训练数据缺乏多样性时，模型在检测讽刺时的性能受到影响。解决方案的关键在于通过生成式大型语言模型 (Generative Large Language Models) 来减少训练数据中的偏差，从而提高模型在跨领域（如反讽检测）和跨语言（如英语）设置下的鲁棒性和泛化能力。研究结果表明，该方法在土耳其语和英语的讽刺和反讽检测任务中有效，但对因果语言模型（如 Llama-3.1）的影响有限。此外，论文还提供了土耳其讽刺新闻数据集，并进行了分类、去偏和可解释性方面的案例研究。

链接: https://arxiv.org/abs/2412.09247
作者: Asli Umay Ozturk,Recep Firat Cekinel,Asli Umay Ozturk
关键词-EN: combating misinformation online, accurately extracting opinions, misinformation online, essential for accurately, accurately extracting
类目: Computation and Language (cs.CL)
备注: Accepted to BUCC2025 Workshop @COLING2025

点击查看摘要

Abstract:Satire detection is essential for accurately extracting opinions from textual data and combating misinformation online. However, the lack of diverse corpora for satire leads to the problem of stylistic bias which impacts the models’ detection performances. This study proposes a debiasing approach for satire detection, focusing on reducing biases in training data by utilizing generative large language models. The approach is evaluated in both cross-domain (irony detection) and cross-lingual (English) settings. Results show that the debiasing method enhances the robustness and generalizability of the models for satire and irony detection tasks in Turkish and English. However, its impact on causal language models, such as Llama-3.1, is limited. Additionally, this work curates and presents the Turkish Satirical News Dataset with detailed human annotations, with case studies on classification, debiasing, and explainability.
zh

[NLP-28] CleanComedy: Creating Friendly Humor through Generative Techniques

【速读】：该论文试图解决幽默生成任务中数据资源有限和现有数据集质量低下的问题，特别是现有幽默语言资源中存在的毒性和重复性问题。解决方案的关键在于提出了CleanComedy，这是一个专门的部分标注的、经过毒性过滤的英语和俄语笑话语料库，通过从多种来源收集笑话并进行数据过滤来提升数据质量。论文通过调查不同笑话组的幽默和毒性水平，验证了其数据过滤方法的有效性，并通过对比人类编写的笑话与基于CleanComedy数据集训练的生成式笑话模型，研究了计算机幽默生成的进展。

链接: https://arxiv.org/abs/2412.09203
作者: Dmitry Vikhorev,Daria Galimzianova,Svetlana Gorovaia,Elizaveta Zhemchuzhina,Ivan P. Yamshchikov
关键词-EN: natural language processing, language processing due, challenging task, task in natural, processing due
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humor generation is a challenging task in natural language processing due to limited resources and the quality of existing datasets. Available humor language resources often suffer from toxicity and duplication, limiting their effectiveness for training robust models. This paper proposes CleanComedy, a specialized, partially annotated toxicity-filtered corpus of English and Russian jokes collected from various sources. We study the effectiveness of our data filtering approach through a survey on humor and toxicity levels in various joke groups. In addition, we study advances in computer humor generation by comparing jokes written by humans with various groups of generative jokes, including our baseline models trained on the CleanComedy datasets.
zh

[NLP-29] ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks AAAI2025

【速读】：该论文试图解决大语言模型 (LLMs) 在生成结构化内容时未能遵循格式要求的问题，即“格式忠实性 (format faithfulness)”。解决方案的关键在于提出了 Reinforce Format Faithfulness (ReFF) 方法，通过利用格式的可判定性，帮助 LLMs 生成符合指令的格式化输出，同时不损害其通用质量。ReFF 无需标注数据即可显著提升格式忠实性（如从原始 LLaMA3 的 21.6% 提升至 95.0%），并在保持通用质量（如 F1 分数从 47.3 降至 46.4）的同时，结合标注数据后可进一步提高格式忠实性和通用质量（如格式忠实性从 21.6% 提升至 75.5%，F1 分数从 47.3 提升至 61.6）。

链接: https://arxiv.org/abs/2412.09173
作者: Jiashu Yao,Heyan Huang,Zeming Liu,Haoyu Wen,Wei Su,Boao Qian,Yuhang Guo
关键词-EN: large language models, format faithfulness, generate well-structured content, language models, format
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.
zh

[NLP-30] When Text Embedding Meets Large Language Model: A Comprehensive Survey

【速读】：该论文试图解决在自然语言处理（NLP）领域中，如何在大语言模型（LLMs）时代有效利用文本嵌入（text embeddings）的问题。解决方案的关键在于将LLMs与文本嵌入的交互分为三大主题：(1) LLM增强的文本嵌入（LLM-augmented text embedding），通过LLMs提升传统嵌入方法；(2) LLMs作为文本嵌入器（LLMs as text embedders），利用LLMs的固有能力生成嵌入；(3) 使用LLMs理解文本嵌入（Text embedding understanding with LLMs），借助LLMs分析和解释嵌入。通过这种分类，论文提供了一个系统性的概述，并探讨了LLMs时代出现的新挑战，同时指出了未来文本嵌入发展的潜在方向。

链接: https://arxiv.org/abs/2412.09165
作者: Zhijie Nie,Zhangchi Feng,Mingxin Li,Cunwang Zhang,Yanzhao Zhang,Dingkun Long,Richong Zhang
关键词-EN: natural language processing, Text embedding, deep learning era, Text, driving advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Work in progress

点击查看摘要

Abstract:Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications, such as semantic matching, clustering, and information retrieval, continue to rely on text embeddings for their efficiency and effectiveness. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, utilizing their innate capabilities for embedding generation; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing these efforts based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.
zh

[NLP-31] PolyIPA – Multilingual Phoneme-to-Grapheme Conversion Model

【速读】：该论文试图解决多语言音素到字素转换的问题，特别是在多语言名字音译、专名学研究和信息检索中的应用。解决方案的关键在于提出了PolyIPA模型，该模型利用了两个辅助模型进行数据增强：IPA2vec用于跨语言寻找音似词，similarIPA用于处理音标符号的变体。通过这些辅助模型，PolyIPA在多语言和多书写系统的测试集上实现了较低的字符错误率（CER: 0.055）和高BLEU分数（0.914），尤其在浅正字法语言中表现出色。此外，使用束搜索（beam search）进一步提升了模型的实用性，使得前3个候选结果将有效错误率降低了52.7%（CER: 0.026），展示了其在跨语言应用中的有效性。

链接: https://arxiv.org/abs/2412.09102
作者: Davor Lauc
关键词-EN: paper presents PolyIPA, conversion model designed, onomastic research, presents PolyIPA, information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7% (to CER: 0.026), demonstrating the model’s effectiveness for cross-linguistic applications.
zh

[NLP-32] Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion COLING2025

【速读】：该论文试图解决大型语言模型（LLMs）在知识图谱补全（KGC）任务中表现不如传统方法的问题。解决方案的关键在于提出了一种基于指令调优的新方法，即FtG。具体来说，FtG采用“过滤-生成”范式，将KGC任务转化为多选题格式，从而利用LLMs的能力同时减轻其幻觉问题。此外，论文设计了一种灵活的自图序列化提示（ego-graph serialization prompt），并使用结构-文本适配器（structure-text adapter）以情境化的方式结合结构和文本信息。这些创新使得FtG在实验中显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.09094
作者: Ben Liu,Jihai Zhang,Fangquan Lin,Cheng Yang,Min Peng
关键词-EN: Large Language Models, natural language processing, Large Language, Language Models, superior semantic comprehension
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: COLING 2025 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) present massive inherent knowledge and superior semantic comprehension capability, which have revolutionized various tasks in natural language processing. Despite their success, a critical gap remains in enabling LLMs to perform knowledge graph completion (KGC). Empirical evidence suggests that LLMs consistently perform worse than conventional KGC approaches, even through sophisticated prompt design or tailored instruction-tuning. Fundamentally, applying LLMs on KGC introduces several critical challenges, including a vast set of entity candidates, hallucination issue of LLMs, and under-exploitation of the graph structure. To address these challenges, we propose a novel instruction-tuning-based method, namely FtG. Specifically, we present a \textitfilter-then-generate paradigm and formulate the KGC task into a multiple-choice question format. In this way, we can harness the capability of LLMs while mitigating the issue casused by hallucinations. Moreover, we devise a flexible ego-graph serialization prompt and employ a structure-text adapter to couple structure and text information in a contextualized manner. Experimental results demonstrate that FtG achieves substantial performance gain compared to existing state-of-the-art methods. The instruction dataset and code are available at \urlthis https URL.
zh

[NLP-33] Evaluating Pixel Language Models on Non-Standardized Languages COLING2025

【速读】：该论文试图解决从标准语言到方言的迁移学习问题，特别是处理方言数据中常见的词汇外（out-of-vocabulary）词汇。解决方案的关键在于使用基于像素的模型（pixel-based models），这些模型将文本转换为图像并分割成小块，从而实现连续的词汇表示。这种方法在处理方言数据时表现出色，尤其在词性标注（part-of-speech tagging）、依存句法分析（dependency parsing）和意图检测（intent detection）等任务中，相较于基于标记的模型（token-based models）有显著优势，尽管在主题分类（topic classification）任务中表现不佳。

链接: https://arxiv.org/abs/2412.09084
作者: Alberto Muñoz-Ortiz,Verena Blaschke,Barbara Plank
关键词-EN: pixel-based models, transfer learning, models, pixel-based, standard languages
类目: Computation and Language (cs.CL)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.
zh

[NLP-34] Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂推理问题时的局限性，特别是现有方法如思维链（Chain-of-Thought, CoT）和思维树（Tree-of-Thought, ToT）在单次推理过程中可能无法重新审视错误路径，从而影响准确性的问题。解决方案的关键在于提出了一个新的推理框架——思维森林（Forest-of-Thought, FoT），该框架通过集成多个推理树来利用集体决策，采用稀疏激活策略选择最相关的推理路径，并引入动态自校正策略和共识引导的决策策略，以实现实时错误修正、从过去错误中学习，并优化正确性和计算资源的使用。实验结果表明，FoT框架显著提升了LLMs的推理能力，使其在解决复杂任务时更加精确和高效。

链接: https://arxiv.org/abs/2412.09078
作者: Zhenni Bi,Kai Han,Chuanjian Liu,Yehui Tang,Yunhe Wang
关键词-EN: Large Language Models, Large Language, Language Models, shown remarkable abilities, remains a challenge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable abilities across various language tasks, but solving complex reasoning problems remains a challenge. While existing methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT utilizes sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction and learning from past mistakes, as well as consensus-guided decision making strategies to optimize correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.
zh

[NLP-35] Dial-In LLM : Human-Aligned Dialogue Intent Clustering with LLM -in-the-loop

【速读】：该论文试图解决从客户对话中发现意图时，传统文本聚类方法与人类感知不一致的问题，特别是由于从嵌入距离到语义距离的转换，现有的量化指标可能无法准确反映意图聚类的真实质量。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 的强大语言理解能力，设计更精确的意图聚类算法。论文首先验证了微调后的 LLM 在语义一致性评估和聚类命名方面的鲁棒性，分别达到了 97.50% 和 94.40% 的准确率。随后，提出了一个迭代聚类算法，以促进聚类级别的细化和高质量意图聚类的持续发现。此外，还引入了几种 LLM 驱动的半监督聚类技术，专门用于从客户服务对话中发现意图。实验结果表明，这些方法在量化指标和应用层面的性能上均优于现有方法，分别提升了 6.25% 和 12%。

链接: https://arxiv.org/abs/2412.09049
作者: Mengze Hong,Yuanfeng Song,Di Jiang,Wailing Ng,Yanjie Sun,Chen Jason Zhang
关键词-EN: automated support system, Large Language Models, support system, plays an important, important role
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The discovery of customer intention from dialogue plays an important role in automated support system. However, traditional text clustering methods are poorly aligned with human perceptions due to the shift from embedding distance to semantic distance, and existing quantitative metrics for text clustering may not accurately reflect the true quality of intent clusters. In this paper, we leverage the superior language understanding capabilities of Large Language Models (LLMs) for designing better-calibrated intent clustering algorithms. We first establish the foundation by verifying the robustness of fine-tuned LLM utility in semantic coherence evaluation and cluster naming, resulting in an accuracy of 97.50% and 94.40%, respectively, when compared to the human-labeled ground truth. Then, we propose an iterative clustering algorithm that facilitates cluster-level refinement and the continuous discovery of high-quality intent clusters. Furthermore, we present several LLM-in-the-loop semi-supervised clustering techniques tailored for intent discovery from customer service dialogue. Experiments on a large-scale industrial dataset comprising 1,507 intent clusters demonstrate the effectiveness of the proposed techniques. The methods outperformed existing counterparts, achieving 6.25% improvement in quantitative metrics and 12% enhancement in application-level performance when constructing an intent classifier.
zh

[NLP-36] Multi-Task Learning with LLM s for Implicit Sentiment Analysis: Data-level and Task-level Automatic Weight Learning

【速读】：该论文试图解决隐式情感分析 (Implicit Sentiment Analysis, ISA) 中由于缺乏显著线索词而导致的推理能力不足和数据不足的问题。解决方案的关键在于引入了一种新的多任务学习框架 (Multi-Task Learning, MTL)，称为 MT-ISA。该框架通过利用大型语言模型 (Large Language Models, LLMs) 的生成和推理能力，构建辅助任务以补充情感元素，并通过自动多任务学习 (Automatic MTL) 充分利用辅助数据。MT-ISA 引入了数据级和任务级的自动权重学习 (Automatic Weight Learning, AWL)，动态识别关系并优先处理更可靠的数据和关键任务，使不同规模的模型能够根据其推理能力自适应地学习细粒度权重。通过这种方法，MT-ISA 在不同规模的模型中实现了主预测任务与辅助任务之间的最佳平衡，验证了其有效性和适应性。

链接: https://arxiv.org/abs/2412.09046
作者: Wenna Lai,Haoran Xie,Guandong Xu,Qing Li
关键词-EN: Implicit sentiment analysis, presents significant challenges, salient cue words, significant challenges due, Implicit sentiment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, and 6 tables

点击查看摘要

Abstract:Implicit sentiment analysis (ISA) presents significant challenges due to the absence of salient cue words. Previous methods have struggled with insufficient data and limited reasoning capabilities to infer underlying opinions. Integrating multi-task learning (MTL) with large language models (LLMs) offers the potential to enable models of varying sizes to reliably perceive and recognize genuine opinions in ISA. However, existing MTL approaches are constrained by two sources of uncertainty: data-level uncertainty, arising from hallucination problems in LLM-generated contextual information, and task-level uncertainty, stemming from the varying capacities of models to process contextual information. To handle these uncertainties, we introduce MT-ISA, a novel MTL framework that enhances ISA by leveraging the generation and reasoning capabilities of LLMs through automatic MTL. Specifically, MT-ISA constructs auxiliary tasks using generative LLMs to supplement sentiment elements and incorporates automatic MTL to fully exploit auxiliary data. We introduce data-level and task-level automatic weight learning (AWL), which dynamically identifies relationships and prioritizes more reliable data and critical tasks, enabling models of varying sizes to adaptively learn fine-grained weights based on their reasoning capabilities. We investigate three strategies for data-level AWL, while also introducing homoscedastic uncertainty for task-level AWL. Extensive experiments reveal that models of varying sizes achieve an optimal balance between primary prediction and auxiliary tasks in MT-ISA. This underscores the effectiveness and adaptability of our approach.
zh

[NLP-37] Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation COLING2025

【速读】：该论文试图解决中文分词 (Chinese Word Segmentation, CWS) 中缺乏自然标注数据的问题，并首次提出从语音-文本平行数据中显式挖掘词边界。解决方案的关键在于利用 Montreal Forced Aligner (MFA) 工具对语音-文本数据进行字符级对齐，将停顿作为候选词边界，并通过基于概率的策略过滤不可靠的词边界。此外，论文还提出了完整的训练策略（complete-then-train, CTT），以更有效地利用这些词边界作为额外的训练数据。实验结果表明，该方法在跨领域的CWS任务中具有显著效果。

链接: https://arxiv.org/abs/2412.09045
作者: Xuebin Wang,Lei Zhang,Zhenghua Li,Shilin Zhou,Chen Gong,Yang Hou
关键词-EN: Chinese Word Segmentation, Montreal Forced Aligner, Inspired by early, explicitly mine word, Word Segmentation
类目: Computation and Language (cs.CL)
备注: COLING 2025

点击查看摘要

Abstract:Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from speech-text parallel data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.
zh

[NLP-38] ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在推理过程中由于KV缓存 (KV cache) 增长导致的内存不足问题。解决方案的关键在于提出了一种基于层不确定性的KV缓存压缩方法，通过动态分配不同层的预算大小来保留关键信息，从而在减少内存使用的同时实现几乎无损的性能。实验结果表明，该方法可以将KV缓存的内存使用量减少到全KV推理的约20%。

链接: https://arxiv.org/abs/2412.09036
作者: Meizhi Zhong,Xikai Liu,Chen Zhang,Yikun Lei,Yan Gao,Yao Hu,Kehai Chen,Min Zhang
关键词-EN: Large Language models, Large Language, research hotspot, Language models, Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language models (LLMs) have become a research hotspot. To accelerate the inference of LLMs, storing computed caches in memory has become the standard technique. However, as the inference length increases, growing KV caches might lead to out-of-memory issues. Many existing methods address this issue through KV cache compression, primarily by preserving key tokens throughout all layers to reduce information loss. Most of them allocate a uniform budget size for each layer to retain. However, we observe that the minimum budget sizes needed to retain essential information vary across layers and models based on the perspectives of attention and hidden state output. Building on this observation, this paper proposes a simple yet effective KV cache compression method that leverages layer uncertainty to allocate budget size for each layer. Experimental results show that the proposed method can reduce memory usage of the KV caches to only \sim 20% when compared to Full KV inference while achieving nearly lossless performance.
zh

[NLP-39] Dialogue Language Model with Large-Scale Persona Data Engineering

【速读】：该论文试图解决开放域对话系统中角色一致性（persona consistency）的问题，特别是在现有数据集规模和多样性有限的情况下。解决方案的关键在于引入了一个名为PPDS的开放域角色对话系统，该系统通过在大规模角色对话数据集上进行生成式预训练（generative pre-training）来增强角色一致性。具体来说，论文提出了一个角色提取模型，用于自主精确地生成大量角色对话数据集，并采用了一种创新的角色增强技术来解决构建数据集中固有的无效角色偏差（invalid persona bias）问题。通过定量和人工评估，该模型在响应质量和角色一致性方面表现出色，证明了其有效性。

链接: https://arxiv.org/abs/2412.09034
作者: Mengze Hong,Chen Zhang,Chaotao Chen,Rongzhong Lian,Di Jiang
关键词-EN: Maintaining persona consistency, Maintaining persona, persona dialogue datasets, persona dialogue, persona
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Maintaining persona consistency is paramount in the application of open-domain dialogue systems, as exemplified by models like ChatGPT. Despite significant advancements, the limited scale and diversity of current persona dialogue datasets remain challenges to achieving robust persona-consistent dialogue models. In this study, drawing inspiration from the success of large-scale pre-training, we introduce PPDS, an open-domain persona dialogue system that employs extensive generative pre-training on a persona dialogue dataset to enhance persona consistency. Specifically, we present a persona extraction model designed to autonomously and precisely generate vast persona dialogue datasets. Additionally, we unveil a pioneering persona augmentation technique to address the invalid persona bias inherent in the constructed dataset. Both quantitative and human evaluations consistently highlight the superior response quality and persona consistency of our proposed model, underscoring its effectiveness.
zh

[NLP-40] Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

【速读】：该论文试图解决神经机器翻译 (Neural Machine Translation, NMT) 模型在科学、技术和教育领域（Scientific, Technical and Educational domains）的表现不佳问题，尤其是在低资源印度语言中的翻译任务。解决方案的关键在于创建了一个包含超过280万条高质量英印和印印翻译对的多语言平行语料库（multilingual parallel corpus），该语料库通过挖掘NPTEL视频讲座的人工翻译转录获得。通过使用这一语料库对NMT模型进行微调，论文中的模型在领域内任务中超越了所有公开可用的模型，并在领域外任务中将基线模型平均提升了超过2个BLEU分数。

链接: https://arxiv.org/abs/2412.09025
作者: Advait Joglekar,Srinivasan Umesh
关键词-EN: Neural Machine Translation, Neural Machine, Educational domains, Machine Translation, Technical and Educational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the potential for generalizing to out-of-domain translation tasks by improving the baseline by over 2 BLEU on average for these Indian languages on the Flores+ benchmark. We are pleased to release our model and dataset via this link: this https URL.
zh

[NLP-41] Improvement in Sign Language Translation Using Text CTC Alignment

【速读】：该论文试图解决手语翻译 (SLT) 中基于词目 (gloss) 监督的连接主义时序分类 (CTC) 方法在处理手语视频与口语文本之间非单调对齐时的局限性。解决方案的关键在于提出了一种结合联合 CTC/注意力机制 (joint CTC/Attention) 和迁移学习 (transfer learning) 的新方法。联合 CTC/注意力机制通过引入层次编码并在解码过程中整合 CTC 与注意力机制，有效处理了单调与非单调对齐问题。同时，迁移学习有助于弥合视觉与语言在手语翻译中的模态差距。实验结果表明，该方法在两个广泛采用的基准数据集上达到了与最先进方法相当的效果，并优于纯注意力基线。此外，该研究为未来基于文本 CTC 对齐的无词目手语翻译研究开辟了新方向。

链接: https://arxiv.org/abs/2412.09014
作者: Sihan Tan,Taro Miyazaki,Nabeela Khan,Kazuhiro Nakadai
关键词-EN: Connectionist Temporal Classification, Current sign language, Temporal Classification, Connectionist Temporal, Current sign
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current sign language translation (SLT) approaches often rely on gloss-based supervision with Connectionist Temporal Classification (CTC), limiting their ability to handle non-monotonic alignments between sign language video and spoken text. In this work, we propose a novel method combining joint CTC/Attention and transfer learning. The joint CTC/Attention introduces hierarchical encoding and integrates CTC with the attention mechanism during decoding, effectively managing both monotonic and non-monotonic alignments. Meanwhile, transfer learning helps bridge the modality gap between vision and language in SLT. Experimental results on two widely adopted benchmarks, RWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves results comparable to state-of-the-art and outperforms the pure-attention baseline. Additionally, this work opens a new door for future research into gloss-free SLT using text-based CTC alignment.
zh

[NLP-42] What Makes Cryptic Crosswords Challenging for LLM s? COLING2025

【速读】：该论文试图解决现代大型语言模型（LLMs）在解决隐语填字游戏（cryptic crosswords）时表现不佳的问题，并探讨其背后的原因。解决方案的关键在于建立基准测试结果，评估Gemma2、LLaMA3和ChatGPT等三种流行LLMs在该任务中的表现，并分析它们为何难以达到人类水平的表现。通过发布代码和数据集，研究为未来的改进提供了基础。

链接: https://arxiv.org/abs/2412.09012
作者: Abdelrahman Sadallah,Daria Kotova,Ekaterina Kochmar
关键词-EN: Cryptic crosswords, including Large Language, Large Language Models, types of wordplay, rely on general
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: COLING 2025

点击查看摘要

Abstract:Cryptic crosswords are puzzles that rely on general knowledge and the solver’s ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at this https URL.
zh

[NLP-43] Assessing the Robustness of Retrieval-Augmented Generation Systems in K-12 Educational Question Answering with Knowledge Discrepancies

【速读】：该论文试图解决生成式 AI (Generative AI) 系统在 K-12 教育领域中，由于教材知识与大语言模型 (Large Language Models, LLMs) 参数知识之间的差异，导致检索增强生成 (Retrieval-Augmented Generation, RAG) 系统在问答任务中表现下降的问题。解决方案的关键在于提出了 EduKDQA 数据集，该数据集通过模拟实际应用中的知识差异，即在答案和源文档中应用假设的知识更新，来系统性地评估 RAG 系统在知识差异下的鲁棒性。EduKDQA 包含了 3,005 个涵盖五个学科的问题，涵盖了从上下文利用和知识整合角度出发的全面问题类型，并通过广泛的实验验证了 RAG 系统在知识差异下的性能下降情况，特别是那些需要整合上下文知识和参数知识的问题对 LLMs 构成了挑战。

链接: https://arxiv.org/abs/2412.08985
作者: Tianshi Zheng,Weihan Li,Jiaxin Bai,Weiqi Wang,Yangqiu Song
关键词-EN: Large Language Models, demonstrated remarkable potential, Retrieval-Augmented Generation, Education domain, RAG systems
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have demonstrated remarkable potential as question answering systems in the K-12 Education domain, where knowledge is typically queried within the restricted scope of authoritative textbooks. However, the discrepancy between textbooks and the parametric knowledge in Large Language Models (LLMs) could undermine the effectiveness of RAG systems. To systematically investigate the robustness of RAG systems under such knowledge discrepancies, we present EduKDQA, a question answering dataset that simulates knowledge discrepancies in real applications by applying hypothetical knowledge updates in answers and source documents. EduKDQA includes 3,005 questions covering five subjects, under a comprehensive question typology from the perspective of context utilization and knowledge integration. We conducted extensive experiments on retrieval and question answering performance. We find that most RAG systems suffer from a substantial performance drop in question answering with knowledge discrepancies, while questions that require integration of contextual knowledge and parametric knowledge pose a challenge to LLMs.
zh

[NLP-44] RuleArena: A Benchmark for Rule-Guided Reasoning with LLM s in Real-World Scenarios

【速读】：该论文试图解决大型语言模型 (LLMs) 在处理复杂、现实世界规则推理任务中的能力评估问题。解决方案的关键在于引入了一个名为 RuleArena 的新型基准测试，该基准涵盖了航空行李费、NBA 交易和税务法规三个实际领域，评估 LLMs 在处理需要长上下文理解、逻辑推理和精确数学计算的复杂自然语言指令方面的能力。RuleArena 的两个关键特性使其区别于传统的基于规则的推理基准：(1) 超越了标准的一阶逻辑表示；(2) 基于真实的实际场景，提供了 LLMs 在现实应用中的适用性和可靠性的见解。

链接: https://arxiv.org/abs/2412.08972
作者: Ruiwen Zhou,Wenyue Hua,Liangming Pan,Sitao Cheng,Xiaobao Wu,En Yu,William Yang Wang
关键词-EN: large language models, paper introduces RuleArena, challenging benchmark designed, follow complex, paper introduces
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Data and Codes are available at this https URL

点击查看摘要

Abstract:This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains – airline baggage fees, NBA transactions, and tax regulations – RuleArena assesses LLMs’ proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs’ rule-guided reasoning capabilities in real-life applications.
zh

[NLP-45] Reasoning-Aware Query-Focused Summarization over Multi-Table Data

【速读】：该论文试图解决多表数据上的查询聚焦摘要生成问题，现有方法通常依赖复杂的预处理步骤且难以跨领域泛化或处理多表查询所需的逻辑推理。解决方案的关键在于提出了QueryTableSummarizer++，这是一个端到端的生成式框架，利用大型语言模型 (LLMs) 并结合表感知预训练、查询对齐微调以及带反馈的强化学习。该方法无需中间序列化步骤，直接生成查询相关的摘要，显著提升了在BLEU、ROUGE和F1-score等指标上的表现，并在跨领域泛化和复杂查询处理方面表现出色。

链接: https://arxiv.org/abs/2412.08970
作者: Xiaochuan Lin,Xiangyong Chen
关键词-EN: structured data, challenging yet critical, extracting precise, precise and relevant, relevant information
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Query-focused summarization over multi-table data is a challenging yet critical task for extracting precise and relevant information from structured data. Existing methods often rely on complex preprocessing steps and struggle to generalize across domains or handle the logical reasoning required for multi-table queries. In this paper, we propose QueryTableSummarizer++, an end-to-end generative framework leveraging large language models (LLMs) enhanced with table-aware pre-training, query-aligned fine-tuning, and reinforcement learning with feedback. Our method eliminates the need for intermediate serialization steps and directly generates query-relevant summaries. Experiments on a benchmark dataset demonstrate that QueryTableSummarizer++ significantly outperforms state-of-the-art baselines in terms of BLEU, ROUGE, and F1-score. Additional analyses highlight its scalability, generalization across domains, and robust handling of complex queries. Human evaluation further validates the superior quality and practical applicability of the generated summaries, establishing QueryTableSummarizer++ as a highly effective solution for multi-table summarization tasks.
zh

[NLP-46] Align Generate Learn: A Novel Closed-Loop Framework for Cross-Lingual In-Context Learning

【速读】：该论文试图解决跨语言上下文学习（Cross-lingual in-context learning, XICL）中现有方法依赖外部检索器或任务特定微调的问题，这些方法限制了其可扩展性和通用性。解决方案的关键在于提出了一种自监督框架，利用大语言模型（LLMs）的生成能力，通过内部选择和利用任务相关示例来实现跨语言任务处理。该框架引入了两个关键目标：检索-生成对齐损失（retrieval-generation alignment loss）以优化所选示例的质量，以及语义一致性损失（semantic coherence loss）以确保跨语言的一致性。这种方法在多语言基准测试中实现了最先进的性能，并展示了其在不同语言家族和未见任务中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2412.08955
作者: Mateo Alejandro Rojas,Rafael Carranza
关键词-EN: large language models, leveraging large language, transformative paradigm, paradigm for leveraging, leveraging large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual in-context learning (XICL) has emerged as a transformative paradigm for leveraging large language models (LLMs) to tackle multilingual tasks, especially for low-resource languages. However, existing approaches often rely on external retrievers or task-specific fine-tuning, limiting their scalability and generalizability. In this paper, we propose a novel self-supervised framework that harnesses the generative capabilities of LLMs to internally select and utilize task-relevant examples. Our method introduces two key objectives: a retrieval-generation alignment loss to optimize the quality of selected examples and a semantic coherence loss to ensure cross-lingual consistency. Through extensive experiments on multilingual benchmarks, our approach achieves state-of-the-art performance, significantly outperforming existing baselines. Further analysis highlights its robustness across diverse language families and its ability to generalize to unseen tasks. Human evaluations confirm the superior fluency, relevance, and semantic correctness of outputs generated by our method. This work provides a scalable, effective, and generalizable solution for cross-lingual in-context learning.
zh

[NLP-47] Mojito: Motion Trajectory and Intensity Control for Video Generation

【速读】：该论文试图解决在生成高质量视频内容时，如何有效训练扩散模型以整合方向性指导和可控运动强度的问题。解决方案的关键在于引入Mojito模型，该模型通过两个核心模块实现：一是方向性运动控制模块（Directional Motion Control module），利用交叉注意力机制在不增加额外训练的情况下引导生成对象的运动方向；二是运动强度调节器（Motion Intensity Modulator），通过从视频中生成的光流图（optical flow maps）来调节不同级别的运动强度。这两个模块共同确保了模型在生成视频时能够精确控制运动轨迹和强度，同时保持高计算效率。

链接: https://arxiv.org/abs/2412.08948
作者: Xuehai He,Shuohang Wang,Jianwei Yang,Xiaoxia Wu,Yiping Wang,Kuan Wang,Zheng Zhan,Olatunji Ruwase,Yelong Shen,Xin Eric Wang
关键词-EN: shown great promise, Recent advancements, high-quality video content, producing high-quality video, textbf
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. This paper introduces Mojito, a diffusion model that incorporates both \textbfMotion tra\textbfjectory and \textbfintensi\textbfty contr\textbfol for text to video generation. Specifically, Mojito features a Directional Motion Control module that leverages cross-attention to efficiently direct the generated object’s motion without additional training, alongside a Motion Intensity Modulator that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito’s effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.
zh

[NLP-48] MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for Multi-Task Learning COLING2025

【速读】：该论文试图解决在多任务学习场景中，低秩适应 (LoRA) 技术性能不足的问题，以及专家模型 (MoE) 架构中存在的跨领域数据相互干扰和任务知识遗忘的挑战。解决方案的关键在于提出了一种混合共享 LoRA 模型 (MoSLD)，通过在不同专家之间共享 LoRA 的上投影矩阵来促进跨任务的通用知识学习，同时保留下投影矩阵以专注于每个任务的独特特征。此外，采用 dropout 策略缓解了参数矩阵的不平衡更新和 LoRA 中的参数过拟合问题。实验结果表明，MoSLD 在单任务和多任务场景中均表现出优异的性能，并具备强大的领域外泛化能力。

链接: https://arxiv.org/abs/2412.08946
作者: Lulu Zhao,Weihao Zeng,Xiaofeng Shi,Hua Zhou
关键词-EN: fine-tuning large pre-trained, large pre-trained models, falls short, crucial technique, technique for fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accept by COLING 2025

点击查看摘要

Abstract:Recently, LoRA has emerged as a crucial technique for fine-tuning large pre-trained models, yet its performance in multi-task learning scenarios often falls short. In contrast, the MoE architecture presents a natural solution to this issue. However, it introduces challenges such as mutual interference of data across multiple domains and knowledge forgetting of various tasks. Additionally, MoE significantly increases the number of parameters, posing a computational cost challenge. Therefore, in this paper, we propose MoSLD, a mixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these challenges by sharing the upper projection matrix in LoRA among different experts, encouraging the model to learn general knowledge across tasks, while still allowing the lower projection matrix to focus on the unique features of each task. The application of dropout alleviates the imbalanced update of parameter matrix and mitigates parameter overfitting in LoRA. Extensive experiments demonstrate that our model exhibits excellent performance in both single-task and multi-task scenarios, with robust out-of-domain generalization capabilities.
zh

[NLP-49] Multi-Scale Heterogeneous Text-Attributed Graph Datasets From Diverse Domains

【速读】：该论文试图解决异构文本属性图 (Heterogeneous Text-Attributed Graphs, HTAGs) 在机器学习模型评估中的数据稀缺问题。当前研究主要集中在同构图 (homogeneous graphs) 上，缺乏对异构图的全面理解和评估。论文的关键解决方案是引入了一系列多尺度、跨年份、涵盖多个领域（如电影、社区问答、学术、文学和专利网络）的多样化基准数据集。这些数据集不仅提供了原始的文本内容，还支持对不同规模和领域的HTAGs进行真实且可复现的评估。通过公开所有源数据、数据集构建代码、处理后的HTAGs、数据加载器、基准代码和评估设置，论文为HTAGs的研究提供了全面的资源支持。

链接: https://arxiv.org/abs/2412.08937
作者: Yunhui Liu,Qizhuo Xie,Jinwei Shi,Jiaxu Shen,Tieke He
关键词-EN: Heterogeneous Text-Attributed Graphs, gained widespread popularity, Heterogeneous Text-Attributed, gained widespread, widespread popularity
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Heterogeneous Text-Attributed Graphs (HTAGs), where different types of entities are not only associated with texts but also connected by diverse relationships, have gained widespread popularity and application across various domains. However, current research on text-attributed graph learning predominantly focuses on homogeneous graphs, which feature a single node and edge type, thus leaving a gap in understanding how methods perform on HTAGs. One crucial reason is the lack of comprehensive HTAG datasets that offer original textual content and span multiple domains of varying sizes. To this end, we introduce a collection of challenging and diverse benchmark datasets for realistic and reproducible evaluation of machine learning models on HTAGs. Our HTAG datasets are multi-scale, span years in duration, and cover a wide range of domains, including movie, community question answering, academic, literature, and patent networks. We further conduct benchmark experiments on these datasets with various graph neural networks. All source data, dataset construction codes, processed HTAGs, data loaders, benchmark codes, and evaluation setup are publicly available at GitHub and Hugging Face.
zh

[NLP-50] From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning NEURIPS2024

【速读】：该论文试图解决在安全强化学习 (Safe Reinforcement Learning, RL) 中，如何灵活且高效地处理自然语言形式约束的问题。传统方法需要手动设计每个约束的代价函数 (cost function)，这不仅依赖领域专家知识，还缺乏灵活性。论文提出的解决方案关键在于引入轨迹级文本约束翻译器 (Trajectory-level Textual Constraints Translator, TTCT)，它不仅用于提供约束，还作为训练信号，从而替代手动设计的代价函数。实验结果表明，TTCT能够有效理解文本约束和轨迹，并且通过TTCT训练的策略在约束违反率上优于标准代价函数。此外，TTCT还展示了零样本迁移能力，能够适应约束变化的环境。

链接: https://arxiv.org/abs/2412.08920
作者: Pusen Dong,Tianchen Zhu,Yue Qiu,Haoyi Zhou,Jianxin Li
关键词-EN: Safe reinforcement learning, obeying specific constraints, reinforcement learning, agent to finish, obeying specific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Safe reinforcement learning (RL) requires the agent to finish a given task while obeying specific constraints. Giving constraints in natural language form has great potential for practical scenarios due to its flexible transfer capability and accessibility. Previous safe RL methods with natural language constraints typically need to design cost functions manually for each constraint, which requires domain expertise and lacks flexibility. In this paper, we harness the dual role of text in this task, using it not only to provide constraint but also as a training signal. We introduce the Trajectory-level Textual Constraints Translator (TTCT) to replace the manually designed cost function. Our empirical results demonstrate that TTCT effectively comprehends textual constraint and trajectory, and the policies trained by TTCT can achieve a lower violation rate than the standard cost function. Extra studies are conducted to demonstrate that the TTCT has zero-shot transfer capability to adapt to constraint-shift environments.
zh

[NLP-51] Phi-4 Technical Report

【速读】：该论文试图解决的问题是如何通过改进数据质量和训练方法来提升语言模型在STEM（科学、技术、工程和数学）领域的问答能力。解决方案的关键在于采用了一种以数据质量为核心的训练策略，即在整个训练过程中战略性地引入合成数据（synthetic data），而非仅仅依赖于传统的自然数据源（如网页内容或代码）。此外，phi-4通过改进数据生成、训练课程和后训练技术，显著超越了其教师模型GPT-4在STEM-focused QA任务上的表现，证明了这些方法的有效性。尽管phi-4的架构与phi-3相比变化不大，但其性能在推理相关基准测试中表现出色，这主要归功于数据质量的提升和训练方法的创新。

链接: https://arxiv.org/abs/2412.08905
作者: Marah Abdin,Jyoti Aneja,Harkirat Behl,Sébastien Bubeck,Ronen Eldan,Suriya Gunasekar,Michael Harrison,Russell J. Hewett,Mojan Javaheripi,Piero Kauffmann,James R. Lee,Yin Tat Lee,Yuanzhi Li,Weishung Liu,Caio C. T. Mendes,Anh Nguyen,Eric Price,Gustavo de Rosa,Olli Saarikivi,Adil Salim,Shital Shah,Xin Wang,Rachel Ward,Yue Wu,Dingli Yu,Cyril Zhang,Yi Zhang
关键词-EN: parameter language model, language model developed, centrally focused, parameter language, data quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size – especially on reasoning-focused benchmarks – due to improved data, training curriculum, and innovations in the post-training scheme.
zh

[NLP-52] AI-assisted Knowledge Discovery in Biomedical Literature to Support Decision-making in Precision Oncology

【速读】：该论文旨在解决癌症患者靶向治疗中，如何通过全面分析肿瘤分子特征和患者临床特征，结合现有知识和最新生物医学文献中的发现，提供个性化治疗方案的问题。解决方案的关键在于评估和利用自然语言处理（Natural Language Processing, NLP）技术，特别是来自BERT家族的模型（如BioBERT）和PubTator 3.0，以支持从生物医学文献中进行知识发现。具体来说，BioBERT在命名实体识别（Named Entity Recognition, NER）和关系抽取（Relation Extraction, RE）任务中表现优异，尤其是在RE任务中取得了最高的F1分数（0.79），并能有效识别几乎所有实体提及和大部分关系，从而为个性化治疗提供支持。

链接: https://arxiv.org/abs/2412.08900
作者: Ting He,Kory Kreimeyer,Mimi Najjar,Jonathan Spiker,Maria Fatteh,Valsamo Anagnostou,Taxiarchis Botsis
关键词-EN: cancer patients requires, patient clinical characteristics, cancer patients, patients requires, patient clinical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AMIA Annual Symposium 2024

点击查看摘要

Abstract:The delivery of appropriate targeted therapies to cancer patients requires the complete analysis of the molecular profiling of tumors and the patient’s clinical characteristics in the context of existing knowledge and recent findings described in biomedical literature and several other sources. We evaluated the potential contributions of specific natural language processing solutions to support knowledge discovery from biomedical literature. Two models from the Bidirectional Encoder Representations from Transformers (BERT) family, two Large Language Models, and PubTator 3.0 were tested for their ability to support the named entity recognition (NER) and the relation extraction (RE) tasks. PubTator 3.0 and the BioBERT model performed best in the NER task (best F1-score equal to 0.93 and 0.89, respectively), while BioBERT outperformed all other solutions in the RE task (best F1-score 0.79) and a specific use case it was applied to by recognizing nearly all entity mentions and most of the relations.
zh

[NLP-53] A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

【速读】：该论文试图解决生成高质量推理数据以持续训练大型语言模型（Large Language Models, LLMs）时，传统合成方法难以扩展且成本高昂的问题。解决方案的关键在于提出了基于图的合成数据管道（Graph-based Synthetic Data Pipeline, GSDP），该框架通过从种子数据中提取知识点并构建知识点关系图，探索知识间的隐含联系，从而实现高效的数据扩展（×255）。GSDP利用开源模型，在保持合成质量接近GPT-4-0613的同时，成本降低了100倍。此外，针对最具挑战性的数学推理任务，论文还提出了GSDP-MATH数据集，并通过在GSDP-MATH上微调Mistral-7B模型，展示了该方法的有效性。

链接: https://arxiv.org/abs/2412.08864
作者: Jiankang Wang,Jianjun Xu,Xiaorui Wang,Yuxin Wang,Mengting Xing,Shancheng Fang,Zhineng Chen,Hongtao Xie,Yongdong Zhang
关键词-EN: Large Language Models, Large Language, Synthesizing high-quality reasoning, performance of Large, Synthesizing high-quality
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves \times 255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining \times 100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models trained in this paper will be available.
zh

[NLP-54] Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

【速读】：该论文试图解决大语言模型（LLMs）在理解和判断不同国家文化价值观时存在的问题，特别是如何通过训练方法（如模型大小、训练语料库、对齐等）来提升其对文化价值观的理解。解决方案的关键在于：1) 通过多语言语料库的训练减少对西方文化的偏见；2) 增加模型大小以提升对社会价值观的理解；3) 使用合成数据增强小模型的表现。这些方法共同揭示了设计LLMs时如何更好地结合文化价值观理解的关键策略。

链接: https://arxiv.org/abs/2412.08846
作者: Minsang Kim,Seungjun Baek
关键词-EN: Large language models, Large language, closely interact, Large, human society
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) closely interact with humans, and thus need an intimate understanding of the cultural values of human society. In this paper, we explore how open-source LLMs make judgments on diverse categories of cultural values across countries, and its relation to training methodology such as model sizes, training corpus, alignment, etc. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. In addition, LLMs tend to judge cultural values biased toward Western culture, which can be improved with training on the multilingual corpus. We also find that increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data. Our analysis reveals valuable insights into the design methodology of LLMs in connection with their understanding of cultural values.
zh

[NLP-55] Large Concept Models: Language Modeling in a Sentence Representation Space

【速读】：该论文试图解决当前大型语言模型（LLMs）在处理输入和生成输出时仅在词级别（token level）进行操作的问题，这与人类在多个抽象层次上进行信息分析和内容生成的模式形成鲜明对比。解决方案的关键在于提出一种新的架构，即“大型概念模型”（Large Concept Model），该模型在显式的高级语义表示（concept）上进行操作，这些概念是语言和模态无关的，代表了更高层次的思想或动作。论文中假设一个概念对应一个句子，并利用现有的句子嵌入空间SONAR进行实验，探索了多种方法，包括均方误差回归（MSE regression）、基于扩散的生成变体以及在量化SONAR空间中操作的模型。通过训练1.6B到7B参数的模型，并在1.3T到2.7T的训练数据上进行实验，证明了该模型的可行性和在生成任务中的优越性能，尤其是在多语言环境下的零样本泛化能力。

链接: https://arxiv.org/abs/2412.08821
作者: TheLCM team,Loïc Barrault,Paul-Ambroise Duquenne,Maha Elbayad,Artyom Kozhevnikov,Belen Alastruey,Pierre Andrews,Mariano Coria,Guillaume Couairon,Marta R. Costa-jussà,David Dale,Hady Elsahar,Kevin Heffernan,João Maria Janeiro,Tuan Tran,Christophe Ropers,Eduardo Sánchez,Robin San Roman,Alexandre Mourachko,Safiyyah Saleem,Holger Schwenk
关键词-EN: Large Concept Model, revolutionized the field, field of artificial, artificial intelligence, de-facto tool
类目: Computation and Language (cs.CL)
备注: 49 pages

点击查看摘要

Abstract:LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a “Large Concept Model”. In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available. Comments: 49 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.08821 [cs.CL] (or arXiv:2412.08821v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.08821 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-56] jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

【速读】：该论文试图解决CLIP模型在纯文本任务上的性能不足问题，导致跨模态信息检索系统需要依赖单独的文本和多模态模型。解决方案的关键在于引入了一个改进的框架，通过多任务、多阶段的对比学习（multi-task, multi-stage contrastive learning）和优化的训练策略，提升了纯文本检索性能，同时增加了多语言支持、复杂视觉文档的理解能力，并通过Matryoshka表示学习和向量截断实现了效率提升。最终的jina-clip-v2模型在多语言多模态和多语言文本检索基准测试中表现优异，实现了文本和多模态检索系统的统一。

链接: https://arxiv.org/abs/2412.08802
作者: Andreas Koukounas,Georgios Mastrapas,Bo Wang,Mohammad Kalim Akram,Sedigheh Eslami,Michael Günther,Isabelle Mohr,Saba Sturua,Scott Martens,Nan Wang,Han Xiao
关键词-EN: Contrastive Language-Image Pretraining, shared embedding space, highly effective method, Language-Image Pretraining, embedding space
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 21 pages, 1-10 main paper, 10-12 refs, 12-21 benchmarks

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) is a highly effective method for aligning images and texts in a shared embedding space. These models are widely used for tasks such as cross-modal information retrieval and multi-modal understanding. However, CLIP models often struggle with text-only tasks, underperforming compared to specialized text models. This performance disparity forces retrieval systems to rely on separate models for text-only and multi-modal tasks. In this work, we build upon our previous model, jina-clip-v1, by introducing a refined framework that utilizes multi-task, multi-stage contrastive learning across multiple languages, coupled with an improved training recipe to enhance text-only retrieval. The resulting model, jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks, while adding multilingual support, better understanding of complex visual documents and efficiency gains thanks to Matryoshka Representation Learning and vector truncation. The model performs comparably to the state-of-the-art in both multilingual-multimodal and multilingual text retrieval benchmarks, addressing the challenge of unifying text-only and multi-modal retrieval systems.
zh

[NLP-57] Coverage-based Fairness in Multi-document Summarization

【速读】：该论文试图解决多文档摘要 (Multi-document Summarization, MDS) 中的公平性问题，特别是如何确保生成的摘要能够公平地反映来自具有不同社会属性值的文档的信息。解决方案的关键在于提出了两个新的公平性度量：一是基于文档覆盖率的摘要级公平性度量——等覆盖率 (Equal Coverage)，它考虑了文档中的冗余信息；二是用于检测语料库级不公平性的语料库级度量——覆盖平价 (Coverage Parity)。通过这些度量，论文评估了十三种不同的语言模型 (LLMs) 的公平性，并发现Claude3-sonnet在所有评估的LLMs中最为公平，但几乎所有LLMs都存在对不同社会属性值的过度代表现象。

链接: https://arxiv.org/abs/2412.08795
作者: Haoyuan Li,Yusen Zhang,Rui Zhang,Snigdha Chaturvedi
关键词-EN: fairly representing information, summary fairly representing, Proportional Representation, multi-document summarization, Fairness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values.
zh

[NLP-58] BDA: Bangla Text Data Augmentation Framework

【速读】：该论文试图解决在资源有限、高质量数据稀缺的领域中，如何通过数据增强（Data Augmentation）提高训练数据量的问题。解决方案的关键在于提出了一个孟加拉语文本数据增强框架（Bangla Text Data Augmentation, BDA），该框架结合了预训练模型和基于规则的方法生成新的文本变体，并通过过滤过程确保新生成的文本在保持原意的同时增加词汇多样性。实验结果表明，该框架在孟加拉语文本分类任务中显著提升了F1分数，能够在仅使用50%训练数据的情况下达到与使用100%训练数据相当的性能，尤其是在数据稀缺的情况下，通过BDA增强数据能够带来显著的F1分数提升。

链接: https://arxiv.org/abs/2412.08753
作者: Md. Tariquzzaman,Audwit Nafi Anam,Naimul Haque,Mohsinul Kabir,Hasan Mahmud,Md Kamrul Hasan
关键词-EN: involves generating synthetic, generating synthetic samples, augmentation involves generating, Data augmentation involves, Text Data Augmentation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework’s effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA’s performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.
zh

[NLP-59] In-Context Learning with Topological Information for Knowledge Graph Completion

【速读】：该论文试图解决知识图谱补全 (Knowledge Graph Completion, KGC) 中的不完整性问题，特别是在传导性 (transductive) 和归纳性 (inductive) 设置下的挑战。解决方案的关键在于利用大语言模型 (Large Language Models, LLMs) 的上下文学习 (in-context learning) 能力，通过将本体知识 (ontological knowledge) 和图结构信息整合到LLMs的上下文中，从而提升KGC的性能。具体来说，该方法在传导性设置中通过利用训练图谱中的节点信息来增强补全效果，而在更具挑战性的归纳性设置中，则通过本体知识推断缺失节点的有用信息，并将其作为上下文线索供LLM在推理时使用，从而在ILPC-small和ILPC-large数据集上展现出优于基线方法的性能。

链接: https://arxiv.org/abs/2412.08742
作者: Udari Madhushani Sehwag,Kassiani Papasotiriou,Jared Vann,Sumitra Ganesh
关键词-EN: question answering, supporting a wide, crucial for representing, representing and reasoning, reasoning over structured
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) are crucial for representing and reasoning over structured information, supporting a wide range of applications such as information retrieval, question answering, and decision-making. However, their effectiveness is often hindered by incompleteness, limiting their potential for real-world impact. While knowledge graph completion (KGC) has been extensively studied in the literature, recent advances in generative AI models, particularly large language models (LLMs), have introduced new opportunities for innovation. In-context learning has recently emerged as a promising approach for leveraging pretrained knowledge of LLMs across a range of natural language processing tasks and has been widely adopted in both academia and industry. However, how to utilize in-context learning for effective KGC remains relatively underexplored. We develop a novel method that incorporates topological information through in-context learning to enhance KGC performance. By integrating ontological knowledge and graph structure into the context of LLMs, our approach achieves strong performance in the transductive setting i.e., nodes in the test graph dataset are present in the training graph dataset. Furthermore, we apply our approach to KGC in the more challenging inductive setting, i.e., nodes in the training graph dataset and test graph dataset are disjoint, leveraging the ontology to infer useful information about missing nodes which serve as contextual cues for the LLM during inference. Our method demonstrates superior performance compared to baselines on the ILPC-small and ILPC-large datasets.
zh

[NLP-60] Euclid: Supercharging Multimodal LLM s with Synthetic High-Fidelity Visual Descriptions

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在低级视觉感知 (Low-Level Visual Perception, LLVP) 方面的不足，特别是对图像几何细节的准确描述能力。解决方案的关键在于引入了一个名为Geoperception的基准测试，用于评估MLLMs在2D几何信息转录方面的能力。通过该基准测试，论文揭示了现有MLLMs的局限性，并进行了系统的实证研究，探索了提升几何任务性能的策略，包括使用高保真合成数据和多阶段训练与数据课程。研究发现，数据课程能够帮助模型有效学习从零开始难以掌握的几何理解任务。基于这些发现，论文开发了Euclid系列模型，专门优化了低级几何感知能力，并在Geoperception基准测试中显著超越了现有的最佳闭源模型Gemini-1.5-Pro。

链接: https://arxiv.org/abs/2412.08737
作者: Jiarui Zhang,Ollie Liu,Tianyu Yu,Jinyi Hu,Willie Neiswanger
关键词-EN: made rapid progress, Multimodal large language, large language models, recent years, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, 22 figures, 5 tables, 7 algorithms

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) – particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM’s ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.
zh

[NLP-61] LatentQA: Teaching LLM s to Decode Activations Into Natural Language

【速读】：该论文试图解决语言模型表示的可解释性问题，特别是如何将模型激活的复杂输出（如电路、向量、标量）转化为人类可理解的自然语言形式。解决方案的关键是提出了Latent Interpretation Tuning (LIT)，通过微调一个解码器大型语言模型（LLM），使其能够基于模型激活和相关问答对的数据集，直接用自然语言回答关于模型激活的开放性问题。这一方法不仅用于多样化的阅读应用，如从表示中提取关系知识或揭示控制模型行为的系统提示，还通过定义可微分的损失函数来控制模型行为，例如去偏见或调整生成内容的情感。此外，该方法还被扩展用于揭示模型潜在的有害能力。

链接: https://arxiv.org/abs/2412.08686
作者: Alexander Pan,Lijie Chen,Jacob Steinhardt
关键词-EN: Interpretability methods seek, Interpretability methods, immediately human-interpretable, methods seek, seek to understand
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Project page is at this https URL

点击查看摘要

Abstract:Interpretability methods seek to understand language model representations, yet the outputs of most such methods – circuits, vectors, scalars – are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.
zh

[NLP-62] SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

【速读】：该论文试图解决现有检索增强生成 (RAG) 系统中，仅基于语义相似性 (similarity) 或相关信息 (relatedness) 进行数据组织，导致复杂任务中多跳推理能力不足的问题。解决方案的关键在于提出了一种新的 RAG 索引方法 SiReRAG，该方法同时考虑了相似性和相关信息。具体来说，SiReRAG 通过递归摘要构建相似性树，并从文本中提取命题和实体，基于共享实体对命题进行分组，进而构建相关性树。最终，将这两种树索引并展平为一个统一的检索池，从而在多跳数据集上实现了显著的性能提升，平均 F1 分数提高了 1.9%，并且在现有重排序方法上进一步提升了 7.8% 的平均 F1 分数。

链接: https://arxiv.org/abs/2412.06206
作者: Nan Zhang,Prafulla Kumar Choubey,Alexander Fabbri,Gabriel Bernadett-Shapiro,Rui Zhang,Prasenjit Mitra,Caiming Xiong,Chien-Sheng Wu
关键词-EN: retrieval-augmented generation, important step, step towards strong, related information, strong performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indexing is an important step towards strong performance in retrieval-augmented generation (RAG) systems. However, existing methods organize data based on either semantic similarity (similarity) or related information (relatedness), but do not cover both perspectives comprehensively. Our analysis reveals that modeling only one perspective results in insufficient knowledge synthesis, leading to suboptimal performance on complex tasks requiring multihop reasoning. In this paper, we propose SiReRAG, a novel RAG indexing approach that explicitly considers both similar and related information. On the similarity side, we follow existing work and explore some variances to construct a similarity tree based on recursive summarization. On the relatedness side, SiReRAG extracts propositions and entities from texts, groups propositions via shared entities, and generates recursive summaries to construct a relatedness tree. We index and flatten both similarity and relatedness trees into a unified retrieval pool. Our experiments demonstrate that SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (MuSiQue, 2WikiMultiHopQA, and HotpotQA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SiReRAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores.
zh

[NLP-63] Foundational Large Language Models for Materials Research

【速读】：该论文试图解决材料科学领域中由于文献数据爆炸性增长导致的知识提取、合成和科学推理瓶颈问题。解决方案的关键在于通过领域适应性训练，开发出专门针对材料科学的基础模型LLaMat。该模型通过在LLaMA模型基础上进行持续预训练，结合大量材料文献和晶体学数据，实现了在材料科学特定任务中的卓越表现，特别是在晶体结构生成和结构化信息提取方面。论文展示了LLaMat在材料科学任务中的高效性，并揭示了领域适应性在开发实用型LLM（Large Language Models）助手中的重要性，为未来科学AI系统的开发提供了重要参考。

链接: https://arxiv.org/abs/2412.09560
作者: Vaibhav Mishra,Somaditya Singh,Dhruv Ahlawat,Mohd Zaki,Vaibhav Bihani,Hargun Singh Grover,Biswajit Mishra,Santiago Miret,Mausam,N. M. Anoop Krishnan
关键词-EN: addressing global challenges, global challenges, critical for addressing, addressing global, Materials
类目: Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Materials discovery and development are critical for addressing global challenges. Yet, the exponential growth in materials science literature comprising vast amounts of textual data has created significant bottlenecks in knowledge extraction, synthesis, and scientific reasoning. Large Language Models (LLMs) offer unprecedented opportunities to accelerate materials research through automated analysis and prediction. Still, their effective deployment requires domain-specific adaptation for understanding and solving domain-relevant tasks. Here, we present LLaMat, a family of foundational models for materials science developed through continued pretraining of LLaMA models on an extensive corpus of materials literature and crystallographic data. Through systematic evaluation, we demonstrate that LLaMat excels in materials-specific NLP and structured information extraction while maintaining general linguistic capabilities. The specialized LLaMat-CIF variant demonstrates unprecedented capabilities in crystal structure generation, predicting stable crystals with high coverage across the periodic table. Intriguingly, despite LLaMA-3’s superior performance in comparison to LLaMA-2, we observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific performance across diverse materials science tasks, including structured information extraction from text and tables, more particularly in crystal structure generation, a potential adaptation rigidity in overtrained LLMs. Altogether, the present work demonstrates the effectiveness of domain adaptation towards developing practically deployable LLM copilots for materials research. Beyond materials science, our findings reveal important considerations for domain adaptation of LLMs, such as model selection, training methodology, and domain-specific performance, which may influence the development of specialized scientific AI systems.
zh

[NLP-64] Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection

【速读】：该论文试图解决多语言切换（code-switching）对端到端（E2E）自动语音识别（ASR）系统带来的挑战，特别是在声学和语义混淆方面。解决方案的关键在于三个主要贡献：首先，通过将语言识别（LID）信息融入编码器的多个中间层，增强输出嵌入的语言信息；其次，引入语言边界对齐损失（language boundary alignment loss），使后续的ASR模块更有效地利用内部语言后验知识；最后，探索利用语言后验促进共享编码器与特定语言编码器之间的深度交互。实验结果表明，该方法在SEAME语料库上优于现有的基于解耦混合专家（D-MoE）的方法，显著提升了编码器对语言的敏感性。

链接: https://arxiv.org/abs/2412.08651
作者: Tzu-Ting Yang,Hsin-Wei Wang,Yi-Cheng Wang,Berlin Chen
关键词-EN: automatic speech recognition, multilingual speakers alternately, speakers alternately switch, conversations-still poses significant, poses significant challenges
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: SLT 2024

点击查看摘要

Abstract:Code-switching-where multilingual speakers alternately switch between languages during conversations-still poses significant challenges to end-to-end (E2E) automatic speech recognition (ASR) systems due to phenomena of both acoustic and semantic confusion. This issue arises because ASR systems struggle to handle the rapid alternation of languages effectively, which often leads to significant performance degradation. Our main contributions are at least threefold: First, we incorporate language identification (LID) information into several intermediate layers of the encoder, aiming to enrich output embeddings with more detailed language information. Secondly, through the novel application of language boundary alignment loss, the subsequent ASR modules are enabled to more effectively utilize the knowledge of internal language posteriors. Third, we explore the feasibility of using language posteriors to facilitate deep interaction between shared encoder and language-specific encoders. Through comprehensive experiments on the SEAME corpus, we have verified that our proposed method outperforms the prior-art method, disentangle based mixture-of-experts (D-MoE), further enhancing the acuity of the encoder to languages.
zh

计算机视觉

[CV-0] Doe-1: Closed-Loop Autonomous Driving with Large World Model

【速读】：该论文试图解决现有自动驾驶方法中存在的开环控制、弱扩展性、缺乏高阶交互以及决策效率低下的问题。其关键解决方案是提出一个闭环框架，并引入一个名为Doe-1的大型驾驶世界模型，该模型统一了感知、预测和规划功能。具体而言，论文将自动驾驶问题形式化为一个“下一个token生成”问题，并使用多模态token来完成不同的任务。感知任务通过自由形式的文本（场景描述）实现，预测任务则在RGB空间中直接生成图像token，而规划任务则通过位置感知的tokenizer将动作编码为离散token。最终，通过训练一个多模态transformer，以端到端的方式自回归生成感知、预测和规划token，从而实现高效的自动驾驶决策。

链接: https://arxiv.org/abs/2412.09627
作者: Wenzhao Zheng,Zetian Xia,Yuanhui Huang,Sicheng Zuo,Jie Zhou,Jiwen Lu
关键词-EN: received increasing attention, increasing attention due, amounts of data, received increasing, increasing attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: this https URL.
zh

[CV-1] FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

【速读】：该论文试图解决视觉扩散模型在生成高分辨率图像或视频时面临的挑战，特别是在模型生成超出其训练分辨率的视觉内容时，由于高频信息增加导致的重复模式和低质量内容问题。解决方案的关键在于提出了FreeScale，一种无需调参的推理范式，通过尺度融合来实现更高分辨率的视觉生成。具体来说，FreeScale通过处理不同感受野尺度的信息，并提取所需频率成分进行融合，从而有效解决了高分辨率生成中的重复模式问题，显著提升了图像和视频模型的高分辨率生成能力。

链接: https://arxiv.org/abs/2412.09626
作者: Haonan Qiu,Shiwei Zhang,Yujie Wei,Ruihang Chu,Hangjie Yuan,Xiang Wang,Yingya Zhang,Ziwei Liu
关键词-EN: achieve remarkable progress, constrained computation resources, limited resolutions due, diffusion models achieve, models achieve remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this http URL

点击查看摘要

Abstract:Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.
zh

[CV-2] Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors

【速读】：该论文试图解决自动生成多视角幻觉的问题，即通过单一视觉内容从不同视角呈现不同的解释。传统方法如阴影艺术和线艺虽然能创造有趣的3D幻觉，但受限于简单的视觉输出（如图形-背景或线条画），限制了其艺术表现力和实用性。论文提出了一种基于用户提供的文本提示或图像的简单而有效的解决方案，关键在于利用预训练的文本到图像扩散模型，通过可微分渲染优化神经3D表示的纹理和几何形状，从而在多角度观察时产生不同的解释。该方法通过多种技术提升了生成3D多视角幻觉的质量，并通过广泛的实验展示了其在多样3D形式中的有效性。

链接: https://arxiv.org/abs/2412.09625
作者: Yue Feng,Vaibhav Sanjay,Spencer Lutz,Badour AlBahar,Songwei Ge,Jia-Bin Huang
关键词-EN: Automatically generating multiview, content offers distinct, Automatically generating, visual content offers, offers distinct interpretations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Automatically generating multiview illusions is a compelling challenge, where a single piece of visual content offers distinct interpretations from different viewing perspectives. Traditional methods, such as shadow art and wire art, create interesting 3D illusions but are limited to simple visual outputs (i.e., figure-ground or line drawing), restricting their artistic expressiveness and practical versatility. Recent diffusion-based illusion generation methods can generate more intricate designs but are confined to 2D images. In this work, we present a simple yet effective approach for creating 3D multiview illusions based on user-provided text prompts or images. Our method leverages a pre-trained text-to-image diffusion model to optimize the textures and geometry of neural 3D representations through differentiable rendering. When viewed from multiple angles, this produces different interpretations. We develop several techniques to improve the quality of the generated 3D multiview illusions. We demonstrate the effectiveness of our approach through extensive experiments and showcase illusion generation with diverse 3D forms.
zh

[CV-3] GenEx: Generating an Explorable World

【速读】：该论文试图解决人工智能在理解和探索三维物理现实世界中的核心挑战。解决方案的关键在于引入了一个名为GenEx的系统，该系统通过生成式想象（generative imagination）来规划复杂的具身世界探索。GenEx能够从单一的RGB图像生成一个完整的、三维一致的想象环境，并通过全景视频流使其“活化”。该系统利用从Unreal Engine中精选的可扩展3D世界数据，实现了高质量的世界生成、长轨迹上的鲁棒循环一致性，并展示了强大的三维能力，如一致性和主动三维映射。GPT辅助的代理利用对物理世界未见部分的预测期望来精炼其信念，模拟不同决策的结果，并做出更明智的选择，从而在目标无关的探索和目标驱动的导航中执行复杂的具身任务。

链接: https://arxiv.org/abs/2412.09624
作者: Taiming Lu,Tianmin Shu,Junfei Xiao,Luoxin Ye,Jiahao Wang,Cheng Peng,Chen Wei,Daniel Khashabi,Rama Chellappa,Alan Yuille,Jieneng Chen
关键词-EN: artificial intelligence, central challenge, development of artificial, Understanding, physical real world
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Website: this http URL

点击查看摘要

Abstract:Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.
zh

[CV-4] OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

【速读】：该论文试图解决在生成式全向视频 (ODV) 中，现有文本到视频生成方法在内容准确性和一致性上的不足，以及传统运动控制技术在处理复杂球面运动时导致的空问畸变和性能不佳的问题。解决方案的关键在于提出了OmniDrag，这是一种首个支持场景和对象级别运动控制的全向图像到视频生成方法。OmniDrag通过引入一个全向控制模块，并与时间注意力层联合微调，有效处理复杂的球面运动。此外，论文还开发了一种新颖的球面运动估计器，能够精确提取运动控制信号，并允许用户通过绘制控制点和目标点来进行拖拽式ODV生成。为了支持这一方法，论文还提出了一个新的数据集Move360，以解决现有ODV数据集中大场景和对象运动数据的稀缺问题。

链接: https://arxiv.org/abs/2412.09623
作者: Weiqi Li,Shijie Zhao,Chong Mou,Xuhan Sheng,Zhenyu Zhang,Qian Wang,Junlin Li,Li Zhang,Jian Zhang
关键词-EN: reality gains popularity, virtual reality gains, gains popularity, virtual reality, reality gains
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing. While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs. Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions. To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation. Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion. In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points. We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation. The project page is available at this https URL.
zh

[CV-5] LoRACLR: Contrastive Adaptation for Customization of Diffusion Models ACL

【速读】：该论文试图解决多概念图像生成中多个个性化模型合并时出现的属性纠缠问题，以及为保持概念独立性而需要单独重新训练的复杂性。解决方案的关键在于提出了一种名为LoRACLR的新方法，该方法通过对比目标（contrastive objective）来对齐和合并多个针对不同概念微调的LoRA模型，形成一个统一的模型，而无需额外的单独微调。这种方法确保了各概念在权重空间中的兼容性，同时最小化了相互干扰，从而实现了高效、可扩展的多概念图像合成。

链接: https://arxiv.org/abs/2412.09622
作者: Enis Simsar,Thomas Hofmann,Federico Tombari,Pinar Yanardag
关键词-EN: Recent advances, allowing specific concepts, customization have enabled, enabled high-fidelity, allowing specific
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in text-to-image customization have enabled high-fidelity, context-rich generation of personalized images, allowing specific concepts to appear in a variety of scenarios. However, current methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness. We present LoRACLR, a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model without additional individual fine-tuning. LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference. By enforcing distinct yet cohesive representations for each concept, LoRACLR enables efficient, scalable model composition for high-quality, multi-concept image synthesis. Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.
zh

[CV-6] Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

【速读】：该论文试图解决从图像中理解动态三维场景的问题，特别是在缺乏大规模监督训练数据的情况下，如何有效地恢复三维运动。解决方案的关键在于提出了一种从互联网上的立体视频和广角视频中挖掘高质量四维重建的方法。该系统通过融合和过滤相机姿态估计、立体深度估计和时间跟踪方法的输出，生成高质量的动态三维重建。具体而言，该方法生成了具有长期运动轨迹的世界一致性伪度量三维点云数据，并利用这些数据训练DUSt3R模型，使其能够从真实世界的图像对中预测结构和三维运动，从而实现对多样化真实场景的泛化能力。

链接: https://arxiv.org/abs/2412.09621
作者: Linyi Jin,Richard Tucker,Zhengqi Li,David Fouhey,Noah Snavely,Aleksander Holynski
关键词-EN: Learning to understand, imagery is crucial, crucial for applications, applications ranging, ranging from robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page: this https URL
zh

[CV-7] Learning Camera Movement Control from Real-World Drone Videos

【速读】：该论文试图解决现有AI视频拍摄方法在模拟训练中外观多样性不足、专家操作记录成本高以及设计基于启发式的目标覆盖所有场景困难的问题。解决方案的关键在于提出一种可扩展的方法，通过收集真实世界的训练数据以提高多样性，自动提取摄像机轨迹以降低标注成本，并训练一种不依赖启发式的有效架构。具体实现包括通过在线视频进行3D重建收集99k高质量轨迹，使用Kalman滤波去除低质量数据，并引入DVGFormer，一种自回归Transformer模型，利用过去的图像和摄像机路径预测下一帧的摄像机运动。该系统在合成自然场景和真实城市3D扫描中进行了评估，展示了其在执行复杂摄像机运动方面的有效性。

链接: https://arxiv.org/abs/2412.09620
作者: Yunzhong Hou,Liang Zheng,Philip Torr
关键词-EN: filming existing subjects, generating the pixels, study seeks, seeks to automate, subjects into attractive
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos. Data and code are available at this http URL.
zh

[CV-8] SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

【速读】：该论文旨在解决现有文本到图像（Text-to-Image, T2I）扩散模型在模型规模大、运行速度慢以及移动设备上生成质量低的问题。解决方案的关键在于开发一种极小且快速的T2I模型，能够在移动平台上生成高分辨率和高品质的图像。具体措施包括：1) 系统性地优化网络架构设计，减少模型参数和延迟，同时确保生成质量；2) 通过跨架构知识蒸馏（cross-architecture knowledge distillation）从更大模型中提取知识，采用多层次方法从头训练模型以提升生成质量；3) 结合对抗性指导（adversarial guidance）与知识蒸馏，实现少步生成。最终，该模型SnapGen在移动设备上首次实现了1024x1024像素图像的生成，且在ImageNet-1K和T2I基准测试中，以远小于大规模模型的参数规模（如比SDXL小7倍，比IF-XL小14倍）取得了更优的性能。

链接: https://arxiv.org/abs/2412.09619
作者: Dongting Hu,Jierun Chen,Xijie Huang,Huseyin Coskun,Arpit Sahni,Aarush Gupta,Anujraaj Goyal,Dishani Lahiri,Rajesh Singh,Yerlan Idelbayev,Junli Cao,Yanyu Li,Kwang-Ting Cheng,S.-H. Gary Chan,Mingming Gong,Sergey Tulyakov,Anil Kag,Yanwu Xu,Jian Ren
关键词-EN: including large model, diffusion models face, slow runtime, face several limitations, including large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).
zh

[CV-9] EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

【速读】：该论文试图解决传统无调参方法在处理多张参考图像时无法捕捉一致视觉元素的问题，以及基于调参的Low-Rank Adaptation (LoRA)方法需要针对每个图像组进行特定微调的局限性。解决方案的关键在于提出了EasyRef，一种即插即用的适应方法，通过利用多模态大语言模型 (MLLM) 的多图像理解和指令跟随能力，捕捉多张参考图像中的一致视觉元素，并通过适配器将MLLM的表示注入扩散过程，从而实现对未见领域的泛化。此外，论文还引入了高效的参考聚合策略和渐进式训练方案，以降低计算成本并增强细节保留。

链接: https://arxiv.org/abs/2412.09618
作者: Zhuofan Zong,Dongzhi Jiang,Bingqi Ma,Guanglu Song,Hao Shao,Dazhong Shen,Yu Liu,Hongsheng Li
关键词-EN: Significant achievements, consistent visual elements, visual elements, consistent visual, capture consistent visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM’s representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
zh

[CV-10] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在处理长上下文场景（如视频、高分辨率图像或长篇图文文档）时的性能下降问题。解决方案的关键在于提出了一种新的视觉位置编码方法——可变视觉位置编码 (Variable Visual Position Encoding, V2PE)，该方法通过为视觉标记采用可变且较小的增量来优化位置编码，从而更高效地管理长多模态序列。实验表明，V2PE显著提升了VLMs在长上下文场景中的理解和推理能力，并通过与增强的长上下文多模态数据集结合，进一步微调了开源VLM模型InternVL2，使其在处理长达1M标记的多模态序列时表现出色。

链接: https://arxiv.org/abs/2412.09616
作者: Junqi Ge,Ziyi Chen,Jintao Lin,Jinguo Zhu,Xihui Liu,Jifeng Dai,Xizhou Zhu
关键词-EN: lengthy image-text documents, shown promising capabilities, tasks involving videos, high-resolution images, involving videos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code and models will be available at this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model’s context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs’ ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.
zh

[CV-11] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在处理图像和视频时，由于视觉输入的token长度较大而导致的效率问题，以及现有模型通常分别处理图像和视频，限制了它们在结合图像和视频任务中的能力。解决方案的关键在于提出了一种统一的token压缩策略，称为渐进视觉token压缩（Progressive Visual Token Compression, PVC）。该策略通过逐步编码和自适应压缩每一帧的token，利用时间冗余高效压缩视频token，并将图像扩展为“静态”视频，逐步补充空间细节。PVC在保留空间细节和时间变化的同时，统一了图像和视频的token压缩，且在实验中表现出在各种视频理解基准上的最先进性能，同时在细节敏感的图像任务中没有性能损失。

链接: https://arxiv.org/abs/2412.09613
作者: Chenyu Yang,Xuan Dong,Xizhou Zhu,Weijie Su,Jiahao Wang,Hao Tian,Zhe Chen,Wenhai Wang,Lewei Lu,Jifeng Dai
关键词-EN: Large Vision-Language Models, Large Vision-Language, token compression, Visual token compression, compression
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a “static” video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in multiple frames. PVC unifies the token compressing of images and videos. With a limited number of tokens per frame (64 tokens by default), spatial details and temporal changes can still be preserved. Experiments show that our model achieves state-of-the-art performance across various video understanding benchmarks, including long video tasks and fine-grained short video tasks. Meanwhile, our unified token compression strategy incurs no performance loss on image benchmarks, particularly in detail-sensitive tasks.
zh

[CV-12] FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

【速读】：该论文试图解决 rectified flow 模型在图像生成中难以实现解耦编辑的问题，即在进行属性特定修改时，避免影响图像中不相关的部分。解决方案的关键在于提出了 FluxSpace，一种利用 rectified flow 模型中 transformer 块学习到的表示空间的领域无关图像编辑方法。通过构建一组语义可解释的表示，FluxSpace 能够实现从细粒度图像编辑到艺术创作的广泛任务，并具备解耦编辑的能力，从而提供了一种可扩展且有效的图像编辑方法。

链接: https://arxiv.org/abs/2412.09611
作者: Yusuf Dalva,Kavana Venkatesh,Pinar Yanardag
关键词-EN: Rectified flow models, high-quality image synthesis, Rectified flow, showcasing impressive capabilities, flow models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.
zh

[CV-13] Representing Long Volumetric Video with Temporal Gaussian Hierarchy SIGGRAPH

【速读】：该论文旨在解决从多视角RGB视频中重建长时长的体积视频（volumetric video）的挑战。现有动态视图合成方法虽然利用了强大的4D表示（如特征网格或点云序列）来实现高质量渲染，但通常局限于短视频片段（1~2秒），并且在处理更长视频时面临内存占用过大的问题。论文提出的解决方案是引入一种名为Temporal Gaussian Hierarchy的新型4D表示方法，其关键在于利用动态场景中的时间冗余性。具体来说，场景中不同区域的内容变化速度不同，因此该方法构建了一个多层次的4D高斯基元（Gaussian primitives）层次结构，每个层次分别描述不同变化速度的场景区域，并通过自适应共享高斯基元来表示不同时间段内不变的场景内容，从而有效减少高斯基元的数量。此外，高斯层次的树状结构使得在训练或渲染时能够高效地使用部分高斯基元来表示特定时刻的场景，从而在处理长时间视频时保持近乎恒定的GPU内存使用。实验结果表明，该方法在训练成本、渲染速度和存储使用方面均优于现有方法，并且是首个能够高效处理分钟级体积视频数据并保持最先进渲染质量的方法。

链接: https://arxiv.org/abs/2412.09608
作者: Zhen Xu,Yinghao Xu,Zhiyuan Yu,Sida Peng,Jiaming Sun,Hujun Bao,Xiaowei Zhou
关键词-EN: multi-view RGB videos, multi-view RGB, Gaussian primitives, RGB videos, reconstructing long volumetric
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注: SIGGRAPH Asia 2024 (TOG). Project page: this https URL

点击查看摘要

Abstract:This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality. Our project page is available at: this https URL.
zh

[CV-14] Spectral Image Tokenizer

【速读】：该论文试图解决传统图像标记器（image tokenizers）在自回归生成模型中存在的空间位置排列不合理问题，特别是基于光栅扫描顺序的标记方式不利于自回归建模。解决方案的关键在于提出了一种基于离散小波变换（Discrete Wavelet Transform, DWT）的图像频谱标记方法，使得生成的标记序列能够以从粗到细的方式表示图像。这种方法的优势包括：1) 利用了自然图像在高频部分更易压缩的特性；2) 支持不同分辨率的图像输入和重建，无需重新训练；3) 改善了下一个标记预测的条件，通过粗略重建整个图像而非逐行重建；4) 支持部分解码，即前几个生成的标记即可重建图像的粗略版本；5) 使自回归模型能够用于图像上采样。

链接: https://arxiv.org/abs/2412.09607
作者: Carlos Esteves,Mohammed Suhail,Ameesh Makadia
关键词-EN: autoregressive transformer-based image, transformer-based image generation, Image, tokenizers map images, Image tokenizers map
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster scan order, which is not ideal for autoregressive modeling. In this paper, we propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT), such that the sequence of tokens represents the image in a coarse-to-fine fashion. Our tokenizer brings several advantages: 1) it leverages that natural images are more compressible at high frequencies, 2) it can take and reconstruct images of different resolutions without retraining, 3) it improves the conditioning for next-token prediction – instead of conditioning on a partial line-by-line reconstruction of the image, it takes a coarse reconstruction of the full image, 4) it enables partial decoding where the first few generated tokens can reconstruct a coarse version of the image, 5) it enables autoregressive models to be used for image upsampling. We evaluate the tokenizer reconstruction metrics as well as multiscale image generation, text-guided image upsampling and editing.
zh

[CV-15] Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

【速读】：该论文试图解决视觉基础模型（Visual Foundation Models, VFMs）在3D理解方面的局限性问题，特别是现有评估方法依赖于3D数据作为真实标签，限制了评估集的规模和多样性。解决方案的关键在于提出了Feat2GS框架，通过从无姿态图像中提取的VFM特征读取3D高斯属性（3D Gaussians attributes），从而在不依赖3D数据的情况下，通过新视角合成（novel view synthesis）来探测几何和纹理的3D感知能力。Feat2GS通过解耦3D高斯参数（geometry和texture），分别分析几何和纹理的感知能力，并在此基础上开发了几种达到最先进水平的变体，为探测VFMs提供了有效的基准。

链接: https://arxiv.org/abs/2412.09606
作者: Yue Chen,Xingyu Chen,Anpei Chen,Gerard Pons-Moll,Yuliang Xiu
关键词-EN: visual foundation models, natural question arises, foundation models, question arises, visual foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry ( \boldsymbolx, \alpha, \Sigma ) and texture ( \boldsymbolc ) - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data will be made available at this https URL.
zh

[CV-16] SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

【速读】：该论文试图解决现有多模态大语言模型（Multimodal Large Language Models, MLLMs）在模型架构和训练流程上的复杂性问题，特别是那些不使用编码器（encoder-free）的统一模型。解决方案的关键在于提出了SynerGen-VL，一种简单而强大的无编码器MLLM，并通过引入token folding机制和基于视觉专家的渐进对齐预训练策略（vision-expert-based progressive alignment pretraining strategy），有效支持高分辨率图像理解并降低训练复杂性。这些创新使得SynerGen-VL在参数规模相当或更小的情况下，达到了或超越现有无编码器统一MLLMs的性能，并缩小了与特定任务最先进模型之间的差距。

链接: https://arxiv.org/abs/2412.09604
作者: Hao Li,Changyao Tian,Jie Shao,Xizhou Zhu,Zhaokai Wang,Jinguo Zhu,Wenhan Dou,Xiaogang Wang,Hongsheng Li,Lewei Lu,Jifeng Dai
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, unified Multimodal Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.
zh

[CV-17] Do Multimodal Large Language Models See Like Humans?

【速读】：该论文试图解决的问题是：多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在视觉任务上的表现是否与人类视觉系统 (Human Visual System, HVS) 的感知方式一致。当前的基准测试无法从这一角度评估 MLLMs。为此，论文提出了 HVSBench，一个大规模基准测试，旨在评估 MLLMs 与 HVS 在基本视觉任务上的对齐程度。HVSBench 包含了超过 85,000 个多模态样本，涵盖 13 个类别和 5 个 HVS 领域，包括显著性 (Prominence)、数数 (Subitizing)、优先级 (Prioritizing)、自由观看 (Free-Viewing) 和搜索 (Searching)。实验表明，HVSBench 能够全面评估 MLLMs，揭示出即使是表现最好的模型也存在显著改进空间。该基准测试的关键在于提供了一个新的视角，推动了与人类视觉系统对齐的可解释性 MLLMs 的研究。

链接: https://arxiv.org/abs/2412.09603
作者: Jiaying Lin,Shuquan Ye,Rynson W.H. Lau
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, leveraging recent advancements, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.
zh

[CV-18] Hidden Biases of End-to-End Driving Datasets CVPR2024

【速读】：该论文试图解决在CARLA Leaderboard 2.0上应用端到端驾驶系统的问题，并探讨训练数据集对模型性能的影响。解决方案的关键在于系统性地分析训练数据集，得出以下新见解：(1) 专家风格显著影响下游策略性能；(2) 在复杂数据集中，不应基于简单的类别频率标准对帧进行加权；(3) 通过估计帧是否改变目标标签来减少数据集大小，同时保留重要信息。基于这些发现，论文提出的模型在2024年CARLA挑战赛的地图和传感器赛道上分别获得第一和第二名，并在Bench2Drive测试路线上创造了新的技术水平。此外，论文还揭示了当前评估指标的设计缺陷，并提出了改进建议。

链接: https://arxiv.org/abs/2412.09602
作者: Julian Zimmerlin,Jens Beißwenger,Bernhard Jaeger,Andreas Geiger,Kashyap Chitta
关键词-EN: made rapid progress, rapid progress, systems have made, made rapid, CARLA Leaderboard
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Technical report for the CVPR 2024 Workshop on Foundation Models for Autonomous Systems. Runner-up of the track ‘CARLA Autonomous Driving Challenge’ in the 2024 Autonomous Grand Challenge ( this https URL )

点击查看摘要

Abstract:End-to-end driving systems have made rapid progress, but have so far not been applied to the challenging new CARLA Leaderboard 2.0. Further, while there is a large body of literature on end-to-end architectures and training strategies, the impact of the training dataset is often overlooked. In this work, we make a first attempt at end-to-end driving for Leaderboard 2.0. Instead of investigating architectures, we systematically analyze the training dataset, leading to new insights: (1) Expert style significantly affects downstream policy performance. (2) In complex data sets, the frames should not be weighted on the basis of simplistic criteria such as class frequencies. (3) Instead, estimating whether a frame changes the target labels compared to previous frames can reduce the size of the dataset without removing important information. By incorporating these findings, our model ranks first and second respectively on the map and sensors tracks of the 2024 CARLA Challenge, and sets a new state-of-the-art on the Bench2Drive test routes. Finally, we uncover a design flaw in the current evaluation metrics and propose a modification for future challenges. Our dataset, code, and pre-trained models are publicly available at this https URL.
zh

[CV-19] Owl-1: Omni World Model for Consistent Long Video Generation

【速读】：该论文试图解决现有视频生成模型（Video Generation Models, VGMs）在生成长时间视频时由于仅依赖最后一帧作为条件而导致的长期不一致性问题。解决方案的关键在于提出了Omni World modeL (Owl-1)，通过在潜在空间中建模长期发展，并使用VGMs将这些发展转化为视频，从而生成长期连贯且一致的长视频。具体来说，Owl-1通过引入一个潜在状态变量来表示世界，该变量可以解码为显式的视频观察，这些观察用于预测时间动态，进而更新状态变量。这种动态与状态的交互增强了长视频的多样性和一致性。

链接: https://arxiv.org/abs/2412.09600
作者: Yuanhui Huang,Wenzhao Zheng,Yuan Gao,Xin Tao,Pengfei Wan,Di Zhang,Jie Zhou,Jiwen Lu
关键词-EN: general-purpose large vision, long video generation, large vision models, received extensive attention, extensive attention recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Video generation models (VGMs) have received extensive attention recently and serve as promising candidates for general-purpose large vision models. While they can only generate short videos each time, existing methods achieve long video generation by iteratively calling the VGMs, using the last-frame output as the condition for the next-round generation. However, the last frame only contains short-term fine-grained information about the scene, resulting in inconsistency in the long horizon. To address this, we propose an Omni World modeL (Owl-1) to produce long-term coherent and comprehensive conditions for consistent long video generation. As videos are observations of the underlying evolving world, we propose to model the long-term developments in a latent space and use VGMs to film them into videos. Specifically, we represent the world with a latent state variable which can be decoded into explicit video observations. These observations serve as a basis for anticipating temporal dynamics which in turn update the state variable. The interaction between evolving dynamics and persistent state enhances the diversity and consistency of the long videos. Extensive experiments show that Owl-1 achieves comparable performance with SOTA methods on VBench-I2V and VBench-Long, validating its ability to generate high-quality video observations. Code: this https URL.
zh

[CV-20] RatBodyFormer: Rodent Body Surface from Keypoints

【速读】：该论文试图解决自动分析无纹理鼠体表面以进行行为建模的问题，关键在于通过鼠体表面的运动信息来解码其行为。解决方案的核心是提出了两个主要贡献：首先，引入了一种名为RatDome的多相机系统用于捕捉鼠行为，并构建了一个包含3D关键点和3D体表面点的大规模数据集；其次，设计了一种名为RatBodyFormer的网络，用于将检测到的关键点转换为3D体表面点，该网络采用掩码学习进行训练，且对训练数据中3D体表面点的具体位置不敏感。这些创新为自动化鼠行为分析提供了新的基础，并有望对生物医学和神经科学研究产生深远影响。

链接: https://arxiv.org/abs/2412.09599
作者: Ayaka Higami,Karin Oshima,Tomoyo Isoguchi Shiramatsu,Hirokazu Takahashi,Shohei Nobuhara,Ko Nishino
关键词-EN: body surface points, surface evades automatic, body surface, textureless body surface, body surface evades
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rat behavior modeling goes to the heart of many scientific studies, yet the textureless body surface evades automatic analysis as it literally has no keypoints that detectors can find. The movement of the body surface, however, is a rich source of information for deciphering the rat behavior. We introduce two key contributions to automatically recover densely 3D sampled rat body surface points, passively. The first is RatDome, a novel multi-camera system for rat behavior capture, and a large-scale dataset captured with it that consists of pairs of 3D keypoints and 3D body surface points. The second is RatBodyFormer, a novel network to transform detected keypoints to 3D body surface points. RatBodyFormer is agnostic to the exact locations of the 3D body surface points in the training data and is trained with masked-learning. We experimentally validate our framework with a number of real-world experiments. Our results collectively serve as a novel foundation for automated rat behavior analysis and will likely have far-reaching implications for biomedical and neuroscientific research.
zh

[CV-21] LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

【速读】：该论文试图解决单图像三维重建中的几何歧义和视角信息有限的问题，特别是利用潜在视频扩散模型 (Latent Video Diffusion Models, LVDMs) 的生成先验时面临的三大挑战：(1) 大相机运动导致质量下降，(2) 精确相机控制困难，(3) 扩散过程中固有的几何失真破坏三维一致性。解决方案的关键在于提出 LiftImage3D 框架，通过设计关节轨迹策略将大相机运动分解为可控的小运动，使用 MASt3R 神经匹配模型校准相机姿态并生成点云，以及引入畸变感知的三维高斯喷射表示 (distortion-aware 3D Gaussian splatting representation) 来学习帧间独立畸变并输出无畸变的标准高斯分布，从而确保三维一致性并有效释放 LVDMs 的生成先验。

链接: https://arxiv.org/abs/2412.09597
作者: Yabo Chen,Chen Yang,Jiemin Fang,Xiaopeng Zhang,Lingxi Xie,Wei Shen,Wenrui Dai,Hongkai Xiong,Qi Tian
关键词-EN: limited viewpoint information, computer vision due, reconstruction remains, viewpoint information, inherent geometric ambiguities
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs’ generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.
zh

[CV-22] Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

【速读】：该论文试图解决从单张图像中恢复物体几何形状和材质的挑战，由于其固有的欠约束性，这一任务非常困难。解决方案的关键在于提出了一个名为Neural LightRig的新框架，通过利用来自大规模扩散模型的辅助多光照条件（multi-lighting conditions）来增强内在估计。具体来说，该框架首先利用扩散模型的光照先验，在合成重光照数据集上构建多光照扩散模型，生成多个在不同方向点光源下照明的图像。然后，通过这些多光照图像减少估计不确定性，训练一个具有U-Net骨干网络的大型G-buffer模型，以准确预测表面法线和材质。实验结果表明，该方法显著优于现有最先进的方法，能够实现精确的表面法线和PBR材质估计，并产生生动的重光照效果。

链接: https://arxiv.org/abs/2412.09593
作者: Zexin He,Tengfei Wang,Xin Huang,Xingang Pan,Ziwei Liu
关键词-EN: Recovering the geometry, under-constrained nature, challenging due, Recovering, present Neural LightRig
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Code and dataset are available on our project page at this https URL.
zh

[CV-23] Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

【速读】：该论文试图解决视线目标估计问题，即预测一个人在场景中正在看的位置。解决方案的关键在于提出了Gaze-LLE，一种基于Transformer框架的新方法，通过利用冻结的DINOv2编码器提取场景的单一特征表示，并结合人物特定的位置提示，使用轻量级模块进行视线解码。该方法简化了传统复杂的手工设计流程，显著提升了在多个视线估计基准上的性能。

链接: https://arxiv.org/abs/2412.09586
作者: Fiona Ryan,Ajay Bati,Sangmin Lee,Daniel Bolya,Judy Hoffman,James M. Rehg
关键词-EN: gaze target estimation, gaze target, person gaze target, target estimation, address the problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person’s gaze target requires reasoning both about the person’s appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: this http URL .
zh

[CV-24] OLA-VLM: Elevating Visual Perception in Multimodal LLM s with Auxiliary Embedding Distillation

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 在视觉理解能力上的不足问题，认为仅依赖自然语言监督 (natural language supervision) 的训练方式是次优的。解决方案的关键在于提出了一种新的方法 OLA-VLM，通过从目标视觉表示中蒸馏知识到 LLM 的隐藏表示中，优化中间的 LLM 表示。具体来说，OLA-VLM 在预训练阶段将视觉嵌入预测与下一个文本标记预测进行耦合优化，从而提升模型的视觉表示质量，并在多个基准测试中显著提高了性能，尤其是在 Depth 任务上提升了 8.7%。

链接: https://arxiv.org/abs/2412.09585
作者: Jitesh Jain,Zhengyuan Yang,Humphrey Shi,Jianfeng Gao,Jianwei Yang
关键词-EN: natural language supervision, developing contemporary MLLMs, natural language, language supervision, solely natural language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM’s visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM’s hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach’s superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at this https URL .
zh

[CV-25] Neptune: The Long Orbit to Benchmarking Long Video Understanding

【速读】：该论文试图解决现有视频数据集和模型主要关注短片段（10秒到30秒）的问题，以及长视频数据集通常依赖高成本的手动标注，且容易被强大的图像模型通过逐帧分析解决的挑战。解决方案的关键在于提出了一种半自动化的数据集创建流程，利用大规模模型（视觉语言模型 (VLMs) 和大型语言模型 (LLMs)）自动生成密集且时间对齐的视频字幕，并为长达15分钟的视频片段生成具有挑战性的问答干扰项集。该流程生成的数据集Neptune涵盖了广泛的长视频推理能力，并特别强调多模态推理。此外，论文还提出了一个新的开源模型评分指标GEM，用于评估开放式问答的响应质量。通过这些创新，论文旨在推动开发能够理解长视频的更先进模型。

链接: https://arxiv.org/abs/2412.09582
作者: Arsha Nagrani,Mingda Zhang,Ramin Mehran,Rachel Hornung,Nitesh Bharadwaj Gundavarapu,Nilpa Jha,Austin Myers,Xingyi Zhou,Boqing Gong,Cordelia Schmid,Mikhail Sirotenko,Yukun Zhu,Tobias Weyand
关键词-EN: paper describes, describes a semi-automatic, video, long video, long
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at this https URL
zh

[CV-26] FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

【速读】：该论文试图解决从稀疏视角图像中进行高质量3D重建时，相机外参和内参难以准确获取的问题。解决方案的关键在于提出了FreeSplatter框架，该框架基于简化的Transformer架构，通过序列自注意力块实现多视角图像信息交换，并将其解码为像素级的3D高斯基元。这些基元被统一放置在参考坐标系中，从而实现高保真度的3D建模，并利用现成的求解器快速估计相机参数。FreeSplatter不仅在重建质量和姿态估计精度上超越了现有技术，还展示了其在下游应用（如文本/图像到3D内容生成）中的生产力提升潜力。

链接: https://arxiv.org/abs/2412.09573
作者: Jiale Xu,Shenghua Gao,Ying Shan
关键词-EN: Existing sparse-view reconstruction, Existing sparse-view, models heavily rely, heavily rely, rely on accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing sparse-view reconstruction models heavily rely on accurate known camera poses. However, deriving camera extrinsics and intrinsics from sparse-view images presents significant challenges. In this work, we present FreeSplatter, a highly scalable, feed-forward reconstruction framework capable of generating high-quality 3D Gaussians from uncalibrated sparse-view images and recovering their camera parameters in mere seconds. FreeSplatter is built upon a streamlined transformer architecture, comprising sequential self-attention blocks that facilitate information exchange among multi-view image tokens and decode them into pixel-wise 3D Gaussian primitives. The predicted Gaussian primitives are situated in a unified reference frame, allowing for high-fidelity 3D modeling and instant camera parameter estimation using off-the-shelf solvers. To cater to both object-centric and scene-level reconstruction, we train two model variants of FreeSplatter on extensive datasets. In both scenarios, FreeSplatter outperforms state-of-the-art baselines in terms of reconstruction quality and pose estimation accuracy. Furthermore, we showcase FreeSplatter’s potential in enhancing the productivity of downstream applications, such as text/image-to-3D content creation.
zh

[CV-27] Video Creation by Demonstration

【速读】：该论文试图解决通过演示视频和不同场景的上下文图像生成物理上合理且动作概念一致的视频的问题。解决方案的关键是提出了 \delta -Diffusion，这是一种自监督训练方法，通过条件未来帧预测从无标签视频中学习。与基于显式信号的现有视频生成控制方法不同，\delta -Diffusion采用隐式潜在控制形式，以实现最大灵活性和表达能力。通过利用具有外观瓶颈设计的视频基础模型，从演示视频中提取动作潜在变量，用于条件生成过程，同时最小化外观泄露。实验结果表明，\delta -Diffusion在人类偏好和大规模机器评估中均优于相关基线，展示了其在交互式世界模拟中的潜力。

链接: https://arxiv.org/abs/2412.09551
作者: Yihong Sun,Hao Zhou,Liangzhe Yuan,Jennifer J. Sun,Yandong Li,Xuhui Jia,Hartwig Adam,Bharath Hariharan,Long Zhao,Ting Liu
关键词-EN: video creation experience, creation experience, video creation, creation, video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present \delta -Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, \delta -Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at this https URL.
zh

[CV-28] Exemplar Masking for Multimodal Incremental Learning

【速读】：该论文试图解决多模态增量学习中的两个主要问题：基于样本的方法中多模态数据存储量过大，以及在大型多模态模型上进行微调的计算需求过高。解决方案的关键在于采用了参数高效的微调方案来减轻微调负担，并提出了样本掩码框架以高效重放旧知识。具体来说，通过基于注意力权重和不同模态间的相关性对非重要标记进行掩码处理，显著减少了样本的存储大小，从而在相同的内存缓冲区下保存更多样本。此外，设计了多模态数据增强技术以多样化样本，增强对先前知识的回放。实验结果表明，该方法在有限的内存缓冲区下更为高效且对灾难性遗忘更具鲁棒性。

链接: https://arxiv.org/abs/2412.09549
作者: Yi-Lun Lee,Chen-Yu Lee,Wei-Chen Chiu,Yi-Hsuan Tsai
关键词-EN: previously learned information, Multimodal incremental learning, incremental learning, concurrently learning, learned information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge without forgetting the previously learned information. There are numerous challenges for this task, mainly including the larger storage size of multimodal data in exemplar-based methods and the computational requirement of finetuning on huge multimodal models. In this paper, we leverage the parameter-efficient tuning scheme to reduce the burden of fine-tuning and propose the exemplar masking framework to efficiently replay old knowledge. Specifically, the non-important tokens are masked based on the attention weights and the correlation across different modalities, significantly reducing the storage size of an exemplar and consequently saving more exemplars under the same memory buffer. Moreover, we design a multimodal data augmentation technique to diversify exemplars for replaying prior knowledge. In experiments, we not only evaluate our method in existing multimodal datasets but also extend the ImageNet-R dataset to a multimodal dataset as a real-world application, where captions are generated by querying multimodal large language models (e.g., InstructBLIP). Extensive experiments show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer. Code is available at this https URL.
zh

[CV-29] Meshtron: High-Fidelity Artist-Like 3D Mesh Generation at Scale

【速读】：该论文试图解决高质量三维表面网格（meshes）生成中的两个关键问题：一是当前方法在面数（face count）和顶点坐标分辨率（vertex coordinate resolution）上的限制，导致无法生成复杂三维对象的高质量网格；二是生成过程的计算效率和内存消耗问题。解决方案的关键在于提出了一种名为Meshtron的新型自回归网格生成模型，通过四个核心组件实现：(1) 沙漏神经架构（hourglass neural architecture），(2) 截断序列训练（truncated sequence training），(3) 滑动窗口推理（sliding window inference），(4) 强制网格序列顺序的鲁棒采样策略（robust sampling strategy that enforces the order of mesh sequences）。这些创新使得Meshtron能够在高达64K面数和1024级坐标分辨率下生成网格，相比现有方法，面数提高了一个数量级，坐标分辨率提高了8倍，同时减少了50%的训练内存消耗和提高了2.5倍的吞吐量，生成的网格在细节和保真度上接近专业艺术家的作品。

链接: https://arxiv.org/abs/2412.09548
作者: Zekun Hao,David W. Romero,Tsung-Yi Lin,Ming-Yu Liu
关键词-EN: fundamental representations, Meshes, coordinate resolution, high-quality meshes, creating high-quality meshes
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Meshes are fundamental representations of 3D surfaces. However, creating high-quality meshes is a labor-intensive task that requires significant time and expertise in 3D modeling. While a delicate object often requires over 10^4 faces to be accurately modeled, recent attempts at generating artist-like meshes are limited to 1.6 K faces and heavy discretization of vertex coordinates. Hence, scaling both the maximum face count and vertex coordinate resolution is crucial to producing high-quality meshes of realistic, complex 3D objects. We present Meshtron, a novel autoregressive mesh generation model able to generate meshes with up to 64K faces at 1024-level coordinate resolution --over an order of magnitude higher face count and 8\times higher coordinate resolution than current state-of-the-art methods. Meshtron’s scalability is driven by four key components: (1) an hourglass neural architecture, (2) truncated sequence training, (3) sliding window inference, (4) a robust sampling strategy that enforces the order of mesh sequences. This results in over 50% less training memory, 2.5\times faster throughput, and better consistency than existing works. Meshtron generates meshes of detailed, complex 3D objects at unprecedented levels of resolution and fidelity, closely resembling those created by professional artists, and opening the door to more realistic generation of detailed 3D assets for animation, gaming, and virtual environments.
zh

[CV-30] SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing

【速读】：该论文试图解决从文本提示生成可用于模拟的着装3D人体化身（simulation-ready clothed 3D human avatars）的问题。当前的文本驱动生成方法要么使用统一的模型来处理头发、服装和人体，要么生成的头发和服装难以适应现有的模拟管道。论文的关键解决方案在于提出了一种两阶段框架，结合了3D高斯分布（3D Gaussians）的灵活性与可模拟的头发和服装网格。首先，通过三个文本条件化的3D生成模型分别生成服装网格、人体形状和头发，然后利用3D高斯分布来优化化身的外观，并通过物理模拟器驱动服装和头发的运动，最终实现具有生动纹理和真实动态运动的模拟准备型3D化身。

链接: https://arxiv.org/abs/2412.09545
作者: Xueting Li,Ye Yuan,Shalini De Mello,Gilles Daviet,Jonathan Leaf,Miles Macklin,Jan Kautz,Umar Iqbal
关键词-EN: hair strands, introduce SimAvatar, hair, generate simulation-ready clothed, garment
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing, and the human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing simulation pipelines. The primary challenge lies in representing the hair and garment geometry in a way that allows leveraging established prior knowledge from foundational image diffusion models (e.g., Stable Diffusion) while being simulation-ready using either physics or neural simulators. To address this task, we propose a two-stage framework that combines the flexibility of 3D Gaussians with simulation-ready hair strands and garment meshes. Specifically, we first employ three text-conditioned 3D generative models to generate garment mesh, body shape and hair strands from the given text prompt. To leverage prior knowledge from foundational diffusion models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair strands and learn the avatar appearance through optimization. To drive the avatar given a pose sequence, we first apply physics simulators onto the garment meshes and hair strands. We then transfer the motion onto 3D Gaussians through carefully designed mechanisms for each body part. As a result, our synthesized avatars have vivid texture and realistic dynamic motion. To the best of our knowledge, our method is the first to produce highly realistic, fully simulation-ready 3D avatars, surpassing the capabilities of current approaches.
zh

[CV-31] Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

【速读】：该论文试图解决视频理解领域中高质量视频-文本数据集的缺乏问题，以及现有视频语言模型（VideoLLMs）在处理长视频复杂性时的效率不足。解决方案的关键在于引入了一个大规模的合成数据集，该数据集通过专有模型生成，并使用精心设计的提示来应对广泛的问题。此外，论文提出了一种动态视觉令牌压缩架构（dynamic visual token compression architecture），该架构在计算效率和性能之间取得了平衡。这些创新使得所提出的模型在多种视频任务中达到了最先进的结果，并在多图像理解方面展示了显著的泛化能力。

链接: https://arxiv.org/abs/2412.09530
作者: Han Wang,Yuxiang Nie,Yongjie Ye,Deng GuanYu,Yanjie Wang,Shuai Li,Haiyang Yu,Jinghui Lu,Can Huang
关键词-EN: Large Vision-Language Models, rapidly evolving field, application of Large, Large Vision-Language, evolving field
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we’ve seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model delivers an absolute improvement of 2.7% over LLaVA-OneVision on VideoMME and 10.7% on MuirBench. Codes are available at this https URL
zh

[CV-32] Can Modern LLM s Act as Agent Cores in Radiology~Environments?

【速读】：该论文试图解决的核心问题是：“现代大型语言模型（LLMs）能否作为放射学环境中的代理核心？” 为解决这一问题，论文提出了RadABench，其关键贡献包括：1) RadABench-Data，一个综合的合成评估数据集，涵盖6个解剖结构、5种成像模式、10类工具和11项放射学任务；2) RadABench-EvalPlat，一个新颖的评估平台，具备提示驱动的流程和模拟广泛放射学工具集的能力；3) 对7个领先的LLMs进行多维度性能评估，从5个角度使用多种指标。研究结果表明，尽管当前LLMs在许多领域表现出色，但仍不足以作为完全运行放射学代理系统的核心。论文还识别了影响LLM代理核心性能的关键因素，为临床医生在实际放射学实践中有效应用代理系统提供了见解。

链接: https://arxiv.org/abs/2412.09529
作者: Qiaoyu Zheng,Chaoyi Wu,Pengcheng Qiu,Lisong Dai,Ya Zhang,Yanfeng Wang,Weidi Xie
关键词-EN: large language models, offer enhanced accuracy, Advancements in large, language models, large language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages,7 figures

点击查看摘要

Abstract:Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. Radiology, with its complex analytical requirements, is an ideal field for the application of these agents. This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?’ To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. All of our code and data are open-sourced in this https URL.
zh

[CV-33] Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis

【速读】：该论文试图解决传统视觉模型在病理诊断中特征提取冗余的问题，以及现有大规模视觉-语言模型（LVLMs）因输入分辨率限制而导致的效率和准确性不足的问题。解决方案的关键在于提出了两种创新策略：一是混合任务引导的特征增强（mixed task-guided feature enhancement），通过多尺度分析聚焦于病变相关细节；二是提示引导的细节特征补全（prompt-guided detail feature completion），基于特定提示整合粗粒度和细粒度特征，从而在不牺牲推理速度的前提下提升诊断效率和准确性。通过训练专门针对病理学的LVLM模型OmniPath，并在包含490,000个样本的综合数据集上进行验证，该模型在诊断精度和效率上显著优于现有方法。

链接: https://arxiv.org/abs/2412.09521
作者: Shengxuming Zhang,Weihan Li,Tianhong Gao,Jiacong Hu,Haoming Luo,Mingli Song,Xiuming Zhang,Zunlei Feng
关键词-EN: determining disease characteristics, guiding treatment, disease characteristics, assessing prognosis, relying heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, traditional pure vision models face challenges of redundant feature extraction, whereas existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490,000 samples from diverse pathology tasks-including cancer detection, grading, vascular and neural invasion identification, and so on-we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, offering an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.
zh

[CV-34] Agent -based Video Trimming

【速读】：该论文试图解决用户生成视频长度增加带来的信息筛选负担问题，提出了一种新的任务——视频修剪 (Video Trimming, VT)，旨在高效提取关键视频信息并生成连贯的故事。解决方案的关键在于引入基于代理的视频修剪 (Agent-based Video Trimming, AVT)，其核心结构包括三个阶段：视频结构化 (Video Structuring)、片段过滤 (Clip Filtering) 和故事编排 (Story Composition)。具体而言，通过视频字幕生成代理 (Video Captioning Agent) 将视频片段转换为结构化文本描述，利用过滤模块 (Filtering Module) 动态丢弃低质量片段，并通过视频编排代理 (Video Arrangement Agent) 选择和编排有效片段以形成连贯的最终视频。该方法在用户研究和多个数据集上的评估中表现优异，特别是在高光检测任务中展示了更高的mAP和精度。

链接: https://arxiv.org/abs/2412.09513
作者: Lingfeng Yang,Zhenyuan Chen,Xiang Li,Peiyang Jia,Liangqu Long,Jian Yang
关键词-EN: video, Video Trimming, increasing in length, placing a burden, Agent-based Video Trimming
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at this https URL.
zh

[CV-35] GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency

【速读】：该论文试图解决现有3D affordance学习方法在泛化能力和鲁棒性方面的不足，主要由于标注数据有限以及依赖于专注于几何编码的3D骨干网络，这些网络对现实世界中的噪声和数据损坏缺乏韧性。解决方案的关键在于提出了GEAL框架，通过利用大规模预训练的2D模型来增强3D affordance学习的泛化能力和鲁棒性。具体来说，GEAL采用双分支架构，结合高斯溅射技术，建立3D点云与2D表示之间的一致映射，从而从稀疏点云中生成逼真的2D渲染。此外，粒度自适应融合模块和2D-3D一致性对齐模块进一步强化了跨模态对齐和知识迁移，使3D分支能够从2D模型的丰富语义和泛化能力中受益。通过引入两个新的基于损坏的基准测试（PIAD-C和LASO-C），GEAL在公共数据集和自定义基准上均表现出优于现有方法的性能，展示了在多样化条件下鲁棒且适应性强的affordance预测能力。

链接: https://arxiv.org/abs/2412.09511
作者: Dongyue Lu,Lingdong Kong,Tianxin Huang,Gim Hee Lee
关键词-EN: Identifying affordance regions, Identifying affordance, human-machine interaction, cues is essential, essential for robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures, 12 tables; Project Page at this https URL

点击查看摘要

Abstract:Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models. To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions. Code and corruption datasets have been made publicly available.
zh

[CV-36] Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction

【速读】：该论文试图解决室内路径损耗（pathloss）无线电地图预测问题，解决方案的关键在于采用基于视觉Transformer (Vision Transformers, ViTs) 的神经网络，并通过广泛的数据增强技术和预训练的DINOv2权重来提升网络的泛化能力。该方法在包括未见过的建筑物、不同频率和具有不同辐射模式的天线等多种复杂场景下均表现出良好的性能。

链接: https://arxiv.org/abs/2412.09507
作者: Edvard Ghukasyan,Hrant Khachatrian,Rafayel Mkrtchyan,Theofanis P. Raptis
关键词-EN: Vision Transformers, demonstrated remarkable success, success in achieving, demonstrated remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Work partly supported by the RA Science Committee grant No. 22rl-052 (DISTAL) and the EU under Italian National Recovery and Resilience Plan of NextGenerationEU on “Telecommunications of the Future” (PE00000001 - program “RESTART”)

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated remarkable success in achieving state-of-the-art performance across various image-based tasks and beyond. In this study, we employ a ViT-based neural network to address the problem of indoor pathloss radio map prediction. The network’s generalization ability is evaluated across diverse settings, including unseen buildings, frequencies, and antennas with varying radiation patterns. By leveraging extensive data augmentation techniques and pretrained DINOv2 weights, we achieve promising results, even under the most challenging scenarios.
zh

[CV-37] Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理语音方面的不足，特别是现有模型在语音与其他模态（如视觉和语言）的整合上存在缺陷的问题。解决方案的关键在于引入Lyra模型，通过三种策略实现高效的语音处理和多模态能力增强：(1) 利用现有的开源大模型和提出的多模态LoRA（Low-Rank Adaptation）来降低训练成本和数据需求；(2) 使用潜在多模态正则化器和提取器来加强语音与其他模态之间的关系，提升模型性能；(3) 构建包含1.5M多模态数据样本和12K长语音样本的高质量、广泛数据集，使Lyra能够处理复杂的长语音输入并实现更强大的全认知能力。这些策略使Lyra在视觉-语言、视觉-语音和语音-语言基准测试中达到最先进性能，同时减少了计算资源和训练数据的消耗。

链接: https://arxiv.org/abs/2412.09501
作者: Zhisheng Zhong,Chengyao Wang,Yuqi Liu,Senqiao Yang,Longxiang Tang,Yuechen Zhang,Jingyao Li,Tianyuan Qu,Yanwei Li,Yukang Chen,Shaozuo Yu,Sitong Wu,Eric Lo,Shu Liu,Jiaya Jia
关键词-EN: Multi-modal Large Language, Large Language Models, expanding beyond single-domain, essential to meet, meet the demands
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Tech report

点击查看摘要

Abstract:As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.
zh

[CV-38] Video Seal: Open and Efficient Video Watermarking KR

【速读】：该论文试图解决数字平台上AI生成内容和复杂视频编辑工具泛滥带来的内容审核难题，关键解决方案是通过神经视频水印技术（neural video watermarking）实现视频的不可感知信号嵌入和识别。论文提出的Video Seal框架通过联合训练嵌入器（embedder）和提取器（extractor），并在两者之间应用视频编解码（video codecs）等变换来确保水印的鲁棒性。此外，论文引入了时间水印传播（temporal watermark propagation）技术，将图像水印模型高效转换为视频水印模型，无需对每一帧进行高分辨率水印处理。实验结果表明，Video Seal在速度、不可感知性和鲁棒性方面表现优异，尤其是在面对几何变换和视频压缩等复杂失真时。论文还开源了代码库、模型和公共演示，以促进该领域的进一步研究和发展。

链接: https://arxiv.org/abs/2412.09492
作者: Pierre Fernandez,Hady Elsahar,I. Zeki Yalniz,Alexandre Mourachko
关键词-EN: moderate digital platforms, sophisticated video editing, video editing tools, digital platforms, proliferation of AI-generated
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL

点击查看摘要

Abstract:The proliferation of AI-generated content and sophisticated video editing tools has made it both important and challenging to moderate digital platforms. Video watermarking addresses these challenges by embedding imperceptible signals into videos, allowing for identification. However, the rare open tools and methods often fall short on efficiency, robustness, and flexibility. To reduce these gaps, this paper introduces Video Seal, a comprehensive framework for neural video watermarking and a competitive open-sourced model. Our approach jointly trains an embedder and an extractor, while ensuring the watermark robustness by applying transformations in-between, e.g., video codecs. This training is multistage and includes image pre-training, hybrid post-training and extractor fine-tuning. We also introduce temporal watermark propagation, a technique to convert any image watermarking model to an efficient video watermarking model without the need to watermark every high-resolution frame. We present experimental results demonstrating the effectiveness of the approach in terms of speed, imperceptibility, and robustness. Video Seal achieves higher robustness compared to strong baselines especially under challenging distortions combining geometric transformations and video compression. Additionally, we provide new insights such as the impact of video compression during training, and how to compare methods operating on different payloads. Contributions in this work - including the codebase, models, and a public demo - are open-sourced under permissive licenses to foster further research and development in the field.
zh

[CV-39] New keypoint-based approach for recognising British Sign Language (BSL) from sequences ICCV

【速读】：该论文试图解决在连续手语序列中识别英国手语（British Sign Language, BSL）单词的问题，其关键解决方案是提出了一种基于关键点（keypoint-based）的分类模型。该模型通过分析手语动作中的关键点，显著提升了计算效率和内存使用，同时缩短了训练时间并减少了计算资源的消耗。这是首次将基于关键点的模型应用于BSL单词分类，因此无法与现有方法进行直接比较。

链接: https://arxiv.org/abs/2412.09475
作者: Oishi Deb,KR Prajwal,Andrew Zisserman
关键词-EN: British Sign Language, recognise British Sign, Sign Language, continuous signing sequences, British Sign
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: International Conference on Computer Vision (ICCV) - HANDS Workshop

点击查看摘要

Abstract:In this paper, we present a novel keypoint-based classification model designed to recognise British Sign Language (BSL) words within continuous signing sequences. Our model’s performance is assessed using the BOBSL dataset, revealing that the keypoint-based approach surpasses its RGB-based counterpart in computational efficiency and memory usage. Furthermore, it offers expedited training times and demands fewer computational resources. To the best of our knowledge, this is the inaugural application of a keypoint-based model for BSL word classification, rendering direct comparisons with existing works unavailable.
zh

[CV-40] OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

【速读】：该论文试图解决扩散和基于流的生成模型在图像恢复任务中存在的计算开销大和灵活性不足的问题。传统方法要么需要大量采样步骤以生成高质量图像，导致显著的计算开销，要么依赖模型蒸馏，但通常固定了保真度与真实感之间的权衡，缺乏灵活性。论文提出的OFTSR是一种新颖的基于流的单步图像超分辨率框架，其关键在于通过训练一个条件流基超分辨率模型作为教师模型，并应用特定的约束条件进行蒸馏，使得单步学生模型的预测结果与教师模型的采样ODE轨迹对齐。这种对齐确保了学生模型从初始状态的单步预测与教师模型从更接近的中间状态的预测相匹配，从而实现了单步超分辨率的同时，能够灵活调整保真度与真实感之间的权衡。

链接: https://arxiv.org/abs/2412.09465
作者: Yuanzhi Zhu,Ruiqing Wang,Shilin Lu,Junnan Li,Hanshu Yan,Kai Zhang
关键词-EN: achieving superior perceptual, deep learning approaches, demonstrated remarkable success, superior perceptual quality, perceptual quality compared
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model’s single-step predictions from initial states match the teacher’s predictions from a closer intermediate state. Through extensive experiments on challenging datasets including FFHQ (256 \times 256), DIV2K, and ImageNet (256 \times 256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off. Code and pre-trained models are available at this https URL and this https URL, respectively.
zh

[CV-41] ATPrompt: Textual Prompt Learning with Embedded Attributes

【速读】：该论文试图解决现有基于文本的提示学习方法在处理未知类别时存在的局限性，即这些方法仅能对预定义的已知类别进行图像与文本（类别）空间的对齐，而无法处理未知类别。解决方案的关键在于引入属性嵌入的文本提示学习方法（Attribute-embedded Textual Prompt learning method, ATPrompt），通过将多个通用属性标记（universal attribute tokens）融入可学习的软提示中，将学习空间从一维的类别层面扩展到多维的属性层面，从而实现从以类别为中心的提示形式向属性-类别混合形式的转变。此外，论文提出了一种可微分的属性搜索方法，用于从大型语言模型总结的候选池中选择代表性和适合的属性，以最终确定下游任务的属性。这种方法作为一种易于使用的插件技术，能够在几乎不增加计算成本的情况下，显著提升现有基于文本的提示学习方法的性能。

链接: https://arxiv.org/abs/2412.09442
作者: Zheng Li,Yibing Song,Penghai Zhao,Ming-Ming Cheng,Xiang Li,Jian Yang
关键词-EN: primarily employ multiple, hard class tokens, methods primarily employ, aiming to align, primarily employ
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project Page: this https URL

点击查看摘要

Abstract:Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
zh

[CV-42] MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning AAAI2025 AAAI25

【速读】：该论文试图解决类增量学习 (Class-Incremental Learning, CIL) 中的灾难性遗忘问题，即模型在学习新类别时遗忘旧类别知识的现象。解决方案的关键在于提出了一种名为模型手术 (MOdel Surgery, MOS) 的方法，通过训练任务特定的适配器 (adapters) 来持续调整预训练模型 (Pre-trained Models, PTMs) 以适应下游任务。具体来说，MOS 通过适配器合并方法来减轻参数层面的遗忘，保留任务特定信息的同时弥合不同组件之间的差距；同时，引入了一种无需训练的自精炼适配器检索机制，利用模型的固有能力在推理过程中进行更好的适配器检索，从而解决检索层面的遗忘问题。通过这些步骤的联合修正，MOS 能够有效抵抗学习过程中的灾难性遗忘，并在多个基准数据集上验证了其最先进的性能。

链接: https://arxiv.org/abs/2412.09441
作者: Hai-Long Sun,Da-Wei Zhou,Hanbin Zhao,Le Gan,De-Chuan Zhan,Han-Jia Ye
关键词-EN: CIL, model, forgetting, requires models, continually acquire knowledge
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025. Code is available at: this https URL

点击查看摘要

Abstract:Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones. Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts. Existing work seeks to utilize lightweight components to adjust the PTM, while the forgetting phenomenon still comes from \em parameter and retrieval levels. Specifically, iterative updates of the model result in parameter drift, while mistakenly retrieving irrelevant modules leads to the mismatch during inference. To this end, we propose MOdel Surgery (MOS) to rescue the model from forgetting previous knowledge. By training task-specific adapters, we continually adjust the PTM to downstream tasks. To mitigate parameter-level forgetting, we present an adapter merging approach to learn task-specific adapters, which aims to bridge the gap between different components while reserve task-specific information. Besides, to address retrieval-level forgetting, we introduce a training-free self-refined adapter retrieval mechanism during inference, which leverages the model’s inherent ability for better adapter retrieval. By jointly rectifying the model with those steps, MOS can robustly resist catastrophic forgetting in the learning process. Extensive experiments on seven benchmark datasets validate MOS’s state-of-the-art performance. Code is available at: this https URL
zh

[CV-43] owards Robust and Fair Vision Learning in Open-World Environments

【速读】：该论文旨在解决视觉学习中的公平性和鲁棒性问题，提出了四个关键贡献。首先，针对大规模数据需求问题，论文提出了基于双射最大似然（Bijective Maximum Likelihood）和公平性适应学习（Fairness Adaptation Learning）的公平性领域适应（Fairness Domain Adaptation）方法。其次，为了实现视觉学习的开放世界建模能力，论文提出了开放世界公平性持续学习框架（Open-world Fairness Continual Learning Framework），结合了公平性持续学习（Fairness Continual Learning）和开放世界持续学习（Open-world Continual Learning）的研究成果。第三，针对多视角视觉数据的特征不变性问题，论文提出了基于几何的跨视角适应（Geometry-based Cross-view Adaptation）框架，以学习跨视角的鲁棒特征表示。最后，面对大规模视频和多模态数据的挑战，论文提出了基于Transformer的方法和领域泛化（Domain Generalization）方法，以提升视觉基础模型在多模态和时间数据上的鲁棒性。这些解决方案的关键在于通过创新的方法和技术，有效提升了视觉学习系统的公平性和鲁棒性。

链接: https://arxiv.org/abs/2412.09439
作者: Thanh-Dat Truong
关键词-EN: Fairness Continual Learning, Bijective Maximum Likelihood, Fairness Adaptation Learning, Open-world Fairness Continual, Open-world Continual Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Dissertation

点击查看摘要

Abstract:The dissertation presents four key contributions toward fairness and robustness in vision learning. First, to address the problem of large-scale data requirements, the dissertation presents a novel Fairness Domain Adaptation approach derived from two major novel research findings of Bijective Maximum Likelihood and Fairness Adaptation Learning. Second, to enable the capability of open-world modeling of vision learning, this dissertation presents a novel Open-world Fairness Continual Learning Framework. The success of this research direction is the result of two research lines, i.e., Fairness Continual Learning and Open-world Continual Learning. Third, since visual data are often captured from multiple camera views, robust vision learning methods should be capable of modeling invariant features across views. To achieve this desired goal, the research in this thesis will present a novel Geometry-based Cross-view Adaptation framework to learn robust feature representations across views. Finally, with the recent increase in large-scale videos and multimodal data, understanding the feature representations and improving the robustness of large-scale visual foundation models is critical. Therefore, this thesis will present novel Transformer-based approaches to improve the robust feature representations against multimodal and temporal data. Then, a novel Domain Generalization Approach will be presented to improve the robustness of visual foundation models. The research’s theoretical analysis and experimental results have shown the effectiveness of the proposed approaches, demonstrating their superior performance compared to prior studies. The contributions in this dissertation have advanced the fairness and robustness of machine vision learning.
zh

[CV-44] Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

【速读】：该论文试图解决多模态音乐生成中的数据稀缺、跨模态对齐不足和可控性有限的问题。解决方案的关键在于引入了一种名为Visuals Music Bridge (VMB)的新方法，通过显式的文本和音乐桥梁来实现多模态对齐。具体来说，Multimodal Music Description Model将视觉输入转换为详细的文本描述，提供文本桥梁；Dual-track Music Retrieval模块结合了广泛和针对性的检索策略，提供音乐桥梁并增强用户控制。最终，Explicitly Conditioned Music Generation框架基于这两个桥梁生成音乐。实验结果表明，VMB在音乐质量、模态对齐和定制化方面显著优于现有方法。

链接: https://arxiv.org/abs/2412.09428
作者: Baisen Wang,Le Zhuo,Zhaokai Wang,Chenxi Bao,Wu Chengjing,Xuecheng Nie,Jiao Dai,Jizhong Han,Yue Liao,Si Liu
关键词-EN: Multimodal music generation, Multimodal music, music, music generation, music generation aims
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at this https URL.
zh

[CV-45] MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease Recognition from Fundus Images

【速读】：该论文试图解决现有多模态学习方法在眼底图像和OCT图像上的应用局限性，即这些方法通常要求训练和测试时必须同时具备且严格配对两种模态的数据，这在临床实践中较为不切实际。为此，论文提出了一种新的设置，称为“基于眼底图像的OCT增强疾病识别”，允许在训练阶段使用未配对的多模态数据，并在测试阶段依赖广泛使用的眼底照片。解决方案的关键在于提出了一种OCT辅助的概念蒸馏方法 (OCT-assisted Conceptual Distillation Approach, OCT-CoDA)，通过语义丰富的概念从OCT图像中提取与疾病相关的知识，并将其迁移到眼底模型中，从而显著提升基于眼底图像的诊断性能，并将跨模态知识转移过程解释为可解释的过程。

链接: https://arxiv.org/abs/2412.09402
作者: Lehan Wang,Chongchong Qi,Chubin Ou,Lin An,Mei Jin,Xiangbin Kong,Xiaomeng Li
关键词-EN: Existing multi-modal learning, multi-modal learning methods, Existing multi-modal, Conceptual Distillation Approach, OCT images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE TMI

点击查看摘要

Abstract:Existing multi-modal learning methods on fundus and OCT images mostly require both modalities to be available and strictly paired for training and testing, which appears less practical in clinical scenarios. To expand the scope of clinical applications, we formulate a novel setting, “OCT-enhanced disease recognition from fundus images”, that allows for the use of unpaired multi-modal data during the training phase and relies on the widespread fundus photographs for testing. To benchmark this setting, we present the first large multi-modal multi-class dataset for eye disease diagnosis, MultiEYE, and propose an OCT-assisted Conceptual Distillation Approach (OCT-CoDA), which employs semantically rich concepts to extract disease-related knowledge from OCT images and leverage them into the fundus model. Specifically, we regard the image-concept relation as a link to distill useful knowledge from the OCT teacher model to the fundus student model, which considerably improves the diagnostic performance based on fundus images and formulates the cross-modal knowledge transfer into an explainable process. Through extensive experiments on the multi-disease classification task, our proposed OCT-CoDA demonstrates remarkable results and interpretability, showing great potential for clinical application. Our dataset and code are available at this https URL.
zh

[CV-46] SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

【速读】：该论文试图解决单目RGB图像实时高质量稠密三维重建的问题。解决方案的关键在于提出了SLAM3R系统，通过端到端的神经网络架构，直接从RGB图像中回归出3D点云图，并通过局部点云图的逐步对齐和变形实现全局一致的场景重建。与传统的基于姿态优化的方法不同，SLAM3R无需显式求解相机参数，从而简化了流程并提升了实时性能，实验结果表明其在多个数据集上达到了最先进的重建精度和完整性，同时保持了20+ FPS的实时性能。

链接: https://arxiv.org/abs/2412.09401
作者: Yuzheng Liu,Siyan Dong,Shuzhe Wang,Yingda Yin,Yanchao Yang,Qingnan Fan,Baoquan Chen
关键词-EN: monocular RGB SLAM, RGB SLAM system, effective monocular RGB, RGB SLAM, high-quality dense
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce \textbfSLAM3R, a novel and effective monocular RGB SLAM system for real-time and high-quality dense 3D reconstruction. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given an input video, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images in each window and progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. Code and weights at: \urlthis https URL.
zh

[CV-47] UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer

【速读】：该论文试图解决扩散式视频生成模型中存在的弱一致性和图像质量随时间下降的问题。解决方案的关键在于提出了一种非侵入式的插件——统一帧组织器 (Uniform Frame Organizer, UFO)，它能够与任何扩散式视频生成模型兼容。UFO 通过一系列具有可调强度的自适应适配器，显著增强了视频前景与背景的一致性，并提升了图像质量，而无需修改原始模型的参数。其模块化设计支持多 UFO 组合，允许个性化定制视频生成模型，并且支持跨不同模型的直接迁移，无需特定再训练。

链接: https://arxiv.org/abs/2412.09389
作者: Delong Liu,Zhaohui Hou,Mingjie Zhan,Shihao Han,Zhicheng Zhao,Fei Su
关键词-EN: achieved significant success, diffusion-based video generation, video generation, Uniform Frame Organizer, video generation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Recently, diffusion-based video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. The code will be publicly available at this https URL.
zh

[CV-48] All You Need in Knowledge Distillation Is a Tailored Coordinate System

【速读】：该论文试图解决现有知识蒸馏 (Knowledge Distillation, KD) 方法依赖于为特定任务训练的大型教师模型的问题，这些方法既不灵活又低效。论文提出的解决方案是利用自监督学习 (Self-Supervised Learning, SSL) 预训练模型作为教师，并通过特征所在的坐标系或线性子空间捕获其暗知识。关键在于提出了一种无需教师模型的坐标系调整方法 (Tailored Coordinate System, TCS)，该方法仅需一次教师模型的前向传播，即可为学生网络定制坐标系，适用于多种架构，支持跨架构蒸馏，并在少样本学习中表现出色。实验表明，TCS 在显著提高准确性的同时，大幅减少了训练时间和 GPU 内存消耗。

链接: https://arxiv.org/abs/2412.09388
作者: Junjie Zhou,Ke Zhu,Jianxin Wu
关键词-EN: transferring dark knowledge, small student network, essential in transferring, dark knowledge, large teacher
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, and allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.
zh

[CV-49] DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

【速读】：该论文试图解决在可控人体图像动画中，由于稀疏控制信号（如骨骼姿态）导致的运动对齐问题，特别是在参考角色与驱动视频中的角色体型差异较大时，现有方法引入的密集条件（如深度图）会损害生成视频的质量。解决方案的关键在于提出了一种名为DisPose的方法，通过将稀疏骨骼姿态解耦为运动场指导和关键点对应关系，生成密集运动场以提供区域级别的密集指导，同时保持稀疏姿态控制的泛化能力。此外，通过从参考图像中提取扩散特征并将其转移到目标姿态，提供独特的身份信息。论文还提出了一个即插即用的混合ControlNet，在不改变现有模型参数的情况下，提升生成视频的质量和一致性。

链接: https://arxiv.org/abs/2412.09349
作者: Hongxiang Li,Yaowei Li,Yuhang Yang,Junjie Cao,Zhihong Zhu,Xuxin Cheng,Long Chen
关键词-EN: Controllable human image, Controllable human, image animation aims, human image animation, reference image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllable human image animation aims to generate videos from reference images using driving videos. Due to the limited control signals provided by sparse guidance (e.g., skeleton pose), recent works have attempted to introduce additional dense conditions (e.g., depth map) to ensure motion alignment. However, such strict dense guidance impairs the quality of the generated video when the body shape of the reference character differs significantly from that of the driving video. In this paper, we present DisPose to mine more generalizable and effective control signals without additional dense input, which disentangles the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence. Specifically, we generate a dense motion field from a sparse motion field and the reference image, which provides region-level dense guidance while maintaining the generalization of the sparse pose control. We also extract diffusion features corresponding to pose keypoints from the reference image, and then these point features are transferred to the target pose to provide distinct identity information. To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet that improves the quality and consistency of generated videos while freezing the existing model parameters. Extensive qualitative and quantitative experiments demonstrate the superiority of DisPose compared to current methods. Code: \hyperlinkthis https URLthis https URL.
zh

[CV-50] Quantitative Evaluation of Motif Sets in Time Series

【速读】：该论文试图解决时间序列模式发现 (Time Series Motif Discovery, TSMD) 任务中现有评估方法的局限性问题。现有方法通常是定性评估，而少数定量评估方法则存在隐含假设，限制了其适用性。论文提出的解决方案包括两个关键部分：一是引入了一种广泛适用的评估指标 PROM，克服了现有方法的局限性；二是开发了时间序列模式发现基准 TSMD-Bench，用于定量评估。实验结果表明，PROM 提供了比现有指标更全面的评估，TSMD-Bench 比早期基准更具挑战性，两者的结合有助于更好地理解不同 TSMD 方法的相对性能。

链接: https://arxiv.org/abs/2412.09346
作者: Daan Van Wesenbeeck,Aras Yurtman,Wannes Meert,Hendrik Blockeel
关键词-EN: numerous application domains, finding recurring patterns, Series Motif Discovery, Time Series Motif, Time Series
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time Series Motif Discovery (TSMD), which aims at finding recurring patterns in time series, is an important task in numerous application domains, and many methods for this task exist. These methods are usually evaluated qualitatively. A few metrics for quantitative evaluation, where discovered motifs are compared to some ground truth, have been proposed, but they typically make implicit assumptions that limit their applicability. This paper introduces PROM, a broadly applicable metric that overcomes those limitations, and TSMD-Bench, a benchmark for quantitative evaluation of time series motif discovery. Experiments with PROM and TSMD-Bench show that PROM provides a more comprehensive evaluation than existing metrics, that TSMD-Bench is a more challenging benchmark than earlier ones, and that the combination can help understand the relative performance of TSMD methods. More generally, the proposed approach enables large-scale, systematic performance comparisons in this field.
zh

[CV-51] MaskTerial: A Foundation Model for Automated 2D Material Flake Detection

【速读】：该论文试图解决从光学显微镜图像中自动检测和分类二维材料（2D material）薄片的问题，特别是针对低对比度材料的识别挑战。解决方案的关键在于提出了一种名为MaskTerial的深度学习模型，该模型采用实例分割网络（instance segmentation network），并通过合成数据生成器（synthetic data generator）进行广泛预训练，从而能够快速适应新材料的识别，仅需5到10张图像即可。此外，模型结合了不确定性估计（uncertainty estimation）来基于光学对比度对预测结果进行最终分类，显著提升了对低对比度材料（如六方氮化硼）的检测能力。

链接: https://arxiv.org/abs/2412.09333
作者: Jan-Lucas Uslu,Alexey Nekrasov,Alexander Hermans,Bernd Beschoten,Bastian Leibe,Lutz Waldecker,Christoph Stampfer
关键词-EN: computer vision algorithms, exfoliated two-dimensional, automated using computer, computer vision, vision algorithms
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Image and Video Processing (eess.IV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator, that generates realistic microscopy images from unlabeled data. This results in a model that can to quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.
zh

[CV-52] Are Conditional Latent Diffusion Models Effective for Image Restoration? CVPR2025

【速读】：该论文试图解决条件潜在扩散模型（CLDMs）在图像恢复（IR）任务中的适用性问题。尽管CLDMs在捕捉高层语义相关性方面表现出色，尤其在文本到图像生成等任务中，但在图像恢复任务中，由于需要通过低层表示来建模降质图像与真实图像之间的关系，CLDMs面临高失真和语义偏差的挑战。论文通过与传统图像恢复模型的广泛实验对比，揭示了在最小降质情况下，传统方法优于CLDMs。解决方案的关键在于重新审视当前基于CLDMs的图像恢复方法，并通过实证研究探讨不同CLDM设计元素对恢复性能的影响，以期为该领域带来新的研究方向。

链接: https://arxiv.org/abs/2412.09324
作者: Yunchen Yuan,Junyuan Xiao,Xinjie Li
关键词-EN: increasingly employ conditional, employ conditional latent, conditional latent diffusion, latent diffusion models, restoration increasingly employ
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures, submitted to IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR 2025)

点击查看摘要

Abstract:Recent advancements in image restoration increasingly employ conditional latent diffusion models (CLDMs). While these models have demonstrated notable performance improvements in recent years, this work questions their suitability for IR tasks. CLDMs excel in capturing high-level semantic correlations, making them effective for tasks like text-to-image generation with spatial conditioning. However, in IR, where the goal is to enhance image perceptual quality, these models face difficulty of modeling the relationship between degraded images and ground truth images using a low-level representation. To support our claims, we compare state-of-the-art CLDMs with traditional image restoration models through extensive experiments. Results reveal that despite the scaling advantages of CLDMs, they suffer from high distortion and semantic deviation, especially in cases with minimal degradation, where traditional methods outperform them. Additionally, we perform empirical studies to examine the impact of various CLDM design elements on their restoration performance. We hope this finding inspires a reexamination of current CLDM-based IR solutions, opening up more opportunities in this field.
zh

[CV-53] -SVG: Text-Driven Stereoscopic Video Generation

【速读】：该论文试图解决立体视频生成中的技术复杂性问题，特别是生成立体视差（stereo parallax）以实现深度感知的挑战。解决方案的关键在于提出了文本驱动立体视频生成系统（Text-driven Stereoscopic Video Generation, T-SVG），该系统通过文本提示生成参考视频，并将其转换为3D点云序列，再从两个略有差异的视角渲染，从而实现自然的立体效果。T-SVG的核心创新在于整合了无需训练的文本到视频生成、深度估计和视频修复技术，确保了高效性和用户友好性，同时支持无缝更新，简化了立体视频的生产流程。

链接: https://arxiv.org/abs/2412.09323
作者: Qiao Jin,Xiaodong Chen,Wu Liu,Tao Mei,Yongdong Zhang
关键词-EN: extended reality, virtual reality, immersive content captivates, horizons in multimedia, opened new horizons
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:The advent of stereoscopic videos has opened new horizons in multimedia, particularly in extended reality (XR) and virtual reality (VR) applications, where immersive content captivates audiences across various platforms. Despite its growing popularity, producing stereoscopic videos remains challenging due to the technical complexities involved in generating stereo parallax. This refers to the positional differences of objects viewed from two distinct perspectives and is crucial for creating depth perception. This complex process poses significant challenges for creators aiming to deliver convincing and engaging presentations. To address these challenges, this paper introduces the Text-driven Stereoscopic Video Generation (T-SVG) system. This innovative, model-agnostic, zero-shot approach streamlines video generation by using text prompts to create reference videos. These videos are transformed into 3D point cloud sequences, which are rendered from two perspectives with subtle parallax differences, achieving a natural stereoscopic effect. T-SVG represents a significant advancement in stereoscopic content creation by integrating state-of-the-art, training-free techniques in text-to-video generation, depth estimation, and video inpainting. Its flexible architecture ensures high efficiency and user-friendliness, allowing seamless updates with newer models without retraining. By simplifying the production pipeline, T-SVG makes stereoscopic video generation accessible to a broader audience, demonstrating its potential to revolutionize the field.
zh

[CV-54] FAMNet: Frequency-aware Matching Network for Cross-domain Few-shot Medical Image Segmentation AAAI-25 AAAI

【速读】：该论文试图解决跨域少样本医学图像分割 (Cross-domain Few-shot Medical Image Segmentation, CD-FSMIS) 问题，特别是由于不同成像技术导致的域漂移 (domain shift) 限制了现有少样本医学图像分割 (FSMIS) 模型的适用性。解决方案的关键在于提出了频率感知匹配网络 (Frequency-aware Matching Network, FAMNet)，该网络包含两个核心模块：频率感知匹配 (Frequency-aware Matching, FAM) 模块和多光谱融合 (Multi-Spectral Fusion, MSF) 模块。FAM 模块通过解决元学习阶段中的域内差异（由支持-查询偏差引起）和域间差异（由不同成像技术引起）来增强模型的泛化能力。MSF 模块则通过整合 FAM 模块解耦的不同频率特征，进一步减轻域间差异对分割性能的影响。通过这两个模块的结合，FAMNet 在三个跨域数据集上超越了现有的 FSMIS 模型和跨域少样本语义分割模型，实现了 CD-FSMIS 任务的最新技术水平。

链接: https://arxiv.org/abs/2412.09319
作者: Yuntian Bo,Yazhou Zhu,Lunbo Li,Haofeng Zhang
关键词-EN: few-shot medical image, medical image segmentation, current FSMIS tasks, medical image, Frequency-aware Matching Network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Existing few-shot medical image segmentation (FSMIS) models fail to address a practical issue in medical imaging: the domain shift caused by different imaging techniques, which limits the applicability to current FSMIS tasks. To overcome this limitation, we focus on the cross-domain few-shot medical image segmentation (CD-FSMIS) task, aiming to develop a generalized model capable of adapting to a broader range of medical image segmentation scenarios with limited labeled data from the novel target domain. Inspired by the characteristics of frequency domain similarity across different domains, we propose a Frequency-aware Matching Network (FAMNet), which includes two key components: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion (MSF) module. The FAM module tackles two problems during the meta-learning phase: 1) intra-domain variance caused by the inherent support-query bias, due to the different appearances of organs and lesions, and 2) inter-domain variance caused by different medical imaging techniques. Additionally, we design an MSF module to integrate the different frequency features decoupled by the FAM module, and further mitigate the impact of inter-domain variance on the model’s segmentation performance. Combining these two modules, our FAMNet surpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation models on three cross-domain datasets, achieving state-of-the-art performance in the CD-FSMIS task.
zh

[CV-55] Multimodal Sentiment Analysis based on Video and Audio Inputs

【速读】：该论文试图解决情感识别模型在视频和音频输入上的准确性问题，并验证这些模型的可用性。解决方案的关键在于使用经过微调的模型（Facebook/wav2vec2-large 和 Google/vivit-b-16x2-kinetics400）分别处理音频和视频数据，并通过多种决策框架（如加权平均法、置信度阈值法、基于置信度的动态加权法和基于规则的逻辑法）来整合两个模型的输出概率，以提高情感识别的准确性。

链接: https://arxiv.org/abs/2412.09317
作者: Antonio Fernandez,Suzan Awinat
关键词-EN: current researches working, highest accuracy rate, abundance of current, current researches, researches working
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Presented as a full paper in the 15th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2024) October 28-30, 2024, Leuven, Belgium

点击查看摘要

Abstract:Despite the abundance of current researches working on the sentiment analysis from videos and audios, finding the best model that gives the highest accuracy rate is still considered a challenge for researchers in this field. The main objective of this paper is to prove the usability of emotion recognition models that take video and audio inputs. The datasets used to train the models are the CREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned models that been used are: Facebook/wav2vec2-large for audio and the Google/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for each emotion generated by the two previous models is utilized in the decision making framework. After disparity in the results, if one of the models gets much higher accuracy, another test framework is created. The methods used are the Weighted Average method, the Confidence Level Threshold method, the Dynamic Weighting Based on Confidence method, and the Rule-Based Logic method. This limited approach gives encouraging results that make future research into these methods viable.
zh

[CV-56] Advancing Attribution-Based Neural Network Explainability through Relative Absolute Magnitude Layer-Wise Relevance Propagation and Multi-Component Evaluation

【速读】：该论文试图解决现有Layer-Wise Relevance Propagation (LRP)方法在解释性方面的不足，并提出了一种新的方法来确定输入神经元的相关性。解决方案的关键在于改进LRP的公式，并通过层级相关性传播来实现这一目标。此外，论文还将该方法应用于Vision Transformer架构，并在ImageNet和PascalVOC数据集上进行了性能评估，结果显示其优于现有方法。论文还指出了当前评估指标的不足，并提出了一种新的评估指标，结合了忠实性（faithfulness）、鲁棒性（robustness）和对比性（contrastiveness），以更全面地评估基于归因的解释方法。

链接: https://arxiv.org/abs/2412.09311
作者: Davor Vukadin,Petar Afrić,Marin Šilić,Goran Delač
关键词-EN: Recent advancement, deep-neural network performance, network performance led, advancement in deep-neural, Layer-Wise Relevance Propagation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 16 figures, 13 tables, ACM Transactions on Intelligence Systems and Technology

点击查看摘要

Abstract:Recent advancement in deep-neural network performance led to the development of new state-of-the-art approaches in numerous areas. However, the black-box nature of neural networks often prohibits their use in areas where model explainability and model transparency are crucial. Over the years, researchers proposed many algorithms to aid neural network understanding and provide additional information to the human expert. One of the most popular methods being Layer-Wise Relevance Propagation (LRP). This method assigns local relevance based on the pixel-wise decomposition of nonlinear classifiers. With the rise of attribution method research, there has emerged a pressing need to assess and evaluate their performance. Numerous metrics have been proposed, each assessing an individual property of attribution methods such as faithfulness, robustness or localization. Unfortunately, no single metric is deemed optimal for every case, and researchers often use several metrics to test the quality of the attribution maps. In this work, we address the shortcomings of the current LRP formulations and introduce a novel method for determining the relevance of input neurons through layer-wise relevance propagation. Furthermore, we apply this approach to the recently developed Vision Transformer architecture and evaluate its performance against existing methods on two image classification datasets, namely ImageNet and PascalVOC. Our results clearly demonstrate the advantage of our proposed method. Furthermore, we discuss the insufficiencies of current evaluation metrics for attribution-based explainability and propose a new evaluation metric that combines the notions of faithfulness, robustness and contrastiveness. We utilize this new metric to evaluate the performance of various attribution-based methods. Our code is available at: this https URL
zh

[CV-57] GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression AAAI2025

【速读】：该论文试图解决音频驱动的人脸生成中，由于输入肖像多样性和音频与面部运动复杂关联所带来的挑战。解决方案的关键在于提出了一个名为GoHD的鲁棒框架，该框架通过三个关键模块实现高度真实、富有表现力和可控的肖像视频生成：1) 引入潜变量导航的动画模块，提升对未见风格的泛化能力，实现运动与身份的高解耦，并修正自然眼动；2) 设计了基于conformer结构的条件扩散模型，确保头部姿态与韵律感知；3) 采用两阶段训练策略，从输入音频中估计唇同步和真实表情，同时解耦频繁的帧级唇动提取与其他时间依赖性较弱的运动生成。

链接: https://arxiv.org/abs/2412.09296
作者: Ziqi Zhou,Weize Quan,Hailin Shi,Wei Li,Lili Wang,Dong-ming Yan
关键词-EN: necessitates seamless integration, generation necessitates seamless, visual data amidst, Audio-driven talking head, diverse input portraits
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD’s advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.
zh

[CV-58] InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

【速读】：该论文试图解决当前文本到视频生成中视频描述（video captions）存在的细节不足、幻觉（hallucinations）和运动描述不精确的问题，这些问题影响了生成视频的保真度和一致性。解决方案的关键在于提出了一个新颖的实例感知结构化描述框架，称为InstanceCap，首次实现了实例级和细粒度的视频描述。该框架通过设计辅助模型集群将原始视频转换为实例，以增强实例的保真度，并将密集提示（dense prompts）细化为结构化短语，从而实现简洁而精确的描述。此外，论文还构建了一个包含22K样本的InstanceVid数据集，并提出了针对InstanceCap结构的增强推理管道。实验结果表明，InstanceCap显著优于之前的模型，确保了描述与视频之间的高保真度，同时减少了幻觉现象。

链接: https://arxiv.org/abs/2412.09283
作者: Tiehan Fan,Kepan Nan,Rui Xie,Penghao Zhou,Zhenheng Yang,Chaoyou Fu,Xiang Li,Jian Yang,Ying Tai
关键词-EN: delivering remarkable results, recent years, delivering remarkable, evolved rapidly, rapidly in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.
zh

[CV-59] owards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine AAAI2025

【速读】：该论文试图解决当前生物医学多模态大语言模型（MLLM）在图像理解层面仅限于图像级处理，且交互方式局限于文本命令的问题。解决方案的关键在于提出了一个名为MedPLIB的新型端到端多模态大语言模型，具备像素级理解能力，支持视觉问答（VQA）、任意像素级提示（如点、边界框和自由形状）以及像素级定位。论文还提出了一种新颖的专家混合（Mixture-of-Experts, MoE）多阶段训练策略，将MoE分为视觉-语言专家模型和像素定位专家模型的独立训练阶段，并通过MoE进行微调，从而在多任务学习中有效协调并保持推理时的计算成本与单一专家模型相当。此外，论文引入了Medical Complex Vision Question Answering Dataset (MeCoVQA)，以推动生物医学MLLM的研究。

链接: https://arxiv.org/abs/2412.09278
作者: Xiaoshuang Huang,Lingdong Shen,Jia Liu,Fangxin Shang,Hongxiang Li,Haifeng Huang,Yehui Yang
关键词-EN: Multimodal Large Language, intelligent biomedical assistant, achieved notable advancements, Multimodal Large, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at this https URL.
zh

[CV-60] xt-Video Multi-Grained Integration for Video Moment Montage

【速读】：该论文试图解决在线短视频平台用户在编辑短视频时手动选择、裁剪和组装原始素材耗时且复杂的问题。解决方案的关键在于提出了一种名为Video Moment Montage (VMM)的新任务，旨在根据预先提供的叙述文本准确定位相应的视频片段，并将其排列成与描述相符的完整视频。为实现这一目标，论文提出了Text-Video Multi-Grained Integration (TV-MGI)方法，通过高效融合脚本中的文本特征与视频的镜头级和帧级特征，确保视频内容与文本描述之间的全局和细粒度对齐。此外，论文还引入了Multiple Sentences with Shots Dataset (MSSD)，一个专为VMM任务设计的大规模数据集，以支持进一步的研究和实验验证。

链接: https://arxiv.org/abs/2412.09276
作者: Zhihui Yin,Ye Ma,Xipeng Cao,Bo Wang,Quan Chen,Peng Jiang
关键词-EN: short video editing, online short video, short video platforms, short video, online short
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of online short video platforms has driven a surge in user demand for short video editing. However, manually selecting, cropping, and assembling raw footage into a coherent, high-quality video remains laborious and time-consuming. To accelerate this process, we focus on a user-friendly new task called Video Moment Montage (VMM), which aims to accurately locate the corresponding video segments based on a pre-provided narration text and then arrange these video clips to create a complete video that aligns with the corresponding descriptions. The challenge lies in extracting precise temporal segments while ensuring intra-sentence and inter-sentence context consistency, as a single script sentence may require trimming and assembling multiple video clips. To address this problem, we present a novel \textitText-Video Multi-Grained Integration method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features, which enables the global and fine-grained alignment between the video content and the corresponding textual descriptions in the script. To facilitate further research in this area, we introduce the Multiple Sentences with Shots Dataset (MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct extensive experiments on the MSSD dataset to demonstrate the effectiveness of our framework compared to baseline methods.
zh

[CV-61] LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

【速读】：该论文试图解决基于扩散模型的唇同步方法中存在的时序一致性问题，并提升SyncNet的收敛精度。解决方案的关键在于提出了LatentSync框架，该框架利用音频条件下的潜在扩散模型（audio conditioned latent diffusion models）直接建模复杂的视听关联，避免了传统方法中对中间运动表示的依赖。此外，论文提出了时序表示对齐（Temporal REPresentation Alignment, TREPA）方法，通过大规模自监督视频模型提取的时序表示来增强生成帧与真实帧之间的时序一致性。同时，论文通过全面的实证研究，优化了SyncNet的训练参数和数据预处理方法，显著提高了其在HDTF测试集上的准确率，从91%提升至94%。这些创新使得LatentSync在HDTF和VoxCeleb2数据集上的各项指标均优于现有的最先进唇同步方法。

链接: https://arxiv.org/abs/2412.09262
作者: Chunyu Li,Chao Zhang,Weikai Xu,Jinghui Xie,Weiguo Feng,Bingyue Peng,Weiwei Xing
关键词-EN: audio conditioned latent, conditioned latent diffusion, pixel space diffusion, intermediate motion representation, lip sync methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Furthermore, we observe the commonly encountered SyncNet convergence issue and conduct comprehensive empirical studies, identifying key factors affecting SyncNet convergence in terms of model architecture, training hyperparameters, and data preprocessing methods. We significantly improve the accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet. Based on the above innovations, our method outperforms state-of-the-art lip sync methods across various metrics on the HDTF and VoxCeleb2 datasets.
zh

[CV-62] FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection AAAI2025

【速读】：该论文试图解决红外-可见光目标检测 (Infrared-visible object detection, IVOD) 中现有方法忽视互补信息的频率特性问题，特别是可见图像中的高频细节和红外图像中的低频热信息。解决方案的关键在于提出了一个新颖的频率驱动特征分解网络 (Frequency-Driven Feature Decomposition Network, FD2-Net)，通过高频单元 (HFU) 和低频单元 (LFU) 分别捕捉高频和低频特征，并采用无参数的互补增强策略和多模态重建机制，有效提升了多模态特征的表示能力，从而在多个IVOD基准测试中显著超越了现有最先进模型。

链接: https://arxiv.org/abs/2412.09258
作者: Ke Li,Di Wang,Zhangyuan Hu,Shaofeng Li,Weiping Ni,Lin Zhao,Quan Wang
关键词-EN: Infrared-visible object detection, complementary information, visible images, Infrared-visible object, seeks to harness
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by AAAI 2025

点击查看摘要

Abstract:Infrared-visible object detection (IVOD) seeks to harness the complementary information in infrared and visible images, thereby enhancing the performance of detectors in complex environments. However, existing methods often neglect the frequency characteristics of complementary information, such as the abundant high-frequency details in visible images and the valuable low-frequency thermal information in infrared images, thus constraining detection performance. To solve this problem, we introduce a novel Frequency-Driven Feature Decomposition Network for IVOD, called FD2-Net, which effectively captures the unique frequency representations of complementary information across multimodal visual spaces. Specifically, we propose a feature decomposition encoder, wherein the high-frequency unit (HFU) utilizes discrete cosine transform to capture representative high-frequency features, while the low-frequency unit (LFU) employs dynamic receptive fields to model the multi-scale context of diverse objects. Next, we adopt a parameter-free complementary strengths strategy to enhance multimodal features through seamless inter-frequency recoupling. Furthermore, we innovatively design a multimodal reconstruction mechanism that recovers image details lost during feature extraction, further leveraging the complementary information from infrared and visible images to enhance overall representational capacity. Extensive experiments demonstrate that FD2-Net outperforms state-of-the-art (SOTA) models across various IVOD benchmarks, i.e. LLVIP (96.2% mAP), FLIR (82.9% mAP), and M3FD (83.5% mAP).
zh

[CV-63] VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation

【速读】：该论文试图解决分割模型在训练时受限于预定义类别的问题，并提出了一种结合视觉-语言推理（Vision-Language reasoning）和无监督域适应（Unsupervised Domain Adaptation, UDA）的解决方案。其关键在于通过多尺度上下文数据、鲁棒的文本嵌入（text embeddings）和提示增强（prompt augmentation）以及逐层微调（layer-wise fine-tuning）来提升视觉-语言模型的细粒度分割能力，并将其集成到UDA框架中，通过蒸馏（distillation）和跨域混合采样（cross-domain mixed sampling）来增强模型的域适应性和泛化能力。最终提出的UDA-FROVSS框架是首个无需共享类别即可有效跨域适应的UDA方法。

链接: https://arxiv.org/abs/2412.09240
作者: Roberto Alcover-Couso,Marcos Escudero-Viñolo,Juan C. SanMiguel,Jesus Bescos
关键词-EN: typically constrained, adapting Vision-Language Models, Open Vocabulary Semantic, Unsupervised Domain Adaptation, Vocabulary Semantic Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these enhancements into a UDA framework by employing distillation to stabilize training and cross-domain mixed sampling to boost adaptability without compromising generalization. The resulting UDA-FROVSS framework is the first UDA approach to effectively adapt across domains without requiring shared categories. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.09240 [cs.CV] (or arXiv:2412.09240v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.09240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-64] Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering

【速读】：该论文试图解决视频问答 (VideoQA) 中现有方法无法有效整合问题与视频帧及语义对象级抽象的问题。解决方案的关键在于提出了局部-全局问题感知视频嵌入 (Local-Global Question Aware Video Embedding, LGQAVE)，通过三项创新来更好地整合多模态知识并强调与特定问题相关的语义视觉概念。首先，LGQAVE 采用跨注意力机制来精确识别与问题最相关的帧，而非传统的随机帧采样。其次，利用不同图结构捕捉这些帧内对象的动态变化，并通过 miniGPT 模型将这些对象与问题语义对齐。最后，通过问题感知的动态图变换器 (Question-aware Dynamic Graph Transformer, Q-DGT) 处理这些图结构，生成精细化的局部和全局视频表示，并通过额外的跨注意力模块整合这些表示，最终生成用于答案生成的视频嵌入。

链接: https://arxiv.org/abs/2412.09230
作者: Sai Bhargav Rongali,Mohamad Hassan N C,Ankit Jha,Neha Bhargava,Saurabh Prasad,Biplab Banerjee
关键词-EN: paper tackles, tackles the intricate, intricate challenge, video, video question-answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper tackles the intricate challenge of video question-answering (VideoQA). Despite notable progress, current methods fall short of effectively integrating questions with video frames and semantic object-level abstractions to create question-aware video representations. We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better and emphasize semantic visual concepts relevant to specific questions. LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions. It captures the dynamics of objects within these frames using distinct graphs, grounding them in question semantics with the miniGPT model. These graphs are processed by a question-aware dynamic graph transformer (Q-DGT), which refines the outputs to develop nuanced global and local video representations. An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers. Extensive evaluations across multiple benchmarks demonstrate that LGQAVE significantly outperforms existing models in delivering accurate multi-choice and open-ended answers.
zh

[CV-65] UADet: A Remarkably Simple Yet Effective Uncertainty-Aware Open-Set Object Detection Framework

【速读】：该论文试图解决开放集目标检测 (Open-Set Object Detection, OSOD) 问题，即在未标记图像中检测已知和未知目标。其关键挑战在于缺乏对未知类别的监督，导致难以区分未知目标与背景。为应对这一问题，论文提出了不确定性感知开放集目标检测器 (Uncertainty-Aware Open-Set Object Detector, UADet)，通过结合外观和几何不确定性来有效减少先前方法中未标注实例的误用或遗漏。实验结果表明，UADet 在检测已知和未知目标方面显著优于现有最先进 (SOTA) 方法，并在开放世界目标检测 (Open World Object Detection, OWOD) 任务中表现出显著优势。

链接: https://arxiv.org/abs/2412.09229
作者: Silin Cheng,Yuanpei Liu,Kai Han
关键词-EN: World Object Detection, Object Detection, unknown objects, unlabelled images, Open-Set Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:We tackle the challenging problem of Open-Set Object Detection (OSOD), which aims to detect both known and unknown objects in unlabelled images. The main difficulty arises from the absence of supervision for these unknown classes, making it challenging to distinguish them from the background. Existing OSOD detectors either fail to properly exploit or inadequately leverage the abundant unlabeled unknown objects in training data, restricting their performance. To address these limitations, we propose UADet, an Uncertainty-Aware Open-Set Object Detector that considers appearance and geometric uncertainty. By integrating these uncertainty measures, UADet effectively reduces the number of unannotated instances incorrectly utilized or omitted by previous methods. Extensive experiments on OSOD benchmarks demonstrate that UADet substantially outperforms previous state-of-the-art (SOTA) methods in detecting both known and unknown objects, achieving a 1.8x improvement in unknown recall while maintaining high performance on known classes. When extended to Open World Object Detection (OWOD), our method shows significant advantages over the current SOTA method, with average improvements of 13.8% and 6.9% in unknown recall on M-OWODB and S-OWODB benchmarks, respectively. Extensive results validate the effectiveness of our uncertainty-aware approach across different open-set scenarios.
zh

[CV-66] DASK: Distribution Rehearsing via Adaptive Style Kernel Learning for Exemplar-Free Lifelong Person Re-Identification AAAI-25 AAAI

【速读】：该论文试图解决终身行人重识别（Lifelong Person Re-identification, LReID）中的灾难性遗忘问题，该问题源于训练步骤之间的显著领域差异。现有的LReID方法通常依赖于数据回放（data replay）和知识蒸馏（knowledge distillation）来缓解这一问题，但数据回放方法存在数据隐私问题，而知识蒸馏方法由于未蒸馏知识的累积遗忘导致性能受限。论文提出了一种新的范式，通过建模和排练旧领域的分布来增强新数据学习过程中的知识巩固，具备强大的抗遗忘能力且无需存储任何样本。解决方案的关键是引入了一种名为“通过自适应风格核学习进行分布排练”（Distribution Rehearsing via Adaptive Style Kernel Learning, DASK）的无样本LReID方法。DASK包括一个分布排练学习机制（Distribution Rehearser Learning），该机制能够在每个学习步骤中将任意分布数据转换为当前数据风格。此外，通过探索自适应核预测网络（Adaptive Kernel Prediction network）来实现实例特定的分布调整，从而增强风格迁移能力。论文还设计了一个基于分布排练的LReID训练模块，通过旧的AKPNet模型基于新数据排练旧分布，在联合知识巩固方案下实现有效的新旧知识积累。实验结果表明，DASK在抗遗忘和泛化能力方面分别比现有方法提升了3.6%-6.8%和4.5%-6.5%。

链接: https://arxiv.org/abs/2412.09224
作者: Kunlun Xu,Chenghao Jiang,Peixi Xiong,Yuxin Peng,Jiahuan Zhou
关键词-EN: Lifelong person re-identification, Lifelong person, significant domain gaps, person re-identification, catastrophic forgetting due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Lifelong person re-identification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective new-old knowledge accumulation under a joint knowledge consolidation scheme. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity, respectively. Our code is available at this https URL
zh

[CV-67] USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation AAAI2025

【速读】：该论文试图解决基于骨架的表示学习中对比学习方法的两个主要问题：一是现有方法主要依赖负样本，需要额外的动量编码器和记忆库，增加了模型训练的复杂性；二是这些方法主要关注全局表示，忽略了密集预测任务中至关重要的局部细节表示。解决方案的关键在于提出了一个基于特征去相关（feature decorrelation）的统一骨架密集表示学习框架，称为USDRL，通过在时间、空间和实例域上进行多粒度的特征去相关，减少表示维度间的冗余，最大化信息提取。此外，设计了一个密集时空编码器（Dense Spatio-Temporal Encoder, DSTE），以有效捕捉细粒度的动作表示，从而提升密集预测任务的性能。

链接: https://arxiv.org/abs/2412.09220
作者: Wanjiang Weng,Hongsong Wang,Junbo He,Lei He,Guosen Xie
关键词-EN: achieved great success, Contrastive learning, representation learning recently, achieved great, great success
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Contrastive learning has achieved great success in skeleton-based representation learning recently. However, the prevailing methods are predominantly negative-based, necessitating additional momentum encoder and memory bank to get negative samples, which increases the difficulty of model training. Furthermore, these methods primarily concentrate on learning a global representation for recognition and retrieval tasks, while overlooking the rich and detailed local representations that are crucial for dense prediction tasks. To alleviate these issues, we introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation, called USDRL, which employs feature decorrelation across temporal, spatial, and instance domains in a multi-grained manner to reduce redundancy among dimensions of the representations to maximize information extraction from features. Additionally, we design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action representations effectively, thereby enhancing the performance of dense prediction tasks. Comprehensive experiments, conducted on the benchmarks NTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks including action recognition, action retrieval, and action detection, conclusively demonstrate that our approach significantly outperforms the current state-of-the-art (SOTA) approaches. Our code and models are available at this https URL.
zh

[CV-68] Enhancing Implicit Neural Representations via Symmetric Power Transformation AAAI2025

【速读】：该论文试图解决如何增强隐式神经表示 (Implicit Neural Representation, INR) 的表达能力问题，其关键在于提出了一种对称幂变换 (symmetric power transformation) 方法。与以往使用随机排列或索引重排的方法不同，该方法通过可逆操作在不增加额外存储消耗的情况下实现数据增强。具体而言，论文提出了范围定义对称假设 (Range-Defined Symmetric Hypothesis)，认为特定的范围和对称性可以提升INR的表达能力。基于此，论文设计了一种非线性对称幂变换，通过幂系数重新分配数据以在目标范围内近似对称，并进一步引入偏差感知校准 (deviation-aware calibration) 和自适应软边界 (adaptive soft boundary) 来增强变换的鲁棒性，从而有效解决了极端偏差放大和连续性破坏的问题。

链接: https://arxiv.org/abs/2412.09213
作者: Weixiang Zhang,Shuzhao Xie,Chengwei Ren,Shijia Ge,Mingzi Wang,Zhi Wang
关键词-EN: Implicit Neural Representation, Neural Representation, Implicit Neural, capacity of Implicit, symmetric power transformation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:We propose symmetric power transformation to enhance the capacity of Implicit Neural Representation~(INR) from the perspective of data transformation. Unlike prior work utilizing random permutation or index rearrangement, our method features a reversible operation that does not require additional storage consumption. Specifically, we first investigate the characteristics of data that can benefit the training of INR, proposing the Range-Defined Symmetric Hypothesis, which posits that specific range and symmetry can improve the expressive ability of INR. Based on this hypothesis, we propose a nonlinear symmetric power transformation to achieve both range-defined and symmetric properties simultaneously. We use the power coefficient to redistribute data to approximate symmetry within the target range. To improve the robustness of the transformation, we further design deviation-aware calibration and adaptive soft boundary to address issues of extreme deviation boosting and continuity breaking. Extensive experiments are conducted to verify the performance of the proposed method, demonstrating that our transformation can reliably improve INR compared with other data transformations. We also conduct 1D audio, 2D image and 3D video fitting tasks to demonstrate the effectiveness and applicability of our method.
zh

[CV-69] CARLA-scenes: A synthetically generated dataset for event-based optical flow prediction

【速读】：该论文试图解决事件驱动视觉与脉冲神经网络 (Spiking Neural Networks, SNNs) 在机器人领域应用中缺乏多样化、可扩展的真实世界数据集的问题。解决方案的关键在于引入了一个名为 eWiz 的综合库，用于处理事件驱动数据，包括数据加载、增强、可视化、编码和训练数据生成等功能，并提供了损失函数和性能指标。此外，论文还提出了基于 eWiz 的合成数据集 eCARLA-scenes，利用 CARLA 模拟器生成用于光流预测任务的自驾驶汽车场景数据，旨在为事件驱动相机在自主导航中的应用奠定基础，并推动 SNNs 在神经形态硬件（如 Intel Loihi）上的应用。

链接: https://arxiv.org/abs/2412.09209
作者: Jad Mansour,Hayat Rajani,Rafael Garcia,Nuno Gracias
关键词-EN: Spiking Neural Networks, Neural Networks, Spiking Neural, vision and Spiking, Unmanned Aerial Vehicles
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The joint use of event-based vision and Spiking Neural Networks (SNNs) is expected to have a large impact in robotics in the near future, in tasks such as, visual odometry and obstacle avoidance. While researchers have used real-world event datasets for optical flow prediction (mostly captured with Unmanned Aerial Vehicles (UAVs)), these datasets are limited in diversity, scalability, and are challenging to collect. Thus, synthetic datasets offer a scalable alternative by bridging the gap between reality and simulation. In this work, we address the lack of datasets by introducing eWiz, a comprehensive library for processing event-based data. It includes tools for data loading, augmentation, visualization, encoding, and generation of training data, along with loss functions and performance metrics. We further present a synthetic event-based datasets and data generation pipelines for optical flow prediction tasks. Built on top of eWiz, eCARLA-scenes makes use of the CARLA simulator to simulate self-driving car scenarios. The ultimate goal of this dataset is the depiction of diverse environments while laying a foundation for advancing event-based camera applications in autonomous field vehicle navigation, paving the way for using SNNs on neuromorphic hardware such as the Intel Loihi.
zh

[CV-70] mporal Action Localization with Cross Layer Task Decoupling and Refinement AAAI2025

【速读】：该论文试图解决时间动作定位 (Temporal Action Localization, TAL) 中分类和定位任务之间的特征需求冲突问题。现有方法通常使用共享输入特征的独立分类和定位头，导致性能不佳。论文提出的解决方案是引入跨层任务解耦与细化 (Cross Layer Task Decoupling and Refinement, CLTDR) 策略，通过结合高层语义特征和低层边界感知特征，有效分离分类和定位任务。此外，跨层的多特征用于细化和对齐解耦后的分类和回归结果。最后，轻量级的门控多粒度 (Gated Multi-Granularity, GMG) 模块用于全面提取和聚合瞬时、局部和全局时间粒度的视频特征。这些创新使得该方法在多个挑战性基准上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.09202
作者: Qiang Li,Di Liu,Jun Kong,Sen Li,Hui Xu,Jianzhong Wang
关键词-EN: involves dual tasks, involves dual, classify and localize, Layer Task Decoupling, Cross Layer Task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundary-aware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models are publicly available at: this https URL.
zh

[CV-71] Accuracy Improvements for Convolutional and Differential Distance Function Approximations

【速读】：该论文旨在解决在有界区域内，从区域内部点到边界距离函数的估计问题。解决方案的关键在于提出了卷积和微分距离估计方案，并通过拉普拉斯积分（Laplace integrals）的渐近性和泰勒级数（Taylor series）外推法来提升估计的准确性。

链接: https://arxiv.org/abs/2412.09200
作者: Alexander Belyaev,Pierre-Alain Fayolle
关键词-EN: bounded domain, problem of estimating, internal points, distance function, distance estimation schemes
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given a bounded domain, we deal with the problem of estimating the distance function from the internal points of the domain to the boundary of the domain. Convolutional and differential distance estimation schemes are considered and, for both the schemes, accuracy improvements are proposed and evaluated. Asymptotics of Laplace integrals and Taylor series extrapolations are used to achieve the improvements.
zh

[CV-72] MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place Recognition

【速读】：该论文试图解决视觉位置识别 (Visual Place Recognition, VPR) 中由于图像视角变化导致的描述符学习不一致问题。解决方案的关键在于引入了视角自分类与VPR的相互学习机制。具体来说，研究从基于地理坐标的粗分类开始，逐步通过简单的聚类技术进行更精细的视角分类，并在无监督的情况下对数据集进行划分，同时训练描述符提取器用于位置识别。实验结果表明，这种方法能够近乎完美地根据视角对数据集进行划分，从而实现相互增强的效果，甚至在性能上超越了使用真实标签进行数据集划分的最先进 (SOTA) 方法。

链接: https://arxiv.org/abs/2412.09199
作者: Qiwen Gu,Xufei Wang,Fenglin Zhang,Junqiao Zhao,Siyue Tao,Chen Ye,Tiantian Feng,Changjun Jiang
关键词-EN: Visual Place Recognition, Visual Place, robustly identify locations, leveraging image retrieval, aims to robustly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Visual Place Recognition (VPR) aims to robustly identify locations by leveraging image retrieval based on descriptors encoded from environmental images. However, drastic appearance changes of images captured from different viewpoints at the same location pose incoherent supervision signals for descriptor learning, which severely hinder the performance of VPR. Previous work proposes classifying images based on manually defined rules or ground truth labels for viewpoints, followed by descriptor training based on the classification results. However, not all datasets have ground truth labels of viewpoints and manually defined rules may be suboptimal, leading to degraded descriptor this http URL address these challenges, we introduce the mutual learning of viewpoint self-classification and VPR. Starting from coarse classification based on geographical coordinates, we progress to finer classification of viewpoints using simple clustering techniques. The dataset is partitioned in an unsupervised manner while simultaneously training a descriptor extractor for place recognition. Experimental results show that this approach almost perfectly partitions the dataset based on viewpoints, thus achieving mutually reinforcing effects. Our method even excels state-of-the-art (SOTA) methods that partition datasets using ground truth labels.
zh

[CV-73] ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local Motion Deblurring

【速读】：该论文试图解决由于移动物体引起的模糊问题，特别是在背景清晰而前景模糊的情况下。解决方案的关键在于设计了一个基于上下文信息的局部模糊检测模块（context-based local blur detection module），以提高模糊区域的识别准确性。此外，论文利用现代智能手机提供的短曝光图像，开发了一种模糊感知引导图像恢复方法（blur-aware guided image restoration method），通过短曝光图像中的清晰结构细节来辅助重建严重模糊区域。为了实现更真实和视觉上令人满意的图像恢复，论文还提出了一种短曝光引导的扩散模型（short-exposure guided diffusion model），该模型从短曝光图像和模糊区域中提取有用特征，以更好地约束扩散过程。最终，这些组件被整合到一个名为ExpRDiff的简单而有效的网络中，实验结果表明该方法优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.09193
作者: Zhongbao Yang,Jiangxin Dong,Jinhui Tang,Jinshan Pan
关键词-EN: Removing blur caused, background remains clear, static background remains, moving objects, Removing blur
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Removing blur caused by moving objects is challenging, as the moving objects are usually significantly blurry while the static background remains clear. Existing methods that rely on local blur detection often suffer from inaccuracies and cannot generate satisfactory results when focusing solely on blurred regions. To overcome these problems, we first design a context-based local blur detection module that incorporates additional contextual information to improve the identification of blurry regions. Considering that modern smartphones are equipped with cameras capable of providing short-exposure images, we develop a blur-aware guided image restoration method that utilizes sharp structural details from short-exposure images, facilitating accurate reconstruction of heavily blurred regions. Furthermore, to restore images realistically and visually-pleasant, we develop a short-exposure guided diffusion model that explores useful features from short-exposure images and blurred regions to better constrain the diffusion process. Finally, we formulate the above components into a simple yet effective network, named ExpRDiff. Experimental results show that ExpRDiff performs favorably against state-of-the-art methods.
zh

[CV-74] RAD: Region-Aware Diffusion Models for Image Inpainting

【速读】：该论文试图解决现有图像修复方法在扩散模型应用中的效率问题，特别是现有方法在生成过程中需要嵌套循环或额外的条件组件，导致计算复杂度高且推理时间长的问题。解决方案的关键在于提出了区域感知扩散模型 (Region-aware Diffusion Models, RAD)，通过为每个像素采用不同的噪声调度，使得局部区域能够异步生成，同时考虑全局图像上下文。RAD简化了反向过程，无需额外组件，从而实现了比现有最先进方法快100倍的推理速度。此外，通过低秩适应 (Low-rank Adaptation, LoRA) 对预训练扩散模型进行微调，进一步降低了训练的计算负担。

链接: https://arxiv.org/abs/2412.09191
作者: Sora Kim,Sungho Suh,Minsik Lee
关键词-EN: achieved remarkable success, Diffusion models, Diffusion, achieved remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation, with applications broadening across various domains. Inpainting is one such application that can benefit significantly from diffusion models. Existing methods either hijack the reverse process of a pretrained diffusion model or cast the problem into a larger framework, \ie, conditioned generation. However, these approaches often require nested loops in the generation process or additional components for conditioning. In this paper, we present region-aware diffusion models (RAD) for inpainting with a simple yet effective reformulation of the vanilla diffusion models. RAD utilizes a different noise schedule for each pixel, which allows local regions to be generated asynchronously while considering the global image context. A plain reverse process requires no additional components, enabling RAD to achieve inference time up to 100 times faster than the state-of-the-art approaches. Moreover, we employ low-rank adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models, reducing computational burdens in training as well. Experiments demonstrated that RAD provides state-of-the-art results both qualitatively and quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.
zh

[CV-75] On the effectiveness of Rotation-Equivariance in U-Net: A Benchmark for Image Segmentation

【速读】：该论文试图解决在图像分割任务中，旋转等变性（rotation-equivariance）在复杂架构如U-Net中的应用问题。解决方案的关键在于全面评估旋转等变性U-Net在广泛任务中的有效性，并通过与标准U-Net架构的对比，分析其在性能和计算成本方面的改进。研究特别关注那些目标物体方向在图像中任意分布的数据集（如Kvasir-SEG），以及更标准的分割数据集（如COCO-Stuff），以探索旋转等变性在超越特定任务的广泛适用性。

链接: https://arxiv.org/abs/2412.09182
作者: Robin Ghyselinck,Valentin Delchevalerie,Bruno Dumas,Benoît Frénay
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, Numerous studies, studies have recently
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Numerous studies have recently focused on incorporating different variations of equivariance in Convolutional Neural Networks (CNNs). In particular, rotation-equivariance has gathered significant attention due to its relevance in many applications related to medical imaging, microscopic imaging, satellite imaging, industrial tasks, etc. While prior research has primarily focused on enhancing classification tasks with rotation equivariant CNNs, their impact on more complex architectures, such as U-Net for image segmentation, remains scarcely explored. Indeed, previous work interested in integrating rotation-equivariance into U-Net architecture have focused on solving specific applications with a limited scope. In contrast, this paper aims to provide a more exhaustive evaluation of rotation equivariant U-Net for image segmentation across a broader range of tasks. We benchmark their effectiveness against standard U-Net architectures, assessing improvements in terms of performance and sustainability (i.e., computational cost). Our evaluation focuses on datasets whose orientation of objects of interest is arbitrary in the image (e.g., Kvasir-SEG), but also on more standard segmentation datasets (such as COCO-Stuff) as to explore the wider applicability of rotation equivariance beyond tasks undoubtedly concerned by rotation equivariance. The main contribution of this work is to provide insights into the trade-offs and advantages of integrating rotation equivariance for segmentation tasks.
zh

[CV-76] Weighted Poisson-disk Resampling on Large-Scale Point Clouds AAAI2025

【速读】：该论文试图解决大规模点云处理中的重采样问题，特别是现有方法在平衡点云数量、密度和几何一致性方面的不足，以及在大规模点云处理中效率和精度的下降。解决方案的关键在于提出了一种加权泊松盘重采样方法 (Weighted Poisson-disk, WPD)，通过设计基于体素的初始泊松重采样策略来提高效率并估计更精确的泊松盘半径，随后引入加权切向平滑步骤优化每个点的Voronoi图，同时保持各向同性并保留尖锐特征。最终实现具有指定点数、均匀密度和高几何一致性的点云重采样，显著提升了大规模点云重采样的性能。

链接: https://arxiv.org/abs/2412.09177
作者: Xianhe Jiao,Chenlei Lv,Junli Zhao,Ran Yi,Yu-Hui Wen,Zhenkuan Pan,Zhongke Wu,Yong-jin Liu
关键词-EN: important role, role of controlling, large-scale point cloud, point, large-scale point
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:For large-scale point cloud processing, resampling takes the important role of controlling the point number and density while keeping the geometric consistency. % in related tasks. However, current methods cannot balance such different requirements. Particularly with large-scale point clouds, classical methods often struggle with decreased efficiency and accuracy. To address such issues, we propose a weighted Poisson-disk (WPD) resampling method to improve the usability and efficiency for the processing. We first design an initial Poisson resampling with a voxel-based estimation strategy. It is able to estimate a more accurate radius of the Poisson-disk while maintaining high efficiency. Then, we design a weighted tangent smoothing step to further optimize the Voronoi diagram for each point. At the same time, sharp features are detected and kept in the optimized results with isotropic property. Finally, we achieve a resampling copy from the original point cloud with the specified point number, uniform density, and high-quality geometric consistency. Experiments show that our method significantly improves the performance of large-scale point cloud resampling for different applications, and provides a highly practical solution.
zh

[CV-77] DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization

【速读】：该论文试图解决文本到图像 (Text-to-Image, T2I) 模型在基于少量参考图像进行微调时容易出现的过拟合问题，具体表现为提示词错位 (prompt misalignment) 和内容泄露 (content leakage)，导致模型无法准确遵循输入提示或生成不期望的对象。解决方案的关键在于提出了一种名为 DECOR 的方法，通过分解文本嵌入矩阵并将其投影到与不期望的词向量正交的向量空间中，从而减少不期望的语义对文本嵌入的影响。实验结果表明，DECOR 在文本和视觉对齐评估指标上均优于现有最先进的定制化模型，显著提升了生成图像与输入提示的一致性，有效解决了过拟合问题。

链接: https://arxiv.org/abs/2412.09169
作者: Geonhui Jang,Jin-Hwa Kim,Yong-Hyun Park,Junho Kim,Gayoung Lee,Yonghyun Jeong
关键词-EN: perform high-quality customization, reference images, effectively capture, perform high-quality, text embeddings
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models can effectively capture the content or style of reference images to perform high-quality customization. A representative technique for this is fine-tuning using low-rank adaptations (LoRA), which enables efficient model customization with reference images. However, fine-tuning with a limited number of reference images often leads to overfitting, resulting in issues such as prompt misalignment or content leakage. These issues prevent the model from accurately following the input prompt or generating undesired objects during inference. To address this problem, we examine the text embeddings that guide the diffusion model during inference. This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry and identify the cause of overfitting. Based on this, we propose DECOR, which projects text embeddings onto a vector space orthogonal to undesired token vectors, thereby reducing the influence of unwanted semantics in the text embeddings. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models and achieves Pareto frontier performance across text and visual alignment evaluation metrics. Furthermore, it generates images more faithful to the input prompts, showcasing its effectiveness in addressing overfitting and enhancing text-to-image customization.
zh

[CV-78] YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

【速读】：该论文试图解决在产品级视频中生成高质量音效的问题，特别是在仅有少量标注数据的多样化场景下。解决方案的关键在于引入YingSound，一个基于视频引导的音效生成基础模型，能够在少样本（few-shot）设置下生成高质量音频。YingSound包含两个主要模块：第一个模块采用条件流匹配（conditional flow matching）transformer实现音频和视觉模态之间的有效语义对齐，构建可学习的音视聚合器（AVA），整合高分辨率视觉特征与相应音频特征；第二个模块通过多模态视觉-音频链式思维（CoT）方法，在少样本设置下生成更精细的音效。此外，论文还提供了一个涵盖多种现实场景的行业标准视频到音频（V2A）数据集，并通过自动化评估和人工研究验证了YingSound在多样化条件输入下生成高质量同步音效的有效性。

链接: https://arxiv.org/abs/2412.09168
作者: Zihao Chen,Haomin Zhang,Xinhan Di,Haoyu Wang,Sizhe Shan,Junjie Zheng,Yunming Liang,Yihan Fan,Xinfa Zhu,Wenjie Tian,Yihua Wang,Chaofan Ding,Lei Xie
关键词-EN: few-shot settings, Generating sound effects, Generating sound, product-level videos, requires the production
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \urlthis https URL
zh

[CV-79] Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation

【速读】：该论文试图解决基于网络爬取数据集训练的基础模型在下游任务中传播社会偏见的问题。解决方案的关键在于提出了一种局部化的反事实生成方法，通过自动掩膜和引导修复技术，将反事实修改限制在特定属性相关的区域，从而在保持图像上下文的同时生成高质量的反事实样本。这种方法在Conceptual Captions数据集上创建性别反事实时，相比现有方法在视觉和语义保真度上表现更优，并且在非人中心任务上保持了仅使用真实数据训练的模型的性能。通过使用这些反事实样本进行微调，模型在多个偏见评估指标上显示出明显的偏见减少，同时保持了ImageNet零样本性能。

链接: https://arxiv.org/abs/2412.09160
作者: Kirill Sirotkin,Marcos Escudero-Viñolo,Pablo Carballeira,Mayug Maniparambil,Catarina Barata,Noel E. O’Connor
关键词-EN: propagate societal biases, Foundation models trained, web-scraped datasets propagate, datasets propagate societal, Foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models trained on web-scraped datasets propagate societal biases to downstream tasks. While counterfactual generation enables bias analysis, existing methods introduce artifacts by modifying contextual elements like clothing and background. We present a localized counterfactual generation method that preserves image context by constraining counterfactual modifications to specific attribute-relevant regions through automated masking and guided inpainting. When applied to the Conceptual Captions dataset for creating gender counterfactuals, our method results in higher visual and semantic fidelity than state-of-the-art alternatives, while maintaining the performance of models trained using only real data on non-human-centric tasks. Models fine-tuned with our counterfactuals demonstrate measurable bias reduction across multiple metrics, including a decrease in gender classification disparity and balanced person preference scores, while preserving ImageNet zero-shot performance. The results establish a framework for creating balanced datasets that enable both accurate bias profiling and effective mitigation.
zh

[CV-80] Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond Standard Baselines ICML

【速读】：该论文试图解决现有对抗攻击研究在交通标志分类模型中主要局限于重复使用基线模型（如LISA-CNN或GTSRB-CNN）以及相似实验设置的问题。解决方案的关键在于将模型架构与数据集解耦，并评估更广泛的通用模型，以进行公平比较。此外，论文还比较了两种攻击设置：不显眼攻击（inconspicuous）和可见攻击（visible），通常这两种设置没有直接比较。研究结果表明，标准基线模型比通用模型更容易受到攻击，因此建议未来在更广泛的基线模型上评估新的攻击方法。

链接: https://arxiv.org/abs/2412.09150
作者: Svetlana Pavlitska,Leopold Müller,J. Marius Zöllner
关键词-EN: traffic sign classification, sign classification models, Adversarial attacks, real world, sign classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at ICMLA 2024

点击查看摘要

Abstract:Adversarial attacks on traffic sign classification models were among the first successfully tried in the real world. Since then, the research in this area has been mainly restricted to repeating baseline models, such as LISA-CNN or GTSRB-CNN, and similar experiment settings, including white and black patches on traffic signs. In this work, we decouple model architectures from the datasets and evaluate on further generic models to make a fair comparison. Furthermore, we compare two attack settings, inconspicuous and visible, which are usually regarded without direct comparison. Our results show that standard baselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than the generic ones. We, therefore, suggest evaluating new attacks on a broader spectrum of baselines in the future. Our code is available at \urlthis https URL.
zh

[CV-81] LVMark: Robust Watermark for latent video diffusion models

【速读】：该论文试图解决视频生成模型中水印嵌入的问题，特别是现有方法未能有效处理视频中的时间信息，导致水印性能不佳的问题。解决方案的关键在于引入了一种名为LVMark的新型水印方法，该方法通过选择性权重调制策略将水印信息嵌入到视频扩散模型中，同时保持生成视频的质量。此外，论文设计了一种水印解码器，利用3D小波域中的时空信息通过交叉注意力模块来准确解码水印信息，从而在面对恶意攻击时仍能有效保护生成模型的所有权。

链接: https://arxiv.org/abs/2412.09122
作者: MinHyuk Jang,Youngdong Jang,JaeHyeok Lee,Kodai Kawamura,Feng Yang,Sangpil Kim
关键词-EN: Rapid advancements, create hyper-realistic videos, create hyper-realistic, Rapid, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid advancements in generative models have made it possible to create hyper-realistic videos. As their applicability increases, their unauthorized use has raised significant concerns, leading to the growing demand for techniques to protect the ownership of the generative model itself. While existing watermarking methods effectively embed watermarks into image-generative models, they fail to account for temporal information, resulting in poor performance when applied to video-generative models. To address this issue, we introduce a novel watermarking method called LVMark, which embeds watermarks into video diffusion models. A key component of LVMark is a selective weight modulation strategy that efficiently embeds watermark messages into the video diffusion model while preserving the quality of the generated videos. To accurately decode messages in the presence of malicious attacks, we design a watermark decoder that leverages spatio-temporal information in the 3D wavelet domain through a cross-attention module. To the best of our knowledge, our approach is the first to highlight the potential of video-generative model watermarking as a valuable tool for enhancing the effectiveness of ownership protection in video-generative models.
zh

[CV-82] ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation

【速读】：该论文试图解决基于事件相机的高时间分辨率（HTR）光流估计中的两个关键问题：缺乏HTR真值数据和事件数据的固有稀疏性。解决方案的关键在于提出了一种基于残差的范式，将HTR光流估计分为全局线性运动估计和HTR残差流细化两个阶段。该方法通过残差范式有效缓解了事件稀疏性对优化的影响，并兼容任何低时间分辨率（LTR）算法。此外，论文引入了新的学习策略，包括使用共享细化器估计残差流和引入区域噪声模拟中间流残差模式，以应对缺乏HTR真值数据的挑战，并支持域内自监督训练。实验结果表明，该方法在LTR和HTR指标上均达到了最先进的精度。

链接: https://arxiv.org/abs/2412.09105
作者: Qianang Zhou,Zhiyu Zhu,Junhui Hou,Yongjian Deng,Youfu Li,Junlin Xiong
关键词-EN: cameras hold significant, hold significant promise, Event cameras hold, HTR, HTR optical flow
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Event cameras hold significant promise for high-temporal-resolution (HTR) motion estimation. However, estimating event-based HTR optical flow faces two key challenges: the absence of HTR ground-truth data and the intrinsic sparsity of event data. Most existing approaches rely on the flow accumulation paradigms to indirectly supervise intermediate flows, often resulting in accumulation errors and optimization difficulties. To address these challenges, we propose a residual-based paradigm for estimating HTR optical flow with event data. Our approach separates HTR flow estimation into two stages: global linear motion estimation and HTR residual flow refinement. The residual paradigm effectively mitigates the impacts of event sparsity on optimization and is compatible with any LTR algorithm. Next, to address the challenge posed by the absence of HTR ground truth, we incorporate novel learning strategies. Specifically, we initially employ a shared refiner to estimate the residual flows, enabling both LTR supervision and HTR inference. Subsequently, we introduce regional noise to simulate the residual patterns of intermediate flows, facilitating the adaptation from LTR supervision to HTR inference. Additionally, we show that the noise-based strategy supports in-domain self-supervised training. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art accuracy in both LTR and HTR metrics, highlighting its effectiveness and superiority.
zh

[CV-83] owards Long-Horizon Vision-Language Navigation: Platform Benchmark and Method

【速读】：该论文试图解决现有视觉语言导航（Vision-Language Navigation, VLN）方法在复杂和动态环境中的多阶段和长时任务中的局限性问题。解决方案的关键在于提出了一个名为长时视觉语言导航（Long-Horizon Vision-Language Navigation, LH-VLN）的新任务，强调长期规划和决策一致性。为支持这一任务，论文开发了自动化数据生成平台NavGen，通过双向多粒度生成方法构建复杂任务结构的数据集，并构建了LHPR-VLN基准，包含3,260个任务，平均每任务150步，成为首个专门为长时视觉语言导航任务设计的数据集。此外，论文提出了独立成功率（Independent Success Rate, ISR）、条件成功率（Conditional Success Rate, CSR）和基于真实值的条件成功率（CSR weight by Ground Truth, CGT）等细粒度评估指标，并设计了多粒度动态记忆（Multi-Granularity Dynamic Memory, MGDM）模块，结合短期记忆模糊化和长期记忆检索，以提高模型在动态环境中的适应性。这些创新为LH-VLN提供了强大的数据生成流程、全面的模型评估数据集、合理的评估指标和新型VLN模型，奠定了推进LH-VLN的基础框架。

链接: https://arxiv.org/abs/2412.09082
作者: Xinshuai Song,Weixing Chen,Yang Liu,Weikai Chen,Guanbin Li,Liang Lin
关键词-EN: Existing Vision-Language Navigation, Long-Horizon Vision-Language Navigation, Existing Vision-Language, Vision-Language Navigation, methods primarily focus
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A novel Vision-Language Navigation task: Long-Horizon Vision-Language Navigation

点击查看摘要

Abstract:Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.
zh

[CV-84] DomCLP: Domain-wise Contrastive Learning with Prototype Mixup for Unsupervised Domain Generalization

【速读】：该论文试图解决自监督学习（Self-supervised Learning, SSL）模型在处理未见领域数据时生成有效表示的难题。现有基于实例判别任务和InfoNCE的SSL方法在提取领域无关的共同特征时效果不佳，导致领域相关特征的放大和领域无关特征的抑制，从而阻碍了领域泛化。论文提出的解决方案关键在于两部分：首先，通过领域对比学习（Domain-wise Contrastive Learning, DCon）增强领域无关的共同特征；其次，采用原型混合学习（Prototype Mixup Learning, PMix）在不依赖强假设的情况下，跨多个领域泛化领域无关的共同特征。实验结果表明，该方法在PACS和DomainNet数据集上显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.09074
作者: Jin-Seop Lee,Noo-ri Kim,Jee-Hyong Lee
关键词-EN: domain-irrelevant common features, achieved remarkable success, Self-supervised learning, common features, domain-irrelevant common
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code page: this https URL

点击查看摘要

Abstract:Self-supervised learning (SSL) methods based on the instance discrimination tasks with InfoNCE have achieved remarkable success. Despite their success, SSL models often struggle to generate effective representations for unseen-domain data. To address this issue, research on unsupervised domain generalization (UDG), which aims to develop SSL models that can generate domain-irrelevant features, has been conducted. Most UDG approaches utilize contrastive learning with InfoNCE to generate representations, and perform feature alignment based on strong assumptions to generalize domain-irrelevant common features from multi-source domains. However, existing methods that rely on instance discrimination tasks are not effective at extracting domain-irrelevant common features. This leads to the suppression of domain-irrelevant common features and the amplification of domain-relevant features, thereby hindering domain generalization. Furthermore, strong assumptions underlying feature alignment can lead to biased feature learning, reducing the diversity of common features. In this paper, we propose a novel approach, DomCLP, Domain-wise Contrastive Learning with Prototype Mixup. We explore how InfoNCE suppresses domain-irrelevant common features and amplifies domain-relevant features. Based on this analysis, we propose Domain-wise Contrastive Learning (DCon) to enhance domain-irrelevant common features. We also propose Prototype Mixup Learning (PMix) to generalize domain-irrelevant common features across multiple domains without relying on strong assumptions. The proposed method consistently outperforms state-of-the-art methods on the PACS and DomainNet datasets across various label fractions, showing significant improvements. Our code will be released. Our project page is available at this https URL.
zh

[CV-85] SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain Few-Shot Learning

【速读】：该论文试图解决跨域少样本学习 (Cross-Domain Few-Shot Learning, CD-FSL) 中由于视觉风格迁移导致的梯度不稳定和局部优化问题。解决方案的关键在于提出了一种新颖的裁剪-全局风格扰动方法，称为自适应对抗风格扰动 (Self-Versatility Adversarial Style Perturbation, SVasP)。SVasP 通过多样化输入模式和聚合局部裁剪风格梯度来模拟更多潜在的目标域对抗风格，从而增强梯度稳定性并避免陷入尖锐的局部最小值。此外，论文还提出了一种新的目标函数，旨在最大化全局、裁剪和对抗特征之间的视觉差异，同时保持语义一致性，从而在训练过程中实现平滑的损失景观，提升模型在目标域的迁移能力。

链接: https://arxiv.org/abs/2412.09073
作者: Wenqian Li,Pengfei Fang,Hui Xue
关键词-EN: Cross-Domain Few-Shot Learning, Few-Shot Learning, textbf, Cross-Domain Few-Shot, aims to transfer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen source domains to unseen target domains, which is crucial for evaluating the generalization and robustness of models. Recent studies focus on utilizing visual styles to bridge the domain gap between different domains. However, the serious dilemma of gradient instability and local optimization problem occurs in those style-based CD-FSL methods. This paper addresses these issues and proposes a novel crop-global style perturbation method, called \underline\textbfSelf-\underline\textbfVersatility \underline\textbfAdversarial \underline\textbfStyle \underline\textbfPerturbation (\textbfSVasP), which enhances the gradient stability and escapes from poor sharp minima jointly. Specifically, SVasP simulates more diverse potential target domain adversarial styles via diversifying input patterns and aggregating localized crop style gradients, to serve as global style perturbation stabilizers within one image, a concept we refer to as self-versatility. Then a novel objective function is proposed to maximize visual discrepancy while maintaining semantic consistency between global, crop, and adversarial features. Having the stabilized global style perturbation in the training phase, one can obtain a flattened minima in the loss landscape, boosting the transferability of the model to the target domains. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art methods. Our codes are available at this https URL.
zh

[CV-86] Cross-View Completion Models are Zero-shot Correspondence Estimators

【速读】：该论文试图解决跨视角补全学习中的对应关系问题，通过类比自监督对应学习，提出了一种新的视角。解决方案的关键在于利用跨视角补全模型中的跨注意力图（cross-attention map），该图在捕捉对应关系方面比从编码器或解码器特征中得出的其他相关性更为有效。研究通过在零样本匹配、基于学习的几何匹配和多帧深度估计任务中的评估，验证了跨注意力图的有效性。

链接: https://arxiv.org/abs/2412.09072
作者: Honggyu An,Jinhyeon Kim,Seonghoon Park,Jaewoo Jung,Jisang Han,Sunghwan Hong,Seungryong Kim
关键词-EN: self-supervised correspondence learning, cross-view completion learning, explore new perspectives, drawing an analogy, analogy to self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:In this work, we explore new perspectives on cross-view completion learning by drawing an analogy to self-supervised correspondence learning. Through our analysis, we demonstrate that the cross-attention map within cross-view completion models captures correspondence more effectively than other correlations derived from encoder or decoder features. We verify the effectiveness of the cross-attention map by evaluating on both zero-shot matching and learning-based geometric matching and multi-frame depth estimation. Project page is available at this https URL.
zh

[CV-87] An Efficient Framework for Enhancing Discriminative Models via Diffusion Techniques AAAI2025

【速读】：该论文试图解决生成式模型在图像分类中的高计算开销和性能不足的问题。解决方案的关键在于提出了扩散式判别模型增强框架 (Diffusion-Based Discriminative Model Enhancement Framework, DBMEF)，该框架通过无训练的方式将判别模型与生成模型无缝集成，利用判别模型进行初步预测，并通过扩散模型赋予深度神经网络重新思考的能力，从而提升分类精度和泛化能力。实验结果表明，DBMEF在多个深度模型架构和数据集上均实现了显著的性能提升，如在ImageNet数据集上ResNet-50的性能提升了1.51%，在ImageNet-A数据集上提升了3.02%。

链接: https://arxiv.org/abs/2412.09063
作者: Chunxiao Li,Xiaoxiao Wang,Boming Miao,Chuanlong Xie,Zizhe Wang,Yao Zhu
关键词-EN: discriminative models based, discriminative models, computer vision, traditionally achieved, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Image classification serves as the cornerstone of computer vision, traditionally achieved through discriminative models based on deep neural networks. Recent advancements have introduced classification methods derived from generative models, which offer the advantage of zero-shot classification. However, these methods suffer from two main drawbacks: high computational overhead and inferior performance compared to discriminative models. Inspired by the coordinated cognitive processes of rapid-slow pathway interactions in the human brain during visual signal recognition, we propose the Diffusion-Based Discriminative Model Enhancement Framework (DBMEF). This framework seamlessly integrates discriminative and generative models in a training-free manner, leveraging discriminative models for initial predictions and endowing deep neural networks with rethinking capabilities via diffusion models. Consequently, DBMEF can effectively enhance the classification accuracy and generalization capability of discriminative models in a plug-and-play manner. We have conducted extensive experiments across 17 prevalent deep model architectures with different training methods, including both CNN-based models such as ResNet and Transformer-based models like ViT, to demonstrate the effectiveness of the proposed DBMEF. Specifically, the framework yields a 1.51% performance improvement for ResNet-50 on the ImageNet dataset and 3.02% on the ImageNet-A dataset. In conclusion, our research introduces a novel paradigm for image classification, demonstrating stable improvements across different datasets and neural networks.
zh

[CV-88] Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D Images AAAI25

【速读】：该论文试图解决单视角点云重建中依赖昂贵的CAD模型和复杂几何先验的问题，关键解决方案在于引入双曲空间（hyperbolic space）来表示和理解点云中的复杂层次结构，从而降低失真。论文提出了双曲Chamfer距离和正则化的三元组损失（regularized triplet loss）来增强部分点云与完整点云之间的关系，并设计了自适应边界条件以提升模型对3D结构的理解和重建能力。实验结果表明，该方法显著提升了特征提取能力，并在3D重建任务中表现出色。

链接: https://arxiv.org/abs/2412.09055
作者: Wenrui Li,Zhe Yang,Wei Han,Hengyu Man,Xingtao Wang,Xiaopeng Fan
关键词-EN: Reconstructing desired objects, Reconstructing desired, computer vision, desired objects, objects and scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI25

点击查看摘要

Abstract:Reconstructing desired objects and scenes has long been a primary goal in 3D computer vision. Single-view point cloud reconstruction has become a popular technique due to its low cost and accurate results. However, single-view reconstruction methods often rely on expensive CAD models and complex geometric priors. Effectively utilizing prior knowledge about the data remains a challenge. In this paper, we introduce hyperbolic space to 3D point cloud reconstruction, enabling the model to represent and understand complex hierarchical structures in point clouds with low distortion. We build upon previous methods by proposing a hyperbolic Chamfer distance and a regularized triplet loss to enhance the relationship between partial and complete point clouds. Additionally, we design adaptive boundary conditions to improve the model’s understanding and reconstruction of 3D structures. Our model outperforms most existing models, and ablation studies demonstrate the significance of our model and its components. Experimental results show that our method significantly improves feature extraction capabilities. Our model achieves outstanding performance in 3D reconstruction tasks.
zh

[CV-89] ContextHOI: Spatial Context Learning for Human-Object Interaction Detection AAAI-25 AAAI

【速读】：该论文试图解决在人体-物体交互 (Human-Object Interaction, HOI) 识别中，由于前景物体模糊或遮挡导致的空间上下文信息不足的问题。解决方案的关键在于提出了一个名为 ContextHOI 的双分支框架，该框架能够同时捕捉物体检测特征和空间上下文信息。具体来说，ContextHOI 通过一个上下文分支来提取有用的空间上下文信息，而不需要额外的手工标注背景标签，并通过上下文感知的空间和语义监督来过滤无关噪声并捕捉关键的上下文信息。该方法在 HICO-DET 和 v-coco 基准测试中达到了最先进的性能，并通过构建新的 HICO-ambiguous 基准进一步验证了其在处理遮挡或模糊实例交互时的有效性。

链接: https://arxiv.org/abs/2412.09050
作者: Mingda Jia,Liming Zhao,Ge Li,Yun Zheng
关键词-EN: considered critical, critical in Human-Object, instance-centric foreground, HOI, Spatial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.
zh

[CV-90] Motif Guided Graph Transformer with Combinatorial Skeleton Prototype Learning for Skeleton-Based Person Re-Identification AAAI2025

【速读】：该论文试图解决基于3D骨骼数据进行行人重识别（Person re-identification, re-ID）的问题，特别是现有方法在挖掘关键身体结构和运动特征（如步态）以及空间-时间子模式方面的不足。解决方案的关键在于提出了一个通用的Motif引导的图变换器与组合骨骼原型学习（Motif guided graph transformer with Combinatorial skeleton prototype learning, MoCos）框架。该框架通过引入层次结构motif和步态协作motif的motif引导图变换器（Motif guided graph transformer, MGT），同时关注多阶局部关节相关性和关键协作身体部位，以增强骨骼关系学习。此外，通过组合骨骼原型学习（Combinatorial skeleton prototype learning, CSP），利用随机空间-时间组合的关节节点和骨骼图生成多样化的子骨骼和子轨迹表示，并与每个身份的最具代表性特征（原型）进行对比，从而学习类别相关的语义和判别性骨骼表示。实验结果表明，MoCos在性能上优于现有的最先进模型，并展示了其在不同场景下的通用性。

链接: https://arxiv.org/abs/2412.09044
作者: Haocong Rao,Chunyan Miao
关键词-EN: challenging task, task with significant, skeleton prototype learning, skeleton, Person re-identification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025. Codes are available at this https URL

点击查看摘要

Abstract:Person re-identification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints’ structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.
zh

[CV-91] DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

【速读】：该论文试图解决自动驾驶领域中真实街景的4D重建问题，特别是现有方法通常需要离线处理且依赖耗时的迭代过程，限制了其实际应用。解决方案的关键在于提出了Large 4D Gaussian Reconstruction Model (DrivingRecon)，该模型能够直接从环绕视图视频中预测4D高斯分布，从而实现高效的重建。为解决多视图图像的融合问题，论文提出了Prune and Dilate Block (PD-Block)，用于消除相邻视图间的重叠高斯点和冗余背景点。此外，通过动态和静态解耦策略，增强了跨时间信息的处理，从而更好地学习几何和运动特征。实验结果表明，DrivingRecon在场景重建质量和视图合成方面显著优于现有方法，并展示了其在模型预训练、车辆适配和场景编辑中的应用潜力。

链接: https://arxiv.org/abs/2412.09043
作者: Hao Lu,Tianshuo Xu,Wenzhao Zheng,Yunpeng Zhang,Wei Zhan,Dalong Du,Masayoshi Tomizuka,Kurt Keutzer,Yingcong Chen
关键词-EN: developing real-world simulators, essential for developing, developing real-world, real-world simulators, simulators in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photorealistic 4D reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. However, most existing methods perform this task offline and rely on time-consuming iterative processes, limiting their practical applications. To this end, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene reconstruction model, which directly predicts 4D Gaussian from surround view videos. To better integrate the surround-view images, the Prune and Dilate Block (PD-Block) is proposed to eliminate overlapping Gaussian points between adjacent views and remove redundant background points. To enhance cross-temporal information, dynamic and static decoupling is tailored to better learn geometry and motion features. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality and novel view synthesis compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our code is available at this https URL.
zh

[CV-92] Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model AAAI2025

【速读】：该论文试图解决视频异常检测中现有方法忽视异常形态多样性以及无法有效捕捉小尺寸异常物体的问题。解决方案的关键在于提出了一种基于局部信息捕捉的patch-based diffusion model，并创新性地引入了motion and appearance conditions，以同时考虑视频中的外观和运动异常。通过将这些条件无缝集成到扩散模型中，该方法能够生成语义内容和运动关系一致且上下文适当的预测，从而显著提升异常行为的检测效果。

链接: https://arxiv.org/abs/2412.09026
作者: Hang Zhou,Jiale Cai,Yuteng Ye,Yonghui Feng,Chenxing Gao,Junqing Yu,Zikai Song,Wei Yang
关键词-EN: normal patterns exclusively, recover normal patterns, reporting abnormal patterns, patterns exclusively, generation problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by AAAI2025

点击查看摘要

Abstract:A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results in four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.
zh

[CV-93] STEAM: Squeeze and Transform Enhanced Attention Module

【速读】：该论文试图解决现有通道和空间注意力机制在提升深度卷积神经网络（CNN）表示能力时，往往导致参数和计算成本增加的问题。解决方案的关键在于引入了一个常数参数模块，即STEAM（Squeeze and Transform Enhanced Attention Module），该模块利用图关系建模原理，综合建模通道和空间注意力，从而在最小化参数和计算成本的同时增强CNN的表示能力。STEAM首次采用基于图的方法来同时建模通道和空间注意力，并结合多头图变换器的概念。此外，论文还提出了输出引导池化（Output Guided Pooling, OGP），以高效捕捉空间上下文，进一步增强空间注意力。实验结果表明，STEAM在标准基准数据集上的大规模图像分类、目标检测和实例分割任务中，相较于标准ResNet-50模型，在仅增加少量GFLOPs的情况下，实现了2%的准确率提升，并且在GFLOPs方面显著优于领先的ECA和GCT模块。

链接: https://arxiv.org/abs/2412.09023
作者: Rishabh Sabharwal,Ram Samarth B B,Parikshit Singh Rathore,Punit Rathore
关键词-EN: convolutional neural networks, deep convolutional neural, attention mechanisms introduced, spatial attention, earlier works enhance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.
zh

[CV-94] Arbitrary-steps Image Super-resolution via Diffusion Inversion

【速读】：该论文试图解决图像超分辨率 (Image Super-Resolution, SR) 问题，提出了一种基于扩散反演 (Diffusion Inversion) 的新技术。解决方案的关键在于设计了一种部分噪声预测策略 (Partial Noise Prediction Strategy)，通过构建扩散模型的中间状态作为采样起点，并利用深度噪声预测器 (Deep Noise Predictor) 估计前向扩散过程中的最优噪声图。该预测器在训练后可用于沿扩散轨迹部分初始化采样过程，生成高分辨率结果。相比现有方法，该技术提供了灵活且高效的采样机制，支持从一到五个任意采样步骤，即使在单步采样下也能表现出优于或相当于最新技术水平的性能。

链接: https://arxiv.org/abs/2412.09013
作者: Zongsheng Yue,Kang Liao,Chen Change Loy
关键词-EN: rich image priors, image priors encapsulated, large pre-trained diffusion, Partial noise Prediction, pre-trained diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures. Project: this https URL

点击查看摘要

Abstract:This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep noise predictor to estimate the optimal noise maps for the forward diffusion process. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result. Compared to existing approaches, our method offers a flexible and efficient sampling mechanism that supports an arbitrary number of sampling steps, ranging from one to five. Even with a single sampling step, our method demonstrates superior or comparable performance to recent state-of-the-art approaches. The code and model are publicly available at this https URL.
zh

[CV-95] MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments

【速读】：该论文试图解决在扩展现实（XR）环境中通过手绘草图和语音输入快速生成高质量3D对象的问题。解决方案的关键在于提出了MS2Mesh-XR多模态草图到网格生成管道，通过结合手绘草图和语音输入，利用ControlNet推断出逼真的图像，并使用卷积重建模型（Convolutional Reconstruction Model）将选定的图像快速重建为详细的3D网格。该管道能够在不到20秒内生成高质量的3D网格，支持在运行时XR场景中的沉浸式可视化和操作，从而显著提升XR环境下的创意生产和用户体验。

链接: https://arxiv.org/abs/2412.09008
作者: Yuqi Tong,Yue Qiu,Ruiyang Li,Shi Qiu,Pheng-Ann Heng
关键词-EN: hand-drawn sketches assisted, extended reality, Convolutional Reconstruction Model, create realistic, hand-drawn sketches
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: IEEE AIxVR 2025

点击查看摘要

Abstract:We present MS2Mesh-XR, a novel multi-modal sketch-to-mesh generation pipeline that enables users to create realistic 3D objects in extended reality (XR) environments using hand-drawn sketches assisted by voice inputs. In specific, users can intuitively sketch objects using natural hand movements in mid-air within a virtual environment. By integrating voice inputs, we devise ControlNet to infer realistic images based on the drawn sketches and interpreted text prompts. Users can then review and select their preferred image, which is subsequently reconstructed into a detailed 3D mesh using the Convolutional Reconstruction Model. In particular, our proposed pipeline can generate a high-quality 3D mesh in less than 20 seconds, allowing for immersive visualization and manipulation in run-time XR scenes. We demonstrate the practicability of our pipeline through two use cases in XR settings. By leveraging natural user inputs and cutting-edge generative AI capabilities, our approach can significantly facilitate XR-based creative production and enhance user experiences. Our code and demo will be available at: this https URL
zh

[CV-96] A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter AAAI2025

【速读】：该论文试图解决多模态模型微调中的两个主要挑战：一是现有方法主要针对视觉-语言任务，难以扩展到超过两种模态的情况；二是这些方法在模态间交互的利用和效率上存在局限。解决方案的关键在于提出了一种名为低秩序列多模态适配器 (loW-rank sequence multimodal adapter, Wander) 的方法。Wander通过外积操作实现模态间信息的逐元素融合，并利用CP分解 (CP decomposition) 将张量分解为秩一成分，从而大幅减少参数量。此外，Wander还实现了token级别的低秩分解 (token-level low-rank decomposition)，以提取更细粒度的特征和模态间的序列关系。这些设计使得Wander能够在参数高效的前提下实现不同模态序列间的token级别交互，并在多模态数据集上显著优于现有的高效迁移学习方法。

链接: https://arxiv.org/abs/2412.08979
作者: Zirun Guo,Xize Cheng,Yangyang Wu,Tao Jin
关键词-EN: shown great success, shown great, great success, success in unimodal, Efficient transfer learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Efficient transfer learning methods such as adapter-based methods have shown great success in unimodal models and vision-language models. However, existing methods have two main challenges in fine-tuning multimodal models. Firstly, they are designed for vision-language tasks and fail to extend to situations where there are more than two modalities. Secondly, they exhibit limited exploitation of interactions between modalities and lack efficiency. To address these issues, in this paper, we propose the loW-rank sequence multimodal adapter (Wander). We first use the outer product to fuse the information from different modalities in an element-wise way effectively. For efficiency, we use CP decomposition to factorize tensors into rank-one components and achieve substantial parameter reduction. Furthermore, we implement a token-level low-rank decomposition to extract more fine-grained features and sequence relationships between modalities. With these designs, Wander enables token-level interactions between sequences of different modalities in a parameter-efficient way. We conduct extensive experiments on datasets with different numbers of modalities, where Wander outperforms state-of-the-art efficient transfer learning methods consistently. The results fully demonstrate the effectiveness, efficiency and universality of Wander.
zh

[CV-97] Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation

【速读】：该论文试图解决在条件视频生成中，尤其是涉及复杂动作如舞蹈时，生成角色动画时面部特征与参考图像不一致的问题。解决方案的关键在于提出了一种基于3D形变模型（3D Morphable Model, 3DMM）的面部标志点转换方法。通过从源视频的面部标志点重建3D人脸，并调整3DMM参数以匹配参考图像的目标面部特征，从而获得与目标面部特征对齐的转换后标志点。这种方法有效改善了生成视频与参考图像之间的面部特征不匹配问题。

链接: https://arxiv.org/abs/2412.08976
作者: Lianrui Mu,Xingze Zhou,Wenjie Zheng,Jiangnan Ye,Xiaoyu Liang,Yuchen Yang,Jianhong Bai,Jiedong Zhuang,Haoji Hu
关键词-EN: Landmark-guided character animation, Landmark-guided character, character animation generation, target facial features, Generating character animations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Landmark-guided character animation generation is an important field. Generating character animations with facial features consistent with a reference image remains a significant challenge in conditional video generation, especially involving complex motions like dancing. Existing methods often fail to maintain facial feature consistency due to mismatches between the facial landmarks extracted from source videos and the target facial features in the reference image. To address this problem, we propose a facial landmark transformation method based on the 3D Morphable Model (3DMM). We obtain transformed landmarks that align with the target facial features by reconstructing 3D faces from the source landmarks and adjusting the 3DMM parameters to match the reference image. Our method improves the facial consistency between the generated videos and the reference images, effectively improving the facial feature mismatch problem.
zh

[CV-98] Elevating Flow-Guided Video Inpainting with Reference Generation AAAI2025

【速读】：该论文试图解决视频修复 (Video inpainting, VI) 中内容传播和新内容生成的挑战。解决方案的关键在于结合大规模生成模型进行参考生成和先进的像素传播算法。具体来说，论文提出了一种基于强生成模型的方法，不仅显著提升了帧级质量以实现对象移除，还能根据用户提供的文本提示合成缺失区域的新内容。此外，论文引入了一种一次性像素拉取方法，有效避免了重复采样导致的误差累积，同时保持了亚像素精度。通过这些创新，该方法在公共基准和高质量视频修复基准 (HQVI) 上展示了显著更高的视觉质量和指标分数，并能轻松处理超过2K分辨率的高分辨率视频，突显了其在实际应用中的优越性。

链接: https://arxiv.org/abs/2412.08975
作者: Suhwan Cho,Seoung Wug Oh,Sangyoun Lee,Joon-Young Lee
关键词-EN: requires effective propagation, challenging task, task that requires, requires effective, frames while simultaneously
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.
zh

[CV-99] Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

【速读】：该论文试图解决现有跨模态对比蒸馏方法在3D表示学习中忽视模态特异性特征的问题，导致表示效果不佳。解决方案的关键在于提出了一种新的框架CMCR，通过更好地整合模态共享特征和模态特异性特征来改进传统方法。具体来说，论文引入了掩码图像建模和占用估计任务，以指导网络学习更全面的模态特异性特征，并提出了一个多模态统一码本，用于学习跨不同模态的共享嵌入空间。此外，还引入了几何增强的掩码图像建模，进一步提升了3D表示学习的效果。实验结果表明，该方法在下游任务中显著优于现有的图像到LiDAR对比蒸馏方法。

链接: https://arxiv.org/abs/2412.08973
作者: Yifan Zhang,Junhui Hou
关键词-EN: Cross-modal contrastive distillation, Cross-modal contrastive, recently been explored, Cross-modal, learning effective
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR, to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at this https URL.
zh

[CV-100] AFFAKT: A Hierarchical Optimal Transport based Method for Affective Facial Knowledge Transfer in Video Deception Detection AAAI2025

【速读】：该论文试图解决高质量大规模标注数据稀缺的问题，尤其是在视频欺骗检测中应用深度学习模型时面临的挑战。解决方案的关键在于提出了一种名为AFFAKT的新方法，通过从大规模面部表情数据集中迁移有用且相关的知识来提升分类性能。具体来说，该方法通过H-OTKT模块量化面部表情类别与欺骗样本之间的最优关系映射，并将知识从面部表情数据集迁移到欺骗样本中。此外，SRKB模块设计了一个相关原型，通过动量更新保留面部表情类别与欺骗类别之间的不变相关性。在推理过程中，迁移的知识通过样本特定的重加权策略与相关原型进行微调。实验结果表明，该方法在两个欺骗检测数据集上表现出优越的性能，并且与心理学理论一致，揭示了欺骗与负面情感之间的高度关联。

链接: https://arxiv.org/abs/2412.08965
作者: Zihan Ji,Xuetao Tian,Ye Liu
关键词-EN: high-quality large-scale labeled, employing deep learning, deep learning models, large-scale labeled datasets, labeled datasets poses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:The scarcity of high-quality large-scale labeled datasets poses a huge challenge for employing deep learning models in video deception detection. To address this issue, inspired by the psychological theory on the relation between deception and expressions, we propose a novel method called AFFAKT in this paper, which enhances the classification performance by transferring useful and correlated knowledge from a large facial expression dataset. Two key challenges in knowledge transfer arise: 1) \textithow much knowledge of facial expression data should be transferred and 2) \textithow to effectively leverage transferred knowledge for the deception classification model during inference. Specifically, the optimal relation mapping between facial expression classes and deception samples is firstly quantified using proposed H-OTKT module and then transfers knowledge from the facial expression dataset to deception samples. Moreover, a correlation prototype within another proposed module SRKB is well designed to retain the invariant correlations between facial expression classes and deception classes through momentum updating. During inference, the transferred knowledge is fine-tuned with the correlation prototype using a sample-specific re-weighting strategy. Experimental results on two deception detection datasets demonstrate the superior performance of our proposed method. The interpretability study reveals high associations between deception and negative affections, which coincides with the theory in psychology.
zh

[CV-101] Multimodal Industrial Anomaly Detection by Crossmodal Reverse Distillation

【速读】：该论文试图解决在无监督多模态工业图像异常检测 (unsupervised multimodal Industrial Image Anomaly Detection, AD) 中，现有基于知识蒸馏 (Knowledge Distillation, KD) 方法在融合多模态特征时无法有效捕捉单一模态异常的问题。解决方案的关键在于提出了跨模态反向蒸馏 (Crossmodal Reverse Distillation, CRD) 方法，通过多分支设计为每个模态分配独立的分支，从而实现对各模态异常的精细检测。此外，通过设计跨模态滤波器和放大器 (Crossmodal Filter and Amplifier)，增强了模态间的交互，确保学生网络能够更好地学习正常特征，同时有效检测所有模态中的异常。实验结果表明，该方法在MVTec 3D-AD数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.08949
作者: Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang
关键词-EN: unsupervised Industrial Image, Industrial Image Anomaly, Image Anomaly Detection, Industrial Image, Knowledge distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) has been widely studied in unsupervised Industrial Image Anomaly Detection (AD), but its application to unsupervised multimodal AD remains underexplored. Existing KD-based methods for multimodal AD that use fused multimodal features to obtain teacher representations face challenges. Anomalies in one modality may not be effectively captured in the fused teacher features, leading to detection failures. Besides, these methods do not fully leverage the rich intra- and inter-modality information. In this paper, we propose Crossmodal Reverse Distillation (CRD) based on Multi-branch design to realize Multimodal Industrial AD. By assigning independent branches to each modality, our method enables finer detection of anomalies within each modality. Furthermore, we enhance the interaction between modalities during the distillation process by designing Crossmodal Filter and Amplifier. With the idea of crossmodal mapping, the student network is allowed to better learn normal features while anomalies in all modalities are ensured to be effectively detected. Experimental verifications on the MVTec 3D-AD dataset demonstrate that our method achieves state-of-the-art performance in multimodal anomaly detection and localization.
zh

[CV-102] Selective Visual Prompting in Vision Mamba AAAI-25 AAAI

【速读】：该论文试图解决现有视觉提示方法（visual prompting）在预训练视觉Mamba（Vim）模型上应用时的不足，特别是这些方法未能有效激活Mamba块中的输入和遗忘门，导致难以提取和传播判别信息的问题。解决方案的关键在于提出了一种新的选择性视觉提示（Selective Visual Prompting, SVP）方法，通过轻量级的选择性提示器生成逐令牌的提示，确保在Mamba块中自适应激活更新和遗忘门，从而促进判别信息的传播。此外，SVP采用了双路径结构，包括跨层提示（Cross-Prompting）和内层提示（Inner-Prompting），分别利用共享参数和独立参数来促进跨层和内层信息的传播，从而在各种大规模基准测试中显著优于现有方法。

链接: https://arxiv.org/abs/2412.08947
作者: Yifeng Yao,Zichen Liu,Zhenyu Cui,Yuxin Peng,Jiahuan Zhou
关键词-EN: Pre-trained Vision Mamba, demonstrated exceptional performance, computationally efficient manner, Pre-trained Vision, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: in Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at this https URL.
zh

[CV-103] Optimized Gradient Clipping for Noisy Label Learning AAAI2025

【速读】：该论文试图解决现有方法在处理噪声标签时，通过固定阈值进行梯度裁剪（gradient clipping）而忽视训练过程中梯度分布动态变化的问题。解决方案的关键在于提出了一种名为优化梯度裁剪（Optimized Gradient Clipping, OGC）的方法，该方法通过动态调整裁剪阈值，基于裁剪后噪声梯度与干净梯度的比例，并利用干净样本和噪声样本的分布模型进行估计。这种方法能够在每个训练步骤中自适应地修改裁剪阈值，从而有效控制噪声梯度的影响，提升模型对噪声的鲁棒性。

链接: https://arxiv.org/abs/2412.08941
作者: Xichen Ye,Yifan Wu,Weizhong Zhang,Xiaoqiang Li,Yifan Chen,Cheng Jin
关键词-EN: Previous research, Optimized Gradient Clipping, research has shown, shown that constraining, loss function
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Previous research has shown that constraining the gradient of loss function with respect to model-predicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach. The code and a technical appendix for better digital viewing are included as supplementary materials and scheduled to be open-sourced upon publication.
zh

[CV-104] Deep Clustering using Dirichlet Process Gaussian Mixture and Alpha Jensen-Shannon Divergence Clustering Loss

【速读】：该论文试图解决深度聚类（deep clustering）中两个关键问题：一是传统聚类损失函数（如Kullback-Leibler divergence）的不对称性，二是对聚类数目的先验知识依赖。解决方案的关键在于：首先，采用Jensen-Shannon divergence的封闭形式变体来克服不对称性；其次，引入基于Dirichlet过程高斯混合模型（Dirichlet process Gaussian mixture model）的无限聚类表示，实现潜在空间中的联合聚类与模型选择，称为深度模型选择（deep model selection）。该方法无需预设聚类数目，而是通过训练过程中逐步逼近最优聚类数，从而避免了先验知识的依赖。

链接: https://arxiv.org/abs/2412.08940
作者: Kart-Leong Lim
关键词-EN: deep learning feature, deep learning, Deep clustering, Deep, learning feature space
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep clustering is an emerging topic in deep learning where traditional clustering is performed in deep learning feature space. However, clustering and deep learning are often mutually exclusive. In the autoencoder based deep clustering, the challenge is how to jointly optimize both clustering and dimension reduction together, so that the weights in the hidden layers are not only guided by reconstruction loss, but also by a loss function associated with clustering. The current state-of-the-art has two fundamental flaws. First, they rely on the mathematical convenience of Kullback-Leibler divergence for the clustering loss function but the former is asymmetric. Secondly, they assume the prior knowledge on the number of clusters is always available for their dataset of interest. This paper tries to improve on these problems. In the first problem, we use a Jensen-Shannon divergence to overcome the asymmetric issue, specifically using a closed form variant. Next, we introduce an infinite cluster representation using Dirichlet process Gaussian mixture model for joint clustering and model selection in the latent space which we called deep model selection. The number of clusters in the latent space are not fixed but instead vary accordingly as they gradually approach the optimal number during training. Thus, prior knowledge is not required. We evaluate our proposed deep model selection method with traditional model selection on large class number datasets such as MIT67 and CIFAR100 and also compare with both traditional variational Bayes model and deep clustering method with convincing results.
zh

[CV-105] Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration

【速读】：该论文试图解决现有知识蒸馏（Knowledge Distillation, KD）方法在图像恢复任务中忽视学生网络学习状态、采用固定解空间以及仅依赖L1损失难以利用图像分布信息的问题。解决方案的关键在于提出了一种新颖的动态对比知识蒸馏（Dynamic Contrastive Knowledge Distillation, DCKD）框架，通过引入动态对比正则化来感知学生网络的学习状态，并动态调整蒸馏解空间，同时提出分布映射模块来提取和校准教师与学生模型之间的像素级类别分布。DCKD框架具有结构无关性，能够适应不同的骨干网络，并可与优化上界约束的方法结合，进一步提升模型性能。

链接: https://arxiv.org/abs/2412.08939
作者: Yunshuai Zhou,Junbo Qiao,Jincheng Liao,Wei Li,Simiao Li,Jiao Xie,Yunhang Shen,Jie Hu,Shaohui Lin
关键词-EN: compact student network, valuable yet challenging, challenging approach, high-performance but cumbersome, image restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a high-performance but cumbersome teacher model. However, previous KD methods for image restoration overlook the state of the student during the distillation, adopting a fixed solution space that limits the capability of KD. Additionally, relying solely on L1-type loss struggles to leverage the distribution information of images. In this work, we propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration. Specifically, we introduce dynamic contrastive regularization to perceive the student’s learning state and dynamically adjust the distilled solution space using contrastive learning. Additionally, we also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models. Note that the proposed DCKD is a structure-agnostic distillation framework, which can adapt to different backbones and can be combined with methods that optimize upper-bound constraints to further enhance model performance. Extensive experiments demonstrate that DCKD significantly outperforms the state-of-the-art KD methods across various image restoration tasks and backbones.
zh

[CV-106] Deep clustering using adversarial net based clustering loss

【速读】：该论文试图解决深度聚类（Deep Clustering）中由于需要封闭形式的损失函数而导致反向传播难以处理的问题。解决方案的关键在于将深度聚类重新表述为一种基于传统封闭形式KL散度（KL divergence）的对抗网络（adversarial net）。具体来说，训练过程转变为同时最小化编码器和最大化判别器，理论上在达到最优时，该方法能够逼近编码器和判别器分布假设之间的JS散度（JS divergence）。通过这种方法，论文在多个知名数据集（如SVHN、USPS、MNIST和CIFAR10）上展示了与现有最先进的深度聚类方法相当或更好的性能。

链接: https://arxiv.org/abs/2412.08933
作者: Kart-Leong Lim
关键词-EN: recent deep learning, deep learning technique, combines deep learning, Deep clustering, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep clustering is a recent deep learning technique which combines deep learning with traditional unsupervised clustering. At the heart of deep clustering is a loss function which penalizes samples for being an outlier from their ground truth cluster centers in the latent space. The probabilistic variant of deep clustering reformulates the loss using KL divergence. Often, the main constraint of deep clustering is the necessity of a closed form loss function to make backpropagation tractable. Inspired by deep clustering and adversarial net, we reformulate deep clustering as an adversarial net over traditional closed form KL divergence. Training deep clustering becomes a task of minimizing the encoder and maximizing the discriminator. At optimality, this method theoretically approaches the JS divergence between the distribution assumption of the encoder and the discriminator. We demonstrated the performance of our proposed method on several well cited datasets such as SVHN, USPS, MNIST and CIFAR10, achieving on-par or better performance with some of the state-of-the-art deep clustering methods.
zh

[CV-107] CAPrompt: Cyclic Prompt Aggregation for Pre-Trained Model Based Class Incremental Learning AAAI-25 AAAI

【速读】：该论文试图解决在类增量学习 (Class Incremental Learning, CIL) 中，由于任务 ID 预测不准确导致的提示 (prompt) 不一致问题，以及现有方法未能充分利用学习到的提示参数中的知识进行任务 ID 预测的问题。解决方案的关键在于提出了一种新的循环提示聚合方法 (Cyclic Prompt Aggregation, CAPrompt)，通过在训练和推理过程中引入创新的提示聚合策略，利用不同提示的加权和来克服提示不一致性，从而消除对任务 ID 预测的依赖。该方法通过理论分析证明了在凹性条件下，聚合提示的误差低于选择单一任务特定提示的误差，并通过引入凹性约束和线性约束来指导提示学习。此外，论文还提出了一种循环权重预测策略，自动调整提示权重以实现更准确的聚合。实验结果表明，CAPrompt 在多个数据集上优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.08929
作者: Qiwei Li,Jiahuan Zhou
关键词-EN: Class Incremental Learning, Class Incremental, Incremental Learning, demonstrated promising performance, prompt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Recently, prompt tuning methods for pre-trained models have demonstrated promising performance in Class Incremental Learning (CIL). These methods typically involve learning task-specific prompts and predicting the task ID to select the appropriate prompts for inference. However, inaccurate task ID predictions can cause severe inconsistencies between the prompts used during training and inference, leading to knowledge forgetting and performance degradation. Additionally, existing prompt tuning methods rely solely on the pre-trained model to predict task IDs, without fully leveraging the knowledge embedded in the learned prompt parameters, resulting in inferior prediction performance. To address these issues, we propose a novel Cyclic Prompt Aggregation (CAPrompt) method that eliminates the dependency on task ID prediction by cyclically aggregating the knowledge from different prompts. Specifically, rather than predicting task IDs, we introduce an innovative prompt aggregation strategy during both training and inference to overcome prompt inconsistency by utilizing a weighted sum of different prompts. Thorough theoretical analysis demonstrates that under concave conditions, the aggregated prompt achieves lower error compared to selecting a single task-specific prompt. Consequently, we incorporate a concave constraint and a linear constraint to guide prompt learning, ensuring compliance with the concave condition requirement. Furthermore, to fully exploit the prompts and achieve more accurate prompt weights, we develop a cyclic weight prediction strategy. This strategy begins with equal weights for each task and automatically adjusts them to more appropriate values in a cyclical manner. Experiments on various datasets demonstrate that our proposed CAPrompt outperforms state-of-the-art methods by 2%-3%. Our code is available at this https URL.
zh

[CV-108] A Flexible Plug-and-Play Module for Generating Variable-Length

【速读】：该论文试图解决现有深度监督哈希模型在生成固定长度哈希码时，无法有效平衡效率与效果的问题。解决方案的关键在于提出了嵌套哈希层 (Nested Hash Layer, NHL)，这是一个即插即用的模块，能够在现有深度监督哈希模型中同时生成不同长度的哈希码，并通过嵌套方式优化这一过程。为解决多学习目标带来的优化冲突，论文进一步提出了自适应权重策略，动态调整训练过程中的梯度。此外，通过长短期级联自蒸馏方法 (long-short cascade self-distillation)，利用较长哈希码的结构信息来指导较短哈希码的生成，从而提升整体哈希码质量。实验结果表明，NHL不仅加速了训练过程，还在多种深度哈希模型中实现了更优的检索性能。

链接: https://arxiv.org/abs/2412.08922
作者: Liyang He,Yuren Zhang,Rui Li,Zhenya Huang,Runze Wu,Enhong Chen
关键词-EN: offering significant benefits, Deep supervised hashing, hash codes, supervised hashing models, existing deep supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deep supervised hashing has become a pivotal technique in large-scale image retrieval, offering significant benefits in terms of storage and search efficiency. However, existing deep supervised hashing models predominantly focus on generating fixed-length hash codes. This approach fails to address the inherent trade-off between efficiency and effectiveness when using hash codes of varying lengths. To determine the optimal hash code length for a specific task, multiple models must be trained for different lengths, leading to increased training time and computational overhead. Furthermore, the current paradigm overlooks the potential relationships between hash codes of different lengths, limiting the overall effectiveness of the models. To address these challenges, we propose the Nested Hash Layer (NHL), a plug-and-play module designed for existing deep supervised hashing models. The NHL framework introduces a novel mechanism to simultaneously generate hash codes of varying lengths in a nested manner. To tackle the optimization conflicts arising from the multiple learning objectives associated with different code lengths, we further propose an adaptive weights strategy that dynamically monitors and adjusts gradients during training. Additionally, recognizing that the structural information in longer hash codes can provide valuable guidance for shorter hash codes, we develop a long-short cascade self-distillation method within the NHL to enhance the overall quality of the generated hash codes. Extensive experiments demonstrate that NHL not only accelerates the training process but also achieves superior retrieval performance across various deep hashing models. Our code is publicly available at this https URL.
zh

[CV-109] Sensing for Space Safety and Sustainability: A Deep Learning Approach with Vision Transformers

【速读】：该论文试图解决低地球轨道中由于小型卫星数量快速增加带来的空间安全、操作和外层空间环境可持续性问题。解决方案的关键在于提出了一种高效的卫星目标检测 (Satellite Object Detection, SOD) 方法，利用机载深度学习 (Deep Learning, DL) 技术来评估和应对碰撞风险，同时考虑小型卫星平台的资源限制。论文提出了两种新的深度学习模型，GELAN-ViT 和 GELAN-RepViT，它们将视觉变换器 (Vision Transformer, ViT) 引入广义高效层聚合网络 (Generalized Efficient Layer Aggregation Network, GELAN) 架构，并通过分离卷积神经网络和 ViT 路径来克服现有模型的局限性。这些模型在平均精度 (mAP) 和计算成本方面优于当前最先进的 YOLOv9-t，显著降低了计算复杂度 (GFLOPs) 并提高了检测精度。

链接: https://arxiv.org/abs/2412.08913
作者: Wenxuan Zhang,Peng Hu
关键词-EN: low Earth orbit, enable ubiquitous digital, ubiquitous digital services, space assets represented, low Earth
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To be published in the 12th Annual IEEE International Conference on Wireless for Space and Extreme Environments (WiSEE 2024)

点击查看摘要

Abstract:The rapid increase of space assets represented by small satellites in low Earth orbit can enable ubiquitous digital services for everyone. However, due to the dynamic space environment, numerous space objects, complex atmospheric conditions, and unexpected events can easily introduce adverse conditions affecting space safety, operations, and sustainability of the outer space environment. This challenge calls for responsive, effective satellite object detection (SOD) solutions that allow a small satellite to assess and respond to collision risks, with the consideration of constrained resources on a small satellite platform. This paper discusses the SOD tasks and onboard deep learning (DL) approach to the tasks. Two new DL models are proposed, called GELAN-ViT and GELAN-RepViT, which incorporate vision transformer (ViT) into the Generalized Efficient Layer Aggregation Network (GELAN) architecture and address limitations by separating the convolutional neural network and ViT paths. These models outperform the state-of-the-art YOLOv9-t in terms of mean average precision (mAP) and computational costs. On the SOD dataset, our proposed models can achieve around 95% mAP50 with giga-floating point operations (GFLOPs) reduced by over 5.0. On the VOC 2012 dataset, they can achieve \geq 60.7% mAP50 with GFLOPs reduced by over 5.2.
zh

[CV-110] Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression

【速读】：该论文试图解决由编解码器压缩（如AV1和HEVC）导致的8K视频质量下降问题。解决方案的关键在于提出了一种新颖的Transformer-Diffusion模型（DiQP），该模型通过去噪扩散（Denoising Diffusion）方法，在不引入额外噪声的情况下，有效建模并逆转压缩伪影的复杂非高斯特性。模型结合了Transformer的长程依赖捕捉能力，并通过增强的窗口机制保留了像素组间的时空上下文。此外，辅助的“向前看”（Look Ahead）和“向周围看”（Look Around）模块提供了未来帧和周围帧的信息，进一步提升了细节重建和整体视觉质量。实验结果表明，该模型在4K和8K等高分辨率视频的恢复上优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.08912
作者: Ali Mollaahmadi Dehaghi,Reza Razavi,Mohammad Moshirpour
关键词-EN: introduce DiQP, Denoising Diffusion, video quality degraded, Transformer-Diffusion model, model
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:In this paper, we introduce DiQP; a novel Transformer-Diffusion model for restoring 8K video quality degraded by codec compression. To the best of our knowledge, our model is the first to consider restoring the artifacts introduced by various codecs (AV1, HEVC) by Denoising Diffusion without considering additional noise. This approach allows us to model the complex, non-Gaussian nature of compression artifacts, effectively learning to reverse the degradation. Our architecture combines the power of Transformers to capture long-range dependencies with an enhanced windowed mechanism that preserves spatiotemporal context within groups of pixels across frames. To further enhance restoration, the model incorporates auxiliary “Look Ahead” and “Look Around” modules, providing both future and surrounding frame information to aid in reconstructing fine details and enhancing overall visual quality. Extensive experiments on different datasets demonstrate that our model outperforms state-of-the-art methods, particularly for high-resolution videos such as 4K and 8K, showcasing its effectiveness in restoring perceptually pleasing videos from highly compressed sources.
zh

[CV-111] GaGA: Towards Interactive Global Geolocation Assistant

【速读】：该论文试图解决全球地理定位问题，即预测全球范围内图像的地理位置。解决方案的关键在于引入了一个名为 GaGA 的创新交互式全球地理定位助手，该助手基于蓬勃发展的大规模视觉语言模型 (LVLMs)，能够从图像中提取地理线索，并结合 LVLMs 中嵌入的广泛世界知识进行地理定位，同时提供预测结果的解释和理由。此外，论文设计了一种新颖的交互式地理定位方法，超越了传统的静态推理方法，允许用户干预、纠正或提供线索，从而使模型更加灵活和实用。GaGA 的开发依赖于新提出的多模态全球地理定位 (MG-Geo) 数据集，并在 GWS15k 数据集上实现了最先进的性能，显著提升了地理定位的准确性。

链接: https://arxiv.org/abs/2412.08907
作者: Zhiyang Dou,Zipeng Wang,Xumeng Han,Chenhui Qiang,Kuiran Wang,Guorong Li,Zhibei Huang,Zhenjun Han
关键词-EN: Global geolocation, computer vision, Multi-modal Global Geolocation, interactive global geolocation, seeks to predict
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Global geolocation, which seeks to predict the geographical location of images captured anywhere in the world, is one of the most challenging tasks in the field of computer vision. In this paper, we introduce an innovative interactive global geolocation assistant named GaGA, built upon the flourishing large vision-language models (LVLMs). GaGA uncovers geographical clues within images and combines them with the extensive world knowledge embedded in LVLMs to determine the geolocations while also providing justifications and explanations for the prediction results. We further designed a novel interactive geolocation method that surpasses traditional static inference approaches. It allows users to intervene, correct, or provide clues for the predictions, making the model more flexible and practical. The development of GaGA relies on the newly proposed Multi-modal Global Geolocation (MG-Geo) dataset, a comprehensive collection of 5 million high-quality image-text pairs. GaGA achieves state-of-the-art performance on the GWS15k dataset, improving accuracy by 4.57% at the country level and 2.92% at the city level, setting a new benchmark. These advancements represent a significant leap forward in developing highly accurate, interactive geolocation systems with global applicability.
zh

[CV-112] LV-CadeNet: Long View Feature Convolution-Attention Fusion Encoder-Decoder Network for Clinical MEG Spike Detection

【速读】：该论文试图解决临床脑磁图（MEG）数据中癫痫发作间期放电（IEDs）自动检测的问题，特别是针对临床数据中正负样本高度不平衡的挑战。解决方案的关键在于引入了一种名为LV-CadeNet的模型，该模型结合了长视图特征卷积-注意力融合编码-解码网络（Long View feature Convolution-Attention fusion Encoder-Decoder Network），通过半监督学习来弥合训练数据与临床测试数据之间的分布差异。此外，LV-CadeNet通过构建长视图形态学输入数据来模拟人类专家的检测方式，并采用先进的卷积-注意力模块提取时间和空间特征，从而显著提高了MEG数据中IEDs的检测准确率，从42.31%提升至54.88%。

链接: https://arxiv.org/abs/2412.08896
作者: Kuntao Xiao,Xiongfei Wang,Pengfei Teng,Yi Sun,Wanli Yang,Liang Zhang,Hanyang Dong,Guoming Luan,Shurong Sheng
关键词-EN: interictal epileptic discharges, localizing interictal epileptic, source localizing interictal, MEG, MEG spike detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:It is widely acknowledged that the epileptic foci can be pinpointed by source localizing interictal epileptic discharges (IEDs) via Magnetoencephalography (MEG). However, manual detection of IEDs, which appear as spikes in MEG data, is extremely labor intensive and requires considerable professional expertise, limiting the broader adoption of MEG technology. Numerous studies have focused on automatic detection of MEG spikes to overcome this challenge, but these efforts often validate their models on synthetic datasets with balanced positive and negative samples. In contrast, clinical MEG data is highly imbalanced, raising doubts on the real-world efficacy of these models. To address this issue, we introduce LV-CadeNet, a Long View feature Convolution-Attention fusion Encoder-Decoder Network, designed for automatic MEG spike detection in real-world clinical scenarios. Beyond addressing the disparity between training data distribution and clinical test data through semi-supervised learning, our approach also mimics human specialists by constructing long view morphological input data. Moreover, we propose an advanced convolution-attention module to extract temporal and spatial features from the input data. LV-CadeNet significantly improves the accuracy of MEG spike detection, boosting it from 42.31% to 54.88% on a novel clinical dataset sourced from Sanbo Brain Hospital Capital Medical University. This dataset, characterized by a highly imbalanced distribution of positive and negative samples, accurately represents real-world clinical scenarios.
zh

[CV-113] Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark AAAI2025

【速读】：该论文试图解决从长视频中生成适合社交媒体分享的短视频（video long-to-short task）的问题。解决方案的关键在于提出了一个大规模数据集Repurpose-10K，包含超过10,000个视频和120,000个标注片段，用于训练和评估模型。为了应对非专业标注者可能带来的标注不准确问题，论文提出了一种两阶段标注方案，并引入了一个基于跨模态融合与对齐框架的基线模型，综合利用音频、视觉和字幕信息来处理这一复杂任务。

链接: https://arxiv.org/abs/2412.08879
作者: Yongliang Wu,Wenbo Zhu,Jiawang Cao,Yi Lu,Bozheng Li,Weiheng Chi,Zihan Qiu,Lirian Su,Haolin Zheng,Jay Wu,Xu Yang
关键词-EN: social media platforms, experienced significant growth, producing short-form videos, recent times, demand for producing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:The demand for producing short-form videos for sharing on social media platforms has experienced significant growth in recent times. Despite notable advancements in the fields of video summarization and highlight detection, which can create partially usable short films from raw videos, these approaches are often domain-specific and require an in-depth understanding of real-world video content. To tackle this predicament, we propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips aimed at resolving the video long-to-short task. Recognizing the inherent constraints posed by untrained human annotators, which can result in inaccurate annotations for repurposed videos, we propose a two-stage solution to obtain annotations from real-world user-generated content. Furthermore, we offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects through a cross-modal fusion and alignment framework. We aspire for our work to ignite groundbreaking research in the lesser-explored realms of video repurposing. The code and data will be available at this https URL.
zh

[CV-114] Inference-Time Diffusion Model Distillation

【速读】：该论文试图解决扩散蒸馏模型在加速反向采样过程中由于分布偏移和多步采样累积误差导致的性能下降问题。解决方案的关键在于引入Distillation++，这是一个新颖的推理时蒸馏框架，通过在采样过程中结合教师引导的细化（teacher-guided refinement）来缩小性能差距。具体而言，该方法将学生模型的采样过程重新表述为带有分数蒸馏采样损失 (Score Distillation Sampling loss, SDS) 的近端优化问题，并在反向采样过程中集成蒸馏优化，从而利用预训练的扩散模型引导学生模型的采样轨迹向干净流形靠近。这种方法在不增加额外数据或微调的情况下，显著提升了去噪过程的实时性能，并在早期采样阶段表现出优于现有最先进蒸馏基线的显著改进。

链接: https://arxiv.org/abs/2412.08871
作者: Geon Yeong Park,Sang Wan Lee,Jong Chul Ye
关键词-EN: effectively accelerate reverse, models effectively accelerate, fewer steps, distillation, Diffusion distillation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models. Code: this https URL.
zh

[CV-115] ViUniT: Visual Unit Tests for More Robust Visual Programming

【速读】：该论文试图解决视觉推理任务中模型生成正确答案但伴随错误程序的问题，特别是在基准视觉推理数据上，模型在正确回答时仍有33%的概率生成错误的程序。解决方案的关键在于提出了视觉单元测试框架 (Visual Unit Testing, ViUniT)，通过自动生成单元测试来提高视觉程序的可靠性。该框架利用语言模型生成单元测试，包括图像描述和预期答案，并通过图像合成生成相应的图像，从而验证程序的逻辑正确性。实验表明，ViUniT不仅提升了模型性能（11.4%），还显著减少了“正确但错误原因”的程序出现率（40%），并使开源模型在性能上超越了gpt-4o-mini。

链接: https://arxiv.org/abs/2412.08859
作者: Artemis Panagopoulou,Honglu Zhou,Silvio Savarese,Caiming Xiong,Chris Callison-Burch,Mark Yatskar,Juan Carlos Niebles
关键词-EN: Programming based approaches, Programming based, Unit, Unit tests, based approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.
zh

[CV-116] Labits: Layered Bidirectional Time Surfaces Representation for Event Camera-based Continuous Dense Trajectory Estimation

【速读】：该论文试图解决事件相机（event cameras）在动态场景中信息表示的优化问题，特别是如何减少在事件表示构建过程中的信息损失。解决方案的关键在于提出了Labits（Layered Bidirectional Time Surfaces）这一新型表示方法，该方法能够同时保留细粒度的时间信息、稳定的二维视觉特征以及时间上一致的信息密度。此外，论文还引入了一个专门用于提取活动像素局部光流（APLOF, Active Pixel Local Optical Flow）的模块，显著提升了性能，并在MultiFlow数据集上实现了49%的轨迹终点误差（TEPE）降低。

链接: https://arxiv.org/abs/2412.08849
作者: Zhongyang Zhang,Jiacheng Qiu,Shuyang Cui,Yijun Luo,Tauhidur Rahman
关键词-EN: traditional frame-based sensors, capturing dynamic scenes, high temporal resolution, frame-based sensors, capturing dynamic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 24 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Event cameras provide a compelling alternative to traditional frame-based sensors, capturing dynamic scenes with high temporal resolution and low latency. Moving objects trigger events with precise timestamps along their trajectory, enabling smooth continuous-time estimation. However, few works have attempted to optimize the information loss during event representation construction, imposing a ceiling on this task. Fully exploiting event cameras requires representations that simultaneously preserve fine-grained temporal information, stable and characteristic 2D visual features, and temporally consistent information density, an unmet challenge in existing representations. We introduce Labits: Layered Bidirectional Time Surfaces, a simple yet elegant representation designed to retain all these features. Additionally, we propose a dedicated module for extracting active pixel local optical flow (APLOF), significantly boosting the performance. Our approach achieves an impressive 49% reduction in trajectory end-point error (TEPE) compared to the previous state-of-the-art on the MultiFlow dataset. The code will be released upon acceptance.
zh

[CV-117] DALI: Domain Adaptive LiDAR Object Detection via Distribution-level and Instance-level Pseudo Label Denoising

【速读】：该论文试图解决在使用激光雷达点云进行目标检测时，由于大规模数据集的3D边界框标注成本高且耗时，导致训练深度神经网络所需的标注样本不足的问题。解决方案的关键在于提出了Domain Adaptive LIdar (DALI)目标检测框架，通过无监督域适应(UDA)技术，将源域已标注数据的训练知识迁移到未标注的目标域数据上。具体而言，DALI框架通过两种策略来减少伪标签引入的噪声：一是后训练尺寸归一化(PTSN)策略，用于在网络训练后识别无偏的尺度，从而缓解伪标签尺寸分布的偏差；二是伪点云生成(PPCG)策略，包括光线约束和无约束两种方法，用于生成每个实例的伪点云，确保伪标签与伪点云在训练过程中的一致性。这些方法在KITTI、Waymo和nuScenes等公开数据集上验证了其有效性，并取得了领先的无监督域适应性能。

链接: https://arxiv.org/abs/2412.08806
作者: Xiaohu Lu,Hayder Radha
关键词-EN: underlying detectors’ deep, detectors’ deep neural, deep neural networks, large amount, amount of human-annotated
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection using LiDAR point clouds relies on a large amount of human-annotated samples when training the underlying detectors’ deep neural networks. However, generating 3D bounding box annotation for a large-scale dataset could be costly and time-consuming. Alternatively, unsupervised domain adaptation (UDA) enables a given object detector to operate on a novel new data, with unlabeled training dataset, by transferring the knowledge learned from training labeled \textitsource domain data to the new unlabeled \textittarget domain. Pseudo label strategies, which involve training the 3D object detector using target-domain predicted bounding boxes from a pre-trained model, are commonly used in UDA. However, these pseudo labels often introduce noise, impacting performance. In this paper, we introduce the Domain Adaptive LIdar (DALI) object detection framework to address noise at both distribution and instance levels. Firstly, a post-training size normalization (PTSN) strategy is developed to mitigate bias in pseudo label size distribution by identifying an unbiased scale after network training. To address instance-level noise between pseudo labels and corresponding point clouds, two pseudo point clouds generation (PPCG) strategies, ray-constrained and constraint-free, are developed to generate pseudo point clouds for each instance, ensuring the consistency between pseudo labels and pseudo points during training. We demonstrate the effectiveness of our method on the publicly available and popular datasets KITTI, Waymo, and nuScenes. We show that the proposed DALI framework achieves state-of-the-art results and outperforms leading approaches on most of the domain adaptation tasks. Our code is available at \hrefthis https URLthis https URL.
zh

[CV-118] Generative Modeling with Explicit Memory

【速读】：该论文试图解决深度生成扩散模型在训练和推理过程中因捕获复杂数据分布而导致的计算需求大幅增加的问题。解决方案的关键在于引入显式记忆生成建模 (Generative Modeling with Explicit Memory, GMem)，通过在扩散模型的训练和采样阶段利用外部记忆库来保留数据分布的语义信息，从而减少对神经网络容量的依赖。这种方法显著提升了训练和采样效率，并在生成质量上取得了突破，例如在ImageNet 256×256分辨率上，GMem将SiT模型的训练速度提升了46.7倍，并在250K步内达到了FID分数为5.75的性能，远超现有最优方法REPA。

链接: https://arxiv.org/abs/2412.08781
作者: Yi Tang,Peng Sun,Zhenglin Cheng,Tao Lin
关键词-EN: Recent studies, deep generative diffusion, models implicitly learns, generative diffusion models, diffusion models implicitly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce \textbfGenerative \textbfModeling with \textbfExplicit \textbfMemory (GMem), leveraging an external memory bank in both training and sampling phases of diffusion models. This approach preserves semantic information from data distributions, reducing reliance on neural network capacity for learning and generalizing across diverse datasets. The results are significant: our GMem enhances both training, sampling efficiency, and generation quality. For instance, on ImageNet at 256 \times 256 resolution, GMem accelerates SiT training by over 46.7\times , achieving the performance of a SiT model trained for 7M steps in fewer than 150K steps. Compared to the most efficient existing method, REPA, GMem still offers a 16\times speedup, attaining an FID score of 5.75 within 250K steps, whereas REPA requires over 4M steps. Additionally, our method achieves state-of-the-art generation quality, with an FID score of 3.56 without classifier-free guidance on ImageNet 256\times256 . Our code is available at \urlthis https URL.
zh

[CV-119] ProtoOcc: Accurate Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder AAAI

【速读】：该论文试图解决3D占用预测问题，即预测3D体素的占用状态和语义类别。解决方案的关键在于ProtoOcc模型的设计，其核心组件包括双分支编码器（Dual Branch Encoder, DBE）和原型查询解码器（Prototype Query Decoder, PQD）。DBE通过结合3D体素和BEV表示的多尺度特征，提升了模型的性能和计算效率。PQD引入了原型查询（Prototype Queries），通过场景自适应原型和场景无关原型加速解码过程，并采用鲁棒原型学习（Robust Prototype Learning）方法，通过在训练过程中注入噪声并进行去噪训练，进一步提高了模型的鲁棒性和预测精度。

链接: https://arxiv.org/abs/2412.08774
作者: Jungho Kim,Changwon Kang,Dongyoung Lee,Sehwan Choi,Jun Won Choi
关键词-EN: deep semantic understanding, Prototype Query Decoder, Dual Branch Encoder, prediction model designed, semantic classes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI Conference on Artificial Intelligence 2025, 9 pages, 5 figures

点击查看摘要

Abstract:In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels through a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales through a dual branch structure. This design enhances both performance and computational efficiency by providing a large receptive field for the BEV representation while maintaining a smaller receptive field for the voxel representation. The PQD introduces Prototype Queries to accelerate the decoding process. Scene-Adaptive Prototypes are derived from the 3D voxel features of input sample, while Scene-Agnostic Prototypes are computed by applying Scene-Adaptive Prototypes to an Exponential Moving Average during the training phase. By using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose the Robust Prototype Learning, which injects noise into prototype generation process and trains the model to denoise during the training phase. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For single-frame method, it reaches 39.56% mIoU with an inference speed of 12.83 FPS on an NVIDIA RTX 3090. Our code can be found at this https URL.
zh

[CV-120] LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information

【速读】：该论文试图解决多模态大语言模型（MLLMs）中视觉标记（visual tokens）占用大量最大标记限制的问题，尤其是在处理多图像或视频时导致的计算需求增加和性能下降。解决方案的关键是提出了动态特征图缩减（Dynamic Feature Map Reduction, DFMR），该方法基于LLaVA-1.5，通过动态压缩视觉标记来释放标记容量，从而显著提升模型在不同视觉标记长度下的性能。这一方法不仅为资源受限的学术环境提供了可行方案，还能在工业环境中用于数据增强，以缓解开放域图像-文本对数据集在继续预训练阶段的稀缺性问题。

链接: https://arxiv.org/abs/2412.08771
作者: Ke Wang,Hong Xuan
关键词-EN: Multi-modal large language, achieved great progress, large language models, utilizing instruction-following data, Multi-modal large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) utilizing instruction-following data, such as LLaVA, have achieved great progress in the industry. A major limitation in these models is that visual tokens consume a substantial portion of the maximum token limit in large language models (LLMs), leading to increased computational demands and decreased performance when prompts include multiple images or videos. Industry solutions often mitigate this issue by increasing computational power, but this approach is less feasible in academic environments with limited resources. In this study, we propose Dynamic Feature Map Reduction (DFMR) based on LLaVA-1.5 to address the challenge of visual token overload. DFMR dynamically compresses the visual tokens, freeing up token capacity. Our experimental results demonstrate that integrating DFMR into LLaVA-1.5 significantly improves the performance of LLaVA in varied visual token lengths, offering a promising solution for extending LLaVA to handle multi-image and video scenarios in resource-constrained academic environments and it can also be applied in industry settings for data augmentation to help mitigate the scarcity of open-domain image-text pair datasets in the continued pretraining stage.
zh

[CV-121] Beyond Knowledge Silos: Task Fingerprinting for Democratization of Medical Imaging AI

【速读】：该论文试图解决医学影像分析领域中知识孤岛问题，即现有知识分散在不同出版物中且许多细节未公开，同时隐私法规限制了数据共享，阻碍了协作和进展。解决方案的关键在于提出了一种基于“数据指纹”（dataset “fingerprints”）的安全知识转移框架，这些指纹是特征分布的结构化表示，能够量化任务相似性。通过在71个不同任务和12种医学影像模态中测试神经架构转移、预训练、数据增强策略和多任务学习，该方法在识别相关知识方面优于传统方法，并促进了协作模型训练，从而推动了医学影像AI的民主化发展。

链接: https://arxiv.org/abs/2412.08763
作者: Patrick Godau,Akriti Srivastava,Tim Adler,Lena Maier-Hein
关键词-EN: undergoing rapid transformations, methodical research increasingly, research increasingly translated, rapid transformations, clinical practice
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The field of medical imaging AI is currently undergoing rapid transformations, with methodical research increasingly translated into clinical practice. Despite these successes, research suffers from knowledge silos, hindering collaboration and progress: Existing knowledge is scattered across publications and many details remain unpublished, while privacy regulations restrict data sharing. In the spirit of democratizing of AI, we propose a framework for secure knowledge transfer in the field of medical image analysis. The key to our approach is dataset “fingerprints”, structured representations of feature distributions, that enable quantification of task similarity. We tested our approach across 71 distinct tasks and 12 medical imaging modalities by transferring neural architectures, pretraining, augmentation policies, and multi-task learning. According to comprehensive analyses, our method outperforms traditional methods for identifying relevant knowledge and facilitates collaborative model training. Our framework fosters the democratization of AI in medical imaging and could become a valuable tool for promoting faster scientific advancement.
zh

[CV-122] Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images

【速读】：该论文试图解决在训练和推理过程中检测未见过的后门样本的问题。解决方案的关键在于利用视觉语言模型 (Vision Language Models, VLMs) 中的提示调优 (prompt tuning) 技术，通过训练可学习的文本提示来区分干净图像和带有隐藏后门触发器的图像。这种方法在检测未见过的后门触发器方面表现出色，实验结果显示在两个知名数据集上的平均准确率达到了86%，为后门防御设立了新的标准。

链接: https://arxiv.org/abs/2412.08755
作者: Kyle Stein,Andrew Arash Mahyari,Guillermo Francia,Eman El-Sheikh
关键词-EN: target labels, Backdoor attacks pose, pose a critical, critical threat, threat by embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backdoor attacks pose a critical threat by embedding hidden triggers into inputs, causing models to misclassify them into target labels. While extensive research has focused on mitigating these attacks in object recognition models through weight fine-tuning, much less attention has been given to detecting backdoored samples directly. Given the vast datasets used in training, manual inspection for backdoor triggers is impractical, and even state-of-the-art defense mechanisms fail to fully neutralize their impact. To address this gap, we introduce a groundbreaking method to detect unseen backdoored images during both training and inference. Leveraging the transformative success of prompt tuning in Vision Language Models (VLMs), our approach trains learnable text prompts to differentiate clean images from those with hidden backdoor triggers. Experiments demonstrate the exceptional efficacy of this method, achieving an impressive average accuracy of 86% across two renowned datasets for detecting unseen backdoor triggers, establishing a new standard in backdoor defense.
zh

[CV-123] DocVLM: Make Your VLM an Efficient Reader

【速读】：该论文试图解决文档理解任务中高分辨率图像输入带来的计算开销问题，特别是在需要细粒度文本处理的场景下。解决方案的关键在于引入DocVLM方法，通过将基于OCR（Optical Character Recognition）的文本和布局信息编码为紧凑的学习查询集，并将其集成到视觉语言模型（Vision-Language Models, VLMs）中，从而在不依赖高分辨率图像的情况下提升文档处理性能。该方法的核心在于利用OCR编码器捕捉文本内容和布局信息，并将其压缩为一组学习查询，有效减少了模型对高分辨率图像的依赖，同时保持了原始模型的权重。

链接: https://arxiv.org/abs/2412.08746
作者: Mor Shpigel Nacson,Aviad Aberdam,Roy Ganz,Elad Ben Avraham,Alona Golts,Yair Kittenplon,Shai Mazor,Ron Litman
关键词-EN: Vision-Language Models, requires fine-grained text, diverse visual tasks, excel in diverse, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding. In limited-token regimes (448 \times 448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM’s potential for applications requiring high-performance and efficiency.
zh

[CV-124] VisionArena: 230K Real World User-VLM Conversations with Preference Labels

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在实际用户交互中缺乏有效基准的问题。解决方案的关键在于创建了一个名为VisionArena的大规模数据集，包含23万条真实用户与VLMs的对话，涵盖7.3万名用户、45个VLMs和138种语言。该数据集分为三个子集：VisionArena-Chat（20万条单轮和多轮对话）、VisionArena-Battle（3万条比较对话及用户偏好投票）和VisionArena-Bench（500条多样化用户提示的自动基准）。通过分析用户提问类型、响应风格对偏好的影响以及模型在特定任务（如空间推理和规划）中的不足，论文展示了基于VisionArena-Chat微调的模型在MMMU和WildVision基准上显著优于Llava-Instruct-158K。

链接: https://arxiv.org/abs/2412.08687
作者: Christopher Chou,Lisa Dunlap,Koki Mashita,Krishna Mandal,Trevor Darrell,Ion Stoica,Joseph E. Gonzalez,Wei-Lin Chiang
关键词-EN: authentic user-VLM interactions, capture authentic user-VLM, Chatbot Arena, user-VLM interactions, Chatbot Arena model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and VLMs. Collected from Chatbot Arena - an open-source platform where users interact with VLMs and submit preference votes - VisionArena spans 73K unique users, 45 VLMs, and 138 languages. Our dataset contains three subsets: VisionArena-Chat, 200k single and multi-turn conversations between a user and a VLM; VisionArena-Battle, 30K conversations comparing two anonymous VLMs with user preference votes; and VisionArena-Bench, an automatic benchmark of 500 diverse user prompts that efficiently approximate the live Chatbot Arena model rankings. Additionally, we highlight the types of question asked by users, the influence of response style on preference, and areas where models often fail. We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks. Lastly, we show finetuning the same base model on VisionArena-Chat outperforms Llava-Instruct-158K, with a 17-point gain on MMMU and a 46-point gain on the WildVision benchmark. Dataset at this https URL
zh

[CV-125] ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes

【速读】：该论文试图解决在街道场景模拟中生成真实且可交互的交通参与者动态的问题，特别是针对不同类型的参与者（如车辆和行人）以及它们之间的复杂交互。解决方案的关键在于提出了ChatDyn系统，该系统通过多LLM（Large Language Model）代理的角色扮演方法，利用自然语言输入来规划不同交通参与者的轨迹和行为。为了实现精细且真实的动态生成，ChatDyn设计了两个新型执行器：PedExecutor（统一的多任务执行器，用于生成不同任务规划下的行人动态）和VehExecutor（基于物理过渡的策略，用于生成物理上合理的车辆动态）。通过这种方式，ChatDyn能够生成包含多车辆和行人的真实驾驶场景动态，并在多个子任务上显著优于先前的方法。

链接: https://arxiv.org/abs/2412.08685
作者: Yuxi Wei,Jingbo Wang,Yuwen Du,Dingju Wang,Liang Pan,Chenxin Xu,Yao Feng,Bo Dai,Siheng Chen
关键词-EN: street scene simulation, realistic, dynamics, scene simulation, specific instruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently a lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicles and pedestrians, with different kinds of interactions between them. In this paper, we introduce ChatDyn, the first system capable of generating interactive, controllable and realistic participant dynamics in street scenes based on language instructions. To achieve precise control through complex language, ChatDyn employs a multi-LLM-agent role-playing approach, which utilizes natural language inputs to plan the trajectories and behaviors for different traffic participants. To generate realistic fine-grained dynamics based on the planning, ChatDyn designs two novel executors: the PedExecutor, a unified multi-task executor that generates realistic pedestrian dynamics under different task plannings; and the VehExecutor, a physical transition-based policy that generates physically plausible vehicle dynamics. Extensive experiments show that ChatDyn can generate realistic driving scene dynamics with multiple vehicles and pedestrians, and significantly outperforms previous methods on subtasks. Code and model will be available at this https URL.
zh

[CV-126] Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

【速读】：该论文试图解决单图像3D肖像重建中的时间不一致性和外观遗忘问题，以及自重演方法中无法忠实保留用户每帧外观（如瞬时表情和光照）的问题。解决方案的关键在于提出了一种基于融合的方法，通过将参考视图的规范3D先验与每帧输入视图的动态外观相结合，实现了时间稳定且忠实重建用户每帧外观的3D视频。该方法仅使用由表情条件3D生成对抗网络（GAN）生成的合成数据进行训练，在工作室和野外数据集上均达到了最先进的3D重建和时间一致性。

链接: https://arxiv.org/abs/2412.08684
作者: Shengze Wang,Xueting Li,Chao Liu,Matthew Chan,Michael Stengel,Henry Fuchs,Shalini De Mello,Koki Nagano
关键词-EN: Recent breakthroughs, enabled telepresence systems, breakthroughs in single-image, systems to stream, camera in real-time
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: arXiv admin note: substantial text overlap with arXiv:2405.00794

点击查看摘要

Abstract:Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user’s appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user’s per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user’s per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets. this https URL
zh

[CV-127] Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism

【速读】：该论文试图解决抑郁症的早期诊断问题，通过分析患者的语音信号来识别抑郁症的迹象。解决方案的关键在于提出了动态卷积块注意力模块 (Dynamic-CBAM)，并将其与注意力门控循环单元网络 (Attention-GRU Network) 结合，用于情感分类。该模型通过分析音频信号，能够有效识别抑郁症患者，并在 VNEMOS 数据集上取得了较高的识别准确率，分别为未加权准确率 (UA) 0.87、加权准确率 (WA) 0.86 和 F1 分数 0.87。

链接: https://arxiv.org/abs/2412.08683
作者: Quang-Anh N.D.,Manh-Hung Ha,Thai Kim Dinh,Minh-Duc Pham,Ninh Nguyen Van
关键词-EN: Major depressive disorder, mental health condition, Major depressive, depressive disorder, mental health
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 9 Page, 5 Figures

点击查看摘要

Abstract:Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depressed express discomfort, sadness and they may speak slowly, trembly, and lose emotion in their voices. In this study, we proposed the Dynamic Convolutional Block Attention Module (Dynamic-CBAM) to utilized with in an Attention-GRU Network to classify the emotions by analyzing the audio signal of humans. Based on the results, we can diagnose which patients are depressed or prone to depression then so that treatment and prevention can be started as soon as possible. The research delves into the intricate computational steps involved in implementing a Attention-GRU deep learning architecture. Through experimentation, the model has achieved an impressive recognition with Unweighted Accuracy (UA) rate of 0.87 and 0.86 Weighted Accuracy (WA) rate and F1 rate of 0.87 in the VNEMOS dataset. Training code is released in this https URL
zh

[CV-128] A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

【速读】：该论文旨在解决语义分割（Semantic Segmentation）中因逐步降低特征图空间分辨率而导致的对齐问题。解决方案的关键在于设计了一个语义细化模块（Semantic Refinement Module, SRM），该模块通过学习上采样特征图中每个像素的变换偏移量，并结合高分辨率特征图和邻近偏移量，来增强分割网络的语义表示，特别是在物体边界附近的像素。此外，论文还提出了一个上下文细化模块（Contextual Refinement Module, CRM），用于捕捉跨空间和通道维度的全局上下文信息，并通过聚合主干网络四个阶段的语义图来丰富通道上下文信息。这些模块的有效性在Cityscapes、Bdd100K和ADE20K三个广泛使用的数据集上得到了验证，并展示了优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.08671
作者: Zhiyan Wang,Deyin Liu,Lin Yuanbo Wu,Song Wang,Xin Guo,Lin Qi
关键词-EN: feature maps, editing contents, images and videos, fundamental task, contents of images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accept by tmm

点击查看摘要

Abstract:Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.
zh

[CV-129] A feature refinement module for light-weight semantic segmentation network ICIP2023

【速读】：该论文试图解决在实际语义分割任务中，模型推理速度与分割精度之间的权衡问题。现有方法通过设计轻量级网络来加速推理，但往往导致精度显著下降。论文提出的解决方案关键在于引入一个特征细化模块 (Feature Refinement Module, FRM)，该模块能够从骨干网络生成的多阶段特征图中提取语义信息，并通过使用Transformer块捕捉非局部上下文信息，从而在不显著增加计算复杂度的情况下提升轻量级网络的语义信息获取能力。实验结果表明，该方法在Cityscapes和Bdd100K数据集上实现了精度与计算成本的良好平衡，特别是在Cityscapes测试集上达到了80.4%的mIoU，同时仅需214.82 GFLOPs。

链接: https://arxiv.org/abs/2412.08670
作者: Zhiyan Wang,Xin Guo,Song Wang,Peixiao Zheng,Lin Qi
关键词-EN: Low computational complexity, high segmentation accuracy, semantic segmentation tasks, real-world semantic segmentation, Low computational
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accept by icip 2023

点击查看摘要

Abstract:Low computational complexity and high segmentation accuracy are both essential to the real-world semantic segmentation tasks. However, to speed up the model inference, most existing approaches tend to design light-weight networks with a very limited number of parameters, leading to a considerable degradation in accuracy due to the decrease of the representation ability of the networks. To solve the problem, this paper proposes a novel semantic segmentation method to improve the capacity of obtaining semantic information for the light-weight network. Specifically, a feature refinement module (FRM) is proposed to extract semantics from multi-stage feature maps generated by the backbone and capture non-local contextual information by utilizing a transformer block. On Cityscapes and Bdd100K datasets, the experimental results demonstrate that the proposed method achieves a promising trade-off between accuracy and computational cost, especially for Cityscapes test set where 80.4% mIoU is achieved and only 214.82 GFLOPs are required.
zh

[CV-130] Detecting Visual Triggers in Cannabis Imagery: A CLIP-Based Multi-Labeling Framework with Local-Global Aggregation

【速读】：该论文旨在探讨在线讨论中视觉和文本特征对大麻食品相关话题用户参与度的影响，并提供政策制定者和监管机构在设计警示标签和营销法规方面的可操作性见解。解决方案的关键在于利用CLIP模型分析Facebook上的42,743张图片，检测与食品相关的视觉特征（如颜色和亮度）对用户互动的影响，同时通过BART模型对文本进行去噪自动编码，分类并分析与用户参与度相关的十大主题。研究发现，食品相关视觉特征（如水果、糖果和烘焙食品）与用户参与度呈显著正相关，而某些文本主题（如大麻合法化）也与参与度正相关。相反，图像的颜色丰富度和某些文本主题与用户参与度呈负相关。

链接: https://arxiv.org/abs/2412.08648
作者: Linqi Lu,Xianshi Yu,Akhil Perumal Reddy
关键词-EN: study investigates, investigates the interplay, features in online, online discussions, user engagement
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: This project was initiated in September 2023

点击查看摘要

Abstract:This study investigates the interplay of visual and textual features in online discussions about cannabis edibles and their impact on user engagement. Leveraging the CLIP model, we analyzed 42,743 images from Facebook (March 1 to August 31, 2021), with a focus on detecting food-related visuals and examining the influence of image attributes such as colorfulness and brightness on user interaction. For textual analysis, we utilized the BART model as a denoising autoencoder to classify ten topics derived from structural topic modeling, exploring their relationship with user engagement. Linear regression analysis identified significant positive correlations between food-related visuals (e.g., fruit, candy, and bakery) and user engagement scores, as well as between engagement and text topics such as cannabis legalization. In contrast, negative associations were observed with image colorfulness and certain textual themes. These findings offer actionable insights for policymakers and regulatory bodies in designing warning labels and marketing regulations to address potential risks associated with recreational cannabis edibles.
zh

[CV-131] Open-Source Acceleration of Stable-Diffusion.cpp

【速读】：该论文试图解决Stable Diffusion模型在图像生成过程中高计算延迟和内存消耗的问题。解决方案的关键在于优化Sdcpp框架中的ggml_conv_2d算子，通过引入Winograd算法加速2D卷积操作，这是整个流程中的主要瓶颈。通过分析依赖和独立的计算图，利用设备的局部性和并行性，实现了显著的性能提升。实验结果表明，单个卷积层的加速可达2.76倍，整体图像生成过程的推理速度提升至4.79倍。

链接: https://arxiv.org/abs/2412.05781
作者: Jingxu Ng,Cheng Lv,Pu Zhao,Wei Niu,Juyi Lin,Minzhou Pan,Yun Liang,Yanzhi Wang
关键词-EN: generating high-quality images, Stable diffusion plays, plays a crucial, crucial role, role in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stable diffusion plays a crucial role in generating high-quality images. However, image generation is time-consuming and memory-intensive. To address this, this http URL (Sdcpp) emerges as an efficient inference framework to accelerate the diffusion models. Although it is lightweight, the current implementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both high inference latency and massive memory usage. To address this, in this work, we present an optimized version of Sdcpp leveraging the Winograd algorithm to accelerate 2D convolution operations, which is the primary bottleneck in the pipeline. By analyzing both dependent and independent computation graphs, we exploit the device’s locality and parallelism to achieve substantial performance improvements. Our framework delivers correct end-to-end results across various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and SDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process, compared with the original Sdcpp on M1 pro. Homepage: this https URL
zh

[CV-132] Embeddings are all you need! Achieving High Performance Medical Image Classification through Training-Free Embedding Analysis

【速读】：该论文试图解决在医学影像分析中，传统AI和机器学习模型训练和测试过程资源消耗大、计算时间长的问题。解决方案的关键在于采用基于嵌入（embedding-based）的方法，利用预训练的基础模型（如ResNet和CLIP）生成图像嵌入，并通过简单的线性分类器进行多类别分类任务。这种方法不仅在多种医学影像模态（如视网膜图像、乳腺X线摄影、皮肤镜图像和胸部X光片）中实现了与传统方法相当甚至更高的分类性能（AUC-ROC得分提升高达87%），而且显著减少了计算资源的需求，为医学影像分析提供了一种更高效、更可持续的替代方案。

链接: https://arxiv.org/abs/2412.09445
作者: Raj Hansini Khoiwal,Alan B. McMillan
关键词-EN: Developing artificial intelligence, typically involves extensive, Developing artificial, involves extensive training, imaging typically involves
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Developing artificial intelligence (AI) and machine learning (ML) models for medical imaging typically involves extensive training and testing on large datasets, consuming significant computational time, energy, and resources. There is a need for more efficient methods that can achieve comparable or superior diagnostic performance without the associated resource burden. We investigated the feasibility of replacing conventional training procedures with an embedding-based approach that leverages concise and semantically meaningful representations of medical images. Using pre-trained foundational models-specifically, convolutional neural networks (CNN) like ResNet and multimodal models like Contrastive Language-Image Pre-training (CLIP)-we generated image embeddings for multi-class classification tasks. Simple linear classifiers were then applied to these embeddings. The approach was evaluated across diverse medical imaging modalities, including retinal images, mammography, dermatoscopic images, and chest radiographs. Performance was compared to benchmark models trained and tested using traditional methods. The embedding-based models surpassed the benchmark area under the receiver operating characteristic curve (AUC-ROC) scores by up to 87 percentage in multi-class classification tasks across the various medical imaging modalities. Notably, CLIP embedding models achieved the highest AUC-ROC scores, demonstrating superior classification performance while significantly reducing computational demands. Our study indicates that leveraging embeddings from pre-trained foundational models can effectively replace conventional, resource-intensive training and testing procedures in medical image analysis. This embedding-based approach offers a more efficient alternative for image segmentation, classification, and prediction, potentially accelerating AI technology integration into clinical practice.
zh

[CV-133] A Plug-and-Play Algorithm for 3D Video Super-Resolution of Single-Photon LiDAR data

【速读】：该论文试图解决单光子雪崩二极管 (SPAD) 数据在动态场景中进行3D重建时面临的挑战，特别是运动模糊和低分辨率问题。解决方案的关键在于提出了一种新颖的计算成像算法，通过交替进行引导视频超分辨率处理和基于光流的精确图像重对齐，来提高3D场景重建的精度和分辨率。该方法采用即插即用 (plug-and-play) 策略，结合优化方案，显著提升了在不同信噪比和光子水平下的图像分辨率，并在多种实际场景中验证了其鲁棒性和通用性。

链接: https://arxiv.org/abs/2412.09427
作者: Alice Ruget,Lewis Wilson,Jonathan Leach,Rachael Tobin,Aongus Mccarthy,Gerald S. Buller,Steve Mclaughlin,Abderrahim Halimi
关键词-EN: Counting detection techniques, Single-Photon Counting detection, time-correlated Single-Photon Counting, Single-photon avalanche diodes, Counting detection
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Single-photon avalanche diodes (SPADs) are advanced sensors capable of detecting individual photons and recording their arrival times with picosecond resolution using time-correlated Single-Photon Counting detection techniques. They are used in various applications, such as LiDAR, and can capture high-speed sequences of binary single-photon images, offering great potential for reconstructing 3D environments with high motion dynamics. To complement single-photon data, they are often paired with conventional passive cameras, which capture high-resolution (HR) intensity images at a lower frame rate. However, 3D reconstruction from SPAD data faces challenges. Aggregating multiple binary measurements improves precision and reduces noise but can cause motion blur in dynamic scenes. Additionally, SPAD arrays often have lower resolution than passive cameras. To address these issues, we propose a novel computational imaging algorithm to improve the 3D reconstruction of moving scenes from SPAD data by addressing the motion blur and increasing the native spatial resolution. We adopt a plug-and-play approach within an optimization scheme alternating between guided video super-resolution of the 3D scene, and precise image realignment using optical flow. Experiments on synthetic data show significantly improved image resolutions across various signal-to-noise ratios and photon levels. We validate our method using real-world SPAD measurements on three practical situations with dynamic objects. First on fast-moving scenes in laboratory conditions at short range; second very low resolution imaging of people with a consumer-grade SPAD sensor from STMicroelectronics; and finally, HR imaging of people walking outdoors in daylight at a range of 325 meters under eye-safe illumination conditions using a short-wave infrared SPAD camera. These results demonstrate the robustness and versatility of our approach.
zh

[CV-134] Learned Compression for Compressed Learning

【速读】：该论文试图解决现有压缩系统在压缩学习中的效率问题，特别是线性变换编码和端到端学习压缩系统在降低比特率的同时未能均匀降低维度，导致效率提升有限的问题。解决方案的关键是引入WaLLoC（Wavelet Learned Lossy Compression），这是一种结合线性变换编码与非线性降维自编码器的神经编解码架构。WaLLoC通过在可逆小波包变换之间嵌入浅层非对称自编码器和熵瓶颈，实现了更高的有效分辨率和更低的维度，从而在多个关键指标上优于现有自编码器。其编码器主要由线性操作构成，具有高效性和广泛的适用性，适用于移动计算、遥感以及直接从压缩数据中学习。

链接: https://arxiv.org/abs/2412.09405
作者: Dan Jacobellis,Neeraja J. Yadwadkar
关键词-EN: Modern sensors produce, sensors produce increasingly, produce increasingly rich, increasingly rich streams, Modern sensors
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: Accepted as paper to 2025 IEEE Data Compression Conference

点击查看摘要

Abstract:Modern sensors produce increasingly rich streams of high-resolution data. Due to resource constraints, machine learning systems discard the vast majority of this information via resolution reduction. Compressed-domain learning allows models to operate on compact latent representations, allowing higher effective resolution for the same budget. However, existing compression systems are not ideal for compressed learning. Linear transform coding and end-to-end learned compression systems reduce bitrate, but do not uniformly reduce dimensionality; thus, they do not meaningfully increase efficiency. Generative autoencoders reduce dimensionality, but their adversarial or perceptual objectives lead to significant information loss. To address these limitations, we introduce WaLLoC (Wavelet Learned Lossy Compression), a neural codec architecture that combines linear transform coding with nonlinear dimensionality-reducing autoencoders. WaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck between an invertible wavelet packet transform. Across several key metrics, WaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion models. WaLLoC does not require perceptual or adversarial losses to represent high-frequency detail, providing compatibility with modalities beyond RGB images and stereo audio. WaLLoC’s encoder consists almost entirely of linear operations, making it exceptionally efficient and suitable for mobile computing, remote sensing, and learning directly from compressed data. We demonstrate WaLLoC’s capability for compressed-domain learning across several tasks, including image classification, colorization, document understanding, and music source separation. Our code, experiments, and pre-trained audio and image codecs are available at this https URL
zh

[CV-135] Multi-Stage Segmentation and Cascade Classification Methods for Improving Cardiac MRI Analysis

【速读】：该论文旨在解决心脏磁共振成像（cardiac magnetic resonance imaging）的分割与分类问题，特别是在提高准确性和泛化能力方面的挑战。解决方案的关键在于引入了一种基于深度学习的新方法，通过多阶段处理流程实现：首先使用U-Net和ResNet模型进行分割，随后进行高斯平滑处理，显著提升了分割精度，左心室和右心室的Dice系数分别达到0.974和0.947。在分类方面，采用级联深度学习分类器来区分不同心脏疾病，如肥厚型心肌病、心肌梗死和扩张型心肌病，平均准确率达到97.2%。该方法在分割和分类精度上均优于现有模型，展示了其在临床应用中的潜力，但仍需在不同成像协议下进行进一步验证和解释。

链接: https://arxiv.org/abs/2412.09386
作者: Vitalii Slobodzian,Pavlo Radiuk,Oleksander Barmak,Iurii Krak
关键词-EN: cardiac magnetic resonance, current approaches face, approaches face challenges, cardiac magnetic, diagnosing heart conditions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Cardiac MRI, heart pathology, deep learning, segmentation, Gaussian smoothing, classification, cascade

点击查看摘要

Abstract:The segmentation and classification of cardiac magnetic resonance imaging are critical for diagnosing heart conditions, yet current approaches face challenges in accuracy and generalizability. In this study, we aim to further advance the segmentation and classification of cardiac magnetic resonance images by introducing a novel deep learning-based approach. Using a multi-stage process with U-Net and ResNet models for segmentation, followed by Gaussian smoothing, the method improved segmentation accuracy, achieving a Dice coefficient of 0.974 for the left ventricle and 0.947 for the right ventricle. For classification, a cascade of deep learning classifiers was employed to distinguish heart conditions, including hypertrophic cardiomyopathy, myocardial infarction, and dilated cardiomyopathy, achieving an average accuracy of 97.2%. The proposed approach outperformed existing models, enhancing segmentation accuracy and classification precision. These advancements show promise for clinical applications, though further validation and interpretation across diverse imaging protocols is necessary.
zh

[CV-136] Physics-Driven Autoregressive State Space Models for Medical Image Reconstruction

【速读】：该论文试图解决医学图像重建中的欠采样问题，特别是如何在保持数据一致性的同时，有效抑制由于欠采样引起的图像伪影。解决方案的关键在于引入了一种新型的物理驱动自回归状态空间模型 (MambaRoll)，该模型通过在展开架构的每个级联中使用基于物理驱动状态空间模块 (PSSM) 的自回归框架，能够在给定空间尺度上高效聚合上下文特征，同时保持对采集数据的忠实度。通过从早期空间尺度自回归预测下一尺度的特征图，MambaRoll 能够捕捉多尺度的上下文特征，从而在加速 MRI 和稀疏视图 CT 重建中显著优于基于卷积、Transformer 和传统状态空间模块的现有物理驱动方法。

链接: https://arxiv.org/abs/2412.09331
作者: Bilal Kabas,Fuat Arslan,Valiyeh A. Nezhad,Saban Ozturk,Emine U. Saritas,Tolga Çukur
关键词-EN: operator linking measurement, imaging operator linking, undersampled acquisitions, ill-posed problem, problem that involves
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Medical image reconstruction from undersampled acquisitions is an ill-posed problem that involves inversion of the imaging operator linking measurement and image domains. In recent years, physics-driven (PD) models have gained prominence in learning-based reconstruction given their enhanced balance between efficiency and performance. For reconstruction, PD models cascade data-consistency modules that enforce fidelity to acquired data based on the imaging operator, with network modules that process feature maps to alleviate image artifacts due to undersampling. Success in artifact suppression inevitably depends on the ability of the network modules to tease apart artifacts from underlying tissue structures, both of which can manifest contextual relations over broad spatial scales. Convolutional modules that excel at capturing local correlations are relatively insensitive to non-local context. While transformers promise elevated sensitivity to non-local context, practical implementations often suffer from a suboptimal trade-off between local and non-local sensitivity due to intrinsic model complexity. Here, we introduce a novel physics-driven autoregressive state space model (MambaRoll) for enhanced fidelity in medical image reconstruction. In each cascade of an unrolled architecture, MambaRoll employs an autoregressive framework based on physics-driven state space modules (PSSM), where PSSMs efficiently aggregate contextual features at a given spatial scale while maintaining fidelity to acquired data, and autoregressive prediction of next-scale feature maps from earlier spatial scales enhance capture of multi-scale contextual features. Demonstrations on accelerated MRI and sparse-view CT reconstructions indicate that MambaRoll outperforms state-of-the-art PD methods based on convolutional, transformer and conventional SSM modules.
zh

[CV-137] Computer-Aided Osteoporosis Diagnosis Using Transfer Learning with Enhanced Features from Stacked Deep Learning Modules

【速读】：该论文试图解决膝关节骨质疏松症的早期检测问题，通过提高诊断的准确性和效率来改善患者预后。解决方案的关键在于结合迁移学习 (transfer learning) 和堆叠特征增强深度学习模块 (stacked feature enhancement deep learning blocks) 的计算机辅助诊断 (CAD) 系统。具体来说，该系统利用预训练的卷积神经网络 (CNN) 提取膝关节X光图像的特征，并通过五个连续的Conv-RELU-MaxPooling模块增强这些特征。这种设计能够捕捉与骨结构、关节变形和骨质疏松标志相关的高级特征，并通过分类模块区分健康和骨质疏松的膝关节状态。实验结果表明，该方法在多个数据集上实现了高达98.00%的准确率，相较于现有方法提升了约2%。

链接: https://arxiv.org/abs/2412.09330
作者: Ayesha Siddiqua,Rakibul Hasan,Anichur Rahman,Abu Saleh Musa Miah
关键词-EN: increasing fracture risk, increasing fracture, fracture risk, Knee osteoporosis weakens, Knee osteoporosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knee osteoporosis weakens the bone tissue in the knee joint, increasing fracture risk. Early detection through X-ray images enables timely intervention and improved patient outcomes. While some researchers have focused on diagnosing knee osteoporosis through manual radiology evaluation and traditional machine learning using hand-crafted features, these methods often struggle with performance and efficiency due to reliance on manual feature extraction and subjective interpretation. In this study, we propose a computer-aided diagnosis (CAD) system for knee osteoporosis, combining transfer learning with stacked feature enhancement deep learning blocks. Initially, knee X-ray images are preprocessed, and features are extracted using a pre-trained Convolutional Neural Network (CNN). These features are then enhanced through five sequential Conv-RELU-MaxPooling blocks. The Conv2D layers detect low-level features, while the ReLU activations introduce non-linearity, allowing the network to learn complex patterns. MaxPooling layers down-sample the features, retaining the most important spatial information. This sequential processing enables the model to capture complex, high-level features related to bone structure, joint deformation, and osteoporotic markers. The enhanced features are passed through a classification module to differentiate between healthy and osteoporotic knee conditions. Extensive experiments on three individual datasets and a combined dataset demonstrate that our model achieves 97.32%, 98.24%, 97.27%, and 98.00% accuracy for OKX Kaggle Binary, KXO-Mendeley Multi-Class, OKX Kaggle Multi-Class, and the combined dataset, respectively, showing an improvement of around 2% over existing methods.
zh

[CV-138] Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations ICLR2025

【速读】：该论文试图解决的问题是：灵长类动物的腹侧视觉流（ventral visual stream）是否仅优化于物体分类任务，还是也优化于空间潜在变量（spatial latents）的估计。解决方案的关键在于通过使用3D图形引擎生成的合成图像数据集，训练卷积神经网络（CNN）来估计不同的空间和类别潜在变量组合。研究发现，仅估计少量空间潜在变量的模型在神经对齐分数上与训练于数百个类别的模型相当，并且空间潜在变量性能与神经对齐度强相关。这表明，空间潜在变量和类别训练的模型在内部表示上非常相似，尤其是在早期和中层网络中，但并非完全相同。论文提供了证据表明，这种收敛部分是由训练数据中的非目标潜在变量变化驱动的，这促进了这些非目标潜在变量的隐式学习。总体而言，这些结果表明，多种训练目标（如空间潜在变量）可以导致与腹侧视觉流神经对齐的相似模型，因此不应假设腹侧视觉流仅优化于物体分类。

链接: https://arxiv.org/abs/2412.09115
作者: Yudi Xie,Weichen Huang,Esther Alter,Jeremy Schwartz,Joshua B. Tenenbaum,James J. DiCarlo
关键词-EN: primate ventral visual, ventral stream, object categorization, ventral, ventral visual stream
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 29 pages, 20 figures, ICLR 2025

点击查看摘要

Abstract:Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring – despite much prior evidence – its role in estimating “spatial” latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different – if at all – are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar – but not identical – internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.
zh

[CV-139] A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection

【速读】：该论文试图解决工业成像中在复杂噪声环境下准确检测和区分表面缺陷与噪声的难题。解决方案的关键在于提出了一种混合框架，该框架结合了统计特征选择和分类技术，以提高缺陷检测的准确性并减少误报。具体来说，该框架通过生成标量分数来表示感兴趣区域（ROI）被分类为缺陷或噪声的可能性，并利用Fisher分离、卡方检验和方差分析等统计方法从工业图像中提取的55个特征中识别最具区分性的特征，从而最大化缺陷与噪声之间的分离度。Fisher准则确保了自动化系统的稳健实时性能。该框架不仅可作为独立的评估模块，还可作为机器学习分类器的后验增强，提供适应性强的质量控制层，优化预测结果。

链接: https://arxiv.org/abs/2412.08800
作者: Alejandro Garnung Menéndez
关键词-EN: distinguishing surface defects, accurately detecting, critical and challenging, detecting and distinguishing, distinguishing surface
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 17 figures

点击查看摘要

Abstract:In industrial imaging, accurately detecting and distinguishing surface defects from noise is critical and challenging, particularly in complex environments with noisy data. This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy while minimizing false positives. The motivation of the system is based on the generation of scalar scores that represent the likelihood that a region of interest (ROI) is classified as a defect or noise. We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods such as Fisher separation, chi-squared test, and variance analysis. These techniques identify the most discriminative features, focusing on maximizing the separation between true defects and noise. Fisher’s criterion ensures robust, real-time performance for automated systems. This statistical framework opens up multiple avenues for application, functioning as a standalone assessment module or as an a posteriori enhancement to machine learning classifiers. The framework can be implemented as a black-box module that applies to existing classifiers, providing an adaptable layer of quality control and optimizing predictions by leveraging intuitive feature extraction strategies, emphasizing the rationale behind feature significance and the statistical rigor of feature selection. By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications, especially in complex, noisy environments.
zh

人工智能

[AI-0] A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks

链接: https://arxiv.org/abs/2412.09579
作者: Saptarshi Mandal,Xiaojun Lin,R. Srikant
关键词-EN: achieved substantial empirical, substantial empirical success, pre-trained large teacher, Knowledge distillation, student model learns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Main Body of the Paper is under Review at L4DC 2025

点击查看摘要

Abstract:Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of \citephinton2015distilling. Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only O\left(\frac1\gamma^2 \epsilon\right) neurons to achieve a classification loss averaged over epochs smaller than some \epsilon 0 , where \gamma is the separation margin of the limiting kernel. In contrast, hard-label training requires O\left(\frac1\gamma^4 \cdot \ln\left(\frac1\epsilon\right)\right) neurons, as derived from an adapted version of the gradient descent analysis in \citepji2020polylogarithmic. This implies that when \gamma \leq \epsilon , i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.

[AI-1] Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

链接: https://arxiv.org/abs/2412.09544
作者: Paria Rashidinejad,Yuandong Tian
关键词-EN: reward hacking, reward hacking problem, Aligning AI systems, infamous reward hacking, Reward Hacking due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 46 pages, 3 figures

点击查看摘要

Abstract:Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu’s weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain “stationary labels”, resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.

[AI-2] he Parameters of Educability

链接: https://arxiv.org/abs/2412.09480
作者: Leslie G. Valiant
关键词-EN: create advanced civilizations, existing biological species, species on Earth, advanced civilizations, recently proposed
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 13 pages

点击查看摘要

Abstract:The educability model is a computational model that has been recently proposed to describe the cognitive capability that makes humans unique among existing biological species on Earth in being able to create advanced civilizations. Educability is defined as a capability for acquiring and applying knowledge. It is intended both to describe human capabilities and, equally, as an aspirational description of what can be usefully realized by machines. While the intention is to have a mathematically well-defined computational model, in constructing an instance of the model there are a number of decisions to make. We call these decisions \it parameters. In a standard computer, two parameters are the memory capacity and clock rate. There is no universally optimal choice for either one, or even for their ratio. Similarly, in a standard machine learning system, two parameters are the learning algorithm and the dataset used for training. Again, there are no universally optimal choices known for either. An educable system has many more parameters than either of these two kinds of system. This short paper discusses some of the main parameters of educable systems, and the broader implications of their existence.

[AI-3] STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading

链接: https://arxiv.org/abs/2412.09468
作者: Yilei Zhao,Wentao Zhang,Tingran Yang,Yong Jiang,Fei Huang,Wei Yang Bryan Lim
关键词-EN: capture excess returns, returns from mispricing, price assets, excess returns, factor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In financial trading, factor models are widely used to price assets and capture excess returns from mispricing. Recently, we have witnessed the rise of variational autoencoder-based latent factor models, which learn latent factors self-adaptively. While these models focus on modeling overall market conditions, they often fail to effectively capture the temporal patterns of individual stocks. Additionally, representing multiple factors as single values simplifies the model but limits its ability to capture complex relationships and dependencies. As a result, the learned factors are of low quality and lack diversity, reducing their effectiveness and robustness across different trading periods. To address these issues, we propose a Spatio-Temporal factOR Model based on dual vector quantized variational autoencoders, named STORM, which extracts features of stocks from temporal and spatial perspectives, then fuses and aligns these features at the fine-grained and semantic level, and represents the factors as multi-dimensional embeddings. The discrete codebooks cluster similar factor embeddings, ensuring orthogonality and diversity, which helps distinguish between different factors and enables factor selection in financial trading. To show the performance of the proposed factor model, we apply it to two downstream experiments: portfolio management on two stock datasets and individual trading tasks on six specific stocks. The extensive experiments demonstrate STORM’s flexibility in adapting to downstream tasks and superior performance over baseline models.

[AI-4] Solving Multiagent Path Finding on Highly Centralized Networks

链接: https://arxiv.org/abs/2412.09433
作者: Foivos Fioravantes,Dušan Knop,Jan Matyáš Křišťan,Nikolaos Melissinos,Michal Opler,Tung Anh Vu
关键词-EN: Mutliagent Path Finding, Mutliagent Path, Path Finding, consists of identifying, identifying the trajectories
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Mutliagent Path Finding (MAPF) problem consists of identifying the trajectories that a set of agents should follow inside a given network in order to reach their desired destinations as soon as possible, but without colliding with each other. We aim to minimize the maximum time any agent takes to reach their goal, ensuring optimal path length. In this work, we complement a recent thread of results that aim to systematically study the algorithmic behavior of this problem, through the parameterized complexity point of view. First, we show that MAPF is NP-hard when the given network has a star-like topology (bounded vertex cover number) or is a tree with 11 leaves. Both of these results fill important gaps in our understanding of the tractability of this problem that were left untreated in the recent work of [Fioravantes et al. Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology. AAAI’24]. Nevertheless, our main contribution is an exact algorithm that scales well as the input grows (FPT) when the topology of the given network is highly centralized (bounded distance to clique). This parameter is significant as it mirrors real-world networks. In such environments, a bunch of central hubs (e.g., processing areas) are connected to only few peripheral nodes. Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.09433 [cs.CC] (or arXiv:2412.09433v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2412.09433 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer ICRA2025

链接: https://arxiv.org/abs/2412.09417
作者: Adam Labiosa,Zhihan Wang,Siddhant Agarwal,William Cong,Geethika Hemkumar,Abhinav Narayan Harish,Benjamin Hong,Josh Kelle,Chen Li,Yuhao Li,Zisen Shao,Peter Stone,Josiah P. Hanna
关键词-EN: multi-agent environments remains, Standard Platform League, partially observable, remains a difficult, difficult and unsolved
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Robot decision-making in partially observable, real-time, dynamic, and multi-agent environments remains a difficult and unsolved challenge. Model-free reinforcement learning (RL) is a promising approach to learning decision-making in such domains, however, end-to-end RL in complex environments is often intractable. To address this challenge in the RoboCup Standard Platform League (SPL) domain, we developed a novel architecture integrating RL within a classical robotics stack, while employing a multi-fidelity sim2real approach and decomposing behavior into learned sub-behaviors with heuristic selection. Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division. In this work, we fully describe our system’s architecture and empirically analyze key design decisions that contributed to its success. Our approach demonstrates how RL-based behaviors can be integrated into complete robot behavior architectures.

[AI-6] Uncommon Belief in Rationality AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.09407
作者: Qi Shi,Pavel Naumov
关键词-EN: traditional standard assumption, Common knowledge, traditional standard, standard assumption, assumption in analysing
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: The 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Common knowledge/belief in rationality is the traditional standard assumption in analysing interaction among agents. This paper proposes a graph-based language for capturing significantly more complicated structures of higher-order beliefs that agents might have about the rationality of the other agents. The two main contributions are a solution concept that captures the reasoning process based on a given belief structure and an efficient algorithm for compressing any belief structure into a unique minimal form.

[AI-7] Distributed Intelligent System Architecture for UAV-Assisted Monitoring of Wind Energy Infrastructure

链接: https://arxiv.org/abs/2412.09387
作者: Serhii Svystun,Oleksandr Melnychenko,Pavlo Radiuk,Oleg Savenko,Andrii Lysyi
关键词-EN: renewable energy production, rapid development, development of green, key to sustainable, sustainable renewable energy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Wind turbine inspection, UAV, intelligent systems, distributed architecture, defect detection, renewable energy maintenance, automated monitoring

点击查看摘要

Abstract:With the rapid development of green energy, the efficiency and reliability of wind turbines are key to sustainable renewable energy production. For that reason, this paper presents a novel intelligent system architecture designed for the dynamic collection and real-time processing of visual data to detect defects in wind turbines. The system employs advanced algorithms within a distributed framework to enhance inspection accuracy and efficiency using unmanned aerial vehicles (UAVs) with integrated visual and thermal sensors. An experimental study conducted at the “Staryi Sambir-1” wind power plant in Ukraine demonstrates the system’s effectiveness, showing a significant improvement in defect detection accuracy (up to 94%) and a reduction in inspection time per turbine (down to 1.5 hours) compared to traditional methods. The results show that the proposed intelligent system architecture provides a scalable and reliable solution for wind turbine maintenance, contributing to the durability and performance of renewable energy infrastructure.

[AI-8] AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLM s Complex Reasoning Capabilities

链接: https://arxiv.org/abs/2412.09385
作者: Fabrizio Davide,Pietro Torre,Andrea Gaggioli
关键词-EN: Artificial General Intelligence, General Intelligence, Artificial General, large language models, likelihood of Artificial
类目: Artificial Intelligence (cs.AI)
*备注: 47 pages, 8 figures, 17 tables, appendix with data and code

点击查看摘要

Abstract:We tasked 16 state-of-the-art large language models (LLMs) with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030. To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR). The LLMs’ estimates varied widely, ranging from 3% (Reka- Core) to 47.6% (GPT-4o), with a median of 12.5%. These estimates closely align with a recent expert survey that projected a 10% likelihood of AGI by 2027, underscoring the relevance of LLMs in forecasting complex, speculative scenarios. The LLM-PR process demonstrated strong reliability, evidenced by a high Intraclass Correlation Coefficient (ICC = 0.79), reflecting notable consistency in scoring across the models. Among the models, Pplx-70b-online emerged as the top performer, while Gemini-1.5-pro-api ranked the lowest. A cross-comparison with external benchmarks, such as LMSYS Chatbot Arena, revealed that LLM rankings remained consistent across different evaluation methods, suggesting that existing benchmarks may not encapsulate some of the skills relevant for AGI prediction. We further explored the use of weighting schemes based on external benchmarks, optimizing the alignment of LLMs’ predictions with human expert forecasts. This analysis led to the development of a new, ‘AGI benchmark’ designed to highlight performance differences in AGI-related tasks. Our findings offer insights into LLMs’ capabilities in speculative, interdisciplinary forecasting tasks and emphasize the growing need for innovative evaluation frameworks for assessing AI performance in complex, uncertain real-world scenarios.

[AI-9] Diffusion Model with Representation Alignment for Protein Inverse Folding

链接: https://arxiv.org/abs/2412.09380
作者: Chenglin Wang,Yucheng Zhou,Zijie Zhai,Jianbing Shen,Kai Zhang
关键词-EN: protein backbone structure, problem in bioinformatics, aiming to recover, fundamental problem, Protein inverse folding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter-residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion-based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state-of-the-art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.

[AI-10] Does Low Spoilage Under Cold Conditions Foster Cultural Complexity During the Foraging Era? – A Theoretical and Computational Inquiry

链接: https://arxiv.org/abs/2412.09335
作者: Minhyeok Lee
关键词-EN: Human cultural complexity, Human cultural, Human, cultural complexity, cultural
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Human cultural complexity did not arise in a vacuum. Scholars in the humanities and social sciences have long debated how ecological factors, such as climate and resource availability, enabled early hunter-gatherers to allocate time and energy beyond basic subsistence tasks. This paper presents a formal, interdisciplinary approach that integrates theoretical modeling with computational methods to examine whether conditions that allow lower spoilage of stored food, often associated with colder climates and abundant large fauna, could indirectly foster the emergence of cultural complexity. Our contribution is twofold. First, we propose a mathematical framework that relates spoilage rates, yield levels, resource management skills, and cultural activities. Under this framework, we prove that lower spoilage and adequate yields reduce the frequency of hunting, thus freeing substantial time for cultural pursuits. Second, we implement a reinforcement learning simulation, inspired by engineering optimization techniques, to validate the theoretical predictions. By training agents in different (Y,p) environments, where Y is yield and p is the probability of daily spoilage, we observe patterns consistent with the theoretical model: stable conditions with lower spoilage strongly correlate with increased cultural complexity. While we do not claim to replicate prehistoric social realities directly, our results suggest that ecologically stable niches provided a milieu in which cultural forms could germinate and evolve. This study, therefore, offers an integrative perspective that unites humanistic inquiries into the origins of culture with the formal rigor and exploratory power of computational modeling.

[AI-11] owards Open-Vocabulary Video Semantic Segmentation

链接: https://arxiv.org/abs/2412.09329
作者: Xinhao Li,Yun Liu,Guolei Sun,Min Wu,Le Zhang,Ce Zhu
关键词-EN: recent research, Semantic segmentation, focal point, point of recent, Open Vocabulary Video
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model’s understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model’s capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS’s zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS’s effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.

[AI-12] Auto-Regressive Moving Diffusion Models for Time Series Forecasting

链接: https://arxiv.org/abs/2412.09328
作者: Jiaxin Gao,Qinglong Cao,Yuntian Chen
关键词-EN: shown considerable promise, diffusion-based TSF models, TSF, Time series, diffusion-based TSF
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: no comment

点击查看摘要

Abstract:Time series forecasting (TSF) is essential in various domains, and recent advancements in diffusion-based TSF models have shown considerable promise. However, these models typically adopt traditional diffusion patterns, treating TSF as a noise-based conditional generation task. This approach neglects the inherent continuous sequential nature of time series, leading to a fundamental misalignment between diffusion mechanisms and the TSF objective, thereby severely impairing performance. To bridge this misalignment, and inspired by the classic Auto-Regressive Moving Average (ARMA) theory, which views time series as continuous sequential progressions evolving from previous data points, we propose a novel Auto-Regressive Moving Diffusion (ARMD) model to first achieve the continuous sequential diffusion-based TSF. Unlike previous methods that start from white Gaussian noise, our model employs chain-based diffusion with priors, accurately modeling the evolution of time series and leveraging intermediate state information to improve forecasting accuracy and stability. Specifically, our approach reinterprets the diffusion process by considering future series as the initial state and historical series as the final state, with intermediate series generated using a sliding-based technique during the forward process. This design aligns the diffusion model’s sampling procedure with the forecasting objective, resulting in an unconditional, continuous sequential diffusion TSF model. Extensive experiments conducted on seven widely used datasets demonstrate that our model achieves state-of-the-art performance, significantly outperforming existing diffusion-based TSF models. Our code is available on GitHub: this https URL.

[AI-13] Beware of Metacognitive Laziness: Effects of Generative Artificial Intelligence on Learning Motivation Processes and Performance

链接: https://arxiv.org/abs/2412.09315
作者: Yizhou Fan,Luzhen Tang,Huixiao Le,Kejie Shen,Shufang Tan,Yueying Zhao,Yuan Shen,Xinyu Li,Dragan Gašević
关键词-EN: generative artificial intelligence, self-regulated learning processes, educational innovation, generative artificial, continuous development
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:With the continuous development of technological and educational innovation, learners nowadays can obtain a variety of support from agents such as teachers, peers, education technologies, and recently, generative artificial intelligence such as ChatGPT. The concept of hybrid intelligence is still at a nascent stage, and how learners can benefit from a symbiotic relationship with various agents such as AI, human experts and intelligent learning systems is still unknown. The emerging concept of hybrid intelligence also lacks deep insights and understanding of the mechanisms and consequences of hybrid human-AI learning based on strong empirical research. In order to address this gap, we conducted a randomised experimental study and compared learners’ motivations, self-regulated learning processes and learning performances on a writing task among different groups who had support from different agents (ChatGPT, human expert, writing analytics tools, and no extra tool). A total of 117 university students were recruited, and their multi-channel learning, performance and motivation data were collected and analysed. The results revealed that: learners who received different learning support showed no difference in post-task intrinsic motivation; there were significant differences in the frequency and sequences of the self-regulated learning processes among groups; ChatGPT group outperformed in the essay score improvement but their knowledge gain and transfer were not significantly different. Our research found that in the absence of differences in motivation, learners with different supports still exhibited different self-regulated learning processes, ultimately leading to differentiated performance. What is particularly noteworthy is that AI technologies such as ChatGPT may promote learners’ dependence on technology and potentially trigger metacognitive laziness.

[AI-14] Learning Novel Skills from Language-Generated Demonstrations

链接: https://arxiv.org/abs/2412.09286
作者: Ao-Qun Jin,Tian-Yu Xiang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Yue Cao,Sheng-Bin Duan,Fu-Chao Xie,Zeng-Guang Hou
关键词-EN: potential safety risks, high labor costs, Current robot learning, Current robot, resulting in high
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current robot learning algorithms for acquiring novel skills often rely on demonstration datasets or environment interactions, resulting in high labor costs and potential safety risks. To address these challenges, this study proposes a skill-learning framework that enables robots to acquire novel skills from natural language instructions. The proposed pipeline leverages vision-language models to generate demonstration videos of novel skills, which are processed by an inverse dynamics model to extract actions from the unlabeled demonstrations. These actions are subsequently mapped to environmental contexts via imitation learning, enabling robots to learn new skills effectively. Experimental evaluations in the MetaWorld simulation environments demonstrate the pipeline’s capability to generate high-fidelity and reliable demonstrations. Using the generated demonstrations, various skill learning algorithms achieve an accomplishment rate three times the original on novel tasks. These results highlight a novel approach to robot learning, offering a foundation for the intuitive and intelligent acquisition of novel robotic skills.

[AI-15] Speeding up approximate MAP by applying domain knowledge about relevant variables

链接: https://arxiv.org/abs/2412.09264
作者: Johan Kwisthout,Andrew Schroeder
关键词-EN: problem in Bayesian, Bayesian networks, Frugal Explanation heuristic, MAP, notoriously intractable
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:The MAP problem in Bayesian networks is notoriously intractable, even when approximated. In an earlier paper we introduced the Most Frugal Explanation heuristic approach to solving MAP, by partitioning the set of intermediate variables (neither observed nor part of the MAP variables) into a set of relevant variables, which are marginalized out, and irrelevant variables, which will be assigned a sampled value from their domain. In this study we explore whether knowledge about which variables are relevant for a particular query (i.e., domain knowledge) speeds up computation sufficiently to beat both exact MAP as well as approximate MAP while giving reasonably accurate results. Our results are inconclusive, but also show that this probably depends on the specifics of the MAP query, most prominently the number of MAP variables.

[AI-16] LMAgent : A Large-scale Multimodal Agents Society for Multi-user Simulation

链接: https://arxiv.org/abs/2412.09237
作者: Yijun Liu,Wu Liu,Xiaoyan Gu,Yong Rui,Xiaodong He,Yongdong Zhang
关键词-EN: crucial for understanding, understanding complex social, understanding complex, multimodal, Abstract
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The believable simulation of multi-user behavior is crucial for understanding complex social systems. Recently, large language models (LLMs)-based AI agents have made significant progress, enabling them to achieve human-like intelligence across various tasks. However, real human societies are often dynamic and complex, involving numerous individuals engaging in multimodal interactions. In this paper, taking e-commerce scenarios as an example, we present LMAgent, a very large-scale and multimodal agents society based on multimodal LLMs. In LMAgent, besides freely chatting with friends, the agents can autonomously browse, purchase, and review products, even perform live streaming e-commerce. To simulate this complex system, we introduce a self-consistency prompting mechanism to augment agents’ multimodal capabilities, resulting in significantly improved decision-making performance over the existing multi-agent system. Moreover, we propose a fast memory mechanism combined with the small-world model to enhance system efficiency, which supports more than 10,000 agent simulations in a society. Experiments on agents’ behavior show that these agents achieve comparable performance to humans in behavioral indicators. Furthermore, compared with the existing LLMs-based multi-agent system, more different and valuable phenomena are exhibited, such as herd behavior, which demonstrates the potential of LMAgent in credible large-scale social behavior simulations.

[AI-17] CSSDH: An Ontology for Social Determinants of Health to Operational Continuity of Care Data Interoperability

链接: https://arxiv.org/abs/2412.09223
作者: Subhashis Das,Debashis Naskar,Sara Rodriguez Gonzalez
关键词-EN: Electronic Health Records, home-based healthcare solutions, social determinants, reliance on technology-driven, professionals as needed
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures, conference-The 25th International Conference on Intelligent Data Engineering and Automated Learning

点击查看摘要

Abstract:The rise of digital platforms has led to an increasing reliance on technology-driven, home-based healthcare solutions, enabling individuals to monitor their health and share information with healthcare professionals as needed. However, creating an efficient care plan management system requires more than just analyzing hospital summaries and Electronic Health Records (EHRs). Factors such as individual user needs and social determinants of health, including living conditions and the flow of healthcare information between different settings, must also be considered. Challenges in this complex healthcare network involve schema diversity (in EHRs, personal health records, etc.) and terminology diversity (e.g., ICD, SNOMED-CT) across ancillary healthcare operations. Establishing interoperability among various systems and applications is crucial, with the European Interoperability Framework (EIF) emphasizing the need for patient-centric access and control of healthcare data. In this paper, we propose an integrated ontological model, the Common Semantic Data Model for Social Determinants of Health (CSSDH), by combining ISO/DIS 13940:2024 ContSys with WHO Social Determinants of Health. CSSDH aims to achieve interoperability within the Continuity of Care Network.

[AI-18] Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

链接: https://arxiv.org/abs/2412.09126
作者: Meng Shen,Yake Wei,Jianxiong Yin,Deepu Rajan,Di Hu,Simon See
关键词-EN: Training multimodal models, Training multimodal, requires a large, large amount, data
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, ACMMM Asia 2024, Oral Presentation

点击查看摘要

Abstract:Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL’s efficacy in selecting multimodal data pairs across three multimodal datasets. Comments: 11 pages, ACMMM Asia 2024, Oral Presentation Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.09126 [cs.MM] (or arXiv:2412.09126v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2412.09126 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3696409.3700225 Focus to learn more DOI(s) linking to related resources

[AI-19] Goal-Driven Query Answering over First- and Second-Order Dependencies with Equality

链接: https://arxiv.org/abs/2412.09125
作者: Efthymia Tsamoura,Boris Motik
关键词-EN: universal model, dependencies, Query, plays a central, central role
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Logic in Computer Science (cs.LO)
*备注: 47 pages

点击查看摘要

Abstract:Query answering over data with dependencies plays a central role in most applications of dependencies. The problem is commonly solved by using a suitable variant of the chase algorithm to compute a universal model of the dependencies and the data and thus explicate all knowledge implicit in the dependencies. After this preprocessing step, an arbitrary conjunctive query over the dependencies and the data can be answered by evaluating it the computed universal model. If, however, the query to be answered is fixed and known in advance, computing the universal model is often inefficient as many inferences made during this process can be irrelevant to a given query. In such cases, a goal-driven approach, which avoids drawing unnecessary inferences, promises to be more efficient and thus preferable in practice. In this paper we present what we believe to be the first technique for goal-driven query answering over first- and second-order dependencies with equality reasoning. Our technique transforms the input dependencies so that applying the chase to the output avoids many inferences that are irrelevant to the query. The transformation proceeds in several steps, which comprise the following three novel techniques. First, we present a variant of the singularisation technique by Marnette [60] that is applicable to second-order dependencies and that corrects an incompleteness of a related formulation by ten Cate et al. [74]. Second, we present a relevance analysis technique that can eliminate from the input dependencies that provably do not contribute to query answers. Third, we present a variant of the magic sets algorithm [19] that can handle second-order dependencies with equality reasoning. We also present the results of an extensive empirical evaluation, which show that goal-driven query answering can be orders of magnitude faster than computing the full universal model. Comments: 47 pages Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB); Logic in Computer Science (cs.LO) ACMclasses: F.4.1; I.2.4 Cite as: arXiv:2412.09125 [cs.AI] (or arXiv:2412.09125v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.09125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.09104
作者: Songjun Tu,Jingbo Sun,Qichao Zhang,Yaocheng Zhang,Jia Liu,Ke Chen,Dongbin Zhao
关键词-EN: reward-free offline dataset, Offline preference-based reinforcement, preference-based reinforcement learning, typically operates, preference-based reinforcement
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns and selecting optimal actions based on high reward labels. Additionally, we introduce an ensemble normalization technique that effectively integrates multiple reward models, balancing the tradeoff between reward differentiation and accuracy. Empirical evaluations on various benchmarks demonstrate the superiority of DTR over other state-of-the-art baselines

[AI-21] mporal Numeric Planning with Patterns AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.09101
作者: Matteo Cardellini,Enrico Giunchiglia
关键词-EN: produce SMT formulas, numeric planning problems, recently proposed planning, temporal numeric planning, SMT formulas
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:We consider temporal numeric planning problems \Pi expressed in PDDL2.1 level 3, and show how to produce SMT formulas (i) whose models correspond to valid plans of \Pi , and (ii) that extend the recently proposed planning with patterns approach from the numeric to the temporal case. We prove the correctness and completeness of the approach and show that it performs very well on 10 domains with required concurrency.

[AI-22] Understanding Opportunities and Risks of Synthetic Relationships: Leveraging the Power of Longitudinal Research with Customised AI Tools

链接: https://arxiv.org/abs/2412.09086
作者: Alfio Ventura,Nils Köbis
关键词-EN: position paper discusses, synthetic relationships, position paper, paper discusses, discusses the benefits
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: This is a “Position paper accepted for CONVERSATIONS 2024 - the 8th International Workshop on Chatbots and Human-Centred AI, hosted by CERTH, Thessaloniki, Greece, December 4-5, 2024.” The original publication is available on the workshop website: this https URL . This document is identical to the original and is mainly available here for accessibility and discoverability

点击查看摘要

Abstract:This position paper discusses the benefits of longitudinal behavioural research with customised AI tools for exploring the opportunities and risks of synthetic relationships. Synthetic relationships are defined as “continuing associations between humans and AI tools that interact with one another wherein the AI tool(s) influence(s) humans’ thoughts, feelings, and/or actions.” (Starke et al., 2024). These relationships can potentially improve health, education, and the workplace, but they also bring the risk of subtle manipulation and privacy and autonomy concerns. To harness the opportunities of synthetic relationships and mitigate their risks, we outline a methodological approach that complements existing findings. We propose longitudinal research designs with self-assembled AI agents that enable the integration of detailed behavioural and self-reported data.

[AI-23] EmbedGenius: Towards Automated Software Development for Generic Embedded IoT Systems

链接: https://arxiv.org/abs/2412.09058
作者: Huanqi Yang,Mingzhe Li,Mingda Han,Zhenjiang Li,Weitao Xu
关键词-EN: enabling seamless connectivity, range of applications, crucial for enabling, enabling seamless, seamless connectivity
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Embedded IoT system development is crucial for enabling seamless connectivity and functionality across a wide range of applications. However, such a complex process requires cross-domain knowledge of hardware and software and hence often necessitates direct developer involvement, making it labor-intensive, time-consuming, and error-prone. To address this challenge, this paper introduces EmbedGenius, the first fully automated software development platform for general-purpose embedded IoT systems. The key idea is to leverage the reasoning ability of Large Language Models (LLMs) and embedded system expertise to automate the hardware-in-the-loop development process. The main methods include a component-aware library resolution method for addressing hardware dependencies, a library knowledge generation method that injects utility domain knowledge into LLMs, and an auto-programming method that ensures successful deployment. We evaluate EmbedGenius’s performance across 71 modules and four mainstream embedded development platforms with over 350 IoT tasks. Experimental results show that EmbedGenius can generate codes with an accuracy of 95.7% and complete tasks with a success rate of 86.5%, surpassing human-in-the-loop baselines by 15.6%–37.7% and 25.5%–53.4%, respectively. We also show EmbedGenius’s potential through case studies in environmental monitoring and remote control systems development.

[AI-24] A Context-Enhanced Framework for Sequential Graph Reasoning IJCAI2024

链接: https://arxiv.org/abs/2412.09056
作者: Shuo Shi,Chao Peng,Chenyang Xu,Zhengfeng Yang
关键词-EN: automated math problem, math problem solving, paper studies sequential, graph algorithm learning, studies sequential reasoning
类目: Artificial Intelligence (cs.AI)
*备注: Appeared at IJCAI 2024

点击查看摘要

Abstract:The paper studies sequential reasoning over graph-structured data, which stands as a fundamental task in various trending fields like automated math problem solving and neural graph algorithm learning, attracting a lot of research interest. Simultaneously managing both sequential and graph-structured information in such tasks presents a notable challenge. Over recent years, many neural architectures in the literature have emerged to tackle the issue. In this work, we generalize the existing architectures and propose a context-enhanced framework. The crucial innovation is that the reasoning of each step does not only rely on the outcome of the preceding step but also leverages the aggregation of information from more historical outcomes. The idea stems from our observation that in sequential graph reasoning, each step’s outcome has a much stronger inner connection with each other compared to traditional seq-to-seq tasks. We show that the framework can effectively integrate with the existing methods, enhancing their reasoning abilities. Empirical evaluations are conducted on the challenging CLRS Reasoning Benchmark, and the results demonstrate that the proposed framework significantly improves the performance of existing architectures, yielding state-of-the-art results across the majority of the datasets within the benchmark.

[AI-25] Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis

链接: https://arxiv.org/abs/2412.09032
作者: Zhoulin Ji,Chenhao Lin,Hang Wang,Chao Shen
关键词-EN: increasingly crucial due, Detecting synthetic, identity impersonation, increasingly crucial, crucial due
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on multi-scale pyramid features to estimate the synthetic spans. Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level. At the segment level, it attains an EER of 1.07% and a 92.19% F1 score. These results highlight the model’s robust capability for a comprehensive analysis of synthetic speech, offering a promising avenue for future research and practical applications in this field.

[AI-26] RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction AAAI2025

链接: https://arxiv.org/abs/2412.09030
作者: Zhihao Ding,Ting Zhang,Yiran Li,Jieming Shi,Chen Jason Zhang
关键词-EN: Organic Solar Cells, Organic Solar, Solar Cells, sustainable energy production, OSC
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures. This is the extended version of the paper accepted at AAAI 2025, which includes all technical appendices and additional experimental details

点击查看摘要

Abstract:Organic Solar Cells (OSCs) are a promising technology for sustainable energy production. However, the identification of molecules with desired OSC properties typically involves laborious experimental research. To accelerate progress in the field, it is crucial to develop machine learning models capable of accurately predicting the properties of OSC molecules. While graph representation learning has demonstrated success in molecular property prediction, it remains underexplored for OSC-specific tasks. Existing methods fail to capture the unique structural features of OSC molecules, particularly the intricate ring systems that critically influence OSC properties, leading to suboptimal performance. To fill the gap, we present RingFormer, a novel graph transformer framework specially designed to capture both atom and ring level structural patterns in OSC molecules. RingFormer constructs a hierarchical graph that integrates atomic and ring structures and employs a combination of local message passing and global attention mechanisms to generate expressive graph representations for accurate OSC property prediction. We evaluate RingFormer’s effectiveness on five curated OSC molecule datasets through extensive experiments. The results demonstrate that RingFormer consistently outperforms existing methods, achieving a 22.77% relative improvement over the nearest competitor on the CEPDB dataset.

[AI-27] he AI Interface: Designing for the Ideal Machine-Human Experience (Editorial)

链接: https://arxiv.org/abs/2412.09000
作者: Aparna Sundar,Tony Russell-Rose,Udo Kruschwitz,Karen Machleit
关键词-EN: emotionally resonant AI-human, resonant AI-human interfaces, artificial intelligence, daily life, critical challenge
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:As artificial intelligence (AI) becomes increasingly embedded in daily life, designing intuitive, trustworthy, and emotionally resonant AI-human interfaces has emerged as a critical challenge. This editorial introduces a Special Issue that explores the psychology of AI experience design, focusing on how interfaces can foster seamless collaboration between humans and machines. Drawing on insights from diverse fields (healthcare, consumer technology, workplace dynamics, and cultural sector), the papers in this collection highlight the complexities of trust, transparency, and emotional sensitivity in human-AI interaction. Key themes include designing AI systems that align with user perceptions and expectations, overcoming resistance through transparency and trust, and framing AI capabilities to reduce user anxiety. By synthesizing findings from eight diverse studies, this editorial underscores the need for AI interfaces to balance efficiency with empathy, addressing both functional and emotional dimensions of user experience. Ultimately, it calls for actionable frameworks to bridge research and practice, ensuring that AI systems enhance human lives through thoughtful, human-centered design.

[AI-28] Predicting Quality of Video Gaming Experience Using Global-Scale Telemetry Data and Federated Learning

链接: https://arxiv.org/abs/2412.08950
作者: Zhongyang Zhang,Jinhe Wen,Zixi Chen,Dara Arbab,Sruti Sahani,Bijan Arbab,Haojian Jin,Tauhidur Rahman
关键词-EN: FPS, gaming experience, affect game FPS, game FPS, accurate FPS
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 22 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Frames Per Second (FPS) significantly affects the gaming experience. Providing players with accurate FPS estimates prior to purchase benefits both players and game developers. However, we have a limited understanding of how to predict a game’s FPS performance on a specific device. In this paper, we first conduct a comprehensive analysis of a wide range of factors that may affect game FPS on a global-scale dataset to identify the determinants of FPS. This includes player-side and game-side characteristics, as well as country-level socio-economic statistics. Furthermore, recognizing that accurate FPS predictions require extensive user data, which raises privacy concerns, we propose a federated learning-based model to ensure user privacy. Each player and game is assigned a unique learnable knowledge kernel that gradually extracts latent features for improved accuracy. We also introduce a novel training and prediction scheme that allows these kernels to be dynamically plug-and-play, effectively addressing cold start issues. To train this model with minimal bias, we collected a large telemetry dataset from 224 countries and regions, 100,000 users, and 835 games. Our model achieved a mean Wasserstein distance of 0.469 between predicted and ground truth FPS distributions, outperforming all baseline methods.

[AI-29] Goal-Conditioned Supervised Learning for Multi-Objective Recommendation

链接: https://arxiv.org/abs/2412.08911
作者: Shijun Li,Hilaf Hasson,Jing Hu,Joydeep Ghosh
关键词-EN: concurrently optimize multiple, Goal-Conditioned Supervised Learning, Multi-objective learning endeavors, optimize multiple objectives, Supervised Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-objective learning endeavors to concurrently optimize multiple objectives using a single model, aiming to achieve high and balanced performance across these diverse objectives. However, it often involves a more complex optimization problem, particularly when navigating potential conflicts between objectives, leading to solutions with higher memory requirements and computational complexity. This paper introduces a Multi-Objective Goal-Conditioned Supervised Learning (MOGCSL) framework for automatically learning to achieve multiple objectives from offline sequential data. MOGCSL extends the conventional Goal-Conditioned Supervised Learning (GCSL) method to multi-objective scenarios by redefining goals from one-dimensional scalars to multi-dimensional vectors. The need for complex architectures and optimization constraints can be naturally eliminated. MOGCSL benefits from filtering out uninformative or noisy instances that do not achieve desirable long-term rewards. It also incorporates a novel goal-choosing algorithm to model and select “high” achievable goals for inference. While MOGCSL is quite general, we focus on its application to the next action prediction problem in commercial-grade recommender systems. In this context, any viable solution needs to be reasonably scalable and also be robust to large amounts of noisy data that is characteristic of this application space. We show that MOGCSL performs admirably on both counts. Specifically, extensive experiments conducted on real-world recommendation datasets validate its efficacy and efficiency. Also, analysis and experiments are included to explain its strength in discounting the noisier portions of training data in recommender systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2412.08911 [cs.LG] (or arXiv:2412.08911v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.08911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] Radiology Report Generation via Multi-objective Preference Optimization

链接: https://arxiv.org/abs/2412.08901
作者: Ting Xiao,Lei Shi,Peng Liu,Zhe Wang,Chenjia Bai
关键词-EN: Automatic Radiology Report, Radiology Report Generation, Automatic Radiology, Report Generation, Radiology Report
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages,3 figures

点击查看摘要

Abstract:Automatic Radiology Report Generation (RRG) is an important topic for alleviating the substantial workload of radiologists. Existing RRG approaches rely on supervised regression based on different architectures or additional knowledge injection,while the generated report may not align optimally with radiologists’ preferences. Especially, since the preferences of radiologists are inherently heterogeneous and multidimensional, e.g., some may prioritize report fluency, while others emphasize clinical accuracy. To address this problem,we propose a new RRG method via Multi-objective Preference Optimization (MPO) to align the pre-trained RRG model with multiple human preferences, which can be formulated by multi-dimensional reward functions and optimized by multi-objective reinforcement learning (RL). Specifically, we use a preference vector to represent the weight of preferences and use it as a condition for the RRG model. Then, a linearly weighed reward is obtained via a dot product between the preference vector and multi-dimensional this http URL,the RRG model is optimized to align with the preference vector by optimizing such a reward via RL. In the training stage,we randomly sample diverse preference vectors from the preference space and align the model by optimizing the weighted multi-objective rewards, which leads to an optimal policy on the entire preference space. When inference,our model can generate reports aligned with specific preferences without further fine-tuning. Extensive experiments on two public datasets show the proposed method can generate reports that cater to different preferences in a single model and achieve state-of-the-art performance.

[AI-31] Neural Interactive Proofs

链接: https://arxiv.org/abs/2412.08897
作者: Lewis Hammond,Sam Adam-Day
关键词-EN: computationally bounded agent, neural interactive proofs, verifier can learn, provers in order, computationally bounded
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 42 pages, 17 figures

点击查看摘要

Abstract:We consider the problem of how a trusted, but computationally bounded agent (a ‘verifier’) can learn to interact with one or more powerful but untrusted agents (‘provers’) in order to solve a given task. More specifically, we study the case in which agents are represented using neural networks and refer to solutions of this problem as neural interactive proofs. First we introduce a unifying framework based on prover-verifier games, which generalises previously proposed interaction protocols. We then describe several new protocols for generating neural interactive proofs, and provide a theoretical comparison of both new and existing approaches. Finally, we support this theory with experiments in two domains: a toy graph isomorphism problem that illustrates the key ideas, and a code validation task using large language models. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems.

[AI-32] SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

链接: https://arxiv.org/abs/2412.08894
作者: Kwangryeol Park,Seulki Lee
关键词-EN: Square-Matricized Momentum Factorization, learning rate optimizers, adaptive learning rate, Square-Matricized Momentum, propose SMMF
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks.

[AI-33] Efficient Reinforcement Learning for Optimal Control with Natural Images

链接: https://arxiv.org/abs/2412.08893
作者: Peter N. Loxley
关键词-EN: control systems engineering, sequential decision problems, decision problems widely, systems engineering, artificial intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reinforcement learning solves optimal control and sequential decision problems widely found in control systems engineering, robotics, and artificial intelligence. This work investigates optimal control over a sequence of natural images. The problem is formalized, and general conditions are derived for an image to be sufficient for implementing an optimal policy. Reinforcement learning is shown to be efficient only for certain types of image representations. This is demonstrated by developing a reinforcement learning benchmark that scales easily with number of states and length of horizon, and has optimal policies that are easily distinguished from suboptimal policies. Image representations given by overcomplete sparse codes are found to be computationally efficient for optimal control, using fewer computational resources to learn and evaluate optimal policies. For natural images of fixed size, representing each image as an overcomplete sparse code in a linear network is shown to increase network storage capacity by orders of magnitude beyond that possible for any complete code, allowing larger tasks with many more states to be solved. Sparse codes can be generated by devices with low energy requirements and low computational overhead.

[AI-34] owards modeling evolving longitudinal health trajectories with a transformer-based deep learning model

链接: https://arxiv.org/abs/2412.08873
作者: Hans Moen,Vishnu Raj,Andrius Vabalas,Markus Perola,Samuel Kaski,Andrea Ganna,Pekka Marttinen
关键词-EN: individuals’ health histories, health trajectories, registers contain rich, rich information, individuals’ health trajectories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Health registers contain rich information about individuals’ health histories. Here our interest lies in understanding how individuals’ health trajectories evolve in a nationwide longitudinal dataset with coded features, such as clinical codes, procedures, and drug purchases. We introduce a straightforward approach for training a Transformer-based deep learning model in a way that lets us analyze how individuals’ trajectories change over time. This is achieved by modifying the training objective and by applying a causal attention mask. We focus here on a general task of predicting the onset of a range of common diseases in a given future forecast interval. However, instead of providing a single prediction about diagnoses that could occur in this forecast interval, our approach enable the model to provide continuous predictions at every time point up until, and conditioned on, the time of the forecast period. We find that this model performs comparably to other models, including a bi-directional transformer model, in terms of basic prediction performance while at the same time offering promising trajectory modeling properties. We explore a couple of ways to use this model for analyzing health trajectories and aiding in early detection of events that forecast possible later disease onsets. We hypothesize that this method may be helpful in continuous monitoring of peoples’ health trajectories and enabling interventions in ongoing health trajectories, as well as being useful in retrospective analyses.

[AI-35] Key Safety Design Overview in AI-driven Autonomous Vehicles

链接: https://arxiv.org/abs/2412.08862
作者: Vikas Vyas,Zheyuan Xu
关键词-EN: autonomous SAE level, incorporate artificial intelligence, autonomous SAE, artificial intelligence, automotive software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the increasing presence of autonomous SAE level 3 and level 4, which incorporate artificial intelligence software, along with the complex technical challenges they present, it is essential to maintain a high level of functional safety and robust software design. This paper explores the necessary safety architecture and systematic approach for automotive software and hardware, including fail soft handling of automotive safety integrity level (ASIL) D (highest level of safety integrity), integration of artificial intelligence (AI), and machine learning (ML) in automotive safety architecture. By addressing the unique challenges presented by increasing AI-based automotive software, we proposed various techniques, such as mitigation strategies and safety failure analysis, to ensure the safety and reliability of automotive software, as well as the role of AI in software reliability throughout the data lifecycle. Index Terms Safety Design, Automotive Software, Performance Evaluation, Advanced Driver Assistance Systems (ADAS) Applications, Automotive Software Systems, Electronic Control Units. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.08862 [cs.SE] (or arXiv:2412.08862v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.08862 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] Kajal: Extracting Grammar of a Source Code Using Large Language Models

链接: https://arxiv.org/abs/2412.08842
作者: Mohammad Jalili Torkamani
关键词-EN: Large Language Models, software engineering tasks, Understanding and extracting, leveraging Large Language, manually creating
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures, 1 table, preprint

点击查看摘要

Abstract:Understanding and extracting the grammar of a domain-specific language (DSL) is crucial for various software engineering tasks; however, manually creating these grammars is time-intensive and error-prone. This paper presents Kajal, a novel approach that automatically infers grammar from DSL code snippets by leveraging Large Language Models (LLMs) through prompt engineering and few-shot learning. Kajal dynamically constructs input prompts, using contextual information to guide the LLM in generating the corresponding grammars, which are iteratively refined through a feedback-driven approach. Our experiments show that Kajal achieves 60% accuracy with few-shot learning and 45% without it, demonstrating the significant impact of few-shot learning on the tool’s effectiveness. This approach offers a promising solution for automating DSL grammar extraction, and future work will explore using smaller, open-source LLMs and testing on larger datasets to further validate Kajal’s performance.

[AI-37] Structural Entropy Guided Probabilistic Coding

链接: https://arxiv.org/abs/2412.08841
作者: Xiang Huang,Hao Peng,Li Sun,Hui Lin,Chunyang Liu,Jiang Cao,Philip S. Yu
关键词-EN: deterministic embeddings, data point, advantages over deterministic, describes the uncertainty, uncertainty and complexity
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution constraint under the Information Bottleneck (IB) principle to enhance representation learning. However, these proposed regularization terms only consider the constraint of each latent variable, omitting the structural information between latent variables. In this paper, we propose a novel structural entropy-guided probabilistic coding model, named SEPC. Specifically, we incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Besides, as traditional structural information theory is not well-suited for regression tasks, we propose a probabilistic encoding tree, transferring regression tasks to classification tasks while diminishing the influence of the transformation. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC compared to other state-of-the-art models in terms of effectiveness, generalization capability, and robustness to label noise. The codes and datasets are available at this https URL.

[AI-38] HadaCore: Tensor Core Accelerated Hadamard Transform Kernel

链接: https://arxiv.org/abs/2412.08832
作者: Krish Agarwal,Rishi Astra,Adnan Hoque,Mudhakar Srivatsa,Raghu Ganti,Less Wright,Sijia Chen
关键词-EN: Fast Walsh-Hadamard Transform, modified Fast Walsh-Hadamard, Tensor Cores present, Tensor Core acceleration, modern GPU hardware
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present HadaCore, a modified Fast Walsh-Hadamard Transform (FWHT) algorithm optimized for the Tensor Cores present in modern GPU hardware. HadaCore follows the recursive structure of the original FWHT algorithm, achieving the same asymptotic runtime complexity but leveraging a hardware-aware work decomposition that benefits from Tensor Core acceleration. This reduces bottlenecks from compute and data exchange. On Nvidia A100 and H100 GPUs, HadaCore achieves speedups of 1.1-1.4x and 1.0-1.3x, with a peak gain of 3.5x and 3.6x respectively, when compared to the existing state-of-the-art implementation of the original algorithm. We also show that when using FP16 or BF16, our implementation is numerically accurate, enabling comparable accuracy on MMLU benchmarks when used in an end-to-end Llama3 inference run with quantized (FP8) attention.

[AI-39] Efficient Dynamic Attributed Graph Generation ICDE2025

链接: https://arxiv.org/abs/2412.08810
作者: Fan Li,Xiaoyang Wang,Dawei Cheng,Cong Chen,Ying Zhang,Xuemin Lin
关键词-EN: testing database engines, data management due, fundamental research problem, diverse use cases, ranging from testing
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 14 pages,10 figures. Accepted by IEEE ICDE2025

点击查看摘要

Abstract:Data generation is a fundamental research problem in data management due to its diverse use cases, ranging from testing database engines to data-specific applications. However, real-world entities often involve complex interactions that cannot be effectively modeled by traditional tabular data. Therefore, graph data generation has attracted increasing attention recently. Although various graph generators have been proposed in the literature, there are three limitations: i) They cannot capture the co-evolution pattern of graph structure and node attributes. ii) Few of them consider edge direction, leading to substantial information loss. iii) Current state-of-the-art dynamic graph generators are based on the temporal random walk, making the simulation process time-consuming. To fill the research gap, we introduce VRDAG, a novel variational recurrent framework for efficient dynamic attributed graph generation. Specifically, we design a bidirectional message-passing mechanism to encode both directed structural knowledge and attribute information of a snapshot. Then, the temporal dependency in the graph sequence is captured by a recurrence state updater, generating embeddings that can preserve the evolution pattern of early graphs. Based on the hidden node embeddings, a conditional variational Bayesian method is developed to sample latent random variables at the neighboring timestep for new snapshot generation. The proposed generation paradigm avoids the time-consuming path sampling and merging process in existing random walk-based methods, significantly reducing the synthesis time. Finally, comprehensive experiments on real-world datasets are conducted to demonstrate the effectiveness and efficiency of the proposed model.

[AI-40] Autoformalizing and Simulating Game-Theoretic Scenarios using LLM -augmented Agents

链接: https://arxiv.org/abs/2412.08805
作者: Agnieszka Mensfelt,Kostas Stathis,Vince Trencsenyi
关键词-EN: versatile tool, tool for exploring, exploring interactions, artificial agents, agents
类目: Artificial Intelligence (cs.AI)
*备注: code: this https URL

点击查看摘要

Abstract:Game-theoretic simulations are a versatile tool for exploring interactions of both natural and artificial agents. However, modelling real-world scenarios and developing simulations often require substantial human expertise and effort. To streamline this process, we present a framework that enables the autoformalization of game-theoretic scenarios using agents augmented by large language models (LLMs). In this approach, LLM-augmented agents translate natural language scenario descriptions into executable logic programs that define the rules of each game, validating these programs for syntactic accuracy. A tournament simulation is then conducted, during which the agents test the functionality of the generated games by playing them. When a ground truth payoff matrix is available, an exact semantic validation can also be performed. The validated games can then be used in further simulations to assess the effectiveness of different strategies. We evaluate our approach on a diverse set of 55 natural language descriptions across five well-known 2x2 simultaneous-move games, demonstrating 96% syntactic and 87% semantic correctness in the generated game rules. Additionally, we assess the LLM-augmented agents’ capability to autoformalize strategies for gameplay.

[AI-41] Integrating Optimization Theory with Deep Learning for Wireless Network Design

链接: https://arxiv.org/abs/2412.08761
作者: Sinem Coleri,Aysun Gurur Onalan,Marco di Renzo
关键词-EN: Traditional wireless network, real-time applications due, Traditional wireless, domain-specific mathematical models, optimization algorithms derived
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: Accepted for publication in IEEE Communications Magazine

点击查看摘要

Abstract:Traditional wireless network design relies on optimization algorithms derived from domain-specific mathematical models, which are often inefficient and unsuitable for dynamic, real-time applications due to high complexity. Deep learning has emerged as a promising alternative to overcome complexity and adaptability concerns, but it faces challenges such as accuracy issues, delays, and limited interpretability due to its inherent black-box nature. This paper introduces a novel approach that integrates optimization theory with deep learning methodologies to address these issues. The methodology starts by constructing the block diagram of the optimization theory-based solution, identifying key building blocks corresponding to optimality conditions and iterative solutions. Selected building blocks are then replaced with deep neural networks, enhancing the adaptability and interpretability of the system. Extensive simulations show that this hybrid approach not only reduces runtime compared to optimization theory based approaches but also significantly improves accuracy and convergence rates, outperforming pure deep learning models.

[AI-42] VEL: A Formally Verified Reasoner for OWL2 EL Profile

链接: https://arxiv.org/abs/2412.08739
作者: Atalay Mert Ileri,Nalen Rangarajan,Jack Cannell,Hande McGinty
关键词-EN: Web Ontology Language, Ontology Language, Web Ontology, past two decades, knowledge graphs
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Over the past two decades, the Web Ontology Language (OWL) has been instrumental in advancing the development of ontologies and knowledge graphs, providing a structured framework that enhances the semantic integration of data. However, the reliability of deductive reasoning within these systems remains challenging, as evidenced by inconsistencies among popular reasoners in recent competitions. This evidence underscores the limitations of current testing-based methodologies, particularly in high-stakes domains such as healthcare. To mitigate these issues, in this paper, we have developed VEL, a formally verified EL++ reasoner equipped with machine-checkable correctness proofs that ensure the validity of outputs across all possible inputs. This formalization, based on the algorithm of Baader et al., has been transformed into executable OCaml code using the Coq proof assistant’s extraction capabilities. Our formalization revealed several errors in the original completeness proofs, which led to changes to the algorithm to ensure its completeness. Our work demonstrates the necessity of mechanization of reasoning algorithms to ensure their correctness at theoretical and implementation levels.

[AI-43] From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields

链接: https://arxiv.org/abs/2412.08731
作者: Miltiadis Kofinas,Samuele Papa,Efstratios Gavves
关键词-EN: encoding spatio-temporal signals, recently emerged, encoding spatio-temporal, spatio-temporal signals, output nodes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Preprint. Source code: this https URL

点击查看摘要

Abstract:Neural fields (NeFs) have recently emerged as a state-of-the-art method for encoding spatio-temporal signals of various modalities. Despite the success of NeFs in reconstructing individual signals, their use as representations in downstream tasks, such as classification or segmentation, is hindered by the complexity of the parameter space and its underlying symmetries, in addition to the lack of powerful and scalable conditioning mechanisms. In this work, we draw inspiration from the principles of connectionism to design a new architecture based on MLPs, which we term NeoMLP. We start from an MLP, viewed as a graph, and transform it from a multi-partite graph to a complete graph of input, hidden, and output nodes, equipped with high-dimensional features. We perform message passing on this graph and employ weight-sharing via self-attention among all the nodes. NeoMLP has a built-in mechanism for conditioning through the hidden and output nodes, which function as a set of latent codes, and as such, NeoMLP can be used straightforwardly as a conditional neural field. We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data. Furthermore, we fit datasets of neural representations, by learning instance-specific sets of latent codes using a single backbone architecture, and then use them for downstream tasks, outperforming recent state-of-the-art methods. The source code is open-sourced at this https URL.

[AI-44] Learning Physics Informed Neural ODEs With Partial Measurements

链接: https://arxiv.org/abs/2412.08681
作者: Paul Ghanem,Ahmet Demirkaya,Tales Imbiriba,Alireza Ramezani,Zachary Danziger,Deniz Erdogmus
关键词-EN: Learning dynamics governing, dynamics governing physical, Learning dynamics, dynamics governing, Physics Informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning dynamics governing physical and spatiotemporal processes is a challenging problem, especially in scenarios where states are partially measured. In this work, we tackle the problem of learning dynamics governing these systems when parts of the system’s states are not measured, specifically when the dynamics generating the non-measured states are unknown. Inspired by state estimation theory and Physics Informed Neural ODEs, we present a sequential optimization framework in which dynamics governing unmeasured processes can be learned. We demonstrate the performance of the proposed approach leveraging numerical simulations and a real dataset extracted from an electro-mechanical positioning system. We show how the underlying equations fit into our formalism and demonstrate the improved performance of the proposed method when compared with baselines.

[AI-45] Distinguishing Scams and Fraud with Ensemble Learning

链接: https://arxiv.org/abs/2412.08680
作者: Isha Chadalavada,Tianhui Huang,Jessica Staddon
关键词-EN: increasingly query LLM-enabled, query LLM-enabled web, LLM-enabled web chatbots, Users increasingly query, Consumer Financial Protection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Users increasingly query LLM-enabled web chatbots for help with scam defense. The Consumer Financial Protection Bureau’s complaints database is a rich data source for evaluating LLM performance on user scam queries, but currently the corpus does not distinguish between scam and non-scam fraud. We developed an LLM ensemble approach to distinguishing scam and fraud CFPB complaints and describe initial findings regarding the strengths and weaknesses of LLMs in the scam defense context.

[AI-46] A Behavior Tree-inspired programming language for autonomous agents

链接: https://arxiv.org/abs/2412.08654
作者: Oliver Biggar,Iman Shames
关键词-EN: Behavior Trees, designing agents behavior, propose a design, ideas and motivations, agents behavior
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:We propose a design for a functional programming language for autonomous agents, built off the ideas and motivations of Behavior Trees (BTs). BTs are a popular model for designing agents behavior in robotics and AI. However, as their growth has increased dramatically, the simple model of BTs has come to be limiting. There is a growing push to increase the functionality of BTs, with the end goal of BTs evolving into a programming language in their own right, centred around the defining BT properties of modularity and reactiveness. In this paper, we examine how the BT model must be extended in order to grow into such a language. We identify some fundamental problems which must be solved: implementing `reactive’ selection, ‘monitoring’ safety-critical conditions, and passing data between actions. We provide a variety of small examples which demonstrate that these problems are complex, and that current BT approaches do not handle them in a manner consistent with modularity. We instead provide a simple set of modular programming primitives for handling these use cases, and show how they can be combined to build complex programs. We present a full specification for our BT-inspired language, and give an implementation in the functional programming language Haskell. Finally, we demonstrate our language by translating a large and complex BT into a simple, unambiguous program. Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE) Cite as: arXiv:2412.08654 [cs.PL] (or arXiv:2412.08654v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2412.08654 Focus to learn more arXiv-issued DOI via DataCite

[AI-47] What AI evaluations for preventing catastrophic risks can and cannot do

链接: https://arxiv.org/abs/2412.08653
作者: Peter Barnett,Lisa Thiergart
关键词-EN: underlying current approaches, preventing catastrophic risks, governance toolkit, important component, cases for preventing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI evaluations are an important component of the AI governance toolkit, underlying current approaches to safety cases for preventing catastrophic risks. Our paper examines what these evaluations can and cannot tell us. Evaluations can establish lower bounds on AI capabilities and assess certain misuse risks given sufficient effort from evaluators. Unfortunately, evaluations face fundamental limitations that cannot be overcome within the current paradigm. These include an inability to establish upper bounds on capabilities, reliably forecast future model capabilities, or robustly assess risks from autonomous AI systems. This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe. We conclude with recommendations for incremental improvements to frontier AI safety, while acknowledging these fundamental limitations remain unsolved. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.08653 [cs.CY] (or arXiv:2412.08653v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2412.08653 Focus to learn more arXiv-issued DOI via DataCite

[AI-48] A Practical Approach to Causal Inference over Time

链接: https://arxiv.org/abs/2410.10502
作者: Martina Cinquini,Isacco Beretta,Salvatore Ruggieri,Isabel Valera
关键词-EN: focus on estimating, causal, causal VAR framework, time, causal VAR
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we focus on estimating the causal effect of an intervention over time on a dynamical system. To that end, we formally define causal interventions and their effects over time on discrete-time stochastic processes (DSPs). Then, we show under which conditions the equilibrium states of a DSP, both before and after a causal intervention, can be captured by a structural causal model (SCM). With such an equivalence at hand, we provide an explicit mapping from vector autoregressive models (VARs), broadly applied in econometrics, to linear, but potentially cyclic and/or affected by unmeasured confounders, SCMs. The resulting causal VAR framework allows us to perform causal inference over time from observational time series data. Our experiments on synthetic and real-world datasets show that the proposed framework achieves strong performance in terms of observational forecasting while enabling accurate estimation of the causal effect of interventions on dynamical systems. We demonstrate, through a case study, the potential practical questions that can be addressed using the proposed causal VAR framework.

[AI-49] Regression and Classification with Single-Qubit Quantum Neural Networks

链接: https://arxiv.org/abs/2412.09486
作者: Leandro C. Souza,Bruno C. Guingo,Gilson Giraldi,Renato Portugal
关键词-EN: developing data-driven algorithms, classical machine learning, machine learning, data-driven algorithms, quantum machine learning
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Since classical machine learning has become a powerful tool for developing data-driven algorithms, quantum machine learning is expected to similarly impact the development of quantum algorithms. The literature reflects a mutually beneficial relationship between machine learning and quantum computing, where progress in one field frequently drives improvements in the other. Motivated by the fertile connection between machine learning and quantum computing enabled by parameterized quantum circuits, we use a resource-efficient and scalable Single-Qubit Quantum Neural Network (SQQNN) for both regression and classification tasks. The SQQNN leverages parameterized single-qubit unitary operators and quantum measurements to achieve efficient learning. To train the model, we use gradient descent for regression tasks. For classification, we introduce a novel training method inspired by the Taylor series, which can efficiently find a global minimum in a single step. This approach significantly accelerates training compared to iterative methods. Evaluated across various applications, the SQQNN exhibits virtually error-free and strong performance in regression and classification tasks, including the MNIST dataset. These results demonstrate the versatility, scalability, and suitability of the SQQNN for deployment on near-term quantum devices.

[AI-50] Residual Channel Boosts Contrastive Learning for Radio Frequency Fingerprint Identification

链接: https://arxiv.org/abs/2412.08885
作者: Rui Pan,Hui Chen,Guanxiong Shen,Hongyang Chen
关键词-EN: Frequency Fingerprint Identification, Radio Frequency Fingerprint, Fingerprint Identification, Radio Frequency, Frequency Fingerprint
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:In order to address the issue of limited data samples for the deployment of pre-trained models in unseen environments, this paper proposes a residual channel-based data augmentation strategy for Radio Frequency Fingerprint Identification (RFFI), coupled with a lightweight SimSiam contrastive learning framework. By applying least square (LS) and minimum mean square error (MMSE) channel estimations followed by equalization, signals with different residual channel effects are generated. These residual channels enable the model to learn more effective representations. Then the pre-trained model is fine-tuned with 1% samples in a novel environment for RFFI. Experimental results demonstrate that our method significantly enhances both feature extraction ability and generalization while requiring fewer samples and less time, making it suitable for practical wireless security applications.

[AI-51] Quantum Kernel-Based Long Short-term Memory for Climate Time-Series Forecasting

链接: https://arxiv.org/abs/2412.08851
作者: Yu-Chao Hsu,Nan-Yow Chen,Tai-Yu Li,Po-Heng(Henry)Lee,Kuan-Cheng Chen
关键词-EN: Air Quality Index, Kernel-Based Long short-memory, Quality Index, Quantum Kernel-Based Long, Air Quality
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2411.13225

点击查看摘要

Abstract:We present the Quantum Kernel-Based Long short-memory (QK-LSTM) network, which integrates quantum kernel methods into classical LSTM architectures to enhance predictive accuracy and computational efficiency in climate time-series forecasting tasks, such as Air Quality Index (AQI) prediction. By embedding classical inputs into high-dimensional quantum feature spaces, QK-LSTM captures intricate nonlinear dependencies and temporal dynamics with fewer trainable parameters. Leveraging quantum kernel methods allows for efficient computation of inner products in quantum spaces, addressing the computational challenges faced by classical models and variational quantum circuit-based models. Designed for the Noisy Intermediate-Scale Quantum (NISQ) era, QK-LSTM supports scalable hybrid quantum-classical implementations. Experimental results demonstrate that QK-LSTM outperforms classical LSTM networks in AQI forecasting, showcasing its potential for environmental monitoring and resource-constrained scenarios, while highlighting the broader applicability of quantum-enhanced machine learning frameworks in tackling large-scale, high-dimensional climate datasets.

[AI-52] Quantum-Train-Based Distributed Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2412.08845
作者: Kuan-Cheng Chen,Samuel Yen-Chi Chen,Chen-Yu Liu,Kin K. Leung
关键词-EN: traditional Reinforcement Learning, Reinforcement Learning, Multi-Agent Reinforcement Learning, Quantum-Train Reinforcement Learning, Distributed Multi-Agent Reinforcement
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Quantum-Train-Based Distributed Multi-Agent Reinforcement Learning (Dist-QTRL), a novel approach to addressing the scalability challenges of traditional Reinforcement Learning (RL) by integrating quantum computing principles. Quantum-Train Reinforcement Learning (QTRL) leverages parameterized quantum circuits to efficiently generate neural network parameters, achieving a (poly(\log(N))) reduction in the dimensionality of trainable parameters while harnessing quantum entanglement for superior data representation. The framework is designed for distributed multi-agent environments, where multiple agents, modeled as Quantum Processing Units (QPUs), operate in parallel, enabling faster convergence and enhanced scalability. Additionally, the Dist-QTRL framework can be extended to high-performance computing (HPC) environments by utilizing distributed quantum training for parameter reduction in classical neural networks, followed by inference using classical CPUs or GPUs. This hybrid quantum-HPC approach allows for further optimization in real-world applications. In this paper, we provide a mathematical formulation of the Dist-QTRL framework and explore its convergence properties, supported by empirical results demonstrating performance improvements over centric QTRL models. The results highlight the potential of quantum-enhanced RL in tackling complex, high-dimensional tasks, particularly in distributed computing settings, where our framework achieves significant speedups through parallelization without compromising model accuracy. This work paves the way for scalable, quantum-enhanced RL systems in practical applications, leveraging both quantum and classical computational resources.

[AI-53] Sampling-based Continuous Optimization with Coupled Variables for RNA Design

链接: https://arxiv.org/abs/2412.08751
作者: Wei Yu Tang,Ning Dai,Tianshuo Zhou,David H. Mathews,Liang Huang
关键词-EN: RNA design, aims to find, target structure aims, task of RNA, objective function
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to the state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: this http URL.

[AI-54] A quantum-classical reinforcement learning model to play Atari games

链接: https://arxiv.org/abs/2412.08725
作者: Dominik Freinberger,Julian Lemmel,Radu Grosu,Sofiene Jerbi
关键词-EN: Recent advances, parametrized quantum circuits, learning models based, deep learning models, quantum learning models
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 + 13 pages

点击查看摘要

Abstract:Recent advances in reinforcement learning have demonstrated the potential of quantum learning models based on parametrized quantum circuits as an alternative to deep learning models. On the one hand, these findings have shown the ultimate exponential speed-ups in learning that full-blown quantum models can offer in certain – artificially constructed – environments. On the other hand, they have demonstrated the ability of experimentally accessible PQCs to solve OpenAI Gym benchmarking tasks. However, it remains an open question whether these near-term QRL techniques can be successfully applied to more complex problems exhibiting high-dimensional observation spaces. In this work, we bridge this gap and present a hybrid model combining a PQC with classical feature encoding and post-processing layers that is capable of tackling Atari games. A classical model, subjected to architectural restrictions similar to those present in the hybrid model is constructed to serve as a reference. Our numerical investigation demonstrates that the proposed hybrid model is capable of solving the Pong environment and achieving scores comparable to the classical reference in Breakout. Furthermore, our findings shed light on important hyperparameter settings and design choices that impact the interplay of the quantum and classical components. This work contributes to the understanding of near-term quantum learning models and makes an important step towards their deployment in real-world RL scenarios.

[AI-55] A Mathematical Framework for Consciousness in Neural Networks

链接: https://arxiv.org/abs/1704.01148
作者: T.R. Lima
关键词-EN: explanatory gap, paper presents, bridging the explanatory, Levine, qualia
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel mathematical framework for bridging the explanatory gap (Levine, 1983) between consciousness and its physical correlates. Specifically, we propose that qualia correspond to singularities in the mathematical representations of neural network topology. Crucially, we do not claim that qualia are singularities or that singularities “explain” why qualia feel as they do. Instead, we propose that singularities serve as principled, coordinate-invariant markers of points where attempts at purely quantitative description of a system’s dynamics reach an in-principle limit. By integrating these formal markers of irreducibility into models of the physical correlates of consciousness, we establish a framework that recognizes qualia as phenomena inherently beyond reduction to complexity, computation, or information. This approach draws on insights from philosophy of mind, mathematics, cognitive neuroscience, and artificial intelligence (AI). It does not solve the hard problem of consciousness (Chalmers, 1995), but it advances the discourse by integrating the irreducible nature of qualia into a rigorous, physicalist framework. While primarily theoretical, these insights also open avenues for future AI and artificial consciousness (AC) research, suggesting that recognizing and harnessing irreducible topological features may be an important unlock in moving beyond incremental, scale-based improvements and toward artificial general intelligence (AGI) and AC.

机器学习

[LG-0] Obfuscated Activations Bypass LLM Latent-Space Defenses

链接: https://arxiv.org/abs/2412.09565
作者: Luke Bailey,Alex Serrano,Abhay Sheshadri,Mikhail Seleznyov,Jordan Taylor,Erik Jenner,Jacob Hilton,Stephen Casper,Carlos Guestrin,Scott Emmons
关键词-EN: Recent latent-space monitoring, latent-space monitoring techniques, Recent latent-space, LLM attacks, monitoring techniques
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses – including sparse autoencoders, representation probing, and latent OOD detection – are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network’s behavior. This poses a fundamental challenge to latent-space defenses.

[LG-1] Improving the Reliability of Cable Broadband Networks via Proactive Network Maintenance

链接: https://arxiv.org/abs/2412.09564
作者: Jiyao Hu,Zhenyu Zhou,Xiaowei Yang
关键词-EN: Proactive Network Maintenance, broadband technologies widely, PNM data, Cable broadband networks, Cable
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 15 pages including reference. Submitted to IEEE/ACM Transactions on Networking. Partly published in NSDI’20, this is the extended version

点击查看摘要

Abstract:Cable broadband networks are one of the few “last-mile” broadband technologies widely available in the U.S. Unfortunately, they have poor reliability after decades of deployment. The cable industry proposed a framework called Proactive Network Maintenance (PNM) to diagnose the cable networks. However, there is little public knowledge or systematic study on how to use these data to detect and localize cable network problems. Existing tools in the public domain have prohibitive high false-positive rates. In this paper, we propose CableMon, the first public-domain system that applies machine learning techniques to PNM data to improve the reliability of cable broadband networks. CableMon tackles two key challenges faced by cable ISPs: accurately detecting failures, and distinguishing whether a failure occurs within a network or at a subscriber’s premise. CableMon uses statistical models to generate features from time series data and uses customer trouble tickets as hints to infer abnormal/failure thresholds for these generated features. Further, CableMon employs an unsupervised learning model to group cable devices sharing similar anomalous patterns and effectively identify impairments that occur inside a cable network and impairments occur at a subscriber’s premise, as these two different faults require different types of technical personnel to repair them. We use eight months of PNM data and customer trouble tickets from an ISP and experimental deployment to evaluate CableMon’s performance. Our evaluation results show that CableMon can effectively detect and distinguish failures from PNM data and outperforms existing public-domain tools.

[LG-2] Capturing the Temporal Dependence of Training Data Influence

链接: https://arxiv.org/abs/2412.09538
作者: Jiachen T. Wang,Dawn Song,James Zou,Prateek Mittal,Ruoxi Jia
关键词-EN: Traditional data influence, influence estimation methods, data, Traditional data, data influence estimation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Correspondence to Jiachen T. Wang and Ruoxi Jia

点击查看摘要

Abstract:Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms, especially for foundation models using stochastic algorithms and multi-stage curricula, are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering a critical question in machine learning: How can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of trajectory-specific leave-one-out (LOO) influence, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model’s optimization trajectory. However, exactly evaluating the trajectory-specific LOO presents a significant computational challenge. To address this, we propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. Specifically, we compute a training data embedding that encapsulates the cumulative interactions between data and the evolving model parameters. The LOO can then be efficiently approximated through a simple dot-product between the data value embedding and the gradient of the given test data. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics. In particular, we uncover distinct phases of data influence, revealing that data points in the early and late stages of training exert a greater impact on the final model. These insights translate into actionable strategies for managing the computational overhead of data selection by strategically timing the selection process, potentially opening new avenues in data curation research.

[LG-3] GainAdaptor: Learning Quadrupedal Locomotion with Dual Actors for Adaptable and Energy-Efficient Walking on Various Terrains

链接: https://arxiv.org/abs/2412.09520
作者: Mincheol Kim,Nahyun Kwon,Jung-Yup Kim
关键词-EN: Deep reinforcement learning, Deep reinforcement, controlling legged robots, reinforcement learning, minimalist architectures
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has emerged as an innovative solution for controlling legged robots in challenging environments using minimalist architectures. Traditional control methods for legged robots, such as inverse dynamics, either directly manage joint torques or use proportional-derivative (PD) controllers to regulate joint positions at a higher level. In case of DRL, direct torque control presents significant challenges, leading to a preference for joint position control. However, this approach necessitates careful adjustment of joint PD gains, which can limit both adaptability and efficiency. In this paper, we propose GainAdaptor, an adaptive gain control framework that autonomously tunes joint PD gains to enhance terrain adaptability and energy efficiency. The framework employs a dual-actor algorithm to dynamically adjust the PD gains based on varying ground conditions. By utilizing a divided action space, GainAdaptor efficiently learns stable and energy-efficient locomotion. We validate the effectiveness of the proposed method through experiments conducted on a Unitree Go1 robot, demonstrating improved locomotion performance across diverse terrains.

[LG-4] A novel ML-fuzzy control system for optimizing PHEV fuel efficiency and extending electric range under diverse driving conditions

链接: https://arxiv.org/abs/2412.09499
作者: Mehrdad Raeesi,Saba Mansour,Sina Changizian
关键词-EN: greener transportation future, utilizes machine learning, optimize power allocation, plug-in hybrid electric, series hybrid
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:Aiming for a greener transportation future, this study introduces an innovative control system for plug-in hybrid electric vehicles (PHEVs) that utilizes machine learning (ML) techniques to forecast energy usage in the pure electric mode of the vehicle and optimize power allocation across different operational modes, including pure electric, series hybrid, parallel hybrid, and internal combustion operation. The fuzzy logic decision-making process governs the vehicle control system. The performance was assessed under various driving conditions. Key findings include a significant enhancement in pure electric mode efficiency, achieving an extended full-electric range of approximately 84 kilometers on an 80% utilization of a 20-kWh battery pack. During the WLTC driving cycle, the control system reduced fuel consumption to 2.86 L/100km, representing a 20% reduction in gasoline-equivalent fuel consumption. Evaluations of vehicle performance at discrete driving speeds, highlighted effective energy management, with the vehicle battery charging at lower speeds and discharging at higher speeds, showing optimized energy recovery and consumption strategies. Initial battery charge levels notably influenced vehicle performance. A 90% initial charge enabled prolonged all-electric operation, minimizing fuel consumption to 2 L/100km less than that of the base control system. Real-world driving pattern analysis revealed significant variations, with shorter, slower cycles requiring lower fuel consumption due to prioritized electric propulsion, while longer, faster cycles increased internal combustion engine usage. The control system also adapted to different battery state of health (SOH) conditions, with higher SOH facilitating extended electric mode usage, reducing total fuel consumption by up to 2.87 L/100km.

[LG-5] Early Detection of At-Risk Students Using Machine Learning

链接: https://arxiv.org/abs/2412.09483
作者: Azucena L. Jimenez Martinez,Kanika Sood,Rakeshkumar Mahto
关键词-EN: California State University, California State, research presents preliminary, Fullerton dashboard, State University
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research presents preliminary work to address the challenge of identifying at-risk students using supervised machine learning and three unique data categories: engagement, demographics, and performance data collected from Fall 2023 using Canvas and the California State University, Fullerton dashboard. We aim to tackle the persistent challenges of higher education retention and student dropout rates by screening for at-risk students and building a high-risk identification system. By focusing on previously overlooked behavioral factors alongside traditional metrics, this work aims to address educational gaps, enhance student outcomes, and significantly boost student success across disciplines at the University. Pre-processing steps take place to establish a target variable, anonymize student information, manage missing data, and identify the most significant features. Given the mixed data types in the datasets and the binary classification nature of this study, this work considers several machine learning models, including Support Vector Machines (SVM), Naive Bayes, K-nearest neighbors (KNN), Decision Trees, Logistic Regression, and Random Forest. These models predict at-risk students and identify critical periods of the semester when student performance is most vulnerable. We will use validation techniques such as train test split and k-fold cross-validation to ensure the reliability of the models. Our analysis indicates that all algorithms generate an acceptable outcome for at-risk student predictions, while Naive Bayes performs best overall.

[LG-6] Bayesian Optimization via Continual Variational Last Layer Training

链接: https://arxiv.org/abs/2412.09477
作者: Paul Brunzema,Mikkel Jordahn,John Willes,Sebastian Trimpe,Jasper Snoek,James Harrison
关键词-EN: Gaussian Processes, Euclidean metrics, defined by Euclidean, efficiently updated online, easily captured
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate models for Bayesian optimization (BO) due to their ability to model uncertainty and their performance on tasks where correlations are easily captured (such as those defined by Euclidean metrics) and their ability to be efficiently updated online. However, the performance of GPs depends on the choice of kernel, and kernel selection for complex correlation structures is often difficult or must be made bespoke. While Bayesian neural networks (BNNs) are a promising direction for higher capacity surrogate models, they have so far seen limited use due to poor performance on some problem types. In this paper, we propose an approach which shows competitive performance on many problem types, including some that BNNs typically struggle with. We build on variational Bayesian last layers (VBLLs), and connect training of these models to exact conditioning in GPs. We exploit this connection to develop an efficient online training algorithm that interleaves conditioning and optimization. Our findings suggest that VBLL networks significantly outperform GPs and other BNN architectures on tasks with complex input correlations, and match the performance of well-tuned GPs on established benchmark tasks.

[LG-7] A Novel Ensemble-Based Deep Learning Model with Explainable AI for Accurate Kidney Disease Diagnosis

链接: https://arxiv.org/abs/2412.09472
作者: Md. Arifuzzaman,Iftekhar Ahmed,Md. Jalal Uddin Chowdhury,Shadman Sakib,Mohammad Shoaib Rahman,Md. Ebrahim Hossain,Shakib Absar
关键词-EN: Chronic Kidney Disease, Chronic Kidney, Kidney Disease, global health challenge, significant global health
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chronic Kidney Disease (CKD) represents a significant global health challenge, characterized by the progressive decline in renal function, leading to the accumulation of waste products and disruptions in fluid balance within the body. Given its pervasive impact on public health, there is a pressing need for effective diagnostic tools to enable timely intervention. Our study delves into the application of cutting-edge transfer learning models for the early detection of CKD. Leveraging a comprehensive and publicly available dataset, we meticulously evaluate the performance of several state-of-the-art models, including EfficientNetV2, InceptionNetV2, MobileNetV2, and the Vision Transformer (ViT) technique. Remarkably, our analysis demonstrates superior accuracy rates, surpassing the 90% threshold with MobileNetV2 and achieving 91.5% accuracy with ViT. Moreover, to enhance predictive capabilities further, we integrate these individual methodologies through ensemble modeling, resulting in our ensemble model exhibiting a remarkable 96% accuracy in the early detection of CKD. This significant advancement holds immense promise for improving clinical outcomes and underscores the critical role of machine learning in addressing complex medical challenges.

[LG-8] Neural Network Symmetrisation in Concrete Settings

链接: https://arxiv.org/abs/2412.09469
作者: Rob Cornish
关键词-EN: neural network symmetrisation, Markov categories, recently gave, context of Markov, gave a general
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cornish (2024) recently gave a general theory of neural network symmetrisation in the abstract context of Markov categories. We give a high-level overview of these results, and their concrete implications for the symmetrisation of deterministic functions and of Markov kernels.

[LG-9] Finite-PINN: A Physics-Informed Neural Network Architecture for Solving Solid Mechanics Problems with General Geometries

链接: https://arxiv.org/abs/2412.09453
作者: Haolin Li,Yuyang Miao,Zahra Sharif Khodaei,M. H. Aliabadi
关键词-EN: solid mechanics problems, general solid mechanics, demonstrated impressive capabilities, solid mechanics, PINN
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:PINN models have demonstrated impressive capabilities in addressing fluid PDE problems, and their potential in solid mechanics is beginning to emerge. This study identifies two key challenges when using PINN to solve general solid mechanics problems. These challenges become evident when comparing the limitations of PINN with the well-established numerical methods commonly used in solid mechanics, such as the finite element method (FEM). Specifically: a) PINN models generate solutions over an infinite domain, which conflicts with the finite boundaries typical of most solid structures; and b) the solution space utilised by PINN is Euclidean, which is inadequate for addressing the complex geometries often present in solid structures. This work proposes a PINN architecture used for general solid mechanics problems, termed the Finite-PINN model. The proposed model aims to effectively address these two challenges while preserving as much of the original implementation of PINN as possible. The unique architecture of the Finite-PINN model addresses these challenges by separating the approximation of stress and displacement fields, and by transforming the solution space from the traditional Euclidean space to a Euclidean-topological joint space. Several case studies presented in this paper demonstrate that the Finite-PINN model provides satisfactory results for a variety of problem types, including both forward and inverse problems, in both 2D and 3D contexts. The developed Finite-PINN model offers a promising tool for addressing general solid mechanics problems, particularly those not yet well-explored in current research. Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Analysis of PDEs (math.AP) Cite as: arXiv:2412.09453 [cs.CE] (or arXiv:2412.09453v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2412.09453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Search Strategy Generation for Branch and Bound Using Genetic Programming AAAI2025

链接: https://arxiv.org/abs/2412.09444
作者: Gwen Maudet,Grégoire Danoy
关键词-EN: Search Strategy, recursively divides, search, search strategy heuristic, search strategy policy
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Branch-and-Bound (B\B) is an exact method in integer programming that recursively divides the search space into a tree. During the resolution process, determining the next subproblem to explore within the tree-known as the search strategy-is crucial. Hand-crafted heuristics are commonly used, but none are effective over all problem classes. Recent approaches utilizing neural networks claim to make more intelligent decisions but are computationally expensive. In this paper, we introduce GP2S (Genetic Programming for Search Strategy), a novel machine learning approach that automatically generates a B\B search strategy heuristic, aiming to make intelligent decisions while being computationally lightweight. We define a policy as a function that evaluates the quality of a B\B node by combining features from the node and the problem; the search strategy policy is then defined by a best-first search based on this node ranking. The policy space is explored using a genetic programming algorithm, and the policy that achieves the best performance on a training set is selected. We compare our approach with the standard method of the SCIP solver, a recent graph neural network-based method, and handcrafted heuristics. Our first evaluation includes three types of primal hard problems, tested on instances similar to the training set and on larger instances. Our method is at most 2% slower than the best baseline and consistently outperforms SCIP, achieving an average speedup of 11.3%. Additionally, GP2S is tested on the MIPLIB 2017 dataset, generating multiple heuristics from different subsets of instances. It exceeds SCIP’s average performance in 7 out of 10 cases across 15 times more instances and under a time limit 15 times longer, with some GP2S methods leading on most experiments in terms of the number of feasible solutions or optimality gap.

[LG-11] Mixture of neural fields for heterogeneous reconstruction in cryo-EM

链接: https://arxiv.org/abs/2412.09420
作者: Axel Levy,Rishwanth Raghu,David Shustin,Adele Rui-Yang Peng,Huan Li,Oliver Biggs Clarke,Gordon Wetzstein,Ellen D. Zhong
关键词-EN: Cryo-electron microscopy, near-physiological contexts, protein structure determination, determination that images, images an ensemble
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cryo-electron microscopy (cryo-EM) is an experimental technique for protein structure determination that images an ensemble of macromolecules in near-physiological contexts. While recent advances enable the reconstruction of dynamic conformations of a single biomolecular complex, current methods do not adequately model samples with mixed conformational and compositional heterogeneity. In particular, datasets containing mixtures of multiple proteins require the joint inference of structure, pose, compositional class, and conformational states for 3D reconstruction. Here, we present Hydra, an approach that models both conformational and compositional heterogeneity fully ab initio by parameterizing structures as arising from one of K neural fields. We employ a new likelihood-based loss function and demonstrate the effectiveness of our approach on synthetic datasets composed of mixtures of proteins with large degrees of conformational variability. We additionally demonstrate Hydra on an experimental dataset of a cellular lysate containing a mixture of different protein complexes. Hydra expands the expressivity of heterogeneous reconstruction methods and thus broadens the scope of cryo-EM to increasingly complex samples.

[LG-12] Opinion de-polarization of social networks with GNNs

链接: https://arxiv.org/abs/2412.09404
作者: Konstantinos Mylonas,Thrasyvoulos Spyropoulos
关键词-EN: social media, ground for political, political debate, debate and exchange, Nowadays
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, social media is the ground for political debate and exchange of opinions. There is a significant amount of research that suggests that social media are highly polarized. A phenomenon that is commonly observed is the echo chamber structure, where users are organized in polarized communities and form connections only with similar-minded individuals, limiting themselves to consume specific content. In this paper we explore a way to decrease the polarization of networks with two echo chambers. Particularly, we observe that if some users adopt a moderate opinion about a topic, the polarization of the network decreases. Based on this observation, we propose an efficient algorithm to identify a good set of K users, such that if they adopt a moderate stance around a topic, the polarization is minimized. Our algorithm employs a Graph Neural Network and thus it can handle large graphs more effectively than other approaches

[LG-13] A Geometry-Aware Message Passing Neural Network for Modeling Aerodynamics over Airfoils

链接: https://arxiv.org/abs/2412.09399
作者: Jacob Helwig,Xuan Zhang,Haiyang Yu,Shuiwang Ji
关键词-EN: involving flows interacting, Computational modeling, aerospace engineering, problem in aerospace, solid objects
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational modeling of aerodynamics is a key problem in aerospace engineering, often involving flows interacting with solid objects such as airfoils. Deep surrogate models have emerged as purely data-driven approaches that learn direct mappings from simulation conditions to solutions based on either simulation or experimental data. Here, we consider modeling of incompressible flows over solid objects, wherein geometric structures are a key factor in determining aerodynamics. To effectively incorporate geometries, we propose a message passing scheme that efficiently and expressively integrates the airfoil shape with the mesh representation. Under this framework, we first obtain a representation of the geometry in the form of a latent graph on the airfoil surface. We subsequently propagate this representation to all collocation points through message passing on a directed, bipartite graph. We demonstrate that this framework supports efficient training by downsampling the solution mesh while avoiding distribution shifts at test time when evaluated on the full mesh. To enable our model to be able to distinguish between distinct spatial regimes of dynamics relative to the airfoil, we represent mesh points in both a leading edge and trailing edge coordinate system. We further enhance the expressiveness of our coordinate system representations by embedding our hybrid Polar-Cartesian coordinates using sinusoidal and spherical harmonics bases. We additionally find that a change of basis to canonicalize input representations with respect to inlet velocity substantially improves generalization. Altogether, these design choices lead to a purely data-driven machine learning framework known as GeoMPNN, which won the Best Student Submission award at the NeurIPS 2024 ML4CFD Competition, placing 4th overall. Our code is publicly available as part of the AIRS library (this https URL).

[LG-14] Hybrid variable spiking graph neural networks for energy-efficient scientific machine learning

链接: https://arxiv.org/abs/2412.09379
作者: Isha Jain,Shailesh Garg,Shaurya Shriyam,Souvik Chakraborty
关键词-EN: Graph Neural Networks, Graph-based representations, Spiking Graph Neural, Neural Networks, Variable Spiking Graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-based representations for samples of computational mechanics-related datasets can prove instrumental when dealing with problems like irregular domains or molecular structures of materials, etc. To effectively analyze and process such datasets, deep learning offers Graph Neural Networks (GNNs) that utilize techniques like message-passing within their architecture. The issue, however, is that as the individual graph scales and/ or GNN architecture becomes increasingly complex, the increased energy budget of the overall deep learning model makes it unsustainable and restricts its applications in applications like edge computing. To overcome this, we propose in this paper Hybrid Variable Spiking Graph Neural Networks (HVS-GNNs) that utilize Variable Spiking Neurons (VSNs) within their architecture to promote sparse communication and hence reduce the overall energy budget. VSNs, while promoting sparse event-driven computations, also perform well for regression tasks, which are often encountered in computational mechanics applications and are the main target of this paper. Three examples dealing with prediction of mechanical properties of material based on microscale/ mesoscale structures are shown to test the performance of the proposed HVS-GNNs in regression tasks. We have also compared the performance of HVS-GNN architectures with the performance of vanilla GNNs and GNNs utilizing leaky integrate and fire neurons. The results produced show that HVS-GNNs perform well for regression tasks, all while promoting sparse communication and, hence, energy efficiency.

[LG-15] A comprehensive interpretable machine learning framework for Mild Cognitive Impairment and Alzheimers disease diagnosis

链接: https://arxiv.org/abs/2412.09376
作者: Maria Eleftheria Vlontzou,Maria Athanasiou,Kalliopi Dalakleidi,Ioanna Skampardoni,Christos Davatzikos,Konstantina Nikita
关键词-EN: Mild Cognitive Impairment, Alzheimer Disease Neuroimaging, Cognitive Impairment, Mild Cognitive, Disease Neuroimaging Initiative
类目: Machine Learning (cs.LG)
*备注: This preprint has not been peer-reviewed yet but has been submitted to a journal

点击查看摘要

Abstract:An interpretable machine learning (ML) framework is introduced to enhance the diagnosis of Mild Cognitive Impairment (MCI) and Alzheimer’s disease (AD) by ensuring robustness of the ML models’ interpretations. The dataset used comprises volumetric measurements from brain MRI and genetic data from healthy individuals and patients with MCI/AD, obtained through the Alzheimer’s Disease Neuroimaging Initiative. The existing class imbalance is addressed by an ensemble learning approach, while various attribution-based and counterfactual-based interpretability methods are leveraged towards producing diverse explanations related to the pathophysiology of MCI/AD. A unification method combining SHAP with counterfactual explanations assesses the interpretability techniques’ robustness. The best performing model yielded 87.5% balanced accuracy and 90.8% F1-score. The attribution-based interpretability methods highlighted significant volumetric and genetic features related to MCI/AD risk. The unification method provided useful insights regarding those features’ necessity and sufficiency, further showcasing their significance in MCI/AD diagnosis.

[LG-16] Diffusion Predictive Control with Constraints

链接: https://arxiv.org/abs/2412.09342
作者: Ralf Römer,Alexander von Rohr,Angela P. Schoellig
关键词-EN: recently gained popularity, multimodal distributions, recently gained, gained popularity, popularity for policy
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Code: this https URL . 14 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Diffusion models have recently gained popularity for policy learning in robotics due to their ability to capture high-dimensional and multimodal distributions. However, diffusion policies are inherently stochastic and typically trained offline, limiting their ability to handle unseen and dynamic conditions where novel constraints not represented in the training data must be satisfied. To overcome this limitation, we propose diffusion predictive control with constraints (DPCC), an algorithm for diffusion-based control with explicit state and action constraints that can deviate from those in the training data. DPCC uses constraint tightening and incorporates model-based projections into the denoising process of a trained trajectory diffusion model. This allows us to generate constraint-satisfying, dynamically feasible, and goal-reaching trajectories for predictive control. We show through simulations of a robot manipulator that DPCC outperforms existing methods in satisfying novel test-time constraints while maintaining performance on the learned control task.

[LG-17] Dynamic Prompt Allocation and Tuning for Continual Test-Time Adaptation

链接: https://arxiv.org/abs/2412.09308
作者: Chaoran Cui,Yongrui Zhen,Shuai Gong,Chunyun Zhang,Hui Liu,Yilong Yin
关键词-EN: Continual test-time adaptation, Continual test-time, pre-trained source model, evolving target distributions, continuously evolving target
类目: Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, and 6 tables

点击查看摘要

Abstract:Continual test-time adaptation (CTTA) has recently emerged to adapt a pre-trained source model to continuously evolving target distributions, which accommodates the dynamic nature of real-world environments. To mitigate the risk of catastrophic forgetting in CTTA, existing methods typically incorporate explicit regularization terms to constrain the variation of model parameters. However, they cannot fundamentally resolve catastrophic forgetting because they rely on a single shared model to adapt across all target domains, which inevitably leads to severe inter-domain interference. In this paper, we introduce learnable domain-specific prompts that guide the model to adapt to corresponding target domains, thereby partially disentangling the parameter space of different domains. In the absence of domain identity for target samples, we propose a novel dynamic Prompt AllocatIon aNd Tuning (PAINT) method, which utilizes a query mechanism to dynamically determine whether the current samples come from a known domain or an unexplored one. For known domains, the corresponding domain-specific prompt is directly selected, while for previously unseen domains, a new prompt is allocated. Prompt tuning is subsequently performed using mutual information maximization along with structural regularization. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our PAINT method for CTTA. We have released our code at this https URL.

[LG-18] ransfer Learning of RSSI to Improve Indoor Localisation Performance

链接: https://arxiv.org/abs/2412.09292
作者: Thanaphon Suwannaphong,Ryan McConville,Ian Craddock
关键词-EN: Bluetooth Low Energy, Received Signal Strength, Signal Strength Indicator, tracking patient conditions, health monitoring systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing demand for health monitoring systems, in-home localisation is essential for tracking patient conditions. The unique spatial characteristics of each house required annotated data for Bluetooth Low Energy (BLE) Received Signal Strength Indicator (RSSI)-based monitoring system. However, collecting annotated training data is time-consuming, particularly for patients with limited health conditions. To address this, we propose Conditional Generative Adversarial Networks (ConGAN)-based augmentation, combined with our transfer learning framework (T-ConGAN), to enable the transfer of generic RSSI information between different homes, even when data is collected using different experimental protocols. This enhances the performance and scalability of such intelligent systems by reducing the need for annotation in each home. We are the first to demonstrate that BLE RSSI data can be shared across different homes, and that shared information can improve the indoor localisation performance. Our T-ConGAN enhances the macro F1 score of room-level indoor localisation by up to 12.2%, with a remarkable 51% improvement in challenging areas such as stairways or outside spaces. This state-of-the-art RSSI augmentation model significantly enhances the robustness of in-home health monitoring systems.

[LG-19] Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices

链接: https://arxiv.org/abs/2412.09289
作者: Thanaphon Suwannaphong,Ferdian Jovan,Ian Craddock,Ryan McConville
关键词-EN: efficient machine learning, paper proposes small, resource-constrained edge devices, machine learning models, on-device indoor localisation
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices, specifically for on-device indoor localisation. Typical approaches for indoor localisation rely on centralised remote processing of data transmitted from lower powered devices such as wearables. However, there are several benefits for moving this to the edge device itself, including increased battery life, enhanced privacy, reduced latency and lowered operational costs, all of which are key for common applications such as health monitoring. The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size while maintaining high predictive performance. We base our work on a large state-of-the-art transformer-based model and seek to deploy it within low-power MCUs. We also propose a state-space-based architecture using Mamba as a more compact alternative to the transformer. Our results show that the quantized transformer model performs well within a 64 KB RAM constraint, achieving an effective balance between model size and localisation precision. Additionally, the compact Mamba model has strong performance under even tighter constraints, such as a 32 KB of RAM, without the need for model compression, making it a viable option for more resource-limited environments. We demonstrate that, through our framework, it is feasible to deploy advanced indoor localisation models onto low-power MCUs with restricted memory limitations. The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring by providing accurate, real-time location data while minimizing power consumption, increasing data privacy, improving latency and reducing infrastructure costs.

[LG-20] Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation

链接: https://arxiv.org/abs/2412.09265
作者: Bofang Jia,Pengxiang Ding,Can Cui,Mingyang Sun,Pengfang Qian,Zhaoxin Fan,Donglin Wang
关键词-EN: Visual-motor policy learning, Visual-motor policy, complex robotic trajectories, modeling complex robotic, learning has advanced
类目: Robotics (cs.RO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages

点击查看摘要

Abstract:Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories. However, their prolonged inference times hinder high-frequency control tasks requiring real-time feedback. While consistency distillation (CD) accelerates inference, it introduces errors that compromise action quality. To address these limitations, we propose the Score and Distribution Matching Policy (SDM Policy), which transforms diffusion-based policies into single-step generators through a two-stage optimization process: score matching ensures alignment with true action distributions, and distribution matching minimizes KL divergence for consistency. A dual-teacher mechanism integrates a frozen teacher for stability and an unfrozen teacher for adversarial training, enhancing robustness and alignment with target distributions. Evaluated on a 57-task simulation benchmark, SDM Policy achieves a 6x inference speedup while having state-of-the-art action quality, providing an efficient and reliable framework for high-frequency robotic tasks.

[LG-21] Single-View Graph Contrastive Learning with Soft Neighborhood Awareness AAAI2025

链接: https://arxiv.org/abs/2412.09261
作者: Qingqiang Sun,Chaoqi Chen,Ziyue Qiao,Xubin Zheng,Kai Wang
关键词-EN: designing effective augmentations, increased computational costs, methods heavily rely, graph contrastive learning, concomitant challenges
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025; full version including appendix

点击查看摘要

Abstract:Most graph contrastive learning (GCL) methods heavily rely on cross-view contrast, thus facing several concomitant challenges, such as the complexity of designing effective augmentations, the potential for information loss between views, and increased computational costs. To mitigate reliance on cross-view contrasts, we propose \tttSIGNA, a novel single-view graph contrastive learning framework. Regarding the inconsistency between structural connection and semantic similarity of neighborhoods, we resort to soft neighborhood awareness for GCL. Specifically, we leverage dropout to obtain structurally-related yet randomly-noised embedding pairs for neighbors, which serve as potential positive samples. At each epoch, the role of partial neighbors is switched from positive to negative, leading to probabilistic neighborhood contrastive learning effect. Furthermore, we propose a normalized Jensen-Shannon divergence estimator for a better effect of contrastive learning. Surprisingly, experiments on diverse node-level tasks demonstrate that our simple single-view GCL framework consistently outperforms existing methods by margins of up to 21.74% (PPI). In particular, with soft neighborhood awareness, SIGNA can adopt MLPs instead of complicated GCNs as the encoder to generate representations in transductive learning tasks, thus speeding up its inference process by 109 times to 331 times. The source code is available at this https URL.

[LG-22] When Can Memorization Improve Fairness?

链接: https://arxiv.org/abs/2412.09254
作者: Bob Pepin,Christian Igel,Raghavendra Selvan
关键词-EN: extent additive fairness, multi-class classification problem, additive fairness metrics, statistical parity, equal opportunity
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:We study to which extent additive fairness metrics (statistical parity, equal opportunity and equalized odds) can be influenced in a multi-class classification problem by memorizing a subset of the population. We give explicit expressions for the bias resulting from memorization in terms of the label and group membership distribution of the memorized dataset and the classifier bias on the unmemorized dataset. We also characterize the memorized datasets that eliminate the bias for all three metrics considered. Finally we provide upper and lower bounds on the total probability mass in the memorized dataset that is necessary for the complete elimination of these biases.

[LG-23] GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

链接: https://arxiv.org/abs/2412.09250
作者: Abdessalam Ed-dib,Zhanibek Datbayev,Amine Mohamed Aboussalah
关键词-EN: Fine-tuning large language, large language models, Geometric Low-Rank Adaptation, large language, computationally intensive
类目: Machine Learning (cs.LG); Geometric Topology (math.GT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is computationally intensive because it requires updating all parameters. Low-Rank Adaptation (LoRA) improves efficiency by modifying only a subset of weights but introduces a trade-off between expressivity and computational cost: lower ranks reduce resources but limit expressiveness, while higher ranks enhance expressivity at increased cost. Despite recent advances in adaptive LoRA techniques, existing methods fail to provide a theoretical basis for optimizing the trade-off between model performance and efficiency. We propose Geometric Low-Rank Adaptation (GeLoRA), a novel framework that computes the intrinsic dimensionality of hidden state representations to adaptively select LoRA ranks. We demonstrate that the intrinsic dimension provides a lower bound for the optimal rank of LoRA matrices, allowing for a principled selection that balances efficiency and expressivity. GeLoRA dynamically adjusts the rank for each layer based on the intrinsic dimensionality of its input and output representations, recognizing that not all model parameters equally impact fine-tuning. Empirical validation on multiple tasks shows that GeLoRA consistently outperforms recent baselines within the same parameter budget.

[LG-24] Uplift modeling with continuous treatments: A predict-then-optimize approach

链接: https://arxiv.org/abs/2412.09232
作者: Simon De Vos,Christopher Bockel-Rickermann,Stefan Lessmann,Wouter Verbeke
关键词-EN: optimize specific outcomes, recommend actions, actions that optimize, optimize specific, specific outcomes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of uplift modeling is to recommend actions that optimize specific outcomes by determining which entities should receive treatment. One common approach involves two steps: first, an inference step that estimates conditional average treatment effects (CATEs), and second, an optimization step that ranks entities based on their CATE values and assigns treatment to the top k within a given budget. While uplift modeling typically focuses on binary treatments, many real-world applications are characterized by continuous-valued treatments, i.e., a treatment dose. This paper presents a predict-then-optimize framework to allow for continuous treatments in uplift modeling. First, in the inference step, conditional average dose responses (CADRs) are estimated from data using causal machine learning techniques. Second, in the optimization step, we frame the assignment task of continuous treatments as a dose-allocation problem and solve it using integer linear programming (ILP). This approach allows decision-makers to efficiently and effectively allocate treatment doses while balancing resource availability, with the possibility of adding extra constraints like fairness considerations or adapting the objective function to take into account instance-dependent costs and benefits to maximize utility. The experiments compare several CADR estimators and illustrate the trade-offs between policy value and fairness, as well as the impact of an adapted objective function. This showcases the framework’s advantages and flexibility across diverse applications in healthcare, lending, and human resource management. All code is available on this http URL.

[LG-25] On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection

链接: https://arxiv.org/abs/2412.09195
作者: Chenyang Guo,Liping Chen,Zhuhai Li,Kong Aik Lee,Zhen-Hua Ling,Wu Guo
关键词-EN: adversarial attacks mounted, Neural networks, input data, attacks mounted, adversarial attacks
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 3 figures, published to IEEE SLT Workshop 2024

点击查看摘要

Abstract:Neural networks are commonly known to be vulnerable to adversarial attacks mounted through subtle perturbation on the input data. Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker’s voice attribute with additive perturbation signal generated by an adversarial network. This paper examines the reversibility property where an entity generating the adversarial perturbations is authorized to remove them and restore original speech (e.g., the speaker him/herself). A similar technique could also be used by an investigator to deanonymize a voice-protected speech to restore criminals’ identities in security and forensic analysis. In this setting, the perturbation generative module is assumed to be known in the removal process. To this end, a joint training of perturbation generation and removal modules is proposed. Experimental results on the LibriSpeech dataset demonstrated that the subtle perturbations added to the original speech can be predicted from the anonymized speech while achieving the goal of privacy protection. By removing these perturbations from the anonymized sample, the original speech can be restored. Audio samples can be found in \urlthis https URL.

[LG-26] Student-Informed Teacher Training

链接: https://arxiv.org/abs/2412.09149
作者: Nico Messikommer,Jiaxu Xing,Elie Aljalbout,Davide Scaramuzza
关键词-EN: teacher, privileged imitation learning, Imitation learning, high-dimensional inputs, learning complex control
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher’s behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters’ limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.

[LG-27] A Brief Discussion on KPI Development in Public Administration

链接: https://arxiv.org/abs/2412.09142
作者: Simona Fioretto,Elio Masciari,Enea Vincenzo Napolitano
关键词-EN: effective service delivery, Efficient and effective, leveraging Random Forest, Random Forest algorithms, key performance indicators
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient and effective service delivery in Public Administration (PA) relies on the development and utilization of key performance indicators (KPIs) for evaluating and measuring performance. This paper presents an innovative framework for KPI construction within performance evaluation systems, leveraging Random Forest algorithms and variable importance analysis. The proposed approach identifies key variables that significantly influence PA performance, offering valuable insights into the critical factors driving organizational success. By integrating variable importance analysis with expert consultation, relevant KPIs can be systematically developed, ensuring that improvement strategies address performance-critical areas. The framework incorporates continuous monitoring mechanisms and adaptive phases to refine KPIs in response to evolving administrative needs. This study aims to enhance PA performance through the application of machine learning techniques, fostering a more agile and results-driven approach to public administration.

[LG-28] MMD-OPT : Maximum Mean Discrepancy Based Sample Efficient Collision Risk Minimization for Autonomous Driving

链接: https://arxiv.org/abs/2412.09121
作者: Basant Sharma,Arun Kumar Singh
关键词-EN: Kernel Hilbert Space, Reproducing Kernel Hilbert, arbitrary prediction distribution, dynamic obstacles, sample-efficient approach
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We propose MMD-OPT: a sample-efficient approach for minimizing the risk of collision under arbitrary prediction distribution of the dynamic obstacles. MMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space (RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two concepts can be used to define a sample efficient surrogate for collision risk estimate. We perform extensive simulations to validate the effectiveness of MMD-OPT on both synthetic and real-world datasets. Importantly, we show that trajectory optimization with our MMD-based collision risk surrogate leads to safer trajectories at low sample regimes than popular alternatives based on Conditional Value at Risk (CVaR).

[LG-29] he Utility and Complexity of In- and Out-of-Distribution Machine Unlearning

链接: https://arxiv.org/abs/2412.09119
作者: Youssef Allouah,Joshua Kazdan,Rachid Guerraoui,Sanmi Koyejo
关键词-EN: knowledge gaps post-deployment, selectively removing data, Machine unlearning, trained models, process of selectively
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Machine unlearning, the process of selectively removing data from trained models, is increasingly crucial for addressing privacy concerns and knowledge gaps post-deployment. Despite this importance, existing approaches are often heuristic and lack formal guarantees. In this paper, we analyze the fundamental utility, time, and space complexity trade-offs of approximate unlearning, providing rigorous certification analogous to differential privacy. For in-distribution forget data – data similar to the retain set – we show that a surprisingly simple and general procedure, empirical risk minimization with output perturbation, achieves tight unlearning-utility-complexity trade-offs, addressing a previous theoretical gap on the separation from unlearning “for free” via differential privacy, which inherently facilitates the removal of such data. However, such techniques fail with out-of-distribution forget data – data significantly different from the retain set – where unlearning time complexity can exceed that of retraining, even for a single sample. To address this, we propose a new robust and noisy gradient descent variant that provably amortizes unlearning time complexity without compromising utility.

[LG-30] An Algorithm-Centered Approach To Model Streaming Data

链接: https://arxiv.org/abs/2412.09118
作者: Fabian Hinder,Valerie Vaquet,David Komnick,Barbara Hammer
关键词-EN: potentially non-stationary environments, non-stationary environments, constitutes a well-established, potentially non-stationary, stream learning constitutes
类目: Machine Learning (cs.LG)
*备注: This manuscript is currently under review at the Symposium on Intelligent Data Analysis (IDA 2025)

点击查看摘要

Abstract:Besides the classical offline setup of machine learning, stream learning constitutes a well-established setup where data arrives over time in potentially non-stationary environments. Concept drift, the phenomenon that the underlying distribution changes over time poses a significant challenge. Yet, despite high practical relevance, there is little to no foundational theory for learning in the drifting setup comparable to classical statistical learning theory in the offline setting. This can be attributed to the lack of an underlying object comparable to a probability distribution as in the classical setup. While there exist approaches to transfer ideas to the streaming setup, these start from a data perspective rather than an algorithmic one. In this work, we suggest a new model of data over time that is aimed at the algorithm’s perspective. Instead of defining the setup using time points, we utilize a window-based approach that resembles the inner workings of most stream learning algorithms. We compare our framework to others from the literature on a theoretical basis, showing that in many cases both model the same situation. Furthermore, we perform a numerical evaluation and showcase an application in the domain of critical infrastructure.

[LG-31] How to Re-enable PDE Loss for Physical Systems Modeling Under Partial Observation AAAI2025

链接: https://arxiv.org/abs/2412.09116
作者: Haodong Feng,Yue Wang,Dixia Fan
关键词-EN: PDE loss, machine learning techniques, Re-enable PDE Loss, PDE, Re-enable PDE
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:In science and engineering, machine learning techniques are increasingly successful in physical systems modeling (predicting future states of physical systems). Effectively integrating PDE loss as a constraint of system transition can improve the model’s prediction by overcoming generalization issues due to data scarcity, especially when data acquisition is costly. However, in many real-world scenarios, due to sensor limitations, the data we can obtain is often only partial observation, making the calculation of PDE loss seem to be infeasible, as the PDE loss heavily relies on high-resolution states. We carefully study this problem and propose a novel framework named Re-enable PDE Loss under Partial Observation (RPLPO). The key idea is that although enabling PDE loss to constrain system transition solely is infeasible, we can re-enable PDE loss by reconstructing the learnable high-resolution state and constraining system transition simultaneously. Specifically, RPLPO combines an encoding module for reconstructing learnable high-resolution states with a transition module for predicting future states. The two modules are jointly trained by data and PDE loss. We conduct experiments in various physical systems to demonstrate that RPLPO has significant improvement in generalization, even when observation is sparse, irregular, noisy, and PDE is inaccurate. The code is available on GitHub: RPLPO.

[LG-32] Integrated trucks assignment and scheduling problem with mixed service mode docks: A Q-learning based adaptive large neighborhood search algorithm

链接: https://arxiv.org/abs/2412.09090
作者: Yueyi Li,Mehrdad Mohammadi,Xiaodong Zhang,Yunxing Lan,Willem van Jaarsveld
关键词-EN: docks enhance efficiency, Mixed service mode, mode docks enhance, Mixed service, truck assignment
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 29 pages, 12 figures, 15 tables

点击查看摘要

Abstract:Mixed service mode docks enhance efficiency by flexibly handling both loading and unloading trucks in warehouses. However, existing research often predetermines the number and location of these docks prior to planning truck assignment and sequencing. This paper proposes a new model integrating dock mode decision, truck assignment, and scheduling, thus enabling adaptive dock mode arrangements. Specifically, we introduce a Q-learning-based adaptive large neighborhood search (Q-ALNS) algorithm to address the integrated problem. The algorithm adjusts dock modes via perturbation operators, while truck assignment and scheduling are solved using destroy and repair local search operators. Q-learning adaptively selects these operators based on their performance history and future gains, employing the epsilon-greedy strategy. Extensive experimental results and statistical analysis indicate that the Q-ALNS benefits from efficient operator combinations and its adaptive mechanism, consistently outperforming benchmark algorithms in terms of optimality gap and Pareto front discovery. In comparison to the predetermined service mode, our adaptive strategy results in lower average tardiness and makespan, highlighting its superior adaptability to varying demands.

[LG-33] Neural Networks for Threshold Dynamics Reconstruction

链接: https://arxiv.org/abs/2412.09079
作者: Elisa Negrini,Almanzo Jiahe Gao,Abigail Bowering,Wei Zhu,Luca Capogna
关键词-EN: convolutional neural network, MBO network, meta-learning MBO network, cellular automatons, learn threshold dynamics
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: Key words: threshold dynamics, cellular automaton, inverse problems, convolutional neural networks, deep learning

点击查看摘要

Abstract:We introduce two convolutional neural network (CNN) architectures, inspired by the Merriman-Bence-Osher (MBO) algorithm and by cellular automatons, to model and learn threshold dynamics for front evolution from video data. The first model, termed the (single-dynamics) MBO network, learns a specific kernel and threshold for each input video without adapting to new dynamics, while the second, a meta-learning MBO network, generalizes across diverse threshold dynamics by adapting its parameters per input. Both models are evaluated on synthetic and real-world videos (ice melting and fire front propagation), with performance metrics indicating effective reconstruction and extrapolation of evolving boundaries, even under noisy conditions. Empirical results highlight the robustness of both networks across varied synthetic and real-world dynamics.

[LG-34] Multi-view Clustering via Unified Multi-kernel Learning and Matrix Factorization

链接: https://arxiv.org/abs/2412.09065
作者: Chenxing Jia,Mingjie Cai,Hamido Fujita
关键词-EN: increasingly important due, Multi-view clustering, factorization-based multi-view clustering, clustering, multi-view clustering methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-view clustering has become increasingly important due to the multi-source character of real-world data. Among existing multi-view clustering methods, multi-kernel clustering and matrix factorization-based multi-view clustering have gained widespread attention as mainstream approaches. However, multi-kernel clustering tends to learn an optimal kernel and then perform eigenvalue decomposition on it, which leads to high computational complexity. Matrix factorization-based multi-view clustering methods impose orthogonal constraints on individual views. This overly emphasizes the accuracy of clustering structures within single views and restricts the learning of individual views. Based on this analysis, we propose a multi-view clustering method that integrates multi-kernel learning with matrix factorization. This approach combines the advantages of both multi-kernel learning and matrix factorization. It removes the orthogonal constraints on individual views and imposes orthogonal constraints on the consensus matrix, resulting in an accurate final clustering structure. Ultimately, the method is unified into a simple form of multi-kernel clustering, but avoids learning an optimal kernel, thus reducing the time complexity. Furthermore, we propose an efficient three-step optimization algorithm to achieve a locally optimal solution. Experiments on widely-used real-world datasets demonstrate the effectiveness of our proposed method.

[LG-35] Go With the Flow: Fast Diffusion for Gaussian Mixture Models

链接: https://arxiv.org/abs/2412.09059
作者: George Rapakoulias,Ali Reza Pedram,Panagiotis Tsiotras
关键词-EN: Schrödinger Bridges, suitable cost functional, processes that steer, finite time, cost functional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Schrödinger Bridges (SB) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. Furthermore, our method generalizes naturally to more general classes of dynamical systems such as controllable Linear Time-Varying systems that cannot currently be solved using traditional neural SB approaches. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, and various other examples. We also benchmark our approach on an Entropic Optimal Transport (EOT) problem and show that it outperforms state-of-the-art methods in cases where the boundary distributions are mixture models while requiring virtually no training.

[LG-36] Safe Active Learning for Gaussian Differential Equations

链接: https://arxiv.org/abs/2412.09053
作者: Leon Glass,Katharina Ensinger,Christoph Zimmer
关键词-EN: Gaussian Process differential, Process differential equations, Gaussian Process, recently gained momentum, gained momentum due
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Process differential equations (GPODE) have recently gained momentum due to their ability to capture dynamics behavior of systems and also represent uncertainty in predictions. Prior work has described the process of training the hyperparameters and, thereby, calibrating GPODE to data. How to design efficient algorithms to collect data for training GPODE models is still an open field of research. Nevertheless high-quality training data is key for model performance. Furthermore, data collection leads to time-cost and financial-cost and might in some areas even be safety critical to the system under test. Therefore, algorithms for safe and efficient data collection are central for building high quality GPODE models. Our novel Safe Active Learning (SAL) for GPODE algorithm addresses this challenge by suggesting a mechanism to propose efficient and non-safety-critical data to collect. SAL GPODE does so by sequentially suggesting new data, measuring it and updating the GPODE model with the new data. In this way, subsequent data points are iteratively suggested. The core of our SAL GPODE algorithm is a constrained optimization problem maximizing information of new data for GPODE model training constrained by the safety of the underlying system. We demonstrate our novel SAL GPODE’s superiority compared to a standard, non-active way of measuring new data on two relevant examples.

[LG-37] Beyond Confusion: A Fine-grained Dialectical Examination of Human Activity Recognition Benchmark Datasets

链接: https://arxiv.org/abs/2412.09037
作者: Daniel Geissler,Dominique Nshimyimana,Vitor Fortes Rey,Sungho Suh,Bo Zhou,Paul Lukowicz
关键词-EN: human activity recognition, made significant progress, machine learning, algorithms for human, activity recognition
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The research of machine learning (ML) algorithms for human activity recognition (HAR) has made significant progress with publicly available datasets. However, most research prioritizes statistical metrics over examining negative sample details. While recent models like transformers have been applied to HAR datasets with limited success from the benchmark metrics, their counterparts have effectively solved problems on similar levels with near 100% accuracy. This raises questions about the limitations of current approaches. This paper aims to address these open questions by conducting a fine-grained inspection of six popular HAR benchmark datasets. We identified for some parts of the data, none of the six chosen state-of-the-art ML methods can correctly classify, denoted as the intersect of false classifications (IFC). Analysis of the IFC reveals several underlying problems, including ambiguous annotations, irregularities during recording execution, and misaligned transition periods. We contribute to the field by quantifying and characterizing annotated data ambiguities, providing a trinary categorization mask for dataset patching, and stressing potential improvements for future data collections.

[LG-38] Pulling the Carpet Below the Learners Feet: Genetic Algorithm To Learn Ensemble Machine Learning Model During Concept Drift

链接: https://arxiv.org/abs/2412.09035
作者: Teddy Lazebnik
关键词-EN: Data-driven models, machine learning, engineering domains, gained popularity, popularity over recent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven models, in general, and machine learning (ML) models, in particular, have gained popularity over recent years with an increased usage of such models across the scientific and engineering domains. When using ML models in realistic and dynamic environments, users need to often handle the challenge of concept drift (CD). In this study, we explore the application of genetic algorithms (GAs) to address the challenges posed by CD in such settings. We propose a novel two-level ensemble ML model, which combines a global ML model with a CD detector, operating as an aggregator for a population of ML pipeline models, each one with an adjusted CD detector by itself responsible for re-training its ML model. In addition, we show one can further improve the proposed model by utilizing off-the-shelf automatic ML methods. Through extensive synthetic dataset analysis, we show that the proposed model outperforms a single ML pipeline with a CD algorithm, particularly in scenarios with unknown CD characteristics. Overall, this study highlights the potential of ensemble ML and CD models obtained through a heuristic and adaptive optimization process such as the GA one to handle complex CD events.

[LG-39] Learning and Current Prediction of PMSM Drive via Differential Neural Networks

链接: https://arxiv.org/abs/2412.09028
作者: Wenjie Mei,Xiaorui Wang,Yanrong Lu,Ke Yu,Shihua Li
关键词-EN: making accurate predictions, understanding complex phenomena, continuous time, time is significant, significant for understanding
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Learning models for dynamical systems in continuous time is significant for understanding complex phenomena and making accurate predictions. This study presents a novel approach utilizing differential neural networks (DNNs) to model nonlinear systems, specifically permanent magnet synchronous motors (PMSMs), and to predict their current trajectories. The efficacy of our approach is validated through experiments conducted under various load disturbances and no-load conditions. The results demonstrate that our method effectively and accurately reconstructs the original systems, showcasing strong short-term and long-term prediction capabilities and robustness. This study provides valuable insights into learning the inherent dynamics of complex dynamical data and holds potential for further applications in fields such as weather forecasting, robotics, and collective behavior analysis.

[LG-40] raining Physical Neural Networks for Analog In-Memory Computing

链接: https://arxiv.org/abs/2412.09010
作者: Yusuke Sakemi,Yuji Okamoto,Takashi Morie,Sou Nobukawa,Takeo Hosomi,Kazuyuki Aihara
关键词-EN: von Neumann bottleneck, Neumann bottleneck encountered, In-memory computing, von Neumann, Neumann bottleneck
类目: Machine Learning (cs.LG)
*备注: 53 pages, 20 figures

点击查看摘要

Abstract:In-memory computing (IMC) architectures mitigate the von Neumann bottleneck encountered in traditional deep learning accelerators. Its energy efficiency can realize deep learning-based edge applications. However, because IMC is implemented using analog circuits, inherent non-idealities in the hardware pose significant challenges. This paper presents physical neural networks (PNNs) for constructing physical models of IMC. PNNs can address the synaptic current’s dependence on membrane potential, a challenge in charge-domain IMC systems. The proposed model is mathematically equivalent to spiking neural networks with reversal potentials. With a novel technique called differentiable spike-time discretization, the PNNs are efficiently trained. We show that hardware non-idealities traditionally viewed as detrimental can enhance the model’s learning performance. This bottom-up methodology was validated by designing an IMC circuit with non-ideal characteristics using the sky130 process. When employing this bottom-up approach, the modeling error reduced by an order of magnitude compared to conventional top-down methods in post-layout simulations.

[LG-41] A physics-informed transformer neural operator for learning generalized solutions of initial boundary value problems

链接: https://arxiv.org/abs/2412.09009
作者: Sumanth Kumar Boya,Deepak Subramani
关键词-EN: problems arise commonly, natural systems governed, nonlinear partial differential, partial differential equations, problems arise
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 29 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Initial boundary value problems arise commonly in applications with engineering and natural systems governed by nonlinear partial differential equations (PDEs). Operator learning is an emerging field for solving these equations by using a neural network to learn a map between infinite dimensional input and output function spaces. These neural operators are trained using a combination of data (observations or simulations) and PDE-residuals (physics-loss). A major drawback of existing neural approaches is the requirement to retrain with new initial/boundary conditions, and the necessity for a large amount of simulation data for training. We develop a physics-informed transformer neural operator (named PINTO) that efficiently generalizes to unseen initial and boundary conditions, trained in a simulation-free setting using only physics loss. The main innovation lies in our new iterative kernel integral operator units, implemented using cross-attention, to transform the PDE solution’s domain points into an initial/boundary condition-aware representation vector, enabling efficient learning of the solution function for new scenarios. The PINTO architecture is applied to simulate the solutions of important equations used in engineering applications: advection, Burgers, and steady and unsteady Navier-Stokes equations (three flow scenarios). For these five test cases, we show that the relative errors during testing under challenging conditions of unseen initial/boundary conditions are only one-fifth to one-third of other leading physics informed operator learning methods. Moreover, our PINTO model is able to accurately solve the advection and Burgers equations at time steps that are not included in the training collocation points. The code is available at \textttthis https URL

[LG-42] Motor Imagery Classification for Asynchronous EEG-Based Brain-Computer Interfaces

链接: https://arxiv.org/abs/2412.09006
作者: Huanyu Wu,Siyang Li,Dongrui Wu
关键词-EN: based brain-computer interfaces, Motor imagery, based brain-computer, brain-computer interfaces, enable the direct
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motor imagery (MI) based brain-computer interfaces (BCIs) enable the direct control of external devices through the imagined movements of various body parts. Unlike previous systems that used fixed-length EEG trials for MI decoding, asynchronous BCIs aim to detect the user’s MI without explicit triggers. They are challenging to implement, because the algorithm needs to first distinguish between resting-states and MI trials, and then classify the MI trials into the correct task, all without any triggers. This paper proposes a sliding window prescreening and classification (SWPC) approach for MI-based asynchronous BCIs, which consists of two modules: a prescreening module to screen MI trials out of the resting-state, and a classification module for MI classification. Both modules are trained with supervised learning followed by self-supervised learning, which refines the feature extractors. Within-subject and cross-subject asynchronous MI classifications on four different EEG datasets validated the effectiveness of SWPC, i.e., it always achieved the highest average classification accuracy, and outperformed the best state-of-the-art baseline on each dataset by about 2%.

[LG-43] Deep Learning Model Security: Threats and Defenses

链接: https://arxiv.org/abs/2412.08969
作者: Tianyang Wang,Ziqian Bi,Yichao Zhang,Ming Liu,Weiche Hsieh,Pohsun Feng,Lawrence K.Q. Yan,Yizhu Wen,Benji Peng,Junyu Liu,Keyu Chen,Sen Zhang,Ming Li,Chuanqi Jiang,Xinyuan Song,Junjie Yang,Bowen Jing,Jintao Ren,Junhao Song,Hong-Ming Tseng,Silin Chen,Yunze Wang,Chia Xin Liang,Jiawei Xu,Xuanhe Pan,Jinlang Wang,Qian Niu
关键词-EN: faces critical security, including adversarial attacks, data poisoning, critical security challenges, transformed AI applications
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Deep learning has transformed AI applications but faces critical security challenges, including adversarial attacks, data poisoning, model theft, and privacy leakage. This survey examines these vulnerabilities, detailing their mechanisms and impact on model integrity and confidentiality. Practical implementations, including adversarial examples, label flipping, and backdoor attacks, are explored alongside defenses such as adversarial training, differential privacy, and federated learning, highlighting their strengths and limitations. Advanced methods like contrastive and self-supervised learning are presented for enhancing robustness. The survey concludes with future directions, emphasizing automated defenses, zero-trust architectures, and the security challenges of large AI models. A balanced approach to performance and security is essential for developing reliable deep learning systems. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2412.08969 [cs.CR] (or arXiv:2412.08969v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.08969 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] Stochastic Learning of Non-Conjugate Variational Posterior for Image Classification

链接: https://arxiv.org/abs/2412.08951
作者: Kart-Leong Lim
关键词-EN: scale Bayesian nonparametrics, Large scale Bayesian, Bayesian nonparametrics, large training size, scale Bayesian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large scale Bayesian nonparametrics (BNP) learner such as stochastic variational inference (SVI) can handle datasets with large class number and large training size at fractional cost. Like its predecessor, SVI rely on the assumption of conjugate variational posterior to approximate the true posterior. A more challenging problem is to consider large scale learning on non-conjugate posterior. Recent works in this direction are mostly associated with using Monte Carlo methods for approximating the learner. However, these works are usually demonstrated on non-BNP related task and less complex models such as logistic regression, due to higher computational complexity. In order to overcome the issue faced by SVI, we develop a novel approach based on the recently proposed variational maximization-maximization (VMM) learner to allow large scale learning on non-conjugate posterior. Unlike SVI, our VMM learner does not require closed-form expression for the variational posterior expectatations. Our only requirement is that the variational posterior is differentiable. In order to ensure convergence in stochastic settings, SVI rely on decaying step-sizes to slow its learning. Inspired by SVI and Adam, we propose the novel use of decaying step-sizes on both gradient and ascent direction in our VMM to significantly improve its learning. We show that our proposed methods is compatible with ResNet features when applied to large class number datasets such as MIT67 and SUN397. Finally, we compare our proposed learner with several recent works such as deep clustering algorithms and showed we were able to produce on par or outperform the state-of-the-art methods in terms of clustering measures.

[LG-45] Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardews Treatise

链接: https://arxiv.org/abs/2412.08944
作者: Tornike Karchkhadze,Keren Shao,Shlomo Dubnov
关键词-EN: Cornelius Cardew Treatise, Cornelius Cardew, Cardew Treatise, inspired by Cornelius, bridge graphic notation
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This work presents a novel method for composing and improvising music inspired by Cornelius Cardew’s Treatise, using AI to bridge graphic notation and musical expression. By leveraging OpenAI’s ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts. These prompts are then input into MusicLDM, a pre-trained latent diffusion model designed for music generation. We introduce a technique called “outpainting,” which overlaps sections of AI-generated music to create a seamless and cohesive composition. We demostrate a new perspective on performing and interpreting graphic scores, showing how AI can transform visual stimuli into sound and expand the creative possibilities in contemporary/experimental music composition. Musical pieces are available at this https URL

[LG-46] Federated Foundation Models on Heterogeneous Time Series AAAI’25

链接: https://arxiv.org/abs/2412.08906
作者: Shengchao Chen,Guodong Long,Jing Jiang,Chengqi Zhang
关键词-EN: time series foundation, time series, series foundation models, general-purpose time series, cross-domain time series
类目: Machine Learning (cs.LG)
*备注: Accepted by Main Track in AAAI’25

点击查看摘要

Abstract:Training a general-purpose time series foundation models with robust generalization capabilities across diverse applications from scratch is still an open challenge. Efforts are primarily focused on fusing cross-domain time series datasets to extract shared subsequences as tokens for training models on Transformer architecture. However, due to significant statistical heterogeneity across domains, this cross-domain fusing approach doesn’t work effectively as the same as fusing texts and images. To tackle this challenge, this paper proposes a novel federated learning approach to address the heterogeneity in time series foundation models training, namely FFTS. Specifically, each data-holding organization is treated as an independent client in a collaborative learning framework with federated settings, and then many client-specific local models will be trained to preserve the unique characteristics per dataset. Moreover, a new regularization mechanism will be applied to both client-side and server-side, thus to align the shared knowledge across heterogeneous datasets from different domains. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed federated learning approach. The newly learned time series foundation models achieve superior generalization capabilities on cross-domain time series analysis tasks, including forecasting, imputation, and anomaly detection.

[LG-47] Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

链接: https://arxiv.org/abs/2412.08890
作者: Junhyuck Kim,Jongho Park,Jaewoong Cho,Dimitris Papailiopoulos
关键词-EN: leverages sparse coding, introduce Lexico, universal dictionary, Lexico, Abstract
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

[LG-48] FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2412.08880
作者: Prajwal Koirala,Zhanhong Jiang,Soumik Sarkar,Cody Fleming
关键词-EN: maximize cumulative rewards, reinforcement learning aims, Informed Advantage Weighted, Advantage Weighted Regression, aims to learn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.

[LG-49] Multi-objective Combinatorial Methodology for Nuclear Reactor Site Assessment: A Case Study for the United States

链接: https://arxiv.org/abs/2412.08878
作者: Omer Erdem,Kevin Daley,Gabrielle Hoelzle,Majdi I. Radaideh
关键词-EN: clean energy intensifies, carbon emission goals, net-zero carbon emission, nuclear energy stands, clean energy
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 29 pages, 8 Tables, 12 figures

点击查看摘要

Abstract:As the global demand for clean energy intensifies to achieve sustainability and net-zero carbon emission goals, nuclear energy stands out as a reliable solution. However, fully harnessing its potential requires overcoming key challenges, such as the high capital costs associated with nuclear power plants (NPPs). One promising strategy to mitigate these costs involves repurposing sites with existing infrastructure, including coal power plant (CPP) locations, which offer pre-built facilities and utilities. Additionally, brownfield sites - previously developed or underutilized lands often impacted by industrial activity - present another compelling alternative. These sites typically feature valuable infrastructure that can significantly reduce the costs of NPP development. This study introduces a novel multi-objective optimization methodology, leveraging combinatorial search to evaluate over 30,000 potential NPP sites in the United States. Our approach addresses gaps in the current practice of assigning pre-determined weights to each site attribute that could lead to bias in the ranking. Each site is assigned a performance-based score, derived from a detailed combinatorial analysis of its site attributes. The methodology generates a comprehensive database comprising site locations (inputs), attributes (outputs), site score (outputs), and the contribution of each attribute to the site score (outputs). We then use this database to train a machine learning neural network model, enabling rapid predictions of nuclear siting suitability across any location in the contiguous United States.

[LG-50] Words of War: Exploring the Presidential Rhetorical Arsenal with Deep Learning

链接: https://arxiv.org/abs/2412.08868
作者: Wyatt Scott,Brett Genz,Sarah Elmasry,Sodiq Adewole
关键词-EN: national leaders words, hold profound significance, leaders words hold, words hold profound, pivotal historical moments
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In political discourse and geopolitical analysis, national leaders words hold profound significance, often serving as harbingers of pivotal historical moments. From impassioned rallying cries to calls for caution, presidential speeches preceding major conflicts encapsulate the multifaceted dynamics of decision-making at the apex of governance. This project aims to use deep learning techniques to decode the subtle nuances and underlying patterns of US presidential rhetoric that may signal US involvement in major wars. While accurate classification is desirable, we seek to take a step further and identify discriminative features between the two classes (i.e. interpretable learning). Through an interdisciplinary fusion of machine learning and historical inquiry, we aspire to unearth insights into the predictive capacity of neural networks in discerning the preparatory rhetoric of US presidents preceding war. Indeed, as the venerable Prussian General and military theorist Carl von Clausewitz admonishes, War is not merely an act of policy but a true political instrument, a continuation of political intercourse carried on with other means (Clausewitz, 1832). Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.08868 [cs.LG] (or arXiv:2412.08868v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.08868 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] MOPI-HFRS: A Multi-objective Personalized Health-aware Food Recommendation System with LLM -enhanced Interpretation

链接: https://arxiv.org/abs/2412.08847
作者: Zheyuan Zhang,Zehong Wang,Tianyi Ma,Varun Sameer Taneja,Sofia Nelson,Nhi Ha Lan Le,Keerthiram Murugesan,Mingxuan Ju,Nitesh V Chawla,Chuxu Zhang,Yanfang Ye
关键词-EN: United States, unhealthy eating habits, health-aware food recommendation, food recommendation, health-aware food
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The prevalence of unhealthy eating habits has become an increasingly concerning issue in the United States. However, major food recommendation platforms (e.g., Yelp) continue to prioritize users’ dietary preferences over the healthiness of their choices. Although efforts have been made to develop health-aware food recommendation systems, the personalization of such systems based on users’ specific health conditions remains under-explored. In addition, few research focus on the interpretability of these systems, which hinders users from assessing the reliability of recommendations and impedes the practical deployment of these systems. In response to this gap, we first establish two large-scale personalized health-aware food recommendation benchmarks at the first attempt. We then develop a novel framework, Multi-Objective Personalized Interpretable Health-aware Food Recommendation System (MOPI-HFRS), which provides food recommendations by jointly optimizing the three objectives: user preference, personalized healthiness and nutritional diversity, along with an large language model (LLM)-enhanced reasoning module to promote healthy dietary knowledge through the interpretation of recommended results. Specifically, this holistic graph learning framework first utilizes two structure learning and a structure pooling modules to leverage both descriptive features and health data. Then it employs Pareto optimization to achieve designed multi-facet objectives. Finally, to further promote the healthy dietary knowledge and awareness, we exploit an LLM by utilizing knowledge-infusion, prompting the LLMs with knowledge obtained from the recommendation model for interpretation.

[LG-52] Grothendieck Graph Neural Networks Framework: An Algebraic Platform for Crafting Topology-Aware GNNs

链接: https://arxiv.org/abs/2412.08835
作者: Amirreza Shiralinasab Langari,Leila Yeganeh,Kim Khoa Nguyen
关键词-EN: Graph Neural Networks, Neural Networks, alternative aggregation strategies, Grothendieck Graph Neural, Sieve Neural Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the structural limitations of Graph Neural Networks (GNNs), in particular with respect to conventional neighborhoods, alternative aggregation strategies have recently been investigated. This paper investigates graph structure in message passing, aimed to incorporate topological characteristics. While the simplicity of neighborhoods remains alluring, we propose a novel perspective by introducing the concept of ‘cover’ as a generalization of neighborhoods. We design the Grothendieck Graph Neural Networks (GGNN) framework, offering an algebraic platform for creating and refining diverse covers for graphs. This framework translates covers into matrix forms, such as the adjacency matrix, expanding the scope of designing GNN models based on desired message-passing strategies. Leveraging algebraic tools, GGNN facilitates the creation of models that outperform traditional approaches. Based on the GGNN framework, we propose Sieve Neural Networks (SNN), a new GNN model that leverages the notion of sieves from category theory. SNN demonstrates outstanding performance in experiments, particularly on benchmarks designed to test the expressivity of GNNs, and exemplifies the versatility of GGNN in generating novel architectures.

[LG-53] Disentangling impact of capacity objective batchsize estimators and step-size on flow VI

链接: https://arxiv.org/abs/2412.08824
作者: Abhinav Agrawal,Justin Domke
关键词-EN: Normalizing flow-based variational, approximate inference approach, flow-based variational inference, promising approximate inference, Normalizing flow-based
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Normalizing flow-based variational inference (flow VI) is a promising approximate inference approach, but its performance remains inconsistent across studies. Numerous algorithmic choices influence flow VI’s performance. We conduct a step-by-step analysis to disentangle the impact of some of the key factors: capacity, objectives, gradient estimators, number of gradient estimates (batchsize), and step-sizes. Each step examines one factor while neutralizing others using insights from the previous steps and/or using extensive parallel computation. To facilitate high-fidelity evaluation, we curate a benchmark of synthetic targets that represent common posterior pathologies and allow for exact sampling. We provide specific recommendations for different factors and propose a flow VI recipe that matches or surpasses leading turnkey Hamiltonian Monte Carlo (HMC) methods.

[LG-54] HARP: A challenging human-annotated math reasoning benchmark

链接: https://arxiv.org/abs/2412.08819
作者: Albert S. Yue,Lovish Madaan,Ted Moskovitz,DJ Strouse,Aaditya K. Singh
关键词-EN: large language models, Human Annotated Reasoning, scale large language, increasing area, area of focus
类目: Machine Learning (cs.LG)
*备注: 28 pages, 17 figures

点击查看摘要

Abstract:Math reasoning is becoming an ever increasing area of focus as we scale large language models. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore briefly. We report evaluations for many frontier models and share some interesting analyses, such as demonstrating that frontier models across families intrinsically scale their inference-time compute for more difficult problems. Finally, we open source all code used for dataset construction (including scraping) and all code for evaluation (including answer checking) to enable future research at: this https URL.

[LG-55] st-Time Alignment via Hypothesis Reweighting

链接: https://arxiv.org/abs/2412.08812
作者: Yoonho Lee,Jonathan Williams,Henrik Marklund,Archit Sharma,Eric Mitchell,Anikait Singh,Chelsea Finn
关键词-EN: desired behavior, define the desired, Large pretrained models, Large pretrained, training
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Large pretrained models often struggle with underspecified tasks – situations where the training data does not fully define the desired behavior. For example, chatbots must handle diverse and often conflicting user preferences, requiring adaptability to various user needs. We propose a novel framework to address the general challenge of aligning models to test-time user intent, which is rarely fully specified during training. Our approach involves training an efficient ensemble, i.e., a single neural network with multiple prediction heads, each representing a different function consistent with the training data. Our main contribution is HyRe, a simple adaptation technique that dynamically reweights ensemble members at test time using a small set of labeled examples from the target distribution, which can be labeled in advance or actively queried from a larger unlabeled pool. By leveraging recent advances in scalable ensemble training, our method scales to large pretrained models, with computational costs comparable to fine-tuning a single model. We empirically validate HyRe in several underspecified scenarios, including personalization tasks and settings with distribution shifts. Additionally, with just five preference pairs from each target distribution, the same ensemble adapted via HyRe outperforms the prior state-of-the-art 2B-parameter reward model accuracy across 18 evaluation distributions.

[LG-56] Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

链接: https://arxiv.org/abs/2412.08794
作者: Prajwal Koirala,Zhanhong Jiang,Soumik Sarkar,Cody Fleming
关键词-EN: safe offline reinforcement, offline reinforcement learning, Conditional Variational Autoencoders, offline data, offline reinforcement
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.

[LG-57] Reducing Popularity Influence by Addressing Position Bias

链接: https://arxiv.org/abs/2412.08780
作者: Andrii Dzhoha,Alexey Kurennoy,Vladimir Vlasov,Marjan Celikik
关键词-EN: existing research focusing, Position bias, refining ranking relevance, Position bias poses, Position
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Position bias poses a persistent challenge in recommender systems, with much of the existing research focusing on refining ranking relevance and driving user engagement. However, in practical applications, the mitigation of position bias does not always result in detectable short-term improvements in ranking relevance. This paper provides an alternative, practically useful view of what position bias reduction methods can achieve. It demonstrates that position debiasing can spread visibility and interactions more evenly across the assortment, effectively reducing a skew in the popularity of items induced by the position bias through a feedback loop. We offer an explanation of how position bias affects item popularity. This includes an illustrative model of the item popularity histogram and the effect of the position bias on its skewness. Through offline and online experiments on our large-scale e-commerce platform, we show that position debiasing can significantly improve assortment utilization, without any degradation in user engagement or financial metrics. This makes the ranking fairer and helps attract more partners or content providers, benefiting the customers and the business in the long term.

[LG-58] Bayesian optimized deep ensemble for uncertainty quantification of deep neural networks: a system safety case study on sodium fast reactor thermal stratification modeling

链接: https://arxiv.org/abs/2412.08776
作者: Zaid Abulawi,Rui Hu,Prasanna Balaprakash,Yang Liu
关键词-EN: Deep Neural Networks, Convolutional Neural Network, system safety modeling, essential for decision-making, decision-making in risk-sensitive
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate predictions and uncertainty quantification (UQ) are essential for decision-making in risk-sensitive fields such as system safety modeling. Deep ensembles (DEs) are efficient and scalable methods for UQ in Deep Neural Networks (DNNs); however, their performance is limited when constructed by simply retraining the same DNN multiple times with randomly sampled initializations. To overcome this limitation, we propose a novel method that combines Bayesian optimization (BO) with DE, referred to as BODE, to enhance both predictive accuracy and UQ. We apply BODE to a case study involving a Densely connected Convolutional Neural Network (DCNN) trained on computational fluid dynamics (CFD) data to predict eddy viscosity in sodium fast reactor thermal stratification modeling. Compared to a manually tuned baseline ensemble, BODE estimates total uncertainty approximately four times lower in a noise-free environment, primarily due to the baseline’s overestimation of aleatoric uncertainty. Specifically, BODE estimates aleatoric uncertainty close to zero, while aleatoric uncertainty dominates the total uncertainty in the baseline ensemble. We also observe a reduction of more than 30% in epistemic uncertainty. When Gaussian noise with standard deviations of 5% and 10% is introduced into the data, BODE accurately fits the data and estimates uncertainty that aligns with the data noise. These results demonstrate that BODE effectively reduces uncertainty and enhances predictions in data-driven models, making it a flexible approach for various applications requiring accurate predictions and robust UQ. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2412.08776 [cs.LG] (or arXiv:2412.08776v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.08776 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] DeepNose: An Equivariant Convolutional Neural Network Predictive Of Human Olfactory Percepts

链接: https://arxiv.org/abs/2412.08747
作者: Sergey Shuvaev,Khue Tran,Khristina Samoilova,Cyrille Mascart,Alexei Koulakov
关键词-EN: olfactory system employs, system employs responses, olfactory percepts, odorant receptors, human olfactory percepts
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注: 8 pages, 8 figures, to be appear in the proceedings of Asilomar Conference on Signals, Systems, and Computers (ACSSC 2024)

点击查看摘要

Abstract:The olfactory system employs responses of an ensemble of odorant receptors (ORs) to sense molecules and to generate olfactory percepts. Here we hypothesized that ORs can be viewed as 3D spatial filters that extract molecular features relevant to the olfactory system, similarly to the spatio-temporal filters found in other sensory modalities. To build these filters, we trained a convolutional neural network (CNN) to predict human olfactory percepts obtained from several semantic datasets. Our neural network, the DeepNose, produced responses that are approximately invariant to the molecules’ orientation, due to its equivariant architecture. Our network offers high-fidelity perceptual predictions for different olfactory datasets. In addition, our approach allows us to identify molecular features that contribute to specific perceptual descriptors. Because the DeepNose network is designed to be aligned with the biological system, our approach predicts distinct perceptual qualities for different stereoisomers. The architecture of the DeepNose relying on the processing of several molecules at the same time permits inferring the perceptual quality of odor mixtures. We propose that the DeepNose network can use 3D molecular shapes to generate high-quality predictions for human olfactory percepts and help identify molecular features responsible for odor quality.

[LG-60] How to Count Coughs: An Event-Based Framework for Evaluating Automatic Cough Detection Algorithm Performance

链接: https://arxiv.org/abs/2406.01529
作者: Lara Orlandic,Jonathan Dan,Jerome Thevenot,Tomas Teijeiro,Alain Sauty,David Atienza
关键词-EN: Chronic cough disorders, subjective patient questionnaires, running Machine Learning, Chronic cough, Machine Learning
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Chronic cough disorders are widespread and challenging to assess because they rely on subjective patient questionnaires about cough frequency. Wearable devices running Machine Learning (ML) algorithms are promising for quantifying daily coughs, providing clinicians with objective metrics to track symptoms and evaluate treatments. However, there is a mismatch between state-of-the-art metrics for cough counting algorithms and the information relevant to clinicians. Most works focus on distinguishing cough from non-cough samples, which does not directly provide clinically relevant outcomes such as the number of cough events or their temporal patterns. In addition, typical metrics such as specificity and accuracy can be biased by class imbalance. We propose using event-based evaluation metrics aligned with clinical guidelines on significant cough counting endpoints. We use an ML classifier to illustrate the shortcomings of traditional sample-based accuracy measurements, highlighting their variance due to dataset class imbalance and sample window length. We also present an open-source event-based evaluation framework to test algorithm performance in identifying cough events and rejecting false positives. We provide examples and best practice guidelines in event-based cough counting as a necessary first step to assess algorithm performance with clinical relevance.

[LG-61] Wait-Less Offline Tuning and Re-solving for Online Decision Making

链接: https://arxiv.org/abs/2412.09594
作者: Jingruo Sun,Wenzhi Gao,Ellen Vitercik,Yinyu Ye
关键词-EN: found broad applications, linear programming, Online linear programming, solving linear programming, OLP
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Online linear programming (OLP) has found broad applications in revenue management and resource allocation. State-of-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. In contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from worse regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP methods. The algorithm re-solves the LP subproblems periodically at a predefined frequency f and uses the latest dual prices to guide online decision-making. In addition, a first-order method runs in parallel during each interval between LP re-solves, smoothing resource consumption. Our algorithm achieves \mathscrO(\log (T/f) + \sqrtf) regret, delivering a “wait-less” online decision-making process that balances the computational efficiency of first-order methods and the superior regret guarantee of LP-based methods.

[LG-62] Experimental Machine Learning with Classical and Quantum Data via NMR Quantum Kernels

链接: https://arxiv.org/abs/2412.09557
作者: Vivek Sabarad,T. S. Mahesh
关键词-EN: enabling linear algorithms, learn nonlinear functions, Kernel methods map, enabling linear, large Hilbert spaces
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Kernel methods map data into high-dimensional spaces, enabling linear algorithms to learn nonlinear functions without explicitly storing the feature vectors. Quantum kernel methods promise efficient learning by encoding feature maps into exponentially large Hilbert spaces inherent in quantum systems. In this work we implement quantum kernels on a 10-qubit star-topology register in a nuclear magnetic resonance (NMR) platform. We experimentally encode classical data in the evolution of multiple quantum coherence orders using data-dependent unitary transformations and then demonstrate one-dimensional regression and two-dimensional classification tasks. By extending the register to a double-layered star configuration, we propose an extended quantum kernel to handle non-parametrized operator inputs. By numerically simulating the extended quantum kernel, we show classification of entangling and nonentangling unitaries. These results confirm that quantum kernels exhibit strong capabilities in classical as well as quantum machine learning tasks.

[LG-63] Enhancing Convergence of Decentralized Gradient Tracking under the KL Property

链接: https://arxiv.org/abs/2412.09556
作者: Xiaokai Chen,Tianyu Cao,Gesualdo Scutari
关键词-EN: study decentralized multiagent, decentralized multiagent optimization, modeled as undirected, undirected graphs, theta
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:We study decentralized multiagent optimization over networks, modeled as undirected graphs. The optimization problem consists of minimizing a nonconvex smooth function plus a convex extended-value function, which enforces constraints or extra structure on the solution (e.g., sparsity, low-rank). We further assume that the objective function satisfies the Kurdyka-Łojasiewicz (KL) property, with given exponent \theta\in [0,1) . The KL property is satisfied by several (nonconvex) functions of practical interest, e.g., arising from machine learning applications; in the centralized setting, it permits to achieve strong convergence guarantees. Here we establish convergence of the same type for the notorious decentralized gradient-tracking-based algorithm SONATA. Specifically, \textbf(i) when \theta\in (0,1/2] , the sequence generated by SONATA converges to a stationary solution of the problem at R-linear rate; \textbf(ii) when \theta\in (1/2,1) , sublinear rate is certified; and finally \textbf(iii) when \theta=0 , the iterates will either converge in a finite number of steps or converges at R-linear rate. This matches the convergence behavior of centralized proximal-gradient algorithms except when \theta=0 . Numerical results validate our theoretical findings.

[LG-64] Loss function to optimise signal significance in particle physics NEURIPS2024

链接: https://arxiv.org/abs/2412.09500
作者: Jai Bardhan,Cyrin Neeraj,Subhadip Mitra,Tanumoy Mandal
关键词-EN: construct a surrogate, directly optimise, particle physics, significance metric, surrogate loss
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 9 pages, 4 figures. Appeared in the Machine Learning for Physical Sciences (ML4PS) workshop in NeurIPS 2024 conference

点击查看摘要

Abstract:We construct a surrogate loss to directly optimise the significance metric used in particle physics. We evaluate our loss function for a simple event classification task using a linear model and show that it produces decision boundaries that change according to the cross sections of the processes involved. We find that the models trained with the new loss have higher signal efficiency for similar values of estimated signal significance compared to ones trained with a cross-entropy loss, showing promise to improve sensitivity of particle physics searches at colliders.

[LG-65] Data Efficient Prediction of excited-state properties using Quantum Neural Networks

链接: https://arxiv.org/abs/2412.09423
作者: Manuel Hagelüken,Marco F. Huber,Marco Roth
关键词-EN: physical processes, chemical and physical, Understanding, Understanding the properties, ground state counterparts
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 + 4 pages, 7 + 3 figures

点击查看摘要

Abstract:Understanding the properties of excited states of complex molecules is crucial for many chemical and physical processes. Calculating these properties is often significantly more resource-intensive than calculating their ground state counterparts. We present a quantum machine learning model that predicts excited-state properties from the molecular ground state for different geometric configurations. The model comprises a symmetry-invariant quantum neural network and a conventional neural network and is able to provide accurate predictions with only a few training data points. The proposed procedure is fully NISQ compatible. This is achieved by using a quantum circuit that requires a number of parameters linearly proportional to the number of molecular orbitals, along with a parameterized measurement observable, thereby reducing the number of necessary measurements. We benchmark the algorithm on three different molecules by evaluating its performance in predicting excited state transition energies and transition dipole moments. We show that, in many instances, the procedure is able to outperform various classical models that rely solely on classical features.

[LG-66] Distribution free uncertainty quantification in neuroscience-inspired deep operators

链接: https://arxiv.org/abs/2412.09369
作者: Shailesh Garg,Souvik Chakraborty
关键词-EN: edge computing setups, feasible edge computing, Energy-efficient deep learning, Energy-efficient deep, deep learning algorithms
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy-efficient deep learning algorithms are essential for a sustainable future and feasible edge computing setups. Spiking neural networks (SNNs), inspired from neuroscience, are a positive step in the direction of achieving the required energy efficiency. However, in a bid to lower the energy requirements, accuracy is marginally sacrificed. Hence, predictions of such deep learning algorithms require an uncertainty measure that can inform users regarding the bounds of a certain output. In this paper, we introduce the Conformalized Randomized Prior Operator (CRP-O) framework that leverages Randomized Prior (RP) networks and Split Conformal Prediction (SCP) to quantify uncertainty in both conventional and spiking neural operators. To further enable zero-shot super-resolution in UQ, we propose an extension incorporating Gaussian Process Regression. This enhanced super-resolution-enabled CRP-O framework is integrated with the recently developed Variable Spiking Wavelet Neural Operator (VSWNO). To test the performance of the obtained calibrated uncertainty bounds, we discuss four different examples covering both one-dimensional and two-dimensional partial differential equations. Results demonstrate that the uncertainty bounds produced by the conformalized RP-VSWNO significantly enhance UQ estimates compared to vanilla RP-VSWNO, Quantile WNO (Q-WNO), and Conformalized Quantile WNO (CQ-WNO). These findings underscore the potential of the proposed approach for practical applications.

[LG-67] Dimensionality Reduction Techniques for Global Bayesian Optimisation NEURIPS2024

链接: https://arxiv.org/abs/2412.09183
作者: Luo Long,Coralia Cartis,Paz Fink Shustin
关键词-EN: Bayesian Optimisation, Space Bayesian Optimisation, information is unavailable, technique for black-box, black-box problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted at NeurIPS 2024 Workshop OPT for ML: Optimization for Machine Learning (Submission Number:67)

点击查看摘要

Abstract:Bayesian Optimisation (BO) is a state-of-the-art global optimisation technique for black-box problems where derivative information is unavailable, and sample efficiency is crucial. However, improving the general scalability of BO has proved challenging. Here, we explore Latent Space Bayesian Optimisation (LSBO), that applies dimensionality reduction to perform BO in a reduced-dimensional subspace. While early LSBO methods used (linear) random projections (Wang et al., 2013), we employ Variational Autoencoders (VAEs) to manage more complex data structures and general DR tasks. Building on Grosnit et. al. (2021), we analyse the VAE-based LSBO framework, focusing on VAE retraining and deep metric loss. We suggest a few key corrections in their implementation, originally designed for tasks such as molecule generation, and reformulate the algorithm for broader optimisation purposes. Our numerical results show that structured latent manifolds improve BO performance. Additionally, we examine the use of the Matérn- \frac52 kernel for Gaussian Processes in this LSBO context. We also integrate Sequential Domain Reduction (SDR), a standard global optimization efficiency strategy, into BO. SDR is included in a GPU-based environment using \textitBoTorch, both in the original and VAE-generated latent spaces, marking the first application of SDR within LSBO.

[LG-68] (epsilon delta)-Differentially Private Partial Least Squares Regression

链接: https://arxiv.org/abs/2412.09164
作者: Ramin Nikzad-Langerodi,Mohit Kumar,Du Nguyen Duy,Mahtab Alghasi
关键词-EN: statistical models based, protecting data-privacy, data-privacy requirements, PLS algorithm, increasingly stringent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 5 figure

点击查看摘要

Abstract:As data-privacy requirements are becoming increasingly stringent and statistical models based on sensitive data are being deployed and used more routinely, protecting data-privacy becomes pivotal. Partial Least Squares (PLS) regression is the premier tool for building such models in analytical chemistry, yet it does not inherently provide privacy guarantees, leaving sensitive (training) data vulnerable to privacy attacks. To address this gap, we propose an (\epsilon, \delta) -differentially private PLS (edPLS) algorithm, which integrates well-studied and theoretically motivated Gaussian noise-adding mechanisms into the PLS algorithm to ensure the privacy of the data underlying the model. Our approach involves adding carefully calibrated Gaussian noise to the outputs of four key functions in the PLS algorithm: the weights, scores, X -loadings, and Y -loadings. The noise variance is determined based on the global sensitivity of each function, ensuring that the privacy loss is controlled according to the (\epsilon, \delta) -differential privacy framework. Specifically, we derive the sensitivity bounds for each function and use these bounds to calibrate the noise added to the model components. Experimental results demonstrate that edPLS effectively renders privacy attacks, aimed at recovering unique sources of variability in the training data, ineffective. Application of edPLS to the NIR corn benchmark dataset shows that the root mean squared error of prediction (RMSEP) remains competitive even at strong privacy levels (i.e., \epsilon=1 ), given proper pre-processing of the corresponding spectra. These findings highlight the practical utility of edPLS in creating privacy-preserving multivariate calibrations and for the analysis of their privacy-utility trade-offs.

[LG-69] Stellar parameter prediction and spectral simulation using machine learning

链接: https://arxiv.org/abs/2412.09002
作者: Vojtěch Cvrček,Martino Romaniello,Radim Šára,Wolfram Freudling,Pascal Ballester
关键词-EN: Velocity Planet Searcher, Radial Velocity Planet, ESO High Accuracy, High Accuracy Radial, Accuracy Radial Velocity
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted for publication in Astronomy Astrophysics

点击查看摘要

Abstract:We applied machine learning to the entire data history of ESO’s High Accuracy Radial Velocity Planet Searcher (HARPS) instrument. Our primary goal was to recover the physical properties of the observed objects, with a secondary emphasis on simulating spectra. We systematically investigated the impact of various factors on the accuracy and fidelity of the results, including the use of simulated data, the effect of varying amounts of real training data, network architectures, and learning paradigms. Our approach integrates supervised and unsupervised learning techniques within autoencoder frameworks. Our methodology leverages an existing simulation model that utilizes a library of existing stellar spectra in which the emerging flux is computed from first principles rooted in physics and a HARPS instrument model to generate simulated spectra comparable to observational data. We trained standard and variational autoencoders on HARPS data to predict spectral parameters and generate spectra. Our models excel at predicting spectral parameters and compressing real spectra, and they achieved a mean prediction error of approximately 50 K for effective temperatures, making them relevant for most astrophysical applications. Furthermore, the models predict metallicity ([M/H]) and surface gravity (log g) with an accuracy of approximately 0.03 dex and 0.04 dex, respectively, underscoring their broad applicability in astrophysical research. The models’ computational efficiency, with processing times of 779.6 ms on CPU and 3.97 ms on GPU, makes them valuable for high-throughput applications like massive spectroscopic surveys and large archival studies. By achieving accuracy comparable to classical methods with significantly reduced computation time, our methodology enhances the scope and efficiency of spectroscopic analysis.

[LG-70] Predicting Emergency Department Visits for Patients with Type II Diabetes ALT

链接: https://arxiv.org/abs/2412.08984
作者: Javad M Alizadeh,Jay S Patel,Gabriel Tajeu,Yuzhou Chen,Ilene L Hollin. Mukesh K Patel,Junchao Fei,Huanmei Wu
关键词-EN: million Americans, Type II diabetes, Americans are affected, affected by Type, significant health risks
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: This manuscript has been accepted and presented at AI-PHSS 2024: The 2024 International Workshop on AI Applications in Public Health and Social Services in conjunction with the 22nd International Conference of Artificial Intelligence in Medicine (AIME 2024)

点击查看摘要

Abstract:Over 30 million Americans are affected by Type II diabetes (T2D), a treatable condition with significant health risks. This study aims to develop and validate predictive models using machine learning (ML) techniques to estimate emergency department (ED) visits among patients with T2D. Data for these patients was obtained from the HealthShare Exchange (HSX), focusing on demographic details, diagnoses, and vital signs. Our sample contained 34,151 patients diagnosed with T2D which resulted in 703,065 visits overall between 2017 and 2021. A workflow integrated EMR data with SDoH for ML predictions. A total of 87 out of 2,555 features were selected for model construction. Various machine learning algorithms, including CatBoost, Ensemble Learning, K-nearest Neighbors (KNN), Support Vector Classification (SVC), Random Forest, and Extreme Gradient Boosting (XGBoost), were employed with tenfold cross-validation to predict whether a patient is at risk of an ED visit. The ROC curves for Random Forest, XGBoost, Ensemble Learning, CatBoost, KNN, and SVC, were 0.82, 0.82, 0.82, 0.81, 0.72, 0.68, respectively. Ensemble Learning and Random Forest models demonstrated superior predictive performance in terms of discrimination, calibration, and clinical applicability. These models are reliable tools for predicting risk of ED visits among patients with T2D. They can estimate future ED demand and assist clinicians in identifying critical factors associated with ED utilization, enabling early interventions to reduce such visits. The top five important features were age, the difference between visitation gaps, visitation gaps, R10 or abdominal and pelvic pain, and the Index of Concentration at the Extremes (ICE) for income.

[LG-71] Belted and Ensembled Neural Network for Linear and Nonlinear Sufficient Dimension Reduction

链接: https://arxiv.org/abs/2412.08961
作者: Yin Tang,Bing Li
关键词-EN: sufficient dimension reduction, Ensembled Neural Network, neural network, dimension reduction, sufficient dimension
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 35 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We introduce a unified, flexible, and easy-to-implement framework of sufficient dimension reduction that can accommodate both linear and nonlinear dimension reduction, and both the conditional distribution and the conditional mean as the targets of estimation. This unified framework is achieved by a specially structured neural network – the Belted and Ensembled Neural Network (BENN) – that consists of a narrow latent layer, which we call the belt, and a family of transformations of the response, which we call the ensemble. By strategically placing the belt at different layers of the neural network, we can achieve linear or nonlinear sufficient dimension reduction, and by choosing the appropriate transformation families, we can achieve dimension reduction for the conditional distribution or the conditional mean. Moreover, thanks to the advantage of the neural network, the method is very fast to compute, overcoming a computation bottleneck of the traditional sufficient dimension reduction estimators, which involves the inversion of a matrix of dimension either p or n. We develop the algorithm and convergence rate of our method, compare it with existing sufficient dimension reduction methods, and apply it to two data examples.

[LG-72] Beyond Reweighting: On the Predictive Role of Covariate Shift in Effect Generalization

链接: https://arxiv.org/abs/2412.08869
作者: Ying Jin,Naoki Egami,Dominik Rothenhäusler
关键词-EN: covariate shift, shift, generalizing statistical inference, statistical inference amidst, covariate shift assumption
类目: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Many existing approaches to generalizing statistical inference amidst distribution shift operate under the covariate shift assumption, which posits that the conditional distribution of unobserved variables given observable ones is invariant across populations. However, recent empirical investigations have demonstrated that adjusting for shift in observed variables (covariate shift) is often insufficient for generalization. In other words, covariate shift does not typically explain away'' the distribution shift between settings. As such, addressing the unknown yet non-negligible shift in the unobserved variables given observed ones (conditional shift) is crucial for generalizable inference. In this paper, we present a series of empirical evidence from two large-scale multi-site replication studies to support a new role of covariate shift in predicting’’ the strength of the unknown conditional shift. Analyzing 680 studies across 65 sites, we find that even though the conditional shift is non-negligible, its strength can often be bounded by that of the observable covariate shift. However, this pattern only emerges when the two sources of shifts are quantified by our proposed standardized, ``pivotal’’ measures. We then interpret this phenomenon by connecting it to similar patterns that can be theoretically derived from a random distribution shift model. Finally, we demonstrate that exploiting the predictive role of covariate shift leads to reliable and efficient uncertainty quantification for target estimates in generalization tasks with partially observed data. Overall, our empirical and theoretical analyses suggest a new way to approach the problem of distributional shift, generalizability, and external validity. Subjects: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2412.08869 [stat.AP] (or arXiv:2412.08869v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2412.08869 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-73] Emulating the Global Change Analysis Model with Deep Learning NEURIPS2024

链接: https://arxiv.org/abs/2412.08850
作者: Andrew Holmes,Matt Jensen,Sarah Coffland,Hidemi Mitani Shen,Logan Sizemore,Seth Bassetti,Brenna Nieva,Claudia Tebaldi,Abigail Snyder,Brian Hutchinson
关键词-EN: Global Change Analysis, Change Analysis Model, Global Change, providing valuable insights, simulates complex interactions
类目: General Economics (econ.GN); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Presented at Tackling Climate Change with Machine Learning, NeurIPS 2024

点击查看摘要

Abstract:The Global Change Analysis Model (GCAM) simulates complex interactions between the coupled Earth and human systems, providing valuable insights into the co-evolution of land, water, and energy sectors under different future scenarios. Understanding the sensitivities and drivers of this multisectoral system can lead to more robust understanding of the different pathways to particular outcomes. The interactions and complexity of the coupled human-Earth systems make GCAM simulations costly to run at scale - a requirement for large ensemble experiments which explore uncertainty in model parameters and outputs. A differentiable emulator with similar predictive power, but greater efficiency, could provide novel scenario discovery and analysis of GCAM and its outputs, requiring fewer runs of GCAM. As a first use case, we train a neural network on an existing large ensemble that explores a range of GCAM inputs related to different relative contributions of energy production sources, with a focus on wind and solar. We complement this existing ensemble with interpolated input values and a wider selection of outputs, predicting 22,528 GCAM outputs across time, sectors, and regions. We report a median R^2 score of 0.998 for the emulator’s predictions and an R^2 score of 0.812 for its input-output sensitivity.

[LG-74] On the Precise Asymptotics and Refined Regret of the Variance-Aware UCB Algorithm

链接: https://arxiv.org/abs/2412.08843
作者: Yuxuan Han,Xiaocong Xu
关键词-EN: Upper Confidence Bound-Variance, canonical Upper Confidence, Upper Confidence Bound, Upper Confidence, incorporates variance estimates
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates of UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, we show that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for arm-pulling rates in the high probability regime, offering insights into regret analysis. As an application of this high probability result, we show that UCB-V can achieve a refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.

[LG-75] Efficient Gravitational Wave Parameter Estimation via Knowledge Distillation: A ResNet1D-IAF Approach

链接: https://arxiv.org/abs/2412.08672
作者: Xihua Zhu,Yiqian Yang,Fan Zhang
关键词-EN: necessitates efficient methods, gravitational wave astronomy, Inverse Autoregressive Flow, detected events necessitates, events necessitates efficient
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 7 pages, 4 figures, 2 tables

点击查看摘要

Abstract:With the rapid development of gravitational wave astronomy, the increasing number of detected events necessitates efficient methods for parameter estimation and model updates. This study presents a novel approach using knowledge distillation techniques to enhance computational efficiency in gravitational wave analysis. We develop a framework combining ResNet1D and Inverse Autoregressive Flow (IAF) architectures, where knowledge from a complex teacher model is transferred to a lighter student model. Our experimental results show that the student model achieves a validation loss of 3.70 with optimal configuration (40,100,0.75), compared to the teacher model’s 4.09, while reducing the number of parameters by 43%. The Jensen-Shannon divergence between teacher and student models remains below 0.0001 across network layers, indicating successful knowledge transfer. By optimizing ResNet layers (7-16) and hidden features (70-120), we achieve a 35% reduction in inference time while maintaining parameter estimation accuracy. This work demonstrates significant improvements in computational efficiency for gravitational wave data analysis, providing valuable insights for real-time event processing.

[LG-76] GeoConformal prediction: a model-agnostic framework of measuring the uncertainty of spatial prediction

链接: https://arxiv.org/abs/2412.08661
作者: Xiayin Lou,Peng Luo,Liqiu Meng
关键词-EN: Spatial, task in geography, prediction, Spatial prediction, fundamental task
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Spatial prediction is a fundamental task in geography. In recent years, with advances in geospatial artificial intelligence (GeoAI), numerous models have been developed to improve the accuracy of geographic variable predictions. Beyond achieving higher accuracy, it is equally important to obtain predictions with uncertainty measures to enhance model credibility and support responsible spatial prediction. Although geostatistic methods like Kriging offer some level of uncertainty assessment, such as Kriging variance, these measurements are not always accurate and lack general applicability to other spatial models. To address this issue, we propose a model-agnostic uncertainty assessment method called GeoConformal Prediction, which incorporates geographical weighting into conformal prediction. We applied it to two classic spatial prediction cases, spatial regression and spatial interpolation, to evaluate its reliability. First, in the spatial regression case, we used XGBoost to predict housing prices, followed by GeoConformal to calculate uncertainty. Our results show that GeoConformal achieved a coverage rate of 93.67%, while Bootstrap methods only reached a maximum coverage of 68.33% after 2000 runs. Next, we applied GeoConformal to spatial interpolation models. We found that the uncertainty obtained from GeoConformal aligned closely with the variance in Kriging. Finally, using GeoConformal, we analyzed the sources of uncertainty in spatial prediction. We found that explicitly including local features in AI models can significantly reduce prediction uncertainty, especially in areas with strong local dependence. Our findings suggest that GeoConformal holds potential not only for geographic knowledge discovery but also for guiding the design of future GeoAI models, paving the way for more reliable and interpretable spatial prediction frameworks.

[LG-77] Capacitive Touch Sensor Modeling With a Physics-informed Neural Network and Maxwells Equations

链接: https://arxiv.org/abs/2412.08650
作者: Ganyong Mo,Krishna Kumar Narayanan,David Castells-Rufas,Jordi Carrabina
关键词-EN: optimizing sensor systems, magnetic field interactions, Maxwell equations, switches and smartphones, understanding electric
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: ESM’2024 (The 38th annual European Simulation and Modelling Conference)

点击查看摘要

Abstract:Maxwell’s equations are the fundamental equations for understanding electric and magnetic field interactions and play a crucial role in designing and optimizing sensor systems like capacitive touch sensors, which are widely prevalent in automotive switches and smartphones. Ensuring robust functionality and stability of the sensors in dynamic environments necessitates profound domain expertise and computationally intensive multi-physics simulations. This paper introduces a novel approach using a Physics-Informed Neural Network (PINN) based surrogate model to accelerate the design process. The PINN model solves the governing electrostatic equations describing the interaction between a finger and a capacitive sensor. Inputs include spatial coordinates from a 3D domain encompassing the finger, sensor, and PCB, along with finger distances. By incorporating the electrostatic equations directly into the neural network’s loss function, the model captures the underlying physics. The learned model thus serves as a surrogate sensor model on which inference can be carried out in seconds for different experimental setups without the need to run simulations. Efficacy results evaluated on unseen test cases demonstrate the significant potential of PINNs in accelerating the development and design optimization of capacitive touch sensors.

[LG-78] Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting

链接: https://arxiv.org/abs/2412.08649
作者: Serbülent Ünsal(1 and 2),Sinem Özdemir(1),Bünyamin Kasap(3),M. Erşan Kalaycı(1),Kemal Turhan(1),Tunca Doğan(4 and 5),Aybar C. Acar(2) ((1) Department of Biostatistics and Medical Informatics, Faculty of Medicine, Graduate School of Health Sciences, Karadeniz Technical University, Trabzon, Türkiye, (2) Cancer Systems Biology Laboratory (KanSiL), Graduate School of Informatics, Middle East Technical University, Ankara, Türkiye, (3) Health Sciences University Trabzon Kanuni Training and Research Hospital, Medical Microbiology Laboratory, Trabzon, Türkiye, (4) Biological Data Science Lab, Dept. of Computer Engineering, Department of Computer Engineering, Hacettepe University, Ankara, Türkiye, (5) Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Türkiye)
关键词-EN: learning framework designed, low-data settings, framework designed, designed to enhance, enhance protein function
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we propose HOPER (HOlistic ProtEin Representation), a novel multimodal learning framework designed to enhance protein function prediction (PFP) in low-data settings. The challenge of predicting protein functions is compounded by the limited availability of labeled data. Traditional machine learning models already struggle in such cases, and while deep learning models excel with abundant data, they also face difficulties when data is scarce. HOPER addresses this issue by integrating three distinct modalities - protein sequences, biomedical text, and protein-protein interaction (PPI) networks - to create a comprehensive protein representation. The model utilizes autoencoders to generate holistic embeddings, which are then employed for PFP tasks using transfer learning. HOPER outperforms existing methods on a benchmark dataset across all Gene Ontology categories, i.e., molecular function, biological process, and cellular component. Additionally, we demonstrate its practical utility by identifying new immune-escape proteins in lung adenocarcinoma, offering insights into potential therapeutic targets. Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research, potentially enabling more accurate and scalable protein function prediction. HOPER source code and datasets are available at this https URL

信息检索

[IR-0] SPRec: Leveraging Self-Play to Debias Preference Alignment for Large Language Model-based Recommendations

链接: https://arxiv.org/abs/2412.09243
作者: Chongming Gao,Ruijun Chen,Shuai Yuan,Kexin Huang,Yuanqing Yu,Xiangnan He
关键词-EN: Large language models, attracted significant attention, Large language, attracted significant, significant attention
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have attracted significant attention in recommendation systems. Current LLM-based recommender systems primarily rely on supervised fine-tuning (SFT) to train the model for recommendation tasks. However, relying solely on positive samples limits the model’s ability to align with user satisfaction and expectations. To address this, researchers have introduced Direct Preference Optimization (DPO), which explicitly aligns recommendations with user preferences using offline preference ranking data. Despite its advantages, our theoretical analysis reveals that DPO inherently biases the model towards a few items, exacerbating the filter bubble issue and ultimately degrading user experience. In this paper, we propose SPRec, a novel self-play recommendation framework designed to mitigate over-recommendation and improve fairness without requiring additional data or manual intervention. In each self-play iteration, the model undergoes an SFT step followed by a DPO step, treating offline interaction data as positive samples and the predicted outputs from the previous iteration as negative samples. This effectively re-weights the DPO loss function using the model’s logits, adaptively suppressing biased items. Extensive experiments on multiple real-world datasets demonstrate SPRec’s effectiveness in enhancing recommendation accuracy and addressing fairness concerns.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-13

目录

概览 (2024-12-13)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载