Arxiv今日论文 | 2024-12-31

本篇博文主要展示 2024-12-31 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决在分布式环境中，多个大型语言模型（LLMs）通过混合代理（Mixture-of-Agents, MoA）架构协同工作时，如何确保边缘设备的队列稳定性问题。具体而言，论文探讨了在边缘设备上运行的LLMs如何通过去中心化的gossip算法进行信息交换，以生成更精确的响应，同时避免由于设备内存限制导致的队列溢出。解决方案的关键在于通过理论计算和实验验证，确定设备队列的稳定性条件，确保平均队列大小在系统中有界。此外，论文还通过实验展示了不同MoA配置在AlpacaEval 2.0基准测试中的响应质量差异，进一步验证了分布式MoA架构的有效性。

链接: https://arxiv.org/abs/2412.21200
作者: Purbesh Mitra,Priyanka Kaswan,Sennur Ulukus
机构: University of Maryland, College Park, MD, USA(马里兰大学帕克分校); Princeton University, Princeton, NJ, USA(普林斯顿大学)
关键词: enabling multiple individual, large language models, enabling multiple, multiple individual LLMs, recently been proposed
类目: Information Theory (cs.IT); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Mixture-of-Agents (MoA) has recently been proposed as a method to enhance performance of large language models (LLMs), enabling multiple individual LLMs to work together for collaborative inference. This collaborative approach results in improved responses to user prompts compared to relying on a single LLM. In this paper, we consider such an MoA architecture in a distributed setting, where LLMs operate on individual edge devices, each uniquely associated with a user and equipped with its own distributed computing power. These devices exchange information using decentralized gossip algorithms, allowing different device nodes to talk without the supervision of a centralized server. In the considered setup, different users have their own LLM models to address user prompts. Additionally, the devices gossip either their own user-specific prompts or augmented prompts to generate more refined answers to certain queries. User prompts are temporarily stored in the device queues when their corresponding LLMs are busy. Given the memory limitations of edge devices, it is crucial to ensure that the average queue sizes in the system remain bounded. In this paper, we address this by theoretically calculating the queuing stability conditions for the device queues under reasonable assumptions, which we validate experimentally as well. Further, we demonstrate through experiments, leveraging open-source LLMs for the implementation of distributed MoA, that certain MoA configurations produce higher-quality responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The implementation is available at: this https URL.
zh

[NLP-1] HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

【速读】：该论文旨在解决大语言模型（LLMs）在渐进推理和问题解决能力方面的评估问题，特别是通过引入自调用代码生成（self-invoking code generation）这一新任务。该任务要求模型首先解决一个基础问题，然后利用其解决方案来处理一个更复杂的问题，从而评估模型在逐步推理中的表现。论文的关键解决方案包括三个方面：首先，提出了一种生成更具挑战性基准测试的通用方法，并由此创建了三个新的基准测试（HumanEval Pro、MBPP Pro 和 BigCodeBench-Lite Pro），专门用于评估自调用代码生成任务；其次，通过对二十个LLMs在这些基准测试上的实验结果分析，发现大多数模型在传统代码生成任务中表现优异，但在自调用任务中表现显著下降，且指令调优模型相比基础模型仅表现出边际改进；最后，揭示了评估结果中存在的失败模式类型。这些结果为未来研究提供了新的方向，强调了在自调用代码生成任务中进一步改进的必要性。

链接: https://arxiv.org/abs/2412.21199
作者: Zhaojian Yu,Yilun Zhao,Arman Cohan,Xiao-Ping Zhang
机构: Tsinghua University(清华大学); Yale University(耶鲁大学)
关键词: self-invoking code generation, code generation, self-invoking code, evaluate the progressive, Pro
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs’ code reasoning capabilities.
zh

[NLP-2] Do NOT Think That Much for 23=? On the Overthinking of o1-Like LLM s

【速读】：该论文旨在解决在类似 OpenAI o1 的模型中存在的“过度思考”（overthinking）问题，即在处理简单问题时分配过多计算资源而收益甚微的现象。论文通过引入从结果和过程两个角度出发的新型效率指标，评估这些模型在计算资源使用上的合理性。关键解决方案包括采用自训练范式（self-training paradigm），提出策略以优化推理过程，减少不必要的计算开销，同时保持模型在不同难度测试集（如 GSM8K、MATH500、GPQA 和 AIME）上的性能。实验结果表明，该方法在降低计算负担的同时，有效维持了模型的准确性。

链接: https://arxiv.org/abs/2412.21187
作者: Xingyu Chen,Jiahao Xu,Tian Liang,Zhiwei He,Jianhui Pang,Dian Yu,Linfeng Song,Qiuzhi Liu,Mengfei Zhou,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: 1; 2; SJTU(上海交通大学); Tencent(腾讯)
关键词: emulate human-like long-time, human-like long-time thinking, thinking during inference, ability to emulate, emulate human-like
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.
zh

[NLP-3] Aviary: training language agents on challenging scientific tasks

【速读】：该论文旨在解决复杂现实任务中的自动化问题，特别是在科学领域，这些任务通常需要多轮的分析、工具使用和实验。论文提出了一种基于语言代理（language agents）的解决方案，这些代理能够通过自然语言或代码与工具交互，从而自动化科学中的智力任务。然而，语言代理的灵活性带来了软件实现上的概念和实践挑战，包括内部推理、规划、工具使用以及温度采样语言模型（temperature-sampled language models）的固有随机性。论文的关键解决方案是引入了Aviary，一个可扩展的语言代理“健身房”（gymnasium），将代理形式化为解决语言基础的部分可观测马尔可夫决策过程（language-grounded partially observable Markov decision processes，简称语言决策过程）。通过实现五个环境，包括三个具有挑战性的科学环境（分子克隆中的DNA构建操作、通过访问科学文献回答研究问题、以及蛋白质稳定性工程），论文展示了基于开源、非前沿大语言模型（LLMs）的语言代理在多项任务上能够匹配甚至超越前沿LLM代理和人类专家，且推理成本降低高达100倍。

链接: https://arxiv.org/abs/2412.21154
作者: Siddharth Narayanan,James D. Braza,Ryan-Rhys Griffiths,Manu Ponnapati,Albert Bou,Jon Laurent,Ori Kabeli,Geemi Wellawatte,Sam Cox,Samuel G. Rodriques,Andrew D. White
机构: FutureHouse Inc.(未来之家公司); University of Rochester(罗切斯特大学); Francis Crick Institute(弗朗西斯·克里克研究所)
关键词: Solving complex real-world, complex real-world tasks, real-world tasks requires, tasks requires cycles, actions and observations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.
zh

[NLP-4] Facilitating large language model Russian adaptation with Learned Embedding Propagation

【速读】：该论文旨在解决在敏感信息环境中采用大语言模型（LLM）技术时面临的训练数据不透明和语言适应成本高昂的问题。当前的开源指令调优（instruction-tuned）LLM虽然在多语言环境中表现出色，但其训练数据未公开，导致模型成果难以复现，且训练特定语言的LLM成本过高，仅能保证推理计算效率的提升。论文提出的解决方案是Learned Embedding Propagation (LEP)，该方法通过最小化对现有LLM知识的影响，减少训练数据需求，并利用新颖的ad-hoc嵌入传播（embedding propagation）过程，跳过指令调优步骤，直接将新语言知识植入现有的指令调优模型中。实验表明，LEP在俄语词汇适应任务中与传统指令调优方法竞争，性能接近OpenChat 3.5和LLaMa-3-8B-Instruct，并通过自校准和持续调优进一步提升了任务解决能力。

链接: https://arxiv.org/abs/2412.21140
作者: Mikhail Tikhomirov,Daniil Chernyshev
机构: 未知
关键词: text generation quality, powerful open-source instruction-tuned, open-source instruction-tuned LLMs, Rapid advancements, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint version of an article published in the Journal of Language and Education. Copyright held by the owner/author(s). Publication rights licensed to the Journal of Language and Education

点击查看摘要

Abstract:Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.
zh

[NLP-5] raining Software Engineering Agents and Verifiers with SWE-Gym

【速读】：该论文旨在解决现实世界中软件工程（SWE）任务的自动化问题，特别是如何训练智能体（agent）以高效处理复杂的编程任务。为此，作者提出了SWE-Gym，这是首个专门用于训练软件工程智能体的环境。SWE-Gym包含2,438个真实的Python任务实例，每个实例均包含可执行的代码库、单元测试以及用自然语言描述的任务。通过在该环境中训练基于语言模型的软件工程智能体，作者在SWE-Bench Verified和Lite测试集上实现了高达19%的绝对解决率提升。此外，作者还通过训练验证器（verifier）对智能体轨迹进行推理时扩展，进一步提升了性能，最终在SWE-Bench Verified和Lite上分别达到了32.0%和26.0%的解决率，刷新了开源权重软件工程智能体的最新技术水平。关键解决方案在于SWE-Gym环境的构建及其与语言模型和验证器的结合，为软件工程自动化提供了新的研究平台和工具。

链接: https://arxiv.org/abs/2412.21139
作者: Jiayi Pan,Xingyao Wang,Graham Neubig,Navdeep Jaitly,Heng Ji,Alane Suhr,Yizhe Zhang
机构: 未知
关键词: real-world software engineering, training real-world software, SWE agents, software engineering, Verified and Lite
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Code at this https URL

点击查看摘要

Abstract:We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.
zh

[NLP-6] Exploring and Controlling Diversity in LLM -Agent Conversation AAAI2025

【速读】：该论文旨在解决开放域多智能体对话（open-domain multi-agent conversations）中多样性（diversity）的控制与探索问题，特别是在世界模拟（world simulation）应用场景下。论文提出了一种名为自适应提示修剪（Adaptive Prompt Pruning, APP）的新方法，通过动态调整生成提示（prompt）的内容来控制多样性，使用单一参数 lambda 来实现这一目标。APP 的关键在于通过修剪提示中的信息来增加输出的多样性，实验表明，修剪更多信息会导致更丰富的输出。此外，论文深入分析了提示内容与对话多样性之间的关系，发现提示的所有组成部分通常都会限制输出的多样性，其中记忆块（Memory block）的影响最为显著。APP 与现有技术如温度采样（temperature sampling）和 top-p 采样（top-p sampling）兼容，为多样性管理提供了灵活的工具。为了应对增加多样性带来的信息不一致性等权衡问题，论文还引入了生成后校正步骤（post-generation correction step），有效平衡了多样性增强与输出一致性。此外，论文还探讨了提示结构（包括组件顺序和长度）对多样性的影响。该研究为基于大语言模型（LLM）的多智能体协作中的多样性系统化工程奠定了基础，提升了其在实际应用中的有效性。

链接: https://arxiv.org/abs/2412.21102
作者: KuanChao Chu,Yi-Pei Chen,Hideki Nakayama
机构: 未知
关键词: Diversity, critical aspect, Adaptive Prompt Pruning, Prompt, APP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for the AAAI 2025 Workshop on Advancing LLM-Based Multi-Agent Collaboration

点击查看摘要

Abstract:Diversity is a critical aspect of multi-agent communication. In this paper, we focus on controlling and exploring diversity in the context of open-domain multi-agent conversations, particularly for world simulation applications. We propose Adaptive Prompt Pruning (APP), a novel method that dynamically adjusts the content of the utterance generation prompt to control diversity using a single parameter, lambda. Through extensive experiments, we show that APP effectively controls the output diversity across models and datasets, with pruning more information leading to more diverse output. We comprehensively analyze the relationship between prompt content and conversational diversity. Our findings reveal that information from all components of the prompt generally constrains the diversity of the output, with the Memory block exerting the most significant influence. APP is compatible with established techniques like temperature sampling and top-p sampling, providing a versatile tool for diversity management. To address the trade-offs of increased diversity, such as inconsistencies with omitted information, we incorporate a post-generation correction step, which effectively balances diversity enhancement with output consistency. Additionally, we examine how prompt structure, including component order and length, impacts diversity. This study addresses key questions surrounding diversity in multi-agent world simulation, offering insights into its control, influencing factors, and associated trade-offs. Our contributions lay the foundation for systematically engineering diversity in LLM-based multi-agent collaborations, advancing their effectiveness in real-world applications.
zh

[NLP-7] Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight Task-Specific Adapters for Automatic Scoring AAAI

【速读】：该论文旨在解决在教育领域中集成人工智能（AI）时面临的性能、适应性和成本之间的平衡问题。具体而言，论文提出了一种共享骨干模型架构，结合轻量级的LoRA（Low-Rank Adaptation）适配器进行任务特定的微调，以自动化评分学生回答，涵盖27个互斥任务。该解决方案的关键在于通过LoRA适配器显著降低了GPU内存消耗（减少60%）和推理延迟（减少40%），同时保持了与完全微调模型相当的竞争性能（平均QWK为0.848，而完全微调模型为0.888）。这一框架不仅提升了效率，还支持教育工作者简化评估流程，确保自动化评分系统的公平性和透明度。

链接: https://arxiv.org/abs/2412.21065
作者: Ehsan Latif,Xiaoming Zhai
机构: 未知
关键词: Artificial Intelligence, integration of Artificial, education requires scalable, education requires, balance performance
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI-iRAISE Workshop

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) in education requires scalable and efficient frameworks that balance performance, adaptability, and cost. This paper addresses these needs by proposing a shared backbone model architecture enhanced with lightweight LoRA adapters for task-specific fine-tuning, targeting the automated scoring of student responses across 27 mutually exclusive tasks. By achieving competitive performance (average QWK of 0.848 compared to 0.888 for fully fine-tuned models) while reducing GPU memory consumption by 60% and inference latency by 40%, the framework demonstrates significant efficiency gains. This approach aligns with the workshops’ focus on improving language models for educational tasks, creating responsible innovations for cost-sensitive deployment, and supporting educators by streamlining assessment workflows. The findings underscore the potential of scalable AI to enhance learning outcomes while maintaining fairness and transparency in automated scoring systems.
zh

[NLP-8] angoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

【速读】：该论文旨在解决文本到音频（Text-to-Audio, TTA）生成模型在偏好对齐（alignment）方面的挑战。由于TTA模型缺乏像大语言模型（Large Language Models, LLMs）中可验证的奖励机制或黄金标准答案等结构化机制，创建偏好对（preference pairs）变得尤为困难。为此，论文提出了CLAP-Ranked Preference Optimization (CRPO)框架，通过迭代生成和优化偏好数据来增强TTA模型的对齐能力。CRPO框架的关键在于其能够生成优于现有替代方案的音频偏好数据集，从而使TangoFlux模型在客观和主观基准测试中均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.21037
作者: Chia-Yu Hung,Navonil Majumder,Zhifeng Kong,Ambuj Mehrish,Rafael Valle,Bryan Catanzaro,Soujanya Poria
机构: 未知
关键词: Large Language Models, capable of generating, GPU, TTA, Large Language
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: this https URL

点击查看摘要

Abstract:We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.
zh

[NLP-9] GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在几何感知（geometric perception）能力上的不足。尽管现有基准测试在评估这些模型时侧重于现实生活中的复杂场景，但它们往往忽略了在非日常现实环境中至关重要的基本感知技能，尤其是对空间关系和抽象视觉模式的解释能力。为解决这一问题，作者提出了GePBench，这是一个专门设计用于评估MLLMs几何感知能力的新型基准测试。通过广泛的评估，研究发现当前最先进的MLLMs在此类任务中存在显著缺陷。此外，作者还展示了使用GePBench数据训练的模型在多种下游任务中表现出显著提升，进一步强调了几何感知作为高级多模态应用基础的重要性。

链接: https://arxiv.org/abs/2412.21036
作者: Shangyu Xing,Changhao Xiang,Yuteng Han,Yifan Yue,Zhen Wu,Xinyu Liu,Zhangtai Wu,Fei Zhao,Xinyu Dai
机构: 未知
关键词: Multimodal large language, large language models, achieved significant advancements, linguistic understanding, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved significant advancements in integrating visual and linguistic understanding. While existing benchmarks evaluate these models in context-rich, real-life scenarios, they often overlook fundamental perceptual skills essential for environments deviating from everyday realism. In particular, geometric perception, the ability to interpret spatial relationships and abstract visual patterns, remains underexplored. To address this limitation, we introduce GePBench, a novel benchmark designed to assess the geometric perception capabilities of MLLMs. Results from extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in such tasks. Additionally, we demonstrate that models trained with data sourced from GePBench show notable improvements on a wide range of downstream tasks, underscoring the importance of geometric perception as a foundation for advanced multimodal applications. Our code and datasets will be publicly available.
zh

[NLP-10] Plancraft: an evaluation dataset for planning with LLM agents

【速读】：该论文旨在解决大语言模型（LLM）和视觉语言模型（VLM）在复杂规划任务中的性能问题，特别是在多模态环境下的决策和任务完成能力。为此，作者提出了Plancraft，一个基于Minecraft制作界面的多模态评估数据集，包含文本和多模态接口。Plancraft集成了Minecraft Wiki以评估工具使用和检索增强生成（RAG），并引入了oracle规划器和oracle RAG信息提取器，以分离现代代理架构中的不同组件。此外，Plancraft还包含故意设计为不可解的任务子集，以评估代理在判断任务是否可解时的决策能力。通过对比开源和闭源LLM及策略的表现，作者发现现有模型在处理Plancraft引入的规划问题时存在显著困难，并提出了改进建议。

链接: https://arxiv.org/abs/2412.21033
作者: Gautier Dagan,Frank Keller,Alex Lascarides
机构: 未知
关键词: multi-modal evaluation dataset, Minecraft crafting GUI, Retrieval Augmented Generation, evaluation dataset, present Plancraft
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and strategies on our task and compare their performance to a handcrafted planner. We find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and we offer suggestions on how to improve their capabilities.
zh

[NLP-11] MapQaTor: A System for Efficient Annotation of Map Query Datasets

【速读】：该论文旨在解决现有地图服务（如Google Maps、Apple Maps、Openstreet Maps）在处理自然语言地理空间查询（natural language geospatial queries）时面临的挑战，特别是如何从地图服务中创建可靠的地理空间问答（QA）数据集。为了解决这一问题，论文提出了MapQaTor，一个基于Web的应用程序，通过其即插即用（plug-and-play）架构，能够无缝集成任何地图API，使用户能够从多种来源收集和可视化数据，且无需复杂设置。MapQaTor通过缓存API响应确保一致的地面真值（ground truth），从而增强数据的可靠性，即使现实世界的信息发生变化。该平台将数据检索、标注和可视化集中在一个单一平台上，为评估当前基于大语言模型（LLMs）的地理空间推理能力提供了独特的机会，同时推动了其在改进地理空间理解方面的能力。评估指标显示，MapQaTor将标注过程的速度提高了至少30倍，突显了其在开发复杂地图推理数据集等地理空间资源方面的潜力。

链接: https://arxiv.org/abs/2412.21015
作者: Mahir Labib Dihan,Mohammed Eunus Ali,Md Rizwan Parvez
机构: 未知
关键词: handle natural language, Large Language Models, Mapping and navigation, language geospatial queries, Apple Maps
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 13 pages, 35 figures

点击查看摘要

Abstract:Mapping and navigation services like Google Maps, Apple Maps, Openstreet Maps, are essential for accessing various location-based data, yet they often struggle to handle natural language geospatial queries. Recent advancements in Large Language Models (LLMs) show promise in question answering (QA), but creating reliable geospatial QA datasets from map services remains challenging. We introduce MapQaTor, a web application that streamlines the creation of reproducible, traceable map-based QA datasets. With its plug-and-play architecture, MapQaTor enables seamless integration with any maps API, allowing users to gather and visualize data from diverse sources with minimal setup. By caching API responses, the platform ensures consistent ground truth, enhancing the reliability of the data even as real-world information evolves. MapQaTor centralizes data retrieval, annotation, and visualization within a single platform, offering a unique opportunity to evaluate the current state of LLM-based geospatial reasoning while advancing their capabilities for improved geospatial understanding. Evaluation metrics show that, MapQaTor speeds up the annotation process by at least 30 times compared to manual methods, underscoring its potential for developing geospatial resources, such as complex map reasoning datasets. The website is live at: this https URL and a demo video is available at: this https URL.
zh

[NLP-12] Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria

【速读】：该论文旨在解决大语言模型（LLMs）在生成大量中间推理单元（如 tokens、句子）以提高复杂任务最终答案质量时，不可避免地导致推理成本显著增加的问题。为了解决这一问题，论文提出了一种新颖的句子级推理缩减训练框架，该框架利用基于似然的标准（verbosity）来识别并移除冗余的推理句子。与以往基于 token 级缩减的方法不同，该句子级缩减框架在减少生成长度的同时保持了模型性能，从而在保留 LLMs 原有推理能力的基础上，实现了平均 17.15% 的生成成本降低。

链接: https://arxiv.org/abs/2412.21006
作者: Joonwon Jang,Jaehee Kim,Wonbin Kweon,Hwanjo Yu
机构: 未知
关键词: Large Language Models, Large Language, enhance final answer, final answer quality, generating extensive intermediate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) rely on generating extensive intermediate reasoning units (e.g., tokens, sentences) to enhance final answer quality across a wide range of complex tasks. While generating multiple reasoning paths or iteratively refining rationales proves effective for improving performance, these approaches inevitably result in significantly higher inference costs. In this work, we propose a novel sentence-level rationale reduction training framework that leverages likelihood-based criteria, verbosity, to identify and remove redundant reasoning sentences. Unlike previous approaches that utilize token-level reduction, our sentence-level reduction framework maintains model performance while reducing generation length. This preserves the original reasoning abilities of LLMs and achieves an average 17.15% reduction in generation costs across various models and tasks.
zh

[NLP-13] Plug-and-Play Training Framework for Preference Optimization

【速读】：该论文旨在解决当前偏好优化方法（如DPO）在处理训练样本时未能考虑样本难度差异的问题，特别是在数学推理等高精度要求任务中表现不佳的情况。为解决这一局限，论文提出了一种新颖的训练框架，该框架通过多重采样分析输出分布，为不同难度的样本分配不同的权重，并将这些权重整合到偏好优化过程中。这种即插即用的方法使大语言模型能够在训练过程中优先处理具有挑战性的样本，从而提高学习效率。实验结果表明，该框架能够与多种偏好优化方法无缝集成，并在数学推理任务中实现一致的性能提升。

链接: https://arxiv.org/abs/2412.20996
作者: Jingyuan Ma,Rui Li,Zheng Li,Lei Sha,Zhifang Sui
机构: 未知
关键词: large language models, significantly enhanced large, enhanced large language, DPO have significantly, wide tasks including
类目: Computation and Language (cs.CL)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty levels of training samples during preference optimization, leading to mediocre performance in tasks with high accuracy requirements, particularly in mathematical reasoning. To address this limitation, we propose a novel training framework, which employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process. This plug-and-play approach enables LLMs to prioritize challenging examples during training, improving learning efficiency. Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.
zh

[NLP-14] KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Models Reasoning Path Aggregation

【速读】：该论文旨在解决大语言模型（LLMs）在知识图谱问答（KGQA）任务中面临的挑战，包括幻觉问题（hallucinations）、知识时效性不足以及现有方法在全局规划和推理能力上的局限性。现有方法通常依赖于逐步遍历知识图谱（KGs），限制了LLMs的全局规划能力，或需要针对特定知识图谱进行微调或预训练。为解决这些问题，论文提出了知识图谱辅助推理路径聚合（KARPA）框架。KARPA的关键在于利用LLMs的全局规划能力，通过三个步骤实现高效且准确的推理：首先，使用LLM进行关系路径的预规划；其次，通过嵌入模型匹配语义相关的路径；最后，在这些路径上进行推理以生成答案。与现有方法不同，KARPA避免了逐步遍历，无需额外训练，并能适配多种LLM架构，从而显著提升了KGQA任务的效率和准确性。

链接: https://arxiv.org/abs/2412.20995
作者: Siyuan Fang,Kaijing Ma,Tianyu Zheng,Xinrun Du,Ningxuan Lu,Ge Zhang,Qingkun Tang
机构: 未知
关键词: Large language models, Large language, demonstrate exceptional performance, demonstrate exceptional, affected by hallucinations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate exceptional performance across a variety of tasks, yet they are often affected by hallucinations and the timeliness of knowledge. Leveraging knowledge graphs (KGs) as external knowledge sources has emerged as a viable solution, but existing methods for LLM-based knowledge graph question answering (KGQA) are often limited by step-by-step decision-making on KGs, restricting the global planning and reasoning capabilities of LLMs, or they require fine-tuning or pre-training on specific KGs. To address these challenges, we propose Knowledge graph Assisted Reasoning Path Aggregation (KARPA), a novel framework that harnesses the global planning abilities of LLMs for efficient and accurate KG reasoning. KARPA operates in three steps: pre-planning relation paths using the LLM’s global planning capabilities, matching semantically relevant paths via an embedding model, and reasoning over these paths to generate answers. Unlike existing KGQA methods, KARPA avoids stepwise traversal, requires no additional training, and is adaptable to various LLM architectures. Extensive experimental results show that KARPA achieves state-of-the-art performance in KGQA tasks, delivering both high efficiency and accuracy. Our code will be available on Github.
zh

[NLP-15] Efficiently Serving LLM Reasoning Programs with Certaindex

【速读】：该论文旨在解决大语言模型（LLMs）在推理任务中计算资源分配效率低下的问题。随着LLMs在数学问题求解、代码生成和法律分析等复杂推理任务中的应用日益广泛，推理时算法（inference-time reasoning algorithms）通过探索多种解决路径来优化输出，但这也导致了计算需求增加和响应延迟上升。现有的服务系统无法适应这些算法的扩展行为或查询难度的变化，导致资源使用效率低下和延迟目标无法满足。论文提出的解决方案是Dynasor系统，其关键在于动态优化LLM推理查询的计算资源分配。Dynasor通过跟踪和调度推理查询中的请求，并利用Certaindex（一种基于模型确定性测量统计推理进度的代理）来动态指导计算资源的分配。该系统能够根据推理进度自适应调度：为复杂查询分配更多计算资源，简化简单查询的计算需求，并提前终止无望的查询，从而在准确性、延迟和成本之间实现平衡。实验表明，Dynasor在批处理中可减少高达50%的计算资源，并在在线服务中维持3.3倍的查询率或4.7倍的延迟SLO（Service Level Objective）提升。

链接: https://arxiv.org/abs/2412.20993
作者: Yichao Fu,Junda Chen,Siqi Zhu,Zheyu Fu,Zhongdongming Dai,Aurick Qiao,Hao Zhang
机构: 未知
关键词: advanced reasoning tasks, code generation, mathematical problem-solving, legal analysis, rapid evolution
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets. We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2412.20993 [cs.LG] (or arXiv:2412.20993v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.20993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-16] DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

【速读】：该论文旨在解决在微调大语言模型（LLMs）时，低秩适应（LoRA）方法在二维空间中无法有效捕捉高维结构的问题，以及现有张量分解方法依赖随机初始化导致验证损失与全微调结果显著偏离的问题。论文提出的解决方案是权重分解张量适应（Weight-Decomposed Tensor Adaptation, DoTA），该方法利用预训练权重的矩阵乘积算子（Matrix Product Operator, MPO）分解进行有效初始化，从而在微调过程中更好地捕捉高维结构。此外，论文还引入了量化版本的QDoTA，适用于4位量化，进一步减少内存消耗。实验结果表明，DoTA在常识推理和算术推理任务上优于随机初始化方法，且QDoTA在常识推理任务上表现出与DoTA相当的性能。

链接: https://arxiv.org/abs/2412.20891
作者: Xiaolin Hu,Xiang Cheng,Peiyu Liu,Wei Liu,Jian Luan,Bin Wang,Yong Liu
机构: 未知
关键词: large language models, fine-tuning large language, language models, low-rank matrices, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices. However, low-rank approximation in two-dimensional space fails to capture high-dimensional structures within the target matrix. Recently, tensor decomposition methods have been explored for fine-tuning LLMs, leveraging their ability to extract structured information. Yet, these approaches primarily rely on random initialization, and the impact of initialization on tensor adaptation remains underexplored. In this paper, we reveal that random initialization significantly diverges from the validation loss achieved by full fine-tuning. To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights for effective initialization in fine-tuning LLMs. Additionally, we introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization. Experiments on commonsense and arithmetic reasoning tasks show that DoTA outperforms random initialization methods with fewer parameters. QDoTA further reduces memory consumption and achieves comparable performance to DoTA on commonsense reasoning tasks. We will release our code to support future research.
zh

[NLP-17] Enhancing Annotated Bibliography Generation with LLM Ensembles

【速读】：该论文旨在解决通过大语言模型（LLM）生成注释书目（annotated bibliography）时存在的质量与冗余问题。其核心解决方案是引入多角色LLM集成（LLM ensembles），通过不同LLM分别负责可控文本生成、评估和总结等任务，以系统性方法提升模型在学术任务中的表现。具体而言，通过调整不同LLM参数实现生成文本的多样性，并由一个LLM作为评判者评估生成内容的相关性、准确性和连贯性。随后，采用多种组合策略筛选响应，并通过总结和冗余去除技术进行合并与优化。初步实验验证表明，LLM集成的输出在连贯性和相关性方面优于单一模型响应，注释质量提升了38%，内容冗余减少了51%，展示了在自动化复杂学术任务中保持高质量标准的潜力。

链接: https://arxiv.org/abs/2412.20864
作者: Sergio Bermejo
机构: 未知
关键词: Large Language Model, Large Language, enhancing annotated bibliography, annotated bibliography generation, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work proposes a novel approach to enhancing annotated bibliography generation through Large Language Model (LLM) ensembles. In particular, multiple LLMs in different roles – controllable text generation, evaluation, and summarization – are introduced and validated using a systematic methodology to enhance model performance in scholarly tasks. Output diversity among the ensemble that generates text is obtained using different LLM parameters, followed by an LLM acting as a judge to assess relevance, accuracy, and coherence. Responses selected by several combining strategies are then merged and refined through summarization and redundancy removal techniques. The preliminary experimental validation demonstrates that the combined outputs from the LLM ensemble improve coherence and relevance compared to individual responses, leading to a 38% improvement in annotation quality and a 51% reduction in content redundancy, thus highlighting the potential for automating complex scholarly tasks while maintaining high-quality standards.
zh

[NLP-18] Are LLM s Really Not Knowledgable? Mining the Submerged Knowledge in LLM s Memory

【速读】：该论文旨在解决大语言模型（LLMs）在问答任务中表现不佳且容易产生幻觉（hallucinations）的问题。尽管以往研究将这些问题归因于模型参数中的知识缺口，但本文通过分析模型的内部表示，发现LLMs在生成错误答案时仍可能保留正确的知识。基于这一观察，论文提出了Hits@k这一新指标，用于独立于表达准确性评估知识保留情况。实验表明，LLMs存储的知识量远超其问答性能所反映的水平。进一步，论文开发了SkipUnsure方法，通过利用检测到但未表达的知识来提高答案准确性。实验结果显示，该方法在开放域和特定域数据集上均取得了显著改进，如在DBPedia上准确率提升了11.8%，在IMDB上提升了6.3%，且无需重新训练模型。

链接: https://arxiv.org/abs/2412.20846
作者: Xingjian Tao,Yiwei Wang,Yujun Cai,Zhicheng Yang,Jing Tang
机构: 未知
关键词: Large language models, Large language, potential knowledge bases, prone to hallucinations, shown promise
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise as potential knowledge bases, yet they often struggle with question-answering tasks and are prone to hallucinations. While previous research attributes these issues to knowledge gaps in the model’s parameters, our investigation reveals a different phenomenon: LLMs often retain correct knowledge even when generating incorrect answers. Through analysis of model’s internal representations, we find that correct answers frequently appear among high-probability tokens despite not being selected as final outputs. Based on this observation, we introduce Hits@k, a new metric to assess knowledge retention independent of expression accuracy. Our extensive experiments demonstrate that LLMs store significantly more knowledge than their QA performance suggests. Building on these findings, we develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge. Experiments on both open-domain and specific-domain datasets show consistent improvements, with accuracy gains of up to 11.8% on DBPedia and 6.3% on IMDB, without requiring model retraining.
zh

[NLP-19] Disentangling Preference Representation and Text Generation for Efficient Individual Preference Alignment COLING2025

【速读】：该论文旨在解决大语言模型（LLMs）与人类个体偏好对齐（alignment）的效率问题。尽管将LLMs与一般人类偏好对齐已被证明能显著提升模型与人类的交互质量，但由于人类价值观的多样性，仅依赖一般偏好对齐是不够的。因此，个性化LLMs以适应用户个体反馈成为了一种有前景的解决方案。然而，现有的对齐算法在效率上存在挑战。为此，论文提出了一种灵活的对齐范式，其关键创新在于将偏好表示（preference representation）与文本生成（text generation）解耦，从而显著提升对齐效率。实验表明，该方法在多个文本生成任务中能够达到或优于基于参数高效微调（PEFT-based）方法的对齐质量，同时将每个新个体偏好的额外训练时间减少了80%至90%。

链接: https://arxiv.org/abs/2412.20834
作者: Jianfei Zhang,Jun Bai,Bei Li,Yanmeng Wang,Rumei Li,Chenghua Lin,Wenge Rong
机构: 未知
关键词: Aligning Large Language, Large Language Models, Aligning Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Coling 2025

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with general human preferences has been proved crucial in improving the interaction quality between LLMs and human. However, human values are inherently diverse among different individuals, making it insufficient to align LLMs solely with general preferences. To address this, personalizing LLMs according to individual feedback emerges as a promising solution. Nonetheless, this approach presents challenges in terms of the efficiency of alignment algorithms. In this work, we introduce a flexible paradigm for individual preference alignment. Our method fundamentally improves efficiency by disentangling preference representation from text generation in LLMs. We validate our approach across multiple text generation tasks and demonstrate that it can produce aligned quality as well as or better than PEFT-based methods, while reducing additional training time for each new individual preference by 80% to 90% in comparison with them.
zh

[NLP-20] Attributing Culture-Conditioned Generations to Pretraining Corpora

【速读】：该论文旨在解决大语言模型（Large Language Models）在开放式生成任务（如叙事写作或对话）中表现出的文化偏见问题，特别是模型对较少见文化的知识有限且生成模板化输出的现象。研究表明，这些偏见可能源于预训练语料库中文化表征的不均衡。论文通过分析模型如何基于预训练数据模式将实体与文化关联，探讨了预训练如何导致文化条件生成的偏见。为解决这一问题，作者提出了MEMOed框架（MEMOrization from pretraining document），用于判断生成内容是否源于预训练数据的记忆。通过对110种文化在食物和服装方面的文化条件生成进行分析，发现高频文化在预训练数据中生成的符号更多来自记忆，而某些低频文化则几乎没有生成。此外，模型倾向于生成预训练数据中频率极高的实体，无论条件文化如何，反映了对高频预训练术语的偏见。MEMOed框架及其发现为理解模型性能与预训练数据的关系提供了新的视角，并激励更多相关研究。

链接: https://arxiv.org/abs/2412.20760
作者: Huihan Li,Arnav Goel,Keyu He,Xiang Ren
机构: 未知
关键词: showing limited knowledge, open-ended generative tasks, large language models, generating templated outputs, exhibit cultural biases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In open-ended generative tasks like narrative writing or dialogue, large language models often exhibit cultural biases, showing limited knowledge and generating templated outputs for less prevalent cultures. Recent works show that these biases may stem from uneven cultural representation in pretraining corpora. This work investigates how pretraining leads to biased culture-conditioned generations by analyzing how models associate entities with cultures based on pretraining data patterns. We propose the MEMOed framework (MEMOrization from pretraining document) to determine whether a generation for a culture arises from memorization. Using MEMOed on culture-conditioned generations about food and clothing for 110 cultures, we find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none. Additionally, the model favors generating entities with extraordinarily high frequency regardless of the conditioned culture, reflecting biases toward frequent pretraining terms irrespective of relevance. We hope that the MEMOed framework and our insights will inspire more works on attributing model performance on pretraining data.
zh

[NLP-21] Depression and Anxiety Prediction Using Deep Language Models and Transfer Learning

【速读】：该论文旨在探索如何利用深度语言模型（deep language models）从用户与应用程序的对话语音中检测抑郁（depression）、焦虑（anxiety）及其共病（co-occurrence）。研究基于16,000次用户交互数据，标签来源于应用程序收集的PHQ-8（Patient Health Questionnaire-8）和GAD-7（Generalized Anxiety Disorder-7）评分结果。解决方案的关键在于使用深度语言模型进行二元分类，其AUC（Area Under Curve）值在0.79到0.86之间，具体表现取决于检测的条件及其共病情况。研究发现，当用户同时患有或完全不患有这两种心理状况时，模型表现最佳，且这一结果并非由数据偏斜（data skew）所致。此外，研究还表明，抑郁检测可能更依赖于潜在的词序线索（word sequence cues），而焦虑检测则相对不显著。

链接: https://arxiv.org/abs/2412.20741
作者: Tomasz Rutowski,Elizabeth Shriberg,Amir Harati,Yang Lu,Piotr Chlebek,Ricardo Oliveira
机构: 未知
关键词: Digital screening, behavioral health conditions, screening and monitoring, aid providers, management of behavioral
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Digital screening and monitoring applications can aid providers in the management of behavioral health conditions. We explore deep language models for detecting depression, anxiety, and their co-occurrence from conversational speech collected during 16k user interactions with an application. Labels come from PHQ-8 and GAD-7 results also collected by the application. We find that results for binary classification range from 0.86 to 0.79 AUC, depending on condition and co-occurrence. Best performance is achieved when a user has either both or neither condition, and we show that this result is not attributable to data skew. Finally, we find evidence suggesting that underlying word sequence cues may be more salient for depression than for anxiety.
zh

[NLP-22] HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

【速读】：该论文旨在解决自动定理证明（Automatic Theorem Proving, ATP）中的数据稀疏问题，并提升模型在交互式定理证明中的性能。为此，作者提出了HunyuanProver，这是一个基于Hunyuan 7B微调的语言模型，专门用于与LEAN4进行交互式自动定理证明。解决方案的关键包括：1）设计了一个可扩展的框架，通过低成本迭代合成数据来缓解数据稀疏问题；2）引入了引导树搜索算法，以实现证明器的“系统2思维”（system 2 thinking），从而提升证明效率。HunyuanProver在多个基准测试中取得了最先进的性能，特别是在miniF2F-test上达到了68.4%的通过率，优于当前的65.9%的最优结果，并成功证明了4个IMO命题。此外，作者还开源了一个包含3万条合成实例的数据集，以促进社区研究。

链接: https://arxiv.org/abs/2412.20735
作者: Yang Li,Dong Du,Linfeng Song,Chen Li,Weikang Wang,Tao Yang,Haitao Mi
机构: 未知
关键词: interactive automatic theorem, automatic theorem proving, language model finetuned, model finetuned, interactive automatic
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B for interactive automatic theorem proving with LEAN4. To alleviate the data sparsity issue, we design a scalable framework to iterative synthesize data with low cost. Besides, guided tree search algorithms are designed to enable effective system 2 thinking of the prover. HunyuanProver achieves state-of-the-art (SOTA) performances on major benchmarks. Specifically, it achieves a pass of 68.4% on the miniF2F-test compared to 65.9%, the current SOTA results. It proves 4 IMO statements (imo_1960_p2, imo_1962_p2, imo_1964_p2 and imo_1983_p6) in miniF2F-test. To benefit the community, we will open-source a dataset of 30k synthesized instances, where each instance contains the original question in natural language, the converted statement by autoformalization, and the proof by HunyuanProver.
zh

[NLP-23] ChartAdapter: Large Vision-Language Model for Chart Summarization

【速读】：该论文旨在解决图表摘要（chart summarization）任务中传统方法和现有基于大语言模型（LLM）方法存在的语义对齐不足和忽略图表数据特性等问题。传统方法通常依赖于多阶段流水线，可能导致视觉与文本信息之间的语义对齐不理想；而现有的LLM方法则过度依赖基础图像或语言模型的能力，忽视了图表数据的独特性和相关挑战。为解决这些问题，论文提出了ChartAdapter，一种轻量级的Transformer模块，通过可学习的查询向量（learnable query vectors）从图表数据中提取隐式语义，并结合跨模态对齐投影器（cross-modal alignment projector）增强视觉到语言的生成学习。通过将ChartAdapter与LLM集成，实现了端到端训练和高效的图表摘要生成。此外，论文还引入了三阶段分层训练流程，并开发了一个包含190,618个样本的大规模图表摘要数据集。实验结果表明，该方法在标准Chart-to-Text测试集上显著优于现有方法，包括最先进的模型，验证了ChartAdapter关键组件的有效性。

链接: https://arxiv.org/abs/2412.20715
作者: Peixin Xu,Yujuan Ding,Wenqi Fan
机构: 未知
关键词: accessible data analysis, extracting key information, focuses on extracting, delivering insights, insights through effective
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chart summarization, which focuses on extracting key information from charts and interpreting it in natural language, is crucial for generating and delivering insights through effective and accessible data analysis. Traditional methods for chart understanding and summarization often rely on multi-stage pipelines, which may produce suboptimal semantic alignment between visual and textual information. In comparison, recently developed LLM-based methods are more dependent on the capability of foundation images or languages, while ignoring the characteristics of chart data and its relevant challenges. To address these limitations, we propose ChartAdapter, a novel lightweight transformer module designed to bridge the gap between charts and textual summaries. ChartAdapter employs learnable query vectors to extract implicit semantics from chart data and incorporates a cross-modal alignment projector to enhance vision-to-language generative learning. By integrating ChartAdapter with an LLM, we enable end-to-end training and efficient chart summarization. To further enhance the training, we introduce a three-stage hierarchical training procedure and develop a large-scale dataset specifically curated for chart summarization, comprising 190,618 samples. Experimental results on the standard Chart-to-Text testing set demonstrate that our approach significantly outperforms existing methods, including state-of-the-art models, in generating high-quality chart summaries. Ablation studies further validate the effectiveness of key components in ChartAdapter. This work highlights the potential of tailored LLM-based approaches to advance chart understanding and sets a strong foundation for future research in this area.
zh

[NLP-24] UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

【速读】：该论文旨在解决在复杂NP难问题（NP-hard problems）中，传统启发式方法（heuristics）设计效率低下的问题。尽管已有研究如FunSearch展示了利用大语言模型（LLMs）在进化算法（EA）框架中进行启发式设计的潜力，但由于其在探索（exploration）和利用（exploitation）方面的不足，其效果尚未完全发挥。为此，论文提出了UBER（Uncertainty-Based Evolution for Refinement）方法，通过在FunSearch框架中引入不确定性，进一步优化LLM+EA方法。UBER的关键创新包括：1）不确定性包容的进化过程（Uncertainty-Inclusive Evolution Process, UIEP），用于自适应地平衡探索与利用；2）基于原则的不确定性包容的岛屿重置策略（Uncertainty-Inclusive Island Reset, UIIS），用于维持种群多样性。通过在多个NP完全问题（NP-complete problems）上的广泛实验，UBER显著优于FunSearch，为LLMs与EA的协同应用提供了新的方向，推动了自动启发式设计领域的发展。

链接: https://arxiv.org/abs/2412.20694
作者: Zijie Chen,Zhanchao Zhou,Yu Lu,Renjun Xu,Lili Pan,Zhenzhong Lan
机构: 未知
关键词: NP-hard problem-solving traditionally, problem-solving traditionally relies, manually crafting effective, crafting effective heuristics, complex problems remains
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:NP-hard problem-solving traditionally relies on heuristics, but manually crafting effective heuristics for complex problems remains challenging. While recent work like FunSearch has demonstrated that large language models (LLMs) can be leveraged for heuristic design in evolutionary algorithm (EA) frameworks, their potential is not fully realized due to its deficiency in exploitation and exploration. We present UBER (Uncertainty-Based Evolution for Refinement), a method that enhances LLM+EA methods for automatic heuristic design by integrating uncertainty on top of the FunSearch framework. UBER introduces two key innovations: an Uncertainty-Inclusive Evolution Process (UIEP) for adaptive exploration-exploitation balance, and a principled Uncertainty-Inclusive Island Reset (UIIS) strategy for maintaining population diversity. Through extensive experiments on challenging NP-complete problems, UBER demonstrates significant improvements over FunSearch. Our work provides a new direction for the synergy of LLMs and EA, advancing the field of automatic heuristic design.
zh

[NLP-25] Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

【速读】：该论文旨在解决大语言模型（LLMs）在模型规模和输入序列长度增加时，由于键值缓存（KV Cache）的快速增长导致推理速度显著下降的问题。为此，论文提出了一种低成本的方法，将多头注意力机制（MHA）模型修剪为分组查询注意力机制（GQA）模型，并支持任意键值头（key-value heads）的压缩比例。解决方案的关键在于使用L0掩码（L0 masks）逐步移除冗余参数，并在修剪训练前通过正交变换（orthogonal transformations）增加注意力头之间的相似性，以进一步提升模型性能。该方法兼容旋转位置编码（RoPE），使得修剪后的模型能够完全适配主流的GQA框架。实验表明，该方法能够在LLaMA2-7B模型上压缩高达87.5%的键值头，且仅通过监督微调即可实现性能的较小损失。

链接: https://arxiv.org/abs/2412.20677
作者: Qingyun Jin,Xiaohui Song,Feng Zhou,Zengchang Qin
机构: 未知
关键词: Large language models, language processing problems, natural language processing, Large language, natural language
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models have been shown to perform well on a variety of natural language processing problems. However, as the model size and the input sequence’s length increase, the rapid increase of KV Cache significantly slows down inference speed. Therefore GQA model, as an alternative to MHA model, has been widely introduced into LLMs. In this work, we propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads. Our method is based on \mathitL_0 masks to gradually remove redundant parameters. In addition, we apply orthogonal transformations to attention heads without changing the model to increase similarity between attention heads before pruning training, in order to further improve performance of the model. Our method can be compatible with rotary position embedding (RoPE), which means the model after training can be fully adapted to the mainstream standard GQA framework. Experiments demonstrate that our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation, just achieved through supervised fine-tuning.
zh

[NLP-26] Knowledge Editing for Large Language Model with Knowledge Neuronal Ensemble

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中知识编辑（Knowledge Editing）面临的挑战，包括参数定位耦合（Parameter Localization Coupling）、定位不精确以及跨层动态交互的缺乏。为解决这些问题，论文提出了一种名为知识神经元集成（Knowledge Neuronal Ensemble, KNE）的新方法。KNE通过将特定知识编码为一组神经元，减少了因参数定位耦合导致的频繁参数修改问题。该方法通过计算每一层每个参数的梯度归因分数（Gradient Attribution Scores），提高了参数定位的精确性和准确性。在编辑过程中，仅计算与知识神经元集成相关的梯度和损失，并进行相应的误差反向传播，确保参数之间的动态交互和协同更新。实验结果表明，KNE方法在知识编辑的准确性、可移植性和局部性指标上显著优于现有基线方法。

链接: https://arxiv.org/abs/2412.20637
作者: Yongchang Li,Yujin Zhu,Tao Yan,Shijian Fan,Gang Wu,Liang Xu
机构: 未知
关键词: Knowledge Neuronal Ensemble, knowledge editing, knowledge, Knowledge Neuronal, constantly evolving
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures, 2 tables

点击查看摘要

Abstract:As real-world knowledge is constantly evolving, ensuring the timeliness and accuracy of a model’s knowledge is crucial. This has made knowledge editing in large language models increasingly important. However, existing knowledge editing methods face several challenges, including parameter localization coupling, imprecise localization, and a lack of dynamic interaction across layers. In this paper, we propose a novel knowledge editing method called Knowledge Neuronal Ensemble (KNE). A knowledge neuronal ensemble represents a group of neurons encoding specific knowledge, thus mitigating the issue of frequent parameter modification caused by coupling in parameter localization. The KNE method enhances the precision and accuracy of parameter localization by computing gradient attribution scores for each parameter at each layer. During the editing process, only the gradients and losses associated with the knowledge neuronal ensemble are computed, with error backpropagation performed accordingly, ensuring dynamic interaction and collaborative updates among parameters. Experimental results on three widely used knowledge editing datasets show that the KNE method significantly improves the accuracy of knowledge editing and achieves, or even exceeds, the performance of the best baseline methods in portability and locality metrics.
zh

[NLP-27] NLP-based Regulatory Compliance – Using GPT 4.0 to Decode Regulatory Documents

【速读】：该论文旨在解决监管文件中语义复杂性（semantic complexities）的问题，特别是检测其中的不一致性和矛盾。研究通过评估GPT-4.0在识别监管要求冲突方面的能力，使用了一个经过精心设计并注入人工模糊性和矛盾的语料库（curated corpus），该语料库是与建筑师和合规工程师合作设计的。解决方案的关键在于利用GPT-4.0的强大语义理解能力，结合精确度（precision）、召回率（recall）和F1分数（F1 score）等指标，验证其在检测不一致性方面的有效性。研究结果表明，LLMs在增强监管合规流程方面具有潜力，但未来仍需通过更大规模的数据集和领域特定的微调（domain-specific fine-tuning）来进一步提高准确性和实际应用性。

链接: https://arxiv.org/abs/2412.20602
作者: Bimal Kumar,Dmitri Roussinov
机构: 未知
关键词: Large Language Models, shown significant promise, Large Language, Language Models, shown significant
类目: Computation and Language (cs.CL)
备注: accepted for presentation at Georg Nemetschek Institute Symposium Expo on Artificial Intelligence for the Built World - Munich, Germany. 12 Sept 2024

点击查看摘要

Abstract:Large Language Models (LLMs) such as GPT-4.0 have shown significant promise in addressing the semantic complexities of regulatory documents, particularly in detecting inconsistencies and contradictions. This study evaluates GPT-4.0’s ability to identify conflicts within regulatory requirements by analyzing a curated corpus with artificially injected ambiguities and contradictions, designed in collaboration with architects and compliance engineers. Using metrics such as precision, recall, and F1 score, the experiment demonstrates GPT-4.0’s effectiveness in detecting inconsistencies, with findings validated by human experts. The results highlight the potential of LLMs to enhance regulatory compliance processes, though further testing with larger datasets and domain-specific fine-tuning is needed to maximize accuracy and practical applicability. Future work will explore automated conflict resolution and real-world implementation through pilot projects with industry partners.
zh

[NLP-28] GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian ALT

【速读】：该论文旨在解决爱沙尼亚语（Estonian）词形还原（lemmatization）中的歧义问题，并提升信息检索（information retrieval, IR）任务中的性能。论文提出的解决方案关键在于结合了两种技术：一是基于规则的高精度形态分析器 Vabamorf，二是基于开放词汇命名实体识别（NER）模型 GliNER 的外部歧义消除模块。通过利用预训练的 GliNER 模型的灵活性，论文成功将 Vabamorf 的词形还原准确率提升了 10%，并显著优于基于词分类的基线方法。此外，论文还通过自动翻译 DBpedia-Entity 数据集创建了爱沙尼亚语的信息检索数据集，并使用 BM25 算法对多种词归一化方法（包括词形还原）进行了基准测试。结果表明，与简单的词干提取（stemming）相比，词形还原在信息检索指标上带来了显著提升，尤其是在高 k 值设置下，词形还原的歧义消除准确率提高对信息检索的召回率（recall）产生了小而一致的改进。

链接: https://arxiv.org/abs/2412.20597
作者: Aleksei Dorkin,Kairit Sirts
机构: 未知
关键词: open vocabulary NER, vocabulary NER model, morphological analyzer Vabamorf, match text spans, highly accurate rule-based
类目: Computation and Language (cs.CL)
备注: Accepted to NoDaLiDa/Baltic-HLT 2025

点击查看摘要

Abstract:We present GliLem – a novel hybrid lemmatization system for Estonian that enhances the highly accurate rule-based morphological analyzer Vabamorf with an external disambiguation module based on GliNER – an open vocabulary NER model that is able to match text spans with text labels in natural language. We leverage the flexibility of a pre-trained GliNER model to improve the lemmatization accuracy of Vabamorf by 10% compared to its original disambiguation module and achieve an improvement over the token classification-based baseline. To measure the impact of improvements in lemmatization accuracy on the information retrieval downstream task, we first created an information retrieval dataset for Estonian by automatically translating the DBpedia-Entity dataset from English. We benchmark several token normalization approaches, including lemmatization, on the created dataset using the BM25 algorithm. We observe a substantial improvement in IR metrics when using lemmatization over simplistic stemming. The benefits of improving lemma disambiguation accuracy manifest in small but consistent improvement in the IR recall measure, especially in the setting of high k.
zh

[NLP-29] Controlling Out-of-Domain Gaps in LLM s for Genre Classification and Generated Text Detection

【速读】：该论文旨在解决大语言模型（LLMs，如 GPT-4）在跨域（out-of-domain, OOD）任务中表现显著下降的问题，这一问题在预训练语言模型（PLMs，如 BERT）中已有先例。研究通过两个非主题分类任务（体裁分类和生成文本检测）验证了这一问题，发现当上下文学习（In-Context Learning, ICL）的示例来自一个领域（如旅行）而测试在另一个领域（如历史）时，分类性能显著下降。为解决这一问题，论文提出了一种方法，通过控制分类过程中使用的预测指标，排除主题特征，引导模型专注于风格而非内容属性。这一方法在少样本设置中将跨域性能差距缩小了多达 20 个百分点。相比之下，作为基线的链式思维（Chain-of-Thought, CoT）方法效果不足，而该方案则显著提升了跨域迁移性能。

链接: https://arxiv.org/abs/2412.20595
作者: Dmitri Roussinov,Serge Sharoff,Nadezhda Puchnina
机构: 未知
关键词: Large Language Models, generation of Large, pre-trained Language Models, Large Language, modern generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The 31st International Conference on Computational Linguistics

点击查看摘要

Abstract:This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: 1) genre classification and 2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance. Comments: The 31st International Conference on Computational Linguistics Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.20595 [cs.CL] (or arXiv:2412.20595v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.20595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-30] owards Neural No-Resource Language Translation: A Comparative Evaluation of Approaches

【速读】：该论文旨在解决无资源语言（no-resource languages）的机器翻译问题，这些语言由于缺乏数字化资源，通常可用于训练的句子少于100句。与低资源语言（low-resource languages）不同，传统机器翻译方法在无资源语言场景下失效。论文通过三种不同的工作流程探索解决方案：翻译特定模型的微调（fine-tuning）、使用链式推理提示（chain-of-reasoning prompting）的大语言模型（LLMs）的上下文学习（in-context learning），以及无推理的直接提示（direct prompting）。研究以欧文斯谷派尤特语（Owens Valley Paiute）为例，表明无资源翻译需要与传统低资源翻译截然不同的方法。实验结果显示，尽管传统方法失效，但大语言模型的上下文学习能力在无资源语言翻译中表现优异，甚至超越低资源翻译方法并接近人类翻译水平（BLEU 0.45-0.6）。其中，链式推理提示在较大语料库中表现最佳，而直接提示在较小数据集上更具优势。由于这些方法具有语言无关性，它们有望广泛应用于多种无资源语言的翻译任务，且无需专家介入。这些发现确立了无资源翻译作为一个独特的范式，为语言保护提供了实践和理论上的新见解。

链接: https://arxiv.org/abs/2412.20584
作者: Madhavendra Thakur
机构: 未知
关键词: pose unique challenges, digital representation, pose unique, No-resource, Owens Valley Paiute
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:No-resource languages - those with minimal or no digital representation - pose unique challenges for machine translation (MT). Unlike low-resource languages, which rely on limited but existent corpora, no-resource languages often have fewer than 100 sentences available for training. This work explores the problem of no-resource translation through three distinct workflows: fine-tuning of translation-specific models, in-context learning with large language models (LLMs) using chain-of-reasoning prompting, and direct prompting without reasoning. Using Owens Valley Paiute as a case study, we demonstrate that no-resource translation demands fundamentally different approaches from low-resource scenarios, as traditional approaches to machine translation, such as those that work for low-resource languages, fail. Empirical results reveal that, although traditional approaches fail, the in-context learning capabilities of general-purpose large language models enable no-resource language translation that outperforms low-resource translation approaches and rivals human translations (BLEU 0.45-0.6); specifically, chain-of-reasoning prompting outperforms other methods for larger corpora, while direct prompting exhibits advantages in smaller datasets. As these approaches are language-agnostic, they have potential to be generalized to translation tasks from a wide variety of no-resource languages without expert input. These findings establish no-resource translation as a distinct paradigm requiring innovative solutions, providing practical and theoretical insights for language preservation.
zh

[NLP-31] Counterfactual Samples Constructing and Training for Commonsense Statements Estimation

【速读】：该论文旨在解决大语言模型（LLMs）在合理性估计（Plausibility Estimation, PE）任务中存在的两个关键问题：一是缺乏语言可解释性（Language-explainable），即模型在决策时未能依赖关键词语段；二是缺乏常识敏感性（Commonsense-sensitive），即模型难以检测常识中的细微语言变化。为解决这些问题，论文提出了一种模型无关的新方法，称为常识反事实样本生成（Commonsense Counterfactual Samples Generating, CCSG）。该方法通过策略性地替换关键词语并在句子中引入低级别丢弃（low-level dropout）来生成反事实样本，并将这些样本纳入句子级对比训练框架中，从而增强模型对关键词语的关注，提升其语言可解释性和常识敏感性。实验结果表明，CCSG在九个不同数据集上显著提升了常识推理能力，相较于现有最优方法（SOTA）提升了3.07%。

链接: https://arxiv.org/abs/2412.20563
作者: Chong Liu,Zaiwen Feng,Lin Liu,Zhenyun Deng,Jiuyong Li,Ruifang Zhai,Debo Cheng,Li Qin
机构: 未知
关键词: Plausibility Estimation, plays a crucial, real world, enabling language models, crucial role
类目: Computation and Language (cs.CL)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Plausibility Estimation (PE) plays a crucial role for enabling language models to objectively comprehend the real world. While large language models (LLMs) demonstrate remarkable capabilities in PE tasks but sometimes produce trivial commonsense errors due to the complexity of commonsense knowledge. They lack two key traits of an ideal PE model: a) Language-explainable: relying on critical word segments for decisions, and b) Commonsense-sensitive: detecting subtle linguistic variations in commonsense. To address these issues, we propose a novel model-agnostic method, referred to as Commonsense Counterfactual Samples Generating (CCSG). By training PE models with CCSG, we encourage them to focus on critical words, thereby enhancing both their language-explainable and commonsense-sensitive capabilities. Specifically, CCSG generates counterfactual samples by strategically replacing key words and introducing low-level dropout within sentences. These counterfactual samples are then incorporated into a sentence-level contrastive training framework to further enhance the model’s learning process. Experimental results across nine diverse datasets demonstrate the effectiveness of CCSG in addressing commonsense reasoning challenges, with our CCSG method showing 3.07% improvement against the SOTA methods.
zh

[NLP-32] he Impact of Prompt Programming on Function-Level Code Generation

【速读】：该论文旨在解决大语言模型（LLMs）在代码生成过程中存在的局限性，特别是生成无关或错误代码的问题。为了提升生成代码的正确性、相似性和质量，论文探讨了不同提示技术（prompt techniques）及其组合对代码生成的影响。研究引入了CodePromptEval数据集，包含7072个提示，用于评估五种提示技术（少样本学习、角色设定、思维链、函数签名、包列表）对三种LLMs（GPT-4o、Llama3和Mistral）生成代码的影响。研究结果表明，某些提示技术显著影响生成代码的质量，但组合多种技术并不一定带来更好的结果，且在正确性和质量之间存在权衡。该数据集和复制包为未来改进LLM生成代码和评估新提示技术提供了基础。

链接: https://arxiv.org/abs/2412.20545
作者: Ranim Khojah,Francisco Gomes de Oliveira Neto,Mazen Mohamad,Philipp Leitner
机构: 未知
关键词: Large Language Models, Large Language, Language Models, prompt techniques, prompt
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: CodePromptEval dataset and replication package on GitHub: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. Despite this, the impact of different prompt techniques – and their combinations – on code generation remains underexplored. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.
zh

[NLP-33] SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech Detection in Memes

【速读】：该论文旨在解决多模态模因（multimodal memes）中仇恨言论检测的挑战，特别是由于模因的隐晦性和对上下文知识的依赖，使得现有方法在检测细粒度的仇恨类别时表现不佳。为此，论文提出了两个新的多模态仇恨言论数据集（MHS和MHS-Con），分别捕捉常规和混淆场景中的细粒度仇恨抽象。关键解决方案是引入了SAFE-MEME（Structured reAsoning FramEwork），这是一个基于链式思维（Chain-of-Thought）的多模态框架，包含问答式推理（SAFE-MEME-QA）和分层分类（SAFE-MEME-H）两种方法，以增强对仇恨模因的检测能力。实验表明，SAFE-MEME-QA在MHS和MHS-Con数据集上分别实现了约5%和4%的平均性能提升，而SAFE-MEME-H在MHS上实现了6%的提升，并在MHS-Con上优于多模态基线模型。此外，论文还探讨了单层适配器微调与全模型微调在不同场景下的有效性，并系统分析了错误案例，为框架的鲁棒性和局限性提供了深入见解。

链接: https://arxiv.org/abs/2412.20541
作者: Palash Nandi,Shivam Sharma,Tanmoy Chakraborty
机构: 未知
关键词: sharing sensitive ideas, requiring contextual knowledge, sensitive ideas, knowledge to interpret, act as cryptic
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 28 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Memes act as cryptic tools for sharing sensitive ideas, often requiring contextual knowledge to interpret. This makes moderating multimodal memes challenging, as existing works either lack high-quality datasets on nuanced hate categories or rely on low-quality social media visuals. Here, we curate two novel multimodal hate speech datasets, MHS and MHS-Con, that capture fine-grained hateful abstractions in regular and confounding scenarios, respectively. We benchmark these datasets against several competing baselines. Furthermore, we introduce SAFE-MEME (Structured reAsoning FramEwork), a novel multimodal Chain-of-Thought-based framework employing QA-style reasoning (SAFE-MEME-QA) and hierarchical categorization (SAFE-MEME-H) to enable robust hate speech detection in memes. SAFE-MEME-QA outperforms existing baselines, achieving an average improvement of approximately 5% and 4% on MHS and MHS-Con, respectively. In comparison, SAFE-MEME-H achieves an average improvement of 6% in MHS while outperforming only multimodal baselines in MHS-Con. We show that fine-tuning a single-layer adapter within SAFE-MEME-H outperforms fully fine-tuned models in regular fine-grained hateful meme detection. However, the fully fine-tuning approach with a QA setup is more effective for handling confounding cases. We also systematically examine the error cases, offering valuable insights into the robustness and limitations of the proposed structured reasoning framework for analyzing hateful memes.
zh

[NLP-34] ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

【速读】：该论文旨在解决现有视频大语言模型（VideoLLMs）在处理长视频序列时的局限性，特别是由于继承自其主干大语言模型（LLMs）的长序列处理能力不足所导致的挑战。现有方法通常通过均匀采样视频帧或压缩视觉标记（visual tokens）来应对这一问题，但这些方法主要关注低层次的时间视觉冗余，而忽视了高层次的知识冗余，从而限制了压缩率并导致性能损失。为此，论文提出了一种无需训练的方法ReTaKe，包含两个关键模块：DPSelect和PivotKV。DPSelect通过视觉特征识别具有局部最大峰值距离的关键帧，这些关键帧与人类视频感知高度一致；PivotKV则利用这些关键帧作为枢轴，对非枢轴标记进行KV-Cache压缩，这些非枢轴标记的注意力分数较低，源自LLMs的先验知识。实验结果表明，ReTaKe能够支持4倍长的视频序列，且性能损失仅为1%，在多个基准测试中表现优于同类规模的VideoLLMs，甚至与更大规模的模型相当或超越。

链接: https://arxiv.org/abs/2412.20504
作者: Xiao Wang,Qingyi Si,Jianlong Wu,Shiyu Zhu,Li Cao,Liqiang Nie
机构: 未知
关键词: Large Language Models, Video Large Language, Large Language, achieved remarkable progress, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have achieved remarkable progress in video understanding. However, existing VideoLLMs often inherit the limitations of their backbone LLMs in handling long sequences, leading to challenges for long video understanding. Common solutions either simply uniformly sample videos’ frames or compress visual tokens, which focus primarily on low-level temporal visual redundancy, overlooking high-level knowledge redundancy. This limits the achievable compression rate with minimal loss. To this end. we introduce a training-free method, \textbfReTaKe , containing two novel modules DPSelect and PivotKV, to jointly model and reduce both temporal visual redundancy and knowledge redundancy for long video understanding. Specifically, DPSelect identifies keyframes with local maximum peak distance based on their visual features, which are closely aligned with human video perception. PivotKV employs the obtained keyframes as pivots and conducts KV-Cache compression for the non-pivot tokens with low attention scores, which are derived from the learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and LVBench, show that ReTaKe can support 4x longer video sequences with minimal performance loss (1%) and outperform all similar-size VideoLLMs with 3%-5%, even surpassing or on par with much larger ones. Our code is available at this https URL
zh

[NLP-35] Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution

【速读】：该论文旨在解决深度神经网络（DNNs）在训练过程中因使用大规模公开数据集而面临的数据中毒攻击（data poisoning attacks）问题，特别是针对后门攻击（backdoor attacks）的防御。现有的自然语言处理（NLP）后门防御方法主要依赖于识别和移除中毒样本，但这些方法通常需要昂贵的重新训练过程。为此，论文提出了一种名为贪婪模块替换（Greedy Module Substitution, GMS）的新方法，通过识别并替换后门模型中的“死木”模块（即对后门路径至关重要的组件）来净化模型。GMS方法的关键在于减少了对干净数据集或干净辅助模型的依赖，从而在无需重新训练的情况下有效降低后门攻击的成功率。实验表明，GMS在RoBERTa-large模型上表现出色，特别是在应对LWS等具有挑战性的攻击时，净化后的攻击成功率（ASR）降至9.7%，显著优于现有基线方法的58.8%。

链接: https://arxiv.org/abs/2412.20476
作者: Yao Tong,Weijun Li,Xuanli He,Haolan Zhan,Qiongkai Xu
机构: 未知
关键词: DNNs often depends, depends on training, training with large-scale, Greedy Module Substitution, Abstract
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: preprint

点击查看摘要

Abstract:The success of DNNs often depends on training with large-scale datasets, but building such datasets is both expensive and challenging. Consequently, public datasets from open-source platforms like HuggingFace have become popular, posing significant risks of data poisoning attacks. Existing backdoor defenses in NLP primarily focus on identifying and removing poisoned samples; however, purifying a backdoored model with these sample-cleaning approaches typically requires expensive retraining. Therefore, we propose Greedy Module Substitution (GMS), which identifies and substitutes ‘‘deadwood’’ modules (i.e., components critical to backdoor pathways) in a backdoored model to purify it. Our method relaxes the common dependency of prior model purification methods on clean datasets or clean auxiliary models. When applied to RoBERTa-large under backdoor attacks, GMS demonstrates strong effectiveness across various settings, particularly against widely recognized challenging attacks like LWS, achieving a post-purification attack success rate (ASR) of 9.7% on SST-2 compared to 58.8% for the best baseline approach.
zh

[NLP-36] Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

【速读】：该论文旨在提升空管（ATC）领域中基于机器学习的辅助系统在边缘情况下的鲁棒性，特别是在高词错误率（WER）转录或部分转录等复杂场景下的呼号识别与理解（CRU）任务。为解决这一问题，论文提出了多模态呼号-指令恢复模型（CCR），该架构通过优化边缘情况下的性能，显著提升了CRU任务的准确性，最高可达15%。此外，论文还介绍了CallSBERT架构，该模型参数较少，微调速度更快，且在微调过程中表现出更高的鲁棒性。关键解决方案在于通过多模态方法和边缘情况优化，显著提升了系统在广泛操作范围内的准确性。

链接: https://arxiv.org/abs/2412.20467
作者: Alexander Blatt,Dietrich Klakow
机构: 未知
关键词: machine-learning based assistant, based assistant systems, Operational machine-learning based, machine-learning based, based assistant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Operational machine-learning based assistant systems must be robust in a wide range of scenarios. This hold especially true for the air-traffic control (ATC) domain. The robustness of an architecture is particularly evident in edge cases, such as high word error rate (WER) transcripts resulting from noisy ATC recordings or partial transcripts due to clipped recordings. To increase the edge-case robustness of call-sign recognition and understanding (CRU), a core tasks in ATC speech processing, we propose the multimodal call-sign-command recovery model (CCR). The CCR architecture leads to an increase in the edge case performance of up to 15%. We demonstrate this on our second proposed architecture, CallSBERT. A CRU model that has less parameters, can be fine-tuned noticeably faster and is more robust during fine-tuning than the state of the art for CRU. Furthermore, we demonstrate that optimizing for edge cases leads to a significantly higher accuracy across a wide operational range.
zh

[NLP-37] Enhancing Entertainment Translation for Indian Languages using Adaptive Context Style and LLM s AAAI’25

【速读】：该论文旨在解决娱乐领域中的神经机器翻译（Neural Machine Translation, NMT）问题，特别是针对对话内容的自动翻译任务，以应用于自动配音、字幕生成及其他内容本地化任务，从而扩大源语言内容的受众范围。传统NMT系统通常孤立地翻译单个句子，缺乏对上下文和风格等关键要素的知识传递。本文强调了这些基本要素在生成相关且引人入胜的翻译中的重要性，并提出了一种新颖的娱乐翻译框架，该框架首次引入了上下文和风格的估计算法。通过估计当前会话的上下文和风格，生成提示（prompt）以指导大语言模型（Large Language Model, LLM）生成高质量的翻译。该方法具有语言和LLM无关性，是一种通用工具。实验结果表明，该方法在COMET评分上显著优于多种先进的LLM，并在胜率（win-ratio）上持续超越基线LLM。

链接: https://arxiv.org/abs/2412.20440
作者: Pratik Rakesh Singh,Mohammadi Zaki,Pankaj Wasnik
机构: 未知
关键词: neural machine translation, enabling source content, content localization tasks, source language content, address the challenging
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI’25

点击查看摘要

Abstract:We address the challenging task of neural machine translation (NMT) in the entertainment domain, where the objective is to automatically translate a given dialogue from a source language content to a target language. This task has various applications, particularly in automatic dubbing, subtitling, and other content localization tasks, enabling source content to reach a wider audience. Traditional NMT systems typically translate individual sentences in isolation, without facilitating knowledge transfer of crucial elements such as the context and style from previously encountered sentences. In this work, we emphasize the significance of these fundamental aspects in producing pertinent and captivating translations. We demonstrate their significance through several examples and propose a novel framework for entertainment translation, which, to our knowledge, is the first of its kind. Furthermore, we introduce an algorithm to estimate the context and style of the current session and use these estimations to generate a prompt that guides a Large Language Model (LLM) to generate high-quality translations. Our method is both language and LLM-agnostic, making it a general-purpose tool. We demonstrate the effectiveness of our algorithm through various numerical studies and observe significant improvement in the COMET scores over various state-of-the-art LLMs. Moreover, our proposed method consistently outperforms baseline LLMs in terms of win-ratio.
zh

[NLP-38] Integrating Natural Language Processing Techniques of Text Mining Into Financial System: Applications and Limitations

【速读】：该论文旨在探讨自然语言处理（NLP）技术在金融系统中的广泛应用及其面临的挑战，特别是在资产定价、公司金融、衍生品、风险管理和公共金融等领域的应用。通过回顾2018年至2023年的相关研究，论文指出大多数研究结合了概率模型与向量空间模型，并将文本数据与数值数据相结合。信息分类技术是最常用的信息处理技术，而长短期记忆（LSTM）和双向编码器模型是最常用的算法。论文还提出了从工程角度为研究人员分析金融文本的路径，并强调了解决数据质量、上下文适应性和模型可解释性等挑战的重要性，以推动先进NLP技术在金融分析和预测中的集成。

链接: https://arxiv.org/abs/2412.20438
作者: Denisa Millo,Blerina Vika,Nevila Baci
机构: 未知
关键词: natural language processing, natural language, language processing, language processing techniques, financial system
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 6 pages, 5 figures, 1 table

点击查看摘要

Abstract:The financial sector, a pivotal force in economic development, increasingly uses the intelligent technologies such as natural language processing to enhance data processing and insight extraction. This research paper through a review process of the time span of 2018-2023 explores the use of text mining as natural language processing techniques in various components of the financial system including asset pricing, corporate finance, derivatives, risk management, and public finance and highlights the need to address the specific problems in the discussion section. We notice that most of the research materials combined probabilistic with vector-space models, and text-data with numerical ones. The most used technique regarding information processing is the information classification technique and the most used algorithms include the long-short term memory and bidirectional encoder models. The research noticed that new specific algorithms are developed and the focus of the financial system is mainly on asset pricing component. The research also proposes a path from engineering perspective for researchers who need to analyze financial text. The challenges regarding text mining perspective such as data quality, context-adaption and model interpretability need to be solved so to integrate advanced natural language processing models and techniques in enhancing financial analysis and prediction. Keywords: Financial System (FS), Natural Language Processing (NLP), Software and Text Engineering, Probabilistic, Vector-Space, Models, Techniques, TextData, Financial Analysis.
zh

[NLP-39] Comparative Performance of Advanced NLP Models and LLM s in Multilingual Geo-Entity Detection

【速读】：该论文旨在解决多语言文本中地理实体（geo-entity）检测的准确性问题，特别是在涉及国家安全和国际安全的应用场景中。通过整合先进的自然语言处理（NLP）方法和大型语言模型（LLMs），论文评估了多种模型（包括SpaCy、XLM-RoBERTa、mLUKE、GeoLM以及OpenAI的GPT 3.5和GPT 4）在多语言地理实体检测任务中的表现。解决方案的关键在于利用来自Telegram频道的英语、俄语和阿拉伯语数据集，通过准确率（accuracy）、精确率（precision）、召回率（recall）和F1分数等指标，系统评估这些模型在不同语言环境下的性能。研究揭示了各模型在跨语言地理实体识别中的优势和挑战，为开发更先进和包容的NLP工具提供了方向，从而推动地理空间分析及其在全球安全中的应用。

链接: https://arxiv.org/abs/2412.20414
作者: Kalin Kopanov
机构: 未知
关键词: Natural Language Processing, Large Language Models, advanced Natural Language, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 6 pages, 1 table, AICCONF '24: Cognitive Models and Artificial Intelligence Conference, Istanbul, Turkey

点击查看摘要

Abstract:The integration of advanced Natural Language Processing (NLP) methodologies and Large Language Models (LLMs) has significantly enhanced the extraction and analysis of geospatial data from multilingual texts, impacting sectors such as national and international security. This paper presents a comprehensive evaluation of leading NLP models – SpaCy, XLM-RoBERTa, mLUKE, GeoLM – and LLMs, specifically OpenAI’s GPT 3.5 and GPT 4, within the context of multilingual geo-entity detection. Utilizing datasets from Telegram channels in English, Russian, and Arabic, we examine the performance of these models through metrics such as accuracy, precision, recall, and F1 scores, to assess their effectiveness in accurately identifying geospatial references. The analysis exposes each model’s distinct advantages and challenges, underscoring the complexities involved in achieving precise geo-entity identification across varied linguistic landscapes. The conclusions drawn from this experiment aim to direct the enhancement and creation of more advanced and inclusive NLP tools, thus advancing the field of geospatial analysis and its application to global security.
zh

[NLP-40] Multi-Objective Large Language Model Unlearning

【速读】：该论文致力于解决在大语言模型（LLMs）中实现机器遗忘（Machine Unlearning）的问题，旨在有效消除LLMs中的不良行为，而无需从头进行完整的重新训练。论文探讨了梯度上升（Gradient Ascent, GA）方法在LLM遗忘中的应用，该方法通过主动降低模型在目标数据上的预测概率来消除其影响。然而，论文指出该方法面临两个主要挑战：梯度爆炸（gradient explosion）和灾难性遗忘（catastrophic forgetting）。为解决这些问题，论文提出了多目标大语言模型遗忘算法（Multi-Objective Large Language Model Unlearning, MOLLM）。该算法的关键是将LLM遗忘问题形式化为多目标优化问题，通过修改交叉熵损失函数为遗忘版本以克服梯度爆炸问题，并计算一个共同的下降更新方向，使模型在遗忘目标数据的同时保持其效用。实验结果表明，MOLLM在遗忘效果和模型效用保持方面优于现有的基于GA的LLM遗忘方法。

链接: https://arxiv.org/abs/2412.20412
作者: Zibin Pan,Shuwen Zhang,Yuesheng Zheng,Chi Li,Yuheng Cheng,Junhua Zhao
机构: 未知
关键词: great attention recently, attracted great attention, effectively eliminate undesirable, eliminate undesirable behaviors, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine unlearning in the domain of large language models (LLMs) has attracted great attention recently, which aims to effectively eliminate undesirable behaviors from LLMs without full retraining from scratch. In this paper, we explore the Gradient Ascent (GA) approach in LLM unlearning, which is a proactive way to decrease the prediction probability of the model on the target data in order to remove their influence. We analyze two challenges that render the process impractical: gradient explosion and catastrophic forgetting. To address these issues, we propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm. We first formulate LLM unlearning as a multi-objective optimization problem, in which the cross-entropy loss is modified to the unlearning version to overcome the gradient explosion issue. A common descent update direction is then calculated, which enables the model to forget the target data while preserving the utility of the LLM. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation.
zh

[NLP-41] A Multidisciplinary Approach to Telegram Data Analysis

【速读】：该论文旨在解决通过Telegram平台获取的早期网络威胁预警信息的有效分析问题。随着黑客活动团体（hacktivist groups）越来越多地利用Telegram传播未来网络攻击信息或炫耀成功攻击，亟需高效的数据分析方法来应对这一挑战。主要问题在于Telegram上频道数量庞大且数据量巨大，传统方法难以从噪声中识别出相关风险。为此，论文提出了一种多学科方法，结合神经网络架构（neural network architectures）和传统机器学习算法（traditional machine learning algorithms），对Telegram数据进行分类和潜在网络威胁的识别。此外，研究还引入了情感分析（sentiment analysis）和实体识别（entity recognition）技术，以深入理解信息的内容和背景。通过评估这些方法在检测和分类网络威胁中的有效性，研究旨在提升早期预警系统的能力，从而更主动地应对潜在的安全漏洞。该研究为在日益互联的数字环境中加强网络安全措施提供了新的思路和方法。

链接: https://arxiv.org/abs/2412.20406
作者: Velizar Varbanov,Kalin Kopanov,Tatiana Atanasova
机构: 未知
关键词: paper presents, presents a multidisciplinary, multidisciplinary approach, approach to analyzing, groups utilizing Telegram
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 1 table, 2 figures, 24th International Multidisciplinary Scientific GeoConference SGEM 2024

点击查看摘要

Abstract:This paper presents a multidisciplinary approach to analyzing data from Telegram for early warning information regarding cyber threats. With the proliferation of hacktivist groups utilizing Telegram to disseminate information regarding future cyberattacks or to boast about successful ones, the need for effective data analysis methods is paramount. The primary challenge lies in the vast number of channels and the overwhelming volume of data, necessitating advanced techniques for discerning pertinent risks amidst the noise. To address this challenge, we employ a combination of neural network architectures and traditional machine learning algorithms. These methods are utilized to classify and identify potential cyber threats within the Telegram data. Additionally, sentiment analysis and entity recognition techniques are incorporated to provide deeper insights into the nature and context of the communicated information. The study evaluates the effectiveness of each method in detecting and categorizing cyber threats, comparing their performance and identifying areas for improvement. By leveraging these diverse analytical tools, we aim to enhance early warning systems for cyber threats, enabling more proactive responses to potential security breaches. This research contributes to the ongoing efforts to bolster cybersecurity measures in an increasingly interconnected digital landscape.
zh

[NLP-42] Natural Language Fine-Tuning

【速读】：该论文旨在解决在特定领域数据有限的情况下，大语言模型（Large Language Model, LLM）微调技术面临的挑战。现有微调方法通常依赖于大量标注数据、外部指导和反馈（如人类对齐、标量奖励和示范），但在实际应用中，特定知识的稀缺性对现有微调技术提出了前所未有的难题。为此，论文提出了一种名为自然语言微调（Natural Language Fine-Tuning, NLFT）的新方法，首次利用自然语言进行微调。NLFT通过利用目标语言模型的强大语言理解能力，将自然语言指导附加到令牌级输出上，并通过计算概率识别显著性令牌。由于NLFT有效利用了语言信息，该方法显著降低了训练成本，提升了训练效率，在准确性、时间节省和资源节约方面全面优于强化微调算法。此外，NLFT可以被视为对监督微调（Supervised Fine-Tuning, SFT）的令牌级细粒度优化，从而高效替代SFT过程，且无需预热（与ReFT相比，后者需要多轮SFT预热）。与SFT相比，NLFT并未增加算法复杂度，保持O(n)的时间复杂度。在GSM8K数据集上的大量实验表明，NLFT仅使用50个数据实例，其准确性提升超过SFT的219%，且时间复杂度和空间复杂度分别比ReFT降低了78.27%和92.24%。NLFT的优越技术为在网络边缘资源有限的情况下部署各种创新LLM微调应用铺平了道路。

链接: https://arxiv.org/abs/2412.20382
作者: Jia Liu,Yue Wang,Zhiqi Lin,Min Chen,Yixue Hao,Long Hu
机构: 未知
关键词: Large language model, techniques typically depend, Large language, language model fine-tuning, model fine-tuning techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model fine-tuning techniques typically depend on extensive labeled data, external guidance, and feedback, such as human alignment, scalar rewards, and demonstration. However, in practical application, the scarcity of specific knowledge poses unprecedented challenges to existing fine-tuning techniques. In this paper, focusing on fine-tuning tasks in specific domains with limited data, we introduce Natural Language Fine-Tuning (NLFT), which utilizes natural language for fine-tuning for the first time. By leveraging the strong language comprehension capability of the target LM, NLFT attaches the guidance of natural language to the token-level outputs. Then, saliency tokens are identified with calculated probabilities. Since linguistic information is effectively utilized in NLFT, our proposed method significantly reduces training costs. It markedly enhances training efficiency, comprehensively outperforming reinforcement fine-tuning algorithms in accuracy, time-saving, and resource conservation. Additionally, on the macro level, NLFT can be viewed as a token-level fine-grained optimization of SFT, thereby efficiently replacing the SFT process without the need for warm-up (as opposed to ReFT requiring multiple rounds of warm-up with SFT). Compared to SFT, NLFT does not increase the algorithmic complexity, maintaining O(n). Extensive experiments on the GSM8K dataset demonstrate that NLFT, with only 50 data instances, achieves an accuracy increase that exceeds SFT by 219%. Compared to ReFT, the time complexity and space complexity of NLFT are reduced by 78.27% and 92.24%, respectively. The superior technique of NLFT is paving the way for the deployment of various innovative LLM fine-tuning applications when resources are limited at network edges. Our code has been released at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.20382 [cs.CL] (or arXiv:2412.20382v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.20382 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-43] LLM 2: Let Large Language Models Harness System 2 Reasoning

【速读】：该论文旨在解决大语言模型（LLMs）在生成输出时偶尔产生不良结果的问题，其根源在于LLMs的自回归架构缺乏区分期望与非期望输出的机制。为解决这一问题，论文提出了LLM2框架，其关键创新在于结合了LLM（System 1）和基于过程的验证器（System 2）。LLM负责生成候选输出，而验证器则通过基于过程的反馈机制来区分期望与非期望结果。验证器通过成对比较损失（pairwise comparison loss）在合成的过程监督数据上进行训练，这些数据通过令牌质量探索策略生成。实验结果表明，LLM2在数学推理基准测试中显著提升了模型性能，例如在GSM8K数据集上，Llama3-1B的准确率从50.3提升至57.8。此外，结合自一致性（self-consistency）方法后，LLM2进一步将major@20准确率从56.2提升至70.2。

链接: https://arxiv.org/abs/2412.20372
作者: Cheng Yang,Chufan Shi,Siheng Li,Bo Shui,Yujiu Yang,Wai Lam
机构: 未知
关键词: Large language models, exhibited impressive capabilities, Large language, occasionally yield undesirable, yield undesirable outputs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We posit that these limitations are rooted in the foundational autoregressive architecture of LLMs, which inherently lacks mechanisms for differentiating between desirable and undesirable results. Drawing inspiration from the dual-process theory of human cognition, we introduce LLM2, a novel framework that combines an LLM (System 1) with a process-based verifier (System 2). Within LLM2, the LLM is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs. The verifier is trained with a pairwise comparison loss on synthetic process-supervision data generated through our token quality exploration strategy. Empirical results on mathematical reasoning benchmarks substantiate the efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8 (+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with self-consistency, LLM2 achieves additional improvements, boosting major@20 accuracy from 56.2 to 70.2 (+14.0).
zh

[NLP-44] Enhancing Code LLM s with Reinforcement Learning in Code Generation

【速读】：该论文旨在系统地探讨强化学习（Reinforcement Learning, RL）在代码生成和优化中的应用，特别是在编译器优化、资源分配以及框架和工具开发中的关键作用。论文通过深入分析RL在编译器优化中的复杂过程，展示了如何利用RL算法提高效率和资源利用率。此外，论文还强调了RL在资源分配（如寄存器分配和系统优化）中的功能，并探讨了RL在代码生成框架和工具中的集成，以增强其能力。解决方案的关键在于通过RL技术优化代码生成和编译过程，从而提升整体性能和资源管理效率。

链接: https://arxiv.org/abs/2412.20367
作者: Junqiao Wang,Zeng Zhang,Yangfan He,Yuyang Song,Tianyu Shi,Yuchen Li,Hengyuan Xu,Kunyu Wu,Guangwu Qian,Qiuwu Chen,Lewei He
机构: 未知
关键词: large language models, reinforcement learning, language models, rapid evolution, evolution of large
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid evolution of large language models (LLM), reinforcement learning (RL) has emerged as a pivotal technique for code generation and optimization in various domains. This paper presents a systematic survey of the application of RL in code optimization and generation, highlighting its role in enhancing compiler optimization, resource allocation, and the development of frameworks and tools. Subsequent sections first delve into the intricate processes of compiler optimization, where RL algorithms are leveraged to improve efficiency and resource utilization. The discussion then progresses to the function of RL in resource allocation, emphasizing register allocation and system optimization. We also explore the burgeoning role of frameworks and tools in code generation, examining how RL can be integrated to bolster their capabilities. This survey aims to serve as a comprehensive resource for researchers and practitioners interested in harnessing the power of RL to advance code generation and optimization techniques.
zh

[NLP-45] HindiLLM : Large Language Model for Hindi

【速读】：该论文旨在解决印地语（Hindi）及其他印度语言在大型语言模型（LLM）领域缺乏高性能模型的问题。尽管LLM在语言处理方面取得了显著进展，但现有研究主要集中在英语上，而印地语等语言的模型开发相对滞后。为此，作者提出了两种自回归LLM模型，即HindiLLM-Small和HindiLLM-Medium，并通过两阶段训练过程进行开发：首先进行无监督预训练，利用大规模高质量文本语料库生成基础模型；随后进行有监督微调，针对情感分析、文本分类、自然语言推理和多选题问答等任务进行优化。关键解决方案包括构建高质量的预训练语料库、开发专用的印地语分词器（HindiLLM tokenizer），以及通过微调提升模型在实际任务中的性能。实验结果表明，基于HindiLLM的微调模型在多项语言任务中表现优异，超越了现有模型。

链接: https://arxiv.org/abs/2412.20357
作者: Sanjay Chouhan,Shubha Brata Nath,Aparajita Dutta
机构: 未知
关键词: helped in solving, solving several problems, Large Language Model, Language, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancements in the Large Language Model (LLM) have helped in solving several problems related to language processing. Most of the researches have focused on the English language only, because of its popularity and abundance on the internet. However, a high-performance language model for Hindi and other Indic languages is lacking in the literature. In this work, we have pre-trained two autoregressive LLM models for the Hindi language, namely HindiLLM-Small and HindiLLM-Medium. We use a two-step process comprising unsupervised pre-training and supervised fine-tuning. First, we create a large and high-quality text corpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding, named HindiLLM tokenizer, using the pre-training text data. We then perform training on the unlabeled data, known as the pre-training step, to get the HindiLLM base models. Furthermore, we perform fine-tuning of the HindiLLM base models for different tasks like sentiment analysis, text classification, natural language inference, and multiple choice question-answer on popular labeled datasets to measure the real-world performance. The evaluation shows that the HindiLLM-based fine-tuned models outperform several models in most of the language related tasks.
zh

[NLP-46] Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

【速读】：该论文旨在探讨检索增强生成（Retrieval Augmented Generation, RAG）在医学领域中输出置信度（confidence）的影响，特别是在不同配置和模型下的表现。尽管RAG通过引入外部信息提升了大型语言模型（Large Language Models, LLMs）的响应准确性，但其输出置信度的机制尚未得到充分研究，而置信度在金融、医疗等高风险领域尤为重要。论文的核心解决方案是通过评估模型预测概率作为输出，并基于概率和准确性计算期望校准误差（Expected Calibration Error, ECE）和自适应校准误差（Adaptive Calibration Error, ACE）来量化置信度。此外，研究还分析了检索文档在提示中的顺序是否会影响置信度的校准。研究结果表明，置信度和准确性在不同模型、设置和输入提示格式下存在显著差异，强调了根据具体模型和条件优化配置的必要性。

链接: https://arxiv.org/abs/2412.20309
作者: Shintaro Ozaki,Yuta Kato,Siyuan Feng,Masayo Tomita,Kazuki Hayashi,Ryoma Obara,Masafumi Oyamada,Katsuhiko Hayashi,Hidetaka Kamigaito,Taro Watanabe
机构: 未知
关键词: Retrieval Augmented Generation, Retrieval Augmented, Augmented Generation, Large Language Models, leveraging external information
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications. However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored, although the confidence of information is very critical in some domains, such as finance, healthcare, and medicine. Our study focuses the impact of RAG on confidence within the medical domain under various configurations and models. We evaluate confidence by treating the model’s predicted probability as its output and calculating Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores based on the probabilities and accuracy. In addition, we analyze whether the order of retrieved documents within prompts calibrates the confidence. Our findings reveal large variation in confidence and accuracy depending on the model, settings, and the format of input prompts. These results underscore the necessity of optimizing configurations based on the specific model and conditions.
zh

[NLP-47] No Preference Left Behind: Group Distributional Preference Optimization

【速读】：该论文旨在解决现有对齐方法（如直接偏好优化，Direct Preference Optimization, DPO）在捕捉群体内分布性多元偏好（distributional pluralistic preferences）时的不足。现有方法往往偏向于主导偏好，忽视了群体内意见的多样性，特别是在存在冲突偏好的情况下。为了解决这一问题，论文提出了群体分布偏好优化（Group Distribution Preference Optimization, GDPO）这一新框架。GDPO通过引入塑造个体偏好的信念（beliefs）概念，利用统计估计校准语言模型，使其与基于信念的条件偏好对齐，从而提供了一个比传统方法更具包容性的对齐框架。实验结果表明，GDPO在训练过程中能够持续减少与目标信念分布的对齐差距，并在与群体分布偏好的对齐效果上优于现有方法，标志着在多元对齐领域的显著进展。

链接: https://arxiv.org/abs/2412.20299
作者: Binwei Yao,Zefan Cai,Yun-Shiuan Chuang,Shanglin Yang,Ming Jiang,Diyi Yang,Junjie Hu
机构: 未知
关键词: Direct Preference Optimization, Preference Optimization, Preferences, Distribution Preference Optimization, uniform but follow
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preferences within a group of people are not uniform but follow a distribution. While existing alignment methods like Direct Preference Optimization (DPO) attempt to steer models to reflect human preferences, they struggle to capture the distributional pluralistic preferences within a group. These methods often skew toward dominant preferences, overlooking the diversity of opinions, especially when conflicting preferences arise. To address this issue, we propose Group Distribution Preference Optimization (GDPO), a novel framework that aligns language models with the distribution of preferences within a group by incorporating the concept of beliefs that shape individual preferences. GDPO calibrates a language model using statistical estimation of the group’s belief distribution and aligns the model with belief-conditioned preferences, offering a more inclusive alignment framework than traditional methods. In experiments using both synthetic controllable opinion generation and real-world movie review datasets, we show that DPO fails to align with the targeted belief distributions, while GDPO consistently reduces this alignment gap during training. Moreover, our evaluation metrics demonstrate that GDPO outperforms existing approaches in aligning with group distributional preferences, marking a significant advance in pluralistic alignment.
zh

[NLP-48] Scoring with Large Language Models : A Study on Measuring Empathy of Responses in Dialogues

【速读】：该论文旨在探究大语言模型（LLMs）在共情评分（empathy scoring）任务中的表现及其评分机制。具体而言，研究试图理解LLMs如何对对话中的共情程度进行评分，并开发了一个新颖且全面的框架来评估LLMs在测量和评分共情方面的有效性。解决方案的关键在于通过显性和可解释的特征来近似LLMs的性能，包括使用对话嵌入（embeddings）、动机性访谈治疗完整性代码（Motivational Interviewing Treatment Integrity, MITI Code）以及LLMs提出的共情显性子因素（explicit subfactors of empathy）。研究结果表明，仅使用嵌入特征即可达到接近通用LLMs的性能，而结合MITI Code和LLM评分的显性子因素，训练的分类器能够接近微调LLMs的表现。此外，研究还通过特征选择方法确定了共情评分过程中最关键的特征，为理解LLM共情评分提供了新的视角，并推动了LLM在社会科学研究中的评分潜力探索。

链接: https://arxiv.org/abs/2412.20264
作者: Henry J. Xie,Jinghan Zhang,Xinhao Zhang,Kunpeng Liu
机构: 未知
关键词: Large Language Models, Large Language, Language Models, complete complex tasks, LLMs
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE BigData 2024

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have become increasingly more powerful in their ability to complete complex tasks. One such task in which LLMs are often employed is scoring, i.e., assigning a numerical value from a certain scale to a subject. In this paper, we strive to understand how LLMs score, specifically in the context of empathy scoring. We develop a novel and comprehensive framework for investigating how effective LLMs are at measuring and scoring empathy of responses in dialogues, and what methods can be employed to deepen our understanding of LLM scoring. Our strategy is to approximate the performance of state-of-the-art and fine-tuned LLMs with explicit and explainable features. We train classifiers using various features of dialogues including embeddings, the Motivational Interviewing Treatment Integrity (MITI) Code, a set of explicit subfactors of empathy as proposed by LLMs, and a combination of the MITI Code and the explicit subfactors. Our results show that when only using embeddings, it is possible to achieve performance close to that of generic LLMs, and when utilizing the MITI Code and explicit subfactors scored by an LLM, the trained classifiers can closely match the performance of fine-tuned LLMs. We employ feature selection methods to derive the most crucial features in the process of empathy scoring. Our work provides a new perspective toward understanding LLM empathy scoring and helps the LLM community explore the potential of LLM scoring in social science studies.
zh

[NLP-49] ComparisonQA: Evaluating Factuality Robustness of LLM s Through Knowledge Frequency Control and Uncertainty

【速读】：该论文旨在解决当前大语言模型（LLMs）在处理低频知识（low-frequency knowledge）时表现不佳的问题，并指出现有研究仅通过实体频率（entity frequency）来评估知识频率（knowledge frequency）的局限性。为此，论文提出了ComparisonQA基准测试，包含28.3万个抽象问题，每个问题通过一对高频和低频实体实例化，以确保知识频率的差异仅与实体频率相关。此外，为了避免当前LLMs研究中存在的语义捷径（semantic shortcuts）问题，论文设计了一种基于正确性和不确定性的两轮方法，用于测量知识的鲁棒性（knowledge robustness）。实验结果表明，LLMs在处理低频知识时鲁棒性显著较低，尤其是GPT-4o表现最差。论文还引入了一种自动过滤低质量和存在捷径问题的方法，形成了ComparisonQA-Hard数据集，发现不确定性能够有效识别此类问题并保持数据规模。

链接: https://arxiv.org/abs/2412.20251
作者: Qing Zong,Zhaowei Wang,Tianshi Zheng,Xiyu Ren,Yangqiu Song
机构: 未知
关键词: sparked extensive research, rapid development, sparked extensive, extensive research, knowledge
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of LLMs has sparked extensive research into their factual knowledge. Current works claim that LLMs fall short on questions requiring less frequent knowledge. However, their proof is incomplete since they only study the influence of entity frequency, which can not fully represent knowledge frequency. So we introduce ComparisonQA benchmark, containing 283K abstract questions, each instantiated by a pair of high-frequency and low-frequency entities. It ensures a controllable comparison because the difference of knowledge frequency between such a pair is only related to entity frequency. In addition, to avoid possible semantic shortcuts, which is a severe problem of current LLMs study, we design a two-round method for knowledge robustness measurement utilizing both correctness and uncertainty. Experiments reveal that LLMs exhibit particularly low robustness regarding low-frequency knowledge, and GPT-4o is even the worst under this measurement. Besides, we introduce an automatic method to filter out questions with low-quality and shortcuts to form ComparisonQA-Hard. We find that uncertainty effectively identifies such questions while maintaining the data size.
zh

[NLP-50] LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning

【速读】：该论文旨在解决大语言模型（LLMs）在数学推理任务中面临的挑战，特别是在需要结合语言理解和数学推理技能的复杂问题求解方面。现有方法通常依赖于集成方法，并受限于目标领域的数据稀缺问题。论文提出的解决方案关键在于引入问题重述策略（question paraphrase strategy），通过多样化数学问题的语言形式来提高模型的泛化能力。此外，采用专门的训练目标，指导模型学习过程，重点增强其对数学概念和推理过程的理解。实验结果表明，该方法有效提升了LLMs在数学推理任务中的表现，突显了其在推动大语言模型发展及其在需要数学推理能力的实际应用中的潜在价值。

链接: https://arxiv.org/abs/2412.20227
作者: Shuguang Chen,Guang Lin
机构: 未知
关键词: natural language processing, shown remarkable performance, complex problem-solving requires, mathematical reasoning skills, mathematical reasoning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks but face challenges in mathematical reasoning, where complex problem-solving requires both linguistic understanding and mathematical reasoning skills. Existing approaches to address this challenge often rely on ensemble methods and suffer from the problem of data scarcity in target domains. In this work, we present a novel method to enhance LLMs’ capabilities in mathematical reasoning tasks. Motivated by the need to bridge this gap, our approach incorporates a question paraphrase strategy, which aims at diversifying the linguistic forms of mathematical questions to improve generalization. Additionally, specialized training objectives are employed to guide the model’s learning process, focusing on enhancing its understanding of mathematical concepts and reasoning processes. We conduct experiments on four datasets using different LLMs, and demonstrate the effectiveness of our approach in improving LLMs’ performance on mathematical reasoning tasks. Our findings underscore the significance of our methodology in the advancement of large language models and its potential implications for real-world applications that require mathematical reasoning abilities.
zh

[NLP-51] AfriHG: News headline generation for African Languages ICLR2024

【速读】：该论文旨在解决非洲多语言新闻标题生成（news headline generation）的问题，特别是在16种广泛使用的非洲语言中。解决方案的关键在于构建了一个名为AfriHG的数据集，该数据集结合了XLSum和MasakhaNEWS数据集，并在此基础上进行了实验。研究团队使用了两种序列到序列（seq2seq）模型（mT5-base和AfriTeVa V2）以及Aya-101大语言模型（LLM）进行对比实验。结果表明，专注于非洲语言的AfriTeVa V2模型在性能上优于多语言的mT5-base模型。此外，研究还发现，使用313M参数的AfriTeVa V2模型通过微调后的性能与超过13B参数的Aya-101 LLM相当，显示出在资源有限的情况下，非洲中心化模型的竞争力。

链接: https://arxiv.org/abs/2412.20223
作者: Toyib Ogunremi,Serah Akojenu,Anthony Soronnadi,Olubayo Adekanmbi,David Ifeoluwa Adelani
机构: 未知
关键词: paper introduces AfriHG, languages widely spoken, spoken by Africa, headline generation dataset, generation dataset created
类目: Computation and Language (cs.CL)
备注: Accepted to AfricaNLP Workshop at ICLR 2024

点击查看摘要

Abstract:This paper introduces AfriHG – a news headline generation dataset created by combining from XLSum and MasakhaNEWS datasets focusing on 16 languages widely spoken by Africa. We experimented with two seq2eq models (mT5-base and AfriTeVa V2), and Aya-101 LLM. Our results show that Africa-centric seq2seq models such as AfriTeVa V2 outperform the massively multilingual mT5-base model. Finally, we show that the performance of fine-tuning AfriTeVa V2 with 313M parameters is competitive to prompting Aya-101 LLM with more than 13B parameters.
zh

[NLP-52] YAD: Leveraging T5 for Improved Automatic Diacritization of Yor`uba Text ICLR2024

【速读】：该论文旨在解决约鲁巴语（Yorùbá）自动加音标（diacritization）问题，即通过自动化为约鲁巴语文本添加正确的音标符号。解决方案的关键在于构建了一个约鲁巴语自动加音标的基准数据集（Yorùbá Automatic Diacritization, YAD），并预训练了一个基于文本到文本转换的Transformer模型（T5）。研究表明，该模型在约鲁巴语加音标任务上优于多种多语言训练的T5模型。此外，论文还指出，更多的数据和更大的模型能够显著提升约鲁巴语加音标的性能。

链接: https://arxiv.org/abs/2412.20218
作者: Akindele Michael Olawole,Jesujoba O. Alabi,Aderonke Busayo Sakpere,David I. Adelani
机构: 未知
关键词: present Yorùbá automatic, Yorùbá diacritization systems, benchmark dataset, Yorùbá automatic diacritization, evaluating Yorùbá diacritization
类目: Computation and Language (cs.CL)
备注: Accepted at AfricaNLP Workshop at ICLR 2024

点击查看摘要

Abstract:In this work, we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá
zh

[NLP-53] Decoding Emotion: Speech Perception Patterns in Individuals with Self-reported Depression

【速读】：该论文旨在探讨印度人群中自我报告的抑郁（self-reported depression）与情感语音感知（perception of affective speech）之间的关系。研究通过使用积极与消极情感量表（PANAS）和患者健康问卷-9（PHQ-9）分别评估参与者的当前情绪状态和抑郁程度，并记录他们对情感语音刺激在效价（valence）和唤醒度（arousal）量表上的情绪反应。研究的关键解决方案在于通过对比抑郁组与非抑郁组对不同情感语音刺激的反应，发现除中性情感语音外，两组在其他情感刺激上未表现出显著差异。此外，抑郁组的PANAS得分显著高于非抑郁组，表明预先存在的情绪状态对当前情绪状态有影响。与以往研究不同，该研究未观察到抑郁组对积极情感刺激的反应减弱，但在悲伤和愤怒情感语音刺激上，所有情绪感知指标均表现出一致性。

链接: https://arxiv.org/abs/2412.20213
作者: Guneesh Vats,Priyanka Srivastava,Chiranjeevi Yarra
机构: 未知
关键词: Indian population, current study examines, examines the relationship, relationship between self-reported, current mood
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The current study examines the relationship between self-reported depression and the perception of affective speech within the Indian population. PANAS and PHQ-9 were used to assess current mood and depression, respectively. Participants’ emotional reactivity was recorded on a valence and arousal scale against the affective speech audio presented in a sequence. No significant differences between the depression and no-depression groups were observed for any of the emotional stimuli, except the audio file depicting neutral emotion. Significantly higher PANAS scores by the depression than the no-depression group indicate the impact of pre-disposed mood on the current mood status. Contrary to previous findings, this study did not observe reduced positive emotional reactivity by the depression group. However, the results demonstrated consistency in emotional reactivity for speech stimuli depicting sadness and anger across all measures of emotion perception.
zh

[NLP-54] Building a Rich Dataset to Empower the Persian Question Answering Systems

【速读】：该论文旨在解决波斯语（Persian）在问答系统（Question Answering Systems）领域缺乏标准数据集的问题。为此，作者提出了一个名为NextQuAD的综合性开放域数据集，包含7,515个上下文、23,918个问答对。解决方案的关键在于利用两个预训练语言模型（ParsBERT和XLM-RoBERTa）构建了一个基于BERT的问答模型，并通过平均对数（mean logits）进行模型集成。实验结果表明，该模型在开发集上取得了0.95的精确匹配（Exact Match, EM）和0.97的F1分数（F1 score）。此外，通过在PersianQA和ParSQuAD数据集上的对比评估，证明了该模型在提升EM指标上的有效性。

链接: https://arxiv.org/abs/2412.20212
作者: Mohsen Yazdinejad,Marjan Kaedi
机构: 未知
关键词: systems provide short, Question answering systems, answering systems provide, Question answering, provide short
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Question answering systems provide short, precise, and specific answers to questions. So far, many robust question answering systems have been developed for English, while some languages with fewer resources, like Persian, have few numbers of standard dataset. In this study, a comprehensive open-domain dataset is presented for Persian. This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers. Then, a BERT-based question answering model has been applied to this dataset using two pre-trained language models, including ParsBERT and XLM-RoBERTa. The results of these two models have been ensembled using mean logits. Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with other Persian datasets, our trained model on the NextQuAD, is evaluated on two other datasets named PersianQA and ParSQuAD. Comparisons show that the proposed model increased EM by 0.39 and 0.14 respectively in PersianQA and ParSQuAD-manual, while a slight EM decline of 0.007 happened in ParSQuAD-automatic.
zh

[NLP-55] Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

【速读】：该论文旨在解决复杂表格问答（Complex Table Question Answering, TQA）中的挑战，特别是那些需要多步骤或多类别推理的问题。现有的方法通常依赖于闭源大语言模型（LLMs）或经过微调的开源权重LLMs，但这些方法存在获取高质量训练数据成本高、闭源模型可访问性差以及可复现性等问题。为此，论文提出了一种名为“多智能体协作工具使用框架”（Multi-Agent Collaboration with Tool use, MACT）的解决方案。MACT框架的核心在于通过一个规划智能体和一个编码智能体的协作，结合工具的使用，来回答问题，而无需依赖闭源模型或进行微调。实验结果表明，MACT在四个TQA基准测试中的三个上优于现有技术，并在两个基准测试中与更大且更昂贵的闭源模型GPT-4表现相当，证明了其多智能体协作在TQA中的有效性。

链接: https://arxiv.org/abs/2412.20145
作者: Wei Zhou,Mohsen Mesgar,Annemarie Friedrich,Heike Adel
机构: 未知
关键词: require complex reasoning, Complex table question, complex reasoning, table question answering, Complex table
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous SoTA systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT’s multi-agent collaboration in TQA.
zh

[NLP-56] M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained Machine Translation Evaluation

【速读】：该论文旨在解决当前在机器翻译（MT）评估领域中，基于大语言模型（LLMs）的“LLM-as-a-judge”方法在性能上不及学习型自动评估指标的问题。为此，作者提出了多维多代理辩论框架（Multidimensional Multi-Agent Debate, M-MAD），其关键解决方案包括：（1）将启发式MQM标准解耦为独立的评估维度，以实现细粒度评估；（2）利用多代理辩论机制，充分发挥LLMs的协作推理能力；（3）将各维度的评估结果综合为最终判断，以确保评估结果的鲁棒性和可靠性。实验表明，M-MAD不仅超越了现有的LLM-as-a-judge方法，还能与最先进的基于参考的自动评估指标相媲美，即使在使用次优模型如GPT-4o mini时也表现出色。

链接: https://arxiv.org/abs/2412.20127
作者: Zhaopeng Feng,Jiayuan Su,Jiamei Zheng,Jiahan Ren,Yan Zhang,Jian Wu,Hongwei Wang,Zuozhu Liu
机构: 未知
关键词: deliver human-like judgments, large language models, Recent advancements, showcasing their potential, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress. Code and data are available at this https URL

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at this https URL.
zh

[NLP-57] Extract Information from Hybrid Long Documents Leveraging LLM s: A Framework and Dataset ICASSP2025

【速读】：该论文旨在解决大语言模型（LLMs）在处理包含文本和表格数据的混合长文档（HLDs）时的能力不足问题。由于HLDs通常超出LLMs的token限制，论文提出了一种自动化信息提取框架（AIE），以帮助LLMs有效处理和分析HLDs。解决方案的关键包括：1）选择和总结HLDs中有用部分的有效方法；2）通过简单的表格序列化方式使LLMs理解表格数据；3）展示AIE在复杂场景中的适应性；4）通过提示工程（prompt engineering）增强LLMs在HLDs上的表现。此外，论文还提出了金融报告数值提取（FINE）数据集，以解决HLDs数据集稀缺的问题，并支持未来研究。

链接: https://arxiv.org/abs/2412.20072
作者: Chongjian Yue,Xinrun Xu,Xiaojun Ma,Lun Du,Zhiming Ding,Shi Han,Dongmei Zhang,Qi Zhang
机构: 未知
关键词: Large Language Models, Large Language, Language Models, demonstrate exceptional performance, tabular reasoning tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICASSP 2025

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.
zh

[NLP-58] On the Compositional Generalization of Multimodal LLM s for Medical Imaging

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在医学领域中由于某些领域数据不足而导致的泛化能力受限的问题。为了解决这一问题，论文提出了使用组合泛化（Compositional Generalization, CG）作为指导框架，通过重新组合已学习元素来理解新的组合，从而提升模型对未见过的医学图像的理解能力。关键解决方案包括构建了一个包含106个医学数据集的Med-MAT平台，用于全面实验。实验结果表明，MLLMs能够利用CG理解未见过的医学图像，并且CG是多任务训练中观察到泛化的主要驱动因素之一。此外，研究还表明，CG在数据有限的情况下仍能有效支持数据集，并在不同骨干网络中表现一致，展示了其广泛的适用性和多功能性。

链接: https://arxiv.org/abs/2412.20070
作者: Zhenyang Cai,Junying Chen,Rongsheng Wang,Weihong Wang,Yonglin Deng,Dingjie Song,Yize Chen,Zixu Zhang,Benyou Wang
机构: 未知
关键词: Multimodal large language, hold significant potential, Multimodal large, large language models, hold significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at this https URL.
zh

[NLP-59] he Emotional Spectrum of LLM s: Leveraging Empathy and Emotion-Based Markers for Mental Health Support

【速读】：该论文旨在解决心理健康服务需求日益增长的问题，特别是在心理对话式 AI（Psychological Conversational AI）领域，敏感数据的稀缺性限制了相关技术的发展。论文提出了一种创新的解决方案，其关键在于结合可解释的情感档案（Explainable Emotional Profiles）和共情对话模型（Empathetic Conversational Models），开发了一个名为 RACLETTE 的系统。该系统不仅能够准确理解用户的情感状态并生成共情回应，还能通过用户互动逐步构建其情感档案。这些情感档案可作为可解释的心理健康评估标记，与不同精神障碍的典型情感模式进行对比，从而为初步筛查和支持提供了一种新方法。这一解决方案为传统心理健康护理的补充提供了有力工具，尤其是在即时专家支持不可得的情况下。

链接: https://arxiv.org/abs/2412.20068
作者: Alessandro De Grandi,Federico Ravenda,Andrea Raballo,Fabio Crestani
机构: 未知
关键词: mental health services, innovative solutions, data is scarce, mental health, increasing demand
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing demand for mental health services has highlighted the need for innovative solutions, particularly in the realm of psychological conversational AI, where the availability of sensitive data is scarce. In this work, we explored the development of a system tailored for mental health support with a novel approach to psychological assessment based on explainable emotional profiles in combination with empathetic conversational models, offering a promising tool for augmenting traditional care, particularly where immediate expertise is unavailable. Our work can be divided into two main parts, intrinsecaly connected to each other. First, we present RACLETTE, a conversational system that demonstrates superior emotional accuracy compared to state-of-the-art benchmarks in both understanding users’ emotional states and generating empathetic responses during conversations, while progressively building an emotional profile of the user through their interactions. Second, we show how the emotional profiles of a user can be used as interpretable markers for mental health assessment. These profiles can be compared with characteristic emotional patterns associated with different mental disorders, providing a novel approach to preliminary screening and support.
zh

[NLP-60] Comparative Analysis of Listwise Reranking with Large Language Models in Limited-Resource Language Contexts

【速读】：该论文旨在评估大语言模型（LLMs）在资源有限的非洲语言列表重排序（listwise reranking）任务中的性能。研究通过比较专有模型RankGPT3.5、Rank4o-mini、RankGPTo1-mini和RankClaude-sonnet在跨语言环境中的表现，探讨LLMs在提升低资源语言重排序任务中的潜力。研究结果表明，这些LLMs在大多数评估指标（尤其是nDCG@10和MRR@100）上显著优于传统基线方法（如BM25-DT）。解决方案的关键在于利用LLMs的强大能力，为低资源语言提供高效且成本效益显著的重排序方案。

链接: https://arxiv.org/abs/2412.20061
作者: Yanxin Shen,Lun Wang,Chuanqi Shi,Shaoshuai Du,Yiyi Tao,Yixian Shen,Hang Zhang
机构: 未知
关键词: including text ranking, Large Language Models, demonstrated significant effectiveness, Language Models, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant effectiveness across various NLP tasks, including text ranking. This study assesses the performance of large language models (LLMs) in listwise reranking for limited-resource African languages. We compare proprietary models RankGPT3.5, Rank4o-mini, RankGPTo1-mini and RankClaude-sonnet in cross-lingual contexts. Results indicate that these LLMs significantly outperform traditional baseline methods such as BM25-DT in most evaluation metrics, particularly in nDCG@10 and MRR@100. These findings highlight the potential of LLMs in enhancing reranking tasks for low-resource languages and offer insights into cost-effective solutions.
zh

[NLP-61] “My life is miserable have to sign 500 autographs everyday”: Exposing Humblebragging the Brags in Disguise

【速读】：该论文旨在解决自动检测文本中“谦虚自夸”（humblebragging）现象的问题，即识别那些表面上看似谦虚或抱怨，实则自我炫耀的语句。这一任务对于提升机器在情感分析（sentiment analysis）和意图识别（intent recognition）等自然语言理解任务中的表现至关重要。论文首次在计算语言学领域引入了这一任务，并通过提出一个四元组定义（4-tuple definition）来形式化“谦虚自夸”的概念。解决方案的关键在于评估了机器学习、深度学习以及大语言模型（LLMs）在该任务上的表现，并与人类的表现进行了对比。此外，论文还创建并发布了一个名为HB24的数据集，包含3,340条由GPT-4o生成的“谦虚自夸”语句。实验结果表明，检测“谦虚自夸”具有挑战性，即使对人类而言也是如此，最佳模型的F1得分为0.88。这项工作为深入探索这一微妙语言现象及其在更广泛自然语言理解系统中的应用奠定了基础。

链接: https://arxiv.org/abs/2412.20057
作者: Sharath Naganna,Saprativa Bhattacharjee,Pushpak Bhattacharyya,Biplab Banerjee
机构: 未知
关键词: individuals present self-promotional, present self-promotional statements, modesty or complaints, individuals present, present self-promotional
类目: Computation and Language (cs.CL)
备注: Under review at ARR

点击查看摘要

Abstract:Humblebragging is a phenomenon where individuals present self-promotional statements under the guise of modesty or complaints. For example, a statement like, “Ugh, I can’t believe I got promoted to lead the entire team. So stressful!”, subtly highlights an achievement while pretending to be complaining. Detecting humblebragging is important for machines to better understand the nuances of human language, especially in tasks like sentiment analysis and intent recognition. However, this topic has not yet been studied in computational linguistics. For the first time, we introduce the task of automatically detecting humblebragging in text. We formalize the task by proposing a 4-tuple definition of humblebragging and evaluate machine learning, deep learning, and large language models (LLMs) on this task, comparing their performance with humans. We also create and release a dataset called HB24, containing 3,340 humblebrags generated using GPT-4o. Our experiments show that detecting humblebragging is non-trivial, even for humans. Our best model achieves an F1-score of 0.88. This work lays the foundation for further exploration of this nuanced linguistic phenomenon and its integration into broader natural language understanding systems.
zh

[NLP-62] STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach – A Case Study on Science Domains

【速读】：该论文旨在解决科学信息提取（scientific information extraction）中面临的训练数据不足和标注成本高的问题，通过利用大语言模型（LLMs）的上下文学习能力来提升性能。解决方案的关键在于提出了一种名为STAYKATE的静态-动态混合选择方法（static-dynamic hybrid selection method），该方法结合了主动学习（active learning）中的代表性采样原则和基于检索的方法（retrieval-based approach），以优化上下文示例的选择。实验结果表明，STAYKATE在三个特定领域的数据集上均优于传统的监督方法和现有的选择方法，尤其在处理其他方法难以应对的实体类型时表现尤为突出。

链接: https://arxiv.org/abs/2412.20043
作者: Chencheng Zhu,Kazutaka Shimada,Tomoki Taniguchi,Tomoko Ohkuma
机构: 未知
关键词: Large language models, scientific information extraction, insufficient training data, Large language, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate the ability to learn in-context, offering a potential solution for scientific information extraction, which often contends with challenges such as insufficient training data and the high cost of annotation processes. Given that the selection of in-context examples can significantly impact performance, it is crucial to design a proper method to sample the efficient ones. In this paper, we propose STAYKATE, a static-dynamic hybrid selection method that combines the principles of representativeness sampling from active learning with the prevalent retrieval-based approach. The results across three domain-specific datasets indicate that STAYKATE outperforms both the traditional supervised methods and existing selection methods. The enhancement in performance is particularly pronounced for entity types that other methods pose challenges.
zh

[NLP-63] BaiJia: A Large Scale Role-Playing Agent Corpus of Chinese Historical Charcaters

【速读】：该论文旨在解决大语言模型（LLMs）在历史角色扮演任务中面临的挑战，特别是由于历史文本记录形式多样且碎片化，导致模型难以有效整合和利用这些信息的问题。为此，作者提出了一个名为“BaiJia”的大规模角色扮演语料库，该语料库涵盖了多种中国历史人物的信息，包括其生平、文学作品、家庭关系、历史事件等。BaiJia语料库的关键在于其首次整合了低资源数据，为大语言模型提供了丰富的历史背景信息，从而显著提升了模型在历史角色扮演任务中的表现。通过广泛的实验，作者验证了BaiJia语料库在增强LLMs角色扮演能力方面的有效性，并推动了LLMs在历史角色扮演任务中的开发与评估。

链接: https://arxiv.org/abs/2412.20024
作者: Ting Bai,Jiazheng Kang,Jiayang Fan
机构: 未知
关键词: Chinese historical characters, comprehensive large-scale role-playing, comprises various Chinese, large-scale role-playing agent, Chinese historical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a comprehensive large-scale role-playing agent corpus, termed BaiJia, that comprises various Chinese historical characters. This corpus is noteworthy for being the pioneering compilation of low-resource data that can be utilized in large language models (LLMs) to engage in AI-driven historical role-playing agents. BaiJia addresses the challenges in terms of fragmented historical textual records in different forms and modalities, integrating various characters’ information, including their biographical, literary, family relations, historical events, and so on. We conduct extensive experiments to demonstrate the effectiveness of our BaiJia agent corpus in bolstering the role-playing abilities of various foundational LLMs, and promoting the development and assessment of LLMs in the context of historical role-playing tasks. The agent corpus is available at this http URL.
zh

[NLP-64] OneKE: A Dockerized Schema-Guided LLM Agent -based Knowledge Extraction System

【速读】：该论文旨在解决从网络和原始PDF书籍中提取知识的挑战，特别是在多领域（如科学、新闻等）中的应用问题。解决方案的关键在于设计了一个名为OneKE的容器化（dockerized）模式引导知识提取系统。该系统通过多个代理（agents）和可配置的知识库（configure knowledge base）来实现。不同代理分别执行其特定角色，支持多种提取场景，而可配置的知识库则有助于模式配置、错误案例调试和修正，从而进一步提升系统性能。通过基准数据集的实证评估和案例研究，OneKE展示了其在不同任务和领域中的高效性和适应性，凸显了其广泛应用的潜力。

链接: https://arxiv.org/abs/2412.20005
作者: Yujie Luo,Xiangyuan Ru,Kangwei Liu,Lin Yuan,Mengshu Sun,Ningyu Zhang,Lei Liang,Zhiqiang Zhang,Jun Zhou,Lanning Wei,Da Zheng,Haofen Wang,Huajun Chen
机构: 未知
关键词: raw PDF Books, PDF Books, Web and raw, raw PDF, dockerized schema-guided knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE’s efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at this https URL and released a Video at this http URL.
zh

[NLP-65] Bridging Context Gaps: Enhancing Comprehension in Long-Form Social Conversations Through Contextualized Excerpts COLING2025

【速读】：该论文旨在解决从小群体录音对话中提取的摘录在后续对话中可能缺失关键上下文或重要元素的问题，特别是在对话内容较长且主题丰富的情况下。为解决这一问题，论文探索了利用大语言模型（LLMs）为这些摘录提供社会相关上下文的方法，从而增强理解、可读性和共情能力。解决方案的关键在于通过有效的上下文补充，提升摘录的语境化效果，并通过主观和客观评估验证了其在理解上的显著改进。此外，论文还发布了人工标注的显著摘录（HSE）数据集，以支持未来研究，并展示了上下文丰富的摘录如何提供更聚焦和全面的对话摘要。

链接: https://arxiv.org/abs/2412.19966
作者: Shrestha Mohanty,Sarah Xuan,Jacob Jobraeel,Anurag Kumar,Deb Roy,Jad Kabbara
机构: 未知
关键词: sharing personal stories, small-group recorded conversations, focus on enhancing, small-group recorded, medium to bring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:We focus on enhancing comprehension in small-group recorded conversations, which serve as a medium to bring people together and provide a space for sharing personal stories and experiences on crucial social matters. One way to parse and convey information from these conversations is by sharing highlighted excerpts in subsequent conversations. This can help promote a collective understanding of relevant issues, by highlighting perspectives and experiences to other groups of people who might otherwise be unfamiliar with and thus unable to relate to these experiences. The primary challenge that arises then is that excerpts taken from one conversation and shared in another setting might be missing crucial context or key elements that were previously introduced in the original conversation. This problem is exacerbated when conversations become lengthier and richer in themes and shared experiences. To address this, we explore how Large Language Models (LLMs) can enrich these excerpts by providing socially relevant context. We present approaches for effective contextualization to improve comprehension, readability, and empathy. We show significant improvements in understanding, as assessed through subjective and objective evaluations. While LLMs can offer valuable context, they struggle with capturing key social aspects. We release the Human-annotated Salient Excerpts (HSE) dataset to support future work. Additionally, we show how context-enriched excerpts can provide more focused and comprehensive conversation summaries.
zh

[NLP-66] Assessing Text Classification Methods for Cyberbullying Detection on Social Media Platforms

【速读】：该论文旨在解决社交媒体平台上网络欺凌（cyberbullying）实时检测与监控系统的性能、数据集质量、时间效率和计算成本等问题。研究通过比较和评估现有文本分类技术在网络欺凌检测领域的适用性，特别是评估了包括BERT、RoBERTa、XLNet、DistilBERT和GPT-2.0在内的大规模语言模型的有效性和性能。研究结果表明，BERT在性能、时间效率和计算资源之间取得了最佳平衡，其准确率、精确率、召回率和F1得分均达到95%，推理时间为0.053秒，内存使用为35.28 MB，CPU/GPU使用率为0.4%，能耗为0.000263 kWh。尽管生成式AI模型在某些方面表现强大，但在特定数据集和任务上，通过策略性适应和微调现有模型仍可实现最先进的性能。因此，解决方案的关键在于选择并优化适合特定任务的语言模型，以实现高效、准确的网络欺凌检测。

链接: https://arxiv.org/abs/2412.19928
作者: Adamu Gaston Philipo,Doreen Sebastian Sarwatt,Jianguo Ding,Mahmoud Daneshmand,Huansheng Ning
机构: 未知
关键词: mental health issues, Cyberbullying significantly contributes, psychology of victims, significantly contributes, contributes to mental
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 15 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Cyberbullying significantly contributes to mental health issues in communities by negatively impacting the psychology of victims. It is a prevalent problem on social media platforms, necessitating effective, real-time detection and monitoring systems to identify harmful messages. However, current cyberbullying detection systems face challenges related to performance, dataset quality, time efficiency, and computational costs. This research aims to conduct a comparative study by adapting and evaluating existing text classification techniques within the cyberbullying detection domain. The study specifically evaluates the effectiveness and performance of these techniques in identifying cyberbullying instances on social media platforms. It focuses on leveraging and assessing large language models, including BERT, RoBERTa, XLNet, DistilBERT, and GPT-2.0, for their suitability in this domain. The results show that BERT strikes a balance between performance, time efficiency, and computational resources: Accuracy of 95%, Precision of 95%, Recall of 95%, F1 Score of 95%, Error Rate of 5%, Inference Time of 0.053 seconds, RAM Usage of 35.28 MB, CPU/GPU Usage of 0.4%, and Energy Consumption of 0.000263 kWh. The findings demonstrate that generative AI models, while powerful, do not consistently outperform fine-tuned models on the tested benchmarks. However, state-of-the-art performance can still be achieved through strategic adaptation and fine-tuning of existing models for specific datasets and tasks.
zh

[NLP-67] Right vs. Right: Can LLM s Make Tough Choices?

【速读】：该论文旨在探讨大型语言模型（LLMs）在处理伦理困境（ethical dilemmas）时的表现，具体包括其对伦理困境的敏感性、道德价值选择的一致性、后果的考量能力，以及其回应是否能够与提示中明确或隐含的道德价值偏好保持一致。为解决这一问题，研究团队基于一个领先的伦理框架构建了一个包含1,730个伦理困境的数据集，涉及四对相互冲突的价值观，并对来自六个家族的20个知名LLMs进行了评估。研究的关键解决方案在于通过系统性实验揭示了LLMs在主要价值对之间的显著偏好，如优先选择真理而非忠诚、集体而非个体、长期而非短期考量，并发现较大规模的LLMs倾向于支持义务论（deontological）视角，即使在指定负面后果的情况下仍坚持其行为选择。此外，研究还表明，明确的指导原则比上下文示例更能有效引导LLMs的道德选择，同时揭示了LLMs在理解不同形式的伦理困境时的局限性。

链接: https://arxiv.org/abs/2412.19926
作者: Jiaqing Yuan,Pradeep K. Murukannaiah,Munindar P. Singh
机构: 未知
关键词: ethical dilemmas, ethical dilemma describes, ethical, options involving conflicting, dilemmas
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An ethical dilemma describes a choice between two “right” options involving conflicting moral values. We present a comprehensive evaluation of how LLMs navigate ethical dilemmas. Specifically, we investigate LLMs on their (1) sensitivity in comprehending ethical dilemmas, (2) consistency in moral value choice, (3) consideration of consequences, and (4) ability to align their responses to a moral value preference explicitly or implicitly specified in a prompt. Drawing inspiration from a leading ethical framework, we construct a dataset comprising 1,730 ethical dilemmas involving four pairs of conflicting values. We evaluate 20 well-known LLMs from six families. Our experiments reveal that: (1) LLMs exhibit pronounced preferences between major value pairs, and prioritize truth over loyalty, community over individual, and long-term over short-term considerations. (2) The larger LLMs tend to support a deontological perspective, maintaining their choices of actions even when negative consequences are specified. (3) Explicit guidelines are more effective in guiding LLMs’ moral choice than in-context examples. Lastly, our experiments highlight the limitation of LLMs in comprehending different formulations of ethical dilemmas.
zh

[NLP-68] HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在性能和能效方面面临的显著计算挑战，特别是在模型规模和复杂性不断增加的情况下。论文提出的解决方案是硬件加速解码（Hardware Accelerated Decoding, HADES），其关键在于设计一种支持硬件级推测解码（speculative decoding）的LLM加速器。推测解码通过预测和并行处理来显著提升LLM操作的效率，从而为更先进和实用的应用铺平道路。这一方法在现有文献中尚未被探索，具有创新性和前瞻性。

链接: https://arxiv.org/abs/2412.19925
作者: Ze Yang,Yihong Jin,Xinhe Xu
机构: 未知
关键词: Large Language Models, natural language processing, revolutionized natural language, generating human-like text, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted to ICCEA 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing by understanding and generating human-like text. However, the increasing demand for more sophisticated LLMs presents significant computational challenges due to their scale and complexity. This paper introduces Hardware Accelerated Decoding (HADES), a novel approach to enhance the performance and energy efficiency of LLMs. We address the design of an LLM accelerator with hardware-level speculative decoding support, a concept not previously explored in existing literature. Our work demonstrates how speculative decoding can significantly improve the efficiency of LLM operations, paving the way for more advanced and practical applications of these models.
zh

[NLP-69] Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM

【速读】：该论文旨在解决自动文本摘要（summarization）评估中的准确性和客观性问题。现有的评估方法，如ROUGE和嵌入相似度，往往与人类判断的相关性较低，且评分不够直观，难以真实反映摘要的质量。此外，尽管大语言模型（LLMs）能够模拟人类进行主观评价，但主观评分难以解释和验证，且容易通过调整模型和提示语（prompts）进行操纵。为解决这些问题，论文提出了一种新颖的评估方法和工具（SumAutoEval），通过在多个粒度层次上定义和评估指标，从完整性（completeness）、正确性（correctness）、一致性（Alignment）和可读性（readability）四个关键维度提供客观评分。该方法显著提升了摘要质量的理解，并与人类判断具有更好的相关性。

链接: https://arxiv.org/abs/2412.19906
作者: Dong Yuan,Eti Rastogi,Fen Zhao,Sagar Goyal,Gautam Naik,Sree Prasanna Rajagopal
机构: 未知
关键词: efficient information consumption, gained paramount importance, efficient information, information consumption, paramount importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the exponential growth of information and the need for efficient information consumption the task of summarization has gained paramount importance. Evaluating summarization accurately and objectively presents significant challenges, particularly when dealing with long and unstructured texts rich in content. Existing methods, such as ROUGE (Lin, 2004) and embedding similarities, often yield scores that have low correlation with human judgements and are also not intuitively understandable, making it difficult to gauge the true quality of the summaries. LLMs can mimic human in giving subjective reviews but subjective scores are hard to interpret and justify. They can be easily manipulated by altering the models and the tones of the prompts. In this paper, we introduce a novel evaluation methodology and tooling designed to address these challenges, providing a more comprehensive, accurate and interpretable assessment of summarization outputs. Our method (SumAutoEval) proposes and evaluates metrics at varying granularity levels, giving objective scores on 4 key dimensions such as completeness, correctness, Alignment and readability. We empirically demonstrate, that SumAutoEval enhances the understanding of output quality with better human correlation.
zh

[NLP-70] GaLore: Boosting Low-Rank Adaptation for LLM s with Cross-Head Projection

【速读】：该论文旨在解决低秩训练方法（如GaLore）在优化大语言模型（LLMs）时，因低秩投影估计（low-rank projection estimation）耗时过长而导致的训练效率低下的问题。具体而言，GaLore中的奇异值分解（SVD）占据了总训练时间的80%以上，成为性能瓶颈。为解决这一问题，论文提出了GaLore+，其关键解决方案包括：1）采用跨头低秩投影（cross-head low-rank projection）来减少多头注意力机制中低秩投影估计的时间消耗；2）使用随机子空间迭代（randomized subspace iteration）以实现快速SVD；3）引入稀疏编码残差（sparsely coded residuals）来降低低秩近似对优化器的一阶和二阶矩以及权重更新带来的误差。实验结果表明，GaLore+在算术推理和自然语言生成任务上表现出色，且微调速度比原始GaLore提升了约4倍。

链接: https://arxiv.org/abs/2412.19820
作者: Xutao Liao,Shaohui Li,Yuhui Xu,Zhi Li,Yu Liu,You He
机构: 未知
关键词: Recent low-rank training, large language models, Recent low-rank, optimize large language, significantly reduced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80% of the total training time. To address this issue, we propose GaLore + , which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore + on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore + delivers superior performance while achieving approximately 4\times fine-tuning speed compared to vanilla GaLore.
zh

[NLP-71] wo-component spatiotemporal template for activation-inhibition of speech in ECoG

【速读】：该论文旨在研究在发音任务中，传感器运动皮层（SMC）内多通道高密度皮层电图（ECoG）记录的频段受限语音活动的时空动态。具体而言，作者探讨了在发音运动期间，传感器运动皮层中个体ECoG通道之间β频段（12-35 Hz）与高频γ频段（70-140 Hz）活动的反相关性。为解决这一问题，作者采用了基于主成分分析（PCA）的方差模型，将SMC通道的频段功率投影到低维主成分上，并通过窗口相关分析识别了语音相关活动与主成分之间的时空关系。研究发现，传感器运动区域与主成分区域的相关性揭示了语音活动中类似激活-抑制的双成分表征，且第三主成分在所有受试者中均未显示出显著相关性，表明两个主成分足以表征发音运动期间的SMC活动。

链接: https://arxiv.org/abs/2412.21178
作者: Eric Easthope
机构: 未知
关键词: multi-channel high-density electrocorticography, consonant-vowel speaking task, power of band-limited, high-density electrocorticography, recorded from multiple
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:I compute the average trial-by-trial power of band-limited speech activity across epochs of multi-channel high-density electrocorticography (ECoG) recorded from multiple subjects during a consonant-vowel speaking task. I show that previously seen anti-correlations of average beta frequency activity (12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement are observable between individual ECoG channels in the sensorimotor cortex (SMC). With this I fit a variance-based model using principal component analysis to the band-powers of individual channels of session-averaged ECoG data in the SMC and project SMC channels onto their lower-dimensional principal components. Spatiotemporal relationships between speech-related activity and principal components are identified by correlating the principal components of both frequency bands to individual ECoG channels over time using windowed correlation. Correlations of principal component areas to sensorimotor areas reveal a distinct two-component activation-inhibition-like representation for speech that resembles distinct local sensorimotor areas recently shown to have complex interplay in whole-body motor control, inhibition, and posture. Notably the third principal component shows insignificant correlations across all subjects, suggesting two components of ECoG are sufficient to represent SMC activity during speech movement. Subjects: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP) Cite as: arXiv:2412.21178 [q-bio.NC] (or arXiv:2412.21178v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2412.21178 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-72] Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment ICASSP2025 ICASSP

【速读】：该论文旨在解决多模态情感识别（Multimodal Emotion Recognition, MER）中跨模态特征对齐的挑战。现有方法通常采用单一的对齐策略，这不仅限制了模型性能，还无法有效处理情感表达中的复杂性和模糊性。为此，论文提出了一种多粒度跨模态对齐（Multi-Granularity Cross-Modal Alignment, MGCMA）框架，其关键创新在于整合了基于分布（distribution-based）、基于实例（instance-based）和基于标记（token-based）的对齐模块，从而实现了跨模态情感信息的多层次感知。实验结果表明，该框架在IEMOCAP数据集上优于当前的最先进技术。

链接: https://arxiv.org/abs/2412.20821
作者: Xuechen Wang,Shiwan Zhao,Haoqin Sun,Hui Wang,Jiaming Zhou,Yong Qin
机构: 未知
关键词: Multimodal emotion recognition, effective multimodal integration, demanding sophisticated methods, multimodal integration, Multimodal emotion
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

点击查看摘要

Abstract:Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performance but also fails to address the complexity and ambiguity inherent in emotional expressions. In response, this paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules. This framework enables a multi-level perception of emotional information across modalities. Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.
zh

[NLP-73] From Generalist to Specialist: A Survey of Large Language Models for Chemistry DATE COLING2025

【速读】：该论文旨在解决大语言模型（LLMs）在化学领域应用中的局限性问题，特别是由于缺乏专门的化学数据和复杂多模态数据（如2D图、3D结构和光谱）所带来的挑战。论文的关键解决方案包括：1）提出将领域特定的化学知识和多模态信息整合到LLMs中的方法；2）将化学LLMs概念化为使用化学工具的代理（agents），并探讨其在加速科学研究中的潜力；3）总结现有用于评估LLMs化学能力的基准。通过这些方法，论文旨在为研究人员提供前沿的化学LLMs发展动态，并激发该领域的创新应用。

链接: https://arxiv.org/abs/2412.19994
作者: Yang Han,Ziping Wan,Lu Chen,Kai Yu,Xin Chen
机构: 未知
关键词: Large Language Models, natural language processing, Large Language, Pretrained Language Models, Language Models
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: COLING2025,We maintain an up-to-date Github repository at: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP). However, the predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry. The scarcity of specialized chemistry data, coupled with the complexity of multi-modal data such as 2D graph, 3D structure and spectrum, present distinct challenges. Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs. In this paper, we outline methodologies for incorporating domain-specific chemistry knowledge and multi-modal information into LLMs, we also conceptualize chemistry LLMs as agents using chemistry tools and investigate their potential to accelerate scientific research. Additionally, we conclude the existing benchmarks to evaluate chemistry ability of LLMs. Finally, we critically examine the current challenges and identify promising directions for future research. Through this comprehensive survey, we aim to assist researchers in staying at the forefront of developments in chemistry LLMs and to inspire innovative applications in the field.
zh

计算机视觉

[CV-0] PERSE: Personalized 3D Generative Avatars from A Single Portrait

【速读】：该论文旨在解决如何从单张参考肖像构建可动画化的个性化生成式虚拟形象（avatar）的问题，并实现对面部属性的连续且解耦的编辑，同时保持个体的身份特征。解决方案的关键在于：首先，通过合成大规模合成的2D视频数据集，其中每个视频包含面部表情和视角的一致变化，并结合特定面部属性的变化；其次，提出了一种基于3D高斯泼溅（3D Gaussian Splatting）的个性化虚拟形象创建方法，学习一个连续且解耦的潜在空间，以实现直观的面部属性操控；最后，引入潜在空间正则化技术，通过插值的2D面部作为监督，确保潜在空间中的平滑过渡。与现有方法相比，PERSE能够生成高质量且保持参考人物身份的虚拟形象，并支持插值属性的编辑。

链接: https://arxiv.org/abs/2412.21206
作者: Hyunsoo Cha,Inhee Lee,Hanbyul Joo
机构: 未知
关键词: animatable personalized generative, facial attribute, building an animatable, latent space, facial attribute editing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present PERSE, a method for building an animatable personalized generative avatar from a reference portrait. Our avatar model enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual’s identity. To achieve this, our method begins by synthesizing large-scale synthetic 2D video datasets, where each video contains consistent changes in the facial expression and viewpoint, combined with a variation in a specific facial attribute from the original input. We propose a novel pipeline to produce high-quality, photorealistic 2D videos with facial attribute editing. Leveraging this synthetic attribute dataset, we present a personalized avatar creation method based on the 3D Gaussian Splatting, learning a continuous and disentangled latent space for intuitive facial attribute manipulation. To enforce smooth transitions in this latent space, we introduce a latent space regularization technique by using interpolated 2D faces as supervision. Compared to previous approaches, we demonstrate that PERSE generates high-quality avatars with interpolated attributes while preserving identity of reference person.
zh

[CV-1] Action-Agnostic Point-Level Supervision for Temporal Action Detection AAAI-25

【速读】：该论文旨在解决时序动作检测（temporal action detection）中标注成本高的问题，提出了一种轻量级标注方案，即动作无关的点级监督（action-agnostic point-level supervision, AAPL）。传统点级监督要求标注者在未修剪的视频中搜索每个动作实例，而AAPL通过无监督方式采样少量视频帧，由标注者仅标注这些帧的动作类别，从而显著降低标注成本。论文还提出了一种检测模型和学习方法，以有效利用AAPL标注。通过在多个数据集（如THUMOS '14、FineAction、GTEA、BEOID和ActivityNet 1.3）上的广泛实验，验证了该方案在标注成本与检测性能之间的权衡上优于或与现有视频级和点级监督方法相当。

链接: https://arxiv.org/abs/2412.21205
作者: Shuhei M. Yoshida,Takashi Shibata,Makoto Terao,Takayuki Okatani,Masashi Sugiyama
机构: 未知
关键词: achieve accurate action, lightly annotated dataset, accurate action instance, achieve accurate, lightly annotated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI-25. Technical appendices included. 15 pages, 3 figures, 11 tables

点击查看摘要

Abstract:We propose action-agnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.
zh

[CV-2] A Large-Scale Study on Video Action Dataset Condensation

【速读】：该论文旨在解决视频数据集压缩（video dataset condensation）这一尚未充分探索的问题。与图像数据不同，视频数据具有额外的时间维度，其中包含大量冗余信息，使得压缩更为关键。论文通过大规模实证研究，系统设计和公平比较，填补了这一研究空白。解决方案的关键在于三个方面：首先，对视频数据进行时间维度处理；其次，建立全面的视频数据集压缩评估协议；最后，将压缩方法适配到时空域，并进行公平比较。研究得出了一些重要观察：样本多样性比时间多样性对视频数据集压缩更为关键，简单的滑动窗口采样方法有效，且在大多数情况下样本选择优于数据集蒸馏。通过在HMDB51、UCF101和Kinetics-400三个主流动作识别数据集上的实验，论文取得了当前最优的结果。

链接: https://arxiv.org/abs/2412.21197
作者: Yang Chen,Sheng Guo,Limin Wang
机构: 未知
关键词: made significant progress, video dataset condensation, made significant, significant progress, Dataset condensation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset condensation has made significant progress in the image domain. Unlike images, videos possess an additional temporal dimension, which harbors considerable redundant information, making condensation even more crucial. However, video dataset condensation still remains an underexplored area. We aim to bridge this gap by providing a large-scale empirical study with systematic design and fair comparison. Specifically, our work delves into three key aspects to provide valuable empirical insights: (1) temporal processing of video data, (2) establishing a comprehensive evaluation protocol for video dataset condensation, and (3) adaptation of condensation methods to the space-time domain and fair comparisons among them. From this study, we derive several intriguing observations: (i) sample diversity appears to be more crucial than temporal diversity for video dataset condensation, (ii) simple slide-window sampling proves to be effective, and (iii) sample selection currently outperforms dataset distillation in most cases. Furthermore, we conduct experiments on three prominent action recognition datasets (HMDB51, UCF101 and Kinetics-400) and achieve state-of-the-art results on all of them. Our code is available at this https URL.
zh

[CV-3] What Makes for a Good Stereoscopic Image?

【速读】：该论文旨在解决虚拟现实（VR）头戴设备中立体视觉体验质量（SQoE）的有效测量问题。现有立体视觉评估指标通常仅关注视觉不适或图像质量等单一因素，且面临数据限制。为此，作者提出了SCOPE（Stereoscopic COntent Preference Evaluation）数据集，该数据集包含真实和合成的立体图像，涵盖了多种常见的感知失真和伪影，并通过VR头戴设备收集了用户偏好标注。研究结果表明，不同头戴设备上的用户偏好具有较高的一致性。此外，作者还提出了iSQoE模型，该模型基于SCOPE数据集进行训练，用于评估立体视觉体验质量。实验表明，在比较单目到立体转换方法时，iSQoE比现有方法更能与人类偏好保持一致。解决方案的关键在于构建一个全面且多样化的数据集，并开发一个能够准确反映用户偏好的评估模型。

链接: https://arxiv.org/abs/2412.21127
作者: Netanel Y. Tamir,Shir Amir,Ranel Itzhaky,Noam Atia,Shobhita Sundaram,Stephanie Fu,Ron Sokolovsky,Phillip Isola,Tali Dekel,Richard Zhang,Miriam Farber
机构: 未知
关键词: effectively measuring stereoscopic, measuring stereoscopic quality, virtual reality, effectively measuring, immersive and comfortable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With rapid advancements in virtual reality (VR) headsets, effectively measuring stereoscopic quality of experience (SQoE) has become essential for delivering immersive and comfortable 3D experiences. However, most existing stereo metrics focus on isolated aspects of the viewing experience such as visual discomfort or image quality, and have traditionally faced data limitations. To address these gaps, we present SCOPE (Stereoscopic COntent Preference Evaluation), a new dataset comprised of real and synthetic stereoscopic images featuring a wide range of common perceptual distortions and artifacts. The dataset is labeled with preference annotations collected on a VR headset, with our findings indicating a notable degree of consistency in user preferences across different headsets. Additionally, we present iSQoE, a new model for stereo quality of experience assessment trained on our dataset. We show that iSQoE aligns better with human preferences than existing methods when comparing mono-to-stereo conversion methods.
zh

[CV-4] Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

【速读】：该论文旨在解决文本到3D生成（text-to-3D generation）在物体和场景级别上的高效生成问题。现有的方法通常需要较长的计算时间，且在处理复杂场景时存在保真度和几何结构上的挑战。为此，作者提出了Prometheus，一种基于3D感知的潜在扩散模型（latent diffusion model），能够在几秒钟内实现高质量的3D生成。其解决方案的关键在于将3D场景生成建模为多视角、前馈、像素对齐的3D高斯生成（3D Gaussian generation），并在潜在扩散范式下进行优化。此外，作者引入了RGB-D潜在空间（RGB-D latent space），以分离外观和几何信息，从而提升生成3D高斯的保真度和几何精度。该方法通过预训练的文本到图像生成模型进行微调，并利用大量单视角和多视角数据集进行训练，确保了模型的泛化能力和生成效果。实验结果表明，该方法在前馈3D高斯重建和文本到3D生成任务中均表现出色。

链接: https://arxiv.org/abs/2412.21117
作者: Yuanbo Yang,Jiahao Shao,Xinyang Li,Yujun Shen,Andreas Geiger,Yiyi Liao
机构: 未知
关键词: introduce Prometheus, latent diffusion, Gaussian generation, latent diffusion model, latent diffusion paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: this https URL
zh

[CV-5] Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

【速读】：该论文旨在解决便携设备上实时、情境感知的智能助手（embodied smart assistant）的开发问题，特别是在智能手机和可穿戴相机等设备上实现“始终在线”（always on）的无缝交互与辅助。解决方案的关键在于构建一个基于自我中心视觉语言模型（egocentric vision-language model）的系统——Vinci，该系统能够实时处理长视频流，结合当前观察和历史上下文回答用户问题，并提供任务规划。此外，Vinci集成了视频生成模块，为需要详细指导的任务生成逐步视觉演示，从而增强用户体验。通过这一框架，Vinci旨在为用户提供情境化和可操作的洞察，推动便携式实时自我中心AI系统的发展。

链接: https://arxiv.org/abs/2412.21080
作者: Yifei Huang,Jilan Xu,Baoqi Pei,Yuping He,Guo Chen,Lijin Yang,Xinyuan Chen,Yaohui Wang,Zheng Nie,Jinyao Liu,Guoshun Fan,Dechen Lin,Fang Fang,Kunpeng Li,Chang Yuan,Yali Wang,Yu Qiao,Limin Wang
机构: 未知
关键词: embodied smart assistant, smart assistant built, egocentric vision-language model, real-time embodied smart, vision-language model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an “always on” mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at this https URL.
zh

[CV-6] Edicho: Consistent Image Editing in the Wild

【速读】：该论文旨在解决在真实场景图像中进行一致性编辑的技术挑战，这些挑战主要源于不可控的因素，如物体姿态、光照条件和摄影环境。为解决这一问题，论文提出了Edicho，一种基于扩散模型（diffusion models）的无训练解决方案，其核心设计原则是利用显式图像对应关系来指导编辑。关键组件包括一个注意力操纵模块（attention manipulation module）和一个经过精心优化的无分类器引导（classifier-free guidance, CFG）去噪策略，两者均考虑了预先估计的对应关系。该推理时算法具有即插即用的特性，并与大多数基于扩散的编辑方法（如ControlNet和BrushNet）兼容。实验结果表明，Edicho在多种设置下均能有效实现跨图像的一致性编辑。

链接: https://arxiv.org/abs/2412.21079
作者: Qingyan Bai,Hao Ouyang,Yinghao Xu,Qiuyu Wang,Ceyuan Yang,Ka Leong Cheng,Yujun Shen,Qifeng Chen
机构: 未知
关键词: technical challenge arising, lighting conditions, unmanageable factors, object poses, photography environments
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.
zh

[CV-7] Varformer: Adapting VARs Generative Prior for Image Restoration

【速读】：该论文旨在解决图像复原任务中如何有效利用生成模型捕捉干净图像的结构和统计特性，从而将退化特征转化为干净图像的问题。其关键解决方案在于提出了一种新型图像生成范式——VAR（Variational Autoregressive Model），该模型通过逐尺度预测方法，逐步捕捉图像的全局结构和细粒度细节，符合图像复原领域广泛认可的多尺度复原原则。此外，VAR在图像重建过程中自动调节输入，使得后续尺度的表示与干净图像的分布对齐。为了在图像复原任务中充分利用VAR的自适应分布对齐能力，论文将VAR中的多尺度潜在表示作为复原先验，进而设计了VarFormer框架。这一策略不仅显著提升了模型在未见任务上的泛化能力，还降低了训练计算成本。实验结果表明，VarFormer在多种图像复原任务中均优于现有的多任务图像复原方法。

链接: https://arxiv.org/abs/2412.21063
作者: Siyang Wang,Feng Zhao
机构: 未知
关键词: high-quality datasets effectively, transforming degraded features, Generative models trained, datasets effectively capture, extensive high-quality datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach. It progressively captures both global structures and fine-grained details through the autoregressive process, consistent with the multi-scale restoration principle widely acknowledged in the restoration community. Furthermore, we observe that during the image reconstruction process utilizing VAR, scale predictions automatically modulate the input, facilitating the alignment of representations at subsequent scales with the distribution of clean images. To harness VAR’s adaptive distribution alignment capability in image restoration tasks, we formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework. The strategic application of these priors enables our VarFormer to achieve remarkable generalization on unseen tasks while also reducing training computational costs. Extensive experiments underscores that our VarFormer outperforms existing multi-task image restoration methods across various restoration tasks.
zh

[CV-8] VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

【速读】：该论文旨在解决视觉生成模型（包括图像和视频生成）与人类偏好对齐的问题。其核心解决方案是构建了一个细粒度且多维度的奖励模型——VisionReward。该模型通过将人类对图像和视频的偏好分解为多个维度，每个维度由一系列判断问题表示，并通过线性加权求和生成一个可解释且准确的评分。为了应对视频质量评估的挑战，论文系统分析了视频的各种动态特征，使VisionReward在视频偏好预测上超越了VideoScore 17.2%，并达到了顶尖性能。基于VisionReward，论文进一步开发了一种多目标偏好学习算法，有效解决了偏好数据中的混杂因素问题。该方法在机器指标和人类评估上均显著优于现有的图像和视频评分方法。

链接: https://arxiv.org/abs/2412.21059
作者: Jiazheng Xu,Yu Huang,Jiale Cheng,Yuanming Yang,Jiajun Xu,Yuan Wang,Wenbo Duan,Shen Yang,Qunlin Jin,Shurun Li,Jiayan Teng,Zhuoyi Yang,Wendi Zheng,Xiao Liu,Ming Ding,Xiaohan Zhang,Xiaotao Gu,Shiyu Huang,Minlie Huang,Jie Tang,Yuxiao Dong
机构: 未知
关键词: aligning visual generation, visual generation models, visual generation, generation models, present a general
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages

点击查看摘要

Abstract:We present a general strategy to aligning visual generation models – both image and video generation – with human preference. To start with, we build VisionReward – a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at this https URL.
zh

[CV-9] E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models DATE

【速读】：该论文旨在解决扩散模型（Diffusion Models）在生成建模中的几个固有局限性，包括训练与采样之间的差距（training-sampling gap）、渐进加噪过程中的信息泄露（information leakage），以及在训练过程中无法整合感知损失（perceptual loss）和对抗损失（adversarial loss）等问题。为解决这些挑战，论文提出了一种创新的端到端训练框架，通过直接优化最终的重构输出来对齐训练和采样过程。该框架的关键在于将训练过程视为从纯噪声到目标数据分布的直接映射，从而消除了训练与采样之间的差距，并减少了信息泄露。此外，该方法还允许在目标函数中整合感知损失和对抗损失，进一步提升了模型的性能。实验结果表明，该方法在COCO30K和HW30K等基准数据集上显著优于传统扩散模型，在FID和CLIP分数上取得了更优的结果，即使在减少采样步骤的情况下也能保持高效性。这些发现凸显了端到端训练在推动扩散基生成模型向更鲁棒和高效解决方案发展中的潜力。

链接: https://arxiv.org/abs/2412.21044
作者: Zhiyu Tan,WenXu Qian,Hesen Chen,Mengping Yang,Lei Chen,Hao Li
机构: 未知
关键词: powerful framework, training-sampling gap, generative modeling, Diffusion models, perceptual and adversarial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report, to be further updated

点击查看摘要

Abstract:Diffusion models have emerged as a powerful framework for generative modeling, achieving state-of-the-art performance across various tasks. However, they face several inherent limitations, including a training-sampling gap, information leakage in the progressive noising process, and the inability to incorporate advanced loss functions like perceptual and adversarial losses during training. To address these challenges, we propose an innovative end-to-end training framework that aligns the training and sampling processes by directly optimizing the final reconstruction output. Our method eliminates the training-sampling gap, mitigates information leakage by treating the training process as a direct mapping from pure noise to the target data distribution, and enables the integration of perceptual and adversarial losses into the objective. Extensive experiments on benchmarks such as COCO30K and HW30K demonstrate that our approach consistently outperforms traditional diffusion models, achieving superior results in terms of FID and CLIP score, even with reduced sampling steps. These findings highlight the potential of end-to-end training to advance diffusion-based generative models toward more robust and efficient solutions.
zh

[CV-10] Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

【速读】：该论文旨在解决盲人脸恢复（Blind Face Restoration）中的挑战，即从各种未知的退化源中恢复高质量的人脸图像。由于退化图像中可提取的信息极为有限，传统基于先验知识的方法（如几何先验和面部特征）在捕捉细节方面往往表现不足。为此，论文提出了一种视觉风格提示学习框架（Visual Style Prompt Learning Framework），利用扩散概率模型（Diffusion Probabilistic Models）在预训练生成模型的潜在空间中显式生成视觉提示（Visual Prompts），以指导恢复过程。此外，论文还引入了风格调制聚合变换层（Style-Modulated Aggregation Transformation Layer），以充分利用视觉提示并增强信息丰富模式的提取。实验和应用结果表明，该方法在实现高质量盲人脸恢复方面具有显著优势。

链接: https://arxiv.org/abs/2412.21042
作者: Wanglong Lu,Jikai Wang,Tao Wang,Kaihao Zhang,Xianta Jiang,Hanli Zhao
机构: 未知
关键词: posing significant challenges, significant challenges due, minimal information retrievable, Blind face restoration, face restoration aims
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Published at Pattern Recognition; 13 pages, 11 figures

点击查看摘要

Abstract:Blind face restoration aims to recover high-quality facial images from various unidentified sources of degradation, posing significant challenges due to the minimal information retrievable from the degraded images. Prior knowledge-based methods, leveraging geometric priors and facial features, have led to advancements in face restoration but often fall short of capturing fine details. To address this, we introduce a visual style prompt learning framework that utilizes diffusion probabilistic models to explicitly generate visual prompts within the latent space of pre-trained generative models. These prompts are designed to guide the restoration process. To fully utilize the visual prompts and enhance the extraction of informative and rich patterns, we introduce a style-modulated aggregation transformation layer. Extensive experiments and applications demonstrate the superiority of our method in achieving high-quality blind face restoration. The source code is available at \hrefthis https URLthis https URL.
zh

[CV-11] owards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline ECIR2025

【速读】：该论文旨在解决基于内容的跨模态检索（cross-modal retrieval）在特定领域实体和长尾概念（long-tail concepts）识别上的不足，特别是在识别特定个体时的挑战。论文提出了身份感知的跨模态检索（identity-aware cross-modal retrieval）任务，目标是根据自然语言查询在特定上下文中检索人物图像。这一任务在个性化视频集合搜索或国家广播机构维护的大型音视频档案浏览等场景中尤为重要。解决方案的关键在于引入了一个新的数据集 COCO Person FaceSwap (COCO-PFS)，该数据集基于广泛使用的 COCO 数据集，并通过 VGGFace2 生成的深度伪造（deepfake）人脸进行了增强，以解决训练和评估模型所需的大规模数据集的缺乏问题。此外，论文还提出了身份感知的 CLIP 模型（Identity-aware CLIP, Id-CLIP），通过有针对性的微调（fine-tuning）实现了具有竞争力的检索性能，为识别长尾身份和上下文细微差别的更鲁棒的跨模态检索系统奠定了基础。

链接: https://arxiv.org/abs/2412.21009
作者: Nicola Messina,Lucia Vadicamo,Leo Maltese,Claudio Gennaro
机构: 未知
关键词: shared embedding space, significantly enhanced content-based, Recent advancements, enhanced content-based retrieval, embedding space
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted as full paper at ECIR 2025

点击查看摘要

Abstract:Recent advancements in deep learning have significantly enhanced content-based retrieval methods, notably through models like CLIP that map images and texts into a shared embedding space. However, these methods often struggle with domain-specific entities and long-tail concepts absent from their training data, particularly in identifying specific individuals. In this paper, we explore the task of identity-aware cross-modal retrieval, which aims to retrieve images of persons in specific contexts based on natural language queries. This task is critical in various scenarios, such as for searching and browsing personalized video collections or large audio-visual archives maintained by national broadcasters. We introduce a novel dataset, COCO Person FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched with deepfake-generated faces from VGGFace2. This dataset addresses the lack of large-scale datasets needed for training and evaluating models for this task. Our experiments assess the performance of different CLIP variations repurposed for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which achieves competitive retrieval performance through targeted fine-tuning. Our contributions lay the groundwork for more robust cross-modal retrieval systems capable of recognizing long-tail identities and contextual nuances. Data and code are available at this https URL.
zh

[CV-12] UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

【速读】：该论文旨在解决在开放世界中，具身视觉智能体（embodied visual agents）在复杂场景中的视觉导航和跟踪问题，特别是针对当前基于强化学习（RL）和大规模视觉语言模型（VLMs）的智能体在动态场景中的闭环控制延迟以及对非结构化地形中三维空间结构的推理能力不足等挑战。解决方案的关键在于引入了UnrealZoo，这是一个基于Unreal Engine构建的高逼真3D虚拟世界集合，能够反映开放世界的复杂性和多样性。通过UnrealCV提供的Python API和工具，研究人员可以高效地进行数据收集、环境增强、分布式训练和基准测试。此外，UnrealCV的渲染和通信效率优化支持了多智能体交互等高级应用。实验结果表明，多样化的训练环境对强化学习智能体的性能提升具有显著优势，同时也揭示了当前具身视觉智能体在开放世界中的技术瓶颈。

链接: https://arxiv.org/abs/2412.20977
作者: Fangwei Zhong,Kui Wu,Churan Wang,Hao Chen,Hai Ci,Zhoujun Li,Yizhou Wang
机构: 未知
关键词: Unreal Engine, virtual worlds built, built on Unreal, introduce UnrealZoo, designed to reflect
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this http URL

点击查看摘要

Abstract:We introduce UnrealZoo, a rich collection of photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of the open worlds. Additionally, we offer a variety of playable entities for embodied AI agents. Based on UnrealCV, we provide a suite of easy-to-use Python APIs and tools for various potential applications, such as data collection, environment augmentation, distributed training, and benchmarking. We optimize the rendering and communication efficiency of UnrealCV to support advanced applications, such as multi-agent interaction. Our experiments benchmark agents in various complex scenes, focusing on visual navigation and tracking, which are fundamental capabilities for embodied visual intelligence. The results yield valuable insights into the advantages of diverse training environments for reinforcement learning (RL) agents and the challenges faced by current embodied vision agents, including those based on RL and large vision-language models (VLMs), in open worlds. These challenges involve latency in closed-loop control in dynamic scenes and reasoning about 3D spatial structures in unstructured terrain.
zh

[CV-13] FPGA-based Acceleration of Neural Network for Image Classification using Vitis AI

【速读】：该论文旨在解决卷积神经网络（CNNs）在计算机视觉应用中因复杂架构导致的CPU或GPU计算吞吐量不足或功耗过高的问题。为解决这些限制，论文提出使用专用硬件加速计算负载，具体方案是在Xilinx Zynq UltraScale+ MPSoC ZCU104 FPGA评估板上利用Vitis-AI加速CNN的图像分类任务，使用CIFAR-10数据集进行验证。该方案的关键在于通过FPGA硬件加速，显著提升了计算吞吐量和能效，分别比CPU和GPU基线提高了3.33-5.82倍和3.39-6.30倍，展示了其在提取二维特征以支持下游任务（如深度估计和三维重建）中的潜力。

链接: https://arxiv.org/abs/2412.20974
作者: Zhengdong Li,Frederick Ziyang Hong,C. Patrick Yue
机构: 未知
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, recent years, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In recent years, Convolutional Neural Networks (CNNs) have been widely adopted in computer vision. Complex CNN architecture running on CPU or GPU has either insufficient throughput or prohibitive power consumption. Hence, there is a need to have dedicated hardware to accelerate the computation workload to solve these limitations. In this paper, we accelerate a CNN for image classification with the CIFAR-10 dataset using Vitis-AI on Xilinx Zynq UltraScale+ MPSoC ZCU104 FPGA evaluation board. The work achieves 3.33-5.82x higher throughput and 3.39-6.30x higher energy efficiency than CPU and GPU baselines. It shows the potential to extract 2D features for downstream tasks, such as depth estimation and 3D reconstruction.
zh

[CV-14] Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

【速读】：该论文旨在解决视频-语言表示学习（video-language representation learning）中粗粒度全局交互（coarse-grained global interactions）的局限性，提出了一种基于多元合作博弈理论（multivariate cooperative game theory）的细粒度多模态学习方法。其核心解决方案包括：1) 引入分层Banzhaf交互（Hierarchical Banzhaf Interaction）模型，从多层次视角模拟视频片段与文本词汇之间的细粒度对应关系；2) 通过融合单模态（single-modal）和跨模态（cross-modal）组件重构表示，以减少Banzhaf交互计算中的偏差，同时确保细粒度与单模态表示相当，并保留跨模态表示的自适应编码特性；3) 将原始结构扩展为灵活的编码器-解码器框架，使模型能够适应多种下游任务。实验结果表明，该方法在文本-视频检索、视频问答和视频字幕生成等任务中表现出色，验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2412.20964
作者: Peng Jin,Hao Li,Li Yuan,Shuicheng Yan,Jie Chen
机构: 未知
关键词: artificial intelligence domain, intelligence domain, artificial intelligence, plays an important, important role
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
zh

[CV-15] Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在复杂场景中准确识别和计数物体以及确定其空间位置方面的挑战，特别是在物体重叠或较小的情况下。为解决这些局限性，论文提出了一种基于多模态检索增强生成（RAG）的新框架，该框架引入了结构化场景图（scene graphs）以增强图像中的物体识别、关系识别和空间理解能力。该框架的关键在于通过结构化场景图提升MLLMs在处理需要精确视觉描述任务时的性能，尤其是在具有挑战性视角（如鸟瞰图或密集物体排列场景）的情况下。实验结果表明，该方法在VG-150和AUG数据集上的视觉问答（VQA）任务中表现优异，显著提升了物体识别、定位和量化能力，并提供了更准确的视觉描述。

链接: https://arxiv.org/abs/2412.20927
作者: Junxiao Xue,Quan Deng,Fei Yu,Yanhao Wang,Jun Wang,Yuehua Li
机构: 未知
关键词: large language models, made significant progress, Multimodal large language, visual question answering, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, under review

点击查看摘要

Abstract:Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM’s capacity to handle tasks requiring precise visual descriptions, especially in scenarios with challenging perspectives, such as aerial views or scenes with dense object arrangements. Finally, we conduct extensive experiments on the VG-150 dataset that focuses on first-person visual understanding and the AUG dataset that involves aerial imagery. The results show that our approach consistently outperforms existing MLLMs in VQA tasks, which stands out in recognizing, localizing, and quantifying objects in different spatial contexts and provides more accurate visual descriptions.
zh

[CV-16] HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization

【速读】：该论文旨在解决计算病理学中组织语义分割（tissue semantic segmentation）任务中，由于像素级标注获取成本高且耗时，导致基于类激活图（CAM, Class Activation Map）的弱监督学习方法存在激活不足（under-activation）和过度激活（over-activation）问题，从而影响分割性能的挑战。为解决这一问题，论文提出了一种基于图像混合合成（image-mixing synthesis）和一致性正则化（consistency regularization）的弱监督语义分割框架，称为HisynSeg。该框架通过生成带有像素级掩码的合成组织病理图像进行全监督模型训练，其中提出了基于马赛克变换（Mosaic transformation）和贝塞尔掩码生成（Bézier mask generation）的两种合成策略，并开发了图像过滤模块以确保合成图像的真实性。此外，为避免模型过度拟合合成图像中的偶然伪影，论文还提出了一种自监督一致性正则化方法，利用无分割掩码的真实图像对分割模型的训练进行监督。通过整合这些技术，HisynSeg成功将弱监督语义分割问题转化为全监督问题，显著提高了分割精度。

链接: https://arxiv.org/abs/2412.20924
作者: Zijie Fang,Yifeng Wang,Peizhang Xie,Zhi Wang,Yongbing Zhang
机构: 未知
关键词: Tissue semantic segmentation, pixel-level tissue segmentation, weakly-supervised semantic segmentation, computational pathology, key tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Tissue semantic segmentation is one of the key tasks in computational pathology. To avoid the expensive and laborious acquisition of pixel-level annotations, a wide range of studies attempt to adopt the class activation map (CAM), a weakly-supervised learning scheme, to achieve pixel-level tissue segmentation. However, CAM-based methods are prone to suffer from under-activation and over-activation issues, leading to poor segmentation performance. To address this problem, we propose a novel weakly-supervised semantic segmentation framework for histopathological images based on image-mixing synthesis and consistency regularization, dubbed HisynSeg. Specifically, synthesized histopathological images with pixel-level masks are generated for fully-supervised model training, where two synthesis strategies are proposed based on Mosaic transformation and Bézier mask generation. Besides, an image filtering module is developed to guarantee the authenticity of the synthesized images. In order to further avoid the model overfitting to the occasional synthesis artifacts, we additionally propose a novel self-supervised consistency regularization, which enables the real images without segmentation masks to supervise the training of the segmentation model. By integrating the proposed techniques, the HisynSeg framework successfully transforms the weakly-supervised semantic segmentation problem into a fully-supervised one, greatly improving the segmentation accuracy. Experimental results on three datasets prove that the proposed method achieves a state-of-the-art performance. Code is available at this https URL.
zh

[CV-17] Low-Light Image Enhancement via Generative Perceptual Priors AAAI2025

【速读】：该论文旨在解决低光照图像增强（Low-Light Image Enhancement, LLIE）在现实场景中应用时面临的挑战，特别是由于多样化光照条件导致的图像增强效果不一致的问题。此外，现有方法在生成视觉上逼真且吸引人的增强图像方面仍存在不足。为解决这些问题，论文提出了一种基于生成式感知先验（Generative Perceptual Priors, GPP-LLIE）的新框架，该框架利用视觉语言模型（Vision-Language Models, VLMs）来指导增强过程。关键解决方案包括：首先，设计了一个管道，通过VLMs评估低光照图像的多个视觉属性，并量化这些评估以输出全局和局部感知先验；其次，在扩散过程中引入了一个基于Transformer的骨干网络，并开发了由全局和局部感知先验指导的新型层归一化（GPP-LN）和注意力机制（LPP-Attn）。实验表明，该模型在配对的低光照数据集上优于当前的最先进方法，并在真实世界数据上表现出优异的泛化能力。

链接: https://arxiv.org/abs/2412.20916
作者: Han Zhou,Wei Dong,Xiaohong Liu,Yulun Zhang,Guangtao Zhai,Jun Chen
机构: 未知
关键词: retrieving texture details, illumination conditions encountered, diverse illumination conditions, Low-Light Image Enhancement, applying current Low-Light
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Although significant progress has been made in enhancing visibility, retrieving texture details, and mitigating noise in Low-Light (LL) images, the challenge persists in applying current Low-Light Image Enhancement (LLIE) methods to real-world scenarios, primarily due to the diverse illumination conditions encountered. Furthermore, the quest for generating enhancements that are visually realistic and attractive remains an underexplored realm. In response to these challenges, we introduce a novel \textbfLLIE framework with the guidance of \textbfGenerative \textbfPerceptual \textbfPriors (\textbfGPP-LLIE) derived from vision-language models (VLMs). Specifically, we first propose a pipeline that guides VLMs to assess multiple visual attributes of the LL image and quantify the assessment to output the global and local perceptual priors. Subsequently, to incorporate these generative perceptual priors to benefit LLIE, we introduce a transformer-based backbone in the diffusion process, and develop a new layer normalization (\textit\textbfGPP-LN) and an attention mechanism (\textit\textbfLPP-Attn) guided by global and local perceptual priors. Extensive experiments demonstrate that our model outperforms current SOTA methods on paired LL datasets and exhibits superior generalization on real-world data. The code is released at \urlthis https URL.
zh

[CV-18] GDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation

【速读】：该论文旨在解决多视角3D目标检测中LiDAR（激光雷达）与相机数据表示之间的固有差异问题，以提升基于相机的检测器性能。现有的方法通常通过深度监督和鸟瞰图（BEV）特征蒸馏等方式利用LiDAR的精确空间信息来增强相机检测器，但由于两种传感器数据表示的差异，这些方法面临挑战。论文提出的解决方案TiGDistill-BEV通过结合两种传感器的优势，有效地弥合了这一差距。其核心在于利用目标内部几何学习（Target Inner-Geometry learning）方案，从多模态（如LiDAR）中提取知识作为教师模型，指导基于相机的学生检测器。具体而言，论文提出了两个关键模块：内部深度监督模块（inner-depth supervision module），用于学习对象内部的低层次相对深度关系，从而增强检测器对对象级空间结构的理解；以及内部特征BEV蒸馏模块（inner-feature BEV distillation module），用于传递前景目标中不同关键点的高层次语义信息。此外，为了进一步缓解领域差距，论文还引入了跨通道和跨关键点蒸馏来建模特征相似性。通过在nuScenes基准测试上的广泛实验，TiGDistill-BEV显著提升了仅基于相机的检测器性能，达到了62.8%的NDS（NuScenes Detection Score），超越了现有方法。

链接: https://arxiv.org/abs/2412.20911
作者: Shaoqing Xu,Fang Li,Peixiang Huang,Ziying Song,Zhi-Xin Yang
机构: 未知
关键词: Accurate multi-view, autonomous driving, detection is essential, essential for applications, Accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures. arXiv admin note: substantial text overlap with arXiv:2212.13979

点击查看摘要

Abstract:Accurate multi-view 3D object detection is essential for applications such as autonomous driving. Researchers have consistently aimed to leverage LiDAR’s precise spatial information to enhance camera-based detectors through methods like depth supervision and bird-eye-view (BEV) feature distillation. However, existing approaches often face challenges due to the inherent differences between LiDAR and camera data representations. In this paper, we introduce the TiGDistill-BEV, a novel approach that effectively bridges this gap by leveraging the strengths of both sensors. Our method distills knowledge from diverse modalities(e.g., LiDAR) as the teacher model to a camera-based student detector, utilizing the Target Inner-Geometry learning scheme to enhance camera-based BEV detectors through both depth and BEV features by leveraging diverse modalities. Specially, we propose two key modules: an inner-depth supervision module to learn the low-level relative depth relations within objects which equips detectors with a deeper understanding of object-level spatial structures, and an inner-feature BEV distillation module to transfer high-level semantics of different key points within foreground targets. To further alleviate the domain gap, we incorporate both inter-channel and inter-keypoint distillation to model feature similarity. Extensive experiments on the nuScenes benchmark demonstrate that TiGDistill-BEV significantly boosts camera-based only detectors achieving a state-of-the-art with 62.8% NDS and surpassing previous methods by a significant margin. The codes is available at: this https URL.
zh

[CV-19] WalkVLM:Aid Visually Impaired People Walking by Vision Language Model

【速读】：该论文旨在解决视觉障碍者在行走过程中缺乏实时、简洁且信息丰富的辅助提醒的问题。现有视觉语言模型（VLMs）在处理盲人行走任务时存在响应冗余和推理效率低下的挑战。为解决这一问题，论文首先发布了一个多样、广泛且无偏的行走感知数据集，包含来自欧洲和亚洲的12k视频-手动标注对，为盲人行走任务提供了统一的训练和测试基准。其次，论文提出了WalkVLM模型，该模型采用思维链（chain of thought）进行分层规划，以生成简洁但信息丰富的提醒，并利用时间感知自适应预测（temporal-aware adaptive prediction）来减少提醒的时间冗余。通过这些关键创新，论文为盲人行走任务建立了坚实的基准，并验证了WalkVLM在流视频处理中的优势。

链接: https://arxiv.org/abs/2412.20903
作者: Zhiqiang Yuan,Ting Zhang,Jiapei Zhang,Jie Zhou,Jinchao Zhang
机构: 未知
关键词: offer walking assistance, million individuals, visual impairment, making it crucial, blind walking task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), employing VLMs to improve this field has emerged as a popular research topic. However, most existing methods are studied on self-built question-answering datasets, lacking a unified training and testing benchmark for walk guidance. Moreover, in blind walking task, it is necessary to perform real-time streaming video parsing and generate concise yet informative reminders, which poses a great challenge for VLMs that suffer from redundant responses and low inference efficiency. In this paper, we firstly release a diverse, extensive, and unbiased walking awareness dataset, containing 12k video-manual annotation pairs from Europe and Asia to provide a fair training and testing benchmark for blind walking task. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs. Our dataset and code will be released at anonymous link this https URL.
zh

[CV-20] ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

【速读】：该论文旨在解决高质量动画贴纸（animated stickers）中透明通道（transparent channels）生成的问题。现有方法主要分为视频抠图算法（video matting algorithms）和基于扩散的算法（diffusion-based algorithms），前者在处理半开放区域时表现不佳，后者则主要用于单张图像建模，导致在动画贴纸建模时出现局部闪烁问题。论文提出了一种名为ILDiff（Implicit Layout Distillation）的方法，通过隐式布局蒸馏生成动画透明通道，解决了现有方法中半开放区域崩溃和未考虑时间信息的问题。此外，论文还创建了包含32万高质量样本的透明动画贴纸数据集（Transparent Animated Sticker Dataset, TASD），为相关领域提供了数据支持。实验表明，ILDiff在生成更精细和平滑的透明通道方面优于其他方法，如Matting Anything和Layer Diffusion。

链接: https://arxiv.org/abs/2412.20901
作者: Ting Zhang,Zhiqiang Yuan,Yeshuang Zhu,Jinchao Zhang
机构: 未知
关键词: current video generation, video generation models, High-quality animated stickers, animated, transparent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality animated stickers usually contain transparent channels, which are often ignored by current video generation models. To generate fine-grained animated transparency channels, existing methods can be roughly divided into video matting algorithms and diffusion-based algorithms. The methods based on video matting have poor performance in dealing with semi-open areas in stickers, while diffusion-based methods are often used to model a single image, which will lead to local flicker when modeling animated stickers. In this paper, we firstly propose an ILDiff method to generate animated transparent channels through implicit layout distillation, which solves the problems of semi-open area collapse and no consideration of temporal information in existing methods. Secondly, we create the Transparent Animated Sticker Dataset (TASD), which contains 0.32M high-quality samples with transparent channel, to provide data support for related fields. Extensive experiments demonstrate that ILDiff can produce finer and smoother transparent channels compared to other methods such as Matting Anything and Layer Diffusion. Our code and dataset will be released at link this https URL.
zh

[CV-21] DDIM sampling for Generative AIBIM a faster intelligent structural design framework

【速读】：该论文旨在解决生成式 AIBIM（Generative AIBIM）中基于物理条件的扩散模型（PCDM）生成设计时计算效率低下的问题。PCDM依赖于去噪扩散概率模型（DDPM）采样过程，每次生成需要1000次迭代，导致生成过程耗时且计算资源需求高。为解决这一问题，论文引入了去噪扩散隐式模型（DDIM），并设计了“DDIM sampling for PCDM”，通过修改原始DDIM公式以适应PCDM的优化过程。实验结果表明，该方法能够将PCDM的生成过程加速100倍，同时保持生成结果的视觉质量不变。该研究的关键在于通过DDIM采样方法显著提升了PCDM的计算效率，为智能结构设计提供了更高效的解决方案。

链接: https://arxiv.org/abs/2412.20899
作者: Zhili He,Yu-Hsing Wang
机构: 未知
关键词: Generative AIBIM, specific physical conditions, creative shear wall, intelligently generate high-quality, shear wall designs
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: the 10th International Conference on Innovative Production and Construction (IPC 2024), Perth, Australia. this https URL

点击查看摘要

Abstract:Generative AIBIM, a successful structural design pipeline, has proven its ability to intelligently generate high-quality, diverse, and creative shear wall designs that are tailored to specific physical conditions. However, the current module of Generative AIBIM that generates designs, known as the physics-based conditional diffusion model (PCDM), necessitates 1000 iterations for each generation due to its reliance on the denoising diffusion probabilistic model (DDPM) sampling process. This leads to a time-consuming and computationally demanding generation process. To address this issue, this study introduces the denoising diffusion implicit model (DDIM), an accelerated generation method that replaces the DDPM sampling process in PCDM. While the original DDIM was designed for DDPM and the optimization process of PCDM differs from that of DDPM, this paper designs “DDIM sampling for PCDM,” which modifies the original DDIM formulations to adapt to the optimization process of PCDM. Experimental results demonstrate that DDIM sampling for PCDM can accelerate the generation process of the original PCDM by a factor of 100 while maintaining the same visual quality in the generated results. This study effectively showcases the effectiveness of DDIM sampling for PCDM in expediting intelligent structural design. Furthermore, this paper reorganizes the contents of DDIM, focusing on the practical usage of DDIM. This change is particularly meaningful for researchers who may not possess a strong background in machine learning theory but are interested in utilizing the tool effectively.
zh

[CV-22] owards Compatible Fine-tuning for Vision-Language Model Updates

【速读】：该论文旨在解决现有高效微调方法在基础模型更新后，其插拔式模块（plug-and-play modules）是否仍然有效的问题。研究发现，许多高性能的微调方法在模型升级后无法保持兼容性。为解决这一问题，论文提出了一种新颖的方法——类条件上下文优化（Class-conditioned Context Optimization, ContCoOp），其关键创新在于通过注意力层将可学习的提示（learnable prompts）与类别嵌入（class embeddings）结合，再输入到文本编码器中。这种设计使得提示能够动态适应嵌入空间的变化（由模型更新引起），从而确保其持续有效性。实验结果表明，ContCoOp在15个数据集上均表现出最高的兼容性，并具有强大的分布外泛化能力。

链接: https://arxiv.org/abs/2412.20895
作者: Zhengbo Wang,Jian Liang,Lijun Sheng,Ran He,Zilei Wang,Tieniu Tan
机构: 未知
关键词: tasks by learning, popular strategy, strategy for enhancing, enhancing the capabilities, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying foundation model is updated, are these plug-and-play modules still effective? In this paper, we first conduct a detailed analysis of various fine-tuning methods on the CLIP in terms of their compatibility with model updates. The study reveals that many high-performing fine-tuning methods fail to be compatible with the upgraded models. To address this, we propose a novel approach, Class-conditioned Context Optimization (ContCoOp), which integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder. Consequently, the prompts can dynamically adapt to the changes in embedding space (due to model updates), ensuring continued effectiveness. Extensive experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.
zh

[CV-23] LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training

【速读】：该论文旨在解决图像全景分割（Panoptic Segmentation, PS）和视频全景分割（Video Panoptic Segmentation, VPS）在自动驾驶场景中的性能提升问题。具体而言，现有研究虽然认识到3D数据对基于摄像头的场景感知的益处，但尚未深入探讨3D数据对图像和视频全景分割的影响。为此，论文提出了一种特征融合模块，通过融合LiDAR和图像数据来增强PS和VPS的性能。此外，论文还展示了在不使用视频数据进行训练的情况下，通过两个简单的模型修改，可以进一步提升VPS的质量。实验结果表明，该方案在图像和视频全景分割的评估指标上实现了高达5个百分点的显著提升。

链接: https://arxiv.org/abs/2412.20881
作者: Fardin Ayar,Ehsan Javanmardi,Manabu Tsukada,Mahdi Javanmardi,Mohammad Rahmati
机构: 未知
关键词: Panoptic segmentation, video panoptic segmentation, combines instance, instance and semantic, gained a lot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2024 International Conference on Intelligent Computing and its Emerging Applications

点击查看摘要

Abstract:Panoptic segmentation, which combines instance and semantic segmentation, has gained a lot of attention in autonomous vehicles, due to its comprehensive representation of the scene. This task can be applied for cameras and LiDAR sensors, but there has been a limited focus on combining both sensors to enhance image panoptic segmentation (PS). Although previous research has acknowledged the benefit of 3D data on camera-based scene perception, no specific study has explored the influence of 3D data on image and video panoptic segmentation (VPS).This work seeks to introduce a feature fusion module that enhances PS and VPS by fusing LiDAR and image data for autonomous vehicles. We also illustrate that, in addition to this fusion, our proposed model, which utilizes two simple modifications, can further deliver even more high-quality VPS without being trained on video data. The results demonstrate a substantial improvement in both the image and video panoptic segmentation evaluation metrics by up to 5 points.
zh

[CV-24] Attention Is All You Need For Mixture-of-Depths Routing

【速读】：该论文旨在解决深度学习模型在训练和推理过程中因参数规模增大而带来的计算需求增加的问题。传统的混合深度（Mixture-of-Depths, MoD）模型通过动态分配计算资源到输入的最相关部分来提高效率，但其路由机制需要额外的网络层，增加了训练难度和模型复杂性。论文提出了一种基于注意力机制的路由方法（A-MoD），利用前一层的注意力图（attention map）在当前层进行路由决策。A-MoD的关键优势在于无需引入额外的可训练参数，能够更高效地进行训练，并且可以轻松适配预训练的Transformer模型。实验表明，A-MoD在ImageNet数据集上相比标准路由和isoFLOP ViT基线模型，准确率提高了2%，并且显著加快了MoD模型的训练收敛速度，使迁移学习速度提升了2倍。

链接: https://arxiv.org/abs/2412.20875
作者: Advait Gadhikar,Souptik Kumar Majumdar,Niclas Popp,Piyapat Saranrittichai,Martin Rapp,Lukas Schott
机构: 未知
关键词: increasingly larger numbers, Advancements in deep, computational demands, increasingly larger, larger numbers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 19 figures

点击查看摘要

Abstract:Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, A-MoD allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines. Furthermore, A-MoD improves the MoD training convergence, leading to up to 2x faster transfer learning.
zh

[CV-25] LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing ICASSP2025

【速读】：该论文旨在解决音视频解析（Audio-visual video parsing）中的模态不对齐问题，即在弱标签（weak labels）条件下对视频进行分类时，不同模态（如视觉和听觉）之间往往缺乏对齐，导致模态交互过程中引入额外噪声。为解决这一问题，论文提出了一种名为“非对齐知识学习交互方法”（Learning Interaction method for Non-aligned Knowledge, LINK），其核心在于通过动态调整不同模态的输入来平衡它们在事件预测中的贡献。此外，该方法利用伪标签（pseudo-labels）的语义信息作为先验知识，以减少来自其他模态的噪声。实验结果表明，该模型在LLP数据集上优于现有方法。

链接: https://arxiv.org/abs/2412.20872
作者: Langyu Wang,Bingke Zhu,Yingying Chen,Jinqiao Wang
机构: 未知
关键词: Audio-visual video parsing, respective temporal boundaries, video parsing focuses, Audio-visual video, alongside their respective
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.
zh

[CV-26] SoftPatch: Fully Unsupervised Anomaly Classification and Segmentation

【速读】：该论文旨在解决实际工业场景中无监督异常检测（unsupervised anomaly detection, AD）在噪声数据（noisy data）训练下的性能受限问题。现有主流无监督异常检测算法在学术数据集上表现良好，但在实际应用中，由于训练数据通常包含噪声，其性能显著下降。为此，论文首次提出了一种完全无监督的工业异常检测方法，即基于记忆的无监督AD方法SoftPatch和SoftPatch+。其关键解决方案在于通过噪声判别器（noise discriminators）在patch级别生成异常分数（outlier scores），用于噪声消除，并在核心集（coreset）构建前进行数据去噪。此外，这些分数被存储在记忆库（memory bank）中，以软化异常检测边界。SoftPatch保持了正常数据的强建模能力并缓解了核心集中的过度自信问题，而SoftPatch+在高噪声（10%至40%）的工业检测场景中表现出更强的鲁棒性。实验结果表明，SoftPatch和SoftPatch+在多种噪声场景下的性能优于现有最先进的AD方法，并且在无噪声的传统无监督AD设置中，其性能与无噪声方法相当。

链接: https://arxiv.org/abs/2412.20870
作者: Chengjie Wang,Xi Jiang,Bin-Bin Gao,Zhenye Gan,Yong Liu,Feng Zheng,Lizhuang Ma
机构: 未知
关键词: including image-level classification, practical application due, anomaly detection, clean training data, mainstream unsupervised anomaly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2403.14233

点击查看摘要

Abstract:Although mainstream unsupervised anomaly detection (AD) (including image-level classification and pixel-level segmentation)algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This paper is the first to consider fully unsupervised industrial anomaly detection (i.e., unsupervised AD with noisy data). To solve this problem, we proposed memory-based unsupervised AD methods, SoftPatch and SoftPatch+, which efficiently denoise the data at the patch level. Noise discriminators are utilized to generate outlier scores for patch-level noise elimination before coreset construction. The scores are then stored in the memory bank to soften the anomaly detection boundary. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset, and SoftPatch+ has more robust performance which is articularly useful in real-world industrial inspection scenarios with high levels of noise (from 10% to 40%). Comprehensive experiments conducted in diverse noise scenarios demonstrate that both SoftPatch and SoftPatch+ outperform the state-of-the-art AD methods on the MVTecAD, ViSA, and BTAD benchmarks. Furthermore, the performance of SoftPatch and SoftPatch+ is comparable to that of the noise-free methods in conventional unsupervised AD setting. The code of the proposed methods can be found at this https URL.
zh

[CV-27] Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation ICASSP2025

【速读】：该论文旨在解决风力涡轮机叶片（WTB）图像分割精度不足的问题，特别是在自动化损伤检测系统中的关键应用。尽管通用视觉模型在多种任务中表现优异，但在特定领域如WTB分割中往往表现不佳。为此，作者提出了一种扩展的Intrinsic LoRA（Low-Rank Adaptation）方法，并结合了一种新颖的双空间增强策略。该策略通过图像级和潜在空间增强的集成来提升分割精度：图像空间增强通过图像对之间的线性插值实现，而潜在空间增强则通过引入基于噪声的潜在概率模型完成。这一方法显著提高了分割精度，超越了当前最先进的WTB图像分割方法。

链接: https://arxiv.org/abs/2412.20838
作者: Shubh Singhal,Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo
机构: 未知
关键词: wind turbine blade, damage detection systems, automated damage detection, Accurate segmentation, turbine blade
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Authors Shubh Singhal and Raül Pérez-Gonzalo contributed equally to this work. Accepted to ICASSP 2025

点击查看摘要

Abstract:Accurate segmentation of wind turbine blade (WTB) images is critical for effective assessments, as it directly influences the performance of automated damage detection systems. Despite advancements in large universal vision models, these models often underperform in domain-specific tasks like WTB segmentation. To address this, we extend Intrinsic LoRA for image segmentation, and propose a novel dual-space augmentation strategy that integrates both image-level and latent-space augmentations. The image-space augmentation is achieved through linear interpolation between image pairs, while the latent-space augmentation is accomplished by introducing a noise-based latent probabilistic model. Our approach significantly boosts segmentation accuracy, surpassing current state-of-the-art methods in WTB image segmentation.
zh

[CV-28] Inclusion 2024 Global Multimedia Deepfake Detection: Towards Multi-dimensional Facial Forgery Detection

【速读】：该论文旨在解决多媒体深度伪造（Deepfake）检测问题，特别是针对图像和音视频的自动编辑、合成、生成等操纵行为的识别。解决方案的关键在于通过全球多媒体深度伪造检测挑战赛（Global Multimedia Deepfake Detection），吸引了来自全球的1500个团队提交了约5000份有效结果，并邀请前20名团队展示其解决方案，最终评选出前三名获奖团队。论文详细介绍了前两名团队的解决方案，以推动图像和音视频伪造检测领域的研究进展。这些方法论的开发将有助于下一代深度伪造检测系统的发展，并鼓励参与者开源其方法。

链接: https://arxiv.org/abs/2412.20833
作者: Yi Zhang,Weize Gao,Changtao Miao,Man Luo,Jianshu Li,Wenzhong Deng,Zhe Li,Bingyu Hu,Weibin Yao,Wenbo Zhou,Tao Gong,Qi Chu
机构: 未知
关键词: Global Multimedia Deepfake, Multimedia Deepfake Detection, Deepfake Detection held, Detection held concurrently, Global Multimedia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Inclusion 2024 Global Multimedia Deepfake Detection Competition Top Team Technical Report

点击查看摘要

Abstract:In this paper, we present the Global Multimedia Deepfake Detection held concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to detect automatic image and audio-video manipulations including but not limited to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted 1500 teams from all over the world, with about 5000 valid result submission counts. We invite the top 20 teams to present their solutions to the challenge, from which the top 3 teams are awarded prizes in the grand finale. In this paper, we present the solutions from the top 3 teams of the two tracks, to boost the research work in the field of image and audio-video forgery detection. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection systems and we encourage participants to open source their methods.
zh

[CV-29] ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning

【速读】：该论文旨在解决透明物体在6D姿态估计中的挑战，主要由于透明物体独特的折射和反射特性导致传统方法难以准确估计其姿态。为解决这一问题，作者提出了ReFlow6D方法，其关键创新在于利用折射中间表示（refractive-intermediate representation）。与传统方法不同，ReFlow6D通过建模光线穿过透明物体时的路径变形，生成一种独立于环境且基于光线折射的物体特定中间表示。该方法仅依赖RGB图像作为输入，无需深度信息，并通过引入透明物体合成损失（transparent object compositing loss）来优化折射中间特征的生成。实验结果表明，ReFlow6D在TOD和Trans32K-6D数据集上显著优于现有方法，并在机器人抓取任务中验证了其姿态估计的实用性。

链接: https://arxiv.org/abs/2412.20830
作者: Hrishikesh Gupta,Stefan Thalhammer,Jean-Baptiste Weibel,Alexander Haberl,Markus Vincze
机构: 未知
关键词: robotics manipulation important, daily life, making their perception, manipulation important, ubiquitous in daily
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Transparent objects are ubiquitous in daily life, making their perception and robotics manipulation important. However, they present a major challenge due to their distinct refractive and reflective properties when it comes to accurately estimating the 6D pose. To solve this, we present ReFlow6D, a novel method for transparent object 6D pose estimation that harnesses the refractive-intermediate representation. Unlike conventional approaches, our method leverages a feature space impervious to changes in RGB image space and independent of depth information. Drawing inspiration from image matting, we model the deformation of the light path through transparent objects, yielding a unique object-specific intermediate representation guided by light refraction that is independent of the environment in which objects are observed. By integrating these intermediate features into the pose estimation network, we show that ReFlow6D achieves precise 6D pose estimation of transparent objects, using only RGB images as input. Our method further introduces a novel transparent object compositing loss, fostering the generation of superior refractive-intermediate features. Empirical evaluations show that our approach significantly outperforms state-of-the-art methods on TOD and Trans32K-6D datasets. Robot grasping experiments further demonstrate that ReFlow6D’s pose estimation accuracy effectively translates to real-world robotics task. The source code is available at: this https URL and this https URL.
zh

[CV-30] Length-Aware DETR for Robust Moment Retrieval

【速读】：该论文旨在解决视频片段检索（Video Moment Retrieval, MR）中短片段定位不准确的问题。通过数据分析，作者发现短片段的特征多样性有限，导致现有基于DETR（Detection Transformer）的模型在短片段定位上表现不佳。为此，论文提出了MomentMix方法，通过两种增强策略——ForegroundMix和BackgroundMix，分别提升前景和背景的特征表示。此外，作者还发现短片段在预测中心位置时存在偏差，因此提出了一种长度感知解码器（Length-Aware Decoder），通过一种新颖的双边匹配过程来条件化长度信息。实验结果表明，该方法在多个基准数据集上超越了现有的DETR-based方法，尤其在短片段定位上取得了显著提升，如在QVHighlights数据集上R1@0.7和mAP分别提高了2.46%和2.57%。

链接: https://arxiv.org/abs/2412.20816
作者: Seojeong Park,Jiho Choi,Kyungjune Baek,Hyunjung Shim
机构: 未知
关键词: Video Moment Retrieval, natural language query, video based, aims to localize, language query
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix employs two augmentation strategies: ForegroundMix and BackgroundMix, each enhancing the feature representations of the foreground and background, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions of moments. To address this, we propose a Length-Aware Decoder, which conditions length through a novel bipartite matching process. Our extensive studies demonstrate the efficacy of our length-aware approach, especially in localizing short moments, leading to improved overall performance. Our method surpasses state-of-the-art DETR-based methods on benchmark datasets, achieving the highest R1 and mAP on QVHighlights and the highest R1@0.7 on TACoS and Charades-STA (such as a 2.46% gain in R1@0.7 and a 2.57% gain in mAP average for QVHighlights). The code is available at this https URL.
zh

[CV-31] wo Heads Are Better Than One: Averaging along Fine-Tuning to Improve Targeted Transferability ICASSP

【速读】：该论文旨在解决目标攻击（targeted attacks）的迁移性（transferability）不足的问题。尽管目标攻击的优化时间远长于非目标攻击，但其迁移性仍然不尽如人意。现有研究表明，通过在特征空间中对已有的对抗样本（adversarial example, AE）进行微调（fine-tuning）可以有效提升其目标迁移性。然而，现有的微调方案仅利用了微调的终点信息，而忽略了微调轨迹中的宝贵信息。论文指出，传统的微调轨迹往往在损失表面的平坦区域边缘振荡，因此提出通过对微调轨迹进行平均，将生成的对抗样本拉向更中心的区域。该方法与现有的微调方案进行了对比，并结合了多种先进的目标攻击方法进行实验验证。实验结果表明，所提出的方法在提升目标迁移性方面具有显著优势。

链接: https://arxiv.org/abs/2412.20807
作者: Hui Zeng,Sanshuai Cui,Biwei Chen,Anjie Peng
机构: 未知
关键词: longer optimization time, untargeted attacks notwithstanding, existing fine-tuning schemes, longer optimization, optimization time
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, accepted by 2025ICASSP

点击查看摘要

Abstract:With much longer optimization time than that of untargeted attacks notwithstanding, the transferability of targeted attacks is still far from satisfactory. Recent studies reveal that fine-tuning an existing adversarial example (AE) in feature space can efficiently boost its targeted transferability. However, existing fine-tuning schemes only utilize the endpoint and ignore the valuable information in the fine-tuning trajectory. Noting that the vanilla fine-tuning trajectory tends to oscillate around the periphery of a flat region of the loss surface, we propose averaging over the fine-tuning trajectory to pull the crafted AE towards a more centered region. We compare the proposed method with existing fine-tuning schemes by integrating them with state-of-the-art targeted attacks in various attacking scenarios. Experimental results uphold the superiority of the proposed method in boosting targeted transferability. The code is available at this http URL.
zh

[CV-32] Frequency-aware Event Cloud Network

【速读】：该论文旨在解决事件相机（Event Camera）数据处理中的两个主要问题：一是主流方法（如帧和体素表示）在达到满意性能的同时，引入了耗时的转换过程、庞大的模型，并牺牲了细粒度的时间信息；二是点云表示虽然有望解决上述问题，但忽略了极性信息，且其模型在抽象长期事件特征方面能力有限。论文提出的解决方案是FECNet，一种基于事件云（Event Cloud）表示的频率感知网络。FECNet通过创新的基于事件的分组和采样模块，充分利用了2S-1T-1P事件云，并通过傅里叶变换在频域进行特征提取，从而显著减少了乘积累加操作（MACs）的爆炸性增长，同时有效地抽象了时空特征。实验结果表明，FECNet在事件驱动的物体分类、动作识别和人体姿态估计任务中表现出高效性和有效性。

链接: https://arxiv.org/abs/2412.20803
作者: Hongwei Ren,Fei Ma,Xiaopeng Lin,Yuetong Fang,Hongxiang Huang,Yulong Huang,Yue Zhou,Haotian Fu,Ziyi Yang,Fei Richard Yu,Bojun Cheng
机构: 未知
关键词: garnering significant attention, biologically inspired sensors, remarkable temporal resolution, emit events asynchronously, Event Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Event cameras are biologically inspired sensors that emit events asynchronously with remarkable temporal resolution, garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformation, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it ignores the polarity information, and its models have limited proficiency in abstracting long-term events’ features. In this paper, we propose a frequency-aware network named FECNet that leverages Event Cloud representations. FECNet fully utilizes 2S-1T-1P Event Cloud by innovating the event-based Group and Sampling module. To accommodate the long sequence events from Event Cloud, FECNet embraces feature extraction in the frequency domain via the Fourier transform. This approach substantially extinguishes the explosion of Multiply Accumulate Operations (MACs) while effectively abstracting spatial-temporal features. We conducted extensive experiments on event-based object classification, action recognition, and human pose estimation tasks, and the results substantiate the effectiveness and efficiency of FECNet.
zh

[CV-33] Generalize Your Face Forgery Detectors: An Insertable Adaptation Module Is All You Need ICASSP2025

【速读】：该论文旨在解决现有面部伪造检测器（face forgery detectors）在训练阶段未见过的伪造样本上泛化能力不足的问题。为解决这一问题，作者提出了一种可插入的适应模块（insertable adaptation module），该模块能够利用在线未标注的测试数据对已训练的现成检测器进行适应，而无需修改其架构或训练过程。解决方案的关键在于引入了一个基于可学习类别原型的分类器（learnable class prototype-based classifier），该分类器通过修正的特征和原型生成预测，从而有效处理在线测试中的各种伪造线索和领域差距。此外，作者还提出了一个最近邻特征校准器（nearest feature calibrator），以进一步提高预测准确性并减少自训练过程中噪声伪标签的影响。实验结果表明，该模块在多个数据集上均表现出优于现有方法的泛化能力，并且可以作为即插即用组件与多种检测器结合，提升整体性能。

链接: https://arxiv.org/abs/2412.20801
作者: Xiaotian Si,Linghui Li,Liwei Zhang,Ziduo Guo,Kaiguo Yuan,Bingyu Li,Xiaoyong Li
机构: 未知
关键词: facial deepfake risks, tackle facial deepfake, deepfake risks, plethora of face, exist to tackle
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP2025 accepted

点击查看摘要

Abstract:A plethora of face forgery detectors exist to tackle facial deepfake risks. However, their practical application is hindered by the challenge of generalizing to forgeries unseen during the training stage. To this end, we introduce an insertable adaptation module that can adapt a trained off-the-shelf detector using only online unlabeled test data, without requiring modifications to the architecture or training process. Specifically, we first present a learnable class prototype-based classifier that generates predictions from the revised features and prototypes, enabling effective handling of various forgery clues and domain gaps during online testing. Additionally, we propose a nearest feature calibrator to further improve prediction accuracy and reduce the impact of noisy pseudo-labels during self-training. Experiments across multiple datasets show that our module achieves superior generalization compared to state-of-the-art methods. Moreover, it functions as a plug-and-play component that can be combined with various detectors to enhance the overall performance.
zh

[CV-34] VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

【速读】：该论文旨在解决扩散模型（diffusion models）在文本到图像生成（text-to-image generation）过程中生成图像美学质量不足的问题。具体而言，现有模型在颜色、光照、构图等细粒度维度上与真实世界的美学图像仍存在差距。为解决这一问题，论文提出了Cross-Attention Value Mixing Control (VMix) Adapter，这是一种即插即用的美学适配器，其关键解决方案包括两个方面：(1) 通过初始化美学嵌入（aesthetic embedding），将输入文本提示（text prompt）解耦为内容描述和美学描述；(2) 通过值混合交叉注意力（value-mixed cross-attention）将美学条件整合到去噪过程中，并使用零初始化线性层（zero-initialized linear layers）连接网络。该方法的核心在于通过设计一种优越的条件控制方法，提升现有扩散模型的美学表现，同时保持图像与文本的对齐。VMix的灵活性使其能够直接应用于社区模型，无需重新训练即可提升视觉表现。实验结果表明，VMix在图像生成任务中优于其他最先进的方法，并且与其他社区模块（如LoRA、ControlNet和IPAdapter）兼容。

链接: https://arxiv.org/abs/2412.20800
作者: Shaojin Wu,Fei Ding,Mengqi Huang,Wei Liu,Qian He
机构: 未知
关键词: show extraordinary talents, generate highly aesthetic, models show extraordinary, highly aesthetic images, diffusion models show
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Codes and models are available at this https URL

点击查看摘要

Abstract:While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is this https URL.
zh

[CV-35] A Tale of Two Imperatives: Privacy and Explainability

【速读】：该论文旨在解决在高风险决策中结合隐私权（Right-to-Privacy, RTP）和解释权（Right-to-Explanation, RTE）的复杂性问题。具体而言，论文探讨了如何在深度学习模型中同时满足差分隐私（Differentially Privacy, DP）和事后解释器（post-hoc explainers）的要求。差分隐私作为当前隐私保护机器学习的黄金标准，提供了强有力的隐私保障；而事后解释器则因其独立于模型训练的特性，成为模型审计的首选工具。论文的关键解决方案包括：1）评估在差分隐私模型下事后解释器的有效性；2）分析差分隐私模型与事后解释器之间的内在交互；3）提出一个工业软件管道，通过实际用例展示如何在高风险应用中有效结合隐私权和解释权。

链接: https://arxiv.org/abs/2412.20798
作者: Supriya Manna,Niladri Sett
机构: 未知
关键词: Deep learning preponderance, follow rigorous operational, rigorous operational frameworks, reshaped high-stakes decision-making, Deep learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Deep learning’s preponderance across scientific domains has reshaped high-stakes decision-making, making it essential to follow rigorous operational frameworks that include both Right-to-Privacy (RTP) and Right-to-Explanation (RTE). This paper examines the complexities of combining these two requirements. For RTP, we focus on ‘Differentially privacy’ (DP), which is considered the current gold standard for privacy-preserving machine learning due to its strong quantitative guarantee of privacy. For RTE, we focus on post-hoc explainers: they are the go-to option for model auditing as they operate independently of model training. We formally investigate (DP) models and various commonly-used post-hoc explainers: how to evaluate these explainers subject to RTP, and analyze the intrinsic interactions between DP models and these explainers. Furthermore, our work throws light on how RTP and RTE can be effectively combined in high-stakes applications. Our study concludes by outlining an industrial software pipeline, with the example of a wildly used use-case, that respects both RTP and RTE requirements.
zh

[CV-36] Sample Correlation for Fingerprinting Deep Face Recognition

【速读】：该论文旨在解决深度人脸识别（deep face recognition）领域中模型窃取（model stealing）攻击的检测问题。现有的模型指纹（model fingerprinting）方法通常使用可转移的对抗样本（transferable adversarial examples）作为指纹，但这种方法对对抗防御（adversarial defense）和迁移学习（transfer learning）较为敏感。为解决这一问题，论文提出了一种基于样本相关性（SAmple Correlation, SAC）的新型模型窃取检测方法。其关键创新在于利用样本之间的成对关系（pairwise relationship）而非对抗样本，并通过SAC-JC方法选择JPEG压缩样本作为模型输入，计算其相关性矩阵（correlation matrix）来检测模型是否被窃取。实验结果表明，SAC方法在深度人脸识别（包括人脸验证和人脸情绪识别）中成功抵御了多种模型窃取攻击，并在AUC、p值和F1得分上表现出最高性能。此外，SAC-JC在目标识别数据集（如Tiny-ImageNet和CIFAR10）上也展现了优于现有方法的性能。

链接: https://arxiv.org/abs/2412.20768
作者: Jiyang Guan,Jian Liang,Yanbo Wang,Ran He
机构: 未知
关键词: http URL fingerprinting, http URL address, http URL results, http URL code, http URL methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face recognition has witnessed remarkable advancements in recent years, thanks to the development of deep learning this http URL, an off-the-shelf face recognition model as a commercial service could be stolen by model stealing attacks, posing great threats to the rights of the model this http URL fingerprinting, as a model stealing detection method, aims to verify whether a suspect model is stolen from the victim model, gaining more and more attention this http URL methods always utilize transferable adversarial examples as the model fingerprint, but this method is known to be sensitive to adversarial defense and transfer learning this http URL address this issue, we consider the pairwise relationship between samples instead and propose a novel yet simple model stealing detection method based on SAmple Correlation (SAC).Specifically, we present SAC-JC that selects JPEG compressed samples as model inputs and calculates the correlation matrix among their model this http URL results validate that SAC successfully defends against various model stealing attacks in deep face recognition, encompassing face verification and face emotion recognition, exhibiting the highest performance in terms of AUC, p-value and F1 this http URL, we extend our evaluation of SAC-JC to object recognition datasets including Tiny-ImageNet and CIFAR10, which also demonstrates the superior performance of SAC-JC to previous this http URL code will be available at \urlthis https URL.
zh

[CV-37] KeyGS: A Keyframe-Centric Gaussian Splatting Method for Monocular Image Sequences AAAI2025

【速读】：该论文旨在解决从稀疏的2D图像重建高质量3D模型时，现有方法对准确相机位姿的依赖以及训练时间过长的问题。现有方法如3D高斯泼溅（3D Gaussian Splatting, 3DGS）虽然具有高效的训练速度和实时渲染能力，但仍需依赖精确的相机位姿进行重建。尽管一些最新方法尝试从单目视频数据集中无需运动结构（Structure-from-Motion, SfM）预处理来训练3DGS模型，但这些方法的训练时间过长，限制了其实际应用。本文提出了一种高效的框架，无需深度或匹配模型，首先利用SfM在几秒内快速获取粗略的相机位姿，然后通过3DGS的密集表示进一步优化这些位姿。此外，本文还提出了从粗到细的频率感知密集化（coarse-to-fine frequency-aware densification）方法，结合联合优化，以重建不同层次的细节，避免相机位姿估计陷入局部极小值或由于高频信号而漂移。该框架显著将训练时间从数小时缩短至数分钟，同时在新视角合成和相机位姿估计方面比现有方法更为精确。

链接: https://arxiv.org/abs/2412.20767
作者: Keng-Wei Chang,Zi-Ming Wang,Shang-Hong Lai
机构: 未知
关键词: garnered significant attention, Reconstructing high-quality, images has garnered, Gaussian Splatting, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025

点击查看摘要

Abstract:Reconstructing high-quality 3D models from sparse 2D images has garnered significant attention in computer vision. Recently, 3D Gaussian Splatting (3DGS) has gained prominence due to its explicit representation with efficient training speed and real-time rendering capabilities. However, existing methods still heavily depend on accurate camera poses for reconstruction. Although some recent approaches attempt to train 3DGS models without the Structure-from-Motion (SfM) preprocessing from monocular video datasets, these methods suffer from prolonged training times, making them impractical for many applications. In this paper, we present an efficient framework that operates without any depth or matching model. Our approach initially uses SfM to quickly obtain rough camera poses within seconds, and then refines these poses by leveraging the dense representation in 3DGS. This framework effectively addresses the issue of long training times. Additionally, we integrate the densification process with joint refinement and propose a coarse-to-fine frequency-aware densification to reconstruct different levels of details. This approach prevents camera pose estimation from being trapped in local minima or drifting due to high-frequency signals. Our method significantly reduces training time from hours to minutes while achieving more accurate novel view synthesis and camera pose estimation compared to previous methods. Comments: AAAI 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.20767 [cs.CV] (or arXiv:2412.20767v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.20767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-38] Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision Tasks

【速读】：该论文旨在探讨同一类别（intra-class）中某些图像比其他图像更具记忆性（memorability）的原因，尽管它们共享相同的类别特征。为了解决这一问题，研究者设计并进行了人类行为实验，通过展示一系列图像并要求参与者识别当前图像是否与之前展示的某一图像匹配，来量化记忆性。关键解决方案是提出了一个新颖的度量指标——类内记忆性得分（Intra-Class Memorability score, ICMscore），该指标将重复图像展示之间的时间间隔纳入计算，从而更准确地评估图像的记忆性。这一研究不仅深入分析了导致图像记忆性差异的细粒度视觉特征，还为认知科学和计算机视觉领域的实际应用奠定了基础。

链接: https://arxiv.org/abs/2412.20761
作者: Jie Jing,Qing Lin,Shuangpeng Han,Lucia Schiatti,Yen-Ling Kuo,Mengmi Zhang
机构: 未知
关键词: shared category characteristics, introduce intra-class memorability, category characteristics, shared category, intra-class memorability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce intra-class memorability, where certain images within the same class are more memorable than others despite shared category characteristics. To investigate what features make one object instance more memorable than others, we design and conduct human behavior experiments, where participants are shown a series of images one at a time, and they must identify when the current item matches the item presented a few steps back in the sequence. To quantify memorability, we propose the Intra-Class Memorability score (ICMscore), a novel metric that incorporates the temporal intervals between repeated image presentations into its calculation. Our contributions open new pathways in understanding intra-class memorability by scrutinizing fine-grained visual features that result in the least and most memorable images and laying the groundwork for real-world applications in cognitive science and computer vision.
zh

[CV-39] Are Vision-Language Models Truly Understanding Multi-vision Sensor?

【速读】：该论文旨在解决当前大规模视觉-语言模型（VLMs）在处理多视觉传感器数据（如热成像、深度信息和X射线）时，缺乏对传感器信息的深入理解的问题。现有模型在处理这些数据时，往往忽视了每种传感器的独特物理属性，导致其在需要多视觉传感器推理的复杂任务中表现受限。为解决这一问题，论文提出了一个新颖的多视觉传感器感知与推理（MS-PR）基准，用于评估VLMs在传感器特定推理方面的能力。此外，论文引入了多样化负属性（DNA）优化方法，旨在帮助VLMs在多视觉传感器任务中进行深度推理，从而弥合图像与传感器数据之间的核心信息差距。实验结果表明，DNA方法显著提升了VLMs在多视觉传感器推理方面的性能。

链接: https://arxiv.org/abs/2412.20750
作者: Sangyun Chung,Youngjoon Yu,Youngchae Chee,Se Yeon Kim,Byung-Kwan Lee,Yong Man Ro
机构: 未知
关键词: Large-scale Vision-Language Models, aligning vision inputs, Large-scale Vision-Language, Vision-Language Models, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL . arXiv admin note: text overlap with arXiv:2408.12114

点击查看摘要

Abstract:Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor’s unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.
zh

[CV-40] Solar Filaments Detection using Active Contours Without Edges

【速读】：该论文旨在解决H-alpha全盘太阳图像中太阳暗条（solar filaments）的检测问题。为解决这一问题，作者提出了一种基于无边缘主动轮廓（Active Contours Without Edges, ACWE）的算法。该算法的核心在于通过初始化轮廓并使其根据能量函数（energy function）进行形变，当轮廓到达目标物体的边界时，能量函数减小，轮廓停止演化。该算法包含三个主要步骤：图像预处理、图像分割和图像后处理。实验结果表明，与传统的目标检测算法相比，该算法在检测太阳暗条方面表现更优。

链接: https://arxiv.org/abs/2412.20749
作者: Sanmoy Bandyopadhyay,Vaibhav Pant
机构: 未知
关键词: H-alpha full-disk solar, filaments in H-alpha, H-alpha full-disk, full-disk solar images, image
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:In this article, an active contours without edges (ACWE)-based algorithm has been proposed for the detection of solar filaments in H-alpha full-disk solar images. The overall algorithm consists of three main steps of image processing. These are image pre-processing, image segmentation, and image post-processing. Here in the work, contours are initialized on the solar image and allowed to deform based on the energy function. As soon as the contour reaches the boundary of the desired object, the energy function gets reduced, and the contour stops evolving. The proposed algorithm has been applied to few benchmark datasets and has been compared with the classical technique of object detection. The results analysis indicates that the proposed algorithm outperforms the results obtained using the existing classical algorithm of object detection.
zh

[CV-41] UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

【速读】：该论文旨在解决遥感影像与自然图像之间的领域差距问题，并探索视觉-语言模型（Vision-Language Models, VLMs）在处理不同类型视觉输入时的能力。为解决这一问题，论文提出了UniRS模型，这是首个能够统一处理多时序（multi-temporal）遥感任务的视觉-语言模型。UniRS的关键创新在于其支持单幅图像、双时相图像对和视频作为输入，从而在一个统一框架内实现全面的遥感时序分析。此外，论文采用了统一的视觉表示方法，使模型能够接受多种视觉输入，并针对双时相图像对任务定制了变化提取模块，以进一步增强时空特征的提取。为了提升模型推理过程，还设计了提示增强机制，利用通用视觉-语言模型的先验知识为UniRS提供线索。最后，通过在混合数据集上进行联合微调，促进了多任务知识共享。实验结果表明，UniRS在视觉问答、变化描述和视频场景分类等多样化任务中均达到了最先进的性能，展示了其在统一多时序遥感任务中的多功能性和有效性。

链接: https://arxiv.org/abs/2412.20742
作者: Yujie Li,Wenjia Xu,Guangzuo Li,Zijian Yu,Zhiwei Wei,Jiuniu Wang,Mugen Peng
机构: 未知
关键词: recently received widespread, received widespread attention, demonstrated excellent generalization, remote sensing, excellent generalization performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbfUniRS, the first vision-language model \textbfunifying multi-temporal \textbfremote \textbfsensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model’s reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.
zh

[CV-42] owards nation-wide analytical healthcare infrastructures: A privacy-preserving augmented knee rehabilitation case study ALT

【速读】：该论文旨在为近未来的隐私保护大数据分析医疗平台做出贡献，这些平台能够处理来自患者的流式或上传的时间序列数据或视频。论文通过实验工作，使用了一个真实生活中的膝关节康复视频数据集，捕捉了一系列从简单个性化到更具挑战性的运动，旨在帮助患者重返运动。解决方案的关键在于利用Google MediaPipe姿态估计技术，将移动设备拍摄的视频转换为隐私保护的诊断时间序列数据。开发的算法通过在患者视频上叠加简笔画元素，并实时更新生成的膝关节角度估计时间序列图（以CSV文件格式流式传输），从而增强膝关节康复视频的分析能力。此外，算法能够通过预设的膝关节角度参数，直观地显示潜在问题，如膝关节过度屈曲或不稳定运动。该自适应算法能够从侧面和正面视频中准确识别所有运动（准确率为91.67%-100%），从而量化康复计划的依从性和运动组数及重复次数。透明的算法设计有助于实现可解释的AI，并为近未来的隐私保护、非供应商锁定、开源开发提供信息，这些开发既适用于终端用户计算设备，也可作为非专有的本地云平台部署在国家医疗系统中。

链接: https://arxiv.org/abs/2412.20733
作者: Boris Bačić,Claudiu Vasile,Chengwei Feng,Marian G. Ciucă
机构: 未知
关键词: big data analytical, capable of processing, privacy-preserving big data, uploaded timeseries data, data analytical healthcare
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multimedia (cs.MM)
备注: The original work citation: Bačić, B., Claudiu Vasile, Feng, C., Ciucă, M. G. (2024, 13-15 Dec.). Towards nation-wide analytical healthcare infrastructures: A privacy-preserving augmented knee rehabilitation case study. Presented at the Conference on Innovative Technologies in Intelligent Systems Industrial Applications (CITISIA 2024), Sydney, NSW

点击查看摘要

Abstract:The purpose of this paper is to contribute towards the near-future privacy-preserving big data analytical healthcare platforms, capable of processing streamed or uploaded timeseries data or videos from patients. The experimental work includes a real-life knee rehabilitation video dataset capturing a set of exercises from simple and personalised to more general and challenging movements aimed for returning to sport. To convert video from mobile into privacy-preserving diagnostic timeseries data, we employed Google MediaPipe pose estimation. The developed proof-of-concept algorithms can augment knee exercise videos by overlaying the patient with stick figure elements while updating generated timeseries plot with knee angle estimation streamed as CSV file format. For patients and physiotherapists, video with side-to-side timeseries visually indicating potential issues such as excessive knee flexion or unstable knee movements or stick figure overlay errors is possible by setting a-priori knee-angle parameters. To address adherence to rehabilitation programme and quantify exercise sets and repetitions, our adaptive algorithm can correctly identify (91.67%-100%) of all exercises from side- and front-view videos. Transparent algorithm design for adaptive visual analysis of various knee exercise patterns contributes towards the interpretable AI and will inform near-future privacy-preserving, non-vendor locking, open-source developments for both end-user computing devices and as on-premises non-proprietary cloud platforms that can be deployed within the national healthcare system.
zh

[CV-43] Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling

【速读】：该论文旨在解决将对话为中心的脚本转化为连贯故事板的挑战，主要问题包括脚本细节有限、物理上下文理解不足以及电影原则整合的复杂性。为解决这些问题，论文提出了“对话可视化”（Dialogue Visualization）这一新任务，并引入了“对话导演”（Dialogue Director）这一无需训练的多模态框架。该框架由脚本导演（Script Director）、摄影师（Cinematographer）和故事板制作器（Storyboard Maker）组成，利用大模型和基于扩散的架构，通过链式思维推理（Chain-of-Thought reasoning）、检索增强生成（Retrieval-Augmented Generation）和多视图合成等技术，提升脚本理解、物理上下文理解和电影知识整合能力。实验结果表明，该框架在脚本解释、物理世界理解和电影原则应用方面优于现有方法，显著提高了基于对话的故事可视化的质量和可控性。

链接: https://arxiv.org/abs/2412.20725
作者: Min Zhang,Zilin Wang,Liyan Chen,Kunhong Liu,Juncong Lin
机构: 未知
关键词: Recent advances, enhanced video generation, advances in AI-driven, AI-driven storytelling, storytelling have enhanced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in AI-driven storytelling have enhanced video generation and story visualization. However, translating dialogue-centric scripts into coherent storyboards remains a significant challenge due to limited script detail, inadequate physical context understanding, and the complexity of integrating cinematic principles. To address these challenges, we propose Dialogue Visualization, a novel task that transforms dialogue scripts into dynamic, multi-view storyboards. We introduce Dialogue Director, a training-free multimodal framework comprising a Script Director, Cinematographer, and Storyboard Maker. This framework leverages large multimodal models and diffusion-based architectures, employing techniques such as Chain-of-Thought reasoning, Retrieval-Augmented Generation, and multi-view synthesis to improve script understanding, physical context comprehension, and cinematic knowledge integration. Experimental results demonstrate that Dialogue Director outperforms state-of-the-art methods in script interpretation, physical world understanding, and cinematic principle application, significantly advancing the quality and controllability of dialogue-based story visualization.
zh

[CV-44] 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives ICLR2024

【速读】：该论文旨在解决动态3D场景表示（Dynamic 3D Scene Representation）和从捕获视频中生成新视角（Novel View Synthesis）的挑战，这对于实现AR/VR和元宇宙应用所需的沉浸式体验至关重要。论文提出了一种将动态场景视为时空4D体积学习问题（Spatio-Temporal 4D Volume Learning）的框架，通过最小化对运动的假设，提供了一种通用的动态场景学习框架。其核心解决方案是使用具有显式几何和外观特征的4D高斯基元（4D Gaussian Primitives）来表示目标动态场景，称为4D高斯溅射（4D Gaussian Splatting, 4DGS）。该方法通过拟合底层时空体积来捕捉空间和时间中的相关信息，并利用各向异性椭圆（Anisotropic Ellipses）参数化的4D高斯模型，自然地学习视角依赖和时间演化的外观。此外，4DGS模型首次支持复杂动态场景的高分辨率、逼真新视角的实时渲染。为了提高效率，论文还提出了几种紧凑变体，有效减少了内存占用并降低了过拟合风险。

链接: https://arxiv.org/abs/2412.20720
作者: Zeyu Yang,Zijie Pan,Xiatian Zhu,Li Zhang,Yu-Gang Jiang,Philip H.S. Torr
机构: 未知
关键词: enabling immersive experiences, immersive experiences required, metaverse applications, captured videos, videos are crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal extension of ICLR 2024

点击查看摘要

Abstract:Dynamic 3D scene representation and novel view synthesis from captured videos are crucial for enabling immersive experiences required by AR/VR and metaverse applications. However, this task is challenging due to the complexity of unconstrained real-world scenes and their temporal dynamics. In this paper, we frame dynamic scenes as a spatio-temporal 4D volume learning problem, offering a native explicit reformulation with minimal assumptions about motion, which serves as a versatile dynamic scene learning framework. Specifically, we represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features, dubbed as 4D Gaussian splatting (4DGS). This approach can capture relevant information in space and time by fitting the underlying spatio-temporal volume. Modeling the spacetime as a whole with 4D Gaussians parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, our model can naturally learn view-dependent and time-evolved appearance with 4D spherindrical harmonics. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, photorealistic novel views for complex dynamic scenes. To enhance efficiency, we derive several compact variants that effectively reduce memory footprint and mitigate the risk of overfitting. Extensive experiments validate the superiority of 4DGS in terms of visual quality and efficiency across a range of dynamic scene-related tasks (e.g., novel view synthesis, 4D generation, scene understanding) and scenarios (e.g., single object, indoor scenes, driving environments, synthetic and real data).
zh

[CV-45] M3oralBench: A MultiModal Moral Benchmark for LVLMs

【速读】：该论文旨在解决当前大视觉语言模型（LVLMs）在多模态道德评估方面的不足。随着这些模型在法律、金融和医疗等关键领域的广泛应用，确保其输出符合人类价值观并保持在道德边界内变得至关重要。然而，现有的道德评估方法主要局限于文本模态，缺乏针对多模态场景的评估工具。为此，论文提出了M^3 oralBench，这是首个针对LVLMs的多模态道德基准。该基准通过扩展Moral Foundations Vignettes（MFVs）中的日常道德场景，并利用文本到图像扩散模型（SD3.0）生成相应的场景图像，从而在Moral Foundations Theory（MFT）的六个道德基础上进行道德判断、道德分类和道德响应等多任务评估。这一解决方案的关键在于其多模态特性，能够全面评估模型在多模态道德理解和推理方面的表现，揭示了当前模型在道德方面的显著局限性。

链接: https://arxiv.org/abs/2412.20718
作者: Bei Yan,Jie Zhang,Zhiyuan Chen,Shiguang Shan,Xilin Chen
机构: 未知
关键词: including large language, large language models, large vision-language models, moral, including large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large foundation models, including large language models (LLMs) and large vision-language models (LVLMs), have become essential tools in critical fields such as law, finance, and healthcare. As these models increasingly integrate into our daily life, it is necessary to conduct moral evaluation to ensure that their outputs align with human values and remain within moral boundaries. Previous works primarily focus on LLMs, proposing moral datasets and benchmarks limited to text modality. However, given the rapid development of LVLMs, there is still a lack of multimodal moral evaluation methods. To bridge this gap, we introduce M ^3 oralBench, the first MultiModal Moral Benchmark for LVLMs. M ^3 oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images. It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response, providing a comprehensive assessment of model performance in multimodal moral understanding and reasoning. Extensive experiments on 10 popular open-source and closed-source LVLMs demonstrate that M ^3 oralBench is a challenging benchmark, exposing notable moral limitations in current models. Our benchmark is publicly available.
zh

[CV-46] HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images

【速读】：该论文旨在解决当前基于潜在扩散模型（Latent Diffusion Models, LDMs）生成的图像检测方法在训练数据不可用情况下的局限性。现有方法假设通过自编码器（autoencoder）重建LDM生成的图像比真实图像更容易，但这种方法过度依赖于背景信息，导致在检测具有简单背景的图像时表现不佳。为此，论文提出了一种名为HFI（High-Frequency Information）的新方法。HFI通过将LDM的自编码器视为下采样-上采样核，测量重建图像中高频信息的失真程度（即混叠现象，aliasing）。该方法无需训练，高效且在各种生成模型生成的复杂图像检测中表现优异。此外，HFI还能通过隐式水印（implicit watermarking）成功检测特定LDM生成的图像，显著优于现有基线方法。

链接: https://arxiv.org/abs/2412.20704
作者: Sungik Choi,Sungwoo Park,Jaehoon Lee,Seunghyun Kim,Stanley Jungkyu Choi,Moontae Lee
机构: 未知
关键词: Dramatic advances, latent diffusion models, AI-generated image detection, latent diffusion, AI-generated images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dramatic advances in the quality of the latent diffusion models (LDMs) also led to the malicious use of AI-generated images. While current AI-generated image detection methods assume the availability of real/AI-generated images for training, this is practically limited given the vast expressibility of LDMs. This motivates the training-free detection setup where no related data are available in advance. The existing LDM-generated image detection method assumes that images generated by LDM are easier to reconstruct using an autoencoder than real images. However, we observe that this reconstruction distance is overfitted to background information, leading the current method to underperform in detecting images with simple backgrounds. To address this, we propose a novel method called HFI. Specifically, by viewing the autoencoder of LDM as a downsampling-upsampling kernel, HFI measures the extent of aliasing, a distortion of high-frequency information that appears in the reconstructed image. HFI is training-free, efficient, and consistently outperforms other training-free methods in detecting challenging images generated by various generative models. We also show that HFI can successfully detect the images generated from the specified LDM as a means of implicit watermarking. HFI outperforms the best baseline method while achieving magnitudes of
zh

[CV-47] Open-Set Object Detection By Aligning Known Class Representations WACV’24

【速读】：该论文旨在解决开放集目标检测（Open-Set Object Detection, OSOD）任务中的未知物体检测问题。现有的方法通过对比聚类（contrastive clustering）来分离未知类别，但本文提出了一种新的基于语义聚类（semantic clustering）的方法，以在语义空间中对齐聚类，并引入类去相关模块（class decorrelation module）来增强类间分离。此外，本文还提出了一个物体聚焦模块（object focus module）来预测物体性得分（objectness scores），从而提升未知物体的检测效果。为了进一步优化模型性能，本文采用了一种评估技术，对低置信度输出进行惩罚，以减少未知物体的误分类风险，并引入了一种新的度量指标HMP（Harmonic Mean of Precision），通过调和平均结合已知和未知类别的精确度。实验结果表明，该方法在MS-COCO和PASCAL VOC数据集上显著提升了OSOD任务的性能。

链接: https://arxiv.org/abs/2412.20701
作者: Hiran Sarkar,Vishal Chudasama,Naoyuki Onoe,Pankaj Wasnik,Vineeth N Balasubramanian
机构: 未知
关键词: contemporary research direction, Open-Set Object Detection, OSOD task, contemporary research, research direction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV’24

点击查看摘要

Abstract:Open-Set Object Detection (OSOD) has emerged as a contemporary research direction to address the detection of unknown objects. Recently, few works have achieved remarkable performance in the OSOD task by employing contrastive clustering to separate unknown classes. In contrast, we propose a new semantic clustering-based approach to facilitate a meaningful alignment of clusters in semantic space and introduce a class decorrelation module to enhance inter-cluster separation. Our approach further incorporates an object focus module to predict objectness scores, which enhances the detection of unknown objects. Further, we employ i) an evaluation technique that penalizes low-confidence outputs to mitigate the risk of misclassification of the unknown objects and ii) a new metric called HMP that combines known and unknown precision using harmonic mean. Our extensive experiments demonstrate that the proposed model achieves significant improvement on the MS-COCO PASCAL VOC dataset for the OSOD task.
zh

[CV-48] Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks

【速读】：该论文旨在解决无监督视觉语言模型（Vision Language Models, VLMs）选择的问题，即在仅有无监督下游数据集且无额外信息的情况下，如何选择性能最佳的VLM。现有方法依赖于有监督的大规模数据集和大型语言模型，这在部署时可能不可行或不可获取。论文提出的解决方案是Visual-tExtual Graph Alignment (VEGA)，该方法通过测量VLM在下游任务中视觉和文本模态之间的对齐程度来选择模型，而无需任何标注。VEGA的核心在于利用VLM的预训练范式，即通过将视觉和文本模态中具有相同语义的特征对齐，映射到一个共享的表示空间。具体而言，VEGA首先分别构建视觉和文本特征的图，然后通过计算视觉图和文本图在节点和边层次上的整体相似性来评估模型性能。实验结果表明，VEGA在多种应用场景和下游数据集上均能提供可靠且准确的VLM性能估计。

链接: https://arxiv.org/abs/2412.20682
作者: Yuhe Ding,Bo Jiang,Aihua Zheng,Qin Xu,Jian Liang
机构: 未知
关键词: CLIP show stellar, show stellar zero-shot, stellar zero-shot capability, CLIP show, show stellar
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. However, selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. Existing VLM selection methods focus on the class-name-only setting, relying on a supervised large-scale dataset and large language models, which may not be accessible or feasible during deployment. This paper introduces the problem of \textbfunsupervised vision-language model selection, where only unsupervised downstream datasets are available, with no additional information provided. To solve this problem, we propose a method termed Visual-tExtual Graph Alignment (VEGA), to select VLMs without any annotations by measuring the alignment of the VLM between the two modalities on the downstream task. VEGA is motivated by the pretraining paradigm of VLMs, which aligns features with the same semantics from the visual and textual modalities, thereby mapping both modalities into a shared representation space. Specifically, we first construct two graphs on the vision and textual features, respectively. VEGA is then defined as the overall similarity between the visual and textual graphs at both node and edge levels. Extensive experiments across three different benchmarks, covering a variety of application scenarios and downstream datasets, demonstrate that VEGA consistently provides reliable and accurate estimates of VLMs’ performance on unlabeled downstream tasks.
zh

[CV-49] Prototypical Distillation and Debiased Tuning for Black-box Unsupervised Domain Adaptation

【速读】：该论文旨在解决黑箱领域自适应（black-box domain adaptation）问题，即在无法直接访问源域数据的情况下，仅通过API获取源模型的预测标签及其置信度，将知识从源域迁移到无标签的目标域。这一问题的背景是源数据可能通过模型反演攻击（model inversion attacks）泄露，因此需要一种无需源数据的安全迁移方法。论文提出的解决方案名为原型蒸馏与去偏调优（ProDDing），其关键步骤包括：首先，利用源模型的原始预测和目标域生成的原型作为教师模型，蒸馏出一个定制化的目标模型；其次，通过惩罚偏向某些类别的logits，持续微调蒸馏后的模型。实验结果表明，ProDDing在多个基准测试中优于现有的黑箱领域自适应方法，尤其在仅提供预测标签的硬标签黑箱领域自适应场景下，表现尤为显著。

链接: https://arxiv.org/abs/2412.20670
作者: Jian Liang,Lijun Sheng,Hongmin Liu,Ran He
机构: 未知
关键词: Unsupervised domain adaptation, unlabeled target domain, domain adaptation, label-rich source domain, black-box domain adaptation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation aims to transfer knowledge from a related, label-rich source domain to an unlabeled target domain, thereby circumventing the high costs associated with manual annotation. Recently, there has been growing interest in source-free domain adaptation, a paradigm in which only a pre-trained model, rather than the labeled source data, is provided to the target domain. Given the potential risk of source data leakage via model inversion attacks, this paper introduces a novel setting called black-box domain adaptation, where the source model is accessible only through an API that provides the predicted label along with the corresponding confidence value for each query. We develop a two-step framework named \textbfPro totypical \textbfD istillation and \textbfD ebiased tun \textbfing ( \textbfProDDing ). In the first step, ProDDing leverages both the raw predictions from the source model and prototypes derived from the target domain as teachers to distill a customized target model. In the second step, ProDDing keeps fine-tuning the distilled model by penalizing logits that are biased toward certain classes. Empirical results across multiple benchmarks demonstrate that ProDDing outperforms existing black-box domain adaptation methods. Moreover, in the case of hard-label black-box domain adaptation, where only predicted labels are available, ProDDing achieves significant improvements over these methods. Code will be available at \urlthis https URL.
zh

[CV-50] Recurrence-based Vanishing Point Detection WACV2025

【速读】：该论文旨在解决传统消失点检测（Vanishing Point Detection, VPD）方法仅依赖图像中显式直线，以及现有监督深度学习方法需要标注数据集进行训练的问题。论文提出了一种无监督的替代方法：基于重复性的消失点检测（Recurrence-based Vanishing Point Detection, R-VPD），该方法不仅利用显式直线，还通过图像中重复出现的对应关系发现隐式直线。此外，论文贡献了两个用于消失点检测的重复模式数据集（Recurring-Pattern-for-Vanishing-Point, RPVP）：1）包含3,200个真实消失点和相机参数的合成图像数据集；2）包含1,400个人工标注消失点的真实世界图像数据集。实验表明，该无监督方法在合成图像数据集上优于所有对比方法，在真实世界图像数据集上优于传统方法，并与监督学习方法表现相当。

链接: https://arxiv.org/abs/2412.20666
作者: Skanda Bharadwaj,Robert Collins,Yanxi Liu
机构: 未知
关键词: Vanishing Point Detection, Point Detection, Recurrence-based Vanishing Point, Vanishing Point, truth vanishing points
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:Classical approaches to Vanishing Point Detection (VPD) rely solely on the presence of explicit straight lines in images, while recent supervised deep learning approaches need labeled datasets for training. We propose an alternative unsupervised approach: Recurrence-based Vanishing Point Detection (R-VPD) that uses implicit lines discovered from recurring correspondences in addition to explicit lines. Furthermore, we contribute two Recurring-Pattern-for-Vanishing-Point (RPVP) datasets: 1) a Synthetic Image dataset with 3,200 ground truth vanishing points and camera parameters, and 2) a Real-World Image dataset with 1,400 human annotated vanishing points. We compare our method with two classical methods and two state-of-the-art deep learning-based VPD methods. We demonstrate that our unsupervised approach outperforms all the methods on the synthetic images dataset, outperforms the classical methods, and is on par with the supervised learning approaches on real-world images.
zh

[CV-51] SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

【速读】：该论文旨在解决遥感领域中高分辨率多模态图像（high-resolution multi-modal imagery）目标检测的挑战。传统目标检测模型通常基于单一数据集训练，局限于特定成像模态和标注格式，忽视了多模态间的共享知识，限制了模型在多样化场景中的适用性。为此，论文提出了多模态数据集与多任务目标检测（Multi-Modal Datasets and Multi-Task Object Detection, M2Det）这一新任务，旨在从任意传感器模态中准确检测水平或定向目标。解决这一问题的关键在于：1）管理多模态建模时的权衡；2）多任务优化的复杂性。为此，论文提出了统一模型SM3Det（Single Model for Multi-Modal datasets and Multi-Task object Detection），其核心是采用网格级稀疏专家混合（grid-level sparse MoE）骨干网络，实现联合知识学习的同时保留不同模态的独特特征表示。此外，SM3Det通过动态学习率调整策略实现一致性和同步优化，有效处理不同模态和任务间的学习难度差异。实验表明，SM3Det在多个数据集上均优于专用模型，展现出良好的有效性和泛化能力。

链接: https://arxiv.org/abs/2412.20665
作者: Yuxuan Li,Xiang Li,Yunheng Li,Yicheng Zhang,Yimian Dai,Qibin Hou,Ming-Ming Cheng,Jian Yang
机构: 未知
关键词: high-resolution multi-modal imagery, Multi-Task Object Detection, Object detection, remote sensing technology, widely accessible
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, it integrates a consistency and synchronization optimization strategy using dynamic learning rate adjustment, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det’s effectiveness and generalizability, consistently outperforming specialized models on individual datasets. The code is available at this https URL.
zh

[CV-52] Enhancing Table Recognition with Vision LLM s: A Benchmark and Neighbor-Guided Toolchain Reasoner

【速读】：该论文旨在解决使用视觉大语言模型（Vision Large Language Models, VLLMs）在无结构化表格识别中的瓶颈问题，特别是低质量图像输入对识别过程的显著影响。论文提出的解决方案之关键是设计了一个名为“邻居引导工具链推理器”（Neighbor-Guided Toolchain Reasoner, NGTR）的框架。该框架通过集成多个轻量级模型进行低层次视觉处理操作，以缓解低质量输入图像带来的问题。具体而言，NGTR利用邻居检索机制来指导生成多个工具调用计划，并将相似邻居的工具选择经验迁移到给定输入中，从而促进合适的工具选择。此外，NGTR还引入了一个反思模块来监督工具调用过程。通过在公开的表格识别数据集上进行广泛实验，论文证明了该框架显著提升了基础VLLMs的识别能力。

链接: https://arxiv.org/abs/2412.20662
作者: Yitong Zhou,Mingyue Cheng,Qingyang Mao,Qi Liu,Feiyang Xu,Xin Li,Enhong Chen
机构: 未知
关键词: Vision Large Language, Large Language Models, Pre-trained foundation models, recently significantly progressed, structured table understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained foundation models have recently significantly progressed in structured table understanding and reasoning. However, despite advancements in areas such as table semantic understanding and table question answering, recognizing the structure and content of unstructured tables using Vision Large Language Models (VLLMs) remains under-explored. In this work, we address this research gap by employing VLLMs in a training-free reasoning paradigm. First, we design a benchmark with various hierarchical dimensions relevant to table recognition. Subsequently, we conduct in-depth evaluations using pre-trained VLLMs, finding that low-quality image input is a significant bottleneck in the recognition process. Drawing inspiration from these findings, we propose the Neighbor-Guided Toolchain Reasoner (NGTR) framework, which is characterized by integrating multiple lightweight models for low-level visual processing operations aimed at mitigating issues with low-quality input images. Specifically, we utilize a neighbor retrieval mechanism to guide the generation of multiple tool invocation plans, transferring tool selection experiences from similar neighbors to the given input, thereby facilitating suitable tool selection. Additionally, we introduce a reflection module to supervise the tool invocation process. Extensive experiments on public table recognition datasets demonstrate that our approach significantly enhances the recognition capabilities of the vanilla VLLMs. We believe that the designed benchmark and the proposed NGTR framework could provide an alternative solution in table recognition.
zh

[CV-53] Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model AAAI2025

【速读】：该论文旨在解决生成高质量全身人-物交互运动序列（whole-body human object interaction motion sequences）的挑战，特别是在动画、虚拟现实/增强现实（VR/AR）和机器人等领域。主要难点在于如何根据物体复杂形状、不同尺寸及其运动轨迹，确定每只手的参与程度，同时确保抓握的真实性和全身各部分的运动协调性。现有方法要么生成缺乏详细手部抓握姿势的交互运动序列，要么仅建模静态抓握姿势。为此，论文提出了一种简单而有效的框架，通过单一扩散模型（diffusion model）联合建模身体、手部与给定物体运动序列之间的关系。为引导网络感知物体的空间位置并学习更自然的抓握姿势，论文引入了新颖的接触感知损失（contact-aware losses）并结合了数据驱动的精心设计指导。实验结果表明，该方法在生成逼真的全身运动序列方面优于现有最先进方法。

链接: https://arxiv.org/abs/2412.20657
作者: Yonghao Zhang,Qiang He,Yanguang Wan,Yinda Zhang,Xiaoming Deng,Cuixia Ma,Hongan Wang
机构: 未知
关键词: Generating high-quality whole-body, Generating high-quality, interaction motion sequences, motion sequences, increasingly important
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Generating high-quality whole-body human object interaction motion sequences is becoming increasingly important in various fields such as animation, VR/AR, and robotics. The main challenge of this task lies in determining the level of involvement of each hand given the complex shapes of objects in different sizes and their different motion trajectories, while ensuring strong grasping realism and guaranteeing the coordination of movement in all body parts. Contrasting with existing work, which either generates human interaction motion sequences without detailed hand grasping poses or only models a static grasping pose, we propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences within a single diffusion model. To guide our network in perceiving the object’s spatial position and learning more natural grasping poses, we introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance. Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences.
zh

[CV-54] Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

【速读】：该论文旨在解决在医学影像领域中，由于数据稀缺性和隐私问题导致的大规模数据集难以获取的问题，以及预训练通用模型与医学领域数据分布差异（distribution shift）带来的微调挑战。论文提出的解决方案是“潜在漂移”（Latent Drift, LD），该方法可以应用于任何微调策略，以缓解分布差异带来的问题，或在推理时作为条件使用。潜在漂移使得扩散模型能够适应医学影像，特别是用于反事实图像生成（counterfactual image generation）这一复杂任务，这对于研究性别、年龄、疾病增减等参数如何影响医学影像至关重要。通过在三个公开的脑部MRI和胸部X射线纵向基准数据集上的评估，该方法在不同微调方案下均表现出显著的性能提升。

链接: https://arxiv.org/abs/2412.20651
作者: Yousef Yeganeh,Ioannis Charisiadis,Marta Hasny,Martin Hartenberger,Björn Ommer,Nassir Navab,Azade Farshad,Ehsan Adeli
机构: 未知
关键词: produce synthetic samples, Scaling by training, medical imaging due, large datasets, data is scarce
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this work will be publicly released upon its acceptance.
zh

[CV-55] Enhancing Visual Representation for Text-based Person Searching

【速读】：该论文旨在解决基于文本的行人检索（Text-based Person Search）任务中的核心难题，即如何从行人图像和文本描述中提取有效细节，并在共同的潜在空间内实现跨模态对齐。现有方法通常采用在单模态数据上预训练的图像和文本编码器，分别提取全局和局部特征，并通过显式的全局-局部对齐来实现跨模态匹配。然而，这些方法在理解视觉细节方面仍存在不足，检索精度受到身份混淆的限制。为解决上述问题，论文提出了VFE-TPS（Visual Feature Enhanced Text-based Person Search）模型，其关键解决方案包括引入预训练的多模态骨干网络CLIP来学习基础的多模态特征，并通过构建文本引导的掩码图像建模任务（Text Guided Masked Image Modeling）来增强模型学习局部视觉细节的能力，而无需显式标注。此外，设计了身份监督的全局视觉特征校准任务（Identity Supervised Global Visual Feature Calibration），以引导模型学习身份感知的全局视觉特征。实验结果表明，通过引入这些辅助任务，预训练CLIP模型中的知识能够成功适应基于文本的行人检索任务，显著提升了模型的视觉理解能力，并在三个基准数据集上超越了现有方法，Rank-1准确率显著提高了约1%至9%。

链接: https://arxiv.org/abs/2412.20646
作者: Wei Shen,Ming Fang,Yuxia Wang,Jiafeng Xiao,Diping Li,Huangqun Chen,Ling Xu,Weifeng Zhang
机构: 未知
关键词: Text-based person search, large-scale image database, Text-based person, person search aims, person search
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-based person search aims to retrieve the matched pedestrians from a large-scale image database according to the text description. The core difficulty of this task is how to extract effective details from pedestrian images and texts, and achieve cross-modal alignment in a common latent space. Prior works adopt image and text encoders pre-trained on unimodal data to extract global and local features from image and text respectively, and then global-local alignment is achieved explicitly. However, these approaches still lack the ability of understanding visual details, and the retrieval accuracy is still limited by identity confusion. In order to alleviate the above problems, we rethink the importance of visual features for text-based person search, and propose VFE-TPS, a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained multimodal backbone CLIP to learn basic multimodal features and constructs Text Guided Masked Image Modeling task to enhance the model’s ability of learning local visual details without explicit annotation. In addition, we design Identity Supervised Global Visual Feature Calibration task to guide the model learn identity-aware global visual features. The key finding of our study is that, with the help of our proposed auxiliary tasks, the knowledge embedded in the pre-trained CLIP model can be successfully adapted to text-based person search task, and the model’s visual understanding ability is significantly enhanced. Experimental results on three benchmarks demonstrate that our proposed model exceeds the existing approaches, and the Rank-1 accuracy is significantly improved with a notable margin of about 1%\sim9% . Our code can be found at this https URL.
zh

[CV-56] YOLO-UniOW: Efficient Universal Open-World Object Detection

【速读】：该论文旨在解决传统目标检测模型在开放世界场景中的局限性，特别是其无法检测训练数据集中未出现类别的问题。传统模型依赖于封闭集数据集，而多模态模型虽然通过文本和图像模态的对齐扩展了类别识别能力，但仍受限于预定义词汇，且因跨模态融合引入较大的推理开销，无法有效处理未知对象。为此，论文提出了通用开放世界目标检测（Universal Open-World Object Detection, Uni-OWD）这一新范式，并设计了YOLO-UniOW模型。该模型的关键创新在于：1）采用自适应决策学习（Adaptive Decision Learning），在CLIP潜在空间中进行轻量级对齐，替代计算密集的跨模态融合，从而在不牺牲泛化能力的前提下实现高效检测；2）设计了通配符学习策略（Wildcard Learning），能够将分布外对象检测为“未知”，并支持动态词汇扩展，无需增量学习。这些设计使YOLO-UniOW能够无缝适应开放世界环境中的新类别，显著提升了检测效率和性能。

链接: https://arxiv.org/abs/2412.20645
作者: Lihao Liu,Juexiao Feng,Hui Chen,Ao Wang,Lin Song,Jungong Han,Guiguang Ding
机构: 未知
关键词: Traditional object detection, Traditional object, Open-World Object Detection, object detection, encountered during training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image modalities, they introduce significant inference overhead due to cross-modality fusion and still remain restricted by predefined vocabulary, leaving them ineffective at handling unknown objects in open-world scenarios. In this work, we introduce Universal Open-World Object Detection (Uni-OWD), a new paradigm that unifies open-vocabulary and open-world object detection tasks. To address the challenges of this setting, we propose YOLO-UniOW, a novel model that advances the boundaries of efficiency, versatility, and performance. YOLO-UniOW incorporates Adaptive Decision Learning to replace computationally expensive cross-modality fusion with lightweight alignment in the CLIP latent space, achieving efficient detection without compromising generalization. Additionally, we design a Wildcard Learning strategy that detects out-of-distribution objects as “unknown” while enabling dynamic vocabulary expansion without the need for incremental learning. This design empowers YOLO-UniOW to seamlessly adapt to new categories in open-world environments. Extensive experiments validate the superiority of YOLO-UniOW, achieving achieving 34.6 AP and 30.0 APr on LVIS with an inference speed of 69.6 FPS. The model also sets benchmarks on M-OWODB, S-OWODB, and nuScenes datasets, showcasing its unmatched performance in open-world object detection. Code and models are available at this https URL.
zh

[CV-57] Slow Perception: Lets Perceive Geometric Figures Step-by-step

【速读】：该论文旨在解决当前大视觉语言模型（Large Vision Language Models, LVLMs）在处理视觉推理任务，尤其是几何数学问题时，难以准确复制几何图形并理解其复杂内在逻辑和空间关系的问题。论文提出了一种称为“慢感知”（Slow Perception, SP）的概念，作为解决方案的关键。SP通过两个阶段实现：首先是感知分解（perception decomposition），将复杂几何图形分解为基本的点线组合，统一几何表示；其次是感知流（perception flow），通过引入“感知尺”（perceptual ruler）逐笔追踪线段，避免“长视觉跳跃”，从而逐步重建复杂几何结构。这种类人的感知方式遵循推理时间缩放定律，即感知越慢，效果越好，与过去追求加速模型感知的研究方向相反，强调逐步、仔细地读取图像。

链接: https://arxiv.org/abs/2412.20631
作者: Haoran Wei,Youyang Yin,Yumeng Li,Jia Wang,Liang Zhao,Jianjian Sun,Zheng Ge,Xiangyu Zhang
机构: 未知
关键词: enter people vision, Large Vision Language, Vision Language Models, geometric math problems, solve visual reasoning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, “visual o1” began to enter people’s vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of “slow perception” (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid “long visual jumps” in regressing line segments by using a proposed “perceptual ruler” to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law – the slower, the better. Researchers strive to speed up the model’s perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.
zh

[CV-58] HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models

【速读】：该论文旨在解决大视觉语言模型（Large Vision-Language Models, LVLMs）在复杂多模态任务中存在的物体幻觉（object hallucination）问题，即模型对图像中物体的错误识别或错误分类。为解决这一问题，作者提出了HALLUCINOGEN，一个新颖的视觉问答（Visual Question Answering, VQA）物体幻觉攻击基准，通过多样化的上下文推理提示（contextual reasoning prompts）来评估现有LVLMs在物体识别任务中的幻觉表现。关键解决方案包括设计一系列上下文推理幻觉提示，要求模型在执行视觉语言任务（如识别、定位或视觉推理）时准确识别目标图像中的物体。此外，作者还扩展了该基准至高风险医疗应用领域，提出了MED-HALLUCINOGEN，专门针对生物医学领域的幻觉攻击，并评估了LVLMs在医学图像上的幻觉表现。通过在多数据集上对八种LVLMs和两种幻觉缓解策略的广泛评估，论文揭示了当前通用和医疗LVLMs仍易受幻觉攻击的影响。

链接: https://arxiv.org/abs/2412.20622
作者: Ashish Seth,Dinesh Manocha,Chirag Agarwal
机构: 未知
关键词: Large Vision-Language Models, Large Vision-Language, Vision-Language Models, complex multimodal tasks, performing complex multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in performing complex multimodal tasks. However, they are still plagued by object hallucination: the misidentification or misclassification of objects present in images. To this end, we propose HALLUCINOGEN, a novel visual question answering (VQA) object hallucination attack benchmark that utilizes diverse contextual reasoning prompts to evaluate object hallucination in state-of-the-art LVLMs. We design a series of contextual reasoning hallucination prompts to evaluate LVLMs’ ability to accurately identify objects in a target image while asking them to perform diverse visual-language tasks such as identifying, locating or performing visual reasoning around specific objects. Further, we extend our benchmark to high-stakes medical applications and introduce MED-HALLUCINOGEN, hallucination attacks tailored to the biomedical domain, and evaluate the hallucination performance of LVLMs on medical images, a critical area where precision is crucial. Finally, we conduct extensive evaluations of eight LVLMs and two hallucination mitigation strategies across multiple datasets to show that current generic and medical LVLMs remain susceptible to hallucination attacks.
zh

[CV-59] FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

【速读】：该论文旨在解决基于Transformer的人体骨骼动作识别模型在复杂性和高参数量需求方面的挑战，特别是在资源受限环境中的实际应用问题。为了解决这一问题，作者提出了FreqMixFormerV2模型，该模型基于频率感知混合Transformer（Frequency-aware Mixed Transformer），通过引入频域分析来识别细微且具有区分性的动作。解决方案的关键在于设计了一种轻量级架构，通过重新设计的频率算子优化了高频和低频参数调整，并简化了频率感知注意力模块。这些改进显著减少了模型参数量，同时仅以极小的精度损失实现了高效部署，从而在效率和准确性之间取得了优越的平衡。

链接: https://arxiv.org/abs/2412.20621
作者: Wenhan Wu,Pengfei Wang,Chen Chen,Aidong Lu
机构: 未知
关键词: Transformer-based human skeleton, Transformer-based human, human skeleton action, skeleton action recognition, developed for years
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE FG2025

点击查看摘要

Abstract:Transformer-based human skeleton action recognition has been developed for years. However, the complexity and high parameter count demands of these models hinder their practical applications, especially in resource-constrained environments. In this work, we propose FreqMixForemrV2, which was built upon the Frequency-aware Mixed Transformer (FreqMixFormer) for identifying subtle and discriminative actions with pioneered frequency-domain analysis. We design a lightweight architecture that maintains robust performance while significantly reducing the model complexity. This is achieved through a redesigned frequency operator that optimizes high-frequency and low-frequency parameter adjustments, and a simplified frequency-aware attention module. These improvements result in a substantial reduction in model parameters, enabling efficient deployment with only a minimal sacrifice in accuracy. Comprehensive evaluations of standard datasets (NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets) demonstrate that the proposed model achieves a superior balance between efficiency and accuracy, outperforming state-of-the-art methods with only 60% of the parameters.
zh

[CV-60] Do Current Video LLM s Have Strong OCR Abilities? A Preliminary Study COLING2025

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models）在视频内容中准确提取和理解文本信息（即视频光学字符识别，Video OCR）的能力评估问题。为了解决这一问题，论文提出了一个新颖的基准测试（benchmark），该基准包含1,028个视频和2,961个问答对，并通过6个不同的子任务来评估模型的性能，包括文本内容及其基本视觉属性的识别、OCR对象的语义和空间理解、动态运动检测和时间定位等。解决方案的关键在于采用了一种半自动化的方法，结合了图像大语言模型（Image LLMs）的OCR能力与人工精炼，从而在效率、成本和数据质量之间取得了平衡。该基准的发布旨在推动视频大语言模型的研究，并强调提升其OCR能力的必要性。

链接: https://arxiv.org/abs/2412.20613
作者: Yulin Fei,Yuhui Gao,Xingyuan Xian,Xiaojin Zhang,Tao Wu,Wei Chen
机构: 未知
关键词: multimodal large language, understanding textual information, based optical character, optical character recognition, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CoLing 2025 (The 31st International Conference on Computational Linguistics)

点击查看摘要

Abstract:With the rise of multimodal large language models, accurately extracting and understanding textual information from video content, referred to as video based optical character recognition (Video OCR), has become a crucial capability. This paper introduces a novel benchmark designed to evaluate the video OCR performance of multi-modal models in videos. Comprising 1,028 videos and 2,961 question-answer pairs, this benchmark proposes several key challenges through 6 distinct subtasks: (1) Recognition of text content itself and its basic visual attributes, (2)Semantic and Spatial Comprehension of OCR objects in videos (3) Dynamic Motion detection and Temporal Localization. We developed this benchmark using a semi-automated approach that integrates the OCR ability of image LLMs with manual refinement, balancing efficiency, cost, and data quality. Our resource aims to help advance research in video LLMs and underscores the need for improving OCR ability for video LLMs. The benchmark will be released on this https URL.
zh

[CV-61] Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

【速读】：该论文旨在解决基于预训练扩散模型（Diffusion Models, DMs）进行图像修复任务时所需的神经网络函数评估次数（Neural Function Evaluations, NFEs）过多的问题。现有的“零样本”修复方案虽然在无需为每个任务单独训练深度神经网络的情况下表现良好，但通常需要大量的NFEs，这主要归因于DMs在生成功能中对NFEs的高需求。尽管一致性模型（Consistency Models, CMs）在图像生成中能够通过较少的NFEs实现快速采样，但现有的基于CMs的修复方法仍需要数十次NFEs或针对每个任务进行微调，而微调过程中假设不准确可能导致性能下降。为此，论文提出了一种基于CMs的零样本修复方案，该方案仅需4次NFEs即可有效运行。其核心在于结合了更好的初始化、反投影引导以及一种新颖的噪声注入机制。实验表明，该方案在图像超分辨率、去模糊和修复任务中具有显著优势，且噪声注入技术不仅适用于CMs，还能在减少NFEs时缓解现有引导DM方法的性能退化问题。

链接: https://arxiv.org/abs/2412.20596
作者: Tomer Garber,Tom Tirer
机构: 未知
关键词: dedicated deep neural, deep neural network, single pretrained diffusion, Neural Function Evaluations, pretrained diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code can be found at: this https URL

点击查看摘要

Abstract:In recent years, it has become popular to tackle image restoration tasks with a single pretrained diffusion model (DM) and data-fidelity guidance, instead of training a dedicated deep neural network per task. However, such “zero-shot” restoration schemes currently require many Neural Function Evaluations (NFEs) for performing well, which may be attributed to the many NFEs needed in the original generative functionality of the DMs. Recently, faster variants of DMs have been explored for image generation. These include Consistency Models (CMs), which can generate samples via a couple of NFEs. However, existing works that use guided CMs for restoration still require tens of NFEs or fine-tuning of the model per task that leads to performance drop if the assumptions during the fine-tuning are not accurate. In this paper, we propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs. It is based on a wise combination of several ingredients: better initialization, back-projection guidance, and above all a novel noise injection mechanism. We demonstrate the advantages of our approach for image super-resolution, deblurring and inpainting. Interestingly, we show that the usefulness of our noise injection technique goes beyond CMs: it can also mitigate the performance degradation of existing guided DM methods when reducing their NFE count.
zh

[CV-62] Enhancing autonomous vehicle safety in rain: a data-centric approach for clear vision

【速读】：该论文旨在解决自动驾驶车辆在雨天环境下因摄像头系统视觉受损而面临的导航挑战。通过利用深度学习技术，研究团队开发了一种视觉模型，该模型能够实时处理车辆摄像头捕捉的图像，消除雨滴对视觉的干扰，生成接近无雨场景的清晰图像。解决方案的关键在于采用经典的编码器-解码器架构（encoder-decoder architecture），并结合跳跃连接（skip connections）和拼接操作（concatenation operations），以有效区分连续图像帧中的高频雨滴模式和低频场景特征。此外，研究团队在CARLA仿真环境中生成了包含晴天和雨天图像的全面数据集，用于模型的训练和测试，并通过与转向模块的集成，验证了模型在提升雨天导航安全性和可靠性方面的显著效果。

链接: https://arxiv.org/abs/2412.20565
作者: Mark A. Seferian,Jidong J. Yang
机构: 未知
关键词: Autonomous vehicles face, vehicles face significant, face significant challenges, Autonomous vehicles, navigating adverse weather
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 16 pages, 16 figures, 2 tables

点击查看摘要

Abstract:Autonomous vehicles face significant challenges in navigating adverse weather, particularly rain, due to the visual impairment of camera-based systems. In this study, we leveraged contemporary deep learning techniques to mitigate these challenges, aiming to develop a vision model that processes live vehicle camera feeds to eliminate rain-induced visual hindrances, yielding visuals closely resembling clear, rain-free scenes. Using the Car Learning to Act (CARLA) simulation environment, we generated a comprehensive dataset of clear and rainy images for model training and testing. In our model, we employed a classic encoder-decoder architecture with skip connections and concatenation operations. It was trained using novel batching schemes designed to effectively distinguish high-frequency rain patterns from low-frequency scene features across successive image frames. To evaluate the model performance, we integrated it with a steering module that processes front-view images as input. The results demonstrated notable improvements in steering accuracy, underscoring the model’s potential to enhance navigation safety and reliability in rainy weather conditions.
zh

[CV-63] Exploiting Aggregation and Segregation of Representations for Domain Adaptive Human Pose Estimation

【速读】：该论文旨在解决人体姿态估计（Human Pose Estimation, HPE）领域中的域适应（Domain Adaptation, DA）问题，特别是在缺乏多样化真实世界标注数据集的情况下，如何有效利用合成数据集进行模型训练并应用于真实数据。现有域适应技术主要关注源域和目标域特征的对齐与聚合，但往往忽略了排除域特定表示的关键任务。为此，论文提出了一种新颖的框架，通过将表示分解为域不变（domain-invariant）和域特定（domain-specific）组件，实现域不变特征的聚合与域特定特征的分离。此外，该框架还深入研究了关键点之间的关系，并应用不同的聚合或分离机制以增强对齐效果。实验结果表明，该方法在多个基准数据集上均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.20538
作者: Qucheng Peng,Ce Zheng,Zhengming Ding,Pu Wang,Chen Chen
机构: 未知
关键词: received increasing attention, increasing attention recently, attention recently due, Human pose estimation, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by the 2025 IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025)

点击查看摘要

Abstract:Human pose estimation (HPE) has received increasing attention recently due to its wide application in motion analysis, virtual reality, healthcare, etc. However, it suffers from the lack of labeled diverse real-world datasets due to the time- and labor-intensive annotation. To cope with the label deficiency issue, one common solution is to train the HPE models with easily available synthetic datasets (source) and apply them to real-world data (target) through domain adaptation (DA). Unfortunately, prevailing domain adaptation techniques within the HPE domain remain predominantly fixated on effecting alignment and aggregation between source and target features, often sidestepping the crucial task of excluding domain-specific representations. To rectify this, we introduce a novel framework that capitalizes on both representation aggregation and segregation for domain adaptive human pose estimation. Within this framework, we address the network architecture aspect by disentangling representations into distinct domain-invariant and domain-specific components, facilitating aggregation of domain-invariant features while simultaneously segregating domain-specific ones. Moreover, we tackle the discrepancy measurement facet by delving into various keypoint relationships and applying separate aggregation or segregation mechanisms to enhance alignment. Extensive experiments on various benchmarks, e.g., Human3.6M, LSP, H3D, and FreiHand, show that our method consistently achieves state-of-the-art performance. The project is available at \urlthis https URL.
zh

[CV-64] KVC-onGoing: Keystroke Verification Challenge

【速读】：该论文旨在通过Keystroke Verification Challenge - onGoing (KVC-onGoing)平台，为研究人员提供一个统一的基准测试环境，以评估其基于击键动态（keystroke dynamics）的身份验证系统。解决方案的关键在于利用Aalto University Keystroke数据库中的大规模公开数据，这些数据包含来自超过185,000名受试者的推文长度可变文本序列，模拟了桌面和移动键盘的真实使用场景。通过标准化的实验协议，KVC-onGoing平台显著提升了击键动态的判别能力，在桌面和移动场景下分别达到了3.33%和3.61%的等错误率（EER），以及在1%错误匹配率（FMR）下的11.96%和17.44%的错误非匹配率（FNMR）。此外，该平台还考虑了人口统计公平性，分析了年龄和性别对评分的影响。整个框架运行在CodaLab上，为研究人员提供了一个便捷且高效的评估工具。

链接: https://arxiv.org/abs/2412.20530
作者: Giuseppe Stragapede,Ruben Vera-Rodriguez,Ruben Tolosana,Aythami Morales,Ivan DeAndres-Tame,Naser Damer,Julian Fierrez,Javier Ortega-Garcia,Alejandro Acien,Nahuel Gonzalez,Andrei Shadrikov,Dmitrii Gordin,Leon Schmitt,Daniel Wimmer,Christoph Großmann,Joerdis Krieger,Florian Heinz,Ron Krestel,Christoffer Mayer,Simon Haberl,Helena Gschrey,Yosuke Yamagishi,Sanjay Saha,Sanka Rasnayaka,Sandareka Wickramanayake,Terence Sim,Weronika Gutfeter,Adam Baran,Mateusz Krzysztoń,Przemysław Jaskóła
机构: 未知
关键词: Keystroke Verification Challenge, Aalto University Keystroke, University Keystroke databases, large-scale public databases, standard experimental protocol
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2401.16559 , arXiv:2311.06000

点击查看摘要

Abstract:This article presents the Keystroke Verification Challenge - onGoing (KVC-onGoing), on which researchers can easily benchmark their systems in a common platform using large-scale public databases, the Aalto University Keystroke databases, and a standard experimental protocol. The keystroke data consist of tweet-long sequences of variable transcript text from over 185,000 subjects, acquired through desktop and mobile keyboards simulating real-life conditions. The results on the evaluation set of KVC-onGoing have proved the high discriminative power of keystroke dynamics, reaching values as low as 3.33% of Equal Error Rate (EER) and 11.96% of False Non-Match Rate (FNMR) @1% False Match Rate (FMR) in the desktop scenario, and 3.61% of EER and 17.44% of FNMR @1% at FMR in the mobile scenario, significantly improving previous state-of-the-art results. Concerning demographic fairness, the analyzed scores reflect the subjects’ age and gender to various extents, not negligible in a few cases. The framework runs on CodaLab.
zh

[CV-65] MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在新型视图合成和实时渲染中因使用数百万个高斯分布而导致的高内存消耗问题。现有的解决方案通过手工标准或学习掩码（learned masks）来剪枝不必要的高斯分布，但这些方法基于剪枝时刻的快照确定性地移除高斯分布，导致从长期视角来看重建性能次优。为此，论文提出了MaskGaussian，将高斯分布建模为概率实体而非永久移除，并根据其存在概率进行利用。关键创新在于提出了一种掩码光栅化（masked-rasterization）技术，使未使用但概率存在的高斯分布能够接收梯度，从而动态评估其对场景演化的贡献并调整其存在概率。这种方法使得高斯分布的重要性迭代变化，剪枝选择更加多样化。实验表明，该方法在减少高斯分布数量的同时，渲染质量优于现有剪枝方法，平均剪枝超过60%的高斯分布，仅导致0.02的PSNR下降。

链接: https://arxiv.org/abs/2412.20522
作者: Yifei Liu,Zhihang Zhong,Yifan Zhan,Sheng Xu,Xiao Sun
机构: 未知
关键词: high memory consumption, memory consumption due, demonstrated remarkable performance, Gaussian Splatting, Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and real-time rendering, the high memory consumption due to the use of millions of Gaussians limits its practicality. To mitigate this issue, improvements have been made by pruning unnecessary Gaussians, either through a hand-crafted criterion or by using learned masks. However, these methods deterministically remove Gaussians based on a snapshot of the pruning moment, leading to sub-optimized reconstruction performance from a long-term perspective. To address this issue, we introduce MaskGaussian, which models Gaussians as probabilistic entities rather than permanently removing them, and utilize them according to their probability of existence. To achieve this, we propose a masked-rasterization technique that enables unused yet probabilistically existing Gaussians to receive gradients, allowing for dynamic assessment of their contribution to the evolving scene and adjustment of their probability of existence. Hence, the importance of Gaussians iteratively changes and the pruned Gaussians are selected diversely. Extensive experiments demonstrate the superiority of the proposed method in achieving better rendering quality with fewer Gaussians than previous pruning methods, pruning over 60% of Gaussians on average with only a 0.02 PSNR decline. Our code can be found at: this https URL
zh

[CV-66] Can Robots “Taste” Grapes? Estimating SSC with Simple RGB Sensors

【速读】：该论文旨在解决在葡萄栽培中，如何通过简单且成本效益高的方法准确评估果实品质，特别是可溶性固形物含量（SSC, Soluble Solid Content）的问题。传统方法如高光谱相机在实验室条件下能高精度估计SSC，但在田间环境中实用性受限。论文提出利用简单的RGB传感器在非受控光照条件下估计SSC和颜色，从而实现机器人辅助采摘。研究通过2021和2022年夏季采集的葡萄图像及其对应的SSC和颜色标签，评估了在嵌入式设备（如机器人和智能手机）上实现SSC估计的算法解决方案。关键解决方案包括为资源受限的机器人提出计算高效的基于直方图的方法，以及为更复杂应用提出的深度学习方法。研究结果表明，通过视觉外观估计SSC可以达到类似人类的性能。

链接: https://arxiv.org/abs/2412.20521
作者: Thomas Alessandro Ciarfuglia,Ionut Marian Motoi,Leonardo Saraceni,Daniele Nardi
机构: 未知
关键词: accurately assessing fruit, Soluble Solid Content, assessing fruit quality, table grape cultivation, depends on accurately
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In table grape cultivation, harvesting depends on accurately assessing fruit quality. While some characteristics, like color, are visible, others, such as Soluble Solid Content (SSC), or sugar content measured in degrees Brix (°Brix), require specific tools. SSC is a key quality factor that correlates with ripeness, but lacks a direct causal relationship with color. Hyperspectral cameras can estimate SSC with high accuracy under controlled laboratory conditions, but their practicality in field environments is limited. This study investigates the potential of simple RGB sensors under uncontrolled lighting to estimate SSC and color, enabling cost-effective, robot-assisted harvesting. Over the 2021 and 2022 summer seasons, we collected grape images with corresponding SSC and color labels to evaluate algorithmic solutions for SSC estimation on embedded devices commonly used in robotics and smartphones. Our results demonstrate that SSC can be estimated from visual appearance with human-like performance. We propose computationally efficient histogram-based methods for resource-constrained robots and deep learning approaches for more complex applications.
zh

[CV-67] DPBridge: Latent Diffusion Bridge for Dense Prediction

【速读】：该论文旨在解决传统扩散模型（Diffusion Models）在密集预测任务（Dense Prediction Problems）中因从无信息噪声先验（Uninformative Noise Prior）开始反向采样轨迹而导致的性能下降和推理速度缓慢的问题。为解决这一问题，论文提出了DPBridge生成框架，其关键创新在于将密集预测任务重新定义为图像条件生成问题（Image-Conditioned Generation Problems），并基于完全可处理的扩散桥过程（Fully-Tractable Diffusion Bridge Process）建立输入图像与其对应密集信号图之间的直接映射。此外，论文还引入了微调策略，利用预训练图像扩散骨干网络（Pretrained Image Diffusion Backbone）的丰富视觉先验知识，以促进高效训练和鲁棒的泛化能力。实验结果表明，DPBridge在多个基准测试中均表现出与现有前馈方法和扩散模型相竞争的性能，验证了其有效性和适应性。

链接: https://arxiv.org/abs/2412.20506
作者: Haorui Ji,Taojun Lin,Hongdong Li
机构: 未知
关键词: complex data distributions, demonstrated remarkable success, effectively capture complex, capture complex data, relationship between RGB
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in dense prediction problems, which aims to model per-pixel relationship between RGB images and dense signal maps, thanks to their ability to effectively capture complex data distributions. However, initiating the reverse sampling trajectory from uninformative noise prior introduces limitations such as degraded performance and slow inference speed. In this work, we propose DPBridge, a generative framework that formulates dense prediction tasks as image-conditioned generation problems and establishes a direct mapping between input image and its corresponding dense map based on fully-tractable diffusion bridge process. This approach addresses aforementioned limitations in conventional diffusion-based solutions. In addition, we introduce finetuning strategies to adapt our model from pretrained image diffusion backbone, leveraging its rich visual prior knowledge to facilitate both efficient training and robust generalization ability. Experimental results shows that our DPBridge can achieve competitive performance compared to both feed-forward and diffusion-based approaches across various benchmarks, highlighting its effectiveness and adaptability.
zh

[CV-68] Multimodal Variational Autoencoder: a Barycentric View AAAI2025

【速读】：该论文旨在解决多模态表示学习（multimodal representation learning）中缺失模态（missing modalities）的问题，特别是在生成模型（generative models）如变分自编码器（VAE）中的应用。其核心目标是学习一种既能捕捉跨模态的共性信息（modality-invariant representation），又能保留各模态特性（modality-specific representation）的表示。现有的多模态VAE方法主要通过专家模型（experts）的视角，如专家乘积（PoE）或专家混合（MoE），来聚合单模态推理分布。本文提出了一种基于重心（barycenter）的通用理论框架，将PoE和MoE视为通过最小化非对称加权KL散度（asymmetric weighted KL divergence）得到的重心特例，并进一步扩展了这一框架，引入了更灵活的散度选择，特别是基于2-Wasserstein距离的Wasserstein重心。该方法在几何上更好地保留了单模态分布的结构，从而更有效地捕捉模态特异性和模态不变性表示。通过在三个多模态基准上的实证研究，验证了该方法的有效性。

链接: https://arxiv.org/abs/2412.20487
作者: Peijie Qiu,Wenhui Zhu,Sayantan Kumar,Xiwen Chen,Xiaotong Sun,Jin Yang,Abolfazl Razi,Yalin Wang,Aristeidis Sotiras
机构: 未知
关键词: Multiple signal modalities, vision and sounds, real-world phenomena, naturally present, present in real-world
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: AAAI 2025

点击查看摘要

Abstract:Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.
zh

[CV-69] MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation

【速读】：该论文旨在解决当前基于相机-LiDAR融合的3D语义占据预测（3D semantic occupancy prediction）方法中存在的两个主要问题：一是计算资源在所有体素（voxel）上均匀分配导致的效率低下，二是对遮挡区域处理不足导致的精度下降。为解决这些问题，论文提出了MR-Occ方法，其核心包括三个关键组件：层次化体素特征优化（Hierarchical Voxel Feature Refinement, HVFR）、多尺度占据解码器（Multi-scale Occupancy Decoder, MOD）和像素到体素融合网络（Pixel to Voxel Fusion Network, PVF-Net）。HVFR通过增强关键体素的特征来提升性能并降低计算成本；MOD引入“遮挡”类别以更好地处理传感器视野中被遮挡的区域，从而提高精度；PVF-Net则利用稠密化的LiDAR特征，通过可变形注意力机制有效融合相机和LiDAR数据。实验结果表明，MR-Occ在nuScenes-Occupancy和SemanticKITTI数据集上均取得了最先进的性能，验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2412.20480
作者: Minjae Seong,Jisong Kim,Geonho Bang,Hawook Jeong,Jun Won Choi
机构: 未知
关键词: semantic occupancy prediction, perception is essential, autonomous driving, semantic occupancy, essential for understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Accurate 3D perception is essential for understanding the environment in autonomous driving. Recent advancements in 3D semantic occupancy prediction have leveraged camera-LiDAR fusion to improve robustness and accuracy. However, current methods allocate computational resources uniformly across all voxels, leading to inefficiency, and they also fail to adequately address occlusions, resulting in reduced accuracy in challenging scenarios. We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction, addressing these challenges through three key components: Hierarchical Voxel Feature Refinement (HVFR), Multi-scale Occupancy Decoder (MOD), and Pixel to Voxel Fusion Network (PVF-Net). HVFR improves performance by enhancing features for critical voxels, reducing computational cost. MOD introduces an `occluded’ class to better handle regions obscured from sensor view, improving accuracy. PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism. Extensive experiments demonstrate that MR-Occ achieves state-of-the-art performance on the nuScenes-Occupancy dataset, surpassing previous approaches by +5.2% in IoU and +5.3% in mIoU while using fewer parameters and FLOPs. Moreover, MR-Occ demonstrates superior performance on the SemanticKITTI dataset, further validating its effectiveness and generalizability across diverse 3D semantic occupancy benchmarks.
zh

[CV-70] oward Scene Graph and Layout Guided Complex 3D Scene Generation

【速读】：该论文旨在解决复杂3D场景生成中的两个关键问题：一是现有方法在处理多对象间复杂关系时的局限性，二是基于分数蒸馏采样（SDS）的方法在操纵具有特定交互的多对象时的约束。为解决这些问题，论文提出了一种名为“场景图和布局引导的3D场景生成”（GraLa3D）的新框架。该框架的核心在于利用大语言模型（LLM）将文本提示描述的场景建模为带有布局边界框信息的场景图，并独特地构建了包含单对象节点和复合超节点的场景图。此外，GraLa3D不仅将3D生成约束在期望的布局内，还通过建模超节点内对象间的交互，缓解了这些节点内对象间的外观泄漏问题。实验结果表明，GraLa3D能够有效克服上述限制，生成与文本提示高度一致的复杂3D场景。

链接: https://arxiv.org/abs/2412.20473
作者: Yu-Hsiang Huang,Wei Wang,Sheng-Yu Huang,Yu-Chiang Frank Wang
机构: 未知
关键词: shown impressive results, Recent advancements, advancements in object-centric, impressive results, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:Recent advancements in object-centric text-to-3D generation have shown impressive results. However, generating complex 3D scenes remains an open challenge due to the intricate relations between objects. Moreover, existing methods are largely based on score distillation sampling (SDS), which constrains the ability to manipulate multiobjects with specific interactions. Addressing these critical yet underexplored issues, we present a novel framework of Scene Graph and Layout Guided 3D Scene Generation (GraLa3D). Given a text prompt describing a complex 3D scene, GraLa3D utilizes LLM to model the scene using a scene graph representation with layout bounding box information. GraLa3D uniquely constructs the scene graph with single-object nodes and composite super-nodes. In addition to constraining 3D generation within the desirable layout, a major contribution lies in the modeling of interactions between objects in a super-node, while alleviating appearance leakage across objects within such nodes. Our experiments confirm that GraLa3D overcomes the above limitations and generates complex 3D scenes closely aligned with text prompts.
zh

[CV-71] JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling

【速读】：该论文旨在解决3D人体生成建模中现有方法难以同时实现表达性和语义可解释性的问题。现有方法在设计紧凑的潜在表示时，往往无法兼顾这两点。为此，论文提出了JADE框架，其核心在于引入了一种关节感知的潜在表示（joint-aware latent representation），将人体分解为骨架结构（由关节位置建模）和局部表面几何（由每个关节附着的特征描述）。这种解耦的潜在空间设计不仅实现了几何和语义的可解释性，还为用户提供了灵活的操控性。此外，为了在提出的分解框架下生成连贯且合理的人体形状，论文还提出了一种级联管道（cascaded pipeline），分别使用两个扩散模型（diffusion models）来建模骨架结构和局部表面几何的分布。通过在公开数据集上的大量实验，JADE框架在自动编码重建精度、编辑可控性和生成质量等多个任务中均表现出优于现有方法的有效性。

链接: https://arxiv.org/abs/2412.20470
作者: Haorui Ji,Rong Wang,Taojun Lin,Hongdong Li
机构: 未知
关键词: computer vision, studied extensively, extensively in computer, Generative modeling, local surface geometries
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative modeling of 3D human bodies have been studied extensively in computer vision. The core is to design a compact latent representation that is both expressive and semantically interpretable, yet existing approaches struggle to achieve both requirements. In this work, we introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures, modeled by joint positions, and local surface geometries, characterized by features attached to each joint. This disentangled latent space design enables geometric and semantic interpretation, facilitating users with flexible controllability. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline where two diffusions are employed to model the distribution of skeleton structures and local surface geometries respectively. Extensive experiments are conducted on public datasets, where we demonstrate the effectiveness of JADE framework in multiple tasks in terms of autoencoding reconstruction accuracy, editing controllability and generation quality compared with existing methods.
zh

[CV-72] Single-image reflection removal via self-supervised diffusion models

【速读】：该论文旨在解决通过透明表面拍摄的图像中反射现象导致的视觉质量下降问题，特别是在缺乏成对真实世界数据的情况下，现有的反射去除方法效果受限。论文提出了一种混合方法，结合了循环一致性（cycle-consistency）和去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPM），以有效去除单张图像中的反射，且无需成对训练数据。解决方案的关键在于引入了反射去除网络（Reflective Removal Network, RRN），利用DDPM建模分解过程并恢复透射图像，以及反射合成网络（Reflective Synthesis Network, RSN），通过非线性注意力机制重新合成输入图像。实验结果表明，该方法在SIR^2、Flash-Based Reflection Removal (FRR)数据集和新引入的Museum Reflection Removal (MRR)数据集上均表现出色，优于现有的先进方法。

链接: https://arxiv.org/abs/2412.20466
作者: Zhengyang Lu,Weifan Wang,Tianhao Guo,Feng Wang
机构: 未知
关键词: http URL paper, URL paper proposes, paired training data, requiring paired training, denoising diffusion probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Reflections often degrade the visual quality of images captured through transparent surfaces, and reflection removal methods suffers from the shortage of paired real-world this http URL paper proposes a hybrid approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) to effectively remove reflections from single images without requiring paired training data. The method introduces a Reflective Removal Network (RRN) that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network (RSN) that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. Experimental results demonstrate the effectiveness of the proposed method on the SIR ^2 , Flash-Based Reflection Removal (FRR) Dataset, and a newly introduced Museum Reflection Removal (MRR) dataset, showing superior performance compared to state-of-the-art methods.
zh

[CV-73] Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection CVPR’24

【速读】：该论文旨在解决弱监督视频异常检测（WS-VAD）中的关键挑战，包括处理模态信息不平衡以及准确区分正常与异常特征的问题。为此，论文提出了一种多模态WS-VAD框架，其核心创新在于引入了跨模态融合适配器（Cross-modal Fusion Adapter, CFA），该机制能够动态选择和增强与视觉模态高度相关的视听特征。此外，论文还提出了双曲洛伦兹图注意力机制（Hyperbolic Lorentzian Graph Attention, HLGAtt），以有效捕捉正常与异常表示之间的层次关系，从而提升特征分离的准确性。通过大量实验，该模型在暴力和裸露检测的基准数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2412.20455
作者: Ayush Ghadiya,Purbayan Kar,Vishal Chudasama,Pankaj Wasnik
机构: 未知
关键词: weakly supervised video, supervised video anomaly, identify anomaly events, contemporary research direction, video anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR’24 MULA Workshop

点击查看摘要

Abstract:Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
zh

[CV-74] Image Augmentation Agent for Weakly Supervised Semantic Segmentation

【速读】：该论文旨在解决弱监督语义分割（Weakly-supervised Semantic Segmentation, WSSS）中由于固定数据集限制导致的性能提升瓶颈问题。现有的WSSS方法主要集中于设计新的网络结构和损失函数以生成更准确的密集标签，但忽视了数据集多样性不足对模型性能的制约。论文提出了一种从数据生成角度增强WSSS的新方法，称为图像增强代理（Image Augmentation Agent, IAA）。IAA的关键在于利用大语言模型（Large Language Models, LLMs）和扩散模型（Diffusion Models）自动生成额外的训练图像，从而为模型提供更丰富的语义信息。具体而言，IAA通过引入提示自优化机制（Prompt Self-refinement Mechanism）解决LLMs生成提示的不稳定性问题，并通过在线过滤器（Online Filter）在扩散生成过程中动态确保生成图像的质量和平衡。实验结果表明，该方法在PASCAL VOC 2012和MS COCO 2014数据集上显著超越了现有的WSSS方法。

链接: https://arxiv.org/abs/2412.20439
作者: Wangyu Wu,Xianglin Qiu,Siqi Song,Zhenhong Chen,Xiaowei Huang,Fei Ma,Jimin Xiao
机构: 未知
关键词: Weakly-supervised semantic segmentation, achieved remarkable progress, Weakly-supervised semantic, achieved remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.
zh

[CV-75] ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos

【速读】：该论文旨在解决第一人称视角空间视频（egocentric spatial videos）的感知质量评估问题，特别是在扩展现实（eXtended Reality, XR）技术快速发展的背景下，如何通过沉浸式体验（embodied experience）来提升用户的观看体验。为了解决这一问题，论文首次引入了第一人称视角空间视频质量评估数据库（Egocentric Spatial Video Quality Assessment Database, ESVQAD），该数据库包含600个第一人称视角空间视频及其平均意见分数（Mean Opinion Scores, MOSs）。此外，论文提出了一种新颖的多维度双目特征融合模型（ESVQAnet），该模型通过整合双目空间、运动和语义特征来预测感知质量。实验结果表明，ESVQAnet在沉浸式感知质量评估任务中优于16种现有的视频质量评估（VQA）模型，并在传统VQA任务中表现出强大的泛化能力。数据库和代码将在论文发表后公开。

链接: https://arxiv.org/abs/2412.20423
作者: Xilei Zhu,Huiyu Duan,Liu Yang,Yucheng Zhu,Xiongkuo Min,Guangtao Zhai,Patrick Le Callet
机构: 未知
关键词: egocentric spatial videos, eXtended Reality, egocentric spatial, egocentric spatial shooting, spatial videos
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the embodied experience to highlight this more immersive experience and study the new problem, i.e., embodied perceptual quality assessment for egocentric spatial videos. Specifically, we introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs). Furthermore, we propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality. Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task, and exhibits strong generalization capability on traditional VQA tasks. The database and codes will be released upon the publication.
zh

[CV-76] Bringing Objects to Life: 4D generation from 3D objects

【速读】：该论文旨在解决现有4D内容生成方法在控制生成内容的外观和几何形状方面的局限性。具体而言，论文提出了一种通过文本提示（textual prompts）来引导用户提供的3D对象生成4D动画的方法，以在保持原始对象身份的同时实现自定义动画。解决方案的关键在于将3D网格转换为“静态”4D神经辐射场（NeRF），以保留输入对象的视觉属性，并利用基于文本驱动的图像到视频扩散模型（Image-to-Video diffusion model）进行动画生成。此外，论文引入了增量视角选择协议（incremental viewpoint selection protocol）和掩码分数蒸馏采样损失（masked Score Distillation Sampling loss），以提升运动的真实感并优化相关区域的生成效果。通过评估时间一致性、提示遵循度和视觉保真度，该方法在身份保持（identity preservation）方面显著优于基线方法，并在视觉质量与动态内容之间实现了有效平衡。

链接: https://arxiv.org/abs/2412.20422
作者: Ohad Rahamim,Ori Malca,Dvir Samuel,Gal Chechik
机构: 未知
关键词: Recent advancements, advancements in generative, generative modeling, modeling now enable, enable the creation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.
zh

[CV-77] Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

【速读】：该论文旨在解决多模态（multimodal）肝脏肿瘤分割任务中，现有方法依赖于严格对齐的多模态数据的问题，这在现实临床图像中难以实现，尤其是在肝脏肿瘤等边界模糊和弥散区域。论文提出的解决方案Diff4MMLiTS通过四个关键步骤实现：首先对多模态CT图像中的目标器官进行预配准；然后通过扩张标注模态的掩码并进行修复，生成无肿瘤的多模态正常CT图像；接着基于多模态CT特征和随机生成的肿瘤掩码，使用潜在扩散模型合成严格对齐的多模态CT图像；最后训练分割模型，从而避免了对严格对齐多模态数据的依赖。该方案的核心在于通过合成严格对齐的多模态数据，解决了现实临床图像中多模态数据配准困难的问题，显著提升了分割性能。

链接: https://arxiv.org/abs/2412.20418
作者: Shiyun Chen,Li Lin,Pujin Cheng,ZhiCheng Jin,JianJian Chen,HaiDong Zhu,Kenneth K. Y. Wong,Xiaoying Tang
机构: 未知
关键词: diverse perspectives offered, Multimodal, multimodal segmentation methods, demonstrated to enhance, enhance performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality’s mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods.
zh

[CV-78] EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

【速读】：该论文旨在解决在大规模文本到图像（Text-to-Image, T2I）扩散模型中去除不期望概念（concept erasure）的同时保持其整体生成质量的挑战。这一挑战在如Stable Diffusion (SD) v3和Flux等新兴范式中尤为突出，这些范式结合了流匹配（flow matching）和基于Transformer的架构，限制了现有概念去除技术的可迁移性，这些技术最初是为早期T2I范式（如SD v1.4）设计的。论文提出的解决方案EraseAnything是首个专门针对最新流式T2I框架的概念去除方法。其关键创新在于将概念去除问题建模为双层优化问题，采用基于LoRA（Low-Rank Adaptation）的参数调优和注意力图正则化器来选择性抑制不期望的激活。此外，论文还提出了一种自对比学习策略，确保在去除不期望概念时不会无意中损害无关概念的性能。实验结果表明，EraseAnything成功填补了早期方法在这一新T2I范式中的研究空白，在广泛的概念去除任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.20413
作者: Daiheng Gao,Shilin Lu,Shaw Walters,Wenbo Zhou,Jiaming Chu,Jie Zhang,Bang Zhang,Mengxi Jia,Jian Zhao,Zhaoxin Fan,Weiming Zhang
机构: 未知
关键词: generative quality remains, open challenge, models while maintaining, generative quality, quality remains
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 18 figures

点击查看摘要

Abstract:Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (\textite.g., SD v1.4). In this work, we introduce \logopic \textbfEraseAnything, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.
zh

[CV-79] Open-Sora: Democratizing Efficient Video Production for All

【速读】：该论文旨在解决人工视觉智能（artificial visual intelligence）在生成和模拟现实世界视觉内容方面的滞后问题，尤其是在视频生成领域。尽管在AI语言能力方面取得了显著突破，但视觉智能的发展仍相对落后。为此，论文提出了Open-Sora，一个开源视频生成模型，旨在生成高保真视频内容，支持多种视觉生成任务，包括文本到图像生成、文本到视频生成以及图像到视频生成。解决方案的关键在于引入了空间-时间扩散变换器（Spatial-Temporal Diffusion Transformer, STDiT），该框架通过解耦空间和时间注意力机制，实现了高效的视频生成。此外，论文还采用了高度压缩的3D自编码器（3D autoencoder）来紧凑表示数据，并通过特定的训练策略加速训练过程。通过开源原则，Open-Sora提供了完整的训练、推理和数据准备代码以及模型权重，旨在促进AI内容创作领域的创新、创造力和包容性。

链接: https://arxiv.org/abs/2412.20404
作者: Zangwei Zheng,Xiangyu Peng,Tianji Yang,Chenhui Shen,Shenggui Li,Hongxin Liu,Yukun Zhou,Tianyi Li,Yang You
机构: 未知
关键词: artificial visual intelligence, senses for humans, foundational senses, visual intelligence, Vision and language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: this https URL.
zh

[CV-80] Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning

【速读】：该论文旨在解决多模态对比学习模型（如CLIP）在面对后门攻击（backdoor attacks）时表现出的显著脆弱性问题。研究发现，CLIP的脆弱性主要源于其对类别无关特征（class-irrelevant features）的过度编码，这削弱了模型对输入扰动的视觉特征抗性，使其更容易捕获后门攻击中插入的触发模式。为解决这一问题，论文提出了一种名为“排斥性视觉提示调优”（Repulsive Visual Prompt Tuning, RVPT）的新型防御方法。RVPT通过设计深度视觉提示调优和特征排斥损失（feature-repelling loss），消除过度的类别无关特征，同时优化交叉熵损失（cross-entropy loss）以保持模型的干净准确率。与现有方法不同，RVPT仅需少量下游干净样本，且仅调优少量参数，显著降低了后门攻击的成功率，并在多个数据集上展现了良好的泛化能力。

链接: https://arxiv.org/abs/2412.20392
作者: Zhifang Zhang,Shuo He,Bingquan Shen,Lei Feng
机构: 未知
关键词: learn high-quality representations, exhibit significant vulnerabilities, Multimodal contrastive learning, contrastive learning models, large-scale image-text datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, yet they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we disclose that CLIP’s vulnerabilities primarily stem from its excessive encoding of class-irrelevant features, which can compromise the model’s visual feature resistivity to input perturbations, making it more susceptible to capturing the trigger patterns inserted by backdoor attacks. Inspired by this finding, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs specially designed deep visual prompt tuning and feature-repelling loss to eliminate excessive class-irrelevant features while simultaneously optimizing cross-entropy loss to maintain clean accuracy. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27% of the parameters relative to CLIP, yet it significantly outperforms state-of-the-art baselines, reducing the attack success rate from 67.53% to 2.76% against SoTA attacks and effectively generalizing its defensive capabilities across multiple datasets.
zh

[CV-81] MetricDepth: Enhancing Monocular Depth Estimation with Deep Metric Learning

【速读】：该论文旨在解决在单目深度估计（monocular depth estimation）任务中，由于缺乏自然类别定义而难以有效利用深度度量学习（deep metric learning）的问题。为此，论文提出了MetricDepth方法，其关键创新点包括两个方面：首先，设计了基于深度差异的样本识别（differential-based sample identification），通过样本相对于锚点的深度差异来识别不同类型的特征样本，从而为单目深度估计模型中的特征正则化奠定基础；其次，针对单目深度估计中深度标注范围广且连续的特点，提出了多范围策略（multi-range strategy），通过根据深度差异范围对负样本进行进一步区分，并实施多样化的正则化，从而促进锚点特征与其负样本之间的差异化正则化交互。实验结果表明，MetricDepth在多个数据集和模型类型上均表现出显著的性能提升，验证了其有效性。

链接: https://arxiv.org/abs/2412.20390
作者: Chunpu Liu,Guanglei Yang,Wangmeng Zuo,Tianyi Zan
机构: 未知
关键词: monocular depth estimation, Deep metric learning, monocular depth, Deep metric, depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep metric learning aims to learn features relying on the consistency or divergence of class labels. However, in monocular depth estimation, the absence of a natural definition of class poses challenges in the leveraging of deep metric learning. Addressing this gap, this paper introduces MetricDepth, a novel method that integrates deep metric learning to enhance the performance of monocular depth estimation. To overcome the inapplicability of the class-based sample identification in previous deep metric learning methods to monocular depth estimation task, we design the differential-based sample identification. This innovative approach identifies feature samples as different sample types by their depth differentials relative to anchor, laying a foundation for feature regularizing in monocular depth estimation models. Building upon this advancement, we then address another critical problem caused by the vast range and the continuity of depth annotations in monocular depth estimation. The extensive and continuous annotations lead to the diverse differentials of negative samples to anchor feature, representing the varied impact of negative samples during feature regularizing. Recognizing the inadequacy of the uniform strategy in previous deep metric learning methods for handling negative samples in monocular depth estimation task, we propose the multi-range strategy. Through further distinction on negative samples according to depth differential ranges and implementation of diverse regularizing, our multi-range strategy facilitates differentiated regularization interactions between anchor feature and its negative samples. Experiments across various datasets and model types demonstrate the effectiveness and versatility of MetricDepth,confirming its potential for performance enhancement in monocular depth estimation task.
zh

[CV-82] PTQ4VM: Post-Training Quantization for Visual Mamba WACV2025

【速读】：该论文旨在解决 Visual Mamba 模型在量化（quantization）过程中面临的独特挑战，这些问题包括：1) 令牌间方差（token-wise variance），2) 通道间异常值（channel-wise outliers），以及 3) 激活值的长尾分布（long tail of activations）。为解决这些问题，作者提出了 Post-Training Quantization for Visual Mamba (PTQ4VM) 方法，其核心策略包括：1) 每个令牌的静态量化（Per-Token Static, PTS），以及 2) 平滑尺度和步长的联合学习（Joint Learning of Smoothing Scale and Step Size, JLSS）。PTQ4VM 能够在 15 分钟内将预训练模型转换为量化格式，且在大规模分类和回归任务中表现出色，在 GPU 上实现了高达 1.83 倍的加速，同时精度损失可忽略不计。

链接: https://arxiv.org/abs/2412.20386
作者: Younghyun Cho,Changhun Lee,Seonggon Kim,Eunhyeok Park
机构: 未知
关键词: Visual Mamba, selective space state, space state model, Mamba, Visual Mamba introduces
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Visual Mamba is an approach that extends the selective space state model, Mamba, to vision tasks. It processes image tokens sequentially in a fixed order, accumulating information to generate outputs. Despite its growing popularity for delivering high-quality outputs at a low computational cost across various tasks, Visual Mamba is highly susceptible to quantization, which makes further performance improvements challenging. Our analysis reveals that the fixed token access order in Visual Mamba introduces unique quantization challenges, which we categorize into three main issues: 1) token-wise variance, 2) channel-wise outliers, and 3) a long tail of activations. To address these challenges, we propose Post-Training Quantization for Visual Mamba (PTQ4VM), which introduces two key strategies: Per-Token Static (PTS) quantization and Joint Learning of Smoothing Scale and Step Size (JLSS). To the our best knowledge, this is the first quantization study on Visual Mamba. PTQ4VM can be applied to various Visual Mamba backbones, converting the pretrained model to a quantized format in under 15 minutes without notable quality degradation. Extensive experiments on large-scale classification and regression tasks demonstrate its effectiveness, achieving up to 1.83x speedup on GPUs with negligible accuracy loss compared to FP16. Our code is available at this https URL.
zh

[CV-83] Breaking Fine-Grained Classification Barriers with Cost-Free Data in Few-Shot Class-Incremental Learning

【速读】：该论文旨在解决细粒度分类（fine-grained classification）中数据标注困难、特征和语义高度多样且频繁变化的问题，这些问题导致传统方法在现实场景中效果不佳。论文提出了一种新的学习范式，使模型能够在标准训练阶段之外继续学习，并利用系统运行过程中遇到的免费数据进行优化。解决方案的关键在于设计了一种高效的探索与利用策略（EXP2），该策略在获得最终分类结果之前，根据类别模板探索具有代表性的推理数据样本，并利用这些样本来优化分类器。实验结果表明，EXP2具有普遍的有效性。

链接: https://arxiv.org/abs/2412.20383
作者: Li-Jun Zhao,Zhen-Duo Chen,Zhi-Yuan Xue,Xin Luo,Xin-Shun Xu
机构: 未知
关键词: Current fine-grained classification, Current fine-grained, research mainly concentrates, fine-grained classification research, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages

点击查看摘要

Abstract:Current fine-grained classification research mainly concentrates on fine-grained feature learning, but in real-world applications, the bigger issue often lies in the data. Fine-grained data annotation is challenging, and the features and semantics are highly diverse and frequently changing, making traditional methods less effective in real-world scenarios. Although some studies have provided potential solutions to this issue, most are limited to making use of limited supervised information. In this paper, we propose a novel learning paradigm to break barriers in fine-grained classification. It enables the model to learn beyond the standard training phase and benefit from cost-free data encountered during system operation. On this basis, an efficient EXPloring and EXPloiting strategy and method (EXP2) is designed. Thereinto, before the final classification results are obtained, representative inference data samples are explored according to class templates and exploited to optimize classifiers. Experimental results demonstrate the general effectiveness of EXP2.
zh

[CV-84] Protege: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

【速读】：该论文旨在解决现有数字化妆应用中无法自动生成多样化且个性化化妆风格的问题。当前的应用，如化妆推荐引擎和化妆迁移技术，存在用户需投入大量努力、专业知识有限以及化妆选项不足等局限性。为解决这一问题，论文提出了Protégé应用，其关键解决方案是利用生成对抗网络（GANs）来学习和自动生成化妆风格。通过大量实验，Protégé展示了其在学习和创建多样化化妆风格方面的能力，提供了一种便捷且直观的方式，标志着数字化妆技术的重大进步。

链接: https://arxiv.org/abs/2412.20381
作者: Jia Wei Sii,Chee Seng Chan
机构: 未知
关键词: digitally apply makeup, Makeup, social media, longer confined, confined to physical
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Makeup is no longer confined to physical application; people now use mobile apps to digitally apply makeup to their photos, which they then share on social media. However, while this shift has made makeup more accessible, designing diverse makeup styles tailored to individual faces remains a challenge. This challenge currently must still be done manually by humans. Existing systems, such as makeup recommendation engines and makeup transfer techniques, offer limitations in creating innovative makeups for different individuals “intuitively” – significant user effort and knowledge needed and limited makeup options available in app. Our motivation is to address this challenge by proposing Protégé, a new makeup application, leveraging recent generative model – GANs to learn and automatically generate makeup styles. This is a task that existing makeup applications (i.e., makeup recommendation systems using expert system and makeup transfer methods) are unable to perform. Extensive experiments has been conducted to demonstrate the capability of Protégé in learning and creating diverse makeups, providing a convenient and intuitive way, marking a significant leap in digital makeup technology!
zh

[CV-85] ri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control AAAI2025

【速读】：该论文旨在解决当前视频到音频生成（Video-to-audio, V2A）模型在生成音频时缺乏精细控制的问题，特别是在响度变化和多模态条件整合方面的不足。为解决这些限制，论文提出了Tri-Ergon模型，该模型基于扩散（diffusion-based）方法，并结合了文本、听觉和像素级视觉提示，以实现细节丰富且语义准确的音频合成。关键创新点在于引入了相对全音量的响度单位（Loudness Units relative to Full Scale, LUFS）嵌入，使得模型能够精确手动控制各个音频通道的响度变化，从而有效处理视频与音频在现实Foley工作流程中的复杂关联。Tri-Ergon能够生成44.1 kHz高保真立体声音频片段，长度可达60秒，显著优于现有仅生成固定时长单声道音频的V2A方法。

链接: https://arxiv.org/abs/2412.20378
作者: Bingliang Li,Fengyu Yang,Yuxin Mao,Qingwen Ye,Hongkai Chen,Yiran Zhong
机构: 未知
关键词: generation utilizes visual-only, produce realistic sounds, utilizes visual-only video, visual-only video features, generation utilizes
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: AAAI 2025 Accepted

点击查看摘要

Abstract:Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
zh

[CV-86] FairDiffusion: Enhancing Equity in Latent Diffusion Models via Fair Bayesian Perturbation

【速读】：该论文旨在解决生成式 AI（Generative AI）在医疗领域的文本到图像合成（text-to-image synthesis）中存在的公平性问题，特别是扩散模型（diffusion models）在不同人口统计子群体（如性别、种族和民族）中生成图像质量不一致的问题。为解决这一关键问题，论文提出了FairDiffusion，一种公平感知的潜在扩散模型（equity-aware latent diffusion model），通过增强图像生成质量和临床特征的语义相关性来提升公平性。此外，论文还设计并构建了FairGenMed，这是首个用于研究医疗生成模型公平性的数据集。通过在外部的HAM10000（皮肤镜图像）和CheXpert（胸部X光片）数据集上的广泛评估，FairDiffusion展示了其在多种医疗成像模态中解决公平性问题的有效性。FairDiffusion和FairGenMed共同推动了公平生成学习的研究，促进了生成式AI在医疗领域的公平应用。

链接: https://arxiv.org/abs/2412.20374
作者: Yan Luo,Muhammad Osama Khan,Congcong Wen,Muhammad Muneeb Afzal,Titus Fidelis Wuermeling,Min Shi,Yu Tian,Yi Fang,Mengyu Wang
机构: 未知
关键词: Recent progress, demonstrated significant utility, Recent, Stable Diffusion model, diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The data and code are made publicly available at this https URL

点击查看摘要

Abstract:Recent progress in generative AI, especially diffusion models, has demonstrated significant utility in text-to-image synthesis. Particularly in healthcare, these models offer immense potential in generating synthetic datasets and training medical students. However, despite these strong performances, it remains uncertain if the image generation quality is consistent across different demographic subgroups. To address this critical concern, we present the first comprehensive study on the fairness of medical text-to-image diffusion models. Our extensive evaluations of the popular Stable Diffusion model reveal significant disparities across gender, race, and ethnicity. To mitigate these biases, we introduce FairDiffusion, an equity-aware latent diffusion model that enhances fairness in both image generation quality as well as the semantic correlation of clinical features. In addition, we also design and curate FairGenMed, the first dataset for studying the fairness of medical generative models. Complementing this effort, we further evaluate FairDiffusion on two widely-used external medical datasets: HAM10000 (dermatoscopic images) and CheXpert (chest X-rays) to demonstrate FairDiffusion’s effectiveness in addressing fairness concerns across diverse medical imaging modalities. Together, FairDiffusion and FairGenMed significantly advance research in fair generative learning, promoting equitable benefits of generative AI in healthcare.
zh

[CV-87] Differential Evolution Integrated Hybrid Deep Learning Model for Object Detection in Pre-made Dishes

【速读】：该论文旨在解决预制菜（pre-made dishes）行业中由于食材重叠遮挡、食材相似性以及加工环境光照不足等因素导致的物体检测（object detection）精度低的问题。为解决这一复杂场景下的检测难题，论文提出了一种差分进化集成混合深度学习模型（Differential Evolution Integrated Hybrid Deep Learning, DEIHDL）。其关键解决方案包括：1）分别开发了基于YOLO和Transformer的三个基础模型，以增加检测多样性；2）通过差分进化优化自调节权重集成这三个基础模型；3）采用加权框融合策略（weighted boxes fusion）对集成过程中各基础模型的置信度进行评分。通过这种多模型集成策略，DEIHDL能够在复杂的预制菜场景中实现高精度的物体检测。

链接: https://arxiv.org/abs/2412.20370
作者: Lujia Lv,Di Wu,Yangyi Xia,Jia Wu,Xiaojing Liu,Yi He
机构: 未知
关键词: people living standards, fast-paced working conditions, Object detection, pre-made dishes, living standards
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the continuous improvement of people’s living standards and fast-paced working conditions, pre-made dishes are becoming increasingly popular among families and restaurants due to their advantages of time-saving, convenience, variety, cost-effectiveness, standard quality, etc. Object detection is a key technology for selecting ingredients and evaluating the quality of dishes in the pre-made dishes industry. To date, many object detection approaches have been proposed. However, accurate object detection of pre-made dishes is extremely difficult because of overlapping occlusion of ingredients, similarity of ingredients, and insufficient light in the processing environment. As a result, the recognition scene is relatively complex and thus leads to poor object detection by a single model. To address this issue, this paper proposes a Differential Evolution Integrated Hybrid Deep Learning (DEIHDL) model. The main idea of DEIHDL is three-fold: 1) three YOLO-based and transformer-based base models are developed respectively to increase diversity for detecting objects of pre-made dishes, 2) the three base models are integrated by differential evolution optimized self-adjusting weights, and 3) weighted boxes fusion strategy is employed to score the confidence of the three base models during the integration. As such, DEIHDL possesses the multi-performance originating from the three base models to achieve accurate object detection in complex pre-made dish scenes. Extensive experiments on real datasets demonstrate that the proposed DEIHDL model significantly outperforms the base models in detecting objects of pre-made dishes.
zh

[CV-88] Exploring the Magnitude-Shape Plot Framework for Anomaly Detection in Crowded Video Scenes

【速读】：该论文旨在解决拥挤视频场景中的异常检测问题，这对于公共安全至关重要，能够及时识别潜在威胁。研究采用功能数据分析（Functional Data Analysis）框架，重点应用了Magnitude-Shape (MS) Plot方法。解决方案的关键在于使用自编码器（Autoencoders）从无异常的训练数据中学习和重建正常行为模式，从而在正常帧上产生较低的重建误差，而在潜在异常帧上产生较高的重建误差。每帧的重建误差矩阵被视为多元功能数据，通过MS-Plot分析其幅度和形状偏差，从而提高异常检测的准确性。MS-Plot通过评估偏差的幅度和形状，提供了一个统计上有原则且可解释的异常检测框架。该方法在UCSD Ped2和CUHK Avenue两个广泛使用的基准数据集上进行了评估，表现出优于传统单变量功能检测器（如FBPlot、TVDMSS、Extremal Depth和Outliergram）及多种先进方法的性能，展示了基于MS-Plot的框架在拥挤视频场景中有效检测异常的潜力。

链接: https://arxiv.org/abs/2412.20363
作者: Zuzheng Wang,Fouzi Harrou,Ying Sun,Marc G Genton
机构: 未知
关键词: enabling timely identification, Detecting anomalies, Functional Data Analysis, public safety, enabling timely
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 21 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Detecting anomalies in crowded video scenes is critical for public safety, enabling timely identification of potential threats. This study explores video anomaly detection within a Functional Data Analysis framework, focusing on the application of the Magnitude-Shape (MS) Plot. Autoencoders are used to learn and reconstruct normal behavioral patterns from anomaly-free training data, resulting in low reconstruction errors for normal frames and higher errors for frames with potential anomalies. The reconstruction error matrix for each frame is treated as multivariate functional data, with the MS-Plot applied to analyze both magnitude and shape deviations, enhancing the accuracy of anomaly detection. Using its capacity to evaluate the magnitude and shape of deviations, the MS-Plot offers a statistically principled and interpretable framework for anomaly detection. The proposed methodology is evaluated on two widely used benchmark datasets, UCSD Ped2 and CUHK Avenue, demonstrating promising performance. It performs better than traditional univariate functional detectors (e.g., FBPlot, TVDMSS, Extremal Depth, and Outliergram) and several state-of-the-art methods. These results highlight the potential of the MS-Plot-based framework for effective anomaly detection in crowded video scenes.
zh

[CV-89] Deep Learning in Image Classification: Evaluating VGG19s Performance on Complex Visual Data

【速读】：该论文旨在探索基于VGG19深度卷积神经网络（Deep Convolutional Neural Network, DCNN）的肺炎X光图像自动分类方法，并评估其在肺炎诊断中的应用效果。通过与经典模型如支持向量机（SVM）、极端梯度提升（XGBoost）、多层感知器（MLP）和ResNet50进行比较，研究发现VGG19在准确率（92%）、AUC（0.95）、F1分数（0.90）和召回率（0.87）等多个指标上表现优异，尤其在图像特征提取和分类精度方面优于其他模型。尽管ResNet50在某些指标上表现良好，但在召回率和F1分数上略逊于VGG19。传统机器学习模型SVM和XGBoost在图像分类任务中表现受限，尤其在复杂的医学图像分析任务中表现平平。研究结果表明，深度学习，特别是卷积神经网络，在医学图像分类任务中具有显著优势，尤其在肺炎X光图像分析中，能够提供高效且准确的自动诊断支持。该研究为肺炎的早期检测和自动化诊断系统的开发提供了强有力的技术支持，并为进一步推动自动化医学图像处理技术的应用和发展奠定了基础。

链接: https://arxiv.org/abs/2412.20345
作者: Weijie He,Tong Zhou,Yanlin Xiang,Yang Lin,Jiacheng Hu,Runyuan Bao
机构: 未知
关键词: X-ray images based, image classification tasks, pneumonia X-ray images, automatic classification method, study aims
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study aims to explore the automatic classification method of pneumonia X-ray images based on VGG19 deep convolutional neural network, and evaluate its application effect in pneumonia diagnosis by comparing with classic models such as SVM, XGBoost, MLP, and ResNet50. The experimental results show that VGG19 performs well in multiple indicators such as accuracy (92%), AUC (0.95), F1 score (0.90) and recall rate (0.87), which is better than other comparison models, especially in image feature extraction and classification accuracy. Although ResNet50 performs well in some indicators, it is slightly inferior to VGG19 in recall rate and F1 score. Traditional machine learning models SVM and XGBoost are obviously limited in image classification tasks, especially in complex medical image analysis tasks, and their performance is relatively mediocre. The research results show that deep learning, especially convolutional neural networks, have significant advantages in medical image classification tasks, especially in pneumonia X-ray image analysis, and can provide efficient and accurate automatic diagnosis support. This research provides strong technical support for the early detection of pneumonia and the development of automated diagnosis systems and also lays the foundation for further promoting the application and development of automated medical image processing technology.
zh

[CV-90] Contrastive Conditional Alignment based on Label Shift Calibration for Imbalanced Domain Adaptation ICPR2024

【速读】：该论文旨在解决不平衡领域自适应（Imbalanced Domain Adaptation, IDA）问题，其中协变量偏移（covariate shift）和标签偏移（label shift）同时存在。现有的无监督领域自适应（Unsupervised Domain Adaptation, UDA）方法主要关注协变量偏移，而在IDA场景下，源域中学习的分类器会表现出与目标域不同的决策偏差，导致目标伪标签不可靠，并进一步引发错误类对齐的误差累积。为解决这一问题，论文提出了一种基于标签偏移校准的对比条件对齐方法（Contrastive Conditional Alignment based on Label Shift Calibration, CCA-LSC）。该方法首先通过对比条件对齐解决协变量偏移，学习具有领域不变性和类别区分性的表示，包括领域对抗学习、样本加权移动平均质心对齐和判别特征对齐。随后，通过估计目标域的概率分布，并基于标签偏移指标校准目标样本的分类预测，以确保伪标签与真实目标数据分布更加一致。实验结果表明，该方法在同时存在标签偏移和协变量偏移的基准数据集上优于现有的UDA和IDA方法。

链接: https://arxiv.org/abs/2412.20337
作者: Xiaona Sun,Zhenyu Wu,Zhiqiang Zhan,Yang Ji
机构: 未知
关键词: unsupervised domain adaptation, imbalanced domain adaptation, methods primarily focus, label shift, domain adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICPR 2024

点击查看摘要

Abstract:Many existing unsupervised domain adaptation (UDA) methods primarily focus on covariate shift, limiting their effectiveness in imbalanced domain adaptation (IDA) where both covariate shift and label shift coexist. Recent IDA methods have achieved promising results based on self-training using target pseudo labels. However, under the IDA scenarios, the classifier learned in the source domain will exhibit different decision bias from the target domain. It will potentially make target pseudo labels unreliable, and will further lead to error accumulation with incorrect class alignment. Thus, we propose contrastive conditional alignment based on label shift calibration (CCA-LSC) for IDA, to address both covariate shift and label shift. Initially, our contrastive conditional alignment resolve covariate shift to learn representations with domain invariance and class discriminability, which include domain adversarial learning, sample-weighted moving average centroid alignment and discriminative feature alignment. Subsequently, we estimate the probability distribution of the target domain, and calibrate target sample classification predictions based on label shift metrics to encourage labeling pseudo-labels more consistently with the distribution of real target data. Extensive experiments are conducted and demonstrate that our method outperforms existing UDA and IDA methods on benchmarks with both label shift and covariate shift. Our code is available at this https URL.
zh

[CV-91] Dual-Level Precision Edges Guided Multi-View Stereo with Accurate Planarization AAAI25

【速读】：该论文旨在解决多视图立体视觉（Multi-View Stereo, MVS）中低纹理区域重建的难题。传统MVS方法在低纹理区域重建中表现良好，但存在跨越物体边界和感知范围有限等问题，影响了平面模型构建的鲁棒性。为解决这些问题，论文提出了DPE-MVS方法，其关键在于引入了双级精度边缘信息（dual-level precision edge information），包括精细边缘和粗糙边缘，从而增强了平面模型构建的鲁棒性，并提高了低纹理区域的重建精度。此外，通过利用边缘信息，论文改进了传统PatchMatch MVS中的采样策略，提出了自适应补丁大小调整方法，以优化随机和低纹理区域中的匹配成本计算。这些改进使得匹配更加精确和鲁棒，最终在ETH3D和Tanks & Temples基准测试中取得了最先进的性能表现。

链接: https://arxiv.org/abs/2412.20328
作者: Kehua Chen,Zhenlong Yuan,Tianlu Mao,Zhaoqi Wang
机构: 未知
关键词: prominent research focus, plane model construction, multi-view stereo, prominent research, research focus
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI25

点击查看摘要

Abstract:The reconstruction of low-textured areas is a prominent research focus in multi-view stereo (MVS). In recent years, traditional MVS methods have performed exceptionally well in reconstructing low-textured areas by constructing plane models. However, these methods often encounter issues such as crossing object boundaries and limited perception ranges, which undermine the robustness of plane model construction. Building on previous work (APD-MVS), we propose the DPE-MVS method. By introducing dual-level precision edge information, including fine and coarse edges, we enhance the robustness of plane model construction, thereby improving reconstruction accuracy in low-textured areas. Furthermore, by leveraging edge information, we refine the sampling strategy in conventional PatchMatch MVS and propose an adaptive patch size adjustment approach to optimize matching cost calculation in both stochastic and low-textured areas. This additional use of edge information allows for more precise and robust matching. Our method achieves state-of-the-art performance on the ETH3D and Tanks Temples benchmarks. Notably, our method outperforms all published methods on the ETH3D benchmark.
zh

[CV-92] Motion Transfer-Driven intra-class data augmentation for Finger Vein Recognition

【速读】：该论文旨在解决指静脉识别（Finger Vein Recognition, FVR）中由于公开数据集规模有限导致的过拟合问题，以及传统数据增强方法无法捕捉真实手指姿态变化所带来的性能提升有限的问题。为解决这一问题，论文提出了一种新颖的运动转移（Motion Transfer, MT）模型，通过模拟实际手指姿态和旋转运动来进行指静脉图像数据增强。该模型的关键在于首先使用关键点检测器提取源图像和驱动图像的关键点及姿态图，然后利用密集运动模块估计运动光流，最后通过图像生成模块生成具有目标姿态的图像。实验结果表明，该运动转移模型能有效提高指静脉识别的准确性。

链接: https://arxiv.org/abs/2412.20327
作者: Xiu-Feng Huang,Lai-Man Po,Wei-Feng Ou
机构: 未知
关键词: secure biometric technique, Finger vein, deep learning-based FVR, vascular bio-information, secure biometric
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 Pages

点击查看摘要

Abstract:Finger vein recognition (FVR) has emerged as a secure biometric technique because of the confidentiality of vascular bio-information. Recently, deep learning-based FVR has gained increased popularity and achieved promising performance. However, the limited size of public vein datasets has caused overfitting issues and greatly limits the recognition performance. Although traditional data augmentation can partially alleviate this data shortage issue, it cannot capture the real finger posture variations due to the rigid label-preserving image transformations, bringing limited performance improvement. To address this issue, we propose a novel motion transfer (MT) model for finger vein image data augmentation via modeling the actual finger posture and rotational movements. The proposed model first utilizes a key point detector to extract the key point and pose map of the source and drive finger vein images. We then utilize a dense motion module to estimate the motion optical flow, which is fed to an image generation module for generating the image with the target pose. Experiments conducted on three public finger vein databases demonstrate that the proposed motion transfer model can effectively improve recognition accuracy. Code is available at: this https URL.
zh

[CV-93] ransformer-Based Contrastive Meta-Learning For Low-Resource Generalizable Activity Recognition

【速读】：该论文旨在解决人类活动识别（HAR）中由于分布偏移（DS）导致的模型泛化能力不足的问题，特别是在低资源场景下，收集和标注足够的人类活动数据成本高昂，进一步加剧了这一挑战。论文提出的解决方案TACO（Transformer-based Contrastive Meta-learning Approach）通过合成虚拟目标域来训练模型，并显式考虑模型的泛化能力。关键创新点在于利用Transformer的注意力机制提取更具表达力的特征，并在元优化中引入监督对比损失函数，以增强表示学习。实验结果表明，TACO在多种低资源分布偏移场景下显著提升了性能。

链接: https://arxiv.org/abs/2412.20290
作者: Junyao Wang,Mohammad Abdullah Al Faruque
机构: 未知
关键词: human activity recognition, remains challenging due, scenarios remains challenging, activity recognition, distribution shifts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning has been widely adopted for human activity recognition (HAR) while generalizing a trained model across diverse users and scenarios remains challenging due to distribution shifts. The inherent low-resource challenge in HAR, i.e., collecting and labeling adequate human-involved data can be prohibitively costly, further raising the difficulty of tackling DS. We propose TACO, a novel transformer-based contrastive meta-learning approach for generalizable HAR. TACO addresses DS by synthesizing virtual target domains in training with explicit consideration of model generalizability. Additionally, we extract expressive feature with the attention mechanism of Transformer and incorporate the supervised contrastive loss function within our meta-optimization to enhance representation learning. Our evaluation demonstrates that TACO achieves notably better performance across various low-resource DS scenarios.
zh

[CV-94] Few-shot Algorithm Assurance

【速读】：该论文旨在解决深度学习模型在图像分类任务中对图像失真（image distortion）的脆弱性问题，特别是确定模型在失真水平下仍能保持高于设定阈值的准确率，这一问题被称为“图像失真下的模型保证”（Model Assurance under Image Distortion）。论文将该问题形式化为一个分类任务，目标是在给定失真水平下，预测模型在失真图像集上的准确率是否高于阈值。解决方案的关键在于提出了一种基于水平集估计（Level Set Estimation, LSE）算法的分类器，利用LSE的均值和方差函数构建分类规则。此外，论文还扩展了该方法，提出在“少量样本”设置下，通过一种新颖的条件变分自编码器（Conditional Variational Autoencoder）模型生成额外的合成图像，以支持模型保证过程。实验结果表明，该方法在五个基准图像数据集上显著优于现有基线方法。

链接: https://arxiv.org/abs/2412.20275
作者: Dang Nguyen,Sunil Gupta
机构: 未知
关键词: deep learning models, deep learning, distortion, model, identify distortion levels
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In image classification tasks, deep learning models are vulnerable to image distortion. For successful deployment, it is important to identify distortion levels under which the model is usable i.e. its accuracy stays above a stipulated threshold. We refer to this problem as Model Assurance under Image Distortion, and formulate it as a classification task. Given a distortion level, our goal is to predict if the model’s accuracy on the set of distorted images is greater than a threshold. We propose a novel classifier based on a Level Set Estimation (LSE) algorithm, which uses the LSE’s mean and variance functions to form the classification rule. We further extend our method to a “few sample” setting where we can only acquire few real images to perform the model assurance process. Our idea is to generate extra synthetic images using a novel Conditional Variational Autoencoder model with two new loss functions. We conduct extensive experiments to show that our classification method significantly outperforms strong baselines on five benchmark image datasets.
zh

[CV-95] Election of Collaborators via Reinforcement Learning for Federated Brain Tumor Segmentation

【速读】：该论文旨在解决在动态联邦学习（Federated Learning, FL）环境中如何优化选择参与协作的节点（collaborators）以提升模型泛化能力的问题。其核心挑战在于如何在保护数据隐私的同时，有效管理分布式数据集中的异常数据点，并实现资源高效的模型训练。论文提出的解决方案是RL-HSimAgg算法，该算法结合了强化学习（Reinforcement Learning, RL）和基于谐波平均的相似性加权聚合（similarity-weighted aggregation, simAgg）方法，以处理异常数据并优化协作节点选择。具体而言，论文采用了多臂赌博机（multi-armed bandit）算法，如Epsilon-greedy (EG) 和上置信界（upper confidence bound, UCB），来平衡探索与利用的权衡，从而提升模型在联邦脑部病变分割任务中的性能。实验结果表明，基于UCB的RL-HSimAgg在增强肿瘤、肿瘤核心和全肿瘤分割的Dice分数上均优于EG方法，证明了其在联邦学习环境中的有效性和鲁棒性。

链接: https://arxiv.org/abs/2412.20253
作者: Muhammad Irfan Khan,Elina Kontio,Suleiman A. Khan,Mojtaba Jafaritadi
机构: 未知
关键词: preserving data privacy, enables collaborative model, enables collaborative, Tumor, data privacy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across decentralized datasets while preserving data privacy. However, optimally selecting participating collaborators in dynamic FL environments remains challenging. We present RL-HSimAgg, a novel reinforcement learning (RL) and similarity-weighted aggregation (simAgg) algorithm using harmonic mean to manage outlier data points. This paper proposes applying multi-armed bandit algorithms to improve collaborator selection and model generalization. By balancing exploration-exploitation trade-offs, these RL methods can promote resource-efficient training with diverse datasets. We demonstrate the effectiveness of Epsilon-greedy (EG) and upper confidence bound (UCB) algorithms for federated brain lesion segmentation. In simulation experiments on internal and external validation sets, RL-HSimAgg with UCB collaborator outperformed the EG method across all metrics, achieving higher Dice scores for Enhancing Tumor (0.7334 vs 0.6797), Tumor Core (0.7432 vs 0.6821), and Whole Tumor (0.8252 vs 0.7931) segmentation. Therefore, for the Federated Tumor Segmentation Challenge (FeTS 2024), we consider UCB as our primary client selection approach in federated Glioblastoma lesion segmentation of multi-modal MRIs. In conclusion, our research demonstrates that RL-based collaborator management, e.g. using UCB, can potentially improve model robustness and flexibility in distributed learning environments, particularly in domains like brain tumor segmentation.
zh

[CV-96] Recommender Engine Driven Client Selection in Federated Brain Tumor Segmentation

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在联邦肿瘤分割挑战（FeTS 2024）中的客户选择问题，以优化FL过程的效率和精度。解决方案的关键在于引入了一个基于非负矩阵分解（Non-negative Matrix Factorization, NNMF）的推荐引擎框架，并结合了基于内容和协同过滤的混合聚合方法。该方法通过智能分析历史表现、专业知识和其他相关指标，识别出最合适的协作伙伴，从而有效应对冷启动问题（Cold Start Problem），即新加入或不活跃的协作伙伴因数据有限而带来的选择挑战。此外，论文还提出了谐波相似度权重聚合（Harmonic Similarity Weight Aggregation, HSimAgg）用于模型参数的自适应聚合。通过在多参数磁共振成像（mpMRI）数据集上的实验验证，该方法在增强肿瘤（ET）、肿瘤核心（TC）和全肿瘤（WT）分割任务中均取得了较高的Dice分数，证明了选择与特定任务（如脑肿瘤分割）专业知识匹配的协作伙伴能够显著提升FL网络的有效性。

链接: https://arxiv.org/abs/2412.20250
作者: Muhammad Irfan Khan,Elina Kontio,Suleiman A. Khan,Mojtaba Jafaritadi
机构: 未知
关键词: client selection protocol, selection protocol designed, efficient client selection, Federated Tumor Segmentation, study presents
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study presents a robust and efficient client selection protocol designed to optimize the Federated Learning (FL) process for the Federated Tumor Segmentation Challenge (FeTS 2024). In the evolving landscape of FL, the judicious selection of collaborators emerges as a critical determinant for the success and efficiency of collective learning endeavors, particularly in domains requiring high precision. This work introduces a recommender engine framework based on non-negative matrix factorization (NNMF) and a hybrid aggregation approach that blends content-based and collaborative filtering. This method intelligently analyzes historical performance, expertise, and other relevant metrics to identify the most suitable collaborators. This approach not only addresses the cold start problem where new or inactive collaborators pose selection challenges due to limited data but also significantly improves the precision and efficiency of the FL process. Additionally, we propose harmonic similarity weight aggregation (HSimAgg) for adaptive aggregation of model parameters. We utilized a dataset comprising 1,251 multi-parametric magnetic resonance imaging (mpMRI) scans from individuals diagnosed with glioblastoma (GBM) for training purposes and an additional 219 mpMRI scans for external evaluations. Our federated tumor segmentation approach achieved dice scores of 0.7298, 0.7424, and 0.8218 for enhancing tumor (ET), tumor core (TC), and whole tumor (WT) segmentation tasks respectively on the external validation set. In conclusion, this research demonstrates that selecting collaborators with expertise aligned to specific tasks, like brain tumor segmentation, improves the effectiveness of FL networks.
zh

[CV-97] Plastic Waste Classification Using Deep Learning: Insights from the WaDaBa Dataset

【速读】：该论文旨在解决塑料废弃物管理中的分类和回收问题，特别是在塑料使用量不断增加的情况下，如何有效处理塑料废弃物的挑战。论文的核心解决方案是采用深度学习技术，特别是卷积神经网络（CNNs）和YOLO（You Only Look Once）等目标检测模型，利用WaDaBa数据集进行塑料废弃物的分类。研究表明，YOLO-11m模型在准确率（98.03%）和mAP50（0.990）方面表现最佳，而YOLO-11n在mAP50（0.992）上表现更为突出。尽管轻量级模型如YOLO-10n训练速度更快，但准确率较低；MobileNet V2在准确率（97.12%）上表现出色，但在目标检测方面表现不足。论文强调了深度学习模型在塑料废弃物分类中的潜力，尤其是YOLO模型在平衡准确率和计算效率方面的优势，为废弃物管理和回收提供了可扩展且具有影响力的解决方案。

链接: https://arxiv.org/abs/2412.20232
作者: Suman Kunwar,Banji Raphael Owabumoye,Abayomi Simeon Alade
机构: 未知
关键词: managing plastic waste, classification and recycling, YOLO, managing plastic, potential of deep
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:With the increasing use of plastic, the challenges associated with managing plastic waste have become more challenging, emphasizing the need of effective solutions for classification and recycling. This study explores the potential of deep learning, focusing on convolutional neural networks (CNNs) and object detection models like YOLO (You Only Look Once), to tackle this issue using the WaDaBa dataset. The study shows that YOLO- 11m achieved highest accuracy (98.03%) and mAP50 (0.990), with YOLO-11n performing similarly but highest mAP50(0.992). Lightweight models like YOLO-10n trained faster but with lower accuracy, whereas MobileNet V2 showed impressive performance (97.12% accuracy) but fell short in object detection. Our study highlights the potential of deep learning models in transforming how we classify plastic waste, with YOLO models proving to be the most effective. By balancing accuracy and computational efficiency, these models can help to create scalable, impactful solutions in waste management and recycling.
zh

[CV-98] owards Real-Time 2D Mapping: Harnessing Drones AI and Computer Vision for Advanced Insights

【速读】：该论文旨在解决航空航天和国防领域中实时二维地图生成（Real-time 2D mapping）所面临的挑战，特别是在处理速度、精度和地形适应性方面的问题。论文提出了一种集成了无人机影像、机器学习（Machine Learning）和计算机视觉（Computer Vision）的先进地图生成系统。该系统的关键解决方案包括：通过自动化特征检测（feature detection）、图像匹配（image matching）和拼接（stitching），利用ORB（Oriented FAST and Rotated BRIEF）进行特征检测，FLANN（Fast Library for Approximate Nearest Neighbors）实现精确的关键点匹配，并通过单应性变换（homography transformations）对齐重叠图像，从而生成无缝、高分辨率的地图。系统采用Python实现，并借助OpenCV进行图像处理，NumPy进行高效计算，以及并行处理技术，确保在动态环境中实现实时更新。该系统在多种光照条件和复杂地形下表现出色，显著提升了传统方法的速度和精度，增强了态势感知和决策能力。

链接: https://arxiv.org/abs/2412.20210
作者: Bharath Kumar Agnur
机构: 未知
关键词: timely geographic data, target tracking, vital tool, accurate and timely, timely geographic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 7 figures, 1 table

点击查看摘要

Abstract:Real-time 2D mapping is a vital tool in aerospace and defense, where accurate and timely geographic data is essential for operations like surveillance, reconnaissance, and target tracking. This project introduces a cutting-edge mapping system that integrates drone imagery with machine learning and computer vision to address challenges in processing speed, accuracy, and adaptability to diverse terrains. By automating feature detection, image matching, and stitching, the system generates seamless, high-resolution maps with minimal delay, providing strategic advantages in defense operations. Implemented in Python, the system leverages OpenCV for image processing, NumPy for efficient computations, and this http URL for parallel processing. ORB (Oriented FAST and Rotated BRIEF) handles feature detection, while FLANN (Fast Library for Approximate Nearest Neighbors) ensures precise keypoint matching. Homography transformations align overlapping images, creating distortion-free maps in real time. This automated approach eliminates manual intervention, enabling live updates critical in dynamic environments. Designed for adaptability, the system performs well under varying light conditions and rugged terrains, making it highly effective in aerospace and defense scenarios. Testing demonstrates significant improvements in speed and accuracy compared to traditional methods, enhancing situational awareness and decision-making. This scalable solution leverages advanced technologies to deliver reliable, actionable data for mission-critical operations. Comments: 7 pages, 7 figures, 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.20210 [cs.CV] (or arXiv:2412.20210v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.20210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-99] owards Visual Grounding: A Survey

【速读】：该论文旨在系统梳理和总结视觉定位（Visual Grounding）领域的发展历程、最新进展及其面临的挑战。视觉定位任务的核心是基于给定的文本描述，在图像中定位特定区域，以模拟人类在社交对话中的多模态理解能力。自2021年以来，该领域出现了诸如基于预训练的视觉定位（grounded pre-training）、多模态大语言模型的视觉定位（grounding multimodal LLMs）、广义视觉定位（generalized visual grounding）以及千兆像素级视觉定位（giga-pixel grounding）等新概念，这些进展带来了诸多新挑战。论文通过系统追踪和总结这些进展，精确定义了各种视觉定位的设置，以规范未来研究并确保公平比较。此外，论文还深入探讨了多个高级主题，并强调了视觉定位的广泛应用。最后，论文提出了未来研究的有价值方向，为后续研究者提供了启发。通过提取共同的技术细节，该综述涵盖了近十年来各子领域的代表性工作，为初学者和经验丰富的研究者提供了理解关键概念和跟踪最新研究进展的宝贵资源。

链接: https://arxiv.org/abs/2412.20206
作者: Linhui Xiao,Xiaoshan Yang,Xiangyuan Lan,Yaowei Wang,Changsheng Xu
机构: 未知
关键词: Referring Expression Comprehension, Referring Expression, Expression Comprehension, Visual Grounding, Phrase Grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI under review. We keep tracing related works at this https URL

点击查看摘要

Abstract:Visual Grounding is also known as Referring Expression Comprehension and Phrase Grounding. It involves localizing a natural number of specific regions within an image based on a given textual description. The objective of this task is to emulate the prevalent referential relationships in social conversations, equipping machines with human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we initially examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements and meticulously organize the various settings in visual grounding, thereby establishing precise definitions of these settings to standardize future research and ensure a fair comparison. Additionally, we delve into several advanced topics and highlight numerous applications of visual grounding. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative works in each subtopic over the past decade. To the best, this paper presents the most comprehensive overview currently available in the field of grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related works at this https URL.
zh

[CV-100] Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems

【速读】：该论文旨在解决智能城市监控中弱监督异常检测（Weakly Supervised Monitoring Anomaly Detection, WSMAD）的实时性和可解释性问题。现有方法由于复杂度高，难以满足边缘设备的实时性和可解释性需求。为此，论文提出了TCVADS（Two-stage Cross-modal Video Anomaly Detection System），其关键解决方案包括两阶段处理：粗粒度快速分类和细粒度详细分析。在第一阶段，TCVADS通过时间序列分析模块（教师模型）提取视频帧特征，并通过知识蒸馏（Knowledge Distillation）将信息传递给简化的卷积网络（学生模型）进行二分类。当检测到异常时，触发第二阶段，采用细粒度多分类模型，利用CLIP进行跨模态对比学习（Cross-modal Contrastive Learning），结合文本和图像，通过设计的三重文本关系增强可解释性并实现精细分类。实验结果表明，TCVADS在模型性能、检测效率和可解释性方面显著优于现有方法，为智能城市监控应用提供了重要贡献。

链接: https://arxiv.org/abs/2412.20201
作者: Wen-Dong Jiang,Chih-Yung Chang,Hsiang-Chuan Chang,Ji-Yuan Chen,Diptendu Sinha Roy
机构: 未知
关键词: Weakly Supervised Monitoring, Weakly Supervised, utilizes weak supervision, Supervised Monitoring Anomaly, weak supervision learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE TETC-CS (Under review)

点击查看摘要

Abstract:Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge this http URL operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.
zh

[CV-101] Mining Platoon Patterns from Traffic Videos VLDB

【速读】：该论文旨在解决从城市规模的视频数据中发现共移动模式（co-movement patterns）时，由于物体遮挡或车辆误匹配导致的轨迹恢复不准确和模式缺失问题。传统方法假设从视频中恢复的轨迹是准确的，并且要求共移动模式中的物体必须在共同路线上连续出现在多个摄像头中，这在实际应用中容易导致模式丢失。为解决这一问题，论文提出了一种宽松的共移动模式定义，取消了共同路线上的连续性要求，并允许组内物体在部分摄像头中未被捕获。此外，论文开发了一种名为MaxGrowth的新型枚举框架，该框架通过将共移动模式视为等效的聚类序列，逐步增加序列长度来枚举候选模式，避免了生成任何假阳性结果，并且无需进行候选模式的验证。MaxGrowth还引入了两种有效的剪枝规则，以高效过滤非最大模式。实验结果表明，MaxGrowth在运行速度上比基线算法快两个数量级，并且在轨迹恢复算法不完美的情况下，仍能在真实视频数据集中表现出高准确性。

链接: https://arxiv.org/abs/2412.20177
作者: Yijun Bei,Teng Ma,Dongxiang Zhang,Sai Wu,Kian-Lee Tan,Gang Chen
机构: 未知
关键词: Discovering co-movement patterns, Discovering co-movement, attractive topic, urban-scale video data, video data sources
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: This submission is an extended technical report version of a paper currently under revision for the VLDB conference. In accordance with PVLDB guidelines, some sentences in the paper are highlighted in blue to indicate changes made during the revision process, specifically for the benefit of VLDB reviewers

点击查看摘要

Abstract:Discovering co-movement patterns from urban-scale video data sources has emerged as an attractive topic. This task aims to identify groups of objects that travel together along a common route, which offers effective support for government agencies in enhancing smart city management. However, the previous work has made a strong assumption on the accuracy of recovered trajectories from videos and their co-movement pattern definition requires the group of objects to appear across consecutive cameras along the common route. In practice, this often leads to missing patterns if a vehicle is not correctly identified from a certain camera due to object occlusion or vehicle mis-matching. To address this challenge, we propose a relaxed definition of co-movement patterns from video data, which removes the consecutiveness requirement in the common route and accommodates a certain number of missing captured cameras for objects within the group. Moreover, a novel enumeration framework called MaxGrowth is developed to efficiently retrieve the relaxed patterns. Unlike previous filter-and-refine frameworks comprising both candidate enumeration and subsequent candidate verification procedures, MaxGrowth incurs no verification cost for the candidate patterns. It treats the co-movement pattern as an equivalent sequence of clusters, enumerating candidates with increasing sequence length while avoiding the generation of any false positives. Additionally, we also propose two effective pruning rules to efficiently filter the non-maximal patterns. Extensive experiments are conducted to validate the efficiency of MaxGrowth and the quality of its generated co-movement patterns. Our MaxGrowth runs up to two orders of magnitude faster than the baseline algorithm. It also demonstrates high accuracy in real video dataset when the trajectory recovery algorithm is not perfect.
zh

[CV-102] On dataset transferability in medical image classification

【速读】：该论文旨在解决现有迁移性估计方法在医学图像分类中的不足。现有方法主要关注预训练源模型特征对目标数据集的适用性，这可能导致不切实际的预测，例如认为目标数据集是其自身的最佳源。为解决这一问题，论文提出了一种新的迁移性度量方法，该方法结合了特征质量和梯度信息，以评估源模型特征对目标任务的适用性和适应性。该方法的创新之处在于通过引入梯度信息来更全面地评估特征的可迁移性，从而在医学图像分类和跨领域迁移场景中取得了优于现有方法的效果。此外，论文还提供了影响医学图像分类迁移性能的因素分析，以及从自然图像到医学图像的跨领域迁移动态，并提供了真实迁移性能基准测试结果，以促进医学图像分类迁移性估计的进一步研究。

链接: https://arxiv.org/abs/2412.20172
作者: Dovile Juodelyte,Enzo Ferrante,Yucheng Lu,Prabhant Singh,Joaquin Vanschoren,Veronika Cheplygina
机构: 未知
关键词: Current transferability estimation, Current transferability, medical image classification, medical image, Current
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current transferability estimation methods designed for natural image datasets are often suboptimal in medical image classification. These methods primarily focus on estimating the suitability of pre-trained source model features for a target dataset, which can lead to unrealistic predictions, such as suggesting that the target dataset is the best source for itself. To address this, we propose a novel transferability metric that combines feature quality with gradients to evaluate both the suitability and adaptability of source model features for target tasks. We evaluate our approach in two new scenarios: source dataset transferability for medical image classification and cross-domain transferability. Our results show that our method outperforms existing transferability metrics in both settings. We also provide insight into the factors influencing transfer performance in medical image classification, as well as the dynamics of cross-domain transfer from natural to medical images. Additionally, we provide ground-truth transfer performance benchmarking results to encourage further research into transferability estimation for medical image classification. Our code and experiments are available at this https URL.
zh

[CV-103] Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation

【速读】：该论文旨在解决3D卷积神经网络（3D CNNs）在处理长程时间依赖性（long-range temporal dependencies）方面的局限性。尽管Transformer在空间维度上有效解决了长程依赖问题，但在时间维度上仍存在不足，且Transformer的引入会导致参数量大幅增加和处理速度下降。为解决这些问题，论文提出了一种简单而有效的模块——地理掩码卷积门控循环单元（Geographically Masked Convolutional Gated Recurrent Unit, Geo-ConvGRU），专门用于鸟瞰图（Bird’s-Eye View）分割任务。该方案的关键在于用ConvGRU替代3D CNN的时间模块，以增强网络处理时间依赖性的能力，并通过引入地理掩码来抑制时间模块引入的噪声。实验结果表明，Geo-ConvGRU在NuScenes数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.20171
作者: Guanglei Yang,Yongqiang Zhang,Wanlong Li,Yu Tang,Weize Shang,Feng Wen,Hongbo Zhang,Mingli Ding
机构: 未知
关键词: computer vision tasks, Convolutional Neural Networks, dependencies explicitly due, Convolutional Gated Recurrent, Gated Recurrent Unit
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks, however, they inherently struggle to model long-range dependencies explicitly due to the localized nature of convolution operations. Although Transformers have addressed limitations in long-range dependencies for the spatial dimension, the temporal dimension remains underexplored. In this paper, we first highlight that 3D CNNs exhibit limitations in capturing long-range temporal dependencies. Though Transformers mitigate spatial dimension issues, they result in a considerable increase in parameter and processing speed reduction. To overcome these challenges, we introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird’s-Eye View segmentation. Specifically, we substitute the 3D CNN layers with ConvGRU in the temporal module to bolster the capacity of networks for handling temporal dependencies. Additionally, we integrate a geographical mask into the Convolutional Gated Recurrent Unit to suppress noise introduced by the temporal module. Comprehensive experiments conducted on the NuScenes dataset substantiate the merits of the proposed Geo-ConvGRU, revealing that our approach attains state-of-the-art performance in Bird’s-Eye View segmentation.
zh

[CV-104] Conformal Risk Control for Pulmonary Nodule Detection

【速读】：该论文旨在解决在医疗决策支持系统中，如何有效量化预测不确定性（predictive uncertainty）以确保决策的可靠性和透明度的问题。特别是在肺癌筛查中的肺结节检测（pulmonary nodule detection）场景下，论文提出了一种基于保形风险控制（conformal risk control, CRC）的不确定性量化技术。该技术通过生成具有保形保证的预测集（prediction sets），允许用户在假阳性率和模型性能之间进行权衡，从而提供形式化的统计保证。论文的关键解决方案在于引入CRC技术，使得模型在保持与放射科医生相当的敏感性的同时，能够量化并控制预测的不确定性，从而在安全关键的医疗领域中提供更为可靠的决策支持。此外，论文还强调了在面对本体不确定性（ontological uncertainty）时，使用现成预测模型的风险，进一步凸显了不确定性量化的重要性。

链接: https://arxiv.org/abs/2412.20167
作者: Roel Hulsman,Valentin Comte,Lorenzo Bertolini,Tobias Wiesenthal,Antonio Puertas Gallardo,Mario Ceresa
机构: 未知
关键词: Quantitative tools, increasingly appealing, growing capabilities, Quantitative, decision support
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative tools are increasingly appealing for decision support in healthcare, driven by the growing capabilities of advanced AI systems. However, understanding the predictive uncertainties surrounding a tool’s output is crucial for decision-makers to ensure reliable and transparent decisions. In this paper, we present a case study on pulmonary nodule detection for lung cancer screening, enhancing an advanced detection model with an uncertainty quantification technique called conformal risk control (CRC). We demonstrate that prediction sets with conformal guarantees are attractive measures of predictive uncertainty in the safety-critical healthcare domain, allowing end-users to achieve arbitrary validity by trading off false positives and providing formal statistical guarantees on model performance. Among ground-truth nodules annotated by at least three radiologists, our model achieves a sensitivity that is competitive with that generally achieved by individual radiologists, with a slight increase in false positives. Furthermore, we illustrate the risks of using off-the-shelve prediction models when faced with ontological uncertainty, such as when radiologists disagree on what constitutes the ground truth on pulmonary nodules.
zh

[CV-105] StyleAutoEncoder for manipulating image attributes using pre-trained StyleGAN

【速读】：该论文旨在解决训练现代生成式模型（Generative Models）时所需的高昂计算成本和资源消耗问题。为此，作者提出了一种轻量级的自编码器模块——StyleAutoEncoder (StyleAE)，该模块可作为预训练生成式模型的插件，用于高效地操纵图像的指定属性。StyleAE的关键在于其能够在不从头训练生成式模型的情况下，通过简单的插件形式实现对图像属性的灵活控制，从而显著降低了计算资源的消耗。通过与当前顶尖的生成式模型StyleGAN结合，实验表明StyleAE在图像属性操纵方面至少与基于可逆归一化流（Invertible Normalizing Flows）的最先进算法效果相当，但更为简单、快速，并且在神经网络设计上提供了更大的自由度。

链接: https://arxiv.org/abs/2412.20164
作者: Andrzej Bedychaj,Jacek Tabor,Marek Śmieja
机构: 未知
关键词: creating high-quality images, generative models, Deep conditional generative, excellent tools, tools for creating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep conditional generative models are excellent tools for creating high-quality images and editing their attributes. However, training modern generative models from scratch is very expensive and requires large computational resources. In this paper, we introduce StyleAutoEncoder (StyleAE), a lightweight AutoEncoder module, which works as a plugin for pre-trained generative models and allows for manipulating the requested attributes of images. The proposed method offers a cost-effective solution for training deep generative models with limited computational resources, making it a promising technique for a wide range of applications. We evaluate StyleAutoEncoder by combining it with StyleGAN, which is currently one of the top generative models. Our experiments demonstrate that StyleAutoEncoder is at least as effective in manipulating image attributes as the state-of-the-art algorithms based on invertible normalizing flows. However, it is simpler, faster, and gives more freedom in designing neural
zh

[CV-106] Multi-Modality Driven LoRA for Adverse Condition Depth Estimation

【速读】：该论文旨在解决自动驾驶领域中在恶劣条件下（如夜间、雾天、雨天）的深度估计问题，即Adverse Condition Depth Estimation (ACDE)。现有方法主要依赖生成模型，需要额外的目标图像或将晴天条件转换为恶劣天气，或通过可学习参数进行特征增强以适配领域差异，导致模型复杂性和调优工作量增加。此外，深度估计模型在多模态特征对齐方面存在不足，阻碍了在恶劣条件下的连贯理解。为解决这些局限性，论文提出了Multi-Modality Driven LoRA (MMD-LoRA)，其核心在于利用低秩适应矩阵（low-rank adaptation matrices）实现从源域到目标域的高效微调。该方法包含两个关键组件：Prompt Driven Domain Alignment (PDDA)和Visual-Text Consistent Contrastive Learning (VTCCL)。PDDA通过图像编码器生成目标域视觉表示，并通过语言与图像之间的源-目标差异对齐损失进行监督；VTCCL则通过对比学习桥接CLIP的文本特征与扩散模型的视觉特征，推动不同天气表示（视觉与文本）的分离和相似表示的聚合。实验结果表明，该方法在nuScenes和Oxford RobotCar数据集上达到了最先进的性能，展示了其在适应多种恶劣环境中的鲁棒性和高效性。

链接: https://arxiv.org/abs/2412.20162
作者: Guanglei Yang,Rui Tian,Yongqiang Zhang,Zhun Zhong,Yongqiang Li,Wangmeng Zuo
机构: 未知
关键词: autonomous driving community, ensuring driving safety, corner case problems, addressing corner case, Adverse Condition Depth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The autonomous driving community is increasingly focused on addressing corner case problems, particularly those related to ensuring driving safety under adverse conditions (e.g., nighttime, fog, rain). To this end, the task of Adverse Condition Depth Estimation (ACDE) has gained significant attention. Previous approaches in ACDE have primarily relied on generative models, which necessitate additional target images to convert the sunny condition into adverse weather, or learnable parameters for feature augmentation to adapt domain gaps, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods where textual and visual features have been pre-aligned, depth estimation models lack sufficient alignment between multimodal features, hindering coherent understanding under adverse conditions. To address these limitations, we propose Multi-Modality Driven LoRA (MMD-LoRA), which leverages low-rank adaptation matrices for efficient fine-tuning from source-domain to target-domain. It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning(VTCCL). During PDDA, the image encoder with MMD-LoRA generates target-domain visual representations, supervised by alignment loss that the source-target difference between language and image should be equal. Meanwhile, VTCCL bridges the gap between textual features from CLIP and visual features from diffusion model, pushing apart different weather representations (vision and text) and bringing together similar ones. Through extensive experiments, the proposed method achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets, underscoring robustness and efficiency in adapting to varied adverse environments.
zh

[CV-107] UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity

【速读】：该论文旨在解决全功能图像修复（all-in-one image restoration）中现有方法的局限性。具体而言，现有方法分为退化无关（degradation-agnostic）和退化感知（degradation-aware）两类，前者无法充分利用退化特定的修复能力，后者则受限于退化估计中的不可避免的误差，导致其性能与特定单任务模型存在较大差距。为解决这一问题，论文提出了UniRestorer模型，其关键解决方案包括：首先，在退化空间进行层次聚类（hierarchical clustering），并训练一个多粒度专家混合（multi-granularity mixture-of-experts, MoE）修复模型；其次，UniRestorer通过退化估计和粒度估计自适应地选择合适的专家进行图像修复。与现有方法相比，UniRestorer既能利用退化估计提升退化特定修复的效果，又能通过粒度估计增强模型对退化估计误差的鲁棒性。实验结果表明，UniRestorer在性能上大幅超越了现有的全功能修复方法，并有望缩小与特定单任务模型之间的性能差距。

链接: https://arxiv.org/abs/2412.20157
作者: Jingbo Lin,Zhilu Zhang,Wenbo Li,Renjing Pei,Hang Xu,Hongzhi Zhang,Wangmeng Zuo
机构: 未知
关键词: considerable progress, made in allin-one, Recently, estimation, restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 20 figures

点击查看摘要

Abstract:Recently, considerable progress has been made in allin-one image restoration. Generally, existing methods can be degradation-agnostic or degradation-aware. However, the former are limited in leveraging degradation-specific restoration, and the latter suffer from the inevitable error in degradation estimation. Consequently, the performance of existing methods has a large gap compared to specific single-task models. In this work, we make a step forward in this topic, and present our UniRestorer with improved restoration performance. Specifically, we perform hierarchical clustering on degradation space, and train a multi-granularity mixture-of-experts (MoE) restoration model. Then, UniRestorer adopts both degradation and granularity estimation to adaptively select an appropriate expert for image restoration. In contrast to existing degradation-agnostic and -aware methods, UniRestorer can leverage degradation estimation to benefit degradationspecific restoration, and use granularity estimation to make the model robust to degradation estimation error. Experimental results show that our UniRestorer outperforms stateof-the-art all-in-one methods by a large margin, and is promising in closing the performance gap to specific single task models. The code and pre-trained models will be publicly available at this https URL.
zh

[CV-108] Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection

【速读】：该论文旨在解决人脸伪造检测（Face Forgery Detection, FFD）中现有方法在捕捉局部伪造痕迹和全局依赖性方面的不足。尽管基于卷积神经网络（CNN）的方法在FFD中表现优异，但它们容易受到各种伪造方法生成的局部伪造模式的影响。而基于Transformer的检测器虽然在建模全局依赖性方面有所改进，但在探索局部伪造痕迹方面表现不佳。此外，混合Transformer网络虽然设计用于捕捉局部和全局的伪造痕迹，但随着Transformer层数的增加，容易出现注意力崩溃（attention collapse）问题，且软标签（soft labels）信息稀缺。为解决这些问题，论文提出了一种蒸馏Transformer网络（Distilled Transformer Network, DTN），其关键解决方案包括：1）设计专家混合（Mixture of Expert, MoE）模块以挖掘多种鲁棒的伪造嵌入；2）提出局部增强视觉Transformer（Locally-Enhanced Vision Transformer, LEVT）模块以学习局部增强的全局表示；3）引入轻量级多注意力缩放（Multi-Attention Scaling, MAS）模块，避免注意力崩溃，并可灵活应用于任何基于Transformer的模型；4）提出深度伪造自蒸馏（Deepfake Self-Distillation, DSD）方案，为模型提供丰富的软标签信息。实验结果表明，该方法在五个深度伪造数据集上超越了现有技术。

链接: https://arxiv.org/abs/2412.20156
作者: Yaning Zhang,Qiufu Li,Zitong Yu,Linlin Shen
机构: 未知
关键词: Face forgery detection, devoted to detecting, detecting the authenticity, FFD, forgery
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.
zh

[CV-109] DEGSTalk: Decomposed Per-Embedding Gaussian Fields for Hair-Preserving Talking Face Synthesis ICASSP2025

【速读】：该论文旨在解决在合成具有长发个体的说话人脸视频时，准确捕捉精细面部特征的挑战。现有方法在处理复杂面部动态和长发保留方面存在不足。为此，作者提出了一种基于3D高斯溅射（3D Gaussian Splatting, 3DGS）的说话人脸合成方法，称为分解预嵌入高斯场（Decomposed Per-Embedding Gaussian Fields, DEGSTalk）。该方案的关键在于引入了可变形预嵌入高斯场（Deformable Pre-Embedding Gaussian Fields），通过隐式表情系数动态调整预嵌入高斯基元，从而精确捕捉动态面部区域和细微表情。此外，作者还提出了一种动态长发保留肖像渲染技术（Dynamic Hair-Preserving Portrait Rendering），以增强合成视频中长发运动的真实感。实验结果表明，DEGSTalk在真实感和合成质量上优于现有方法，特别是在处理复杂面部动态和长发保留方面表现出色。

链接: https://arxiv.org/abs/2412.20148
作者: Kaijun Deng,Dezhi Zheng,Jindong Xie,Jinbao Wang,Weicheng Xie,Linlin Shen,Siyang Song
机构: 未知
关键词: Accurately synthesizing talking, Accurately synthesizing, Gaussian fields, synthesizing talking face, per-embedding Gaussian fields
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Accurately synthesizing talking face videos and capturing fine facial features for individuals with long hair presents a significant challenge. To tackle these challenges in existing methods, we propose a decomposed per-embedding Gaussian fields (DEGSTalk), a 3D Gaussian Splatting (3DGS)-based talking face synthesis method for generating realistic talking faces with long hairs. Our DEGSTalk employs Deformable Pre-Embedding Gaussian Fields, which dynamically adjust pre-embedding Gaussian primitives using implicit expression coefficients. This enables precise capture of dynamic facial regions and subtle expressions. Additionally, we propose a Dynamic Hair-Preserving Portrait Rendering technique to enhance the realism of long hair motions in the synthesized videos. Results show that DEGSTalk achieves improved realism and synthesis quality compared to existing approaches, particularly in handling complex facial dynamics and hair preservation. Our code will be publicly available at this https URL.
zh

[CV-110] Cross-Modal Mapping: Eliminating the Modality Gap for Few-Shot Image Classification

【速读】：该论文旨在解决小样本图像分类任务中，基于预训练视觉-语言模型（如CLIP）的方法在直接使用视觉或文本特征作为类别原型时，由于模态间隙（modality gap）导致特征无法充分代表各自类别的问题。为了解决这一问题，论文提出了一种简单高效的跨模态映射（Cross-Modal Mapping, CMM）方法，通过线性变换将图像特征映射到文本特征空间，使两种模态在同一特征空间内具有可比性。此外，为了进一步优化图像特征与类别文本特征之间的空间关系，论文引入了三元组损失（triplet loss），使类别文本特征能够自然地作为图像特征的类别原型。实验结果表明，该方法在11个基准数据集上平均提升了约3.5%，并在4个分布偏移基准上表现出竞争力。

链接: https://arxiv.org/abs/2412.20110
作者: Xi Yang,Pai Peng,Wulin Xie,Xiaohuan Lu,Jie Wen
机构: 未知
关键词: achieved significant progress, pretrained vision-language models, image classification tasks, few-shot image classification, classification tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In few-shot image classification tasks, methods based on pretrained vision-language models (such as CLIP) have achieved significant progress. Many existing approaches directly utilize visual or textual features as class prototypes, however, these features fail to adequately represent their respective classes. We identify that this limitation arises from the modality gap inherent in pretrained vision-language models, which weakens the connection between the visual and textual modalities. To eliminate this modality gap and enable textual features to fully represent class prototypes, we propose a simple and efficient Cross-Modal Mapping (CMM) method. This method employs a linear transformation to map image features into the textual feature space, ensuring that both modalities are comparable within the same feature space. Nevertheless, the modality gap diminishes the effectiveness of this mapping. To address this, we further introduce a triplet loss to optimize the spatial relationships between image features and class textual features, allowing class textual features to naturally serve as class prototypes for image features. Experimental results on 11 benchmark demonstrate an average improvement of approximately 3.5% compared to conventional methods and exhibit competitive performance on 4 distribution shift benchmarks.
zh

[CV-111] ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming AAAI2025

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理大量视觉标记（visual tokens）时产生的高计算成本问题。现有的MLLM注意力机制分析较为浅显，导致粗粒度的标记剪枝策略无法有效平衡推理速度与准确性。为此，论文提出了一个名为“时空视觉标记修剪”（Spatial-Temporal Visual Token Trimming, ST^3）的框架，该框架通过两个关键组件来加速MLLM推理：1）渐进式视觉标记剪枝（Progressive Visual Token Pruning, PVTP），逐层去除不重要的视觉标记；2）视觉标记退火（Visual Token Annealing, VTA），随着生成标记的增加动态减少每层的视觉标记数量。ST^3无需重新训练即可无缝集成到现有的预训练MLLMs中，显著提升了推理速度（约2倍），并减少了约30%的键值缓存（KV cache）内存占用，同时保持了模型在不同数据集上的性能一致性。

链接: https://arxiv.org/abs/2412.20105
作者: Jiedong Zhuang,Lu Lu,Ming Dai,Rui Hu,Jian Chen,Qiang Liu,Haoji Hu
机构: 未知
关键词: Multimodal large language, large language models, Multimodal large, visual tokens, visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2025

点击查看摘要

Abstract:Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarse-grain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming ( \textbfST^3 ), a framework designed to accelerate MLLM inference without retraining. \textbfST^3 consists of two primary components: 1) Progressive Visual Token Pruning (\textbfPVTP), which eliminates inattentive visual tokens across layers, and 2) Visual Token Annealing (\textbfVTA), which dynamically reduces the number of visual tokens in each layer as the generated tokens grow. Together, these techniques deliver around \mathbf2\times faster inference with only about \mathbf30% KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Crucially, \textbfST^3 can be seamlessly integrated into existing pre-trained MLLMs, providing a plug-and-play solution for efficient inference.
zh

[CV-112] SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

【速读】：该论文旨在解决多体（multi-body）人机交互运动合成的复杂问题，特别是在虚拟现实（VR）、增强现实（AR）和人类动画领域中，如何生成逼真的多体交互运动。与以往研究中常见的单人或单手与单一物体交互的场景不同，该研究关注的是涉及任意数量的人类、手部和物体的多体交互场景。这种复杂性带来了显著的挑战，尤其是在同步不同体之间的运动时，由于它们之间存在高度相关性和相互影响。为解决这些挑战，论文提出了SyncDiff方法，其关键创新在于采用同步运动扩散策略（synchronized motion diffusion strategy），通过单一扩散模型（diffusion model）捕捉多体运动的联合分布。此外，SyncDiff引入了频域运动分解方案（frequency-domain motion decomposition scheme）以提高运动逼真度，并提出了一组新的对齐评分（alignment scores）来强调不同体运动的同步性。通过显式的同步策略，SyncDiff联合优化了数据样本似然和对齐似然，从而在多体配置的广泛实验中展示了其相对于现有最先进运动合成方法的优越性。

链接: https://arxiv.org/abs/2412.20104
作者: Wenkun He,Yun Liu,Ruitao Liu,Li Yi
机构: 未知
关键词: Synthesizing realistic human-object, Synthesizing realistic, realistic human-object interaction, realistic human-object, critical problem
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
zh

[CV-113] An archaeological Catalog Collection Method Based on Large Vision-Language Models WWW

【速读】：该论文旨在解决考古目录（archaeological catalogs）自动化收集过程中面临的挑战，特别是现有的大规模视觉-语言模型（Large Vision-Language Models, VLMs）及其衍生方法在图像检测和模态匹配方面的不足。为解决这些问题，作者提出了一种基于大规模视觉-语言模型的新型考古目录收集方法，其核心在于三个模块：文档定位（document localization）、区块理解（block comprehension）和区块匹配（block matching）。通过在大坝沟和庙子沟陶器目录的实际数据收集和对比实验，该方法展示了其有效性，为考古目录的自动化收集提供了可靠的解决方案。

链接: https://arxiv.org/abs/2412.20088
作者: Honglin Pang,Yi Chang,Tianjing Duan,Xi Yang
机构: 未知
关键词: studying artifact evolution, Large Vision-Language Models, morphological descriptions, excavation information, cultural inheritance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages,4 figures,www source track

点击查看摘要

Abstract:Archaeological catalogs, containing key elements such as artifact images, morphological descriptions, and excavation information, are essential for studying artifact evolution and cultural inheritance. These data are widely scattered across publications, requiring automated collection methods. However, existing Large Vision-Language Models (VLMs) and their derivative data collection methods face challenges in accurate image detection and modal matching when processing archaeological catalogs, making automated collection difficult. To address these issues, we propose a novel archaeological catalog collection method based on Large Vision-Language Models that follows an approach comprising three modules: document localization, block comprehension and block matching. Through practical data collection from the Dabagou and Miaozigou pottery catalogs and comparison experiments, we demonstrate the effectiveness of our approach, providing a reliable solution for automated collection of archaeological catalogs.
zh

[CV-114] Enhancing Marine Debris Acoustic Monitoring by Optical Flow-Based Motion Vector Analysis

【速读】：该论文旨在解决海洋塑料垃圾（plastic debris）污染监测中的技术难题，特别是在水下和海底区域由于低可见度限制，传统光学传感器（optical sensors）难以有效应用的问题。论文提出了一种基于光流（optical flow）的方法，利用声学相机（acoustic camera）或高分辨率前视声呐（high-resolution forward-looking sonar, FLS）捕捉的时间序列信息，增强海洋垃圾监测的性能，而无需依赖目标的先验类别标签。该方法的关键在于克服声呐图像中低信噪比、弱纹理和成像失真等挑战，通过充分利用时间序列信息，实现对海洋垃圾的自主监测。实验在循环水槽中进行，验证了该方法的可行性和鲁棒性，为垃圾的时空分布研究提供了新的视角。

链接: https://arxiv.org/abs/2412.20085
作者: Xiaoteng Zhou,Katsunori Mizuno
机构: 未知
关键词: marine debris monitoring, marine debris, debris monitoring, coastal construction, human-generated waste
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, conference

点击查看摘要

Abstract:With the development of coastal construction, a large amount of human-generated waste, particularly plastic debris, is continuously entering the ocean, posing a severe threat to marine ecosystems. The key to effectively addressing plastic pollution lies in the ability to autonomously monitor such debris. Currently, marine debris monitoring primarily relies on optical sensors, but these methods are limited in their applicability to underwater and seafloor areas due to low-visibility constraints. The acoustic camera, also known as high-resolution forward-looking sonar (FLS), has demonstrated considerable potential in the autonomous monitoring of marine debris, as they are unaffected by water turbidity and dark environments. The appearance of targets in sonar images changes with variations in the imaging viewpoint, while challenges such as low signal-to-noise ratio, weak textures, and imaging distortions in sonar imagery present significant obstacles to debris monitoring based on prior class labels. This paper proposes an optical flow-based method for marine debris monitoring, aiming to fully utilize the time series information captured by the acoustic camera to enhance the performance of marine debris monitoring without relying on prior category labels of the targets. The proposed method was validated through experiments conducted in a circulating water tank, demonstrating its feasibility and robustness. This approach holds promise for providing novel insights into the spatial and temporal distribution of debris.
zh

[CV-115] STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）领域中现有方法存在的计算负担大以及时空正常性学习不足的问题。为解决这些问题，论文提出了一种轻量级且高效的基于Mamba的网络STNMamba。其关键解决方案包括：首先，设计了双编码器架构，其中空间编码器采用多尺度视觉空间状态块（Multi-Scale Vision Space State Blocks, MS-VSSB）提取多尺度外观特征，时间编码器则使用通道感知视觉空间状态块（Channel-Aware Vision Space State Blocks, CA-VSSB）捕捉显著的运动模式；其次，引入了时空交互模块（Spatial-Temporal Interaction Module, STIM），通过时空融合块（Spatial-Temporal Fusion Block, STFB）将时空特征融合到统一特征空间中，并利用记忆库存储正常模式的时空原型，从而限制模型对异常的表征能力。这些设计使得STNMamba在减少参数和计算成本的同时，实现了与现有方法相媲美的性能。

链接: https://arxiv.org/abs/2412.20084
作者: Zhangxun Li,Mengyang Zhao,Xuan Yang,Yang Liu,Jiamu Sheng,Xinhua Zeng,Tian Wang,Kewei Wu,Yu-Gang Jiang
机构: 未知
关键词: Video anomaly detection, intelligent video systems, extensively researched due, Space State Blocks, video systems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems. However, most existing methods based on CNNs and transformers still suffer from substantial computational burdens and have room for improvement in learning spatial-temporal normality. Recently, Mamba has shown great potential for modeling long-range dependencies with linear complexity, providing an effective solution to the above dilemma. To this end, we propose a lightweight and effective Mamba-based network named STNMamba, which incorporates carefully designed Mamba modules to enhance the learning of spatial-temporal normality. Firstly, we develop a dual-encoder architecture, where the spatial encoder equipped with Multi-Scale Vision Space State Blocks (MS-VSSB) extracts multi-scale appearance features, and the temporal encoder employs Channel-Aware Vision Space State Blocks (CA-VSSB) to capture significant motion patterns. Secondly, a Spatial-Temporal Interaction Module (STIM) is introduced to integrate spatial and temporal information across multiple levels, enabling effective modeling of intrinsic spatial-temporal consistency. Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model’s ability to represent anomalies. Extensive experiments on three benchmark datasets demonstrate that our STNMamba achieves competitive performance with fewer parameters and lower computational costs than existing methods.
zh

[CV-116] MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

【速读】：该论文旨在解决深度视觉里程计（Deep Visual Odometry）在复杂场景中由于模糊匹配导致的几何建模和捆绑调整优化（Bundle Adjustment Optimization）误差问题，进而影响姿态估计的准确性和鲁棒性。为解决这一问题，论文提出了MambaVO框架，其关键解决方案包括：1）通过半稠密几何初始化模块（Geometric Initialization Module, GIM）进行鲁棒初始化，确保新帧与最近关键帧的匹配；2）利用几何Mamba模块（Geometric Mamba Module, GMM）对帧间像素级匹配进行优化；3）引入趋势感知惩罚（Trending-Aware Penalty, TAP）平滑训练过程，平衡姿态损失和匹配损失，提升收敛性和稳定性；4）最终通过深度捆绑调整优化姿态和地图，并引入闭环检测模块（Loop Closure Module）进一步增强系统性能。MambaVO及其增强版MambaVO++在公开基准测试中展示了最先进的精度表现，同时保证了实时运行性能和低GPU内存需求。

链接: https://arxiv.org/abs/2412.20082
作者: Shuo Wang,Wanting Li,Yongcai Wang,Zhaoxin Fan,Zhe Huang,Xudong Cai,Jian Zhao,Deying Li
机构: 未知
关键词: demonstrated great advancements, Deep visual odometry, demonstrated great, great advancements, Deep visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep visual odometry has demonstrated great advancements by learning-to-optimize technology. This approach heavily relies on the visual matching across frames. However, ambiguous matching in challenging scenarios leads to significant errors in geometric modeling and bundle adjustment optimization, which undermines the accuracy and robustness of pose estimation. To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation in deep visual odometry. Specifically, when a new frame is received, it is matched with the closest keyframe in the maintained Point-Frame Graph (PFG) via the semi-dense based Geometric Initialization Module (GIM). Then the initialized PFG is processed by a proposed Geometric Mamba Module (GMM), which exploits the matching features to refine the overall inter-frame pixel-to-pixel matching. The refined PFG is finally processed by deep BA to optimize the poses and the map. To deal with the gradient variance, a Trending-Aware Penalty (TAP) is proposed to smooth training by balancing the pose loss and the matching loss to enhance convergence and stability. A loop closure module is finally applied to enable MambaVO++. On public benchmarks, MambaVO and MambaVO++ demonstrate SOTA accuracy performance, while ensuring real-time running performance with low GPU memory requirement. Codes will be publicly available.
zh

[CV-117] MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration

【速读】：该论文旨在解决基于Mamba的图像恢复方法中存在的两个关键问题：一是如何设计一种扫描策略，既能保留自然图像的局部关系和空间连续性，又能促进图像恢复；二是如何有效地聚合通过不同方式展开的序列。为解决这些问题，论文提出了一种新颖的基于Mamba的图像恢复模型（MaIR），其核心包括嵌套S形扫描策略（NSS）和序列混洗注意力块（SSA）。NSS通过基于条纹的扫描区域和S形扫描路径分别保留输入图像的局部性和连续性，而SSA则通过计算不同序列对应通道内的注意力权重来聚合序列。得益于NSS和SSA，MaIR在14个具有挑战性的数据集上超越了40个基线模型，在图像超分辨率、去噪、去模糊和去雾任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.20066
作者: Boyun Li,Haiyu Zhao,Wenxin Wang,Peng Hu,Yuanbiao Gou,Xi Peng
机构: 未知
关键词: shown promising results, Recent advancements, advancements in Mamba, Mamba have shown, shown promising
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Mamba have shown promising results in image restoration. These methods typically flatten 2D images into multiple distinct 1D sequences along rows and columns, process each sequence independently using selective scan operation, and recombine them to form the outputs. However, such a paradigm overlooks two vital aspects: i) the local relationships and spatial continuity inherent in natural images, and ii) the discrepancies among sequences unfolded through totally different ways. To overcome the drawbacks, we explore two problems in Mamba-based restoration methods: i) how to design a scanning strategy preserving both locality and continuity while facilitating restoration, and ii) how to aggregate the distinct sequences unfolded in totally different ways. To address these problems, we propose a novel Mamba-based Image Restoration model (MaIR), which consists of Nested S-shaped Scanning strategy (NSS) and Sequence Shuffle Attention block (SSA). Specifically, NSS preserves locality and continuity of the input images through the stripe-based scanning region and the S-shaped scanning path, respectively. SSA aggregates sequences through calculating attention weights within the corresponding channels of different sequences. Thanks to NSS and SSA, MaIR surpasses 40 baselines across 14 challenging datasets, achieving state-of-the-art performance on the tasks of image super-resolution, denoising, deblurring and dehazing. Our codes will be available after acceptance.
zh

[CV-118] VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition

【速读】：该论文旨在解决基于RGB和事件相机（Event cameras）的多模态任务中，如何高效地进行参数微调（parameter-efficient fine-tuning, PEFT）的问题。现有的方法通常需要对预训练的大模型进行完全微调，这会导致效率低下。为此，论文提出了一种新颖的PEFT策略，通过结合视觉基础模型ViT和模态特定的LoRA（Low-Rank Adaptation）微调策略，来适应RGB-Event分类任务。具体而言，论文首先从RGB帧和事件流中提取特征，并利用帧差骨干网络捕捉运动线索。随后，这些特征通过模态共享的LoRA微调策略在高层次Transformer层中进行多模态特征学习，最终通过分类头实现高效微调。该方案的关键在于通过LoRA和Adapter等轻量级微调方法，在效率和性能之间取得更好的平衡，从而提升RGB-Event识别的性能。

链接: https://arxiv.org/abs/2412.20064
作者: Lan Chen,Haoxiang Yang,Pengpeng Shao,Haoyu Song,Xiao Wang,Zhicheng Zhao,Yaowei Wang,Yonghong Tian
机构: 未知
关键词: Pattern recognition leveraging, deploying deep neural, significantly enhance performance, deep neural networks, Pattern recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: In Peer Review

点击查看摘要

Abstract:Pattern recognition leveraging both RGB and Event cameras can significantly enhance performance by deploying deep neural networks that utilize a fine-tuning strategy. Inspired by the successful application of large models, the introduction of such large models can also be considered to further enhance the performance of multi-modal tasks. However, fully fine-tuning these models leads to inefficiency and lightweight fine-tuning methods such as LoRA and Adapter have been proposed to achieve a better balance between efficiency and performance. To our knowledge, there is currently no work that has conducted parameter-efficient fine-tuning (PEFT) for RGB-Event recognition based on pre-trained foundation models. To address this issue, this paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification. Specifically, given the RGB frames and event streams, we extract the RGB and event features based on the vision foundation model ViT with a modality-specific LoRA tuning strategy. The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network. These features are concatenated and fed into high-level Transformer layers for efficient multi-modal feature learning via modality-shared LoRA tuning. Finally, we concatenate these features and feed them into a classification head to achieve efficient fine-tuning. The source code and pre-trained models will be released on \urlthis https URL.
zh

[CV-119] MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

【速读】：该论文旨在解决文本引导的图像编辑模型在时尚领域应用中的两个主要问题：(1) 编辑区域定位不准确；(2) 编辑强度不足。为解决这些问题，论文提出了MADiff模型。其关键解决方案包括两个方面：首先，通过引入MaskNet，将前景区域、密集姿态（densepose）和大语言模型生成的掩码提示输入轻量级UNet，以更准确地预测编辑区域的掩码；其次，提出注意力增强扩散模型（Attention-Enhanced Diffusion Model），将噪声图、注意力图和MaskNet生成的掩码输入注意力处理器（Attention Processor），生成精炼的噪声图，并将其整合到扩散模型中，从而使编辑后的图像更好地与目标提示对齐。通过这些创新，MADiff模型在时尚图像编辑任务中显著提升了编辑区域定位的准确性和编辑强度。

链接: https://arxiv.org/abs/2412.20062
作者: Zechao Zhan,Dehong Gao,Jinxia Zhang,Jiale Huang,Yang Hu,Xin Wang
机构: 未知
关键词: Text-guided image editing, achieved great success, Text-guided image, editing, editing region
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foreground region, densepose and mask prompts from large language model are fed into a lightweight UNet to predict the mask for editing region. To strengthen the editing magnitude, the Attention-Enhanced Diffusion Model is proposed, where the noise map, attention map, and the mask from MaskNet are fed into the proposed Attention Processor to produce a refined noise map. By integrating the refined noise map into the diffusion model, the edited image can better align with the target prompt. Given the absence of benchmarks in fashion image editing, we constructed a dataset named Fashion-E, comprising 28390 image-text pairs in the training set, and 2639 image-text pairs for four types of fashion tasks in the evaluation set. Extensive experiments on Fashion-E demonstrate that our proposed method can accurately predict the mask of editing region and significantly enhance editing magnitude in fashion image editing compared to the state-of-the-art methods.
zh

[CV-120] AI-based Wearable Vision Assistance System for the Visually Impaired: Integrating Real-Time Object Recognition and Contextual Understanding Using Large Vision-Language Models

【速读】：该论文旨在解决视觉障碍者在日常生活中获取丰富环境信息方面的挑战，传统方法在此方面存在局限性。为此，论文提出了一种新型的可穿戴视觉辅助系统，其核心解决方案包括：1) 采用帽载摄像头与Raspberry Pi 4 Model B（8GB RAM）结合，利用人工智能技术提供实时反馈；2) 通过一键式流程实现新人物或物体的识别，用户可添加新数据以提高识别准确性；3) 利用大型视觉语言模型（LVLM）提供环境中物体的详细描述；4) 集成距离传感器，在用户即将碰撞物体时通过蜂鸣器发出警报，确保导航安全。该系统的创新之处在于将硬件与AI技术（包括LVLM与物联网结合）相结合，显著提升了辅助技术的效能，解决了视觉障碍者面临的主要问题。

链接: https://arxiv.org/abs/2412.20059
作者: Mirza Samad Ahmed Baig,Syeda Anshrah Gillani,Shahid Munir Shah,Mahmoud Aljawarneh,Abdul Akbar Khan,Muhammad Hamzah Siddiqui
机构: 未知
关键词: Visual impairment affects, Visual impairment, impairment affects, affects the ability, live a life
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: N-A

点击查看摘要

Abstract:Visual impairment affects the ability of people to live a life like normal people. Such people face challenges in performing activities of daily living, such as reading, writing, traveling and participating in social gatherings. Many traditional approaches are available to help visually impaired people; however, these are limited in obtaining contextually rich environmental information necessary for independent living. In order to overcome this limitation, this paper introduces a novel wearable vision assistance system that has a hat-mounted camera connected to a Raspberry Pi 4 Model B (8GB RAM) with artificial intelligence (AI) technology to deliver real-time feedback to a user through a sound beep mechanism. The key features of this system include a user-friendly procedure for the recognition of new people or objects through a one-click process that allows users to add data on new individuals and objects for later detection, enhancing the accuracy of the recognition over time. The system provides detailed descriptions of objects in the user’s environment using a large vision language model (LVLM). In addition, it incorporates a distance sensor that activates a beeping sound using a buzzer as soon as the user is about to collide with an object, helping to ensure safety while navigating their environment. A comprehensive evaluation is carried out to evaluate the proposed AI-based solution against traditional support techniques. Comparative analysis shows that the proposed solution with its innovative combination of hardware and AI (including LVLMs with IoT), is a significant advancement in assistive technology that aims to solve the major issues faced by the community of visually impaired people
zh

[CV-121] GSplatLoc: Ultra-Precise Camera Localization via 3D Gaussian Splatting

【速读】：该论文旨在解决相机定位（camera localization）问题，特别是在复杂室内环境中实现高精度的姿态估计。现有的方法在精确度和鲁棒性方面存在局限，尤其是在处理复杂相机运动时。论文提出的解决方案GSplatLoc，其关键在于利用3D高斯溅射（3D Gaussian splatting）的可微分渲染能力，将姿态估计问题转化为基于梯度的优化问题。通过最小化预构建的3D高斯场景生成的渲染深度图与观测深度图之间的差异，GSplatLoc在Replica数据集上实现了0.01厘米的平移误差和接近零的旋转误差，显著超越了现有方法。该方法的鲁棒性在Replica和TUM RGB-D数据集上得到了验证，为密集地图构建中的定位问题设定了新的基准，对机器人技术和增强现实等需要高精度实时定位的应用具有重要意义。

链接: https://arxiv.org/abs/2412.20056
作者: Atticus J. Zeller(Southeast University Chengxian College, Nanjing, China)
机构: 未知
关键词: differentiable rendering capabilities, ultra-precise pose estimation, Gaussian splatting, leverages the differentiable, differentiable rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures. Code available at this https URL

点击查看摘要

Abstract:We present GSplatLoc, a camera localization method that leverages the differentiable rendering capabilities of 3D Gaussian splatting for ultra-precise pose estimation. By formulating pose estimation as a gradient-based optimization problem that minimizes discrepancies between rendered depth maps from a pre-existing 3D Gaussian scene and observed depth images, GSplatLoc achieves translational errors within 0.01 cm and near-zero rotational errors on the Replica dataset - significantly outperforming existing methods. Evaluations on the Replica and TUM RGB-D datasets demonstrate the method’s robustness in challenging indoor environments with complex camera motions. GSplatLoc sets a new benchmark for localization in dense mapping, with important implications for applications requiring accurate real-time localization, such as robotics and augmented reality.
zh

[CV-122] SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection

【速读】：该论文旨在解决在自然长尾分布（long-tailed distribution）场景下，目标检测（object detection）任务中从少量样本中学习的开放性问题。现有方法通常依赖外部ImageNet标签来增强低样本训练实例，但这种方法在实际应用中不切实际且效用有限。论文提出了一种更为通用的解决方案，即利用可选的无标签图像（unlabeled images），这些图像易于收集且无需人工标注。其核心框架SimLTD包括三个简单步骤：（1）在丰富的头部类（head classes）上进行预训练；（2）在稀缺的尾部类（tail classes）上进行迁移学习；（3）在头部和尾部类的采样集上进行微调。该方法避免了元学习（meta-learning）或知识蒸馏（knowledge distillation）的复杂性，通过利用无标签图像，在LVIS v1基准测试中取得了新的记录结果。

链接: https://arxiv.org/abs/2412.20047
作者: Phi Vu Tran
机构: 未知
关键词: visual recognition systems, witnessed tremendous advances, modern visual recognition, Recent years, recognition systems
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:Recent years have witnessed tremendous advances on modern visual recognition systems. Despite such progress, many vision models still struggle with the open problem of learning from few exemplars. This paper focuses on the task of object detection in the setting where object classes follow a natural long-tailed distribution. Existing approaches to long-tailed detection resort to external ImageNet labels to augment the low-shot training instances. However, such dependency on a large labeled database is impractical and has limited utility in realistic scenarios. We propose a more versatile approach to leverage optional unlabeled images, which are easy to collect without the burden of human annotations. Our SimLTD framework is straightforward and intuitive, and consists of three simple steps: (1) pre-training on abundant head classes; (2) transfer learning on scarce tail classes; and (3) fine-tuning on a sampled set of both head and tail classes. Our approach can be viewed as an improved head-to-tail model transfer paradigm without the added complexities of meta-learning or knowledge distillation, as was required in past research. By harnessing supplementary unlabeled images, without extra image labels, SimLTD establishes new record results on the challenging LVIS v1 benchmark across both supervised and semi-supervised settings.
zh

[CV-123] Enhancing Diffusion Models for Inverse Problems with Covariance-Aware Posterior Sampling

【速读】：该论文旨在解决噪声线性逆问题（noisy linear inverse problems）中的后验采样（posterior sampling）问题，特别是在去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPMs）框架下如何更准确地近似似然函数（likelihood）。现有的方法通常基于反向过程（reverse process）条件密度的均值（mean）来近似似然函数，但这种方法在精度上存在局限。论文的关键解决方案是通过推导反向过程协方差（covariance）的闭式表达式，并提出一种基于有限差分法（finite difference method）的协方差近似方法，使得该协方差可以直接从现有的预训练DDPMs中获取，从而在不增加复杂性的情况下，结合均值和近似协方差，提出了一种新的似然函数近似方法，称为协方差感知扩散后验采样（Covariance-Aware Diffusion Posterior Sampling, CA-DPS）。实验结果表明，CA-DPS在不需超参数调优的情况下显著提升了重建性能。

链接: https://arxiv.org/abs/2412.20045
作者: Shayan Mohajer Hamidi,En-Hui Yang
机构: 未知
关键词: Inverse problems exist, Inverse problems, linear inverse problems, science and engineering, disciplines of science
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inverse problems exist in many disciplines of science and engineering. In computer vision, for example, tasks such as inpainting, deblurring, and super resolution can be effectively modeled as inverse problems. Recently, denoising diffusion probabilistic models (DDPMs) are shown to provide a promising solution to noisy linear inverse problems without the need for additional task specific training. Specifically, with the prior provided by DDPMs, one can sample from the posterior by approximating the likelihood. In the literature, approximations of the likelihood are often based on the mean of conditional densities of the reverse process, which can be obtained using Tweedie formula. To obtain a better approximation to the likelihood, in this paper we first derive a closed form formula for the covariance of the reverse process. Then, we propose a method based on finite difference method to approximate this covariance such that it can be readily obtained from the existing pretrained DDPMs, thereby not increasing the complexity compared to existing approaches. Finally, based on the mean and approximated covariance of the reverse process, we present a new approximation to the likelihood. We refer to this method as covariance-aware diffusion posterior sampling (CA-DPS). Experimental results show that CA-DPS significantly improves reconstruction performance without requiring hyperparameter tuning. The code for the paper is put in the supplementary materials.
zh

[CV-124] DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments

【速读】：该论文旨在解决现有交通视频数据集（如Waymo）主要关注西方交通场景、缺乏全球适用性的问题，特别是亚洲交通场景的复杂性未被充分体现。为解决这一差距，作者提出了一个新的数据集DAVE，专门用于评估在复杂和不可预测环境中的感知方法，特别是对弱势道路使用者（VRUs，如行人、动物、摩托车和自行车）的高代表性。DAVE的关键在于其手动标注的多样性，涵盖了16种不同的参与者类别（包括动物、人类、车辆等）和16种动作类型（如切入、之字形移动、U型转弯等），这些场景需要较高的推理能力。此外，DAVE密集标注了超过1300万个边界框（bboxes），其中160多万个框同时标注了参与者身份和动作/行为细节。DAVE的视频采集考虑了多种因素，如天气条件、时间、道路场景和交通密度，使其能够用于跟踪、检测、时空动作定位、语言-视觉时刻检索和多标签视频动作识别等任务的基准测试。DAVE中弱势道路使用者占实例的41.13%，远高于Waymo的23.71%，为开发更敏感和准确的视觉感知算法提供了宝贵资源。实验表明，现有方法在DAVE上的性能有所下降，凸显了其对未来视频识别研究的重要性。

链接: https://arxiv.org/abs/2412.20042
作者: Xijun Wang,Pedro Sandoval-Segura,Chengyuan Zhang,Junyun Huang,Tianrui Guan,Ruiqi Xian,Fuxiao Liu,Rohan Chandra,Boqing Gong,Dinesh Manocha
机构: 未知
关键词: hinders global applicability, predominantly on Western, datasets including Waymo, Western traffic, focusing predominantly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
zh

[CV-125] Maintain Plasticity in Long-timescale Continual Test-time Adaptation

【速读】：该论文旨在解决持续测试时域适应（Continual Test-Time Domain Adaptation, CTTA）中模型在长时间尺度下适应非平稳目标环境时保持可塑性（plasticity）的问题。可塑性指的是模型在不断变化的非平稳环境中持续调整预测的能力。研究发现，现有的CTTA方法在长时间尺度的持续适应阶段中，可塑性会逐渐下降，且这种下降与标签翻转（label flip）的变化密切相关。基于这一发现，论文提出了一种简单而有效的策略——自适应收缩恢复（Adaptive Shrink-Restore, ASR），通过自适应间隔进行权重重新初始化，以保持模型的可塑性。ASR的关键在于根据标签翻转的变化动态确定权重重新初始化的间隔，从而在长时间尺度下实现更持续的适应。该方法在多个CTTA基准测试中验证了其有效性，并取得了优异的性能。

链接: https://arxiv.org/abs/2412.20034
作者: Yanshuo Wang,Xuesong Li,Jinguang Tong,Jie Hong,Jun Lan,Weiqiang Wang,Huijia Zhu,Haoxing Chen
机构: 未知
关键词: pre-trained source models, adjust pre-trained source, Continual test-time domain, test-time domain adaptation, non-stationary target environments
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual test-time domain adaptation (CTTA) aims to adjust pre-trained source models to perform well over time across non-stationary target environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: can the model adapt to continually-changing environments with preserved plasticity over a long time? The plasticity refers to the model’s capability to adjust predictions in response to non-stationary environments continually. In this work, we explore plasticity, this essential but often overlooked aspect of continual adaptation to facilitate more sustained adaptation in the long run. First, we observe that most CTTA methods experience a steady and consistent decline in plasticity during the long-timescale continual adaptation phase. Moreover, we find that the loss of plasticity is strongly associated with the change in label flip. Based on this correlation, we propose a simple yet effective policy, Adaptive Shrink-Restore (ASR), towards preserving the model’s plasticity. In particular, ASR does the weight re-initialization by the adaptive intervals. The adaptive interval is determined based on the change in label flipping. Our method is validated on extensive CTTA benchmarks, achieving excellent performance.
zh

[CV-126] A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification

【速读】：该论文旨在解决深度学习（Deep Learning）中基于判别式分类器（discriminative classifiers）在面对对抗样本（adversarial examples）时的脆弱性问题。尽管对抗训练（adversarial training）可以提升模型的鲁棒性，但它无法从根本上解决黑箱模型（black-box models）不透明性带来的内在脆弱性。论文提出了一种深度集成模型（deep ensemble model），通过将判别式特征与生成式模型（generative models）相结合，以实现高精度和对抗鲁棒性。该解决方案的关键在于：底层使用预训练的判别式网络进行特征提取，顶层则通过深度潜变量模型（deep latent variable model）生成对抗输入的分布，并利用变分贝叶斯（variational Bayes）方法在不进行对抗训练的情况下实现对抗白盒攻击（white-box adversarial attacks）的鲁棒性。实验表明，该模型在CIFAR-10和CIFAR-100数据集上表现出卓越的对抗鲁棒性，并通过反事实度量（counterfactual metrics）和基于特征交互的度量（feature interaction-based metrics）验证了模型可解释性与对抗鲁棒性之间的相关性。此外，在Tiny-ImageNet上的初步结果验证了该方法的可扩展性，为开发鲁棒的图像分类模型提供了实用解决方案。

链接: https://arxiv.org/abs/2412.20025
作者: Chunheng Zhao,Pierluigi Pisu,Gurcan Comert,Negash Begashaw,Varghese Vaidyan,Nina Christine Hubig
机构: 未知
关键词: learning-based discriminative classifiers, mislead model predictions, Deep learning-based discriminative, remarkable success, remain vulnerable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. We present a deep ensemble model that combines discriminative features with generative models to achieve both high accuracy and adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against white-box adversarial attacks without adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model’s superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Additionally, preliminary results on Tiny-ImageNet validate our approach’s scalability to more complex datasets, offering a practical solution for developing robust image classification models.
zh

[CV-127] Adversarial Robustness for Deep Learning-based Wildfire Detection Models

【速读】：该论文旨在解决基于深度神经网络（DNNs）的野火烟雾检测模型在训练数据不足情况下容易出现过拟合和偏差的问题。由于烟雾在时间和空间上具有异常性，导致难以收集足够的训练数据，从而影响了模型的鲁棒性。为此，论文提出了WARP（Wildfire Adversarial Robustness Procedure），这是首个模型无关的框架，用于评估基于DNN的野火检测模型的对抗鲁棒性。WARP通过全局和局部对抗攻击方法来解决烟雾图像多样性不足的问题：全局攻击方法采用图像上下文的高斯噪声，而局部攻击方法则使用针对野火检测关键方面的局部噪声注入。通过WARP的模型无关能力，论文评估了实时卷积神经网络（CNNs）和Transformer模型的对抗鲁棒性，揭示了这些模型在全局噪声和云图像注入下的局限性，并提出了通过数据增强改进模型的必要性。WARP的全面鲁棒性分析为开发野火特定的数据增强策略提供了重要依据，推动了模型的实用化进程。

链接: https://arxiv.org/abs/2412.20006
作者: Ryo Ide,Lei Yang
机构: 未知
关键词: Deep Neural Networks, Deep Neural, early wildfire detection, wildfire detection, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Smoke detection using Deep Neural Networks (DNNs) is an effective approach for early wildfire detection. However, because smoke is temporally and spatially anomalous, there are limitations in collecting sufficient training data. This raises overfitting and bias concerns in existing DNN-based wildfire detection models. Thus, we introduce WARP (Wildfire Adversarial Robustness Procedure), the first model-agnostic framework for evaluating the adversarial robustness of DNN-based wildfire detection models. WARP addresses limitations in smoke image diversity using global and local adversarial attack methods. The global attack method uses image-contextualized Gaussian noise, while the local attack method uses patch noise injection, tailored to address critical aspects of wildfire detection. Leveraging WARP’s model-agnostic capabilities, we assess the adversarial robustness of real-time Convolutional Neural Networks (CNNs) and Transformers. The analysis revealed valuable insights into the models’ limitations. Specifically, the global attack method demonstrates that the Transformer model has more than 70% precision degradation than the CNN against global noise. In contrast, the local attack method shows that both models are susceptible to cloud image injections when detecting smoke-positive instances, suggesting a need for model improvements through data augmentation. WARP’s comprehensive robustness analysis contributed to the development of wildfire-specific data augmentation strategies, marking a step toward practicality.
zh

[CV-128] Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking

【速读】：该论文旨在解决在计算资源受限的移动平台上，特别是实时无人机（UAV）跟踪场景中，现有基于Transformer的视觉跟踪模型难以满足实时处理需求的问题。为解决这一问题，论文提出了AVTrack，一种自适应计算框架，通过选择性激活Transformer模块来优化推理效率，同时保持跟踪性能。关键解决方案包括：1）引入激活模块（Activation Module, AM），动态优化视觉Transformer（ViT）架构，选择性启用相关组件以提升效率；2）通过最大化互信息（Mutual Information, MI）学习视角不变表示，以应对无人机跟踪中常见的视角剧烈变化问题；3）提出AVTrack-MD，一种基于多教师知识蒸馏（Multi-teacher Knowledge Distillation, MD）的改进跟踪器，通过最大化多教师模型与学生模型之间的互信息，提升学生模型的泛化能力和性能，特别是在噪声环境下的表现。实验表明，AVTrack-MD在保持与基线模型相当性能的同时，显著降低了模型复杂度，平均跟踪速度提升了17%。

链接: https://arxiv.org/abs/2412.20002
作者: You Wu,Yongxin Li,Mengyuan Liu,Xucheng Wang,Xiangyang Yang,Hengzhou Ye,Dan Zeng,Qijun Zhao,Shuiwang Li
机构: 未知
关键词: made significant strides, significant strides due, Visual tracking, UAV tracking, real-time UAV tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual tracking has made significant strides due to the adoption of transformer-based models. Most state-of-the-art trackers struggle to meet real-time processing demands on mobile platforms with constrained computing resources, particularly for real-time unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we introduce AVTrack, an adaptive computation framework designed to selectively activate transformer blocks for real-time UAV tracking. The proposed Activation Module (AM) dynamically optimizes the ViT architecture by selectively engaging relevant components, thereby enhancing inference efficiency without significant compromise to tracking performance. Furthermore, to tackle the challenges posed by extreme changes in viewing angles often encountered in UAV tracking, the proposed method enhances ViTs’ effectiveness by learning view-invariant representations through mutual information (MI) maximization. Two effective design principles are proposed in the AVTrack. Building on it, we propose an improved tracker, dubbed AVTrack-MD, which introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework. It harnesses the benefits of multiple teachers, specifically the off-the-shelf tracking models from the AVTrack, by integrating and refining their outputs, thereby guiding the learning process of the compact student network. Specifically, we maximize the MI between the softened feature representations from the multi-teacher models and the student model, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experiments on multiple UAV tracking benchmarks demonstrate that AVTrack-MD not only achieves performance comparable to the AVTrack baseline but also reduces model complexity, resulting in a significant 17% increase in average tracking speed.
zh

[CV-129] Comprehensive Review of EEG-to-Output Research: Decoding Neural Signals into Images Videos and Audio

【速读】：该论文旨在系统回顾和总结基于脑电图（EEG）的感知体验重建研究，重点关注生成式方法、评估指标和数据挑战。通过应用PRISMA指南，作者分析了1800项研究，识别了该领域的关键趋势、挑战和机遇。解决方案的核心在于利用先进的生成式模型，如生成对抗网络（GANs）、变分自编码器（VAEs）和Transformer，以提高解码精度并推动实际应用。此外，论文强调了标准化数据集和跨被试泛化的迫切需求，并提出了未来研究的路线图，以进一步推动该领域的发展。

链接: https://arxiv.org/abs/2412.19999
作者: Yashvir Sabharwal,Balaji Rama
机构: 未知
关键词: high temporal resolution, tool in neuroscience, offering insights, temporal resolution, invaluable tool
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 15 pages. Submitted as a conference paper to IntelliSys 2025

点击查看摘要

Abstract:Electroencephalography (EEG) is an invaluable tool in neuroscience, offering insights into brain activity with high temporal resolution. Recent advancements in machine learning and generative modeling have catalyzed the application of EEG in reconstructing perceptual experiences, including images, videos, and audio. This paper systematically reviews EEG-to-output research, focusing on state-of-the-art generative methods, evaluation metrics, and data challenges. Using PRISMA guidelines, we analyze 1800 studies and identify key trends, challenges, and opportunities in the field. The findings emphasize the potential of advanced models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformers, while highlighting the pressing need for standardized datasets and cross-subject generalization. A roadmap for future research is proposed that aims to improve decoding accuracy and broadening real-world applications.
zh

[CV-130] FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

【速读】：该论文旨在解决在时尚领域中，现有的大规模视觉-语言预训练（VLP）模型难以有效利用细粒度属性（如纹理和材质）的问题，这些属性对于检索等任务至关重要。为了解决这一问题，论文提出了一种名为Fine-grained Attributes Enhanced VLP (FashionFAE)的新方法，其核心在于通过两个关键任务来增强模型对细粒度属性的捕捉能力：一是属性强调的文本预测任务，通过预测物品的细粒度属性，迫使模型从文本模态中关注显著属性；二是属性促进的图像重建任务，通过利用图像模态中的代表性属性，进一步提升模型的细粒度能力。实验结果表明，FashionFAE在检索和识别任务中均显著优于现有最先进方法。

链接: https://arxiv.org/abs/2412.19997
作者: Jiale Huang,Dehong Gao,Jinxia Zhang,Zechao Zhan,Yang Hu,Xin Wang
机构: 未知
关键词: Large-scale Vision-Language Pre-training, demonstrated remarkable success, Large-scale Vision-Language, Vision-Language Pre-training, Attributes Enhanced VLP
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we propose a novel approach for the fashion domain, Fine-grained Attributes Enhanced VLP (FashionFAE), which focuses on the detailed characteristics of fashion data. An attribute-emphasized text prediction task is proposed to predict fine-grained attributes of the items. This forces the model to focus on the salient attributes from the text modality. Additionally, a novel attribute-promoted image reconstruction task is proposed, which further enhances the fine-grained ability of the model by leveraging the representative attributes from the image modality. Extensive experiments show that FashionFAE significantly outperforms State-Of-The-Art (SOTA) methods, achieving 2.9% and 5.2% improvements in retrieval on sub-test and full test sets, respectively, and a 1.6% average improvement in recognition tasks.
zh

[CV-131] An Ordinary Differential Equation Sampler with Stochastic Start for Diffusion Bridge Models

【速读】：该论文旨在解决现有扩散桥模型（Diffusion Bridge Models）在条件图像生成任务中推理速度较慢的问题。现有模型通常依赖于随机微分方程（Stochastic Differential Equation, SDE）采样器，导致其推理速度较慢，而采用高阶常微分方程（Ordinary Differential Equation, ODE）求解器的扩散模型则能显著加速推理。为此，论文提出了一种带有随机启动的高阶ODE采样器，以提升扩散桥模型的推理效率。解决方案的关键在于：首先，在反向过程的初始步骤中引入后验采样（posterior sampling），以克服概率流ODE（Probability Flow ODE, PF-ODE）在反向过程开始时的奇异行为，确保从损坏图像到生成轨迹的平滑过渡，并减少离散化误差；随后，应用Heun二阶求解器来求解PF-ODE，从而在显著减少神经函数评估（Neural Function Evaluations, NFEs）的同时，保持高感知质量。该方法无需额外训练，且与预训练的扩散桥模型完全兼容。

链接: https://arxiv.org/abs/2412.19992
作者: Yuang Wang,Pengfei Jin,Li Zhang,Quanzheng Li,Zhiqiang Chen,Dufan Wu
机构: 未知
关键词: pure Gaussian noise, Diffusion bridge models, Ordinary Differential Equation, Stochastic Differential Equation, Gaussian noise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Diffusion bridge models have demonstrated promising performance in conditional image generation tasks, such as image restoration and translation, by initializing the generative process from corrupted images instead of pure Gaussian noise. However, existing diffusion bridge models often rely on Stochastic Differential Equation (SDE) samplers, which result in slower inference speed compared to diffusion models that employ high-order Ordinary Differential Equation (ODE) solvers for acceleration. To mitigate this gap, we propose a high-order ODE sampler with a stochastic start for diffusion bridge models. To overcome the singular behavior of the probability flow ODE (PF-ODE) at the beginning of the reverse process, a posterior sampling approach was introduced at the first reverse step. The sampling was designed to ensure a smooth transition from corrupted images to the generative trajectory while reducing discretization errors. Following this stochastic start, Heun’s second-order solver is applied to solve the PF-ODE, achieving high perceptual quality with significantly reduced neural function evaluations (NFEs). Our method is fully compatible with pretrained diffusion bridge models and requires no additional training. Extensive experiments on image restoration and translation tasks, including super-resolution, JPEG restoration, Edges-to-Handbags, and DIODE-Outdoor, demonstrated that our sampler outperforms state-of-the-art methods in both visual quality and Frechet Inception Distance (FID).
zh

[CV-132] MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

【速读】：该论文旨在解决基于扩散模型（Diffusion-based）的文本到图像（Text-to-Image, T2I）模型在多属性编辑（Multi-Attribute Editing, MAE）任务中的挑战，特别是在视频编辑领域。现有方法通常需要大量微调或依赖额外网络（如ControlNet）来建模多对象外观，但这些方法仍处于初级阶段，仅提供粗粒度的多属性编辑解决方案。论文提出的MAKIMA框架，基于预训练的T2I模型，无需微调即可实现开放域视频编辑。其关键解决方案包括：1）通过引入去噪过程中的注意力图和特征，保留视频结构和外观信息；2）采用掩码引导的注意力调制（mask-guided attention modulation），增强空间对应标记之间的相关性，抑制自注意力和跨注意力层中的跨属性干扰；3）通过一致特征传播（consistent feature propagation）在关键帧编辑后传播其特征，以平衡视频帧生成质量和效率。实验表明，MAKIMA在开放域多属性视频编辑任务中优于现有基线，在编辑准确性和时间一致性方面均表现出色，同时保持计算效率。

链接: https://arxiv.org/abs/2412.19978
作者: Haoyu Zheng,Wenqiao Zhang,Zheqi Lv,Yu Zhong,Yang Dai,Jianxiang An,Yongliang Shen,Juncheng Li,Dongping Zhang,Siliang Tang,Yueting Zhuang
机构: 未知
关键词: demonstrated remarkable results, video editing tasks, demonstrated remarkable, video, editing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.
zh

[CV-133] DepthMamba with Adaptive Fusion

【速读】：该论文旨在解决多视角深度估计（multi-view depth estimation）系统在真实场景中因相机姿态（camera poses）噪声而失效的问题。当前的多视角深度估计方法通常依赖于理想的相机姿态，然而在自动驾驶等实际场景中，相机姿态往往存在噪声，导致现有方法性能显著下降。为解决这一问题，论文提出了一种双分支网络架构（two-branch network architecture），通过融合单视角和多视角分支的深度估计结果来提升系统的鲁棒性。具体而言，论文引入了Mamba作为特征提取主干网络，并提出了一种基于注意力机制（attention-based fusion）的融合方法，能够自适应地选择两个分支中最鲁棒的估计结果。该方法在动态物体、纹理缺失区域等复杂场景中表现出色，并通过消融实验和KITTI、DDAD等基准测试验证了其有效性。

链接: https://arxiv.org/abs/2412.19964
作者: Zelin Meng,Zhichen Wang
机构: 未知
关键词: Multi-view depth estimation, achieved impressive performance, depth estimation, achieved impressive, Multi-view depth
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-view depth estimation has achieved impressive performance over various benchmarks. However, almost all current multi-view systems rely on given ideal camera poses, which are unavailable in many real-world scenarios, such as autonomous driving. In this work, we propose a new robustness benchmark to evaluate the depth estimation system under various noisy pose settings. Surprisingly, we find current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. To tackle this challenge, we propose a two-branch network architecture which fuses the depth estimation results of single-view and multi-view branch. In specific, we introduced mamba to serve as feature extraction backbone and propose an attention-based fusion methods which adaptively select the most robust estimation results between the two branches. Thus, the proposed method can perform well on some challenging scenes including dynamic objects, texture-less regions, etc. Ablation studies prove the effectiveness of the backbone and fusion method, while evaluation experiments on challenging benchmarks (KITTI and DDAD) show that the proposed method achieves a competitive performance compared to the state-of-the-art methods.
zh

[CV-134] ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers

【速读】：该论文旨在解决建筑工人在长时间高强度体力劳动和工具使用中面临的姿势工效学风险（postural ergonomic risks）问题，这些风险是导致工人受伤和疾病的主要因素。传统工效学风险评估（ERA）方法缺乏交互反馈，无法实时提供风险信息。为此，研究提出了一种基于生成式人工智能（Generative AI）技术的交互式视觉查询系统，该系统结合了视觉问答（VQA）和图像描述生成（IC）功能，能够根据输入的图像生成关于工人姿势工效学风险的文本描述或回答问题。研究还提出了一个专门用于训练和测试此类方法的数据集。实验结果表明，VQA功能的准确率达到96.5%，且IC功能在多个评估指标和专家评估中均优于仅使用通用数据集训练的相同架构方法。该研究为未来利用生成式AI技术开发交互式ERA系统提供了新的方向。

链接: https://arxiv.org/abs/2412.19954
作者: Chao Fan,Qipei Mei,Xiaonan Wang,Xinming Li
机构: 未知
关键词: endure prolonged periods, predominant health concern, high-intensity physical work, illnesses primarily linked, longstanding predominant health
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 8 figures

点击查看摘要

Abstract:In the construction sector, workers often endure prolonged periods of high-intensity physical work and prolonged use of tools, resulting in injuries and illnesses primarily linked to postural ergonomic risks, a longstanding predominant health concern. To mitigate these risks, researchers have applied various technological methods to identify the ergonomic risks that construction workers face. However, traditional ergonomic risk assessment (ERA) techniques do not offer interactive feedback. The rapidly developing vision-language models (VLMs), capable of generating textual descriptions or answering questions about ergonomic risks based on image inputs, have not yet received widespread attention. This research introduces an interactive visual query system tailored to assess the postural ergonomic risks of construction workers. The system’s capabilities include visual question answering (VQA), which responds to visual queries regarding workers’ exposure to postural ergonomic risks, and image captioning (IC), which generates textual descriptions of these risks from images. Additionally, this study proposes a dataset designed for training and testing such methodologies. Systematic testing indicates that the VQA functionality delivers an accuracy of 96.5%. Moreover, evaluations using nine metrics for IC and assessments from human experts indicate that the proposed approach surpasses the performance of a method using the same architecture trained solely on generic datasets. This study sets a new direction for future developments in interactive ERA using generative artificial intelligence (AI) technologies.
zh

[CV-135] Standard-Deviation-Inspired Regularization for Improving Adversarial Robustness

【速读】：该论文旨在提升深度神经网络（DNNs）在面对对抗攻击时的鲁棒性（robustness）和泛化能力（generalization）。现有的对抗训练（Adversarial Training, AT）方法通过内层最大化（inner maximization）生成对抗样本来训练模型，外层最小化（outer minimization）则用于最小化这些对抗样本的损失。然而，论文指出内层最大化过程类似于最小化模型输出概率的修正标准差（modified standard deviation），并提出通过最大化这一修正标准差可以补充AT框架的外层最小化。为此，论文引入了一种基于标准差的正则化项（SDI regularization term），实验表明，该正则化项不仅能够用于生成对抗样本，还能与现有的AT变体结合，显著提升模型在面对更强攻击（如CW和Auto-attack）时的鲁棒性，并改善泛化性能。解决方案的关键在于通过SDI正则化项优化AT框架，从而增强模型的对抗鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2412.19947
作者: Olukorede Fakorede,Modeste Atsague,Jin Tian
机构: 未知
关键词: deep neural networks, Adversarial Training, neural networks, deep neural, Adversarial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Adversarial Training (AT) has been demonstrated to improve the robustness of deep neural networks (DNNs) against adversarial attacks. AT is a min-max optimization procedure where in adversarial examples are generated to train a more robust DNN. The inner maximization step of AT increases the losses of inputs with respect to their actual classes. The outer minimization involves minimizing the losses on the adversarial examples obtained from the inner maximization. This work proposes a standard-deviation-inspired (SDI) regularization term to improve adversarial robustness and generalization. We argue that the inner maximization in AT is similar to minimizing a modified standard deviation of the model’s output probabilities. Moreover, we suggest that maximizing this modified standard deviation can complement the outer minimization of the AT framework. To support our argument, we experimentally show that the SDI measure can be used to craft adversarial examples. Additionally, we demonstrate that combining the SDI regularization term with existing AT variants enhances the robustness of DNNs against stronger attacks, such as CW and Auto-attack, and improves generalization.
zh

[CV-136] Zero-shot Hazard Identification in Autonomous Driving: A Case Study on the COOOL Benchmark

【速读】：该论文旨在解决自动驾驶中检测和分类标签外危险（out-of-label hazards）的问题，提出了一个综合性的解决方案。其核心任务包括驾驶员反应检测、危险物体识别和危险描述生成。在驾驶员反应检测中，采用了基于核的变化点检测方法，结合边界框和光流动力学分析运动模式。对于危险物体识别，结合了基于邻近度的策略和预训练的视觉Transformer（ViT）模型进行物体分类。最后，在危险描述生成中，使用了MOLMO视觉-语言模型，并通过定制提示生成精确且上下文感知的罕见和低分辨率危险的描述。该方案在COOOL竞赛中显著优于基线方法，相对误差减少了33%，并在32支参赛队伍中排名第二。

链接: https://arxiv.org/abs/2412.19944
作者: Lukas Picek,Vojtěch Čermák,Marek Hanzl
机构: 未知
关键词: COOOL competition, detecting and classifying, autonomous driving, paper presents, presents our submission
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents our submission to the COOOL competition, a novel benchmark for detecting and classifying out-of-label hazards in autonomous driving. Our approach integrates diverse methods across three core tasks: (i) driver reaction detection, (ii) hazard object identification, and (iii) hazard captioning. We propose kernel-based change point detection on bounding boxes and optical flow dynamics for driver reaction detection to analyze motion patterns. For hazard identification, we combined a naive proximity-based strategy with object classification using a pre-trained ViT model. At last, for hazard captioning, we used the MOLMO vision-language model with tailored prompts to generate precise and context-aware descriptions of rare and low-resolution hazards. The proposed pipeline outperformed the baseline methods by a large margin, reducing the relative error by 33%, and scored 2nd on the final leaderboard consisting of 32 teams.
zh

[CV-137] Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

【速读】：该论文旨在解决基础模型（foundational models）在3D推理任务中的视角稳定性问题，特别是模型对视角变化的敏感性。论文将不稳定性定义为由于视角的微小变化导致的显著特征变化，进而引发泛化差距。研究通过分析九个基础模型对视角变化的响应，尤其是那些常被忽视的偶然视角（accidental viewpoints），即特定相机方向遮蔽物体真实3D结构的情况，来探讨这一问题。解决方案的关键在于提出了一种仅通过特征表示来识别和分类分布外（OOD）视角、偶然视角和稳定视角的方法，而无需访问实际图像。研究结果表明，尽管基础模型能够一致地编码偶然视角，但由于内在偏差，它们对OOD视角的解释存在差异，有时会导致基于几何相似性的物体误分类。通过在分类、视觉问答（VQA）和3D重建三个下游任务中的定量和定性评估，论文展示了视角不稳定性的影响，并强调了在不同视角条件下特征鲁棒性的重要性。

链接: https://arxiv.org/abs/2412.19920
作者: Mateusz Michalkiewicz,Sheena Bai,Mahsa Baktashmotlagh,Varun Jampani,Guha Balakrishnan
机构: 未知
关键词: significant feature variations, feature variations resulting, variations resulting, resulting from minor, generalization gaps
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages + 3 pages of references. 8 figures, 3 tables

点击查看摘要

Abstract:In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object’s true 3D structure. Our methodology enables recognizing and classifying out-of-distribution (OOD), accidental, and stable viewpoints using feature representations alone, without accessing the actual images. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of OOD viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.
zh

[CV-138] Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts

【速读】：该论文旨在解决场景文本分割（scene text segmentation）中，使用 Segment Anything Model (SAM) 时因提示（prompt）粒度不准确导致的性能不佳问题。具体而言，SAM 在使用词级边界框（word-level bounding box）作为提示时，对字符的分割过于粗糙；而使用字符级边界框（character-level bounding box）作为提示时，则容易出现过度分割（over-segmentation）和欠分割（under-segmentation）的问题。为解决这一问题，论文提出了一种名为 Char-SAM 的自动标注管道，其关键创新在于引入了字符级视觉提示（Character-level visual prompt）。该方案通过字符边界框优化模块（Character Bounding-box Refinement, CBR）生成更精细的字符级边界框提示，并利用字符字形信息（glyph information）作为新的提示，通过字符字形优化模块（Character Glyph Refinement, CGR）引导 SAM 生成更准确的分割掩码，从而有效解决了过度分割和欠分割问题。这一方法充分利用了 SAM 的边界框到掩码（bbox-to-mask）能力，能够自动生成高质量的文本分割标注，且无需额外训练即可从真实世界数据集中生成高质量的场景文本分割数据集。

链接: https://arxiv.org/abs/2412.19917
作者: Enze Xie,Jiaho Lyu,Daiqing Wu,Huawen Shen,Yu Zhou
机构: 未知
关键词: Segment Anything Model, domain-specific segmentation tasks, bounding box, character-level bounding box, recent emergence
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
zh

[CV-139] Leveraging Scene Geometry and Depth Information for Robust Image Deraining

【速读】：该论文旨在解决自动驾驶车辆在雨天条件下视觉感知能力受限的问题，特别是通过图像去雨（Image Deraining）技术提升雨天场景的清晰度，从而提高驾驶安全性。现有方法通常采用单一网络架构生成去雨图像，但未能充分利用场景中的先验知识，尤其是忽略了深度信息（Depth Information）对场景几何结构的指导作用。本文提出了一种新颖的多网络学习框架，包括一个用于去雨的自编码器（AutoEncoder）、一个辅助网络以整合深度信息，以及两个监督网络来确保雨天和晴天场景之间的特征一致性。这种多网络设计使模型能够有效捕捉场景的底层结构，生成更清晰、更准确的去雨图像，进而提升自动驾驶车辆的目标检测性能。

链接: https://arxiv.org/abs/2412.19913
作者: Ningning Xu,Jidong J. Yang
机构: 未知
关键词: holds great potential, deraining holds great, Image deraining holds, contributing to safer, safer driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 12 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Image deraining holds great potential for enhancing the vision of autonomous vehicles in rainy conditions, contributing to safer driving. Previous works have primarily focused on employing a single network architecture to generate derained images. However, they often fail to fully exploit the rich prior knowledge embedded in the scenes. Particularly, most methods overlook the depth information that can provide valuable context about scene geometry and guide more robust deraining. In this work, we introduce a novel learning framework that integrates multiple networks: an AutoEncoder for deraining, an auxiliary network to incorporate depth information, and two supervision networks to enforce feature consistency between rainy and clear scenes. This multi-network design enables our model to effectively capture the underlying scene structure, producing clearer and more accurately derained images, leading to improved object detection for autonomous vehicles. Extensive experiments on three widely-used datasets demonstrated the effectiveness of our proposed method.
zh

[CV-140] YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO

【速读】：该论文旨在解决红外小目标检测算法在复杂背景和目标特征不明显情况下存在的漏检、误报和低精度问题。传统模型驱动方法在处理噪声、目标尺寸和对比度等特征时鲁棒性不足，而现有深度学习方法在关键特征提取和融合方面能力有限。为此，论文提出了一种结合图像超分辨率（super-resolution）技术和多尺度观测的深度学习方法。其关键解决方案包括：首先对输入的红外图像进行超分辨率预处理和多数据增强；其次，基于YOLOv5模型，提出了一种名为YOLO-MST的新深度学习网络，该网络在骨干网络中用自设计的MSFA模块替换SPPF模块，优化了颈部结构，并在预测头中增加了多尺度动态检测头，通过动态融合不同尺度的特征，使检测头能更好地适应复杂场景。该方法在SIRST和IRIS两个公开数据集上的mAP@0.5检测率分别达到96.4%和99.5%，有效提升了检测精度并减少了漏检和误报。

链接: https://arxiv.org/abs/2412.19878
作者: Taoran Yue,Xiaojin Lu,Jiaxi Cai,Yuanping Chen,Shibing Chu
机构: 未知
关键词: small target detection, target detection algorithms, infrared small target, military applications, research globally
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of aerospace technology and the increasing demands of military applications, the development of low false-alarm and high-precision infrared small target detection algorithms has emerged as a key focus of research globally. However, the traditional model-driven method is not robust enough when dealing with features such as noise, target size, and contrast. The existing deep-learning methods have limited ability to extract and fuse key features, and it is difficult to achieve high-precision detection in complex backgrounds and when target features are not obvious. To solve these problems, this paper proposes a deep-learning infrared small target detection method that combines image super-resolution technology with multi-scale observation. First, the input infrared images are preprocessed with super-resolution and multiple data enhancements are performed. Secondly, based on the YOLOv5 model, we proposed a new deep-learning network named YOLO-MST. This network includes replacing the SPPF module with the self-designed MSFA module in the backbone, optimizing the neck, and finally adding a multi-scale dynamic detection head to the prediction head. By dynamically fusing features from different scales, the detection head can better adapt to complex scenes. The mAP@0.5 detection rates of this method on two public datasets, SIRST and IRIS, reached 96.4% and 99.5% respectively, more effectively solving the problems of missed detection, false alarms, and low precision.
zh

[CV-141] Image Classification with Deep Reinforcement Active Learning

【速读】：该论文旨在解决在标注数据稀缺的现实场景中，传统主动学习（Active Learning）方法依赖于手工策略，难以适应高度变化的学习环境（如不同数据集和场景）的问题。为此，作者提出了一种基于马尔可夫决策过程（Markov Decision Process, MDP）的自适应主动学习方法。该框架的核心在于结合深度强化学习（Deep Reinforcement Learning）和主动学习，并采用深度确定性策略梯度（Deep Deterministic Policy Gradient, DDPG）来动态调整样本选择策略，使其能够根据专家（oracle）的反馈和学习环境的变化进行自适应优化。通过在三个不同的图像分类基准数据集上的广泛实验，该方法展示了优于现有多种主动学习策略的性能。

链接: https://arxiv.org/abs/2412.19877
作者: Mingyuan Jiu,Xuguang Song,Hichem Sahbi,Shupan Li,Yan Chen,Wei Guo,Lihua Guo,Mingliang Xu
机构: 未知
关键词: large neural networks, reaching outstanding performances, neural networks, Active learning, learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning is currently reaching outstanding performances on different tasks, including image classification, especially when using large neural networks. The success of these models is tributary to the availability of large collections of labeled training data. In many real-world scenarios, labeled data are scarce, and their hand-labeling is time, effort and cost demanding. Active learning is an alternative paradigm that mitigates the effort in hand-labeling data, where only a small fraction is iteratively selected from a large pool of unlabeled data, and annotated by an expert (a.k.a oracle), and eventually used to update the learning models. However, existing active learning solutions are dependent on handcrafted strategies that may fail in highly variable learning environments (datasets, scenarios, etc). In this work, we devise an adaptive active learning method based on Markov Decision Process (MDP). Our framework leverages deep reinforcement learning and active learning together with a Deep Deterministic Policy Gradient (DDPG) in order to dynamically adapt sample selection strategies to the oracle’s feedback and the learning environment. Extensive experiments conducted on three different image classification benchmarks show superior performances against several existing active learning strategies.
zh

[CV-142] Neighbor Does Matter: Density-Aware Contrastive Learning for Medical Semi-supervised Segmentation

【速读】：该论文旨在解决医学图像分析中多器官半监督分割（multi-organ semi-supervised segmentation）面临的标签不足和软组织对比度低等问题。现有方法通常采用伪标签（pseudo-labeling）和一致性正则化（consistency regularization）等技术，但这些方法主要依赖单个数据样本进行训练，忽略了特征空间中的丰富邻域信息。论文提出了一种基于特征空间几何结构的监督信息提取方法，通过密度感知对比学习（Density-Aware Contrastive Learning, DACL）策略，将稀疏区域中的锚定特征推向由高密度正样本近似的聚类中心，从而增强类内紧凑性。具体而言，该方法利用标记和未标记数据样本构建密度感知邻域图（density-aware neighbor graphs）来估计特征密度并定位稀疏区域，并结合标签引导的协同训练（label-guided co-training）与密度引导的几何正则化（density-guided geometric regularization）形成对未标记数据的互补监督。实验结果表明，该方法在多器官分割挑战数据集上优于现有最先进方法，验证了其在医学图像分割任务中的有效性。

链接: https://arxiv.org/abs/2412.19871
作者: Feilong Tang,Zhongxing Xu,Ming Hu,Wenxue Li,Peng Xia,Yiheng Zhong,Hanjun Wu,Jionglong Su,Zongyuan Ge
机构: 未知
关键词: semi-supervised segmentation faces, segmentation faces challenges, soft tissues, semi-supervised segmentation, insufficient labels
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In medical image analysis, multi-organ semi-supervised segmentation faces challenges such as insufficient labels and low contrast in soft tissues. To address these issues, existing studies typically employ semi-supervised segmentation techniques using pseudo-labeling and consistency regularization. However, these methods mainly rely on individual data samples for training, ignoring the rich neighborhood information present in the feature space. In this work, we argue that supervisory information can be directly extracted from the geometry of the feature space. Inspired by the density-based clustering hypothesis, we propose using feature density to locate sparse regions within feature clusters. Our goal is to increase intra-class compactness by addressing sparsity issues. To achieve this, we propose a Density-Aware Contrastive Learning (DACL) strategy, pushing anchored features in sparse regions towards cluster centers approximated by high-density positive samples, resulting in more compact clusters. Specifically, our method constructs density-aware neighbor graphs using labeled and unlabeled data samples to estimate feature density and locate sparse regions. We also combine label-guided co-training with density-guided geometric regularization to form complementary supervision for unlabeled data. Experiments on the Multi-Organ Segmentation Challenge dataset demonstrate that our proposed method outperforms state-of-the-art methods, highlighting its efficacy in medical image segmentation tasks.
zh

[CV-143] Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

【速读】：该论文旨在解决大规模文本到图像扩散模型（text-to-image diffusion models）在计算和存储成本上的高昂开销问题。尽管这些模型在复杂视觉任务和下游应用中取得了革命性突破，但其极高的计算和存储需求限制了其实际应用。论文提出通过量化（quantization）和快速卷积算法（如Winograd）来降低计算成本和内存带宽使用。然而，现有的粗粒度后训练量化方法在完全量化的Winograd卷积中会导致显著的图像质量损失，且为恢复质量而对Winograd变换矩阵进行微调的成本和复杂性较高，难以适用于大规模基础模型。为此，论文提出了一种细粒度的分组量化（group-wise quantization）方法，并结合仅微调Winograd变换矩阵的尺度参数（scale parameters）的策略，以减少Winograd域中的范围差异。该方法不依赖任何特定领域的训练数据，从而保证了量化扩散模型的泛化性能。实验结果表明，在文本到图像生成任务中，8位完全量化的扩散模型结合Winograd卷积实现了接近无损的质量（FID和CLIP评分），在图像分类任务中，该方法在ResNet18和ResNet-34上的Top-1 ImageNet准确率分别比现有Winograd PTQ方法提高了1.62%和2.56%。

链接: https://arxiv.org/abs/2412.19867
作者: Shuokai Pan,Gerti Tuzi,Sudarshan Sreeram,Dibakar Gope
机构: 未知
关键词: extremely high computational, storage costs limit, Winograd, diffusion models, textto-image diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the revolutionary breakthroughs of large-scale textto-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).
zh

[CV-144] UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

【速读】：该论文旨在解决通过音频输入生成逼真肖像动画（talking head videos）时，现有方法在面部和头部动态、相机运动、光影效果等多方面控制上的不足。论文提出的解决方案UniAvatar，通过使用FLAME模型将所有运动信息渲染到单一图像上，实现了对3D运动细节的精细控制和像素级控制。此外，该方法还设计了独立的模块来分别管理3D运动和全局光照（global illumination），允许单独或组合控制。通过广泛的实验验证，UniAvatar在广泛运动控制和光照控制方面均优于现有方法。为了增强现有数据集的多样性和环境上下文，论文还收集并计划公开两个数据集，DH-FaceDrasMvVid-100和DH-FaceReliVid-200，这些数据集捕捉了说话时显著的头部运动和各种光照场景。

链接: https://arxiv.org/abs/2412.19860
作者: Wenzhang Sun,Xiang Li,Donglin Di,Zhuding Liang,Qiyuan Zhang,Hao Li,Wei Chen,Jianxun Cui
机构: 未知
关键词: animating portrait images, animating portrait, popular task, audio input, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, animating portrait images using audio input is a popular task. Creating lifelike talking head videos requires flexible and natural movements, including facial and head dynamics, camera motion, realistic light and shadow effects. Existing methods struggle to offer comprehensive, multifaceted control over these aspects. In this work, we introduce UniAvatar, a designed method that provides extensive control over a wide range of motion and illumination conditions. Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details while enabling fine-grained, pixel-level control. Beyond motion, this approach also allows for comprehensive global illumination control. We design independent modules to manage both 3D motion and illumination, permitting separate and combined control. Extensive experiments demonstrate that our method outperforms others in both broad-range motion control and lighting control. Additionally, to enhance the diversity of motion and environmental contexts in current datasets, we collect and plan to publicly release two datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, which capture significant head movements during speech and various lighting scenarios.
zh

[CV-145] Fusion of Deep Learning and GIS for Advanced Remote Sensing Image Analysis

【速读】：该论文旨在解决遥感图像分析中的高维度、复杂模式及时序数据处理等挑战，以提升空间数据分析的准确性和效率。其解决方案的关键在于融合深度学习技术（特别是卷积神经网络（CNNs）和长短期记忆网络（LSTM））与地理信息系统（GIS），并通过优化算法（如粒子群优化（PSO）和遗传算法（GA））对模型参数进行微调。这一框架显著提高了分类准确率（从78%提升至92%），降低了预测误差（从12%降至6%），并提升了时序数据的准确性（从75%提升至88%），从而有效监测动态变化。此外，GIS的集成不仅丰富了空间分析，还深化了对地理特征间关系的理解。该研究表明，结合先进的深度学习方法、GIS及优化策略，能够显著推动遥感应用的发展，为环境监测、城市规划和资源管理等领域提供新的研究路径。

链接: https://arxiv.org/abs/2412.19856
作者: Sajjad Afroosheh,Mohammadreza Askari
机构: 未知
关键词: Geographic Information Systems, Convolutional Neural Networks, specifically Convolutional Neural, Long Short-Term Memory, Information Systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This paper presents an innovative framework for remote sensing image analysis by fusing deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, with Geographic Information Systems (GIS). The primary objective is to enhance the accuracy and efficiency of spatial data analysis by overcoming challenges associated with high dimensionality, complex patterns, and temporal data processing. We implemented optimization algorithms, namely Particle Swarm Optimization (PSO) and Genetic Algorithms (GA), to fine-tune model parameters, resulting in improved performance metrics. Our findings reveal a significant increase in classification accuracy from 78% to 92% and a reduction in prediction error from 12% to 6% after optimization. Additionally, the temporal accuracy of the models improved from 75% to 88%, showcasing the frameworks capability to monitor dynamic changes effectively. The integration of GIS not only enriched the spatial analysis but also facilitated a deeper understanding of the relationships between geographical features. This research demonstrates that combining advanced deep learning methods with GIS and optimization strategies can significantly advance remote sensing applications, paving the way for future developments in environmental monitoring, urban planning, and resource management.
zh

[CV-146] Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

【速读】：该论文旨在解决图像生成过程中内容保真度（content fidelity）与艺术风格（artistic style）之间的平衡问题。传统风格迁移方法和现代去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPMs）在尝试实现这一平衡时，往往难以兼顾风格与内容，甚至可能同时牺牲两者。论文通过分析DDPM在保持内容与风格平衡方面的能力，提出了一种新方法，通过识别DDPM注意力层（attention layers）中的敏感性，确定与不同风格特征相对应的特定层。通过仅将这些条件输入定向到这些敏感层，该方法实现了对风格与内容的精细控制，显著减少了因过度约束输入而产生的问题。研究结果表明，该方法通过更好地对齐风格与内容，提升了现有风格化技术的效果，从而提高了生成视觉内容的质量。

链接: https://arxiv.org/abs/2412.19853
作者: Nadav Z. Cohen,Oron Nir,Ariel Shamir
机构: 未知
关键词: Balancing content fidelity, Diffusion Probabilistic Models, Denoising Diffusion Probabilistic, modern Denoising Diffusion, Balancing content
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.
zh

[CV-147] 3D Face Reconstruction With Geometry Details From a Single Color Image Under Occluded Scenes

【速读】：该论文旨在解决现有3D人脸重建技术在多重遮挡场景下泛化能力不足的问题。现有的深度人脸重建方法通常专注于生成逼真的纹理，但在处理多重遮挡（如头发、手掌和眼镜等）时表现不佳。论文提出的解决方案关键在于引入了凹凸贴图（bump mapping）技术，以在粗糙的3D人脸模型上添加中层细节，并创新性地考虑了遮挡场景。通过构建一个统一的框架，该方法能够同时处理多种类型的遮挡，从而在遮挡场景下从捕获的人脸图像中生成具有几何细节的高质量重建结果。

链接: https://arxiv.org/abs/2412.19849
作者: Dapeng Zhao,Yue Qi
机构: 未知
关键词: stereo model naturally, reconstruction technology aims, face stereo model, naturally and realistically, face reconstruction technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2412.18920

点击查看摘要

Abstract:3D face reconstruction technology aims to generate a face stereo model naturally and realistically. Previous deep face reconstruction approaches are typically designed to generate convincing textures and cannot generalize well to multiple occluded scenarios simultaneously. By introducing bump mapping, we successfully added mid-level details to coarse 3D faces. More innovatively, our method takes into account occlusion scenarios. Thus on top of common 3D face reconstruction approaches, we in this paper propose a unified framework to handle multiple types of obstruction simultaneously (e.g., hair, palms and glasses et al.).Extensive experiments and comparisons demonstrate that our method can generate high-quality reconstruction results with geometry details from captured facial images under occluded scenes.
zh

[CV-148] Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

【速读】：该论文致力于解决单视角3D人脸重建（Single-view 3D face reconstruction）中的一个关键问题，即在“野外”（in-the-wild）条件下自动去除眼镜并生成逼真的3D人脸。现有方法通常假设输入图像中的人脸未被遮挡，因此在处理戴眼镜的人脸时效果不佳。论文提出的解决方案的核心在于通过深度学习架构，从单张2D图像中直接回归出3D人脸几何的3DMM（3D Morphable Model）表示，并创新性地引入了一个鲁棒的眼镜区域识别与智能去除过程。具体而言，该方法首先估计眼镜区域的合理位置，并基于此构建3D纹理，确保输出结果的真实性，特别是眼睛、鼻子和嘴之间的拓扑结构。此外，论文还展示了如何将相关的人脸解析任务整合到框架中，以进一步提升重建质量。通过大量实验，该方法在现有3D人脸重建任务中展现了优于现有方法的调控能力。

链接: https://arxiv.org/abs/2412.19848
作者: Dapeng Zhao,Yue Qi
机构: 未知
关键词: fundamental Computer Vision, Computer Vision problem, Computer Vision, fundamental Computer, Vision problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2412.18920

点击查看摘要

Abstract:Single-view 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the input is unobstructed faces which makes their method not suitable for in-the-wild conditions. We present a method for performing a 3D face that removes eyeglasses from a single image. Existing facial reconstruction methods fail to remove eyeglasses automatically for generating a photo-realistic 3D face “in-the-wild”.The innovation of our method lies in a process for identifying the eyeglasses area robustly and remove it intelligently. In this work, we estimate the 2D face structure of the reasonable position of the eyeglasses area, which is used for the construction of 3D texture. An excellent anti-eyeglasses face reconstruction method should ensure the authenticity of the output, including the topological structure between the eyes, nose, and mouth. We achieve this via a deep learning architecture that performs direct regression of a 3DMM representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related face parsing task can be incorporated into the proposed framework and help improve reconstruction quality. We conduct extensive experiments on existing 3D face reconstruction tasks as concrete examples to demonstrate the method’s superior regulation ability over existing methods often break down.
zh

[CV-149] Symbolic Disentangled Representations for Images

【速读】：该论文旨在解决高维潜在空间中生成因子（generative factors）的分离问题，即如何在高维向量表示中确定每个坐标对应的生成因子，从而实现对象属性的可控和可解释编辑。论文提出的解决方案是ArSyD（Architecture for Symbolic Disentanglement），其关键创新在于将每个生成因子表示为与最终表示相同维度的向量，并通过生成因子向量表示的叠加来获得对象表示。这种方法基于超维计算（Hyperdimensional Computing）原理，其中符号被表示为超向量（hypervectors），并允许对其进行向量操作。ArSyD通过构造实现分离，无需在训练过程中对底层分布做出额外假设，且仅以弱监督方式训练模型进行图像重建。该方法在dSprites和CLEVR数据集上进行了验证，并提出了新的分离度量标准，允许比较使用不同维度潜在表示的方法。

链接: https://arxiv.org/abs/2412.19847
作者: Alexandr Korchemnyi,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
关键词: generative factor, generative factor vector, factor vector representations, Vector Symbolic Architectures, representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 14 figures

点击查看摘要

Abstract:The idea of disentangled representations is to reduce the data to a set of generative factors that produce it. Typically, such representations are vectors in latent space, where each coordinate corresponds to one of the generative factors. The object can then be modified by changing the value of a particular coordinate, but it is necessary to determine which coordinate corresponds to the desired generative factor – a difficult task if the vector representation has a high dimension. In this article, we propose ArSyD (Architecture for Symbolic Disentanglement), which represents each generative factor as a vector of the same dimension as the resulting representation. In ArSyD, the object representation is obtained as a superposition of the generative factor vector representations. We call such a representation a \textitsymbolic disentangled representation. We use the principles of Hyperdimensional Computing (also known as Vector Symbolic Architectures), where symbols are represented as hypervectors, allowing vector operations on them. Disentanglement is achieved by construction, no additional assumptions about the underlying distributions are made during training, and the model is only trained to reconstruct images in a weakly supervised manner. We study ArSyD on the dSprites and CLEVR datasets and provide a comprehensive analysis of the learned symbolic disentangled representations. We also propose new disentanglement metrics that allow comparison of methods using latent representations of different dimensions. ArSyD allows to edit the object properties in a controlled and interpretable way, and the dimensionality of the object property representation coincides with the dimensionality of the object representation itself.
zh

[CV-150] A Review of Latent Representation Models in Neuroimaging

【速读】：该论文旨在解决神经影像数据（如MRI或PET）的高维复杂性及其在脑结构和功能研究中的应用难题。为解决这一问题，论文提出了使用潜在表示模型（如自编码器、生成对抗网络和潜在扩散模型）来将高维神经影像数据降维至低维潜在空间，从而识别与脑功能相关的关键模式和变化。通过建模这些潜在空间，研究人员能够深入理解脑的生物学功能，包括其结构随年龄或疾病的变化、感官信息的编码以及对新输入的预测和适应。这些模型不仅为疾病诊断和进展监测等临床应用提供了有力工具，还为探索脑的基本机制（如主动推理和预测编码）提供了新的视角，进而推动对认知、感知和神经障碍的深入理解。

链接: https://arxiv.org/abs/2412.19844
作者: C. Vázquez-García,F. J. Martínez-Murcia,F. Segovia Román,Juan M. Górriz
机构: 未知
关键词: MRI or PET, Generative Adversarial Networks, techniques like MRI, Latent Diffusion Models, offer rich
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 4 figures

点击查看摘要

Abstract:Neuroimaging data, particularly from techniques like MRI or PET, offer rich but complex information about brain structure and activity. To manage this complexity, latent representation models - such as Autoencoders, Generative Adversarial Networks (GANs), and Latent Diffusion Models (LDMs) - are increasingly applied. These models are designed to reduce high-dimensional neuroimaging data to lower-dimensional latent spaces, where key patterns and variations related to brain function can be identified. By modeling these latent spaces, researchers hope to gain insights into the biology and function of the brain, including how its structure changes with age or disease, or how it encodes sensory information, predicts and adapts to new inputs. This review discusses how these models are used for clinical applications, like disease diagnosis and progression monitoring, but also for exploring fundamental brain mechanisms such as active inference and predictive coding. These approaches provide a powerful tool for both understanding and simulating the brain’s complex computational tasks, potentially advancing our knowledge of cognition, perception, and neural disorders.
zh

[CV-151] Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network

【速读】：该论文旨在解决多模态交通流联合预测中存在的两个主要问题：一是现有研究主要集中在单一交通模式的预测，而对不同交通模式的联合预测研究相对有限；二是现有的多模态交通联合建模方法在时空特征提取方面缺乏灵活性。为解决这些问题，论文提出了一种名为“基于图稀疏注意力机制和双向时间卷积网络（GSABT）”的方法。其关键解决方案包括：首先，通过自注意力权重加权的多模态图来捕捉空间局部特征，并利用Top-U稀疏注意力机制获取空间全局特征；其次，采用双向时间卷积网络增强输出与输入数据之间的时间特征相关性，并通过共享-独立模块提取模态间和模态内的时间特征；最后，设计了一个可在时空维度上灵活扩展的多模态联合预测框架。实验结果表明，该模型在三个真实数据集上均实现了最先进的预测性能。

链接: https://arxiv.org/abs/2412.19842
作者: Dongran Zhang,Jiangnan Yan,Kemal Polat,Adi Alhudhaif,Jun Li
机构: 未知
关键词: urban transportation systems, flow prediction plays, Sparse Attention Mechanism, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic flow prediction plays a crucial role in the management and operation of urban transportation systems. While extensive research has been conducted on predictions for individual transportation modes, there is relatively limited research on joint prediction across different transportation modes. Furthermore, existing multimodal traffic joint modeling methods often lack flexibility in spatial-temporal feature extraction. To address these issues, we propose a method called Graph Sparse Attention Mechanism with Bidirectional Temporal Convolutional Network (GSABT) for multimodal traffic spatial-temporal joint prediction. First, we use a multimodal graph multiplied by self-attention weights to capture spatial local features, and then employ the Top-U sparse attention mechanism to obtain spatial global features. Second, we utilize a bidirectional temporal convolutional network to enhance the temporal feature correlation between the output and input data, and extract inter-modal and intra-modal temporal features through the share-unique module. Finally, we have designed a multimodal joint prediction framework that can be flexibly extended to both spatial and temporal dimensions. Extensive experiments conducted on three real datasets indicate that the proposed model consistently achieves state-of-the-art predictive performance.
zh

[CV-152] FlameGS: Reconstruct flame light field via Gaussian Splatting

【速读】：该论文旨在解决传统ART（代数重建技术）算法在火焰燃烧诊断中耗时且计算密集的问题。为解决这一问题，作者提出了一种基于火焰模拟技术的新型火焰表示方法。其关键解决方案包括对火焰发光过程进行建模，并利用二维投影图像进行监督。实验验证表明，该方法在实际图像与预测二维投影之间的平均结构相似性指数（SSIM）达到0.96，峰值信噪比（PSNR）为39.05，同时相比传统算法节省了约34倍的计算时间和10倍的内存资源。

链接: https://arxiv.org/abs/2412.19841
作者: Yunhao Shui,Fuhao Zhang,Can Gao,Hao Xue,Zhiyin Ma,Gang Xun,Xuesong Li
机构: 未知
关键词: computationally intensive issues, flame combustion diagnosis, flame simulation technology, traditional ART algorithms, combustion diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:To address the time-consuming and computationally intensive issues of traditional ART algorithms for flame combustion diagnosis, inspired by flame simulation technology, we propose a novel representation method for flames. By modeling the luminous process of flames and utilizing 2D projection images for supervision, our experimental validation shows that this model achieves an average structural similarity index of 0.96 between actual images and predicted 2D projections, along with a Peak Signal-to-Noise Ratio of 39.05. Additionally, it saves approximately 34 times the computation time and about 10 times the memory compared to traditional algorithms.
zh

[CV-153] ERPA: Efficient RPA Model Integrating OCR and LLM s for Intelligent Document Processing

【速读】：该论文旨在解决传统机器人流程自动化（RPA）在处理大量文档时面临的性能限制问题，特别是在移民工作流程中的身份数据提取和光学字符识别（OCR）任务中。传统RPA解决方案在处理模糊字符和复杂结构时效率低下，导致提取的文本准确性和清晰度不足。论文提出的解决方案ERPA（Enhanced Robotic Process Automation）通过引入大语言模型（LLMs）来优化文本提取的准确性和清晰度，有效处理模糊字符和复杂文档结构。实验结果表明，ERPA在处理时间上显著优于UiPath和Automation Anywhere等主流平台，身份数据提取时间缩短了94%，仅需9.94秒。这一创新方案为文档自动化提供了更快、更可靠的替代方案，具有革命性潜力。

链接: https://arxiv.org/abs/2412.19840
作者: Osama Abdellaif,Abdelrahman Nader,Ali Hamdi
机构: 未知
关键词: innovative Robotic Process, Optical Character Recognition, Robotic Process Automation, optimize Optical Character, Robotic Process
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 6 pages , 2 figures, 1 algorithm

点击查看摘要

Abstract:This paper presents ERPA, an innovative Robotic Process Automation (RPA) model designed to enhance ID data extraction and optimize Optical Character Recognition (OCR) tasks within immigration workflows. Traditional RPA solutions often face performance limitations when processing large volumes of documents, leading to inefficiencies. ERPA addresses these challenges by incorporating Large Language Models (LLMs) to improve the accuracy and clarity of extracted text, effectively handling ambiguous characters and complex structures. Benchmark comparisons with leading platforms like UiPath and Automation Anywhere demonstrate that ERPA significantly reduces processing times by up to 94 percent, completing ID data extraction in just 9.94 seconds. These findings highlight ERPA’s potential to revolutionize document automation, offering a faster and more reliable alternative to current RPA solutions.
zh

[CV-154] Multi-View Fusion Neural Network for Traffic Demand Prediction

【速读】：该论文旨在解决交通研究中时空特征提取的两个关键问题：固定空间图（fixed spatial graph）限制了相似但不直接连接的节点空间特征的提取，而统一的时间建模机制（unified temporal modeling mechanism）忽略了不同节点时间变化的异质性。为解决这些问题，论文提出了一种多视图融合神经网络（MVFN）方法。其关键解决方案包括：通过图卷积网络（GCN）提取空间局部特征，利用余弦重加权线性注意力机制（CLA）提取空间全局特征，并将两者结合形成图-余弦模块（GCM）以提取整体空间特征；同时，采用多通道可分离时间卷积网络（MSTCN），通过多通道时间卷积网络（MTCN）提取统一时间特征，并通过可分离时间卷积网络（STCN）提取独立时间特征。最终，将时空特征数据输入预测层以获得最终结果。该方法在两个交通需求数据集上验证，取得了最佳的预测精度。

链接: https://arxiv.org/abs/2412.19839
作者: Dongran Zhang,Jun Li
机构: 未知
关键词: current studies typically, unified temporal modeling, temporal convolutional network, temporal modeling mechanism, fixed spatial graph
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The extraction of spatial-temporal features is a crucial research in transportation studies, and current studies typically use a unified temporal modeling mechanism and fixed spatial graph for this purpose. However, the fixed spatial graph restricts the extraction of spatial features for similar but not directly connected nodes, while the unified temporal modeling mechanism overlooks the heterogeneity of temporal variation of different nodes. To address these challenges, a multi-view fusion neural network (MVFN) approach is proposed. In this approach, spatial local features are extracted through the use of a graph convolutional network (GCN), and spatial global features are extracted using a cosine re-weighting linear attention mechanism (CLA). The GCN and CLA are combined to create a graph-cosine module (GCM) for the extraction of overall spatial features. Additionally, the multi-channel separable temporal convolutional network (MSTCN) makes use of a multi-channel temporal convolutional network (MTCN) at each layer to extract unified temporal features, and a separable temporal convolutional network (STCN) to extract independent temporal features. Finally, the spatial-temporal feature data is input into the prediction layer to obtain the final result. The model has been validated on two traffic demand datasets and achieved the best prediction accuracy.
zh

[CV-155] RoboSignature: Robust Signature and Watermarking on Network Attacks

【速读】：该论文旨在解决生成式模型（Generative Models）在生成图像时嵌入水印（watermarking）的脆弱性问题。具体而言，现有的水印方法（如Stable Signature）通过在潜在扩散模型（Latent Diffusion Models, LDMs）的解码器中嵌入独特水印来标识生成图像的来源。然而，论文揭示了一种新型的对抗性微调攻击（adversarial fine-tuning attack），能够破坏模型嵌入水印的能力，暴露了现有水印方法的显著漏洞。为解决这一问题，论文提出了一种抗篡改的微调算法（tamper-resistant fine-tuning algorithm），该算法借鉴了大型语言模型（large language models）中的方法，并针对LDMs的水印需求进行了定制。这一解决方案的关键在于增强水印嵌入的鲁棒性，以抵御潜在的对抗性攻击，从而确保生成图像的可追溯性和安全性。

链接: https://arxiv.org/abs/2412.19834
作者: Aryaman Shaan,Garvit Banga,Raghav Mantri
机构: 未知
关键词: enabled easy creation, single prompt, enabled easy, easy creation, creation and generation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models have enabled easy creation and generation of images of all kinds given a single prompt. However, this has also raised ethical concerns about what is an actual piece of content created by humans or cameras compared to model-generated content like images or videos. Watermarking data generated by modern generative models is a popular method to provide information on the source of the content. The goal is for all generated images to conceal an invisible watermark, allowing for future detection or identification. The Stable Signature finetunes the decoder of Latent Diffusion Models such that a unique watermark is rooted in any image produced by the decoder. In this paper, we present a novel adversarial fine-tuning attack that disrupts the model’s ability to embed the intended watermark, exposing a significant vulnerability in existing watermarking methods. To address this, we further propose a tamper-resistant fine-tuning algorithm inspired by methods developed for large language models, tailored to the specific requirements of watermarking in LDMs. Our findings emphasize the importance of anticipating and defending against potential vulnerabilities in generative systems.
zh

[CV-156] Multi-atlas Ensemble Graph Neural Network Model For Major Depressive Disorder Detection Using Functional MRI Data

【速读】：该论文旨在解决重度抑郁症（Major Depressive Disorder, MDD）的诊断问题，特别是通过神经影像技术识别与MDD相关的脑网络特征。当前MDD的诊断主要依赖于临床观察和患者自述症状，忽视了其多样化的病理生理机制。为此，论文提出了一种基于图神经网络（Graph Neural Networks, GNNs）的集成模型，用于从静息态功能磁共振成像（rest-state functional MRI, rs-fMRI）数据中提取判别性特征，以提高MDD的诊断准确性。解决方案的关键在于结合多个脑区分割图谱的特征，构建集成模型，以更全面地捕捉脑网络的复杂性，并显著优于单一图谱模型。该模型在大型多站点MDD数据集上的表现验证了其有效性，最佳模型的准确率为75.80%，敏感性为88.89%，特异性为61.84%，精确率为71.29%，F1得分为79.12%。

链接: https://arxiv.org/abs/2412.19833
作者: Nojod M. Alotaibi,Areej M. Alhothali,Manar S. Ali
机构: 未知
关键词: Major depressive disorder, common mental disorders, Major depressive, common mental, quality of life
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, 10 tables

点击查看摘要

Abstract:Major depressive disorder (MDD) is one of the most common mental disorders, with significant impacts on many daily activities and quality of life. It stands as one of the most common mental disorders globally and ranks as the second leading cause of disability. The current diagnostic approach for MDD primarily relies on clinical observations and patient-reported symptoms, overlooking the diverse underlying causes and pathophysiological factors contributing to depression. Therefore, scientific researchers and clinicians must gain a deeper understanding of the pathophysiological mechanisms involved in MDD. There is growing evidence in neuroscience that depression is a brain network disorder, and the use of neuroimaging, such as magnetic resonance imaging (MRI), plays a significant role in identifying and treating MDD. Rest-state functional MRI (rs-fMRI) is among the most popular neuroimaging techniques used to study MDD. Deep learning techniques have been widely applied to neuroimaging data to help with early mental health disorder detection. Recent years have seen a rise in interest in graph neural networks (GNNs), which are deep neural architectures specifically designed to handle graph-structured data like rs-fMRI. This research aimed to develop an ensemble-based GNN model capable of detecting discriminative features from rs-fMRI images for the purpose of diagnosing MDD. Specifically, we constructed an ensemble model by combining features from multiple brain region segmentation atlases to capture brain complexity and detect distinct features more accurately than single atlas-based models. Further, the effectiveness of our model is demonstrated by assessing its performance on a large multi-site MDD dataset. The best performing model among all folds achieved an accuracy of 75.80%, a sensitivity of 88.89%, a specificity of 61.84%, a precision of 71.29%, and an F1-score of 79.12%.
zh

[CV-157] Vitron: A Unified Pixel-level Vision LLM for Understanding Generating Segmenting Editing NEURIPS2024

【速读】：该论文旨在解决当前视觉大语言模型（Vision Large Language Models, LLMs）在多模态通用性方面面临的挑战，包括粗粒度的实例级理解、缺乏对图像和视频的统一支持，以及在各种视觉任务中的覆盖不足。为解决这些问题，论文提出了VITRON，一种通用的像素级视觉大语言模型，旨在实现对静态图像和动态视频的全面理解、生成、分割和编辑。VITRON的关键解决方案包括：1）在LLM骨干网络基础上，整合了图像、视频和像素级区域视觉的编码器；2）采用最先进的视觉专家作为后端模块，支持从低级到高级的多种视觉任务；3）提出了一种新颖的混合方法，通过同时集成离散的文本指令和连续的信号嵌入，确保从LLM到后端模块的有效和精确消息传递；4）设计了多种像素级时空视觉-语言对齐学习，以提升细粒度视觉能力；5）引入了跨任务协同模块，最大化任务不变的细粒度视觉特征，增强不同视觉任务之间的协同效应。通过这些创新，VITRON在12个视觉任务和22个数据集上展示了其在四大视觉任务集群中的广泛能力，凸显了开发更统一的多模态通用模型的巨大潜力。

链接: https://arxiv.org/abs/2412.19806
作者: Hao Fei,Shengqiong Wu,Hanwang Zhang,Tat-Seng Chua,Shuicheng Yan
机构: 未知
关键词: large language models, coarse-grained instance-level understanding, Recent developments, vision large language, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper, we present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which VITRON supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for VITRON to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist. Project homepage: this https URL
zh

[CV-158] Fine-Tuning TransMorph with Gradient Correlation for Anatomical Alignment

【速读】：该论文旨在解决脑部 MRI 图像配准（brain MRI registration）中依赖解剖标签的问题，并提升配准的解剖学准确性和变形平滑度。解决方案的关键在于对预训练的 TransMorph 模型进行微调，通过引入 FAdam 优化器（FAdam optimizer）提高收敛稳定性，并在相似性度量中加入梯度相关性（gradient correlation）以确保结构变化的一致性，从而改善解剖对齐效果。实验结果表明，该方法在 Dice 和 HdDist95 评分上略有提升，并在归一化变形向量（NDV）上显著降低，验证了梯度相关性在实现平滑且结构一致的变形中的有效性。

链接: https://arxiv.org/abs/2412.20822
作者: Lukas Förner,Kartikay Tehlan,Thomas Wendler
机构: 未知
关键词: Unsupervised deep learning, anatomically accurate transformations, achieving anatomically accurate, Unsupervised deep, brain MRI registration
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised deep learning is a promising method in brain MRI registration to reduce the reliance on anatomical labels, while still achieving anatomically accurate transformations. For the Learn2Reg2024 LUMIR challenge, we propose fine-tuning of the pre-trained TransMorph model to improve the convergence stability as well as the deformation smoothness. The former is achieved through the FAdam optimizer, and consistency in structural changes is incorporated through the addition of gradient correlation in the similarity measure, improving anatomical alignment. The results show slight improvements in the Dice and HdDist95 scores, and a notable reduction in the NDV compared to the baseline TransMorph model. These are also confirmed by inspecting the boundaries of the tissue. Our proposed method highlights the effectiveness of including Gradient Correlation to achieve smoother and structurally consistent deformations for interpatient brain MRI registration.
zh

[CV-159] Residual Connection Networks in Medical Image Processing: Exploration of ResUnet Model Driven by Human Computer Interaction

【速读】：该论文旨在解决脑肿瘤在医学影像中准确识别和定位的挑战，主要由于肿瘤的变异性和结构复杂性。为解决这一问题，论文提出了ResUnet++，一种结合了ResNet和Unet++的先进混合模型。该模型的关键在于在降采样和上采样阶段均集成了残差块（residual blocks），以确保关键图像特征的保留。此外，ResUnet++通过引入人机交互（HCI）原则，提供了直观的实时反馈，使临床医生能够有效可视化和交互肿瘤定位结果，从而促进临床决策的准确性和工作流程的效率。通过在LGG分割数据集上的评估，ResUnet++取得了98.17%的Jaccard Loss，展示了其在分割性能上的优势及其在实际应用中的潜力。

链接: https://arxiv.org/abs/2412.20709
作者: Peixin Dai,Jingsi Zhang,Zhitao Shu
机构: 未知
关键词: remain challenging due, images remain challenging, Accurate identification, medical images remain, Convolutional Neural Networks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate identification and localisation of brain tumours from medical images remain challenging due to tumour variability and structural complexity. Convolutional Neural Networks (CNNs), particularly ResNet and Unet, have made significant progress in medical image processing, offering robust capabilities for image segmentation. However, limited research has explored their integration with human-computer interaction (HCI) to enhance usability, interpretability, and clinical applicability. This paper introduces ResUnet++, an advanced hybrid model combining ResNet and Unet++, designed to improve tumour detection and localisation while fostering seamless interaction between clinicians and medical imaging systems. ResUnet++ integrates residual blocks in both the downsampling and upsampling phases, ensuring critical image features are preserved. By incorporating HCI principles, the model provides intuitive, real-time feedback, enabling clinicians to visualise and interact with tumour localisation results effectively. This fosters informed decision-making and supports workflow efficiency in clinical settings. We evaluated ResUnet++ on the LGG Segmentation Dataset, achieving a Jaccard Loss of 98.17%. The results demonstrate its strong segmentation performance and potential for real-world applications. By bridging advanced medical imaging techniques with HCI, ResUnet++ offers a foundation for developing interactive diagnostic tools, improving clinician trust, decision accuracy, and patient outcomes, and advancing the integration of AI in healthcare workflows.
zh

[CV-160] Conformable Convolution for Topologically Aware Learning of Complex Anatomical Structures

【速读】：该论文旨在解决医学图像分析中深度学习模型难以准确捕捉复杂生物结构的拓扑一致性问题。传统深度学习方法依赖于数据的隐式学习，往往无法有效保持像素级细薄但关键结构的连通性和连续性，从而影响分析结果的可靠性和临床决策。为解决这一问题，论文提出了一种新型卷积层——Conformable Convolution，其核心在于通过自适应核偏移（adaptive kernel offsets）优先关注图像中具有高拓扑显著性的区域。这一过程由拓扑后验生成器（Topological Posterior Generator, TPG）模块引导，该模块利用持久同调（persistent homology）识别关键拓扑特征，并将特征图转换为立方体复形（cubical complexes）以指导卷积层。该框架具有架构无关性，可无缝集成到多种网络架构中，并在分割任务中有效保持了结构的拓扑一致性，实验结果在多个数据集上验证了其定量和定性上的优越性。

链接: https://arxiv.org/abs/2412.20608
作者: Yousef Yeganeh,Rui Xiao,Goktug Guvercin,Nassir Navab,Azade Farshad
机构: 未知
关键词: conventional computer vision, computer vision emphasizes, vision emphasizes pixel-level, necessitates explicit representation, intricate biological structures
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. Such shortcomings can significantly impact the reliability of analysis results and hinder clinical decision-making. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly enforce topological consistency. Conformable Convolution learns adaptive kernel offsets that preferentially focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Our proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. Experimental results on three diverse datasets demonstrate that our framework effectively preserves the topology in the segmentation downstream task, both quantitatively and qualitatively.
zh

[CV-161] Segmentation of Muscularis Propria in Colon Histopathology Images Using Vision Transformers for Hirschsprungs Disease

【速读】：该论文旨在解决先天性巨结肠症（Hirschsprung’s disease, HD）诊断中结肠肌层（muscularis propria）组织病理学图像分析的自动化问题，特别是针对肌层神经丛（myenteric plexus）区域的神经节细胞（ganglion cells）的定量评估。传统方法依赖于病理学家的手动分析，存在耗时、成本高以及观察者间和观察者内变异性等问题。论文提出使用视觉变换器（Vision Transformers, ViTs）作为深度学习方法来自动化这一过程，并与卷积神经网络（Convolutional Neural Networks, CNNs）和浅层学习方法（如k-means聚类）进行性能对比。关键解决方案在于利用ViTs的自注意力机制（self-attention）进行肌层分割，实验结果表明ViTs在DICE得分（89.9%）和神经丛包含率（Plexus Inclusion Rate, PIR, 100%）上均优于CNN和k-means聚类方法，证明了其在HD相关图像分析中的潜力。

链接: https://arxiv.org/abs/2412.20571
作者: Youssef Megahed,Anthony Fuller,Saleh Abou-Alwan,Dina El Demellawy,Adrian D. C. Chan
机构: 未知
关键词: congenital birth defect, birth defect diagnosed, Hirschsprung disease, myenteric plexus regions, colon muscularis propria
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in the CMBEC47/ACCES26 Joint Conference

点击查看摘要

Abstract:Hirschsprung’s disease (HD) is a congenital birth defect diagnosed by identifying the lack of ganglion cells within the colon’s muscularis propria, specifically within the myenteric plexus regions. There may be advantages for quantitative assessments of histopathology images of the colon, such as counting the ganglion and assessing their spatial distribution; however, this would be time-intensive for pathologists, costly, and subject to inter- and intra-rater variability. Previous research has demonstrated the potential for deep learning approaches to automate histopathology image analysis, including segmentation of the muscularis propria using convolutional neural networks (CNNs). Recently, Vision Transformers (ViTs) have emerged as a powerful deep learning approach due to their self-attention. This study explores the application of ViTs for muscularis propria segmentation in calretinin-stained histopathology images and compares their performance to CNNs and shallow learning methods. The ViT model achieved a DICE score of 89.9% and Plexus Inclusion Rate (PIR) of 100%, surpassing the CNN (DICE score of 89.2%; PIR of 96.0%) and k-means clustering method (DICE score of 80.7%; PIR 77.4%). Results assert that ViTs are a promising tool for advancing HD-related image analysis.
zh

[CV-162] Unlocking adaptive digital pathology through dynamic feature learning

【速读】：该论文旨在解决当前数字病理学（digital pathology）中基础模型（foundation models）在临床应用中的灵活性和病理相关性不足的问题。尽管这些模型通过通用特征模拟了真实世界的病理实践，并实现了关键组织学模式的定量分析和癌症特异性信号的解析，但其静态的通用特征限制了其在不断变化的临床需求中的适应性。为此，论文提出了PathFiT，一种动态特征学习方法，能够无缝集成到各种病理基础模型中，以提升其适应性和跨应用的通用性。PathFiT的关键在于其动态特征学习机制，使其能够在不同病理应用场景中实现高效且灵活的部署。通过构建包含超过20TB的互联网和真实世界数据的数字病理学基准，论文验证了PathFiT在35项任务中的34项上实现了最先进的性能，特别是在23项任务中表现显著提升，并在特殊成像任务中平均提升了10.20%。PathFiT的卓越性能和多功能性为计算病理学开辟了新的研究方向。

链接: https://arxiv.org/abs/2412.20430
作者: Jiawen Li,Tian Guan,Qingxin Xia,Yizhi Wang,Xitong Ling,Jing Li,Qiang Huang,Zihan Wang,Zhiyuan Shen,Yifei Ma,Zimo Zhao,Zhe Lei,Tiandong Chen,Junbo Tan,Xueqian Wang,Xiu-Wu Bian,Zhe Wang,Lingchuan Guo,Chao He,Yonghong He
机构: 未知
关键词: critical histological patterns, leverage general-purpose features, pathology foundation models, real-world pathological practices, Foundation models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 49 pages, 14 figures

点击查看摘要

Abstract:Foundation models have revolutionized the paradigm of digital pathology, as they leverage general-purpose features to emulate real-world pathological practices, enabling the quantitative analysis of critical histological patterns and the dissection of cancer-specific signals. However, these static general features constrain the flexibility and pathological relevance in the ever-evolving needs of clinical applications, hindering the broad use of the current models. Here we introduce PathFiT, a dynamic feature learning method that can be effortlessly plugged into various pathology foundation models to unlock their adaptability. Meanwhile, PathFiT performs seamless implementation across diverse pathology applications regardless of downstream specificity. To validate PathFiT, we construct a digital pathology benchmark with over 20 terabytes of Internet and real-world data comprising 28 H\E-stained tasks and 7 specialized imaging tasks including Masson’s Trichrome staining and immunofluorescence images. By applying PathFiT to the representative pathology foundation models, we demonstrate state-of-the-art performance on 34 out of 35 tasks, with significant improvements on 23 tasks and outperforming by 10.20% on specialized imaging tasks. The superior performance and versatility of PathFiT open up new avenues in computational pathology.
zh

[CV-163] Enhancing Transfer Learning for Medical Image Classification with SMOTE: A Comparative Study

【速读】：该论文旨在解决多标签图像分类（Multilabel Image Classification）在医学影像中的应用问题，特别是针对脑肿瘤分类和糖尿病视网膜病变分期的检测。由于领域特定的挑战，如数据不平衡（class imbalance），迁移学习（Transfer Learning, TL）在糖尿病视网膜病变检测中的表现受到限制。论文的关键解决方案是结合合成少数类过采样技术（Synthetic Minority Over-sampling Technique, SMOTE）与迁移学习和传统机器学习方法，以缓解数据不平衡问题。实验结果表明，这种结合方法显著提高了分类性能，准确率提升了1.97%，召回率（灵敏度）提升了5.43%，特异性提升了0.72%。这一研究强调了在医学影像分析中结合迁移学习与重采样技术的重要性，为提升分类准确性和可靠性提供了有效途径，同时无需额外的计算资源。

链接: https://arxiv.org/abs/2412.20235
作者: Md. Zehan Alam,Tonmoy Roy,H.M. Nahid Kawsar,Iffat Rimi
机构: 未知
关键词: application of Transfer, Transfer Learning, Brain Tumor MRI, brain tumor, diabetic retinopathy stage
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 27th International Conference on Computer and Information Technology (ICCIT) 2024

点击查看摘要

Abstract:This paper explores and enhances the application of Transfer Learning (TL) for multilabel image classification in medical imaging, focusing on brain tumor class and diabetic retinopathy stage detection. The effectiveness of TL-using pre-trained models on the ImageNet dataset-varies due to domain-specific challenges. We evaluate five pre-trained models-MobileNet, Xception, InceptionV3, ResNet50, and DenseNet201-on two datasets: Brain Tumor MRI and APTOS 2019. Our results show that TL models excel in brain tumor classification, achieving near-optimal metrics. However, performance in diabetic retinopathy detection is hindered by class imbalance. To mitigate this, we integrate the Synthetic Minority Over-sampling Technique (SMOTE) with TL and traditional machine learning(ML) methods, which improves accuracy by 1.97%, recall (sensitivity) by 5.43%, and specificity by 0.72%. These findings underscore the need for combining TL with resampling techniques and ML methods to address data imbalance and enhance classification performance, offering a pathway to more accurate and reliable medical image analysis and improved patient outcomes with minimal extra computation powers.
zh

[CV-164] Self-Calibrated Dual Contrasting for Annotation-Efficient Bacteria Raman Spectroscopy Clustering and Classification

【速读】：该论文旨在解决基于深度神经网络的拉曼光谱（Raman Spectroscopy, RS）识别方法在病原菌诊断中需要大量标注数据的问题，这一问题导致了高劳动成本。论文提出了一种新颖的标注高效的自校准双对比（Self-Calibrated Dual Contrasting, SCDC）方法，其核心在于从两个不同的子空间（嵌入空间和类别空间）对光谱进行表示。嵌入空间捕捉实例级信息，而类别空间反映类别级信息。通过双对比学习方法，该方法能够在无监督或半监督学习条件下获得具有判别性的光谱表示。此外，引入的自校准机制进一步增强了模型的鲁棒性。实验验证表明，SCDC方法在仅有少量（5%或10%）或无标注数据的情况下，仍能实现稳健的识别性能，展示了其在标注高效的临床生物光谱识别中的潜力。

链接: https://arxiv.org/abs/2412.20060
作者: Haiming Yao,Wei Luo,Tao Zhou,Ang Gao,Xue Wang
机构: 未知
关键词: unique molecular fingerprint, pathogenic bacteria diagnosis, molecular vibration spectroscopy, molecular fingerprint information, molecular vibration
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Raman scattering is based on molecular vibration spectroscopy and provides a powerful technology for pathogenic bacteria diagnosis using the unique molecular fingerprint information of a substance. The integration of deep learning technology has significantly improved the efficiency and accuracy of intelligent Raman spectroscopy (RS) recognition. However, the current RS recognition methods based on deep neural networks still require the annotation of a large amount of spectral data, which is labor-intensive. This paper presents a novel annotation-efficient Self-Calibrated Dual Contrasting (SCDC) method for RS recognition that operates effectively with few or no annotation. Our core motivation is to represent the spectrum from two different perspectives in two distinct subspaces: embedding and category. The embedding perspective captures instance-level information, while the category perspective reflects category-level information. Accordingly, we have implemented a dual contrastive learning approach from two perspectives to obtain discriminative representations, which are applicable for Raman spectroscopy recognition under both unsupervised and semi-supervised learning conditions. Furthermore, a self-calibration mechanism is proposed to enhance robustness. Validation of the identification task on three large-scale bacterial Raman spectroscopy datasets demonstrates that our SCDC method achieves robust recognition performance with very few (5 % or 10 % ) or no annotations, highlighting the potential of the proposed method for biospectral identification in annotation-efficient clinical scenarios.
zh

[CV-165] Uncertainty Quantified Deep Learning and Regression Analysis Framework for Image Segmentation of Skin Cancer Lesions ICML

【速读】：该论文旨在解决深度学习模型（DLMs）在医学图像分割中处理未见过的图像时面临的挑战，特别是缺乏对其分割机制（如Dice系数和性能置信度）的反馈问题。为了解决这一问题，论文提出了两种深度学习模型，一种是从头训练的模型，另一种基于迁移学习，并结合蒙特卡洛 dropout 或贝叶斯反向传播（Bayes-by-backprop）的不确定性估计方法，用于从公开的ISIC-19皮肤镜图像数据库中分割病变区域。关键创新在于首次提出了一种计算单张皮肤镜图像中多个临床区域像素级不确定性估计的方法，并生成了图像级的不确定性地图，展示了DLM分割不准确与特定皮肤组织区域高不确定性之间的对应关系。此外，论文还首次提出了四种新的线性回归模型，能够利用常数和不确定性度量（单独或组合）预测DLM分割的Dice性能，适用于低计算资源的不确定性估计工作流程。这些方法有助于增强临床医生对DLM预测的信任，并优化其在临床诊断和预后中的应用。

链接: https://arxiv.org/abs/2412.20007
作者: Elhoucine Elfatimi,Pratik Shah
机构: 未知
关键词: frequently achieve accurate, Deep learning models, Deep learning, achieve accurate segmentation, frequently achieve
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the 2024 IEEE International Conference on Machine Learning and Applications (ICMLA), accepted for publication and in press by IEEE

点击查看摘要

Abstract:Deep learning models (DLMs) frequently achieve accurate segmentation and classification of tumors from medical images. However, DLMs lacking feedback on their image segmentation mechanisms, such as Dice coefficients and confidence in their performance, face challenges when processing previously unseen images in real-world clinical settings. Uncertainty estimates to identify DLM predictions at the cellular or single-pixel level that require clinician review can enhance trust. However, their deployment requires significant computational resources. This study reports two DLMs, one trained from scratch and another based on transfer learning, with Monte Carlo dropout or Bayes-by-backprop uncertainty estimations to segment lesions from the publicly available The International Skin Imaging Collaboration-19 dermoscopy image database with cancerous lesions. A novel approach to compute pixel-by-pixel uncertainty estimations of DLM segmentation performance in multiple clinical regions from a single dermoscopy image with corresponding Dice scores is reported for the first time. Image-level uncertainty maps demonstrated correspondence between imperfect DLM segmentation and high uncertainty levels in specific skin tissue regions, with or without lesions. Four new linear regression models that can predict the Dice performance of DLM segmentation using constants and uncertainty measures, either individually or in combination from lesions, tissue structures, and non-tissue pixel regions critical for clinical diagnosis and prognostication in skin images (Spearman’s correlation, p 0.05), are reported for the first time for low-compute uncertainty estimation workflows.
zh

[CV-166] SegKAN: High-Resolution Medical Image Segmentation with Long-Distance Dependencies

【速读】：该论文旨在解决计算机断层扫描（CT）中肝血管图像因碎片化和噪声干扰导致的血管完整性难以保持及分割困难的问题。为解决这一问题，作者提出了一种创新模型SegKAN。其关键解决方案包括：首先，通过采用一种新颖的卷积网络结构改进传统的嵌入模块，以平滑图像噪声并防止后续阶段的梯度爆炸问题；其次，将Patch块之间的空间关系转化为时间关系，以解决传统Vision Transformer模型中难以捕捉Patch块间位置关系的问题。实验结果表明，该模型在肝血管数据集上的Dice评分较现有最先进模型提升了1.78%，有效提升了高分辨率扩展对象的分割性能。

链接: https://arxiv.org/abs/2412.19990
作者: Shengbo Tan,Rundong Xue,Shipeng Luo,Zeyu Zhang,Xinran Wang,Lei Zhang,Daji Ergu,Zhang Yi,Yang Zhao,Ying Cai
机构: 未知
关键词: computed tomography scans, posing significant challenges, maintain vessel integrity, making it difficult, computed tomography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hepatic vessels in computed tomography scans often suffer from image fragmentation and noise interference, making it difficult to maintain vessel integrity and posing significant challenges for vessel segmentation. To address this issue, we propose an innovative model: SegKAN. First, we improve the conventional embedding module by adopting a novel convolutional network structure for image embedding, which smooths out image noise and prevents issues such as gradient explosion in subsequent stages. Next, we transform the spatial relationships between Patch blocks into temporal relationships to solve the problem of capturing positional relationships between Patch blocks in traditional Vision Transformer models. We conducted experiments on a Hepatic vessel dataset, and compared to the existing state-of-the-art model, the Dice score improved by 1.78%. These results demonstrate that the proposed new structure effectively enhances the segmentation performance of high-resolution extended objects. Code will be available at this https URL
zh

[CV-167] Quantum Implicit Neural Compression

【速读】：该论文旨在解决基于隐式神经表示（Implicit Neural Representation, INR）的信号压缩技术在处理高分辨率信号时高频细节精度显著下降的问题。传统INR方法在低分辨率信号上能够实现高质量重建，但在小模型下对高频细节的还原能力不足。为此，论文提出了一种量子隐式神经表示（quantum INR, quINR）方法，利用量子神经网络（Quantum Neural Networks）的指数级丰富表达能力来提升数据压缩效率。通过在一些基准数据集上的评估，quINR在图像压缩中的率失真性能（rate-distortion performance）相较于传统编解码器和经典INR方法有显著提升，最高可达1.2dB的增益。解决方案的关键在于引入量子神经网络，以增强模型的表达能力，从而更有效地捕捉和压缩高频细节信息。

链接: https://arxiv.org/abs/2412.19828
作者: Takuya Fujihashi,Toshiaki Koike-Akino
机构: 未知
关键词: represent multimedia signals, implicit neural representation, Signal compression based, number of bits, based on implicit
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantum Algebra (math.QA)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Signal compression based on implicit neural representation (INR) is an emerging technique to represent multimedia signals with a small number of bits. While INR-based signal compression achieves high-quality reconstruction for relatively low-resolution signals, the accuracy of high-frequency details is significantly degraded with a small model. To improve the compression efficiency of INR, we introduce quantum INR (quINR), which leverages the exponentially rich expressivity of quantum neural networks for data compression. Evaluations using some benchmark datasets show that the proposed quINR-based compression could improve rate-distortion performance in image compression compared with traditional codecs and classic INR-based coding methods, up to 1.2dB gain.
zh

人工智能

[AI-0] Adversarial Attack and Defense for LoRa Device Identification and Authentication via Deep Learning

链接: https://arxiv.org/abs/2412.21164
作者: Yalin E. Sagduyu,Tugba Erpek
关键词: Internet of Things, Low-Power Wide-Area Network, communications in Internet, LoRa networks, Wide-Area Network
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:LoRa provides long-range, energy-efficient communications in Internet of Things (IoT) applications that rely on Low-Power Wide-Area Network (LPWAN) capabilities. Despite these merits, concerns persist regarding the security of LoRa networks, especially in situations where device identification and authentication are imperative to secure the reliable access to the LoRa networks. This paper explores a deep learning (DL) approach to tackle these concerns, focusing on two critical tasks, namely (i) identifying LoRa devices and (ii) classifying them to legitimate and rogue devices. Deep neural networks (DNNs), encompassing both convolutional and feedforward neural networks, are trained for these tasks using actual LoRa signal data. In this setting, the adversaries may spoof rogue LoRa signals through the kernel density estimation (KDE) method based on legitimate device signals that are received by the adversaries. Two cases are considered, (i) training two separate classifiers, one for each of the two tasks, and (ii) training a multi-task classifier for both tasks. The vulnerabilities of the resulting DNNs to manipulations in input samples are studied in form of untargeted and targeted adversarial attacks using the Fast Gradient Sign Method (FGSM). Individual and common perturbations are considered against single-task and multi-task classifiers for the LoRa signal analysis. To provide resilience against such attacks, a defense approach is presented by increasing the robustness of classifiers with adversarial training. Results quantify how vulnerable LoRa signal classification tasks are to adversarial attacks and emphasize the need to fortify IoT applications against these subtle yet effective threats.

[AI-1] Open RAN-Enabled Deep Learning-Assisted Mobility Management for Connected Vehicles

链接: https://arxiv.org/abs/2412.21161
作者: Maria Barbosa,Kelvin Dias
关键词: Intelligent Transportation System, enhance Intelligent Transportation, Connected Vehicles, Transportation System, Intelligent Transportation
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in ICOIN 2025

点击查看摘要

Abstract:Connected Vehicles (CVs) can leverage the unique features of 5G and future 6G/NextG networks to enhance Intelligent Transportation System (ITS) services. However, even with advancements in cellular network generations, CV applications may experience communication interruptions in high-mobility scenarios due to frequent changes of serving base station, also known as handovers (HOs). This paper proposes the adoption of Open Radio Access Network (Open RAN/O-RAN) and deep learning models for decision-making to prevent Quality of Service (QoS) degradation due to HOs and to ensure the timely connectivity needed for CV services. The solution utilizes the O-RAN Software Community (OSC), an open-source O-RAN platform developed by the collaboration between the O-RAN Alliance and Linux Foundation, to develop xApps that are executed in the near-Real-Time RIC of OSC. To demonstrate the proposal’s effectiveness, an integrated framework combining the OMNeT++ simulator and OSC was created. Evaluations used real-world datasets in urban application scenarios, such as video streaming transmission and over-the-air (OTA) updates. Results indicate that the proposal achieved superior performance and reduced latency compared to the standard 3GPP HO procedure.

[AI-2] PyG-SSL: A Graph Self-Supervised Learning Toolkit

链接: https://arxiv.org/abs/2412.21151
作者: Lecheng Zheng,Baoyu Jing,Zihao Li,Zhichen Zeng,Tianxin Wei,Mengting Ai,Xinrui He,Lihui Liu,Dongqi Fu,Jiaxuan You,Hanghang Tong,Jingrui He
关键词: graph SSL, Graph Self-Supervised Learning, SSL, Graph SSL toolkit, graph SSL models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Self-Supervised Learning (SSL) has emerged as a pivotal area of research in recent years. By engaging in pretext tasks to learn the intricate topological structures and properties of graphs using unlabeled data, these graph SSL models achieve enhanced performance, improved generalization, and heightened robustness. Despite the remarkable achievements of these graph SSL methods, their current implementation poses significant challenges for beginners and practitioners due to the complex nature of graph structures, inconsistent evaluation metrics, and concerns regarding reproducibility hinder further progress in this field. Recognizing the growing interest within the research community, there is an urgent need for a comprehensive, beginner-friendly, and accessible toolkit consisting of the most representative graph SSL algorithms. To address these challenges, we present a Graph SSL toolkit named PyG-SSL, which is built upon PyTorch and is compatible with various deep learning and scientific computing backends. Within the toolkit, we offer a unified framework encompassing dataset loading, hyper-parameter configuration, model training, and comprehensive performance evaluation for diverse downstream tasks. Moreover, we provide beginner-friendly tutorials and the best hyper-parameters of each graph SSL algorithm on different graph datasets, facilitating the reproduction of results. The GitHub repository of the library is this https URL.

[AI-3] On Parallel External-Memory Bidirectional Search

链接: https://arxiv.org/abs/2412.21104
作者: ior Siag,Shahaf S. Shperberg,Ariel Felner,Nathan R. Sturtevant
关键词: Parallelization and External, External Memory, solving large-scale problems, techniques have significantly, PEM
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, includes conference paper and appendix

点击查看摘要

Abstract:Parallelization and External Memory (PEM) techniques have significantly enhanced the capabilities of search algorithms when solving large-scale problems. Previous research on PEM has primarily centered on unidirectional algorithms, with only one publication on bidirectional PEM that focuses on the meet-in-the-middle (MM) algorithm. Building upon this foundation, this paper presents a framework that integrates both uni- and bi-directional best-first search algorithms into this framework. We then develop a PEM variant of the state-of-the-art bidirectional heuristic search (\BiHS) algorithm BAE* (PEM-BAE*). As previous work on \BiHS did not focus on scaling problem sizes, this work enables us to evaluate bidirectional algorithms on hard problems. Empirical evaluation shows that PEM-BAE* outperforms the PEM variants of A* and the MM algorithm, as well as a parallel variant of IDA*. These findings mark a significant milestone, revealing that bidirectional search algorithms clearly outperform unidirectional search algorithms across several domains, even when equipped with state-of-the-art heuristics.

[AI-4] owards Effective Discrimination Testing for Generative AI

链接: https://arxiv.org/abs/2412.21052
作者: Thomas P. Zollo,Nikita Rajaneesh,Richard Zemel,Talia B. Gillis,Emily Black
关键词: models present, Generative, discriminatory behavior, regulatory goals, GenAI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 38 pages, 9 tables, 8 figures

点击查看摘要

Abstract:Generative AI (GenAI) models present new challenges in regulating against discriminatory behavior. In this paper, we argue that GenAI fairness research still has not met these challenges; instead, a significant gap remains between existing bias assessment methods and regulatory goals. This leads to ineffective regulation that can allow deployment of reportedly fair, yet actually discriminatory, GenAI systems. Towards remedying this problem, we connect the legal and technical literature around GenAI bias evaluation and identify areas of misalignment. Through four case studies, we demonstrate how this misalignment between fairness testing techniques and regulatory goals can result in discriminatory outcomes in real-world deployments, especially in adaptive or complex environments. We offer practical recommendations for improving discrimination testing to better align with regulatory goals and enhance the reliability of fairness assessments in future deployments.

[AI-5] oward Intelligent and Secure Cloud: Large Language Model Empowered Proactive Defense

链接: https://arxiv.org/abs/2412.21051
作者: Yuyang Zhou,Guang Cheng,Kang Du,Zihan Chen
关键词: cloud computing technologies, increasing number, number of benefits, daily lives, rapid evolution
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 7 pages; In submission

点击查看摘要

Abstract:The rapid evolution of cloud computing technologies and the increasing number of cloud applications have provided a large number of benefits in daily lives. However, the diversity and complexity of different components pose a significant challenge to cloud security, especially when dealing with sophisticated and advanced cyberattacks. Recent advancements in generative foundation models (GFMs), particularly in the large language models (LLMs), offer promising solutions for security intelligence. By exploiting the powerful abilities in language understanding, data analysis, task inference, action planning, and code generation, we present LLM-PD, a novel proactive defense architecture that defeats various threats in a proactive manner. LLM-PD can efficiently make a decision through comprehensive data analysis and sequential reasoning, as well as dynamically creating and deploying actionable defense mechanisms on the target cloud. Furthermore, it can flexibly self-evolve based on experience learned from previous interactions and adapt to new attack scenarios without additional training. The experimental results demonstrate its remarkable ability in terms of defense effectiveness and efficiency, particularly highlighting an outstanding success rate when compared with other existing methods.

[AI-6] LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

链接: https://arxiv.org/abs/2412.21001
作者: Xiao-Yin Liu,Guotao Li,Xiao-Hu Zhou,Zeng-Guang Hou
关键词: preference-based reinforcement learning, Offline preference-based reinforcement, reinforcement learning, overcome the challenges, challenges of designing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.

[AI-7] Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction

链接: https://arxiv.org/abs/2412.20962
作者: Yuan Mi,Pu Ren,Hongteng Xu,Hongsheng Liu,Zidong Wang,Yike Guo,Ji-Rong Wen,Hao Sun,Yang Liu
关键词: shown great potential, Data-centric methods, enabling better design, shown great, great potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data-centric methods have shown great potential in understanding and predicting spatiotemporal dynamics, enabling better design and control of the object system. However, pure deep learning models often lack interpretability, fail to obey intrinsic physics, and struggle to cope with the various domains. While geometry-based methods, e.g., graph neural networks (GNNs), have been proposed to further tackle these challenges, they still need to find the implicit physical laws from large datasets and rely excessively on rich labeled data. In this paper, we herein introduce the conservation-informed GNN (CiGNN), an end-to-end explainable learning framework, to learn spatiotemporal dynamics based on limited training data. The network is designed to conform to the general conservation law via symmetry, where conservative and non-conservative information passes over a multiscale space enhanced by a latent temporal marching strategy. The efficacy of our model has been verified in various spatiotemporal systems based on synthetic and real-world datasets, showing superiority over baseline models. Results demonstrate that CiGNN exhibits remarkable accuracy and generalization ability, and is readily applicable to learning for prediction of various spatiotemporal dynamics in a spatial domain with complex geometry.

[AI-8] Rise of Generative Artificial Intelligence in Science

链接: https://arxiv.org/abs/2412.20960
作者: Liangping Ding,Cornelia Lawson,Philip Shapira
关键词: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, GenAI, scientific research
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 26 pages, 4 tables, 1 figures, 1 appendix figure

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI, generative AI) has rapidly become available as a tool in scientific research. To explore the use of generative AI in science, we conduct an empirical analysis using OpenAlex. Analyzing GenAI publications and other AI publications from 2017 to 2023, we profile growth patterns, the diffusion of GenAI publications across fields of study, and the geographical spread of scientific research on generative AI. We also investigate team size and international collaborations to explore whether GenAI, as an emerging scientific research area, shows different collaboration patterns compared to other AI technologies. The results indicate that generative AI has experienced rapid growth and increasing presence in scientific publications. The use of GenAI now extends beyond computer science to other scientific research domains. Over the study period, U.S. researchers contributed nearly two-fifths of global GenAI publications. The U.S. is followed by China, with several small and medium-sized advanced economies demonstrating relatively high levels of GenAI deployment in their research publications. Although scientific research overall is becoming increasingly specialized and collaborative, our results suggest that GenAI research groups tend to have slightly smaller team sizes than found in other AI fields. Furthermore, notwithstanding recent geopolitical tensions, GenAI research continues to exhibit levels of international collaboration comparable to other AI technologies.

[AI-9] Ontology-grounded Automatic Knowledge Graph Construction by LLM under Wikidata schema KDD2024 KDD

链接: https://arxiv.org/abs/2412.20942
作者: Xiaohan Feng,Xixin Wu,Helen Meng
关键词: Large Language Models, Language Models, Large Language, generating Competency Questions, propose an ontology-grounded
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Presented at HI-AI@KDD, Human-Interpretable AI Workshop at the KDD 2024, 26th of August 2024, Barcelona, Spain

点击查看摘要

Abstract:We propose an ontology-grounded approach to Knowledge Graph (KG) construction using Large Language Models (LLMs) on a knowledge base. An ontology is authored by generating Competency Questions (CQ) on knowledge base to discover knowledge scope, extracting relations from CQs, and attempt to replace equivalent relations by their counterpart in Wikidata. To ensure consistency and interpretability in the resulting KG, we ground generation of KG with the authored ontology based on extracted relations. Evaluation on benchmark datasets demonstrates competitive performance in knowledge graph construction task. Our work presents a promising direction for scalable KG construction pipeline with minimal human intervention, that yields high quality and human-interpretable KGs, which are interoperable with Wikidata semantics for potential knowledge base expansion.

[AI-10] Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution

链接: https://arxiv.org/abs/2412.20867
作者: Jonathan Külz,Michael Terzer,Marco Magri,Andrea Giusti,Matthias Althoff
关键词: constantly changing environments, standardized frameworks bridging, situ robotic automation, changing environments, challenging due
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In situ robotic automation in construction is challenging due to constantly changing environments, a shortage of robotic experts, and a lack of standardized frameworks bridging robotics and construction practices. This work proposes a holistic framework for construction task specification, optimization of robot morphology, and mission execution using a mobile modular reconfigurable robot. Users can specify and monitor the desired robot behavior through a graphical interface. Our framework identifies an optimized robot morphology and enables automatic real-world execution by integrating Building Information Modelling (BIM). By leveraging modular robot components, we ensure seamless and fast adaption to the specific demands of the construction task. Experimental validation demonstrates that our approach robustly enables the autonomous execution of robotic drilling.

[AI-11] About rectified sigmoid function for enhancing the accuracy of Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2412.20851
作者: Vasiliy A. Es’kin,Alexey O. Malkhanov,Mikhail E. Smorkalov
关键词: neural networks, solving physical problems, physical problems, modified activation function, article is devoted
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 9 pages, 1 figure, 2 tables, 4 algthorithms. arXiv admin note: substantial text overlap with arXiv:2412.19235

点击查看摘要

Abstract:The article is devoted to the study of neural networks with one hidden layer and a modified activation function for solving physical problems. A rectified sigmoid activation function has been proposed to solve physical problems described by the ODE with neural networks. Algorithms for physics-informed data-driven initialization of a neural network and a neuron-by-neuron gradient-free fitting method have been presented for the neural network with this activation function. Numerical experiments demonstrate the superiority of neural networks with a rectified sigmoid function over neural networks with a sigmoid function in the accuracy of solving physical problems (harmonic oscillator, relativistic slingshot, and Lorentz system).

[AI-12] Analog Alchemy: Neural Computation with In-Memory Inference Learning and Routing

链接: https://arxiv.org/abs/2412.20848
作者: Yigit Demirag
关键词: Artificial Intelligence, field of Artificial, neural computation, ideal neural hardware, rethinking the ideal
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As neural computation is revolutionizing the field of Artificial Intelligence (AI), rethinking the ideal neural hardware is becoming the next frontier. Fast and reliable von Neumann architecture has been the hosting platform for neural computation. Although capable, its separation of memory and computation creates the bottleneck for the energy efficiency of neural computation, contrasting the biological brain. The question remains: how can we efficiently combine memory and computation, while exploiting the physics of the substrate, to build intelligent systems? In this thesis, I explore an alternative way with memristive devices for neural computation, where the unique physical dynamics of the devices are used for inference, learning and routing. Guided by the principles of gradient-based learning, we selected functions that need to be materialized, and analyzed connectomics principles for efficient wiring. Despite non-idealities and noise inherent in analog physics, I will provide hardware evidence of adaptability of local learning to memristive substrates, new material stacks and circuit blocks that aid in solving the credit assignment problem and efficient routing between analog crossbars for scalable architectures.

[AI-13] Frequency-Masked Embedding Inference: A Non-Contrastive Approach for Time Series Representation Learning AAAI-2025

链接: https://arxiv.org/abs/2412.20790
作者: En Fu,Yanyan Hu
关键词: time series, Contrastive learning underpins, Contrastive learning, negative sample pairs, underpins most current
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by AAAI-2025 main track

点击查看摘要

Abstract:Contrastive learning underpins most current self-supervised time series representation methods. The strategy for constructing positive and negative sample pairs significantly affects the final representation quality. However, due to the continuous nature of time series semantics, the modeling approach of contrastive learning struggles to accommodate the characteristics of time series data. This results in issues such as difficulties in constructing hard negative samples and the potential introduction of inappropriate biases during positive sample construction. Although some recent works have developed several scientific strategies for constructing positive and negative sample pairs with improved effectiveness, they remain constrained by the contrastive learning framework. To fundamentally overcome the limitations of contrastive learning, this paper introduces Frequency-masked Embedding Inference (FEI), a novel non-contrastive method that completely eliminates the need for positive and negative samples. The proposed FEI constructs 2 inference branches based on a prompting strategy: 1) Using frequency masking as prompts to infer the embedding representation of the target series with missing frequency bands in the embedding space, and 2) Using the target series as prompts to infer its frequency masking embedding. In this way, FEI enables continuous semantic relationship modeling for time series. Experiments on 8 widely used time series datasets for classification and regression tasks, using linear evaluation and end-to-end fine-tuning, show that FEI significantly outperforms existing contrastive-based methods in terms of generalization. This study provides new insights into self-supervised representation learning for time series. The code is available at this https URL.

[AI-14] SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLM s in Cybersecurity

链接: https://arxiv.org/abs/2412.20787
作者: Pengfei Jing,Mengyun Tang,Xiaorong Shi,Xing Zheng,Sen Nie,Shi Wu,Yong Yang,Xiapu Luo
关键词: Evaluating Large Language, Large Language Models, Evaluating Large, natural language processing, including natural language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and HumanEval assess general LLM performance but lack focus on specific expert domains such as cybersecurity. Previous attempts to create cybersecurity datasets have faced limitations, including insufficient data volume and a reliance on multiple-choice questions (MCQs). To address these gaps, we propose SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in the cybersecurity domain. SecBench includes questions in various formats (MCQs and short-answer questions (SAQs)), at different capability levels (Knowledge Retention and Logical Reasoning), in multiple languages (Chinese and English), and across various sub-domains. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used the powerful while cost-effective LLMs to (1). label the data and (2). constructing a grading agent for automatic evaluation of this http URL results on 13 SOTA LLMs demonstrate the usability of SecBench, which is arguably the largest and most comprehensive benchmark dataset for LLMs in cybersecurity. More information about SecBench can be found at our website, and the dataset can be accessed via the artifact link.

[AI-15] Advancing Parkinsons Disease Progression Prediction: Comparing Long Short-Term Memory Networks and Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2412.20744
作者: Abhinav Roy,Bhavesh Gyanchandani,Aditya Oza,Abhishek Sharma
关键词: significantly reducing quality, increasing mortality risk, degenerative neurological disorder, significantly reducing, mortality risk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parkinson’s Disease (PD) is a degenerative neurological disorder that impairs motor and non-motor functions, significantly reducing quality of life and increasing mortality risk. Early and accurate detection of PD progression is vital for effective management and improved patient outcomes. Current diagnostic methods, however, are often costly, time-consuming, and require specialized equipment and expertise. This work proposes an innovative approach to predicting PD progression using regression methods, Long Short-Term Memory (LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing spline-parametrized univariate functions, allows for dynamic learning of activation patterns, unlike traditional linear models. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) is a comprehensive tool for evaluating PD symptoms and is commonly used to measure disease progression. Additionally, protein or peptide abnormalities are linked to PD onset and progression. Identifying these associations can aid in predicting disease progression and understanding molecular changes. Comparing multiple models, including LSTM and KAN, this study aims to identify the method that delivers the highest metrics. The analysis reveals that KAN, with its dynamic learning capabilities, outperforms other approaches in predicting PD progression. This research highlights the potential of AI and machine learning in healthcare, paving the way for advanced computational models to enhance clinical predictions and improve patient care and treatment strategies in PD management. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.20744 [cs.LG] (or arXiv:2412.20744v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.20744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-16] Overcoming Class Imbalance: Unified GNN Learning with Structural and Semantic Connectivity Representations

链接: https://arxiv.org/abs/2412.20656
作者: Abdullah Alchihabi,Hao Yan,Yuhong Guo
关键词: minority classes, Graph Neural Networks, Class imbalance, Neural Network Learning, Unified Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Class imbalance is pervasive in real-world graph datasets, where the majority of annotated nodes belong to a small set of classes (majority classes), leaving many other classes (minority classes) with only a handful of labeled nodes. Graph Neural Networks (GNNs) suffer from significant performance degradation in the presence of class imbalance, exhibiting bias towards majority classes and struggling to generalize effectively on minority classes. This limitation stems, in part, from the message passing process, leading GNNs to overfit to the limited neighborhood of annotated nodes from minority classes and impeding the propagation of discriminative information throughout the entire graph. In this paper, we introduce a novel Unified Graph Neural Network Learning (Uni-GNN) framework to tackle class-imbalanced node classification. The proposed framework seamlessly integrates both structural and semantic connectivity representations through semantic and structural node encoders. By combining these connectivity types, Uni-GNN extends the propagation of node embeddings beyond immediate neighbors, encompassing non-adjacent structural nodes and semantically similar nodes, enabling efficient diffusion of discriminative information throughout the graph. Moreover, to harness the potential of unlabeled nodes within the graph, we employ a balanced pseudo-label generation mechanism that augments the pool of available labeled nodes from minority classes in the training set. Experimental results underscore the superior performance of our proposed Uni-GNN framework compared to state-of-the-art class-imbalanced graph learning baselines across multiple benchmark datasets.

[AI-17] Predicting Long Term Sequential Policy Value Using Softer Surrogates

链接: https://arxiv.org/abs/2412.20638
作者: Hyunji Nam,Allen Nie,Ge Gao,Vasilis Syrgkanis,Emma Brunskill
关键词: Performing policy evaluation, require waiting substantial, waiting substantial amounts, Performing policy, decision policy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 1 figure

点击查看摘要

Abstract:Performing policy evaluation in education, healthcare and online commerce can be challenging, because it can require waiting substantial amounts of time to observe outcomes over the desired horizon of interest. While offline evaluation methods can be used to estimate the performance of a new decision policy from historical data in some cases, such methods struggle when the new policy involves novel actions or is being run in a new decision process with potentially different dynamics. Here we consider how to estimate the full-horizon value of a new decision policy using only short-horizon data from the new policy, and historical full-horizon data from a different behavior policy. We introduce two new estimators for this setting, including a doubly robust estimator, and provide formal analysis of their properties. Our empirical results on two realistic simulators, of HIV treatment and sepsis treatment, show that our methods can often provide informative estimates of a new decision policy ten times faster than waiting for the full horizon, highlighting that it may be possible to quickly identify if a new decision policy, involving new actions, is better or worse than existing past policies.

[AI-18] NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

链接: https://arxiv.org/abs/2412.20635
作者: Jiawei Zhou,Woojeong Kim,Zhiying Xu,Alexander M. Rush,Minlan Yu
关键词: reducing expensive human, expensive human efforts, analyze networking behaviors, congestion prediction, reducing expensive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Understanding the traffic dynamics in networks is a core capability for automated systems to monitor and analyze networking behaviors, reducing expensive human efforts and economic risks through tasks such as traffic classification, congestion prediction, and attack detection. However, it is still challenging to accurately model network traffic with machine learning approaches in an efficient and broadly applicable manner. Task-specific models trained from scratch are used for different networking applications, which limits the efficiency of model development and generalization of model deployment. Furthermore, while networking data is abundant, high-quality task-specific labels are often insufficient for training individual models. Large-scale self-supervised learning on unlabeled data provides a natural pathway for tackling these challenges. We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records, with the goal of fine-tuning for different downstream tasks with small amount of labels. Our presented NetFlowGen framework goes beyond a proof-of-concept for network traffic pre-training and addresses specific challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection. Experiments demonstrate promising results of our pre-training framework on capturing traffic dynamics and adapting to different networking tasks.

[AI-19] owards Explaining Uncertainty Estimates in Point Cloud Registration

链接: https://arxiv.org/abs/2412.20612
作者: Ziyuan Qin,Jongseok Lee,Rudolph Triebel
关键词: Iterative Closest Point, Iterative Closest, Closest Point, point clouds, commonly used algorithm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Iterative Closest Point (ICP) is a commonly used algorithm to estimate transformation between two point clouds. The key idea of this work is to leverage recent advances in explainable AI for probabilistic ICP methods that provide uncertainty estimates. Concretely, we propose a method that can explain why a probabilistic ICP method produced a particular output. Our method is based on kernel SHAP (SHapley Additive exPlanations). With this, we assign an importance value to common sources of uncertainty in ICP such as sensor noise, occlusion, and ambiguous environments. The results of the experiment show that this explanation method can reasonably explain the uncertainty sources, providing a step towards robots that know when and why they failed in a human interpretable manner

[AI-20] MATEY: multiscale adaptive foundation models for spatiotemporal physical systems

链接: https://arxiv.org/abs/2412.20601
作者: Pei Zhang,M. Paul Laiu,Matthew Norman,Doug Stefanski,John Gounley
关键词: architectures requires extremely, requires extremely long, Accurate representation, computationally prohibitive token, spatiotemporal physical systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurate representation of the multiscale features in spatiotemporal physical systems using vision transformer (ViT) architectures requires extremely long, computationally prohibitive token sequences. To address this issue, we propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features: one ensures convergent behavior to uniform patch refinement, while the other offers better computational efficiency. Moreover, we present a set of spatiotemporal attention schemes, where the temporal or axial spatial dimensions are decoupled, and evaluate their computational and data efficiencies. We assess the performance of the proposed multiscale adaptive model, MATEY, in a sequence of experiments. The results show that adaptive tokenization schemes achieve improved accuracy without significantly increasing the length of the token sequence. Compared to a full spatiotemporal attention scheme or a scheme that decouples only the temporal dimension, we find that fully decoupled axial attention is less efficient and expressive, requiring more training time and model weights to achieve the same accuracy. Finally, we demonstrate in two fine-tuning tasks featuring different physics that models pretrained on PDEBench data outperform the ones trained from scratch, especially in the low data regime with frozen attention.

[AI-21] Kryptonite-N: Machine Learning Strikes Back

链接: https://arxiv.org/abs/2412.20588
作者: Albus Li,Nathan Bailey,Will Sumerfield,Kira Kim
关键词: propose challenge datasets, propose challenge, Quinn, work called, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quinn et al propose challenge datasets in their work called Kryptonite-N". These datasets aim to counter the universal function approximation argument of machine learning, breaking the notation that machine learning can approximate any continuous function" \citeoriginal_paper. Our work refutes this claim and shows that universal function approximations can be applied successfully; the Kryptonite datasets are constructed predictably, allowing logistic regression with sufficient polynomial expansion and L1 regularization to solve for any dimension N.

[AI-22] Bridging the Gap: A Decade Review of Time-Series Clustering Methods

链接: https://arxiv.org/abs/2412.20582
作者: John Paparrizos,Fan Yang,Haojun Li
关键词: including computer science, computer science, environmental sciences, diverse disciplines, including computer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Time series, as one of the most fundamental representations of sequential data, has been extensively studied across diverse disciplines, including computer science, biology, geology, astronomy, and environmental sciences. The advent of advanced sensing, storage, and networking technologies has resulted in high-dimensional time-series data, however, posing significant challenges for analyzing latent structures over extended temporal scales. Time-series clustering, an established unsupervised learning strategy that groups similar time series together, helps unveil hidden patterns in these complex datasets. In this survey, we trace the evolution of time-series clustering methods from classical approaches to recent advances in neural networks. While previous surveys have focused on specific methodological categories, we bridge the gap between traditional clustering methods and emerging deep learning-based algorithms, presenting a comprehensive, unified taxonomy for this research area. This survey highlights key developments and provides insights to guide future research in time-series clustering.

[AI-23] A Survey on Time-Series Distance Measures

链接: https://arxiv.org/abs/2412.20574
作者: John Paparrizos,Haojun Li,Fan Yang,Kaize Wu,Jens E. d’Hondt,Odysseas Papapetrou
关键词: fundamental building blocks, time-series analysis tasks, Distance measures, anomaly detection, analysis tasks
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distance measures have been recognized as one of the fundamental building blocks in time-series analysis tasks, e.g., querying, indexing, classification, clustering, anomaly detection, and similarity search. The vast proliferation of time-series data across a wide range of fields has increased the relevance of evaluating the effectiveness and efficiency of these distance measures. To provide a comprehensive view of this field, this work considers over 100 state-of-the-art distance measures, classified into 7 categories: lock-step measures, sliding measures, elastic measures, kernel measures, feature-based measures, model-based measures, and embedding measures. Beyond providing comprehensive mathematical frameworks, this work also delves into the distinctions and applications across these categories for both univariate and multivariate cases. By providing comprehensive collections and insights, this study paves the way for the future development of innovative time-series distance measures.

[AI-24] he intrinsic motivation of reinforcement and imitation learning for sequential tasks

链接: https://arxiv.org/abs/2412.20573
作者: Sao Mai Nguyen
关键词: developmental cognitive robotics, cognitive robotics aims, including sequential tasks, intrinsic motivation, learning
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Habilitation thesis

点击查看摘要

Abstract:This work in the field of developmental cognitive robotics aims to devise a new domain bridging between reinforcement learning and imitation learning, with a model of the intrinsic motivation for learning agents to learn with guidance from tutors multiple tasks, including sequential tasks. The main contribution has been to propose a common formulation of intrinsic motivation based on empirical progress for a learning agent to choose automatically its learning curriculum by actively choosing its learning strategy for simple or sequential tasks: which task to learn, between autonomous exploration or imitation learning, between low-level actions or task decomposition, between several tutors. The originality is to design a learner that benefits not only passively from data provided by tutors, but to actively choose when to request tutoring and what and whom to ask. The learner is thus more robust to the quality of the tutoring and learns faster with fewer demonstrations. We developed the framework of socially guided intrinsic motivation with machine learning algorithms to learn multiple tasks by taking advantage of the generalisability properties of human demonstrations in a passive manner or in an active manner through requests of demonstrations from the best tutor for simple and composing subtasks. The latter relies on a representation of subtask composition proposed for a construction process, which should be refined by representations used for observational processes of analysing human movements and activities of daily living. With the outlook of a language-like communication with the tutor, we investigated the emergence of a symbolic representation of the continuous sensorimotor space and of tasks using intrinsic motivation. We proposed within the reinforcement learning framework, a reward function for interacting with tutors for automatic curriculum learning in multi-task learning.

[AI-25] Attacks on the neural network and defense methods

链接: https://arxiv.org/abs/2412.20529
作者: A. Korenev,G. Belokrylov,B. Lodonova,A. Novokhrestov
关键词: neural network trained, article will discuss, neural network, network trained, trained on audio
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article will discuss the use of attacks on a neural network trained on audio data, as well as possible methods of protection against these attacks. FGSM, PGD and CW attacks, as well as data poisoning, will be considered. Within the framework of protection, Art-IBM and advertorch libraries will be considered. The obtained accuracy metrics within the framework of attack applications are presented

[AI-26] Game Theory and Multi-Agent Reinforcement Learning : From Nash Equilibria to Evolutionary Dynamics

链接: https://arxiv.org/abs/2412.20523
作者: Neil De La Fuente,Miquel Noguer i Alonso,Guim Casadellà
关键词: explores advanced topics, paper explores advanced, previous work, explores advanced, advanced topics
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 22 pages

点击查看摘要

Abstract:This paper explores advanced topics in complex multi-agent systems building upon our previous work. We examine four fundamental challenges in Multi-Agent Reinforcement Learning (MARL): non-stationarity, partial observability, scalability with large agent populations, and decentralized learning. The paper provides mathematical formulations and analysis of recent algorithmic advancements designed to address these challenges, with a particular focus on their integration with game-theoretic concepts. We investigate how Nash equilibria, evolutionary game theory, correlated equilibrium, and adversarial dynamics can be effectively incorporated into MARL algorithms to improve learning outcomes. Through this comprehensive analysis, we demonstrate how the synthesis of game theory and MARL can enhance the robustness and effectiveness of multi-agent systems in complex, dynamic environments.

[AI-27] Goal-Conditioned Data Augmentation for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2412.20519
作者: Xingshuai Huang,Di Wu Member,Benoit Boulet
关键词: enables policy learning, Offline reinforcement learning, enables policy, reinforcement learning, policy learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) enables policy learning from pre-collected offline datasets, relaxing the need to interact directly with the environment. However, limited by the quality of offline datasets, it generally fails to learn well-qualified policies in suboptimal datasets. To address datasets with insufficient optimal demonstrations, we introduce Goal-cOnditioned Data Augmentation (GODA), a novel goal-conditioned diffusion-based method for augmenting samples with higher quality. Leveraging recent advancements in generative modeling, GODA incorporates a novel return-oriented goal condition with various selection mechanisms. Specifically, we introduce a controllable scaling technique to provide enhanced return-based guidance during data sampling. GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals, thereby maximizing the utility of limited optimal demonstrations. Furthermore, we propose a novel adaptive gated conditioning method for processing noised inputs and conditions, enhancing the capture of goal-oriented guidance. We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA’s effectiveness in enhancing data quality and superior performance compared to state-of-the-art data augmentation methods across various offline RL algorithms.

[AI-28] Dive into Time-Series Anomaly Detection: A Decade Review

链接: https://arxiv.org/abs/2412.20512
作者: Paul Boniol,Qinghua Liu,Mingyi Huang,Themis Palpanas,John Paparrizos
关键词: data collection technology, time-series anomaly detection, anomaly detection, time series analytics, collection technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics. In this regard, time-series anomaly detection has been an important activity, entailing various applications in fields such as cyber security, financial markets, law enforcement, and health care. While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years call for a structured, general characterization of the research methods for time-series anomaly detection. This survey groups and summarizes anomaly detection existing solutions under a process-centric taxonomy in the time series context. In addition to giving an original categorization of anomaly detection methods, we also perform a meta-analysis of the literature and outline general trends in time-series anomaly detection research.

[AI-29] Stratify: Unifying Multi-Step Forecasting Strategies

链接: https://arxiv.org/abs/2412.20510
作者: Riku Green,Grant Stevens,Zahraa Abdallah,Telmo M. Silva Filho
关键词: predictions multiple time, multiple time steps, make predictions multiple, key aspect, aspect of temporal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 9 figures, journal

点击查看摘要

Abstract:A key aspect of temporal domains is the ability to make predictions multiple time steps into the future, a process known as multi-step forecasting (MSF). At the core of this process is selecting a forecasting strategy, however, with no existing frameworks to map out the space of strategies, practitioners are left with ad-hoc methods for strategy selection. In this work, we propose Stratify, a parameterised framework that addresses multi-step forecasting, unifying existing strategies and introducing novel, improved strategies. We evaluate Stratify on 18 benchmark datasets, five function classes, and short to long forecast horizons (10, 20, 40, 80). In over 84% of 1080 experiments, novel strategies in Stratify improved performance compared to all existing ones. Importantly, we find that no single strategy consistently outperforms others in all task settings, highlighting the need for practitioners explore the Stratify space to carefully search and select forecasting strategies based on task-specific requirements. Our results are the most comprehensive benchmarking of known and novel forecasting strategies. We make code available to reproduce our results.

[AI-30] Planning Living and Judging: A Multi-agent LLM -based Framework for Cyclical Urban Planning AAAI2025

链接: https://arxiv.org/abs/2412.20505
作者: Hang Ni,Yuzhi Wang,Hao Liu
关键词: regeneration presents significant, presents significant challenges, Urban regeneration presents, requiring adaptive approaches, context of urbanization
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, accepted by The 1st Workshop on AI for Urban Planning (AAAI 2025’s Workshop)

点击查看摘要

Abstract:Urban regeneration presents significant challenges within the context of urbanization, requiring adaptive approaches to tackle evolving needs. Leveraging advancements in large language models (LLMs), we propose Cyclical Urban Planning (CUP), a new paradigm that continuously generates, evaluates, and refines urban plans in a closed-loop. Specifically, our multi-agent LLM-based framework consists of three key components: (1) Planning, where LLM agents generate and refine urban plans based on contextual data; (2) Living, where agents simulate the behaviors and interactions of residents, modeling life in the urban environment; and (3) Judging, which involves evaluating plan effectiveness and providing iterative feedback for improvement. The cyclical process enables a dynamic and responsive planning approach. Experiments on the real-world dataset demonstrate the effectiveness of our framework as a continuous and adaptive planning process.

[AI-31] A Multiparty Homomorphic Encryption Approach to Confidential Federated Kaplan Meier Survival Analysis

链接: https://arxiv.org/abs/2412.20495
作者: Narasimha Raghavan Veeraragavan,Svetlana Boudko,Jan Franz Nygård
关键词: sensitive patient records, regulations hinder pooling, hinder pooling sensitive, pooling sensitive patient, stringent privacy regulations
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS); Machine Learning (stat.ML)
*备注: 40 pages

点击查看摘要

Abstract:The proliferation of healthcare data has expanded opportunities for collaborative research, yet stringent privacy regulations hinder pooling sensitive patient records. We propose a \emphmultiparty homomorphic encryption-based framework for \emphprivacy-preserving federated Kaplan–Meier survival analysis, offering native floating-point support, a theoretical model, and explicit reconstruction-attack mitigation. Compared to prior work, our framework ensures encrypted federated survival estimates closely match centralized outcomes, supported by formal utility-loss bounds that demonstrate convergence as aggregation and decryption noise diminish. Extensive experiments on the NCCTG Lung Cancer and synthetic Breast Cancer datasets confirm low \emphmean absolute error (MAE) and \emphroot mean squared error (RMSE), indicating negligible deviations between encrypted and non-encrypted survival curves. Log-rank and numerical accuracy tests reveal \emphno significant difference between federated encrypted and non-encrypted analyses, preserving statistical validity. A reconstruction-attack evaluation shows smaller federations (2–3 providers) with overlapping data between the institutions are vulnerable, a challenge mitigated by multiparty encryption. Larger federations (5–50 sites) degrade reconstruction accuracy further, with encryption improving confidentiality. Despite an 8–19 \times computational overhead, threshold-based homomorphic encryption is \emphfeasible for moderate-scale deployments, balancing security and runtime. By providing robust privacy guarantees alongside high-fidelity survival estimates, our framework advances the state-of-the art in secure multi-institutional survival analysis.

[AI-32] A Comprehensive Framework for Reliable Legal AI: Combining Specialized Expert Systems and Adaptive Refinement

链接: https://arxiv.org/abs/2412.20468
作者: Sidra Nasir,Qamar Abbas,Samita Bai,Rizwan Ahmed Khan
关键词: artificial intelligence, document review, contract drafting, discusses the evolving, evolving role
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 16 pages and 5 figures

点击查看摘要

Abstract:This article discusses the evolving role of artificial intelligence (AI) in the legal profession, focusing on its potential to streamline tasks such as document review, research, and contract drafting. However, challenges persist, particularly the occurrence of “hallucinations” in AI models, where they generate inaccurate or misleading information, undermining their reliability in legal contexts. To address this, the article proposes a novel framework combining a mixture of expert systems with a knowledge-based architecture to improve the precision and contextual relevance of AI-driven legal services. This framework utilizes specialized modules, each focusing on specific legal areas, and incorporates structured operational guidelines to enhance decision-making. Additionally, it leverages advanced AI techniques like Retrieval-Augmented Generation (RAG), Knowledge Graphs (KG), and Reinforcement Learning from Human Feedback (RLHF) to improve the system’s accuracy. The proposed approach demonstrates significant improvements over existing AI models, showcasing enhanced performance in legal tasks and offering a scalable solution to provide more accessible and affordable legal services. The article also outlines the methodology, system architecture, and promising directions for future research in AI applications for the legal sector.

[AI-33] Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid Robots for Multimodal Understanding

链接: https://arxiv.org/abs/2412.20429
作者: Libo Wang
关键词: multi-scenario reasoning architecture, improve the cognitive, cognitive autonomy, research proposes, proposes a multi-scenario
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: The main text is 5 pages, 2 figures, and 3 tables

点击查看摘要

Abstract:To improve the cognitive autonomy of humanoid robots, this research proposes a multi-scenario reasoning architecture to solve the technical shortcomings of multi-modal understanding in this field. It draws on simulation based experimental design that adopts multi-modal synthesis (visual, auditory, tactile) and builds a simulator “Maha” to perform the experiment. The findings demonstrate the feasibility of this architecture in multimodal data. It provides reference experience for the exploration of cross-modal interaction strategies for humanoid robots in dynamic environments.

[AI-34] A Deep Subgrouping Framework for Precision Drug Repurposing via Emulating Clinical Trials on Real-world Patient Data KDD2025

链接: https://arxiv.org/abs/2412.20373
作者: Seungyeon Lee,Ruoqi Liu,Feixiong Cheng,Ping Zhang
关键词: novo drug discovery, reducing the time, time and costs, traditional de novo, Drug repurposing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in KDD 2025

点击查看摘要

Abstract:Drug repurposing identifies new therapeutic uses for existing drugs, reducing the time and costs compared to traditional de novo drug discovery. Most existing drug repurposing studies using real-world patient data often treat the entire population as homogeneous, ignoring the heterogeneity of treatment responses across patient subgroups. This approach may overlook promising drugs that benefit specific subgroups but lack notable treatment effects across the entire population, potentially limiting the number of repurposable candidates identified. To address this, we introduce STEDR, a novel drug repurposing framework that integrates subgroup analysis with treatment effect estimation. Our approach first identifies repurposing candidates by emulating multiple clinical trials on real-world patient data and then characterizes patient subgroups by learning subgroup-specific treatment effects. We deploy \model to Alzheimer’s Disease (AD), a condition with few approved drugs and known heterogeneity in treatment responses. We emulate trials for over one thousand medications on a large-scale real-world database covering over 8 million patients, identifying 14 drug candidates with beneficial effects to AD in characterized subgroups. Experiments demonstrate STEDR’s superior capability in identifying repurposing candidates compared to existing approaches. Additionally, our method can characterize clinically relevant patient subgroups associated with important AD-related risk factors, paving the way for precision drug repurposing.

[AI-35] Safe Multiagent Coordination via Entropic Exploration

链接: https://arxiv.org/abs/2412.20361
作者: Ayhan Alp Aydeniz,Enrico Marchesini,Robert Loftin,Christopher Amato,Kagan Tumer
关键词: involve safety concerns, problems involve safety, learning problems involve, safety concerns, problems involve
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Many real-world multiagent learning problems involve safety concerns. In these setups, typical safe reinforcement learning algorithms constrain agents’ behavior, limiting exploration – a crucial component for discovering effective cooperative multiagent behaviors. Moreover, the multiagent literature typically models individual constraints for each agent and has yet to investigate the benefits of using joint team constraints. In this work, we analyze these team constraints from a theoretical and practical perspective and propose entropic exploration for constrained multiagent reinforcement learning (E2C) to address the exploration issue. E2C leverages observation entropy maximization to incentivize exploration and facilitate learning safe and effective cooperative behaviors. Experiments across increasingly complex domains show that E2C agents match or surpass common unconstrained and constrained baselines in task performance while reducing unsafe behaviors by up to 50% .

[AI-36] Distilling Desired Comments for Enhanced Code Review with Large Language Models

链接: https://arxiv.org/abs/2412.20340
作者: Yongda Yu,Lei Zhang,Guoping Rong,Haifeng Shen,Jiahao Zhang,Haoxiang Yan,Guohao Shi,Dong Shao,Ruiqi Pan,Yuan Li,Qiushi Wang,Zhao Tian
关键词: Large Language Models, Large Language, Language Models, review, growing interest
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:There has been a growing interest in using Large Language Models (LLMs) for code review thanks to their proven proficiency in code comprehension. The primary objective of most review scenarios is to generate desired review comments (DRCs) that explicitly identify issues to trigger code fixes. However, existing LLM-based solutions are not so effective in generating DRCs for various reasons such as hallucination. To enhance their code review ability, they need to be fine-tuned with a customized dataset that is ideally full of DRCs. Nevertheless, such a dataset is not yet available, while manual annotation of DRCs is too laborious to be practical. In this paper, we propose a dataset distillation method, Desiview, which can automatically construct a distilled dataset by identifying DRCs from a code review dataset. Experiments on the CodeReviewer dataset comprising more than 150K review entries show that Desiview achieves an impressive performance of 88.93%, 80.37%, 86.67%, and 84.44% in terms of Precision, Recall, Accuracy, and F1, respectively, surpassing state-of-the-art methods. To validate the effect of such a distilled dataset on enhancing LLMs’ code review ability, we first fine-tune the latest LLaMA series (i.e., LLaMA 3 and LLaMA 3.1) to build model Desiview4FT. We then enhance the model training effect through KTO alignment by feeding those review comments identified as non-DRCs to the LLMs, resulting in model Desiview4FA. Verification results indicate that Desiview4FA slightly outperforms Desiview4FT, while both models have significantly improved against the base models in terms of generating DRCs. Human evaluation confirms that both models identify issues more accurately and tend to generate review comments that better describe the issues contained in the code than the base LLMs do.

[AI-37] Mind the Data Gap: Bridging LLM s to Enterprise Data Integration CIDR’25

链接: https://arxiv.org/abs/2412.20331
作者: Moe Kayali,Fabian Wenz,Nesime Tatbul,Çağatay Demiralp
关键词: Leading large language, large language models, Leading large, language models, data
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIDR’25

点击查看摘要

Abstract:Leading large language models (LLMs) are trained on public data. However, most of the world’s data is dark data that is not publicly accessible, mainly in the form of private organizational or enterprise data. We show that the performance of methods based on LLMs seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration. Based on our experience with this enterprise benchmark, we propose techniques to uplift the performance of LLMs on enterprise data, including (1) hierarchical annotation, (2) runtime class-learning, and (3) ontology synthesis. We show that, once these techniques are deployed, the performance on enterprise data becomes on par with that of public data. The Goby benchmark can be obtained at this https URL.

[AI-38] Protein Structure Prediction in the 3D HP Model Using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2412.20329
作者: Giovanny Espitia,Yui Tik Pang,James C. Gumbart
关键词: Hydrophobic-Polar lattice model, address protein structure, protein structure prediction, Hydrophobic-Polar lattice, structure prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:We address protein structure prediction in the 3D Hydrophobic-Polar lattice model through two novel deep learning architectures. For proteins under 36 residues, our hybrid reservoir-based model combines fixed random projections with trainable deep layers, achieving optimal conformations with 25% fewer training episodes. For longer sequences, we employ a long short-term memory network with multi-headed attention, matching best-known energy values. Both architectures leverage a stabilized Deep Q-Learning framework with experience replay and target networks, demonstrating consistent achievement of optimal conformations while significantly improving training efficiency compared to existing methods.

[AI-39] Hypergraph-Based Dynamic Graph Node Classification ICASSP2025

链接: https://arxiv.org/abs/2412.20321
作者: Xiaoxu Ma,Chen Zhao,Minglai Shao,Yujie Lin
关键词: achieved significant success, Node classification, Graph Node Classification, Dynamic Graph, significant success
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: Accepted in ICASSP 2025

点击查看摘要

Abstract:Node classification on static graphs has achieved significant success, but achieving accurate node classification on dynamic graphs where node topology, attributes, and labels change over time has not been well addressed. Existing methods based on RNNs and self-attention only aggregate features of the same node across different time slices, which cannot adequately address and capture the diverse dynamic changes in dynamic graphs. Therefore, we propose a novel model named Hypergraph-Based Multi-granularity Dynamic Graph Node Classification (HYDG). After obtaining basic node representations for each slice through a GNN backbone, HYDG models the representations of each node in the dynamic graph through two modules. The individual-level hypergraph captures the spatio-temporal node representations between individual nodes, while the group-level hypergraph captures the multi-granularity group temporal representations among nodes of the same class. Each hyperedge captures different temporal dependencies of varying lengths by connecting multiple nodes within specific time ranges. More accurate representations are obtained through weighted information propagation and aggregation by the hypergraph neural network. Extensive experiments on five real dynamic graph datasets using two GNN backbones demonstrate the superiority of our proposed framework.

[AI-40] EXAdam: The Power of Adaptive Cross-Moments

链接: https://arxiv.org/abs/2412.20302
作者: Ahmed M. Adly
关键词: widely-used Adam optimizer, paper introduces EXAdam, paper introduces, textbf, Adam optimizer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces EXAdam ( \textbfEX tended \textbfAdam ), a novel optimization algorithm that builds upon the widely-used Adam optimizer. EXAdam incorporates three key enhancements: (1) new debiasing terms for improved moment estimation, (2) a gradient-based acceleration mechanism for increased responsiveness to the current loss landscape, and (3) a dynamic step size formula that allows for continuous growth of the learning rate throughout training. These innovations work synergistically to address limitations of the original Adam algorithm, potentially offering improved convergence properties, enhanced ability to escape saddle points, and greater robustness to hyperparameter choices. I provide a theoretical analysis of EXAdam’s components and their interactions, highlighting the algorithm’s potential advantages in navigating complex optimization landscapes. Empirical evaluations demonstrate EXAdam’s superiority over Adam, achieving 48.07% faster convergence and yielding improvements of 4.6%, 4.13%, and 2.39% in training, validation, and testing accuracies, respectively, when applied to a CNN trained on the CIFAR-10 dataset. While these results are promising, further empirical validation across diverse tasks is essential to fully gauge EXAdam’s efficacy. Nevertheless, EXAdam represents a significant advancement in adaptive optimization techniques, with promising implications for a wide range of machine learning applications. This work aims to contribute to the ongoing development of more efficient, adaptive, and universally applicable optimization methods in the field of machine learning and artificial intelligence.

[AI-41] High-fidelity social learning via shared episodic memories enhances collaborative foraging through mnemonic convergence

链接: https://arxiv.org/abs/2412.20271
作者: Ismael T. Freire,Paul Verschure
关键词: Social learning, Social, episodic memory, learning, enables individuals
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Social learning, a cornerstone of cultural evolution, enables individuals to acquire knowledge by observing and imitating others. At the heart of its efficacy lies episodic memory, which encodes specific behavioral sequences to facilitate learning and decision-making. This study explores the interrelation between episodic memory and social learning in collective foraging. Using Sequential Episodic Control (SEC) agents capable of sharing complete behavioral sequences stored in episodic memory, we investigate how variations in the frequency and fidelity of social learning influence collaborative foraging performance. Furthermore, we analyze the effects of social learning on the content and distribution of episodic memories across the group. High-fidelity social learning is shown to consistently enhance resource collection efficiency and distribution, with benefits sustained across memory lengths. In contrast, low-fidelity learning fails to outperform nonsocial learning, spreading diverse but ineffective mnemonic patterns. Novel analyses using mnemonic metrics reveal that high-fidelity social learning also fosters mnemonic group alignment and equitable resource distribution, while low-fidelity conditions increase mnemonic diversity without translating to performance gains. Additionally, we identify an optimal range for episodic memory length in this task, beyond which performance plateaus. These findings underscore the critical effects of social learning on mnemonic group alignment and distribution and highlight the potential of neurocomputational models to probe the cognitive mechanisms driving cultural evolution.

[AI-42] How To Think About End-To-End Encryption and AI: Training Processing Disclosure and Consent

链接: https://arxiv.org/abs/2412.20231
作者: Mallory Knodel,Andrés Fábrega,Daniella Ferrari,Jacob Leiken,Betty Li Hou,Derek Yen,Sam de Alfaro,Kyunghyun Cho,Sunoo Park
关键词: bringing strong confidentiality, securing communications, bringing strong, gold standard, standard for securing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:End-to-end encryption (E2EE) has become the gold standard for securing communications, bringing strong confidentiality and privacy guarantees to billions of users worldwide. However, the current push towards widespread integration of artificial intelligence (AI) models, including in E2EE systems, raises some serious security concerns. This work performs a critical examination of the (in)compatibility of AI models and E2EE applications. We explore this on two fronts: (1) the integration of AI “assistants” within E2EE applications, and (2) the use of E2EE data for training AI models. We analyze the potential security implications of each, and identify conflicts with the security guarantees of E2EE. Then, we analyze legal implications of integrating AI models in E2EE applications, given how AI integration can undermine the confidentiality that E2EE promises. Finally, we offer a list of detailed recommendations based on our technical and legal analyses, including: technical design choices that must be prioritized to uphold E2EE security; how service providers must accurately represent E2EE security; and best practices for the default behavior of AI features and for requesting user consent. We hope this paper catalyzes an informed conversation on the tensions that arise between the brisk deployment of AI and the security offered by E2EE, and guides the responsible development of new AI features.

[AI-43] Leveraging Large Language Models for Enhancing Autonomous Vehicle Perception

链接: https://arxiv.org/abs/2412.20230
作者: Athanasios Karagounis
关键词: Large Language Models, sophisticated perception systems, rely on sophisticated, interpret their surroundings, cornerstone for safe
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 4 pages

点击查看摘要

Abstract:Autonomous vehicles (AVs) rely on sophisticated perception systems to interpret their surroundings, a cornerstone for safe navigation and decision-making. The integration of Large Language Models (LLMs) into AV perception frameworks offers an innovative approach to address challenges in dynamic environments, sensor fusion, and contextual reasoning. This paper presents a novel framework for incorporating LLMs into AV perception, enabling advanced contextual understanding, seamless sensor integration, and enhanced decision support. Experimental results demonstrate that LLMs significantly improve the accuracy and reliability of AV perception systems, paving the way for safer and more intelligent autonomous driving technologies. By expanding the scope of perception beyond traditional methods, LLMs contribute to creating a more adaptive and human-centric driving ecosystem, making autonomous vehicles more reliable and transparent in their operations. These advancements redefine the relationship between human drivers and autonomous systems, fostering trust through enhanced understanding and personalized decision-making. Furthermore, by integrating memory modules and adaptive learning mechanisms, LLMs introduce continuous improvement in AV perception, enabling vehicles to evolve with time and adapt to changing environments and user preferences.

[AI-44] Federated Unlearning with Gradient Descent and Conflict Mitigation AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.20200
作者: Zibin Pan,Zhichao Wang,Chi Li,Kaiyan Zheng,Boqi Wang,Xiaoying Tang,Junhua Zhao
关键词: Federated Learning, model utility, model, recent years, received much attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: To be published in the Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Federated Learning (FL) has received much attention in recent years. However, although clients are not required to share their data in FL, the global model itself can implicitly remember clients’ local data. Therefore, it’s necessary to effectively remove the target client’s data from the FL global model to ease the risk of privacy leakage and implement ``the right to be forgotten". Federated Unlearning (FU) has been considered a promising way to remove data without full retraining. But the model utility easily suffers significant reduction during unlearning due to the gradient conflicts. Furthermore, when conducting the post-training to recover the model utility, the model is prone to move back and revert what has already been unlearned. To address these issues, we propose Federated Unlearning with Orthogonal Steepest Descent (FedOSD). We first design an unlearning Cross-Entropy loss to overcome the convergence issue of the gradient ascent. A steepest descent direction for unlearning is then calculated in the condition of being non-conflicting with other clients’ gradients and closest to the target client’s gradient. This benefits to efficiently unlearn and mitigate the model utility reduction. After unlearning, we recover the model utility by maintaining the achievement of unlearning. Finally, extensive experiments in several FL scenarios verify that FedOSD outperforms the SOTA FU algorithms in terms of unlearning and model utility.

[AI-45] Lower bounds on transformers with infinite precision

链接: https://arxiv.org/abs/2412.20195
作者: Alexander Kozachinskiy
关键词: one-layer softmax transformers, infinite precision, dimension technique, technique to prove, lower bound
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM _2 task, considered by Sanford, Hsu, and Telgarsky.

[AI-46] Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker

链接: https://arxiv.org/abs/2412.20193
作者: Jiangdong Fan,Hongcai He,Paul Weng,Hui Xu,Jie Shao
关键词: expensive or inaccessible, major bottleneck, large number, demonstrations, supplementary demonstrations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A major bottleneck in imitation learning is the requirement of a large number of expert demonstrations, which can be expensive or inaccessible. Learning from supplementary demonstrations without strict quality requirements has emerged as a powerful paradigm to address this challenge. However, previous methods often fail to fully utilize their potential by discarding non-expert data. Our key insight is that even demonstrations that fall outside the expert distribution but outperform the learned policy can enhance policy performance. To utilize this potential, we propose a novel approach named imitation learning via meta-learning an action ranker (ILMAR). ILMAR implements weighted behavior cloning (weighted BC) on a limited set of expert demonstrations along with supplementary demonstrations. It utilizes the functional of the advantage function to selectively integrate knowledge from the supplementary demonstrations. To make more effective use of supplementary demonstrations, we introduce meta-goal in ILMAR to optimize the functional of the advantage function by explicitly minimizing the distance between the current policy and the expert policy. Comprehensive experiments using extensive tasks demonstrate that ILMAR significantly outperforms previous methods in handling suboptimal demonstrations. Code is available at this https URL.

[AI-47] Real-time Calibration Model for Low-cost Sensor in Fine-grained Time series AAAI2025

链接: https://arxiv.org/abs/2412.20170
作者: Seokho Ahn,Hyungjin Kim,Sungbok Shin,Young-Duk Seo
关键词: Precise measurements, collected from low-cost, Precise, TESLA, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Precise measurements from sensors are crucial, but data is usually collected from low-cost, low-tech systems, which are often inaccurate. Thus, they require further calibrations. To that end, we first identify three requirements for effective calibration under practical low-tech sensor conditions. Based on the requirements, we develop a model called TESLA, Transformer for effective sensor calibration utilizing logarithmic-binned attention. TESLA uses a high-performance deep learning model, Transformers, to calibrate and capture non-linear components. At its core, it employs logarithmic binning to minimize attention complexity. TESLA achieves consistent real-time calibration, even with longer sequences and finer-grained time series in hardware-constrained systems. Experiments show that TESLA outperforms existing novel deep learning and newly crafted linear models in accuracy, calibration speed, and energy efficiency.

[AI-48] LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System

链接: https://arxiv.org/abs/2412.20166
作者: Hyucksung Kwon,Kyungmo Koo,Janghyeon Kim,Woongkyu Lee,Minjae Lee,Hyungdeok Lee,Yousub Jung,Jaehan Park,Yosub Song,Byeongsu Yang,Haerang Choi,Guhyun Kim,Jongsoon Won,Woojae Shin,Changhyun Kim,Gyeongcheol Shin,Yongkee Kwon,Ilkon Kim,Euicheol Lim,John Kim,Jungwook Choi
关键词: large language models, parameters presents significant, presents significant challenges, PIM, memory bandwidth
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12 figures

点击查看摘要

Abstract:The expansion of large language models (LLMs) with hundreds of billions of parameters presents significant challenges to computational resources, particularly data movement and memory bandwidth. Long-context LLMs, which process sequences of tens of thousands of tokens, further increase the demand on the memory system as the complexity in attention layers and key-value cache sizes is proportional to the context length. Processing-in-Memory (PIM) maximizes memory bandwidth by moving compute to the data and can address the memory bandwidth challenges; however, PIM is not necessarily scalable to accelerate long-context LLM because of limited per-module memory capacity and the inflexibility of fixed-functional unit PIM architecture and static memory management. In this work, we propose LoL-PIM which is a multi-node PIM architecture that accelerates long context LLM through hardware-software co-design. In particular, we propose how pipeline parallelism can be exploited across a multi-PIM module while a direct PIM access (DPA) controller (or DMA for PIM) is proposed that enables dynamic PIM memory management and results in efficient PIM utilization across a diverse range of context length. We developed an MLIR-based compiler for LoL-PIM extending a commercial PIM-based compiler where the software modifications were implemented and evaluated, while the hardware changes were modeled in the simulator. Our evaluations demonstrate that LoL-PIM significantly improves throughput and reduces latency for long-context LLM inference, outperforming both multi-GPU and GPU-PIM systems (up to 8.54x and 16.0x speedup, respectively), thereby enabling more efficient deployment of LLMs in real-world applications.

[AI-49] opic-Aware Knowledge Graph with Large Language Models for Interoperability in Recommender Systems

链接: https://arxiv.org/abs/2412.20163
作者: Minhye Jeon,Seokho Ahn,Young-Duk Seo
关键词: cold start problems, addressing data sparsity, start problems, common approaches, approaches to addressing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by The 40th ACM/SIGAPP Symposium On Applied Computing(SAC) 2025

点击查看摘要

Abstract:The use of knowledge graphs in recommender systems has become one of the common approaches to addressing data sparsity and cold start problems. Recent advances in large language models (LLMs) offer new possibilities for processing side and context information within knowledge graphs. However, consistent integration across various systems remains challenging due to the need for domain expert intervention and differences in system characteristics. To address these issues, we propose a consistent approach that extracts both general and specific topics from both side and context information using LLMs. First, general topics are iteratively extracted and updated from side information. Then, specific topics are extracted using context information. Finally, to address synonymous topics generated during the specific topic extraction process, a refining algorithm processes and resolves these issues effectively. This approach allows general topics to capture broad knowledge across diverse item characteristics, while specific topics emphasize detailed attributes, providing a more comprehensive understanding of the semantic features of items and the preferences of users. Experimental results demonstrate significant improvements in recommendation performance across diverse knowledge graphs.

[AI-50] Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting ICASSP2025

链接: https://arxiv.org/abs/2412.20155
作者: Wooseok Han,Minki Kang,Changhun Kim,Eunho Yang
关键词: voice assistant services, attracted considerable attention, considerable attention due, personalized voice assistant, range of applications
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.

[AI-51] RFPPO: Motion Dynamic RRT based Fluid Field - PPO for Dynamic TF/TA Routing Planning

链接: https://arxiv.org/abs/2412.20098
作者: Rongkun Xue,Jing Yang,Yuyang Jiang,Yiming Feng,Zi Yang
关键词: Existing local dynamic, large and medium-sized, Existing local, medium-sized fixed-wing aircraft, local dynamic route
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 2024 IEEE Intelligent Vehicles Symposium

点击查看摘要

Abstract:Existing local dynamic route planning algorithms, when directly applied to terrain following/terrain avoidance, or dynamic obstacle avoidance for large and medium-sized fixed-wing aircraft, fail to simultaneously meet the requirements of real-time performance, long-distance planning, and the dynamic constraints of large and medium-sized aircraft. To deal with this issue, this paper proposes the Motion Dynamic RRT based Fluid Field - PPO for dynamic TF/TA routing planning. Firstly, the action and state spaces of the proximal policy gradient algorithm are redesigned using disturbance flow fields and artificial potential field algorithms, establishing an aircraft dynamics model, and designing a state transition process based on this model. Additionally, a reward function is designed to encourage strategies for obstacle avoidance, terrain following, terrain avoidance, and safe flight. Experimental results on real DEM data demonstrate that our algorithm can complete long-distance flight tasks through collision-free trajectory planning that complies with dynamic constraints, without the need for prior global planning.

[AI-52] From Worms to Mice: Homeostasis Maybe All You Need

链接: https://arxiv.org/abs/2412.20090
作者: Jesus Marco de Lucas
关键词: sole guiding principle, explore ideas inspired, simple neural XOR, neural XOR motif, XOR motif
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:In this brief and speculative commentary, we explore ideas inspired by neural networks in machine learning, proposing that a simple neural XOR motif, involving both excitatory and inhibitory connections, may provide the basis for a relevant mode of plasticity in neural circuits of living organisms, with homeostasis as the sole guiding principle. This XOR motif simply signals the discrepancy between incoming signals and reference signals, thereby providing a basis for a loss function in learning neural circuits, and at the same time regulating homeostasis by halting the propagation of these incoming signals. The core motif uses a 4:1 ratio of excitatory to inhibitory neurons, and supports broader neural patterns such as the well-known ‘winner takes all’ (WTA) mechanism. We examined the prevalence of the XOR motif in the published connectomes of various organisms with increasing complexity, and found that it ranges from tens (in C. elegans) to millions (in several Drosophila neuropils) and more than tens of millions (in mouse V1 visual cortex). If validated, our hypothesis identifies two of the three key components in analogy to machine learning models: the architecture and the loss function. And we propose that a relevant type of biological neural plasticity is simply driven by a basic control or regulatory system, which has persisted and adapted despite the increasing complexity of organisms throughout evolution.

[AI-53] On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLM s

链接: https://arxiv.org/abs/2412.20087
作者: Atmane Ayoub Mansour Bahar,Ahmad Samer Wazan
关键词: Vulnerability Scoring System, Common Vulnerability Scoring, Scoring System, Large Language Models, Adversarial Attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 101 pages, 3 figures

点击查看摘要

Abstract:This research investigates the effectiveness of established vulnerability metrics, such as the Common Vulnerability Scoring System (CVSS), in evaluating attacks against Large Language Models (LLMs), with a focus on Adversarial Attacks (AAs). The study explores the influence of both general and specific metric factors in determining vulnerability scores, providing new perspectives on potential enhancements to these metrics. This study adopts a quantitative approach, calculating and comparing the coefficient of variation of vulnerability scores across 56 adversarial attacks on LLMs. The attacks, sourced from various research papers, and obtained through online databases, were evaluated using multiple vulnerability metrics. Scores were determined by averaging the values assessed by three distinct LLMs. The results indicate that existing scoring-systems yield vulnerability scores with minimal variation across different attacks, suggesting that many of the metric factors are inadequate for assessing adversarial attacks on LLMs. This is particularly true for context-specific factors or those with predefined value sets, such as those in CVSS. These findings support the hypothesis that current vulnerability metrics, especially those with rigid values, are limited in evaluating AAs on LLMs, highlighting the need for the development of more flexible, generalized metrics tailored to such attacks. This research offers a fresh analysis of the effectiveness and applicability of established vulnerability metrics, particularly in the context of Adversarial Attacks on Large Language Models, both of which have gained significant attention in recent years. Through extensive testing and calculations, the study underscores the limitations of these metrics and opens up new avenues for improving and refining vulnerability assessment frameworks specifically tailored for LLMs. Comments: 101 pages, 3 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68M25, ACMclasses: I.2.7; K.4.1; G.3 Cite as: arXiv:2412.20087 [cs.CR] (or arXiv:2412.20087v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.20087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] MAFT: Efficient Model-Agnostic Fairness Testing for Deep Neural Networks via Zero-Order Gradient Search ICSE24

链接: https://arxiv.org/abs/2412.20086
作者: Zhaohui Wang,Min Zhang,Jingran Yang,Bojie Shao,Min Zhang
关键词: Deep neural networks, Deep neural, shown powerful performance, decision-making systems, shown powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Accepted by ICSE24

点击查看摘要

Abstract:Deep neural networks (DNNs) have shown powerful performance in various applications and are increasingly being used in decision-making systems. However, concerns about fairness in DNNs always persist. Some efficient white-box fairness testing methods about individual fairness have been proposed. Nevertheless, the development of black-box methods has stagnated, and the performance of existing methods is far behind that of white-box methods. In this paper, we propose a novel black-box individual fairness testing method called Model-Agnostic Fairness Testing (MAFT). By leveraging MAFT, practitioners can effectively identify and address discrimination in DL models, regardless of the specific algorithm or architecture employed. Our approach adopts lightweight procedures such as gradient estimation and attribute perturbation rather than non-trivial procedures like symbol execution, rendering it significantly more scalable and applicable than existing methods. We demonstrate that MAFT achieves the same effectiveness as state-of-the-art white-box methods whilst improving the applicability to large-scale networks. Compared to existing black-box approaches, our approach demonstrates distinguished performance in discovering fairness violations w.r.t effectiveness (approximately 14.69 times) and efficiency (approximately 32.58 times).

[AI-55] Calibre: Towards Fair and Accurate Personalized Federated Learning with Self-Supervised Learning

链接: https://arxiv.org/abs/2412.20020
作者: Sijia Chen,Ningxin Su,Baochun Li
关键词: existing approaches train, global model, extract transferable representations, existing approaches, approaches train
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: ICDCS camera-ready paper, Code repo: this https URL

点击查看摘要

Abstract:In the context of personalized federated learning, existing approaches train a global model to extract transferable representations, based on which any client could train personalized models with a limited number of data samples. Self-supervised learning is considered a promising direction as the global model it produces is generic and facilitates personalization for all clients fairly. However, when data is heterogeneous across clients, the global model trained using SSL is unable to learn high-quality personalized models. In this paper, we show that when the global model is trained with SSL without modifications, its produced representations have fuzzy class boundaries. As a result, personalized learning within each client produces models with low accuracy. In order to improve SSL towards better accuracy without sacrificing its advantage in fairness, we propose Calibre, a new personalized federated learning framework designed to calibrate SSL representations by maintaining a suitable balance between more generic and more client-specific representations. Calibre is designed based on theoretically-sound properties, and introduces (1) a client-specific prototype loss as an auxiliary training objective; and (2) an aggregation algorithm guided by such prototypes across clients. Our experimental results in an extensive array of non-i.i.d.~settings show that Calibre achieves state-of-the-art performance in terms of both mean accuracy and fairness across clients. Code repo: this https URL.

[AI-56] ProtCLIP: Function-Informed Protein Multi-Modal Learning

链接: https://arxiv.org/abs/2412.20014
作者: Hanjing Zhou,Mingze Yin,Wei Wu,Mingyang Li,Kun Fu,Jintai Chen,Jian Wu,Zheng Wang
关键词: achieved promising performance, aligns protein sequences, learned general protein, general protein representations, downstream applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Multi-modality pre-training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language-supervised visual foundation models due to the ineffective usage of aligned protein-text paired data and the lack of an effective function-informed pre-training paradigm. To address these issues, this paper curates a large-scale protein-text paired dataset called ProtAnno with a property-driven sampling strategy, and introduces a novel function-informed protein pre-training paradigm. Specifically, the sampling strategy determines selecting probability based on the sample confidence and property coverage, balancing the data quality and data quantity in face of large-scale noisy data. Furthermore, motivated by significance of the protein specific functional mechanism, the proposed paradigm explicitly model protein static and dynamic functional segments by two segment-wise pre-training objectives, injecting fine-grained information in a function-informed manner. Leveraging all these innovations, we develop ProtCLIP, a multi-modality foundation model that comprehensively represents function-aware protein embeddings. On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.

[AI-57] Adaptive Parameter-Efficient Federated Fine-Tuning on Heterogeneous Devices

链接: https://arxiv.org/abs/2412.20004
作者: Jun Liu,Yunming Liao,Hongli Xu,Yang Xu,Jianchun Liu,Chen Qian
关键词: pre-trained language models, Federated fine-tuning, distributed manner, proposed to fine-tune, fine-tune the pre-trained
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Federated fine-tuning (FedFT) has been proposed to fine-tune the pre-trained language models in a distributed manner. However, there are two critical challenges for efficient FedFT in practical applications, i.e., resource constraints and system heterogeneity. Existing works rely on parameter-efficient fine-tuning methods, e.g., low-rank adaptation (LoRA), but with major limitations. Herein, based on the inherent characteristics of FedFT, we observe that LoRA layers with higher ranks added close to the output help to save resource consumption while achieving comparable fine-tuning performance. Then we propose a novel LoRA-based FedFT framework, termed LEGEND, which faces the difficulty of determining the number of LoRA layers (called, LoRA depth) and the rank of each LoRA layer (called, rank distribution). We analyze the coupled relationship between LoRA depth and rank distribution, and design an efficient LoRA configuration algorithm for heterogeneous devices, thereby promoting fine-tuning efficiency. Extensive experiments are conducted on a physical platform with 80 commercial devices. The results show that LEGEND can achieve a speedup of 1.5-2.8 \times and save communication costs by about 42.3% when achieving the target accuracy, compared to the advanced solutions.

[AI-58] Delayed Random Partial Gradient Averaging for Federated Learning

链接: https://arxiv.org/abs/2412.19987
作者: Xinyi Hu
关键词: distributed machine learning, machine learning paradigm, Federated learning, Deep Neural Networks, enables multiple clients
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a distributed machine learning paradigm that enables multiple clients to train a shared model collaboratively while preserving privacy. However, the scaling of real-world FL systems is often limited by two communication bottlenecks:(a) while the increasing computing power of edge devices enables the deployment of large-scale Deep Neural Networks (DNNs), the limited bandwidth constraints frequent transmissions over large DNNs; and (b) high latency cost greatly degrades the performance of FL. In light of these bottlenecks, we propose a Delayed Random Partial Gradient Averaging (DPGA) to enhance FL. Under DPGA, clients only share partial local model gradients with the server. The size of the shared part in a local model is determined by the update rate, which is coarsely initialized and subsequently refined over the temporal dimension. Moreover, DPGA largely reduces the system run time by enabling computation in parallel with communication. We conduct experiments on non-IID CIFAR-10/100 to demonstrate the efficacy of our method.

[AI-59] he Fifth International Verification of Neural Networks Competition (VNN-COMP 2024): Summary and Results

链接: https://arxiv.org/abs/2412.19985
作者: Christopher Brix,Stanley Bak,Taylor T. Johnson,Haoze Wu
关键词: neural network verification, International Symposium, International Conference, International Verification, Neural Networks Competition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Report on the results of VNN-COMP 2024. arXiv admin note: substantial text overlap with arXiv:2312.16760 , arXiv:2212.10376

点击查看摘要

Abstract:This report summarizes the 5th International Verification of Neural Networks Competition (VNN-COMP 2024), held as a part of the 7th International Symposium on AI Verification (SAIV), that was collocated with the 36th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2024 iteration, 8 teams participated on a diverse set of 12 regular and 8 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.

[AI-60] Will you donate money to a chatbot? The effect of chatbot anthropomorphic features and persuasion strategies on willingness to donate

链接: https://arxiv.org/abs/2412.19976
作者: Ekaterina Novozhilova,Jiacheng Huang,Le He,Ziling Li,James Cummings
关键词: work investigates, investigates the causal, causal mechanism, strategies on users’, users’ perceptions
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:This work investigates the causal mechanism behind the effect of chatbot personification and persuasion strategies on users’ perceptions and donation likelihood. In a 2 (personified vs. non-personified chatbot) x 2 (emotional vs. logical persuasion strategy) between-subjects experiment (N=76), participants engaged with a chatbot that represented a non-profit charitable organization. The results suggest that interaction with a personified chatbot evokes perceived anthropomorphism; however, it does not elicit greater willingness to donate. In fact, we found that commonly used anthropomorphic features, like name and narrative, led to negative attitudes toward an AI agent in the donation context. Our results showcase a preference for non-personified chatbots paired with logical persuasion appeal, emphasizing the significance of consistency in chatbot interaction, mirroring human-human engagement. We discuss the importance of moving from exploring the common scenario of a chatbot with machine identity vs. a chatbot with human identity in light of the recent regulations of AI systems.

[AI-61] MobileNetV2: A lightweight classification model for home-based sleep apnea screening

链接: https://arxiv.org/abs/2412.19967
作者: Hui Pan,Yanxuan Yu,Jilun Ye,Xu Zhang
关键词: leveraging features extracted, early OSA screening, ECG signals, model leveraging features, extracted from electrocardiogram
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This study proposes a novel lightweight neural network model leveraging features extracted from electrocardiogram (ECG) and respiratory signals for early OSA screening. ECG signals are used to generate feature spectrograms to predict sleep stages, while respiratory signals are employed to detect sleep-related breathing abnormalities. By integrating these predictions, the method calculates the apnea-hypopnea index (AHI) with enhanced accuracy, facilitating precise OSA diagnosis. The method was validated on three publicly available sleep apnea databases: the Apnea-ECG database, the UCDDB dataset, and the MIT-BIH Polysomnographic database. Results showed an overall OSA detection accuracy of 0.978, highlighting the model’s robustness. Respiratory event classification achieved an accuracy of 0.969 and an area under the receiver operating characteristic curve (ROC-AUC) of 0.98. For sleep stage classification, in UCDDB dataset, the ROC-AUC exceeded 0.85 across all stages, with recall for Sleep reaching 0.906 and specificity for REM and Wake states at 0.956 and 0.937, respectively. This study underscores the potential of integrating lightweight neural networks with multi-signal analysis for accurate, portable, and cost-effective OSA screening, paving the way for broader adoption in home-based and wearable health monitoring systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP) Cite as: arXiv:2412.19967 [cs.LG] (or arXiv:2412.19967v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.19967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-62] Hidformer: Transformer-Style Neural Network in Stock Price Forecasting

链接: https://arxiv.org/abs/2412.19932
作者: Kamil Ł. Szydłowski,Jarosław A. Chudziak
关键词: Transformer-based neural networks, Transformer-based neural, financial market analysis, application of Transformer-based, neural networks
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:This paper investigates the application of Transformer-based neural networks to stock price forecasting, with a special focus on the intersection of machine learning techniques and financial market analysis. The evolution of Transformer models, from their inception to their adaptation for time series analysis in financial contexts, is reviewed and discussed. Central to our study is the exploration of the Hidformer model, which is currently recognized for its promising performance in time series prediction. The primary aim of this paper is to determine whether Hidformer will also prove itself in the task of stock price prediction. This slightly modified model serves as the framework for our experiments, integrating the principles of technical analysis with advanced machine learning concepts to enhance stock price prediction accuracy. We conduct an evaluation of the Hidformer model’s performance, using a set of criteria to determine its efficacy. Our findings offer additional insights into the practical application of Transformer architectures in financial time series forecasting, highlighting their potential to improve algorithmic trading strategies, including human decision making.

[AI-63] A Fully Hardware Implemented Accelerator Design in ReRAM Analog Computing without ADCs

链接: https://arxiv.org/abs/2412.19869
作者: Peng Dang,Huawei Li,Wei Wang
关键词: Emerging ReRAM-based accelerators, ultra-high energy efficiency, Emerging ReRAM-based, accelerators process neural, ReRAM-based accelerators process
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Emerging ReRAM-based accelerators process neural networks via analog Computing-in-Memory (CiM) for ultra-high energy efficiency. However, significant overhead in peripheral circuits and complex nonlinear activation modes constrain system energy efficiency improvements. This work explores the hardware implementation of the Sigmoid and SoftMax activation functions of neural networks with stochastically binarized neurons by utilizing sampled noise signals from ReRAM devices to achieve a stochastic effect. We propose a complete ReRAM-based Analog Computing Accelerator (RACA) that accelerates neural network computation by leveraging stochastically binarized neurons in combination with ReRAM crossbars. The novel circuit design removes significant sources of energy/area efficiency degradation, i.e., the Digital-to-Analog and Analog-to-Digital Converters (DACs and ADCs) as well as the components to explicitly calculate the activation functions. Experimental results show that our proposed design outperforms traditional architectures across all overall performance metrics without compromising inference accuracy.

[AI-64] Back To The Future: A Hybrid Transformer-XGBoost Model for Action-oriented Future-proofing Nowcasting

链接: https://arxiv.org/abs/2412.19832
作者: Ziheng Sun
关键词: iconic movie Back, innovative adaptive nowcasting, adaptive nowcasting approach, movie Back, paper explores
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inspired by the iconic movie Back to the Future, this paper explores an innovative adaptive nowcasting approach that reimagines the relationship between present actions and future outcomes. In the movie, characters travel through time to manipulate past events, aiming to create a better future. Analogously, our framework employs predictive insights about the future to inform and adjust present conditions. This dual-stage model integrates the forecasting power of Transformers (future visionary) with the interpretability and efficiency of XGBoost (decision maker), enabling a seamless loop of future prediction and present adaptation. Through experimentation with meteorological datasets, we demonstrate the framework’s advantage in achieving more accurate forecasting while guiding actionable interventions for real-time applications.

[AI-65] A Unified Framework for Context-Aware IoT Management and State-of-the-Art IoT Traffic Anomaly Detection

链接: https://arxiv.org/abs/2412.19830
作者: Daniel Adu Worae,Athar Sheikh,Spyridon Mastorakis
关键词: Internet of Things, introduced growing complexities, expansion of Internet, ecosystems has introduced, rapid expansion
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) ecosystems has introduced growing complexities in device management and network security. To address these challenges, we present a unified framework that combines context-driven large language models (LLMs) for IoT administrative tasks with a fine-tuned anomaly detection module for network traffic analysis. The framework streamlines administrative processes such as device management, troubleshooting, and security enforcement by harnessing contextual knowledge from IoT manuals and operational data. The anomaly detection model achieves state-of-the-art performance in identifying irregularities and threats within IoT traffic, leveraging fine-tuning to deliver exceptional accuracy. Evaluations demonstrate that incorporating relevant contextual information significantly enhances the precision and reliability of LLM-based responses for diverse IoT administrative tasks. Additionally, resource usage metrics such as execution time, memory consumption, and response efficiency demonstrate the framework’s scalability and suitability for real-world IoT deployments.

[AI-66] AnalogXpert: Automating Analog Topology Synthesis by Incorporating Circuit Design Expertise into Large Language Models

链接: https://arxiv.org/abs/2412.19824
作者: Haoyi Zhang,Shizhao Sun,Yibo Lin,Runsheng Wang,Jiang Bian
关键词: modern electronic systems, significant research interest, attracted significant research, electronic systems, research interest
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Analog circuits are crucial in modern electronic systems, and automating their design has attracted significant research interest. One of major challenges is topology synthesis, which determines circuit components and their connections. Recent studies explore large language models (LLM) for topology synthesis. However, the scenarios addressed by these studies do not align well with practical applications. Specifically, existing work uses vague design requirements as input and outputs an ideal model, but detailed structural requirements and device-level models are more practical. Moreover, current approaches either formulate topology synthesis as graph generation or Python code generation, whereas practical topology design is a complex process that demands extensive design knowledge. In this work, we propose AnalogXpert, a LLM-based agent aiming at solving practical topology synthesis problem by incorporating circuit design expertise into LLMs. First, we represent analog topology as SPICE code and introduce a subcircuit library to reduce the design space, in the same manner as experienced designers. Second, we decompose the problem into two sub-task (i.e., block selection and block connection) through the use of CoT and incontext learning techniques, to mimic the practical design process. Third, we introduce a proofreading strategy that allows LLMs to incrementally correct the errors in the initial design, akin to human designers who iteratively check and adjust the initial topology design to ensure accuracy. Finally, we construct a high-quality benchmark containing both real data (30) and synthetic data (2k). AnalogXpert achieves 40% and 23% success rates on the synthetic dataset and real dataset respectively, which is markedly better than those of GPT-4o (3% on both the synthetic dataset and the real dataset).

[AI-67] A Survey on Large Language Models for Communication Network and Service Management: Application Insights Challenges and Future Directions

链接: https://arxiv.org/abs/2412.19823
作者: Gordon Owusu Boateng,Hani Sami,Ahmed Alagha,Hanae Elmekki,Ahmad Hammoud,Rabeb Mizouni,Azzam Mourad,Hadi Otrok,Jamal Bentahar,Sami Muhaidat,Chamseddine Talhi,Zbigniew Dziong,Mohsen Guizani
关键词: Service Management, Natural Language Processing, communication NSM, communication NSM tasks, diverse communication NSM
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of communication networks in recent decades has intensified the need for advanced Network and Service Management (NSM) strategies to address the growing demands for efficiency, scalability, enhanced performance, and reliability of these networks. Large Language Models (LLMs) have received tremendous attention due to their unparalleled capabilities in various Natural Language Processing (NLP) tasks and generating context-aware insights, offering transformative potential for automating diverse communication NSM tasks. Contrasting existing surveys that consider a single network domain, this survey investigates the integration of LLMs across different communication network domains, including mobile networks and related technologies, vehicular networks, cloud-based networks, and fog/edge-based networks. First, the survey provides foundational knowledge of LLMs, explicitly detailing the generic transformer architecture, general-purpose and domain-specific LLMs, LLM model pre-training and fine-tuning, and their relation to communication NSM. Under a novel taxonomy of network monitoring and reporting, AI-powered network planning, network deployment and distribution, and continuous network support, we extensively categorize LLM applications for NSM tasks in each of the different network domains, exploring existing literature and their contributions thus far. Then, we identify existing challenges and open issues, as well as future research directions for LLM-driven communication NSM, emphasizing the need for scalable, adaptable, and resource-efficient solutions that align with the dynamic landscape of communication networks. We envision that this survey serves as a holistic roadmap, providing critical insights for leveraging LLMs to enhance NSM.

[AI-68] Nanoscaling Floating-Point (NxFP): NanoMantissa Adaptive Microexponents and Code Recycling for Direct-Cast Compression of Large Language Models

链接: https://arxiv.org/abs/2412.19821
作者: Yun-Chen Lo,Gu-Yeon Wei,David Brooks
关键词: large language models, fast-growing model size, cutting-edge large language, language models, fast-growing model
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 pages, 12 figures

点击查看摘要

Abstract:As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.

[AI-69] ChipAlign: Instruction Alignment in Large Language Models for Chip Design via Geodesic Interpolation

链接: https://arxiv.org/abs/2412.19819
作者: Chenhui Deng,Yunsheng Bai,Haoxing Ren
关键词: Recent advancements, large language models, ChipNeMo have emerged, advancements in large, large language
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have expanded their application across various domains, including chip design, where domain-adapted chip models like ChipNeMo have emerged. However, these models often struggle with instruction alignment, a crucial capability for LLMs that involves following explicit human directives. This limitation impedes the practical application of chip LLMs, including serving as assistant chatbots for hardware design engineers. In this work, we introduce ChipAlign, a novel approach that utilizes a training-free model merging strategy, combining the strengths of a general instruction-aligned LLM with a chip-specific LLM. By considering the underlying manifold in the weight space, ChipAlign employs geodesic interpolation to effectively fuse the weights of input LLMs, producing a merged model that inherits strong instruction alignment and chip expertise from the respective instruction and chip LLMs. Our results demonstrate that ChipAlign significantly enhances instruction-following capabilities of existing chip LLMs, achieving up to a 26.6% improvement on the IFEval benchmark, while maintaining comparable expertise in the chip domain. This improvement in instruction alignment also translates to notable gains in instruction-involved QA tasks, delivering performance enhancements of 3.9% on the OpenROAD QA benchmark and 8.25% on production-level chip QA benchmarks, surpassing state-of-the-art baselines.

[AI-70] LINKs: Large Language Model Integrated Management for 6G Empowered Digital Twin NetworKs

链接: https://arxiv.org/abs/2412.19811
作者: Shufan Jiang,Bangyan Lin,Yue Wu,Yuan Gao
关键词: large language models, rapidly evolving landscape, digital twins, language models, rapidly evolving
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted by The 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall)

点击查看摘要

Abstract:In the rapidly evolving landscape of digital twins (DT) and 6G networks, the integration of large language models (LLMs) presents a novel approach to network management. This paper explores the application of LLMs in managing 6G-empowered DT networks, with a focus on optimizing data retrieval and communication efficiency in smart city scenarios. The proposed framework leverages LLMs for intelligent DT problem analysis and radio resource management (RRM) in fully autonomous way without any manual intervention. Our proposed framework – LINKs, builds up a lazy loading strategy which can minimize transmission delay by selectively retrieving the relevant data. Based on the data retrieval plan, LLMs transform the retrieval task into an numerical optimization problem and utilizing solvers to build an optimal RRM, ensuring efficient communication across the network. Simulation results demonstrate the performance improvements in data planning and network management, highlighting the potential of LLMs to enhance the integration of DT and 6G technologies.

[AI-71] AI-driven Automation as a Pre-condition for Eudaimonia

链接: https://arxiv.org/abs/2412.19808
作者: Anastasia Siapka
关键词: intrinsically valuable activity, future of work, loss of work, valuable activity, saturated with alarmist
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The debate surrounding the ‘future of work’ is saturated with alarmist warnings about the loss of work as an intrinsically valuable activity. Instead, the present doctoral research approaches this debate from the perspective of human flourishing (eudaimonia). It articulates a neo-Aristotelian interpretation according to which the prospect of mass AI-driven automation, far from being a threat, is rather desirable insofar as it facilitates humans’ flourishing and, subsequently, their engagement in leisure. Drawing on virtue jurisprudence, this research further explores what this desirability may imply for the current legal order.

[AI-72] xLong: Generating Exceptional Behavior Tests with Large Language Models ICSE2025

链接: https://arxiv.org/abs/2405.14619
作者: Jiyang Zhang,Yu Liu,Pengyu Nie,Junyi Jessy Li,Milos Gligoric
关键词: popular programming languages, popular programming, Java, Python, support exceptions
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: ICSE 2025 (camera ready)

点击查看摘要

Abstract:Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on “happy paths”, e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed exLong, that automatically generates EBTs. exLong is a large language model instruction fine-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT-4o), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that exLong outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by exLong were already accepted.

[AI-73] EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion AAAI2025

链接: https://arxiv.org/abs/2412.20359
作者: Ashishkumar Gudmalwar,Ishan D. Biyani,Nirmesh Shah,Pankaj Wasnik,Rajiv Ratn Shah
关键词: Emotional Voice Conversion, Voice Conversion, preserving linguistic content, diffusion-based EVC framework, discrete emotional state
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages \footnoteDemo samples are available at the following URL: \urlthis https URL.

[AI-74] radingAgents : Multi-Agents LLM Financial Trading Framework AAAI2025

链接: https://arxiv.org/abs/2412.20138
作者: Yijia Xiao,Edward Sun,Di Luo,Wei Wang
关键词: Significant progress, large language models, made in automated, automated problem-solving, problem-solving using societies
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Multi-Agent AI in the Real World, AAAI 2025

点击查看摘要

Abstract:Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, multi-agent systems’ potential to replicate real-world trading firms’ collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading.

[AI-75] CrossSpeech: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

链接: https://arxiv.org/abs/2412.20048
作者: Ji-Hoon Kim,Hong-Sun Yang,Yoon-Cheol Ju,Il-Hwan Kim,Byeong-Yeol Kim,Joon Son Chung
关键词: cross-lingual speech synthesis, generate natural speech, cross-lingual speech, speech synthesis, generate natural
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.

[AI-76] owards Strong AI: Transformational Beliefs and Scientific Creativity

链接: https://arxiv.org/abs/2412.19938
作者: Samuel J. Eschker,Chuanhai Liu
关键词: Strong artificial intelligence, possess general cognitive, general cognitive abilities, artificial intelligence, human intelligence
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Strong artificial intelligence (AI) is envisioned to possess general cognitive abilities and scientific creativity comparable to human intelligence, encompassing both knowledge acquisition and problem-solving. While remarkable progress has been made in weak AI, the realization of strong AI remains a topic of intense debate and critical examination. In this paper, we explore pivotal innovations in the history of astronomy and physics, focusing on the discovery of Neptune and the concept of scientific revolutions as perceived by philosophers of science. Building on these insights, we introduce a simple theoretical and statistical framework of weak beliefs, termed the Transformational Belief (TB) framework, designed as a foundation for modeling scientific creativity. Through selected illustrative examples in statistical science, we demonstrate the TB framework’s potential as a promising foundation for understanding, analyzing, and even fostering creativity – paving the way toward the development of strong AI. We conclude with reflections on future research directions and potential advancements.

[AI-77] Pivoting B2B platform business models: From platform experimentation to multi-platform integration to ecosystem envelopment

链接: https://arxiv.org/abs/2412.19931
作者: Clara Filosa,Marin Jovanovic,Lara Agostini,Anna Nosella
关键词: platform business models, platform, business models, shift from traditional, traditional product-centric
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The landscape of digital servitization in the manufacturing sector is evolving, marked by a strategic shift from traditional product-centric to platform business models (BMs). Manufacturing firms often employ a blend of approaches to develop business-to-business (B2B) platforms, leading to significant reconfigurations in their BMs. However, they frequently encounter failures in their B2B platform development initiatives, leading them to abandon initial efforts and pivot to alternative platform strategies. Therefore, this study, through an in-depth case study of a manufacturer in the energy sector, articulates a three-phase pivoting framework for B2B platform BMs, including platform development and platform strategy. Initially, the manufacturer focused on asset-based product sales supplemented by asset maintenance services and followed an emergent platformization strategy characterized by the rise of multiple, independent B2B platforms catering to diverse functions. Next, focusing on the imposed customer journey strategy, the firm shifted towards a strategic multi-platform integration into an all-encompassing platform supported by artificial intelligence (AI), signaling a maturation of the platform BM to combine a wide range of services into an energy-performance-based contract. Finally, the last step of the firm’s platform BM evolution consisted of a deliberate platform strategy open to external stakeholders and enveloping its data-driven offerings within a broader platform ecosystem. This article advances B2B platform BMs and digital servitization literature, highlighting the efficacy of a progressive approach and strategic pivoting.

[AI-78] Modeling Continuous Spatial-temporal Dynamics of Turbulent Flow with Test-time Refinement

链接: https://arxiv.org/abs/2412.19927
作者: Shengyu Chen,Peyman Givi,Can Zheng,Xiaowei Jia
关键词: including climate science, holds immense significance, flows holds immense, freshwater science, turbulent flows holds
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:The precise simulation of turbulent flows holds immense significance across various scientific and engineering domains, including climate science, freshwater science, and energy-efficient manufacturing. Within the realm of simulating turbulent flows, large eddy simulation (LES) has emerged as a prevalent alternative to direct numerical simulation (DNS), offering computational efficiency. However, LES cannot accurately capture the full spectrum of turbulent transport scales and is present only at a lower spatial resolution. Reconstructing high-fidelity DNS data from the lower-resolution LES data is essential for numerous applications, but it poses significant challenges to existing super-resolution techniques, primarily due to the complex spatio-temporal nature of turbulent flows. This paper proposes a novel flow reconstruction approach that leverages physical knowledge to model flow dynamics. Different from traditional super-resolution techniques, the proposed approach uses LES data only in the testing phase through a degradation-based refinement approach to enforce physical constraints and mitigate cumulative reconstruction errors over time. Furthermore, a feature sampling strategy is developed to enable flow data reconstruction across different resolutions. The results on two distinct sets of turbulent flow data indicate the effectiveness of the proposed method in reconstructing high-resolution DNS data, preserving the inherent physical attributes of flow transport, and achieving DNS reconstruction at different resolutions.

[AI-79] Identifying Cocoa Pollinators: A Deep Learning Dataset

链接: https://arxiv.org/abs/2412.19915
作者: Wenxiu Xu,Saba Ghorbani Bazegar,Dong Sheng,Manuel Toledo-Hernandez,ZhenZhong Lan,Thomas Cherico Wanger
关键词: pollination remains limited, industry but research, remains limited, cocoa flower, research on improving
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: The manuscript introduces the first cocoa pollination dataset and an example analysis with YOLOv8 models

点击查看摘要

Abstract:Cocoa is a multi-billion-dollar industry but research on improving yields through pollination remains limited. New embedded hardware and AI-based data analysis is advancing information on cocoa flower visitors, their identity and implications for yields. We present the first cocoa flower visitor dataset containing 5,792 images of Ceratopogonidae, Formicidae, Aphididae, Araneae, and Encyrtidae, and 1,082 background cocoa flower images. This dataset was curated from 23 million images collected over two years by embedded cameras in cocoa plantations in Hainan province, China. We exemplify the use of the dataset with different sizes of YOLOv8 models and by progressively increasing the background image ratio in the training set to identify the best-performing model. The medium-sized YOLOv8 model achieved the best results with 8% background images (F1 Score of 0.71, mAP50 of 0.70). Overall, this dataset is useful to compare the performance of deep learning model architectures on images with low contrast images and difficult detection targets. The data can support future efforts to advance sustainable cocoa production through pollination monitoring projects.

[AI-80] Unveiling Secrets of Brain Function With Generative Modeling: Motion Perception in Primates Cortical Network Organization in Mice

链接: https://arxiv.org/abs/2412.19845
作者: Hadi Vafaii
关键词: Chapter, addressing questions, questions in neuroscience, neuroscience through applications, hierarchical VAE
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: This is my PhD Dissertation, defended on November 3, 2023

点击查看摘要

Abstract:This Dissertation is comprised of two main projects, addressing questions in neuroscience through applications of generative modeling. Project #1 (Chapter 4) explores how neurons encode features of the external world. I combine Helmholtz’s “Perception as Unconscious Inference” – paralleled by modern generative models like variational autoencoders (VAE) – with the hierarchical structure of the visual cortex. This combination leads to the development of a hierarchical VAE model, which I test for its ability to mimic neurons from the primate visual cortex in response to motion stimuli. Results show that the hierarchical VAE perceives motion similar to the primate brain. Additionally, the model identifies causal factors of retinal motion inputs, such as object- and self-motion, in a completely unsupervised manner. Collectively, these results suggest that hierarchical inference underlines the brain’s understanding of the world, and hierarchical VAEs can effectively model this understanding. Project #2 (Chapter 5) investigates the spatiotemporal structure of spontaneous brain activity and its reflection of brain states like rest. Using simultaneous fMRI and wide-field Ca2+ imaging data, this project demonstrates that the mouse cortex can be decomposed into overlapping communities, with around half of the cortical regions belonging to multiple communities. Comparisons reveal similarities and differences between networks inferred from fMRI and Ca2+ signals. The introduction (Chapter 1) is divided similarly to this abstract: sections 1.1 to 1.8 provide background information about Project #1, and sections 1.9 to 1.13 are related to Project #2. Chapter 2 includes historical background, Chapter 3 provides the necessary mathematical background, and finally, Chapter 6 contains concluding remarks and future directions. Comments: This is my PhD Dissertation, defended on November 3, 2023 Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.19845 [q-bio.NC] (or arXiv:2412.19845v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2412.19845 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Hadi Vafaii [view email] [v1] Wed, 25 Dec 2024 03:39:18 UTC (72,674 KB)

[AI-81] Predicting Human Brain States with Transformer MICCAI

链接: https://arxiv.org/abs/2412.19814
作者: Yifei Sun,Mariano Cabezas,Jiah Lee,Chenyu Wang,Wei Zhang,Fernando Calamante,Jinglei Lv
关键词: highly dynamic system, brain states, human brain, fMRI brain states, brain
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, MICCAI MMMI workshop in press

点击查看摘要

Abstract:The human brain is a complex and highly dynamic system, and our current knowledge of its functional mechanism is still very limited. Fortunately, with functional magnetic resonance imaging (fMRI), we can observe blood oxygen level-dependent (BOLD) changes, reflecting neural activity, to infer brain states and dynamics. In this paper, we ask the question of whether the brain states rep-resented by the regional brain fMRI can be predicted. Due to the success of self-attention and the transformer architecture in sequential auto-regression problems (e.g., language modelling or music generation), we explore the possi-bility of the use of transformers to predict human brain resting states based on the large-scale high-quality fMRI data from the human connectome project (HCP). Current results have shown that our model can accurately predict the brain states up to 5.04s with the previous 21.6s. Furthermore, even though the prediction error accumulates for the prediction of a longer time period, the gen-erated fMRI brain states reflect the architecture of functional connectome. These promising initial results demonstrate the possibility of developing gen-erative models for fMRI data using self-attention that learns the functional or-ganization of the human brain. Our code is available at: this https URL.

机器学习

[LG-0] SoS Certificates for Sparse Singular Values and Their Applications: Robust Statistics Subspace Distortion and More

链接: https://arxiv.org/abs/2412.21203
作者: Ilias Diakonikolas,Samuel B. Hopkins,Ankit Pensia,Stefan Tiegel
关键词: random rectangular matrices, algorithms, lower bounds, independent Gaussian entries, singular value certificates
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study \textitsparse singular value certificates for random rectangular matrices. If M is an n \times d matrix with independent Gaussian entries, we give a new family of polynomial-time algorithms which can certify upper bounds on the maximum of |M u| , where u is a unit vector with at most \eta n nonzero entries for a given \eta \in (0,1) . This basic algorithmic primitive lies at the heart of a wide range of problems across algorithmic statistics and theoretical computer science. Our algorithms certify a bound which is asymptotically smaller than the naive one, given by the maximum singular value of M , for nearly the widest-possible range of n,d, and \eta . Efficiently certifying such a bound for a range of n,d and \eta which is larger by any polynomial factor than what is achieved by our algorithm would violate lower bounds in the SQ and low-degree polynomials models. Our certification algorithm makes essential use of the Sum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop a new combinatorial connection between the graph matrix approach to analyze random matrices with dependent entries, and the Efron-Stein decomposition of functions of independent random variables. As applications of our certification algorithm, we obtain new efficient algorithms for a wide range of well-studied algorithmic tasks. In algorithmic robust statistics, we obtain new algorithms for robust mean and covariance estimation with tradeoffs between breakdown point and sample complexity, which are nearly matched by SQ and low-degree polynomial lower bounds (that we establish). We also obtain new polynomial-time guarantees for certification of \ell_1/\ell_2 distortion of random subspaces of \mathbbR^n (also with nearly matching lower bounds), sparse principal component analysis, and certification of the 2\rightarrow p norm of a random matrix. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2412.21203 [cs.DS] (or arXiv:2412.21203v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2412.21203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Unified dimensionality reduction techniques in chronic liver disease detection

链接: https://arxiv.org/abs/2412.21156
作者: Anand Karna,Naina Khan,Rahul Rauniyar,Prashant Giridhar Shambharkar
关键词: major health concern, Machine Learning Repository, Irvine UCI Machine, UCI Machine Learning, liver disease continues
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Globally, chronic liver disease continues to be a major health concern that requires precise predictive models for prompt detection and treatment. Using the Indian Liver Patient Dataset (ILPD) from the University of California at Irvine’s UCI Machine Learning Repository, a number of machine learning algorithms are investigated in this study. The main focus of our research is this dataset, which includes the medical records of 583 patients, 416 of whom have been diagnosed with liver disease and 167 of whom have not. There are several aspects to this work, including feature extraction and dimensionality reduction methods like Linear Discriminant Analysis (LDA), Factor Analysis (FA), t-distributed Stochastic Neighbour Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). The purpose of the study is to investigate how well these approaches work for converting high-dimensional datasets and improving prediction accuracy. To assess the prediction ability of the improved models, a number of classification methods were used, such as Multi-layer Perceptron, Random Forest, K-nearest neighbours, and Logistic Regression. Remarkably, the improved models performed admirably, with Random Forest having the highest accuracy of 98.31% in 10-fold cross-validation and 95.79% in train-test split evaluation. Findings offer important new perspectives on the choice and use of customized feature extraction and dimensionality reduction methods, which improve predictive models for patients with chronic liver disease.

[LG-2] Functional Risk Minimization

链接: https://arxiv.org/abs/2412.21149
作者: Ferran Alet,Clement Gehring,Tomás Lozano-Pérez,Kenji Kawaguchi,Joshua B. Tenenbaum,Leslie Pack Kaelbling
关键词: Machine Learning, Empirical Risk Minimization, Functional Risk Minimization, field of Machine, Learning has changed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of Machine Learning has changed significantly since the 1970s. However, its most basic principle, Empirical Risk Minimization (ERM), remains unchanged. We propose Functional Risk Minimization~(FRM), a general framework where losses compare functions rather than outputs. This results in better performance in supervised, unsupervised, and RL experiments. In the FRM paradigm, for each data point (x_i,y_i) there is function f_\theta_i that fits it: y_i = f_\theta_i(x_i) . This allows FRM to subsume ERM for many common loss functions and to capture more realistic noise processes. We also show that FRM provides an avenue towards understanding generalization in the modern over-parameterized regime, as its objective can be framed as finding the simplest model that fits the training data.

[LG-3] Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

链接: https://arxiv.org/abs/2412.21124
作者: Tim Tsz-Kit Lau,Weijian Li,Chenwei Xu,Han Liu,Mladen Kolar
关键词: batch size, batch size schedules, large-batch training improves, adaptive batch size, batch
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency – employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.

[LG-4] On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

链接: https://arxiv.org/abs/2412.21084
作者: Nicolas Reategui,Roman Pletka,Dionysios Diamantopoulos
关键词: pervasive threat, traditionally countered, network levels, represents a pervasive, detection
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ransomware represents a pervasive threat, traditionally countered at the operating system, file-system, or network levels. However, these approaches often introduce significant overhead and remain susceptible to circumvention by attackers. Recent research activity started looking into the detection of ransomware by observing block IO operations. However, this approach exhibits significant detection challenges. Recognizing these limitations, our research pivots towards enabling robust ransomware detection in storage systems keeping in mind their limited computational resources available. To perform our studies, we propose a kernel-based framework capable of efficiently extracting and analyzing IO operations to identify ransomware activity. The framework can be adopted to storage systems using computational storage devices to improve security and fully hide detection overheads. Our method employs a refined set of computationally light features optimized for ML models to accurately discern malicious from benign activities. Using this lightweight approach, we study a wide range of generalizability aspects and analyze the performance of these models across a large space of setups and configurations covering a wide range of realistic real-world scenarios. We reveal various trade-offs and provide strong arguments for the generalizability of storage-based detection of ransomware and show that our approach outperforms currently available ML-based ransomware detection in storage. Empirical validation reveals that our decision tree-based models achieve remarkable effectiveness, evidenced by higher median F1 scores of up to 12.8%, lower false negative rates of up to 10.9% and particularly decreased false positive rates of up to 17.1% compared to existing storage-based detection approaches. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.21084 [cs.CR] (or arXiv:2412.21084v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.21084 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Privacy-Aware Multi-Device Cooperative Edge Inference with Distributed Resource Bidding

链接: https://arxiv.org/abs/2412.21069
作者: Wenhao Zhuang,Yuyi Mao
关键词: empowered mobile devices, Mobile edge computing, supporting artificial intelligence, proximal MEC servers, mobile devices
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This article was submitted to IEEE for possible publication

点击查看摘要

Abstract:Mobile edge computing (MEC) has empowered mobile devices (MDs) in supporting artificial intelligence (AI) applications through collaborative efforts with proximal MEC servers. Unfortunately, despite the great promise of device-edge cooperative AI inference, data privacy becomes an increasing concern. In this paper, we develop a privacy-aware multi-device cooperative edge inference system for classification tasks, which integrates a distributed bidding mechanism for the MEC server’s computational resources. Intermediate feature compression is adopted as a principled approach to minimize data privacy leakage. To determine the bidding values and feature compression ratios in a distributed fashion, we formulate a decentralized partially observable Markov decision process (DEC-POMDP) model, for which, a multi-agent deep deterministic policy gradient (MADDPG)-based algorithm is developed. Simulation results demonstrate the effectiveness of the proposed algorithm in privacy-preserving cooperative edge inference. Specifically, given a sufficient level of data privacy protection, the proposed algorithm achieves 0.31-0.95% improvements in classification accuracy compared to the approach being agnostic to the wireless channel conditions. The performance is further enhanced by 1.54-1.67% by considering the difficulties of inference data.

[LG-6] BridgePure: Revealing the Fragility of Black-box Data Protection

链接: https://arxiv.org/abs/2412.21061
作者: Yihan Wang,Yiwei Lu,Xiao-Shan Gao,Gautam Kamath,Yaoliang Yu
关键词: unauthorized machine learning, prevent unauthorized machine, Availability attacks, data intended functionality, machine learning models
类目: Machine Learning (cs.LG)
*备注: 26 pages,13 figures

点击查看摘要

Abstract:Availability attacks, or unlearnable examples, are defensive techniques that allow data owners to modify their datasets in ways that prevent unauthorized machine learning models from learning effectively while maintaining the data’s intended functionality. It has led to the release of popular black-box tools for users to upload personal data and receive protected counterparts. In this work, we show such black-box protections can be substantially bypassed if a small set of unprotected in-distribution data is available. Specifically, an adversary can (1) easily acquire (unprotected, protected) pairs by querying the black-box protections with the unprotected dataset; and (2) train a diffusion bridge model to build a mapping. This mapping, termed BridgePure, can effectively remove the protection from any previously unseen data within the same distribution. Under this threat model, our method demonstrates superior purification performance on classification and style mimicry tasks, exposing critical vulnerabilities in black-box data protection.

[LG-7] Learning Epidemiological Dynamics via the Finite Expression Method

链接: https://arxiv.org/abs/2412.21049
作者: Jianda Du,Senwei Liang,Chunmei Wang
关键词: essential for effective, Finite Expression Method, public health decision-making, effective public health, health decision-making
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Modeling and forecasting the spread of infectious diseases is essential for effective public health decision-making. Traditional epidemiological models rely on expert-defined frameworks to describe complex dynamics, while neural networks, despite their predictive power, often lack interpretability due to their ``black-box" nature. This paper introduces the Finite Expression Method, a symbolic learning framework that leverages reinforcement learning to derive explicit mathematical expressions for epidemiological dynamics. Through numerical experiments on both synthetic and real-world datasets, FEX demonstrates high accuracy in modeling and predicting disease spread, while uncovering explicit relationships among epidemiological variables. These results highlight FEX as a powerful tool for infectious disease modeling, combining interpretability with strong predictive performance to support practical applications in public health.

[LG-8] Mind the truncation gap: challenges of learning on dynamic graphs with recurrent architectures

链接: https://arxiv.org/abs/2412.21046
作者: João Bravo,Jacopo Bono,Pedro Saleiro,Hugo Ferreira,Pedro Bizarro
关键词: Systems characterized, continuous-time dynamic graphs, prevalent in social, evolving interactions, biological domains
类目: Machine Learning (cs.LG)
*备注: Published in Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Systems characterized by evolving interactions, prevalent in social, financial, and biological domains, are effectively modeled as continuous-time dynamic graphs (CTDGs). To manage the scale and complexity of these graph datasets, machine learning (ML) approaches have become essential. However, CTDGs pose challenges for ML because traditional static graph methods do not naturally account for event timings. Newer approaches, such as graph recurrent neural networks (GRNNs), are inherently time-aware and offer advantages over static methods for CTDGs. However, GRNNs face another issue: the short truncation of backpropagation-through-time (BPTT), whose impact has not been properly examined until now. In this work, we demonstrate that this truncation can limit the learning of dependencies beyond a single hop, resulting in reduced performance. Through experiments on a novel synthetic task and real-world datasets, we reveal a performance gap between full backpropagation-through-time (F-BPTT) and the truncated backpropagation-through-time (T-BPTT) commonly used to train GRNN models. We term this gap the “truncation gap” and argue that understanding and addressing it is essential as the importance of CTDGs grows, discussing potential future directions for research in this area.

[LG-9] Machine Learning Optimal Ordering in Global Routing Problems in Semiconductors

链接: https://arxiv.org/abs/2412.21035
作者: Heejin Choi,Minji Lee,Chang Hyeong Lee,Jaeho Yang,Rak-Kyeong Seong
关键词: global routing problems, routing problems, global routing, process of layer, layer assignment
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: 18 pages, 13 figures, 6 tables; published in Scientific Reports

点击查看摘要

Abstract:In this work, we propose a new method for ordering nets during the process of layer assignment in global routing problems. The global routing problems that we focus on in this work are based on routing problems that occur in the design of substrates in multilayered semiconductor packages. The proposed new method is based on machine learning techniques and we show that the proposed method supersedes conventional net ordering techniques based on heuristic score functions. We perform global routing experiments in multilayered semiconductor package environments in order to illustrate that the routing order based on our new proposed technique outperforms previous methods based on heuristics. Our approach of using machine learning for global routing targets specifically the net ordering step which we show in this work can be significantly improved by deep learning.

[LG-10] Improving Location-based Thermal Emission Side-Channel Analysis Using Iterative Transfer Learning

链接: https://arxiv.org/abs/2412.21030
作者: Tun-Chieh Lou,Chung-Che Wang,Jyh-Shing Roger Jang,Henian Li,Lang Lin,Norman Chang
关键词: deep learning models, iterative transfer learning, side-channel attack methods, paper proposes, transfer learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper proposes the use of iterative transfer learning applied to deep learning models for side-channel attacks. Currently, most of the side-channel attack methods train a model for each individual byte, without considering the correlation between bytes. However, since the models’ parameters for attacking different bytes may be similar, we can leverage transfer learning, meaning that we first train the model for one of the key bytes, then use the trained model as a pretrained model for the remaining bytes. This technique can be applied iteratively, a process known as iterative transfer learning. Experimental results show that when using thermal or power consumption map images as input, and multilayer perceptron or convolutional neural network as the model, our method improves average performance, especially when the amount of data is insufficient.

[LG-11] EdgeRAG: Online-Indexed RAG for Edge Devices

链接: https://arxiv.org/abs/2412.21023
作者: Korakit Seemakhupt,Sihang Liu,Samira Khan
关键词: Deploying Retrieval Augmented, Retrieval Augmented Generation, resource-constrained edge devices, Deploying Retrieval, Retrieval Augmented
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge devices is challenging due to limited memory and processing power. In this work, we propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval. To avoid the latency of generating embeddings for large tail clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while adaptively caching remaining embeddings to minimize redundant computations and further optimize latency. The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index, but with similar generation quality while allowing all of our evaluated datasets to fit into the memory.

[LG-12] xt Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models

链接: https://arxiv.org/abs/2412.21022
作者: Christos Petridis
关键词: common task nowadays, Natural Language Processing, neural networks, standard neural networks, common task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text classification is a very common task nowadays and there are many efficient methods and algorithms that we can employ to accomplish it. Transformers have revolutionized the field of deep learning, particularly in Natural Language Processing (NLP) and have rapidly expanded to other domains such as computer vision, time-series analysis and more. The transformer model was firstly introduced in the context of machine translation and its architecture relies on self-attention mechanisms to capture complex relationships within data sequences. It is able to handle long-range dependencies more effectively than traditional neural networks (such as Recurrent Neural Networks and Multilayer Perceptrons). In this work, we present a comparison between different techniques to perform text classification. We take into consideration seven pre-trained models, three standard neural networks and three machine learning models. For standard neural networks and machine learning models we also compare two embedding techniques: TF-IDF and GloVe, with the latter consistently outperforming the former. Finally, we demonstrate the results from our experiments where pre-trained models such as BERT and DistilBERT always perform better than standard models/algorithms.

[LG-13] Weber-Fechner Law in Temporal Difference learning derived from Control as Inference

链接: https://arxiv.org/abs/2412.21004
作者: Keiichiro Takahashi,Taisuke Kobayashi,Tomoya Yamanokuchi,Takamitsu Matsubara
关键词: update rule based, update rule, nonlinear update rule, temporal difference, based on temporal
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 36 pages 9 figures

点击查看摘要

Abstract:This paper investigates a novel nonlinear update rule based on temporal difference (TD) errors in reinforcement learning (RL). The update rule in the standard RL states that the TD error is linearly proportional to the degree of updates, treating all rewards equally without no bias. On the other hand, the recent biological studies revealed that there are nonlinearities in the TD error and the degree of updates, biasing policies optimistic or pessimistic. Such biases in learning due to nonlinearities are expected to be useful and intentionally leftover features in biological learning. Therefore, this research explores a theoretical framework that can leverage the nonlinearity between the degree of the update and TD errors. To this end, we focus on a control as inference framework, since it is known as a generalized formulation encompassing various RL and optimal control methods. In particular, we investigate the uncomputable nonlinear term needed to be approximately excluded in the derivation of the standard RL from control as inference. By analyzing it, Weber-Fechner law (WFL) is found, namely, perception (a.k.a. the degree of updates) in response to stimulus change (a.k.a. TD error) is attenuated by increase in the stimulus intensity (a.k.a. the value function). To numerically reveal the utilities of WFL on RL, we then propose a practical implementation using a reward-punishment framework and modifying the definition of optimality. Analysis of this implementation reveals that two utilities can be expected i) to increase rewards to a certain level early, and ii) to sufficiently suppress punishment. We finally investigate and discuss the expected utilities through simulations and robot experiments. As a result, the proposed RL algorithm with WFL shows the expected utilities that accelerate the reward-maximizing startup and continue to suppress punishments during learning.

[LG-14] Verified Lifting of Deep learning Operators

链接: https://arxiv.org/abs/2412.20992
作者: Qi Zhan,Xing Hu,Xin Xia,Shanping Li
关键词: Deep learning operators, Deep learning, modern deep learning, fundamental components, components of modern
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep learning operators are fundamental components of modern deep learning frameworks. With the growing demand for customized operators, it has become increasingly common for developers to create their own. However, designing and implementing operators is complex and error-prone, due to hardware-specific optimizations and the need for numerical stability. There is a pressing need for tools that can summarize the functionality of both existing and user-defined operators. To address this gap, this work introduces a novel framework for the verified lifting of deep learning operators, which synthesizes high-level mathematical formulas from low-level implementations. Our approach combines symbolic execution, syntax-guided synthesis, and SMT-based verification to produce readable and formally verified mathematical formulas. In synthesis, we employ a combination of top-down and bottom-up strategies to explore the vast search space efficiently; In verification, we design invariant synthesis patterns and leverage SMT solvers to validate the correctness of the derived summaries; In simplification, we use egraph-based techniques with custom rules to restore complex formulas to their natural, intuitive forms. Evaluated on a dataset of deep learning operators implemented in Triton from the real world, our method demonstrates the effectiveness of synthesis and verification compared to existing techniques. This framework bridges the gap between low-level implementations and high-level abstractions, improving understanding and reliability in deep learning operator development.

[LG-15] RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses

链接: https://arxiv.org/abs/2412.20987
作者: Mohamed Djilani,Salah Ghamizi,Maxime Cordy
关键词: Robustbench leaderboard, moderate robust models, black-box attacks, including transfer, query-based approaches
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although adversarial robustness has been extensively studied in white-box settings, recent advances in black-box attacks (including transfer- and query-based approaches) are primarily benchmarked against weak defenses, leaving a significant gap in the evaluation of their effectiveness against more recent and moderate robust models (e.g., those featured in the Robustbench leaderboard). In this paper, we question this lack of attention from black-box attacks to robust models. We establish a framework to evaluate the effectiveness of recent black-box attacks against both top-performing and standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation reveals the following key findings: (1) the most advanced black-box attacks struggle to succeed even against simple adversarially trained models; (2) robust models that are optimized to withstand strong white-box attacks, such as AutoAttack, also exhibits enhanced resilience against black-box attacks; and (3) robustness alignment between the surrogate models and the target model plays a key factor in the success rate of transfer-based attacks

[LG-16] AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like Antibodies

链接: https://arxiv.org/abs/2412.20984
作者: Yibo Wen,Chenwei Xu,Jerry Yao-Chieh Hu,Han Liu
关键词: antibody sequence-structure co-design, sequence-structure co-design, present a three-stage, three-stage framework, training deep learning
类目: Machine Learning (cs.LG)
*备注: 30 pages

点击查看摘要

Abstract:We present a three-stage framework for training deep learning models specializing in antibody sequence-structure co-design. We first pre-train a language model using millions of antibody sequence data. Then, we employ the learned representations to guide the training of a diffusion model for joint optimization over both sequence and structure of antibodies. During the final alignment stage, we optimize the model to favor antibodies with low repulsion and high attraction to the antigen binding site, enhancing the rationality and functionality of the designs. To mitigate conflicting energy preferences, we extend AbDPO (Antibody Direct Preference Optimization) to guide the model towards Pareto optimality under multiple energy-based alignment objectives. Furthermore, we adopt an iterative learning paradigm with temperature scaling, enabling the model to benefit from diverse online datasets without requiring additional data. In practice, our proposed methods achieve high stability and efficiency in producing a better Pareto front of antibody designs compared to top samples generated by baselines and previous alignment techniques. Through extensive experiments, we showcase the superior performance of our methods in generating nature-like antibodies with high binding affinity consistently.

[LG-17] Generalizing in Net-Zero Microgrids: A Study with Federated PPO and TRPO

链接: https://arxiv.org/abs/2412.20946
作者: Nicolas M Cuadrado Avila,Samuel Horváth,Martin Takáč
关键词: Region Policy Optimization, Trust Region Policy, work addresses, addresses the challenge, Policy Optimization
类目: Machine Learning (cs.LG)
*备注: Submitted to Environmental Data Science Journal from Cambridge University Press

点击查看摘要

Abstract:This work addresses the challenge of optimal energy management in microgrids through a collaborative and privacy-preserving framework. We propose the FedTRPO methodology, which integrates Federated Learning (FL) and Trust Region Policy Optimization (TRPO) to manage distributed energy resources (DERs) efficiently. Using a customized version of the CityLearn environment and synthetically generated data, we simulate designed net-zero energy scenarios for microgrids composed of multiple buildings. Our approach emphasizes reducing energy costs and carbon emissions while ensuring privacy. Experimental results demonstrate that FedTRPO is comparable with state-of-the-art federated RL methodologies without hyperparameter tunning. The proposed framework highlights the feasibility of collaborative learning for achieving optimal control policies in energy systems, advancing the goals of sustainable and efficient smart grids.

[LG-18] Rethinking Aleatoric and Epistemic Uncertainty NEURIPS2024

链接: https://arxiv.org/abs/2412.20892
作者: Freddie Bickford Smith,Jannik Kossen,Eleanor Trollope,Mark van der Wilk,Adam Foster,Tom Rainforth
关键词: machine-learning models, aleatoric and epistemic, epistemic uncertainty, uncertainty are widely, probabilistic predictions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Presented at the Workshop on Bayesian Decision-Making and Uncertainty (NeurIPS 2024)

点击查看摘要

Abstract:The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoric-epistemic view being insufficiently expressive to capture all of the distinct quantities that researchers are interested in. To explain and address this we derive a simple delineation of different model-based uncertainties and the data-generating processes associated with training and evaluation. Using this in place of the aleatoric-epistemic view could produce clearer discourse as the field moves forward.

[LG-19] CF-CGN: Channel Fingerprints Extrapolation for Multi-band Massive MIMO Transmission based on Cycle-Consistent Generative Networks

链接: https://arxiv.org/abs/2412.20885
作者: Chenjie Xie,Li You,Zhenzhou Jin,Jinke Tang,Xiqi Gao,Xiang-Gen Xia
关键词: effectively enhancing spectrum, multi-band massive MIMO, enhancing spectrum efficiency, massive multiple-input multiple-output, Multi-band massive multiple-input
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Multi-band massive multiple-input multiple-output (MIMO) communication can promote the cooperation of licensed and unlicensed spectra, effectively enhancing spectrum efficiency for Wi-Fi and other wireless systems. As an enabler for multi-band transmission, channel fingerprints (CF), also known as the channel knowledge map or radio environment map, are used to assist channel state information (CSI) acquisition and reduce computational complexity. In this paper, we propose CF-CGN (Channel Fingerprints with Cycle-consistent Generative Networks) to extrapolate CF for multi-band massive MIMO transmission where licensed and unlicensed spectra cooperate to provide ubiquitous connectivity. Specifically, we first model CF as a multichannel image and transform the extrapolation problem into an image translation task, which converts CF from one frequency to another by exploring the shared characteristics of statistical CSI in the beam domain. Then, paired generative networks are designed and coupled by variable-weight cycle consistency losses to fit the reciprocal relationship at different bands. Matched with the coupled networks, a joint training strategy is developed accordingly, supporting synchronous optimization of all trainable parameters. During the inference process, we also introduce a refining scheme to improve the extrapolation accuracy based on the resolution of CF. Numerical results illustrate that our proposed CF-CGN can achieve bidirectional extrapolation with an error of 5-17 dB lower than the benchmarks in different communication scenarios, demonstrating its excellent generalization ability. We further show that the sum rate performance assisted by CF-CGN-based CF is close to that with perfect CSI for multi-band massive MIMO transmission.

[LG-20] Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret

链接: https://arxiv.org/abs/2412.20824
作者: Emilio Jorge,Christos Dimitrakakis,Debabrota Basu
关键词: Reinforcement Learning, impose restrictive assumptions, impose restrictive, provably sublinear regret, Learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In Reinforcement Learning (RL) theory, we impose restrictive assumptions to design an algorithm with provably sublinear regret. Common assumptions, like linear or RKHS models, and Gaussian or log-concave posteriors over the models, do not explain practical success of RL across a wider range of distributions and models. Thus, we study how to design RL algorithms with sublinear regret for isoperimetric distributions, specifically the ones satisfying the Log-Sobolev Inequality (LSI). LSI distributions include the standard setups of RL and others, such as many non-log-concave and perturbed distributions. First, we show that the Posterior Sampling-based RL (PSRL) yields sublinear regret if the data distributions satisfy LSI under some mild additional assumptions. Also, when we cannot compute or sample from an exact posterior, we propose a Langevin sampling-based algorithm design: LaPSRL. We show that LaPSRL achieves order optimal regret and subquadratic complexity per episode. Finally, we deploy LaPSRL with a Langevin sampler – SARAH-LD, and test it for different bandit and MDP environments. Experimental results validate the generality of LaPSRL across environments and its competitive performance with respect to the baselines.

[LG-21] meRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series Forecasting

链接: https://arxiv.org/abs/2412.20810
作者: Huanyu Zhang,Chang Xu,Yi-Fan Zhang,Zhang Zhang,Liang Wang,Jiang Bian,Tieniu Tan
关键词: driving rapid advancements, Time series, Time series forecasting, driving rapid, numerous industries
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting plays a crucial role in data mining, driving rapid advancements across numerous industries. With the emergence of large models, time series foundation models (TSFMs) have exhibited remarkable generalization capabilities, such as zero-shot learning, through large-scale pre-training. Meanwhile, Retrieval-Augmented Generation (RAG) methods have been widely employed to enhance the performance of foundation models on unseen data, allowing models to access to external knowledge. In this paper, we introduce TimeRAF, a Retrieval-Augmented Forecasting model that enhance zero-shot time series forecasting through retrieval-augmented techniques. We develop customized time series knowledge bases that are tailored to the specific forecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract valuable information from the knowledge base. Additionally, we propose Channel Prompting for knowledge integration, which effectively extracts relevant information from the retrieved knowledge along the channel dimension. Extensive experiments demonstrate the effectiveness of our model, showing significant improvement across various domains and datasets.

[LG-22] FastCHGNet: Training one Universal Interatomic Potential to 1.5 Hours with 32 GPUs

链接: https://arxiv.org/abs/2412.20796
作者: Yuanchang Zhou,Siyu Hu,Chen Wang,Lin-Wang Wang,Guangming Tan,Weile Jia
关键词: Graph neural network, universal interatomic potentials, demonstrated remarkable generalization, network universal interatomic, neural network universal
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural network universal interatomic potentials (GNN-UIPs) have demonstrated remarkable generalization and transfer capabilities in material discovery and property prediction. These models can accelerate molecular dynamics (MD) simulation by several orders of magnitude while maintaining \textitab initio accuracy, making them a promising new paradigm in material simulations. One notable example is Crystal Hamiltonian Graph Neural Network (CHGNet), pretrained on the energies, forces, stresses, and magnetic moments from the MPtrj dataset, representing a state-of-the-art GNN-UIP model for charge-informed MD simulations. However, training the CHGNet model is time-consuming(8.3 days on one A100 GPU) for three reasons: (i) requiring multi-layer propagation to reach more distant atom information, (ii) requiring second-order derivatives calculation to finish weights updating and (iii) the implementation of reference CHGNet does not fully leverage the computational capabilities. This paper introduces FastCHGNet, an optimized CHGNet, with three contributions: Firstly, we design innovative Force/Stress Readout modules to decompose Force/Stress prediction. Secondly, we adopt massive optimizations such as kernel fusion, redundancy bypass, etc, to exploit GPU computation power sufficiently. Finally, we extend CHGNet to support multiple GPUs and propose a load-balancing technique to enhance GPU utilization. Numerical results show that FastCHGNet reduces memory footprint by a factor of 3.59. The final training time of FastCHGNet can be decreased to \textbf1.53 hours on 32 GPUs without sacrificing model accuracy.

[LG-23] Accelerating Energy-Efficient Federated Learning in Cell-Free Networks with Adaptive Quantization

链接: https://arxiv.org/abs/2412.20785
作者: Afsaneh Mahmoudi,Ming Xiao,Emil Björnson
关键词: share learning parameters, Federated Learning, share learning, learning parameters, Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables clients to share learning parameters instead of local data, reducing communication overhead. Traditional wireless networks face latency challenges with FL. In contrast, Cell-Free Massive MIMO (CFmMIMO) can serve multiple clients on shared resources, boosting spectral efficiency and reducing latency for large-scale FL. However, clients’ communication resource limitations can hinder the completion of the FL training. To address this challenge, we propose an energy-efficient, low-latency FL framework featuring optimized uplink power allocation for seamless client-server collaboration. Our framework employs an adaptive quantization scheme, dynamically adjusting bit allocation for local gradient updates to reduce communication costs. We formulate a joint optimization problem covering FL model updates, local iterations, and power allocation, solved using sequential quadratic programming (SQP) to balance energy and latency. Additionally, clients use the AdaDelta method for local FL model updates, enhancing local model convergence compared to standard SGD, and we provide a comprehensive analysis of FL convergence with AdaDelta local updates. Numerical results show that, within the same energy and latency budgets, our power allocation scheme outperforms the Dinkelbach and max-sum rate methods by increasing the test accuracy up to 7 % and 19 %, respectively. Moreover, for the three power allocation methods, our proposed quantization scheme outperforms AQUILA and LAQ by increasing test accuracy by up to 36 % and 35 %, respectively.

[LG-24] Joint Scoring Rules: Zero-Sum Competition Avoids Performative Prediction

链接: https://arxiv.org/abs/2412.20732
作者: Rubi Hudson
关键词: decision-making scenario, conditional predictions, principal, Abstract, expert agent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a decision-making scenario, a principal could use conditional predictions from an expert agent to inform their choice. However, this approach would introduce a fundamental conflict of interest. An agent optimizing for predictive accuracy is incentivized to manipulate their principal towards more predictable actions, which prevents that principal from being able to deterministically select their true preference. We demonstrate that this impossibility result can be overcome through the joint evaluation of multiple agents. When agents are made to engage in zero-sum competition, their incentive to influence the action taken is eliminated, and the principal can identify and take the action they most prefer. We further prove that this zero-sum setup is unique, efficiently implementable, and applicable under stochastic choice. Experiments in a toy environment demonstrate that training on a zero-sum objective significantly enhances both predictive accuracy and principal utility, and can eliminate previously learned manipulative behavior.

[LG-25] AverageLinear: Enhance Long-Term Time series forcasting with simple averaging

链接: https://arxiv.org/abs/2412.20727
作者: Gaoxiang Zhao,Li Zhou,Xiaoqiang Wang
关键词: forecast long-term trends, Long-term time series, series analysis aims, time series analysis, forecast long-term
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Long-term time series analysis aims to forecast long-term trends by examining changes over past and future periods. The intricacy of time series data poses significant challenges for modeling. Models based on the Transformer architecture, through the application of attention mechanisms to channels and sequences, have demonstrated notable performance advantages. In contrast, methods based on convolutional neural networks or linear models often struggle to effectively handle scenarios with large number of channels. However, our research reveals that the attention mechanism is not the core component responsible for performance enhancement. We have designed an exceedingly simple linear structure AverageLinear. By employing straightforward channel embedding and averaging operations, this model can effectively capture correlations between channels while maintaining a lightweight architecture. Experimentss on real-world datasets shows that AverageLinear matches or even surpasses state-of-the-art Transformer-based structures in performance. This indicates that using purely linear structures can also endow models with robust predictive power.

[LG-26] Differentiable Convex Optimization Layers in Neural Architectures: Foundations and Perspectives

链接: https://arxiv.org/abs/2412.20679
作者: Calder Katyal
关键词: network architectures represents, neural network architectures, architectures represents, represents a fundamental, fundamental shift
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The integration of optimization problems within neural network architectures represents a fundamental shift from traditional approaches to handling constraints in deep learning. While it is long known that neural networks can incorporate soft constraints with techniques such as regularization, strict adherence to hard constraints is generally more difficult. A recent advance in this field, however, has addressed this problem by enabling the direct embedding of optimization layers as differentiable components within deep networks. This paper surveys the evolution and current state of this approach, from early implementations limited to quadratic programming, to more recent frameworks supporting general convex optimization problems. We provide a comprehensive review of the background, theoretical foundations, and emerging applications of this technology. Our analysis includes detailed mathematical proofs and an examination of various use cases that demonstrate the potential of this hybrid approach. This work synthesizes developments at the intersection of optimization theory and deep learning, offering insights into both current capabilities and future research directions in this rapidly evolving field.

[LG-27] Attention-Driven Metapath Encoding in Heterogeneous Graphs

链接: https://arxiv.org/abs/2412.20678
作者: Calder Katyal
关键词: semantically meaningful structures, meaningful structures called, structures called metapaths, restrict message aggregation, semantically meaningful
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the emerging techniques in node classification in heterogeneous graphs is to restrict message aggregation to pre-defined, semantically meaningful structures called metapaths. This work is the first attempt to incorporate attention into the process of encoding entire metapaths without dropping intermediate nodes. In particular, we construct two encoders: the first uses sequential attention to extend the multi-hop message passing algorithm designed in \citetmagna to the metapath setting, and the second incorporates direct attention to extract semantic relations in the metapath. The model then employs the intra-metapath and inter-metapath aggregation mechanisms of \citethan. We furthermore use the powerful training scheduler specialized for heterogeneous graphs that was developed in \citetlts, ensuring the model slowly learns how to classify the most difficult nodes. The result is a resilient, general-purpose framework for capturing semantic structures in heterogeneous graphs. In particular, we demonstrate that our model is competitive with state-of-the-art models on performing node classification on the IMDB dataset, a popular benchmark introduced in \citetbenchmark.

[LG-28] Blockchain-Empowered Cyber-Secure Federated Learning for Trustworthy Edge Computing

链接: https://arxiv.org/abs/2412.20674
作者: Ervin Moore,Ahmed Imteaj,Md Zarif Hossain,Shabnam Rezapour,M. Hadi Amini
关键词: machine learning scheme, distributed machine learning, Federated Learning, privacy-preserving distributed machine, participant data remains
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a privacy-preserving distributed machine learning scheme, where each participant data remains on the participating devices and only the local model generated utilizing the local computational power is transmitted throughout the database. However, the distributed computational nature of FL creates the necessity to develop a mechanism that can remotely trigger any network agents, track their activities, and prevent threats to the overall process posed by malicious participants. Particularly, the FL paradigm may become vulnerable due to an active attack from the network participants, called a poisonous attack. In such an attack, the malicious participant acts as a benign agent capable of affecting the global model quality by uploading an obfuscated poisoned local model update to the server. This paper presents a cross-device FL model that ensures trustworthiness, fairness, and authenticity in the underlying FL training process. We leverage trustworthiness by constructing a reputation-based trust model based on contributions of agents toward model convergence. We ensure fairness by identifying and removing malicious agents from the training process through an outlier detection technique. Further, we establish authenticity by generating a token for each participating device through a distributed sensing mechanism and storing that unique token in a blockchain smart contract. Further, we insert the trust scores of all agents into a blockchain and validate their reputations using various consensus mechanisms that consider the computational task.

[LG-29] wo Birds with One Stone: Improving Rumor Detection by Addressing the Unfairness Issue

链接: https://arxiv.org/abs/2412.20671
作者: Junyi Chen,Mengjia Wu,Qian Liu,Ying Ding,Yi Zhang
关键词: remains relatively unexplored, Abstract, confounding sensitive attributes, performance, sensitive
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The degraded performance and group unfairness caused by confounding sensitive attributes in rumor detection remains relatively unexplored. To address this, we propose a two-step framework. Initially, it identifies confounding sensitive attributes that limit rumor detection performance and cause unfairness across groups. Subsequently, we aim to learn equally informative representations through invariant learning. Our method considers diverse sets of groups without sensitive attribute annotations. Experiments show our method easily integrates with existing rumor detectors, significantly improving both their detection performance and fairness.

[LG-30] Uncertainty Herding: One Active Learning Method for All Label Budgets

链接: https://arxiv.org/abs/2412.20644
作者: Wonho Bae,Gabriel L. Oliveira,Danica J. Sutherland
关键词: dramatically worse, worse than random, random selection, label budgets, label budgets increase
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Most active learning research has focused on methods which perform well when many labels are available, but can be dramatically worse than random selection when label budgets are small. Other methods have focused on the low-budget regime, but do poorly as label budgets increase. As the line between “low” and “high” budgets varies by problem, this is a serious issue in practice. We propose uncertainty coverage, an objective which generalizes a variety of low- and high-budget objectives, as well as natural, hyperparameter-light methods to smoothly interpolate between low- and high-budget regimes. We call greedy optimization of the estimate Uncertainty Herding; this simple method is computationally fast, and we prove that it nearly optimizes the distribution-level coverage. In experimental validation across a variety of active learning tasks, our proposal matches or beats state-of-the-art performance in essentially all cases; it is the only method of which we are aware that reliably works well in both low- and high-budget settings.

[LG-31] SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

链接: https://arxiv.org/abs/2412.20641
作者: Md Mahadi Hasan Nahid,Sadid Bin Hasan
关键词: raising substantial privacy, substantial privacy concerns, Machine learning, models frequently rely, Consumer Privacy Act
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 15 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Machine learning (ML) models frequently rely on training data that may include sensitive or personal information, raising substantial privacy concerns. Legislative frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the development of strategies that preserve privacy while maintaining the utility of data. In this paper, we investigate the capability of Large Language Models (LLMs) to generate synthetic datasets integrated with Differential Privacy (DP) mechanisms, thereby enabling data-driven research and model training without direct exposure of sensitive information. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data. To substantiate privacy guarantees, we assess the resilience of the generated synthetic data to membership inference attacks and related threats. The experimental results demonstrate that integrating DP within LLM-driven synthetic data generation offers a viable balance between privacy protection and data utility. This study provides a foundational methodology and insight into the privacy-preserving capabilities of LLMs, paving the way for compliant and effective ML research and applications.

[LG-32] Audiopedia: Audio QA with Knowledge ICASSP2025

链接: https://arxiv.org/abs/2412.20619
作者: Abhirama Subramanyam Penamakuri,Kiran Chhatre,Akshat Jain
关键词: Audio Question Answering, Question Answering, called Audio Question, Audio Question, Multi-Audio Question Answering
类目: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:In this paper, we introduce Audiopedia, a novel task called Audio Question Answering with Knowledge, which requires both audio comprehension and external knowledge reasoning. Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions. We define three sub-tasks: (i) Single Audio Question Answering (s-AQA), where questions are answered based on a single audio sample, (ii) Multi-Audio Question Answering (m-AQA), which requires reasoning over multiple audio samples, and (iii) Retrieval-Augmented Audio Question Answering (r-AQA), which involves retrieving relevant audio to answer the question. We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance. To address this, we propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities. Our framework has two components: (i) Audio Entity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model (KA2LM), which together improve performance on knowledge-intensive AQA tasks. To our knowledge, this is the first work to address advanced audio understanding via knowledge-intensive tasks like Audiopedia.

[LG-33] Converting Time Series Data to Numeric Representations Using Alphabetic Mapping and k-mer strategy

链接: https://arxiv.org/abs/2412.20617
作者: Sarwan Ali,Tamkanat E Ali,Imdad Ullah Khan,Murray Patterson
关键词: time series, time series signals, time series data, representing time series, series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of data analysis and bioinformatics, representing time series data in a manner akin to biological sequences offers a novel approach to leverage sequence analysis techniques. Transforming time series signals into molecular sequence-type representations allows us to enhance pattern recognition by applying sophisticated sequence analysis techniques (e.g. k -mers based representation) developed in bioinformatics, uncovering hidden patterns and relationships in complex, non-linear time series data. This paper proposes a method to transform time series signals into biological/molecular sequence-type representations using a unique alphabetic mapping technique. By generating 26 ranges corresponding to the 26 letters of the English alphabet, each value within the time series is mapped to a specific character based on its range. This conversion facilitates the application of sequence analysis algorithms, typically used in bioinformatics, to analyze time series data. We demonstrate the effectiveness of this approach by converting real-world time series signals into character sequences and performing sequence classification. The resulting sequences can be utilized for various sequence-based analysis techniques, offering a new perspective on time series data representation and analysis.

[LG-34] Hilbert Curve Based Molecular Sequence Analysis

链接: https://arxiv.org/abs/2412.20616
作者: Sarwan Ali,Tamkanat E Ali,Imdad Ullah Khan,Murray Patterson
关键词: Accurate molecular sequence, Accurate molecular, Deep Learning, field of bioinformatics, molecular sequence
类目: Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
*备注:

点击查看摘要

Abstract:Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free techniques have also been introduced, their tabular data form results in low performance when used with Deep Learning (DL) models compared to the competitive performance observed in the case of image-based data. To find a solution to this problem and to make Deep Learning (DL) models function to their maximum potential while capturing the important spatial information in the sequence data, we propose a universal Hibert curve-based Chaos Game Representation (CGR) method. This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences. Our method can be globally applied to any type of molecular sequence data. The Hilbert curve-based image representations can be used as input to sophisticated vision DL models for sequence classification. The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of 94.5 % and an F1 score of 93.9% when tested with the CNN model on the lung cancer dataset. This approach opens up a new horizon for exploring molecular sequence analysis using image classification methods.

[LG-35] Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

链接: https://arxiv.org/abs/2412.20553
作者: Arseniy Andreyev,Pierfrancesco Beneventano
关键词: training neural networks, full-batch gradient descent, full batch Hessian, Recent findings, consistently stabilizes
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 28 pages, 24 figures

点击查看摘要

Abstract:Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent at a step size of \eta , the sharpness–defined as the largest eigenvalue of the full batch Hessian–consistently stabilizes at 2/\eta . These results have significant implications for convergence and generalization. Unfortunately, this was observed not to be the case for mini-batch stochastic gradient descent (SGD), thus limiting the broader applicability of these findings. We show that SGD trains in a different regime we call Edge of Stochastic Stability. In this regime, what hovers at 2/\eta is, instead, the average over the batches of the largest eigenvalue of the Hessian of the mini batch (MiniBS) loss–which is always bigger than the sharpness. This implies that the sharpness is generally lower when training with smaller batches or bigger learning rate, providing a basis for the observed implicit regularization effect of SGD towards flatter minima and a number of well established empirical phenomena. Additionally, we quantify the gap between the MiniBS and the sharpness, further characterizing this distinct training regime.

[LG-36] Diminishing Return of Value Expansion Methods

链接: https://arxiv.org/abs/2412.20537
作者: Daniel Palenicek,Michael Lutter,João Carvalho,Daniel Dennert,Faran Ahmad,Jan Peters
关键词: sample efficiency, resulting compounding errors, aims to increase, sample, efficiency
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2303.03955

点击查看摘要

Abstract:Model-based reinforcement learning aims to increase sample efficiency, but the accuracy of dynamics models and the resulting compounding errors are often seen as key limitations. This paper empirically investigates potential sample efficiency gains from improved dynamics models in model-based value expansion methods. Our study reveals two key findings when using oracle dynamics models to eliminate compounding errors. First, longer rollout horizons enhance sample efficiency, but the improvements quickly diminish with each additional expansion step. Second, increased model accuracy only marginally improves sample efficiency compared to learned models with identical horizons. These diminishing returns in sample efficiency are particularly noteworthy when compared to model-free value expansion methods. These model-free algorithms achieve comparable performance without the computational overhead. Our results suggest that the limitation of model-based value expansion methods cannot be attributed to model accuracy. Although higher accuracy is beneficial, even perfect models do not provide unrivaled sample efficiency. Therefore, the bottleneck exists elsewhere. These results challenge the common assumption that model accuracy is the primary constraint in model-based reinforcement learning.

[LG-37] Convergence of the Min-Max Langevin Dynamics and Algorithm for Zero-Sum Games

链接: https://arxiv.org/abs/2412.20471
作者: Yang Cai,Siddharth Mitra,Xiuyuan Wang,Andre Wibisono
关键词: min-max Langevin dynamics, mean-field min-max Langevin, finite-particle min-max Langevin, Euclidean space, min-max Langevin
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study zero-sum games in the space of probability distributions over the Euclidean space \mathbbR^d with entropy regularization, in the setting when the interaction function between the players is smooth and strongly convex-concave. We prove an exponential convergence guarantee for the mean-field min-max Langevin dynamics to compute the equilibrium distribution of the zero-sum game. We also study the finite-particle approximation of the mean-field min-max Langevin dynamics, both in continuous and discrete times. We prove biased convergence guarantees for the continuous-time finite-particle min-max Langevin dynamics to the stationary mean-field equilibrium distribution with an explicit bias estimate which does not scale with the number of particles. We also prove biased convergence guarantees for the discrete-time finite-particle min-max Langevin algorithm to the stationary mean-field equilibrium distribution with an additional bias term which scales with the step size and the number of particles. This provides an explicit iteration complexity for the average particle along the finite-particle algorithm to approximately compute the equilibrium distribution of the zero-sum game.

[LG-38] reatment Effect Estimation for Graph-Structured Targets

链接: https://arxiv.org/abs/2412.20436
作者: Shonosuke Harada,Ryosuke Yoneda,Hisashi Kashima
关键词: Treatment effect estimation, Treatment effect, effect estimation, Graph-target Treatment Effect, Treatment
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Treatment effect estimation, which helps understand the causality between treatment and outcome variable, is a central task in decision-making across various domains. While most studies focus on treatment effect estimation on individual targets, in specific contexts, there is a necessity to comprehend the treatment effect on a group of targets, especially those that have relationships represented as a graph structure between them. In such cases, the focus of treatment assignment is prone to depend on a particular node of the graph, such as the one with the highest degree, thus resulting in an observational bias from a small part of the entire graph. Whereas a bias tends to be caused by the small part, straightforward extensions of previous studies cannot provide efficient bias mitigation owing to the use of the entire graph information. In this study, we propose Graph-target Treatment Effect Estimation (GraphTEE), a framework designed to estimate treatment effects specifically on graph-structured targets. GraphTEE aims to mitigate observational bias by focusing on confounding variable sets and consider a new regularization framework. Additionally, we provide a theoretical analysis on how GraphTEE performs better in terms of bias mitigation. Experiments on synthetic and semi-synthetic datasets demonstrate the effectiveness of our proposed method.

[LG-39] Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

链接: https://arxiv.org/abs/2412.20377
作者: Yan Luo,Congcong Wen,Min Shi,Hao Huang,Yi Fang,Mengyu Wang
关键词: theoretical framework analyzing, analyzing the relationship, theoretical bounds, theoretical, framework analyzing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a comprehensive theoretical framework analyzing the relationship between data distributions and fairness guarantees in equitable deep learning. Our work establishes novel theoretical bounds that explicitly account for data distribution heterogeneity across demographic groups, while introducing a formal analysis framework that minimizes expected loss differences across these groups. We derive comprehensive theoretical bounds for fairness errors and convergence rates, and characterize how distributional differences between groups affect the fundamental trade-off between fairness and accuracy. Through extensive experiments on diverse datasets, including FairVision (ophthalmology), CheXpert (chest X-rays), HAM10000 (dermatology), and FairFace (facial recognition), we validate our theoretical findings and demonstrate that differences in feature distributions across demographic groups significantly impact model fairness, with performance disparities particularly pronounced in racial categories. The theoretical bounds we derive crroborate these empirical observations, providing insights into the fundamental limits of achieving fairness in deep learning models when faced with heterogeneous data distributions. This work advances our understanding of fairness in AI-based diagnosis systems and provides a theoretical foundation for developing more equitable algorithms. The code for analysis is publicly available via \urlthis https URL.

[LG-40] Scalable Bayesian Optimization via Focalized Sparse Gaussian Processes NEURIPS2024

链接: https://arxiv.org/abs/2412.20375
作者: Yunyue Wei,Vincent Zhuang,Saraswati Soedarmadji,Yanan Sui
关键词: small-budget problems due, Gaussian process, computing the Gaussian, Bayesian optimization, scale Bayesian optimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Bayesian optimization is an effective technique for black-box optimization, but its applicability is typically limited to low-dimensional and small-budget problems due to the cubic complexity of computing the Gaussian process (GP) surrogate. While various approximate GP models have been employed to scale Bayesian optimization to larger sample sizes, most suffer from overly-smooth estimation and focus primarily on problems that allow for large online samples. In this work, we argue that Bayesian optimization algorithms with sparse GPs can more efficiently allocate their representational power to relevant regions of the search space. To achieve this, we propose focalized GP, which leverages a novel variational loss function to achieve stronger local prediction, as well as FocalBO, which hierarchically optimizes the focalized GP acquisition function over progressively smaller search spaces. Experimental results demonstrate that FocalBO can efficiently leverage large amounts of offline and online data to achieve state-of-the-art performance on robot morphology design and to control a 585-dimensional musculoskeletal system.

[LG-41] Accelerated regularized learning in finite N-person games

链接: https://arxiv.org/abs/2412.20365
作者: Kyriakos Lotidis,Angeliki Giannou,Panayotis Mertikopoulos,Nicholas Bambos
关键词: convex minimization problems, achieve similar performance, similar performance gains, Nesterov accelerated gradient, accelerated gradient algorithm
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 30 pages, 4 figures

点击查看摘要

Abstract:Motivated by the success of Nesterov’s accelerated gradient algorithm for convex minimization problems, we examine whether it is possible to achieve similar performance gains in the context of online learning in games. To that end, we introduce a family of accelerated learning methods, which we call “follow the accelerated leader” (FTXL), and which incorporates the use of momentum within the general framework of regularized learning - and, in particular, the exponential/multiplicative weights algorithm and its variants. Drawing inspiration and techniques from the continuous-time analysis of Nesterov’s algorithm, we show that FTXL converges locally to strict Nash equilibria at a superlinear rate, achieving in this way an exponential speed-up over vanilla regularized learning methods (which, by comparison, converge to strict equilibria at a geometric, linear rate). Importantly, FTXL maintains its superlinear convergence rate in a broad range of feedback structures, from deterministic, full information models to stochastic, realization-based ones, and even when run with bandit, payoff-based information, where players are only able to observe their individual realized payoffs.

[LG-42] Safe Bayesian Optimization for the Control of High-Dimensional Embodied Systems

链接: https://arxiv.org/abs/2412.20350
作者: Yunyue Wei,Zeji Yi,Hongda Li,Saraswati Soedarmadji,Yanan Sui
关键词: Learning to move, animals and robots, primary goal, goal for animals, Learning
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by CoRL 2024

点击查看摘要

Abstract:Learning to move is a primary goal for animals and robots, where ensuring safety is often important when optimizing control policies on the embodied systems. For complex tasks such as the control of human or humanoid control, the high-dimensional parameter space adds complexity to the safe optimization effort. Current safe exploration algorithms exhibit inefficiency and may even become infeasible with large high-dimensional input spaces. Furthermore, existing high-dimensional constrained optimization methods neglect safety in the search process. In this paper, we propose High-dimensional Safe Bayesian Optimization with local optimistic exploration (HdSafeBO), a novel approach designed to handle high-dimensional sampling problems under probabilistic safety constraints. We introduce a local optimistic strategy to efficiently and safely optimize the objective function, providing a probabilistic safety guarantee and a cumulative safety violation bound. Through the use of isometric embedding, HdSafeBO addresses problems ranging from a few hundred to several thousand dimensions while maintaining safety guarantees. To our knowledge, HdSafeBO is the first algorithm capable of optimizing the control of high-dimensional musculoskeletal systems with high safety probability. We also demonstrate the real-world applicability of HdSafeBO through its use in the safe online optimization of neural stimulation induced human motion control.

[LG-43] Asynchronous Federated Clustering with Unknown Number of Clusters AAAI2025

链接: https://arxiv.org/abs/2412.20341
作者: Yunfan Zhang,Yiqun Zhang,Yang Lu,Mengke Li,Xi Chen,Yiu-ming Cheung
关键词: non-Independent Identically Distributed, unlabeled non-Independent Identically, Identically Distributed, non-Independent Identically, Federated Clustering
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Federated Clustering (FC) is crucial to mining knowledge from unlabeled non-Independent Identically Distributed (non-IID) data provided by multiple clients while preserving their privacy. Most existing attempts learn cluster distributions at local clients, and then securely pass the desensitized information to the server for aggregation. However, some tricky but common FC problems are still relatively unexplored, including the heterogeneity in terms of clients’ communication capacity and the unknown number of proper clusters k^* . To further bridge the gap between FC and real application scenarios, this paper first shows that the clients’ communication asynchrony and unknown k^* are complex coupling problems, and then proposes an Asynchronous Federated Cluster Learning (AFCL) method accordingly. It spreads the excessive number of seed points to the clients as a learning medium and coordinates them across the clients to form a consensus. To alleviate the distribution imbalance cumulated due to the unforeseen asynchronous uploading from the heterogeneous clients, we also design a balancing mechanism for seeds updating. As a result, the seeds gradually adapt to each other to reveal a proper number of clusters. Extensive experiments demonstrate the efficacy of AFCL.

[LG-44] Exploiting Hybrid Policy in Reinforcement Learning for Interpretable Temporal Logic Manipulation IROS2024

链接: https://arxiv.org/abs/2412.20338
作者: Hao Zhang,Hao Wang,Xiucai Huang,Wenrui Chen,Zhen Kan
关键词: Reinforcement Learning, robot learning, based methods, increasingly explored, explored for robot
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by IROS 2024. Code: this https URL

点击查看摘要

Abstract:Reinforcement Learning (RL) based methods have been increasingly explored for robot learning. However, RL based methods often suffer from low sampling efficiency in the exploration phase, especially for long-horizon manipulation tasks, and generally neglect the semantic information from the task level, resulted in a delayed convergence or even tasks failure. To tackle these challenges, we propose a Temporal-Logic-guided Hybrid policy framework (HyTL) which leverages three-level decision layers to improve the agent’s performance. Specifically, the task specifications are encoded via linear temporal logic (LTL) to improve performance and offer interpretability. And a waypoints planning module is designed with the feedback from the LTL-encoded task level as a high-level policy to improve the exploration efficiency. The middle-level policy selects which behavior primitives to execute, and the low-level policy specifies the corresponding parameters to interact with the environment. We evaluate HyTL on four challenging manipulation tasks, which demonstrate its effectiveness and interpretability. Our project is available at: this https URL.

[LG-45] An experimental study on fairness-aware machine learning for credit scoring problem

链接: https://arxiv.org/abs/2412.20298
作者: Huyen Giang Thi Thu,Thang Viet Doan,Tai Le Quy
关键词: Digitalization of credit, Machine learning, commercial banks, digital transformation, machine learning models
类目: Machine Learning (cs.LG)
*备注: The manuscript is submitted to Springer Nature’s journal

点击查看摘要

Abstract:Digitalization of credit scoring is an essential requirement for financial organizations and commercial banks, especially in the context of digital transformation. Machine learning techniques are commonly used to evaluate customers’ creditworthiness. However, the predicted outcomes of machine learning models can be biased toward protected attributes, such as race or gender. Numerous fairness-aware machine learning models and fairness measures have been proposed. Nevertheless, their performance in the context of credit scoring has not been thoroughly investigated. In this paper, we present a comprehensive experimental study of fairness-aware machine learning in credit scoring. The study explores key aspects of credit scoring, including financial datasets, predictive models, and fairness measures. We also provide a detailed evaluation of fairness-aware predictive models and fairness measures on widely used financial datasets.

[LG-46] An analytic theory of creativity in convolutional diffusion models

链接: https://arxiv.org/abs/2412.20292
作者: Mason Kamb,Surya Ganguli
关键词: convolutional diffusion models, diffusion models, convolutional diffusion, models, diffusion
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We obtain the first analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-based diffusion models can generate highly creative images that lie far from their training data. But optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in a fully analytic, completely mechanistically interpretable, equivariant local score (ELS) machine that, (3) without any training can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (median r^2 of 0.90, 0.91, 0.94 on CIFAR10, FashionMNIST, and MNIST). Our ELS machine reveals a locally consistent patch mosaic model of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches in different image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (median r^2 \sim 0.75 on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.

[LG-47] Causal Discovery on Dependent Binary Data

链接: https://arxiv.org/abs/2412.20289
作者: Alex Chen,Qing Zhou
关键词: learning causal graphical, independence between observations, causal graphical models, causal graph learning, causal graph
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The assumption of independence between observations (units) in a dataset is prevalent across various methodologies for learning causal graphical models. However, this assumption often finds itself in conflict with real-world data, posing challenges to accurate structure learning. We propose a decorrelation-based approach for causal graph learning on dependent binary data, where the local conditional distribution is defined by a latent utility model with dependent errors across units. We develop a pairwise maximum likelihood method to estimate the covariance matrix for the dependence among the units. Then, leveraging the estimated covariance matrix, we develop an EM-like iterative algorithm to generate and decorrelate samples of the latent utility variables, which serve as decorrelated data. Any standard causal discovery method can be applied on the decorrelated data to learn the underlying causal graph. We demonstrate that the proposed decorrelation approach significantly improves the accuracy in causal graph learning, through numerical experiments on both synthetic and real-world datasets.

[LG-48] LU Activation Function for Fast and Stable Deep Learning DATE

链接: https://arxiv.org/abs/2412.20269
作者: Alfredo Fernandez,Ankur Mali
关键词: Exponential Linear Unit, Hyperbolic Tangent Exponential, Tangent Exponential Linear, Linear Unit, Hyperbolic Tangent
类目: Machine Learning (cs.LG)
*备注: Updated version of “Stable and Robust Deep Learning By Hyperbolic Tangent Exponential Linear Unit (TeLU)”

点击查看摘要

Abstract:We propose the Hyperbolic Tangent Exponential Linear Unit (TeLU), a neural network hidden activation function defined as TeLU(x)=xtanh(exp(x)). TeLU’s design is grounded in the core principles of key activation functions, achieving strong convergence by closely approximating the identity function in its active region while effectively mitigating the vanishing gradient problem in its saturating region. Its simple formulation enhances computational efficiency, leading to improvements in scalability and convergence speed. Unlike many modern activation functions, TeLU seamlessly combines the simplicity and effectiveness of ReLU with the smoothness and analytic properties essential for learning stability in deep neural networks. TeLU’s ability to mimic the behavior and optimal hyperparameter settings of ReLU, while introducing the benefits of smoothness and curvature, makes it an ideal drop-in replacement. Its analytic nature positions TeLU as a powerful universal approximator, enhancing both robustness and generalization across a multitude of experiments. We rigorously validate these claims through theoretical analysis and experimental validation, demonstrating TeLU’s performance across challenging benchmarks; including ResNet18 on ImageNet, Dynamic-Pooling Transformers on Text8, and Recurrent Neural Networks (RNNs) on the Penn TreeBank dataset. These results highlight TeLU’s potential to set a new standard in activation functions, driving more efficient and stable learning in deep neural networks, thereby accelerating scientific discoveries across various fields.

[LG-49] owards Ideal Temporal Graph Neural Networks: Evaluations and Conclusions after 10000 GPU Hours

链接: https://arxiv.org/abs/2412.20256
作者: Yuxin Yang,Hongkuan Zhou,Rajgopal Kannan,Viktor Prasanna
关键词: Graph Neural Networks, Temporal Graph Neural, Neural Networks, Graph Neural, Temporal Graph
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Temporal Graph Neural Networks (TGNNs) have emerged as powerful tools for modeling dynamic interactions across various domains. The design space of TGNNs is notably complex, given the unique challenges in runtime efficiency and scalability raised by the evolving nature of temporal graphs. We contend that many of the existing works on TGNN modeling inadequately explore the design space, leading to suboptimal designs. Viewing TGNN models through a performance-focused lens often obstructs a deeper understanding of the advantages and disadvantages of each technique. Specifically, benchmarking efforts inherently evaluate models in their original designs and implementations, resulting in unclear accuracy comparisons and misleading runtime. To address these shortcomings, we propose a practical comparative evaluation framework that performs a design space search across well-known TGNN modules based on a unified, optimized code implementation. Using our framework, we make the first efforts towards addressing three critical questions in TGNN design, spending over 10,000 GPU hours: (1) investigating the efficiency of TGNN module designs, (2) analyzing how the effectiveness of these modules correlates with dataset patterns, and (3) exploring the interplay between multiple modules. Key outcomes of this directed investigative approach include demonstrating that the most recent neighbor sampling and attention aggregator outperform uniform neighbor sampling and MLP-Mixer aggregator; Assessing static node memory as an effective node memory alternative, and showing that the choice between static or dynamic node memory should be based on the repetition patterns in the dataset. Our in-depth analysis of the interplay between TGNN modules and dataset patterns should provide a deeper insight into TGNN performance along with potential research directions for designing more general and effective TGNNs.

[LG-50] An Anomaly Detection System Based on Generative Classifiers for Controller Area Network

链接: https://arxiv.org/abs/2412.20255
作者: Chunheng Zhao,Stefano Longari,Michele Carminati,Pierluigi Pisu
关键词: securing onboard networks, securing onboard, safety-critical electronic systems, Intrusion Detection Systems, increasingly complex
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As electronic systems become increasingly complex and prevalent in modern vehicles, securing onboard networks is crucial, particularly as many of these systems are safety-critical. Researchers have demonstrated that modern vehicles are susceptible to various types of attacks, enabling attackers to gain control and compromise safety-critical electronic systems. Consequently, several Intrusion Detection Systems (IDSs) have been proposed in the literature to detect such cyber-attacks on vehicles. This paper introduces a novel generative classifier-based Intrusion Detection System (IDS) designed for anomaly detection in automotive networks, specifically focusing on the Controller Area Network (CAN). Leveraging variational Bayes, our proposed IDS utilizes a deep latent variable model to construct a causal graph for conditional probabilities. An auto-encoder architecture is utilized to build the classifier to estimate conditional probabilities, which contribute to the final prediction probabilities through Bayesian inference. Comparative evaluations against state-of-the-art IDSs on a public Car-hacking dataset highlight our proposed classifier’s superior performance in improving detection accuracy and F1-score. The proposed IDS demonstrates its efficacy by outperforming existing models with limited training data, providing enhanced security assurance for automotive systems.

[LG-51] IMSSA: Deploying modern state-space models on memristive in-memory compute hardware ISCAS2025

链接: https://arxiv.org/abs/2412.20215
作者: Sebastian Siegel,Ming-Jay Yang,John-Paul Strachan
关键词: long temporal sequences, Processing long temporal, deep learning, key challenge, challenge in deep
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, submitted to IEEE ISCAS 2025

点击查看摘要

Abstract:Processing long temporal sequences is a key challenge in deep learning. In recent years, Transformers have become state-of-the-art for this task, but suffer from excessive memory requirements due to the need to explicitly store the sequences. To address this issue, structured state-space sequential (S4) models recently emerged, offering a fixed memory state while still enabling the processing of very long sequence contexts. The recurrent linear update of the state in these models makes them highly efficient on modern graphics processing units (GPU) by unrolling the recurrence into a convolution. However, this approach demands significant memory and massively parallel computation, which is only available on the latest GPUs. In this work, we aim to bring the power of S4 models to edge hardware by significantly reducing the size and computational demand of an S4D model through quantization-aware training, even achieving ternary weights for a simple real-world task. To this end, we extend conventional quantization-aware training to tailor it for analog in-memory compute hardware. We then demonstrate the deployment of recurrent S4D kernels on memrisitve crossbar arrays, enabling their computation in an in-memory compute fashion. To our knowledge, this is the first implementation of S4 kernels on in-memory compute hardware.

[LG-52] Generative Regression Based Watch Time Prediction for Video Recommendation: Model and Performance

链接: https://arxiv.org/abs/2412.20211
作者: Hongxu Ma,Kai Tian,Tao Zhang,Xuefeng Zhang,Chunjie Chen,Han Li,Jihong Guan,Shuigeng Zhou
关键词: encapsulate user interests, Watch time, user interests, encapsulate user, users’ watch times
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, conference or other essential info

点击查看摘要

Abstract:Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to encapsulate user interests. Predicting users’ watch times on videos often encounters challenges, including wide value ranges and imbalanced data distributions, which can lead to significant bias when directly regressing watch time. Recent studies have tried to tackle these issues by converting the continuous watch time estimation into an ordinal classification task. While these methods are somewhat effective, they exhibit notable limitations. Inspired by language modeling, we propose a novel Generative Regression (GR) paradigm for WTP based on sequence generation. This approach employs structural discretization to enable the lossless reconstruction of original values while maintaining prediction fidelity. By formulating the prediction problem as a numerical-to-sequence mapping, and with meticulously designed vocabulary and label encodings, each watch time is transformed into a sequence of tokens. To expedite model training, we introduce the curriculum learning with an embedding mixup strategy which can mitigate training-and-inference inconsistency associated with teacher forcing. We evaluate our method against state-of-the-art approaches on four public datasets and one industrial dataset. We also perform online A/B testing on Kuaishou, a leading video app with about 400 million DAUs, to demonstrate the real-world efficacy of our method. The results conclusively show that GR outperforms existing techniques significantly. Furthermore, we successfully apply GR to another regression task in recommendation systems, i.e., Lifetime Value (LTV) prediction, which highlights its potential as a novel and effective solution to general regression challenges.

[LG-53] No-regret learning in harmonic games: Extrapolation in the face of conflicting interests

链接: https://arxiv.org/abs/2412.20203
作者: Davide Legacci,Panayotis Mertikopoulos,Christos H. Papadimitriou,Georgios Piliouras,Bary S. R. Pradelski
关键词: harmonic games, potential games, games, long-run behavior, behavior of multi-agent
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 36 pages, 5 figures

点击查看摘要

Abstract:The long-run behavior of multi-agent learning - and, in particular, no-regret learning - is relatively well-understood in potential games, where players have aligned interests. By contrast, in harmonic games - the strategic counterpart of potential games, where players have conflicting interests - very little is known outside the narrow subclass of 2-player zero-sum games with a fully-mixed equilibrium. Our paper seeks to partially fill this gap by focusing on the full class of (generalized) harmonic games and examining the convergence properties of follow-the-regularized-leader (FTRL), the most widely studied class of no-regret learning schemes. As a first result, we show that the continuous-time dynamics of FTRL are Poincaré recurrent, that is, they return arbitrarily close to their starting point infinitely often, and hence fail to converge. In discrete time, the standard, “vanilla” implementation of FTRL may lead to even worse outcomes, eventually trapping the players in a perpetual cycle of best-responses. However, if FTRL is augmented with a suitable extrapolation step - which includes as special cases the optimistic and mirror-prox variants of FTRL - we show that learning converges to a Nash equilibrium from any initial condition, and all players are guaranteed at most O(1) regret. These results provide an in-depth understanding of no-regret learning in harmonic games, nesting prior work on 2-player zero-sum games, and showing at a high level that harmonic games are the canonical complement of potential games, not only from a strategic, but also from a dynamic viewpoint.

[LG-54] Accurate Coresets for Latent Variable Models and Regularized Regression

链接: https://arxiv.org/abs/2412.20189
作者: Sanskar Ranjan,Supratim Shit
关键词: model trained, accurate coreset maintains, accurate coreset, weighted subset, level of accuracy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate coresets are a weighted subset of the original dataset, ensuring a model trained on the accurate coreset maintains the same level of accuracy as a model trained on the full dataset. Primarily, these coresets have been studied for a limited range of machine learning models. In this paper, we introduce a unified framework for constructing accurate coresets. Using this framework, we present accurate coreset construction algorithms for general problems, including a wide range of latent variable model problems and \ell_p -regularized \ell_p -regression. For latent variable models, our coreset size is O\left(\mathrmpoly(k)\right) , where k is the number of latent variables. For \ell_p -regularized \ell_p -regression, our algorithm captures the reduction of model complexity due to regularization, resulting in a coreset whose size is always smaller than d^p for a regularization parameter \lambda 0 . Here, d is the dimension of the input points. This inherently improves the size of the accurate coreset for ridge regression. We substantiate our theoretical findings with extensive experimental evaluations on real datasets.

[LG-55] Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation

链接: https://arxiv.org/abs/2412.20185
作者: Yeonhong Park,Jake Hyun,Hojoon Kim,Jae W. Lee
关键词: Large Language Models, Large Language, recently gained popularity, limited hardware resources, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose QDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and inference latency reduction. QDEC stores the residual matrix – the difference between full-precision and quantized weights – in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by analyzing the input activations – this allows for the adaptation to the dynamic nature of activation distribution, and thus maximizes the effectiveness of error compensation. We demonstrate the effectiveness of QDEC by augmenting state-of-the-art quantization methods. For example, QDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model from 10.15 to 9.12 – outperforming its 3.5-bit counterpart – while adding less than 0.0003% to GPU memory usage and incurring only a 1.7% inference slowdown on NVIDIA RTX 4050 Mobile GPU. The code will be publicly available soon.

[LG-56] A Greedy Strategy for Graph Cut

链接: https://arxiv.org/abs/2412.20035
作者: Feiping Nie,Shenfei Pei,Zengwei Zheng,Rong Wang,Xuelong Li
关键词: called GGC, GGC, graph cut problem, objective function, Greedy strategy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a Greedy strategy to solve the problem of Graph Cut, called GGC. It starts from the state where each data sample is regarded as a cluster and dynamically merges the two clusters which reduces the value of the global objective function the most until the required number of clusters is obtained, and the monotonicity of the sequence of objective function values is proved. To reduce the computational complexity of GGC, only mergers between clusters and their neighbors are considered. Therefore, GGC has a nearly linear computational complexity with respect to the number of samples. Also, unlike other algorithms, due to the greedy strategy, the solution of the proposed algorithm is unique. In other words, its performance is not affected by randomness. We apply the proposed method to solve the problem of normalized cut which is a widely concerned graph cut problem. Extensive experiments show that better solutions can often be achieved compared to the traditional two-stage optimization algorithm (eigendecomposition + k-means), on the normalized cut problem. In addition, the performance of GGC also has advantages compared to several state-of-the-art clustering algorithms.

[LG-57] A Nearly Optimal Single Loop Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness ICML2024

链接: https://arxiv.org/abs/2412.20017
作者: Xiaochuan Gong,Jie Hao,Mingrui Liu
关键词: potentially unbounded smoothness, stochastic gradient, stochastic gradient descent, potentially unbounded, strongly convex
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: ICML 2024

点击查看摘要

Abstract:This paper studies the problem of stochastic bilevel optimization where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level function is strongly convex. This problem is motivated by meta-learning applied to sequential data, such as text classification using recurrent neural networks, where the smoothness constant of the upper-level loss function scales linearly with the gradient norm and can be potentially unbounded. Existing algorithm crucially relies on the nested loop design, which requires significant tuning efforts and is not practical. In this paper, we address this issue by proposing a Single Loop bIlevel oPtimizer (SLIP). The proposed algorithm first updates the lower-level variable by a few steps of stochastic gradient descent, and then simultaneously updates the upper-level variable by normalized stochastic gradient descent with momentum and the lower-level variable by stochastic gradient descent. Under standard assumptions, we show that our algorithm finds an \epsilon -stationary point within \widetildeO(1/\epsilon^4) \footnoteHere \widetildeO(\cdot) compresses logarithmic factors of 1/\epsilon and 1/\delta , where \delta\in(0,1) denotes the failure probability. oracle calls of stochastic gradient or Hessian-vector product, both in expectation and with high probability. This complexity result is nearly optimal up to logarithmic factors without mean-square smoothness of the stochastic gradient oracle. Our proof relies on (i) a refined characterization and control of the lower-level variable and (ii) establishing a novel connection between bilevel optimization and stochastic optimization under distributional drift. Our experiments on various tasks show that our algorithm significantly outperforms strong baselines in bilevel optimization.

[LG-58] Discrete Curvature Graph Information Bottleneck AAAI-2025 AAAI

链接: https://arxiv.org/abs/2412.19993
作者: Xingcheng Fu,Jian Wang,Yisen Gao,Qingyun Sun,Haonan Yuan,Jianxin Li,Xianxian Li
关键词: Graph neural networks, Ricci curvature, information transport, information, Graph Information Bottleneck
类目: Machine Learning (cs.LG)
*备注: Accepted by the Main Technical Track of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-2025)

点击查看摘要

Abstract:Graph neural networks(GNNs) have been demonstrated to depend on whether the node effective information is sufficiently passing. Discrete curvature (Ricci curvature) is used to study graph connectivity and information propagation efficiency with a geometric perspective, and has been raised in recent years to explore the efficient message-passing structure of GNNs. However, most empirical studies are based on directly observed graph structures or heuristic topological assumptions and lack in-depth exploration of underlying optimal information transport structures for downstream tasks. We suggest that graph curvature optimization is more in-depth and essential than directly rewiring or learning for graph structure with richer message-passing characterization and better information transport interpretability. From both graph geometry and information theory perspectives, we propose the novel Discrete Curvature Graph Information Bottleneck (CurvGIB) framework to optimize the information transport structure and learn better node representations simultaneously. CurvGIB advances the Variational Information Bottleneck (VIB) principle for Ricci curvature optimization to learn the optimal information transport pattern for specific downstream tasks. The learned Ricci curvature is used to refine the optimal transport structure of the graph, and the node representation is fully and efficiently learned. Moreover, for the computational complexity of Ricci curvature differentiation, we combine Ricci flow and VIB to deduce a curvature optimization approximation to form a tractable IB objective function. Extensive experiments on various datasets demonstrate the superior effectiveness and interpretability of CurvGIB.

[LG-59] A Robust Federated Learning Framework for Undependable Devices at Scale

链接: https://arxiv.org/abs/2412.19991
作者: Shilong Wang,Jianchun Liu,Hongli Xu,Chunming Qiao,Huarong Deng,Qiuye Zheng,Jiantao Gong
关键词: FLUDE, federated learning, frequently disconnected, disconnected from WiFi, training
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In a federated learning (FL) system, many devices, such as smartphones, are often undependable (e.g., frequently disconnected from WiFi) during training. Existing FL frameworks always assume a dependable environment and exclude undependable devices from training, leading to poor model performance and resource wastage. In this paper, we propose FLUDE to effectively deal with undependable environments. First, FLUDE assesses the dependability of devices based on the probability distribution of their historical behaviors (e.g., the likelihood of successfully completing training). Based on this assessment, FLUDE adaptively selects devices with high dependability for training. To mitigate resource wastage during the training phase, FLUDE maintains a model cache on each device, aiming to preserve the latest training state for later use in case local training on an undependable device is interrupted. Moreover, FLUDE proposes a staleness-aware strategy to judiciously distribute the global model to a subset of devices, thus significantly reducing resource wastage while maintaining model performance. We have implemented FLUDE on two physical platforms with 120 smartphones and NVIDIA Jetson devices. Extensive experimental results demonstrate that FLUDE can effectively improve model performance and resource efficiency of FL training in undependable environments.

[LG-60] Caesar: A Low-deviation Compression Approach for Efficient Federated Learning

链接: https://arxiv.org/abs/2412.19989
作者: Jiaming Yan,Jianchun Liu,Hongli Xu,Liusheng Huang,Jiantao Gong,Xudong Liu,Kun Hou
关键词: tremendous communication overhead, federated learning, relieve the tremendous, overhead of federated, gradient compression ratio
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 27 figures

点击查看摘要

Abstract:Compression is an efficient way to relieve the tremendous communication overhead of federated learning (FL) systems. However, for the existing works, the information loss under compression will lead to unexpected model/gradient deviation for the FL training, significantly degrading the training performance, especially under the challenges of data heterogeneity and model obsolescence. To strike a delicate trade-off between model accuracy and traffic cost, we propose Caesar, a novel FL framework with a low-deviation compression approach. For the global model download, we design a greedy method to optimize the compression ratio for each device based on the staleness of the local model, ensuring a precise initial model for local training. Regarding the local gradient upload, we utilize the device’s local data properties (\ie, sample volume and label distribution) to quantify its local gradient’s importance, which then guides the determination of the gradient compression ratio. Besides, with the fine-grained batch size optimization, Caesar can significantly diminish the devices’ idle waiting time under the synchronized barrier. We have implemented Caesar on two physical platforms with 40 smartphones and 80 NVIDIA Jetson devices. Extensive results show that Caesar can reduce the traffic costs by about 25.54% \thicksim 37.88% compared to the compression-based baselines with the same target accuracy, while incurring only a 0.68% degradation in final test accuracy relative to the full-precision communication.

[LG-61] Explainable Semantic Federated Learning Enabled Industrial Edge Network for Fire Surveillance

链接: https://arxiv.org/abs/2412.19979
作者: Li Dong,Yubo Peng,Feibo Jiang,Kezhi Wang,Kun Yang
关键词: Internet of Things, require transmitting large, Industrial Internet, Industrial Edge Semantic, Edge Semantic Network
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注: 9 pages

点击查看摘要

Abstract:In fire surveillance, Industrial Internet of Things (IIoT) devices require transmitting large monitoring data frequently, which leads to huge consumption of spectrum resources. Hence, we propose an Industrial Edge Semantic Network (IESN) to allow IIoT devices to send warnings through Semantic communication (SC). Thus, we should consider (1) Data privacy and security. (2) SC model adaptation for heterogeneous devices. (3) Explainability of semantics. Therefore, first, we present an eXplainable Semantic Federated Learning (XSFL) to train the SC model, thus ensuring data privacy and security. Then, we present an Adaptive Client Training (ACT) strategy to provide a specific SC model for each device according to its Fisher information matrix, thus overcoming the heterogeneity. Next, an Explainable SC (ESC) mechanism is designed, which introduces a leakyReLU-based activation mapping to explain the relationship between the extracted semantics and monitoring data. Finally, simulation results demonstrate the effectiveness of XSFL.

[LG-62] Data-driven tool wear prediction in milling based on a process-integrated single-sensor approach

链接: https://arxiv.org/abs/2412.19950
作者: Eric Hirsch,Christian Friedrich
关键词: Accurate tool wear, tool wear prediction, tool wear, Accurate tool, wear prediction
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Accurate tool wear prediction is essential for maintaining productivity and minimizing costs in machining. However, the complex nature of the tool wear process poses significant challenges to achieving reliable predictions. This study explores data-driven methods, in particular deep learning, for tool wear prediction. Traditional data-driven approaches often focus on a single process, relying on multi-sensor setups and extensive data generation, which limits generalization to new settings. Moreover, multi-sensor integration is often impractical in industrial environments. To address these limitations, this research investigates the transferability of predictive models using minimal training data, validated across two processes. Furthermore, it uses a simple setup with a single acceleration sensor to establish a low-cost data generation approach that facilitates the generalization of models to other processes via transfer learning. The study evaluates several machine learning models, including convolutional neural networks (CNN), long short-term memory networks (LSTM), support vector machines (SVM) and decision trees, trained on different input formats such as feature vectors and short-time Fourier transform (STFT). The performance of the models is evaluated on different amounts of training data, including scenarios with significantly reduced datasets, providing insight into their effectiveness under constrained data conditions. The results demonstrate the potential of specific models and configurations for effective tool wear prediction, contributing to the development of more adaptable and efficient predictive maintenance strategies in machining. Notably, the ConvNeXt model has an exceptional performance, achieving an 99.1% accuracy in identifying tool wear using data from only four milling tools operated until they are worn.

[LG-63] On the Convergence of DP-SGD with Adaptive Clipping

链接: https://arxiv.org/abs/2412.19916
作者: Egor Shulgin,Peter Richtárik
关键词: Stochastic Gradient Descent, Stochastic Gradient, Gradient Descent, powerful technique, technique for enabling
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) with gradient clipping is a powerful technique for enabling differentially private optimization. Although prior works extensively investigated clipping with a constant threshold, private training remains highly sensitive to threshold selection, which can be expensive or even infeasible to tune. This sensitivity motivates the development of adaptive approaches, such as quantile clipping, which have demonstrated empirical success but lack a solid theoretical understanding. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD). We demonstrate that QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but show how this can be mitigated through a carefully designed quantile and step size schedule. Our analysis reveals crucial relationships between quantile selection, step size, and convergence behavior, providing practical guidelines for parameter selection. We extend these results to differentially private optimization, establishing the first theoretical guarantees for DP-QC-SGD. Our findings provide theoretical foundations for widely used adaptive clipping heuristic and highlight open avenues for future research.

[LG-64] Mouth Articulation-Based Anchoring for Improved Cross-Corpus Speech Emotion Recognition

链接: https://arxiv.org/abs/2412.19909
作者: Shreya G. Upadhyay,Ali N. Salman,Carlos Busso,Chi-Chun Lee
关键词: numerous practical applications, Cross-corpus speech emotion, plays a vital, practical applications, vital role
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.

[LG-65] Minimax-Optimal Multi-Agent Robust Reinforcement Learning

链接: https://arxiv.org/abs/2412.19873
作者: Yuchen Jiao,Gen Li
关键词: robust Markov games, multi-player robust Markov, modeling competitive interactions, Multi-agent robust reinforcement, Markov games
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent robust reinforcement learning, also known as multi-player robust Markov games (RMGs), is a crucial framework for modeling competitive interactions under environmental uncertainties, with wide applications in multi-agent systems. However, existing results on sample complexity in RMGs suffer from at least one of three obstacles: restrictive range of uncertainty level or accuracy, the curse of multiple agents, and the barrier of long horizons, all of which cause existing results to significantly exceed the information-theoretic lower bound. To close this gap, we extend the Q-FTRL algorithm \citepli2022minimax to the RMGs in finite-horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an \varepsilon -robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of \widetildeO\left(H^3S\sum_i=1^mA_i\min\left\H,1/R\right/\varepsilon^2\right) , where S denotes the number of states, A_i is the number of actions of the i -th agent, H is the finite horizon length, and R is uncertainty level. We also show that this sample compelxity is minimax optimal by combining an information-theoretic lower bound. Additionally, in the special case of two-player zero-sum RMGs, the algorithm achieves an \varepsilon -robust Nash equilibrium (NE) with the same sample complexity.

[LG-66] Reduced Order Models and Conditional Expectation

链接: https://arxiv.org/abs/2412.19836
作者: Hermann G. Matthies
关键词: imposed externally, serve to optimise, Systems may depend, optimise the system, Bayesian loss function
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 24 pages, 2 appendices

点击查看摘要

Abstract:Systems may depend on parameters which one may control, or which serve to optimise the system, or are imposed externally, or they could be uncertain. This last case is taken as the “Leitmotiv” for the following. A reduced order model is produced from the full order model by some kind of projection onto a relatively low-dimensional manifold or subspace. The parameter dependent reduction process produces a function of the parameters into the manifold. One now wants to examine the relation between the full and the reduced state for all possible parameter values of interest. Similarly, in the field of machine learning, also a function of the parameter set into the image space of the machine learning model is learned on a training set of samples, typically minimising the mean-square error. This set may be seen as a sample from some probability distribution, and thus the training is an approximate computation of the expectation, giving an approximation to the conditional expectation, a special case of an Bayesian updating where the Bayesian loss function is the mean-square error. This offers the possibility of having a combined look at these methods, and also introducing more general loss functions.

[LG-67] GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

链接: https://arxiv.org/abs/2412.19829
作者: Chengming Zhang,Xinheng Ding,Baixi Sun,Xiaodong Yu,Weijian Zheng,Zhen Xie,Dingwen Tao
关键词: Transformer-based large language, operations for Transformer-based, Transformer-based large, large language models, enhance computations
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor’s Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear attention, aiming to optimize LLM inference on Gaudi processors. Evaluation shows that GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor and outperforms state-of-the-art GPUs.

[LG-68] Sparse chaos in cortical circuits

链接: https://arxiv.org/abs/2412.21188
作者: Rainer Engelken,Michael Monteforte,Fred Wolf
关键词: neuronal membrane potential, nerve impulse generation, membrane potential dynamics, Neuronal circuits, nerve impulse
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Nerve impulses, the currency of information flow in the brain, are generated by an instability of the neuronal membrane potential dynamics. Neuronal circuits exhibit collective chaos that appears essential for learning, memory, sensory processing, and motor control. However, the factors controlling the nature and intensity of collective chaos in neuronal circuits are not well understood. Here we use computational ergodic theory to demonstrate that basic features of nerve impulse generation profoundly affect collective chaos in neuronal circuits. Numerically exact calculations of Lyapunov spectra, Kolmogorov-Sinai-entropy, and upper and lower bounds on attractor dimension show that changes in nerve impulse generation in individual neurons moderately impact information encoding rates but qualitatively transform phase space structure. Specifically, we find a drastic reduction in the number of unstable manifolds, Kolmogorov-Sinai entropy, and attractor dimension. Beyond a critical point, marked by the simultaneous breakdown of the diffusion approximation, a peak in the largest Lyapunov exponent, and a localization transition of the leading covariant Lyapunov vector, networks exhibit sparse chaos: prolonged periods of near stable dynamics interrupted by short bursts of intense chaos. Analysis of large, more realistically structured networks supports the generality of these findings. In cortical circuits, biophysical properties appear tuned to this regime of sparse chaos. Our results reveal a close link between fundamental aspects of single-neuron biophysics and the collective dynamics of cortical circuits, suggesting that nerve impulse generation mechanisms are adapted to enhance circuit controllability and information flow.

[LG-69] DeepF-fNet: a physics-informed neural network for vibration isolation optimization

链接: https://arxiv.org/abs/2412.21132
作者: A. Tollardo,F. Cadini,M. Giglio,L. Lomazzi
关键词: minimal material usage, designing safe, material usage, essential for designing, durable components
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Structural optimization is essential for designing safe, efficient, and durable components with minimal material usage. Traditional methods for vibration control often rely on active systems to mitigate unpredictable vibrations, which may lead to resonance and potential structural failure. However, these methods face significant challenges when addressing the nonlinear inverse eigenvalue problems required for optimizing structures subjected to a wide range of frequencies. As a result, no existing approach has effectively addressed the need for real-time vibration suppression within this context, particularly in high-performance environments such as automotive noise, vibration and harshness, where computational efficiency is crucial. This study introduces DeepF-fNet, a novel neural network framework designed to replace traditional active systems in vibration-based structural optimization. Leveraging DeepONets within the context of physics-informed neural networks, DeepF-fNet integrates both data and the governing physical laws. This enables rapid identification of optimal parameters to suppress critical vibrations at specific frequencies, offering a more efficient and real-time alternative to conventional methods. The proposed framework is validated through a case study involving a locally resonant metamaterial used to isolate structures from user-defined frequency ranges. The results demonstrate that DeepF-fNet outperforms traditional genetic algorithms in terms of computational speed while achieving comparable results, making it a promising tool for vibration-sensitive applications. By replacing active systems with machine learning techniques, DeepF-fNet paves the way for more efficient and cost-effective structural optimization in real-world scenarios. Subjects: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2412.21132 [physics.comp-ph] (or arXiv:2412.21132v1 [physics.comp-ph] for this version) https://doi.org/10.48550/arXiv.2412.21132 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Quantum Diffusion Model for Quark and Gluon Jet Generation NEURIPS2024

链接: https://arxiv.org/abs/2412.21082
作者: Mariia Baidachna,Rey Guadarrama,Gopal Ramesh Dahale,Tom Magorsch,Isabel Pedraza,Konstantin T. Matchev,Katia Matcheva,Kyoungchul Kong,Sergei Gleyzer
关键词: demonstrated remarkable success, time-consuming to train, quantum diffusion model, demonstrated remarkable, remarkable success
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: Accepted for the NeurIPS 2024 MLNCP workshop

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in image generation, but they are computationally intensive and time-consuming to train. In this paper, we introduce a novel diffusion model that benefits from quantum computing techniques in order to mitigate computational challenges and enhance generative performance within high energy physics data. The fully quantum diffusion model replaces Gaussian noise with random unitary matrices in the forward process and incorporates a variational quantum circuit within the U-Net in the denoising architecture. We run evaluations on the structurally complex quark and gluon jets dataset from the Large Hadron Collider. The results demonstrate that the fully quantum and hybrid models are competitive with a similar classical model for jet generation, highlighting the potential of using quantum techniques for machine learning problems.

[LG-71] Enhanced coarsening of charge density waves induced by electron correlation: Machine-learning enabled large-scale dynamical simulations

链接: https://arxiv.org/abs/2412.21072
作者: Yang Yang,Chen Cheng,Yunhao Fan,Gia-Wei Chern
关键词: remains largely unexplored, phase ordering kinetics, correlated electron systems, largely unexplored, ordering kinetics
类目: rongly Correlated Electrons (cond-mat.str-el); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:The phase ordering kinetics of emergent orders in correlated electron systems is a fundamental topic in non-equilibrium physics, yet it remains largely unexplored. The intricate interplay between quasiparticles and emergent order-parameter fields could lead to unusual coarsening dynamics that is beyond the standard theories. However, accurate treatment of both quasiparticles and collective degrees of freedom is a multi-scale challenge in dynamical simulations of correlated electrons. Here we leverage modern machine learning (ML) methods to achieve a linear-scaling algorithm for simulating the coarsening of charge density waves (CDWs), one of the fundamental symmetry breaking phases in functional electron materials. We demonstrate our approach on the square-lattice Hubbard-Holstein model and uncover an intriguing enhancement of CDW coarsening which is related to the screening of on-site potential by electron-electron interactions. Our study provides fresh insights into the role of electron correlations in non-equilibrium dynamics and underscores the promise of ML force-field approaches for advancing multi-scale dynamical modeling of correlated electron systems.

[LG-72] Investigating layer-selective transfer learning of QAOA parameters for Max-Cut problem

链接: https://arxiv.org/abs/2412.21071
作者: Francesco Aldo Venturelli,Sreetama Das,Filippo Caruso
关键词: variational quantum algorithm, noisy intermediate-scale quantum, quantum algorithm, solving combinatorial optimization, intermediate-scale quantum
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures. Comments are welcome

点击查看摘要

Abstract:Quantum approximate optimization algorithm (QAOA) is a variational quantum algorithm (VQA) ideal for noisy intermediate-scale quantum (NISQ) processors, and is highly successful for solving combinatorial optimization problems (COPs). It has been observed that the optimal variational parameters obtained from one instance of a COP can be transferred to another instance, producing sufficiently satisfactory solutions for the latter. In this context, a suitable method for further improving the solution is to fine-tune a subset of the transferred parameters. We numerically explore the role of optimizing individual QAOA layers in improving the approximate solution of the Max-Cut problem after parameter transfer. We also investigate the trade-off between a good approximation and the required optimization time when optimizing transferred QAOA parameters. These studies show that optimizing a subset of layers can be more effective at a lower time-cost compared to optimizing all layers.

[LG-73] Active Learning with Variational Quantum Circuits for Quantum Process Tomography

链接: https://arxiv.org/abs/2412.20925
作者: Jiaqi Yang,Xiaohua Xu,Wei Xie
关键词: Quantum process tomography, quantum states, unknown quantum process, Quantum, Quantum process
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum process tomography (QPT), used for reconstruction of an unknown quantum process from measurement data, is a fundamental tool for the diagnostic and full characterization of quantum systems. It relies on querying a set of quantum states as input to the quantum process. Previous works commonly use a straightforward strategy to select a set of quantum states randomly, overlooking differences in informativeness among quantum states. Since querying the quantum system requires multiple experiments that can be prohibitively costly, it is always the case that there are not enough quantum states for high-quality reconstruction. In this paper, we propose a general framework for active learning (AL) to adaptively select a set of informative quantum states that improves the reconstruction most efficiently. In particular, we introduce a learning framework that leverages the widely-used variational quantum circuits (VQCs) to perform the QPT task and integrate our AL algorithms into the query step. We design and evaluate three various types of AL algorithms: committee-based, uncertainty-based, and diversity-based, each exhibiting distinct advantages in terms of performance and computational cost. Additionally, we provide a guideline for selecting algorithms suitable for different scenarios. Numerical results demonstrate that our algorithms achieve significantly improved reconstruction compared to the baseline method that selects a set of quantum states randomly. Moreover, these results suggest that active learning based approaches are applicable to other complicated learning tasks in large-scale quantum information processing.

[LG-74] Uncertainty-Aware Out-of-Distribution Detection with Gaussian Processes

链接: https://arxiv.org/abs/2412.20918
作者: Yang Chen,Chih-Li Sung,Arpan Kusari,Xiaoyang Song,Wenbo Sun
关键词: Deep neural networks, Deep neural, OOD detection, OOD, OOD detection methods
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are often constructed under the closed-world assumption, which may fail to generalize to the out-of-distribution (OOD) data. This leads to DNNs producing overconfident wrong predictions and can result in disastrous consequences in safety-critical applications. Existing OOD detection methods mainly rely on curating a set of OOD data for model training or hyper-parameter tuning to distinguish OOD data from training data (also known as in-distribution data or InD data). However, OOD samples are not always available during the training phase in real-world applications, hindering the OOD detection accuracy. To overcome this limitation, we propose a Gaussian-process-based OOD detection method to establish a decision boundary based on InD data only. The basic idea is to perform uncertainty quantification of the unconstrained softmax scores of a DNN via a multi-class Gaussian process (GP), and then define a score function to separate InD and potential OOD data based on their fundamental differences in the posterior predictive distribution from the GP. Two case studies on conventional image classification datasets and real-world image datasets are conducted to demonstrate that the proposed method outperforms the state-of-the-art OOD detection methods when OOD samples are not observed in the training phase.

[LG-75] Machine Learning of Slow Collective Variables and Enhanced Sampling via Spatial Techniques

链接: https://arxiv.org/abs/2412.20868
作者: Tuğçe Gökdemir,Jakub Rydzewski
关键词: complex physical processes, physical processes depends, Understanding the long-time, recognize patterns, complex physical
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the long-time dynamics of complex physical processes depends on our ability to recognize patterns. To simplify the description of these processes, we often introduce a set of reaction coordinates, customarily referred to as collective variables (CVs). The quality of these CVs heavily impacts our comprehension of the dynamics, often influencing the estimates of thermodynamics and kinetics from atomistic simulations. Consequently, identifying CVs poses a fundamental challenge in chemical physics. Recently, significant progress was made by leveraging the predictive ability of unsupervised machine learning techniques to determine CVs. Many of these techniques require temporal information to learn slow CVs that correspond to the long timescale behavior of the studied process. Here, however, we specifically focus on techniques that can identify CVs corresponding to the slowest transitions between states without needing temporal trajectories as input, instead using the spatial characteristics of the data. We discuss the latest developments in this category of techniques and briefly discuss potential directions for thermodynamics-informed spatial learning of slow CVs.

[LG-76] Acquisition-Independent Deep Learning for Quantitative MRI Parameter Estimation using Neural Controlled Differential Equations

链接: https://arxiv.org/abs/2412.20844
作者: Daan Kuppens(1 and 2),Sebastiano Barbieri(3 and 4),Daisy van den Berg(1 and 5),Pepijn Schouten(1),Harriet C. Thoeny(6 and 7),Myrte Wennen(1),Oliver J. Gurney-Champion(1 and 2) ((1) Department of Radiology and Nuclear Medicine, Amsterdam University Medical Center, Amsterdam, The Netherlands, (2) Imaging and Biomarkers, Cancer Center Amsterdam, Amsterdam, The Netherlands, (3) Queensland Digital Health Centre, University of Queensland, Brisbane, Australia, (4) Centre for Big Data Research in Health, UNSW Sydney, Sydney, Australia, (5) Department of Biomedical Engineering and Physics, Amsterdam University Medical Center location University of Amsterdam, Amsterdam, The Netherlands, (6) University Teaching and Research Hospital, University of Fribourg, Fribourg, Switzerland, (7) Department of Urology, Inselspital, University of Bern, Switzerland)
关键词: QMRI parameter estimation, QMRI parameter, QMRI, Deep learning, parameter estimation
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 29 pages, 10 figures, 7 supplementary figures, pre-print

点击查看摘要

Abstract:Deep learning has proven to be a suitable alternative to least-squares (LSQ) fitting for parameter estimation in various quantitative MRI (QMRI) models. However, current deep learning implementations are not robust to changes in MR acquisition protocols. In practice, QMRI acquisition protocols differ substantially between different studies and clinical settings. The lack of generalizability and adoptability of current deep learning approaches for QMRI parameter estimation impedes the implementation of these algorithms in clinical trials and clinical practice. Neural Controlled Differential Equations (NCDEs) allow for the sampling of incomplete and irregularly sampled data with variable length, making them ideal for use in QMRI parameter estimation. In this study, we show that NCDEs can function as a generic tool for the accurate prediction of QMRI parameters, regardless of QMRI sequence length, configuration of independent variables and QMRI forward model (variable flip angle T1-mapping, intravoxel incoherent motion MRI, dynamic contrast-enhanced MRI). NCDEs achieved lower mean squared error than LSQ fitting in low-SNR simulations and in vivo in challenging anatomical regions like the abdomen and leg, but this improvement was no longer evident at high SNR. NCDEs reduce estimation error interquartile range without increasing bias, particularly under conditions of high uncertainty. These findings suggest that NCDEs offer a robust approach for reliable QMRI parameter estimation, especially in scenarios with high uncertainty or low image quality. We believe that with NCDEs, we have solved one of the main challenges for using deep learning for QMRI parameter estimation in a broader clinical and research setting.

[LG-77] Robust Matrix Completion for Discrete Rating-Scale Data

链接: https://arxiv.org/abs/2412.20802
作者: Aurore Archimbaud,Andreas Alfons,Ines Wilms
关键词: gained considerable interest, Matrix completion, recent years, gained considerable, considerable interest
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Matrix completion has gained considerable interest in recent years. The goal of matrix completion is to predict the unknown entries of a partially observed matrix using its known entries. Although common applications feature discrete rating-scale data, such as user-product rating matrices in recommender systems or surveys in the social and behavioral sciences, methods for matrix completion are almost always designed for and studied in the context of continuous data. Furthermore, only a small subset of the literature considers matrix completion in the presence of corrupted observations despite their common occurrence in practice. Examples include attacks on recommender systems (i.e., malicious users deliberately manipulating ratings to influence the recommender system to their advantage), or careless respondents in surveys (i.e., respondents providing answers irrespective of what the survey asks of them due to a lack of attention). We introduce a matrix completion algorithm that is tailored towards the discrete nature of rating-scale data and robust to the presence of corrupted observations. In addition, we investigate the performance of the proposed method and its competitors with discrete rating-scale (rather than continuous) data as well as under various missing data mechanisms and types of corrupted observations.

[LG-78] Enhancing Privacy in Federated Learning through Quantum Teleportation Integration

链接: https://arxiv.org/abs/2412.20762
作者: Koffka Khan
关键词: enables collaborative model, collaborative model training, learning enables collaborative, sharing raw data, enables collaborative
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning enables collaborative model training across multiple clients without sharing raw data, thereby enhancing privacy. However, the exchange of model updates can still expose sensitive information. Quantum teleportation, a process that transfers quantum states between distant locations without physical transmission of the particles themselves, has recently been implemented in real-world networks. This position paper explores the potential of integrating quantum teleportation into federated learning frameworks to bolster privacy. By leveraging quantum entanglement and the no-cloning theorem, quantum teleportation ensures that data remains secure during transmission, as any eavesdropping attempt would be detectable. We propose a novel architecture where quantum teleportation facilitates the secure exchange of model parameters and gradients among clients and servers. This integration aims to mitigate risks associated with data leakage and adversarial attacks inherent in classical federated learning setups. We also discuss the practical challenges of implementing such a system, including the current limitations of quantum network infrastructure and the need for hybrid quantum-classical protocols. Our analysis suggests that, despite these challenges, the convergence of quantum communication technologies and federated learning presents a promising avenue for achieving unprecedented levels of privacy in distributed machine learning.

[LG-79] raining Deep Neural Classifiers with Soft Diamond Regularizers

链接: https://arxiv.org/abs/2412.20724
作者: Olaoluwa Adigun,Bart Kosko
关键词: maintain classification accuracy, deep neural networks, improve synaptic sparsity, maintain classification, hard-diamond Laplacian regularizer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:We introduce new \emphsoft diamond regularizers that both improve synaptic sparsity and maintain classification accuracy in deep neural networks. These parametrized regularizers outperform the state-of-the-art hard-diamond Laplacian regularizer of Lasso regression and classification. They use thick-tailed symmetric alpha-stable ( \mathcalS \alpha S ) bell-curve synaptic weight priors that are not Gaussian and so have thicker tails. The geometry of the diamond-shaped constraint set varies from a circle to a star depending on the tail thickness and dispersion of the prior probability density function. Training directly with these priors is computationally intensive because almost all \mathcalS \alpha S probability densities lack a closed form. A precomputed look-up table removed this computational bottleneck. We tested the new soft diamond regularizers with deep neural classifiers on the three datasets CIFAR-10, CIFAR-100, and Caltech-256. The regularizers improved the accuracy of the classifiers. The improvements included 4.57% on CIFAR-10, 4.27% on CIFAR-100, and 6.69% on Caltech-256. They also outperformed L_2 regularizers on all the test cases. Soft diamond regularizers also outperformed L_1 lasso or Laplace regularizers because they better increased sparsity while improving classification accuracy. Soft-diamond priors substantially improved accuracy on CIFAR-10 when combined with dropout, batch, or data-augmentation regularization.

[LG-80] Matrix Concentration for Random Signed Graphs and Community Recovery in the Signed Stochastic Block Model

链接: https://arxiv.org/abs/2412.20620
作者: Sawyer Jack Robertson
关键词: pairs of nodes, added independently, random graph models, Laplacian matrices obtained, Laplacian matrix
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 29 pages, 10 figures

点击查看摘要

Abstract:We consider graphs where edges and their signs are added independently at random from among all pairs of nodes. We establish strong concentration inequalities for adjacency and Laplacian matrices obtained from this family of random graph models. Then, we apply our results to study graphs sampled from the signed stochastic block model. Namely, we take a two-community setting where edges within the communities have positive signs and edges between the communities have negative signs and apply a random sign perturbation with probability 0 s 1/2 . In this setting, our findings include: first, the spectral gap of the corresponding signed Laplacian matrix concentrates near 2s with high probability; and second, the sign of the first eigenvector of the Laplacian matrix defines a weakly consistent estimator for the balanced community detection problem, or equivalently, the \pm 1 synchronization problem. We supplement our theoretical contributions with experimental data obtained from the models under consideration.

[LG-81] sting and Improving the Robustness of Amortized Bayesian Inference for Cognitive Models

链接: https://arxiv.org/abs/2412.20586
作者: Yufei Wu,Stefan Radev,Francis Tuerlinckx
关键词: statistical models representing, representing cognitive processes, models representing cognitive, Drift Diffusion Models, problems when estimating
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Contaminant observations and outliers often cause problems when estimating the parameters of cognitive models, which are statistical models representing cognitive processes. In this study, we test and improve the robustness of parameter estimation using amortized Bayesian inference (ABI) with neural networks. To this end, we conduct systematic analyses on a toy example and analyze both synthetic and real data using a popular cognitive model, the Drift Diffusion Models (DDM). First, we study the sensitivity of ABI to contaminants with tools from robust statistics: the empirical influence function and the breakdown point. Next, we propose a data augmentation or noise injection approach that incorporates a contamination distribution into the data-generating process during training. We examine several candidate distributions and evaluate their performance and cost in terms of accuracy and efficiency loss relative to a standard estimator. Introducing contaminants from a Cauchy distribution during training considerably increases the robustness of the neural density estimator as measured by bounded influence functions and a much higher breakdown point. Overall, the proposed method is straightforward and practical to implement and has a broad applicability in fields where outlier detection or removal is challenging.

[LG-82] Distributionally Robust Optimization via Iterative Algorithms in Continuous Probability Spaces

链接: https://arxiv.org/abs/2412.20556
作者: Linglingzhi Zhu,Yao Xie
关键词: significant computational challenges, computational challenges due, distributionally robust optimization, worst-case distribution, minimax problem motivated
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We consider a minimax problem motivated by distributionally robust optimization (DRO) when the worst-case distribution is continuous, leading to significant computational challenges due to the infinite-dimensional nature of the optimization problem. Recent research has explored learning the worst-case distribution using neural network-based generative models to address these computational challenges but lacks algorithmic convergence guarantees. This paper bridges this theoretical gap by presenting an iterative algorithm to solve such a minimax problem, achieving global convergence under mild assumptions and leveraging technical tools from vector space minimax optimization and convex analysis in the space of continuous probability densities. In particular, leveraging Brenier’s theorem, we represent the worst-case distribution as a transport map applied to a continuous reference measure and reformulate the regularized discrepancy-based DRO as a minimax problem in the Wasserstein space. Furthermore, we demonstrate that the worst-case distribution can be efficiently computed using a modified Jordan-Kinderlehrer-Otto (JKO) scheme with sufficiently large regularization parameters for commonly used discrepancy functions, linked to the radius of the ambiguity set. Additionally, we derive the global convergence rate and quantify the total number of subgradient and inexact modified JKO iterations required to obtain approximate stationary points. These results are potentially applicable to nonconvex and nonsmooth scenarios, with broad relevance to modern machine learning applications.

[LG-83] Random Matrix Theory for Stochastic Gradient Descent

链接: https://arxiv.org/abs/2412.20496
作者: Chanju Park,Matteo Favoni,Biagio Lucini,Gert Aarts
关键词: machine learning algorithms, paramount importance, importance for understanding, weight matrix dynamics, Dyson Brownian motion
类目: High Energy Physics - Lattice (hep-lat); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, Proceedings of the 41st International Symposium on Lattice Field Theory (Lattice 2024), July 28th - August 3rd, 2024, University of Liverpool, UK

点击查看摘要

Abstract:Investigating the dynamics of learning in machine learning algorithms is of paramount importance for understanding how and why an approach may be successful. The tools of physics and statistics provide a robust setting for such investigations. Here we apply concepts from random matrix theory to describe stochastic weight matrix dynamics, using the framework of Dyson Brownian motion. We derive the linear scaling rule between the learning rate (step size) and the batch size, and identify universal and non-universal aspects of weight matrix dynamics. We test our findings in the (near-)solvable case of the Gaussian Restricted Boltzmann Machine and in a linear one-hidden-layer neural network.

[LG-84] A Particle Algorithm for Mean-Field Variational Inference

链接: https://arxiv.org/abs/2412.20385
作者: Qiang Du,Kaizheng Wang,Edith Zhang,Chenyang Zhong
关键词: Markov chain Monte, chain Monte Carlo, Monte Carlo, alternative to Markov, Markov chain
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:Variational inference is a fast and scalable alternative to Markov chain Monte Carlo and has been widely applied to posterior inference tasks in statistics and machine learning. A traditional approach for implementing mean-field variational inference (MFVI) is coordinate ascent variational inference (CAVI), which relies crucially on parametric assumptions on complete conditionals. In this paper, we introduce a novel particle-based algorithm for mean-field variational inference, which we term PArticle VI (PAVI). Notably, our algorithm does not rely on parametric assumptions on complete conditionals, and it applies to the nonparametric setting. We provide non-asymptotic finite-particle convergence guarantee for our algorithm. To our knowledge, this is the first end-to-end guarantee for particle-based MFVI.

[LG-85] Confidence Interval Construction and Conditional Variance Estimation with Dense ReLU Networks

链接: https://arxiv.org/abs/2412.20355
作者: Carlos Misael Madrid Padilla,Oscar Hernan Madrid Padilla,Yik Lun Kei,Zhi Zhang,Yanzhen Chen
关键词: Rectified Linear Unit, Linear Unit, Rectified Linear, conditional variance estimation, variance estimation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the problems of conditional variance estimation and confidence interval construction in nonparametric regression using dense networks with the Rectified Linear Unit (ReLU) activation function. We present a residual-based framework for conditional variance estimation, deriving nonasymptotic bounds for variance estimation under both heteroscedastic and homoscedastic settings. We relax the sub-Gaussian noise assumption, allowing the proposed bounds to accommodate sub-Exponential noise and beyond. Building on this, for a ReLU neural network estimator, we derive non-asymptotic bounds for both its conditional mean and variance estimation, representing the first result for variance estimation using ReLU networks. Furthermore, we develop a ReLU network based robust bootstrap procedure (Efron, 1992) for constructing confidence intervals for the true mean that comes with a theoretical guarantee on the coverage, providing a significant advancement in uncertainty quantification and the construction of reliable confidence intervals in deep learning settings.

[LG-86] Zeroth-Order Methods for Nonconvex Stochastic Problems with Decision-Dependent Distributions AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.20330
作者: Yuya Hikima,Akiko Takeda
关键词: recently attracted attention, attracted attention due, recently attracted, attracted attention, attention due
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: This paper has been accepted by The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:In this study, we consider an optimization problem with uncertainty dependent on decision variables, which has recently attracted attention due to its importance in machine learning and pricing applications. In this problem, the gradient of the objective function cannot be obtained explicitly because the decision-dependent distribution is unknown. Therefore, several zeroth-order methods have been proposed, which obtain noisy objective values by sampling and update the iterates. Although these existing methods have theoretical convergence for optimization problems with decision-dependent uncertainty, they require strong assumptions about the function and distribution or exhibit large variances in their gradient estimators. To overcome these issues, we propose two zeroth-order methods under mild assumptions. First, we develop a zeroth-order method with a new one-point gradient estimator including a variance reduction parameter. The proposed method updates the decision variables while adjusting the variance reduction parameter. Second, we develop a zeroth-order method with a two-point gradient estimator. There are situations where only one-point estimators can be used, but if both one-point and two-point estimators are available, it is more practical to use the two-point estimator. As theoretical results, we show the convergence of our methods to stationary points and provide the worst-case iteration and sample complexity analysis. Our simulation experiments with real data on a retail service application show that our methods output solutions with lower objective values than the conventional zeroth-order methods.

[LG-87] Predicting Customer Lifetime Value Using Recurrent Neural Net KDD2022

链接: https://arxiv.org/abs/2412.20295
作者: Huigang Chen,Edwin Ng,Gavin Steininger,Slawek Smyl
关键词: neural network approach, recurrent neural network, paper introduces, neural network, recurrent neural
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 4 pages, 4 figures, first presented in Customer Journey Optimization Workshop in KDD 2022

点击查看摘要

Abstract:This paper introduces a recurrent neural network approach for predicting user lifetime value in Software as a Service (SaaS) applications. The approach accounts for three connected time dimensions. These dimensions are the user cohort (the date the user joined), user age-in-system (the time since the user joined the service) and the calendar date the user is an age-in-system (i.e., contemporaneous information).The recurrent neural networks use a multi-cell architecture, where each cell resembles a long short-term memory neural network. The approach is applied to predicting both acquisition (new users) and rolling (existing user) lifetime values for a variety of time horizons. It is found to significantly improve median absolute percent error versus light gradient boost models and Buy Until You Die models.

[LG-88] Deep Generalized Schr"odinger Bridges: From Image Generation to Solving Mean-Field Games

链接: https://arxiv.org/abs/2412.20279
作者: Guan-Horng Liu,Tianrong Chen,Evangelos A. Theodorou
关键词: Generalized Schrödinger Bridges, Generalized Schrödinger, Schrödinger Bridges, particle evolution based, action including kinetic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Generalized Schrödinger Bridges (GSBs) are a fundamental mathematical framework used to analyze the most likely particle evolution based on the principle of least action including kinetic and potential energy. In parallel to their well-established presence in the theoretical realms of quantum mechanics and optimal transport, this paper focuses on an algorithmic perspective, aiming to enhance practical usage. Our motivated observation is that transportation problems with the optimality structures delineated by GSBs are pervasive across various scientific domains, such as generative modeling in machine learning, mean-field games in stochastic control, and more. Exploring the intrinsic connection between the mathematical modeling of GSBs and the modern algorithmic characterization therefore presents a crucial, yet untapped, avenue. In this paper, we reinterpret GSBs as probabilistic models and demonstrate that, with a delicate mathematical tool known as the nonlinear Feynman-Kac lemma, rich algorithmic concepts, such as likelihoods, variational gaps, and temporal differences, emerge naturally from the optimality structures of GSBs. The resulting computational framework, driven by deep learning and neural networks, operates in a fully continuous state space (i.e., mesh-free) and satisfies distribution constraints, setting it apart from prior numerical solvers relying on spatial discretization or constraint relaxation. We demonstrate the efficacy of our method in generative modeling and mean-field games, highlighting its transformative applications at the intersection of mathematical modeling, stochastic process, control, and machine learning.

[LG-89] Machine and Deep Learning for Credit Scoring: A compliant approach

链接: https://arxiv.org/abs/2412.20225
作者: Abdollah Rida
关键词: Credit Scoring models, Credit Scoring, Federal Reserve Bank, European Central Bank, daily basis
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Credit Scoring is one of the problems banks and financial institutions have to solve on a daily basis. If the state-of-the-art research in Machine and Deep Learning for finance has reached interesting results about Credit Scoring models, usage of such models in a heavily regulated context such as the one in banks has never been done so far. Our work is thus a tentative to challenge the current regulatory status-quo and introduce new BASEL 2 and 3 compliant techniques, while still answering the Federal Reserve Bank and the European Central Bank requirements. With the help of Gradient Boosting Machines (mainly XGBoost) we challenge an actual model used by BANK A for scoring through the door Auto Loan applicants. We prove that the usage of such algorithms for Credit Scoring models drastically improves performance and default capture rate. Furthermore, we leverage the power of Shapley Values to prove that these relatively simple models are not as black-box as the current regulatory system thinks they are, and we attempt to explain the model outputs and Credit Scores within the BANK A Model Design and Validation framework Subjects: Risk Management (q-fin.RM); Machine Learning (cs.LG) Cite as: arXiv:2412.20225 [q-fin.RM] (or arXiv:2412.20225v1 [q-fin.RM] for this version) https://doi.org/10.48550/arXiv.2412.20225 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Abdollah Rida [view email] [v1] Sat, 28 Dec 2024 17:46:43 UTC (818 KB)

[LG-90] Learning physical unknowns from hydrodynamic shock and material interface features in ICF capsule implosions

链接: https://arxiv.org/abs/2412.20192
作者: Daniel A. Serino,Evan Bell,Marc Klasky,Ben S. Southworth,Balasubramanya Nadiga,Trevor Wilcox,Oleg Korobkin
关键词: inertial confinement fusion, characterizing material properties, high energy density, equation of state, confinement fusion
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注:

点击查看摘要

Abstract:In high energy density physics (HEDP) and inertial confinement fusion (ICF), predictive modeling is complicated by uncertainty in parameters that characterize various aspects of the modeled system, such as those characterizing material properties, equation of state (EOS), opacities, and initial conditions. Typically, however, these parameters are not directly observable. What is observed instead is a time sequence of radiographic projections using X-rays. In this work, we define a set of sparse hydrodynamic features derived from the outgoing shock profile and outer material edge, which can be obtained from radiographic measurements, to directly infer such parameters. Our machine learning (ML)-based methodology involves a pipeline of two architectures, a radiograph-to-features network (R2FNet) and a features-to-parameters network (F2PNet), that are trained independently and later combined to approximate a posterior distribution for the parameters from radiographs. We show that the estimated parameters can be used in a hydrodynamics code to obtain density fields and hydrodynamic shock and outer edge features that are consistent with the data. Finally, we demonstrate that features resulting from an unknown EOS model can be successfully mapped onto parameters of a chosen analytical EOS model, implying that network predictions are learning physics, with a degree of invariance to the underlying choice of EOS model.

[LG-91] Debiased Nonparametric Regression for Statistical Inference and Distributionally Robustness

链接: https://arxiv.org/abs/2412.20173
作者: Masahiro Kato
关键词: study proposes, pointwise asymptotic normality, uniform convergence, debiasing method, Abstract
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study proposes a debiasing method for smooth nonparametric estimators. While machine learning techniques such as random forests and neural networks have demonstrated strong predictive performance, their theoretical properties remain relatively underexplored. Specifically, many modern algorithms lack assurances of pointwise asymptotic normality and uniform convergence, which are critical for statistical inference and robustness under covariate shift and have been well-established for classical methods like Nadaraya-Watson regression. To address this, we introduce a model-free debiasing method that guarantees these properties for smooth estimators derived from any nonparametric regression approach. By adding a correction term that estimates the conditional expected residual of the original estimator, or equivalently, its estimation error, we obtain a debiased estimator with proven pointwise asymptotic normality, uniform convergence, and Gaussian process approximation. These properties enable statistical inference and enhance robustness to covariate shift, making the method broadly applicable to a wide range of nonparametric regression problems.

[LG-92] Gradient Descent Methods for Regularized Optimization

链接: https://arxiv.org/abs/2412.20115
作者: Filip Nikolovski,Irena Stojkovska,Katerina Hadzi-Velkova Saneva,Zoran Hadzi-Velkov
关键词: widely recognized technique, widely recognized, Lipschitz constant, proximal, step size
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 1 table; To be published in journal: “MANU Contributions, Section of Natural, Mathematical and Biotechnical Sciences”

点击查看摘要

Abstract:Regularization is a widely recognized technique in mathematical optimization. It can be used to smooth out objective functions, refine the feasible solution set, or prevent overfitting in machine learning models. Due to its simplicity and robustness, the gradient descent (GD) method is one of the primary methods used for numerical optimization of differentiable objective functions. However, GD is not well-suited for solving \ell^1 regularized optimization problems since these problems are non-differentiable at zero, causing iteration updates to oscillate or fail to converge. Instead, a more effective version of GD, called the proximal gradient descent employs a technique known as soft-thresholding to shrink the iteration updates toward zero, thus enabling sparsity in the solution. Motivated by the widespread applications of proximal GD in sparse and low-rank recovery across various engineering disciplines, we provide an overview of the GD and proximal GD methods for solving regularized optimization problems. Furthermore, this paper proposes a novel algorithm for the proximal GD method that incorporates a variable step size. Unlike conventional proximal GD, which uses a fixed step size based on the global Lipschitz constant, our method estimates the Lipschitz constant locally at each iteration and uses its reciprocal as the step size. This eliminates the need for a global Lipschitz constant, which can be impractical to compute. Numerical experiments we performed on synthetic and real-data sets show notable performance improvement of the proposed method compared to the conventional proximal GD with constant step size, both in terms of number of iterations and in time requirements.

[LG-93] Global Search of Optimal Spacecraft Trajectories using Amortization and Deep Generative Models

链接: https://arxiv.org/abs/2412.20023
作者: Ryne Beeson,Anjian Li,Amlan Sinha
关键词: Preliminary spacecraft trajectory, global search problem, spacecraft trajectory optimization, dependent global search, trajectory optimization problem
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 47 pages, 23 figures, initial content of this paper appears in Paper 23-352 at the AAS/AIAA Astrodynamics Specialist Conference, Big Sky, MT, August 13-17 2023

点击查看摘要

Abstract:Preliminary spacecraft trajectory optimization is a parameter dependent global search problem that aims to provide a set of solutions that are of high quality and diverse. In the case of numerical solution, it is dependent on the original optimal control problem, the choice of a control transcription, and the behavior of a gradient based numerical solver. In this paper we formulate the parameterized global search problem as the task of sampling a conditional probability distribution with support on the neighborhoods of local basins of attraction to the high quality solutions. The conditional distribution is learned and represented using deep generative models that allow for prediction of how the local basins change as parameters vary. The approach is benchmarked on a low thrust spacecraft trajectory optimization problem in the circular restricted three-body problem, showing significant speed-up over a simple multi-start method and vanilla machine learning approaches. The paper also provides an in-depth analysis of the multi-modal funnel structure of a low-thrust spacecraft trajectory optimization problem.

[LG-94] Surrogate Modeling for Explainable Predictive Time Series Corrections

链接: https://arxiv.org/abs/2412.19897
作者: Alfredo Lopez,Florian Sobieczky
关键词: local surrogate approach, explainable time-series forecasting, introduce a local, local surrogate, surrogate approach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a local surrogate approach for explainable time-series forecasting. An initially non-interpretable predictive model to improve the forecast of a classical time-series ‘base model’ is used. ‘Explainability’ of the correction is provided by fitting the base model again to the data from which the error prediction is removed (subtracted), yielding a difference in the model parameters which can be interpreted. We provide illustrative examples to demonstrate the potential of the method to discover and explain underlying patterns in the data.

[LG-95] A Neural Network-Based Search for Unmodeled Transients in LIGO-Virgo-KAGRAs Third Observing Run

链接: https://arxiv.org/abs/2412.19883
作者: Ryan Raikman,Eric A. Moreno,Katya Govorkova,Siddharth Soni,Ethan Marx,William Benoit,Alec Gunny,Deep Chatterjee,Christina Reissel,Malina M. Desai,Rafia Omer,Muhammed Saleem,Philip Harris,Erik Katsavounidis,Michael W. Coughlin,Dylan Rankin
关键词: Neural Network, run of LIGO, short-duration gravitational-wave transients, Wave Anomalous Knowledge, Gravitational Wave Anomalous
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the results of a Neural Network (NN)-based search for short-duration gravitational-wave transients in data from the third observing run of LIGO, Virgo, and KAGRA. The search targets unmodeled transients with durations of milliseconds to a few seconds in the 30-1500 Hz frequency band, without assumptions about the incoming signal direction, polarization, or morphology. Using the Gravitational Wave Anomalous Knowledge (GWAK) method, three compact binary coalescences (CBCs) identified by existing pipelines are successfully detected, along with a range of detector glitches. The algorithm constructs a low-dimensional embedded space to capture the physical features of signals, enabling the detection of CBCs, detector glitches, and unmodeled transients. This study demonstrates GWAK’s ability to enhance gravitational-wave searches beyond the limits of existing pipelines, laying the groundwork for future detection strategies.

[LG-96] Multi-Agent Q-Learning for Real-Time Load Balancing User Association and Handover in Mobile Networks

链接: https://arxiv.org/abs/2412.19835
作者: Alireza Alizadeh,Byungju Lim,Mai Vu
关键词: high network performance, optimal base stations, generation cellular networks, time while ensuring, overloaded becomes critical
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:As next generation cellular networks become denser, associating users with the optimal base stations at each time while ensuring no base station is overloaded becomes critical for achieving stable and high network performance. We propose multi-agent online Q-learning (QL) algorithms for performing real-time load balancing user association and handover in dense cellular networks. The load balancing constraints at all base stations couple the actions of user agents, and we propose two multi-agent action selection policies, one centralized and one distributed, to satisfy load balancing at every learning step. In the centralized policy, the actions of UEs are determined by a central load balancer (CLB) running an algorithm based on swapping the worst connection to maximize the total learning reward. In the distributed policy, each UE takes an action based on its local information by participating in a distributed matching game with the BSs to maximize the local reward. We then integrate these action selection policies into an online QL algorithm that adapts in real-time to network dynamics including channel variations and user mobility, using a reward function that considers a handover cost to reduce handover frequency. The proposed multi-agent QL algorithm features low-complexity and fast convergence, outperforming 3GPP max-SINR association. Both policies adapt well to network dynamics at various UE speed profiles from walking, running, to biking and suburban driving, illustrating their robustness and real-time adaptability.

[LG-97] Enhancing Drug-Target Interaction Prediction through Transfer Learning from Activity Cliff Prediction Tasks

链接: https://arxiv.org/abs/2412.19815
作者: Regina Ibragimova,Dimitrios Iliadis,Willem Waegeman
关键词: DTI, early stages, prediction, DTI prediction, Recently
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, machine learning (ML) has gained popularity in the early stages of drug discovery. This trend is unsurprising given the increasing volume of relevant experimental data and the continuous improvement of ML algorithms. However, conventional models, which rely on the principle of molecular similarity, often fail to capture the complexities of chemical interactions, particularly those involving activity cliffs (ACs) - compounds that are structurally similar but exhibit evidently different activity behaviors. In this work, we address two distinct yet related tasks: (1) activity cliff (AC) prediction and (2) drug-target interaction (DTI) prediction. Leveraging insights gained from the AC prediction task, we aim to improve the performance of DTI prediction through transfer learning. A universal model was developed for AC prediction, capable of identifying activity cliffs across diverse targets. Insights from this model were then incorporated into DTI prediction, enabling better handling of challenging cases involving ACs while maintaining similar overall performance. This approach establishes a strong foundation for integrating AC awareness into predictive models for drug discovery. Scientific Contribution This study presents a novel approach that applies transfer learning from AC prediction to enhance DTI prediction, addressing limitations of traditional similarity-based models. By introducing AC-awareness, we improve DTI model performance in structurally complex regions, demonstrating the benefits of integrating compound-specific and protein-contextual information. Unlike previous studies, which treat AC and DTI predictions as separate problems, this work establishes a unified framework to address both data scarcity and prediction challenges in drug discovery.

[LG-98] Pharmacophore-constrained de novo drug design with diffusion bridge

链接: https://arxiv.org/abs/2412.19812
作者: Conghao Wang,Yuguang Mu,Jagath C. Rajapakse
关键词: drug discovery process, treat desired biological, desired biological targets, bioactive drug molecules, discovery process
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, 4 tables

点击查看摘要

Abstract:De novo design of bioactive drug molecules with potential to treat desired biological targets is a profound task in the drug discovery process. Existing approaches tend to leverage the pocket structure of the target protein to condition the molecule generation. However, even the pocket area of the target protein may contain redundant information since not all atoms in the pocket is responsible for the interaction with the ligand. In this work, we propose PP2Drug - a phamacophore-constrained de novo design approach to generate drug candidate with desired bioactivity. Our method adapts diffusion bridge to effectively convert pharmacophore designs in the spatial space into molecular structures under the manner of equivariant transformation, which provides sophisticated control over optimal biochemical feature arrangement on the generated molecules. PP2Drug is demonstrated to generate hit candidates that exhibit high binding affinity with potential protein targets.

信息检索

[IR-0] Unsupervised dense retrieval with conterfactual contrastive learning

链接: https://arxiv.org/abs/2412.20756
作者: Haitian Chen,Qingyao Ai,Xiao Wang,Yiqun Liu,Fen Lin,Qin Liu
关键词: Efficiently retrieving, Information Retrieval, dense retrieval models, large document corpus, document corpus remains
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Efficiently retrieving a concise set of candidates from a large document corpus remains a pivotal challenge in Information Retrieval (IR). Neural retrieval models, particularly dense retrieval models built with transformers and pretrained language models, have been popular due to their superior performance. However, criticisms have also been raised on their lack of explainability and vulnerability to adversarial attacks. In response to these challenges, we propose to improve the robustness of dense retrieval models by enhancing their sensitivity of fine-graned relevance signals. A model achieving sensitivity in this context should exhibit high variances when documents’ key passages determining their relevance to queries have been modified, while maintaining low variances for other changes in irrelevant passages. This sensitivity allows a dense retrieval model to produce robust results with respect to attacks that try to promote documents without actually increasing their relevance. It also makes it possible to analyze which part of a document is actually relevant to a query, and thus improve the explainability of the retrieval model. Motivated by causality and counterfactual analysis, we propose a series of counterfactual regularization methods based on game theory and unsupervised learning with counterfactual passages. Experiments show that, our method can extract key passages without reliance on the passage-level relevance annotations. Moreover, the regularized dense retrieval models exhibit heightened robustness against adversarial attacks, surpassing the state-of-the-art anti-attack methods.

[IR-1] AmalREC: A Dataset for Relation Extraction and Classification Leveraging Amalgamation of Large Language Models

链接: https://arxiv.org/abs/2412.20427
作者: Mansi,Pranshu Pandya,Mahek Bhavesh Vora,Soumya Bharadwaj,Ashish Anand
关键词: Large Language Models, Existing datasets, domain-specific biases, Language Models, extraction often exhibit
类目: Information Retrieval (cs.IR)
*备注: 18 pages, 5 Figures

点击查看摘要

Abstract:Existing datasets for relation classification and extraction often exhibit limitations such as restricted relation types and domain-specific biases. This work presents a generic framework to generate well-structured sentences from given tuples with the help of Large Language Models (LLMs). This study has focused on the following major questions: (i) how to generate sentences from relation tuples, (ii) how to compare and rank them, (iii) can we combine strengths of individual methods and amalgamate them to generate an even bette quality of sentences, and (iv) how to evaluate the final dataset? For the first question, we employ a multifaceted 5-stage pipeline approach, leveraging LLMs in conjunction with template-guided generation. We introduce Sentence Evaluation Index(SEI) that prioritizes factors like grammatical correctness, fluency, human-aligned sentiment, accuracy, and complexity to answer the first part of the second question. To answer the second part of the second question, this work introduces a SEI-Ranker module that leverages SEI to select top candidate generations. The top sentences are then strategically amalgamated to produce the final, high-quality sentence. Finally, we evaluate our dataset on LLM-based and SOTA baselines for relation classification. The proposed dataset features 255 relation types, with 15K sentences in the test set and around 150k in the train set organized in, significantly enhancing relational diversity and complexity. This work not only presents a new comprehensive benchmark dataset for RE/RC task, but also compare different LLMs for generation of quality sentences from relational tuples.

[IR-2] Introducing Semantic Capability in LinkedIns Content Search Engine

链接: https://arxiv.org/abs/2412.20366
作者: Xin Yang,Rachel Zheng,Madhumitha Mohan,Sonali Bhadra,Lingyu Zhang,Rupesh Gupta
关键词: short and simple, search engine, search queries issued, search, engine
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn’s new content search engine with semantic capability, and its impact on metrics.

[IR-3] Left-handed representation in top 100 male professional tennis players: Multi-disciplinary perspectives ACML2016

链接: https://arxiv.org/abs/2412.20360
作者: Boris Bačić,Ali Ghazala
关键词: commonly held opinion, tennis players, held opinion, percentage of left-handers, left-handed tennis players
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: The original work citation (in APA): Bačić, B., Ghazala, A. (2016). Left-handed representation in top 100 male professional tennis players: Multi-disciplinary perspectives. Symposium conducted at the meeting of the First New Zealand Text Mining Workshop (TMNZ 2016) in conjunction with the 8th Asian Conference on Machine Learning (ACML 2016), Hamilton, New Zealand

点击查看摘要

Abstract:A commonly held opinion is that left-handed tennis players are overrepresented compared to the percentage of left-handers within the general population. This study provides the domain insights supported by data analysis that could help inform the decision of parents and coaches considering whether a child should start playing tennis as left- or right-handed when there is no strong arm-handed dominance. Compared to the commonly cited figure of about 10% of left-handed male population, data analysis from the official ATP web site for the top 100 ranked tennis players over the past decades (1985-2016) shows evidence of overrepresentation of left-handed elite tennis players (about 15%). The insights and data analysis can inform the handedness decision, advance coaching and strategic game concepts, enhance media coverage/analytics, left-handed facts and statistics, and inform tennis equipment manufacturing.

[IR-4] A Contrastive Pretrain Model with Prompt Tuning for Multi-center Medication Recommendation

链接: https://arxiv.org/abs/2412.20040
作者: Qidong Liu,Zhaopeng Qiu,Xiangyu Zhao,Xian Wu,Zijian Zhang,Tong Xu,Feng Tian
关键词: critical health-related applications, research interest recently, Medication recommendation, multi-center medication recommendation, attracted extensive research
类目: Information Retrieval (cs.IR)
*备注: accepted by TOIS

点击查看摘要

Abstract:Medication recommendation is one of the most critical health-related applications, which has attracted extensive research interest recently. Most existing works focus on a single hospital with abundant medical data. However, many small hospitals only have a few records, which hinders applying existing medication recommendation works to the real world. Thus, we seek to explore a more practical setting, i.e., multi-center medication recommendation. In this setting, most hospitals have few records, but the total number of records is large. Though small hospitals may benefit from total affluent records, it is also faced with the challenge that the data distributions between various hospitals are much different. In this work, we introduce a novel conTrastive prEtrain Model with Prompt Tuning (TEMPT) for multi-center medication recommendation, which includes two stages of pretraining and finetuning. We first design two self-supervised tasks for the pretraining stage to learn general medical knowledge. They are mask prediction and contrastive tasks, which extract the intra- and inter-relationships of input diagnosis and procedures. Furthermore, we devise a novel prompt tuning method to capture the specific information of each hospital rather than adopting the common finetuning. On the one hand, the proposed prompt tuning can better learn the heterogeneity of each hospital to fit various distributions. On the other hand, it can also relieve the catastrophic forgetting problem of finetuning. To validate the proposed model, we conduct extensive experiments on the public eICU, a multi-center medical dataset. The experimental results illustrate the effectiveness of our model. The implementation code is available to ease the reproducibility this https URL.

[IR-5] Invariant debiasing learning for recommendation via biased imputation

链接: https://arxiv.org/abs/2412.20036
作者: Ting Bai,Weijie Chen,Cheng Yang,Chuan Shi
关键词: Previous debiasing studies, studies utilize unbiased, utilize unbiased data, debiasing studies utilize, Previous debiasing
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Previous debiasing studies utilize unbiased data to make supervision of model training. They suffer from the high trial risks and experimental costs to obtain unbiased data. Recent research attempts to use invariant learning to detach the invariant preference of users for unbiased recommendations in an unsupervised way. However, it faces the drawbacks of low model accuracy and unstable prediction performance due to the losing cooperation with variant preference. In this paper, we experimentally demonstrate that invariant learning causes information loss by directly discarding the variant information, which reduces the generalization ability and results in the degradation of model performance in unbiased recommendations. Based on this consideration, we propose a novel lightweight knowledge distillation framework (KDDebias) to automatically learn the unbiased preference of users from both invariant and variant information. Specifically, the variant information is imputed to the invariant user preference in the distance-aware knowledge distillation process. Extensive experiments on three public datasets, i.e., Yahoo!R3, Coat, and MIND, show that with the biased imputation from the variant preference of users, our proposed method achieves significant improvements with less than 50% learning parameters compared to the SOTA unsupervised debiasing model in recommender systems. Our code are publicly available at this https URL.

附件下载

点击下载今日全部论文列表

目录

概览 (2024-12-31)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载