Arxiv今日论文 | 2024-12-24

本篇博文主要展示 2024-12-24 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大视觉-语言模型 (Large Vision-Language Models, LVLMs) 在处理跨语言文本丰富的视觉输入时性能下降的问题。具体来说，现有模型在图像文本与问题语言不一致的情况下表现不佳，尤其是在多语言环境中。解决方案的关键在于提出了MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information) 方法，通过最大化模型输出与视觉信息之间的跨语言互信息 (Mutual Information) 来构建视觉-文本的跨语言对齐。这一方法通过从单语言设置中提取知识并利用KL散度最小化来实现跨语言知识蒸馏，从而有效减少跨语言场景下的性能差异，同时保持LVLMs的固有能力。

链接: https://arxiv.org/abs/2412.17787
作者: Xinmiao Yu,Xiaocheng Feng,Yun Li,Minghui Liao,Ya-Qi Yu,Xiachong Feng,Weihong Zhong,Ruihan Chen,Mengkang Hu,Jihao Wu,Dandan Tu,Duyu Tang,Bing Qin
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
3. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)
关键词: Recent Large Vision-Language, Recent Large, shown promising reasoning, Large Vision-Language Models, promising reasoning capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model’s sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model’s outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. Codes are available at: this https URL
zh

[NLP-1] ResearchTown: Simulator of Human Research Community

【速读】：该论文试图解决的核心问题是：能否利用大型语言模型 (Large Language Models, LLMs) 模拟人类科研社区的活动。解决方案的关键在于提出了一个名为 ResearchTown 的多智能体框架，通过将科研社区简化为一个智能体-数据图 (agent-data graph)，其中研究人员和论文分别表示为智能体节点和数据节点，并基于合作关系进行连接。此外，论文还引入了 TextGNN，一个基于文本的推理框架，将各种科研活动（如论文阅读、写作和评审）建模为智能体-数据图上的统一消息传递过程。通过 ResearchBench 基准测试，该框架能够实现对科研活动的真实模拟，并生成跨学科的研究思路，从而为自动发现新颖的科学见解提供可能性。

链接: https://arxiv.org/abs/2412.17767
作者: Haofei Yu,Zhaochen Hong,Zirui Cheng,Kunlun Zhu,Keyang Xuan,Jinwei Yao,Tao Feng,Jiaxuan You
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
关键词: Large Language Models, Large Language, demonstrated remarkable potential, question remains unanswered, fundamental question remains
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multi-agent framework for research community simulation. Within this framework, the human research community is simplified and modeled as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph. To evaluate the quality of the research simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire novel research directions.
zh

[NLP-2] In Case You Missed It: ARC Challenge Is Not That Challenging

【速读】：该论文试图解决现代大型语言模型（LLMs）在ARC Challenge上的表现被错误评估的问题，认为其难度主要源于评估设置而非任务本身的复杂性。解决方案的关键在于采用更公平的评估方法，避免直接比较答案选项的限制。通过展示这种评估方式的改变可以显著缩小性能差距（如在SIQA上的表现），甚至在某些任务（如OpenBookQA）上达到超人水平，论文强调了评估方法对模型能力感知的影响，并提出了确保多选评估准确反映模型实际能力的指导原则。

链接: https://arxiv.org/abs/2412.17758
作者: Łukasz Borchmann
机构: Snowflake AI Research
关键词: modern LLMs primarily, LLMs primarily due, prevents direct comparison, ARC Challenge, ARC Easy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.
zh

[NLP-3] Deliberation in Latent Space via Differentiable Cache Augmentation

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂问题时，由于生成离散token序列导致的显著延迟成本和优化困难的问题。解决方案的关键在于引入一个离线的协处理器（coprocessor），该协处理器通过操作模型的键值缓存（key-value cache, kv-cache）来增强缓存，具体方法是向缓存中添加一组潜在嵌入（latent embeddings），以提高后续解码的准确性。该协处理器通过语言建模损失（language modeling loss）在标准预训练数据上进行训练，同时保持解码器冻结。这种方法使得模型能够以端到端可微分的方式，将额外的计算提炼到kv-cache中。由于解码器保持不变，协处理器可以离线异步运行，且在协处理器不可用或特定缓存不需要额外计算时，语言模型仍能正常工作。实验表明，缓存增强后，解码器在后续多个token上的困惑度（perplexity）显著降低，并且在无需任务特定训练的情况下，缓存增强在多种推理密集型任务中均能持续提升性能。

链接: https://arxiv.org/abs/2412.17747
作者: Luyang Liu,Jonas Pfeiffer,Jiaxing Wu,Jun Xie,Arthur Szlam
机构: Google DeepMind(谷歌深度思维); Google DeepMind(谷歌深度思维); Google DeepMind(谷歌深度思维); Google DeepMind(谷歌深度思维); Google DeepMind(谷歌深度思维)
关键词: Techniques enabling large, solving complex problems, intermediate reasoning steps, Techniques enabling, enabling large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Techniques enabling large language models (LLMs) to “think more” by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model’s key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.
zh

[NLP-4] RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation

【速读】：该论文试图解决现有代码翻译基准（benchmarks）主要关注细粒度样本（如代码片段、函数或文件级别），而未能准确反映实际需求中整个代码仓库（repository-level）翻译的问题。解决方案的关键在于提出了一个新的基准——RepoTransBench，这是一个真实世界级别的代码仓库翻译基准，包含自动可执行的测试套件。通过实验评估了11个先进的LLMs在该基准上的表现，发现即使经过迭代调试，最佳模型的成功率（Success@1）也仅为21%，表明当前LLMs在仓库级别代码翻译上仍存在显著不足。

链接: https://arxiv.org/abs/2412.17744
作者: Yanli Wang,Yanlin Wang,Suiquan Wang,Daya Guo,Jiachi Chen,John Grundy,Xilin Liu,Yuchi Ma,Mingzhi Mao,Hongyu Zhang,Zibin Zheng
机构: Sun Yat-sen University(中山大学); Monash University(莫纳什大学); Huawei Cloud Computing Technologies Co., Ltd(华为云计算技术有限公司); Chongqing University(重庆大学)
关键词: Repository-level code translation, code translation, code, Repository-level code, source repository
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities. To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite. We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs. We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%. To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1. However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation. Finally, we conduct a detailed error analysis and highlight current LLMs’ deficiencies in repository-level code translation, which could provide a reference for further improvements.
zh

[NLP-5] YuLan-Mini: An Open Data-efficient Language Model

【速读】：该论文试图解决大规模语言模型（LLMs）预训练过程中面临的资源需求巨大和复杂技术过程的挑战。解决方案的关键在于三个主要技术贡献：一是通过精细的数据管道结合数据清洗和数据调度策略来提高训练效率；二是采用稳健的优化方法来缓解训练不稳定性；三是引入有效的退火方法，结合目标数据选择和长上下文训练，进一步提升模型性能。这些方法使得YuLan-Mini在仅使用1.08T tokens的情况下，达到了与需要更多数据的行业领先模型相媲美的性能。

链接: https://arxiv.org/abs/2412.17743
作者: Yiwen Hu,Huatong Song,Jia Deng,Jiapeng Wang,Jie Chen,Kun Zhou,Yutao Zhu,Jinhao Jiang,Zican Dong,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence(高岭人工智能学院); Renmin University of China(中国人民大学)
关键词: immense resource demands, technical processes involved, large language models, processes involved, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: this https URL.
zh

[NLP-6] Fourier Position Embedding: Enhancing Attentions Periodic Extension for Length Generalization

【速读】：该论文试图解决基于旋转位置嵌入 (Rotary Position Embedding, RoPE) 的语言模型 (Language Models, LMs) 在长上下文扩展时遇到的频谱损伤问题，这种损伤主要由注意力机制外的线性层和激活函数以及时域截断导致的频率分量不足引起。论文的关键解决方案是提出傅里叶位置嵌入 (Fourier Position Embedding, FoPE)，通过构建傅里叶级数并剔除破坏性的频率分量，增强注意力机制的频域特性，从而提高周期扩展和长度泛化能力，实验结果表明FoPE在不同上下文窗口中能保持更稳定的困惑度和一致的准确性。

链接: https://arxiv.org/abs/2412.17739
作者: Ermo Hua,Che Jiang,Xingtai Lv,Kaiyan Zhang,Ning Ding,Youbang Sun,Biqing Qi,Yuchen Fan,Xue Kai Zhu,Bowen Zhou
机构: Tsinghua University(清华大学); Northeastern University(东北大学); Shanghai AI Laboratory(上海人工智能实验室); Shanghai Jiaotong University(上海交通大学)
关键词: Rotary Position Embedding, improving Rotary Position, improving Rotary, Rotary Position, Fourier Position Embedding
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE’s limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention’s frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.
zh

[NLP-7] Chumor 2.0: Towards Benchmarking Chinese Humor Understanding

【速读】：该论文试图解决非英语语言（如中文）中文化背景复杂的幽默数据集资源匮乏的问题。解决方案的关键在于构建了Chumor，这是首个规模超过现有幽默数据集的中文幽默解释数据集，数据来源于中国类似Reddit的平台“Ruo Zhi Ba”。通过测试十种大型语言模型（LLMs）在直接提示和思维链提示下的表现，研究发现现有LLMs在解释中文幽默方面的准确率仅略高于随机水平，远低于人类表现。此外，分析表明人类标注的幽默解释显著优于GPT-4o和ERNIE-4-turbo生成的解释。

链接: https://arxiv.org/abs/2412.17729
作者: Ruiqi He,Yushu He,Longju Bai,Jiarui Liu,Zhenjie Sun,Zenghao Tang,He Wang,Hanchen Xia,Rada Mihalcea,Naihao Deng
机构: University of Michigan(密歇根大学); Carnegie Mellon University(卡内基梅隆大学); Shanghai Jiaotong University(上海交通大学)
关键词: leaving limited resources, evaluations predominantly focus, focus on English, Existing humor datasets, https URL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2406.12754

点击查看摘要

Abstract:Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets. Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo. We release Chumor at this https URL, our project page is at this https URL, our leaderboard is at this https URL, and our codebase is at this https URL.
zh

[NLP-8] Knowledge Editing through Chain-of-Thought

【速读】：该论文试图解决大语言模型（LLMs）在面对不断变化的世界知识时，频繁重新训练的高成本问题。解决方案的关键在于提出了一个名为EditCoT的新型知识编辑框架，该框架通过生成链式思维（Chain-of-Thought, CoT）并利用CoT编辑器基于更新知识迭代优化这一过程，从而在不重新训练模型的情况下灵活且高效地更新LLMs。这种方法不仅在多种任务和语言上实现了最先进的性能，还表现出更好的泛化能力、有效性和稳定性。

链接: https://arxiv.org/abs/2412.17727
作者: Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu
机构: Tsinghua University(清华大学); Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)
关键词: Large Language Models, natural language processing, demonstrated exceptional capabilities, Large Language, demonstrated exceptional
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of natural language processing (NLP) tasks. However, keeping these models up-to-date with evolving world knowledge remains a significant challenge due to the high costs of frequent retraining. To address this challenge, knowledge editing techniques have emerged to update LLMs with new information without rebuilding the model from scratch. Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model’s original capabilities. Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples. Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining. EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge. We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks. The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating. Code and data are available at: this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.17727 [cs.CL] (or arXiv:2412.17727v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.17727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] From Models to Microtheories: Distilling a Models Topical Knowledge for Grounded Question Answering

【速读】：该论文试图解决语言模型（LMs）在回答问题时缺乏透明性和可解释性的问题，特别是模型对特定主题的整体理解（即“理论”）难以被揭示，从而影响用户对模型的信任。解决方案的关键在于提出了一种称为微理论（microtheories）的方法，通过生成一组简洁、通用且非冗余的句子来概括模型对某一主题的核心知识。这些微理论不仅能够系统地推导出对一系列问题的回答，还能增强模型的可解释性和性能。具体实现上，论文首先通过模型生成句子填充知识库，然后从中提炼出核心微理论，并证明这些微理论能够有效补充现有语料库（如Wikipedia）中缺失的关键主题信息，从而提升模型的回答准确性和可验证性。

链接: https://arxiv.org/abs/2412.17701
作者: Nathaniel Weir,Bhavana Dalvi Mishra,Orion Weller,Oyvind Tafjord,Sam Hornstein,Alexander Sabol,Peter Jansen,Benjamin Van Durme,Peter Clark
机构: Johns Hopkins University(约翰斯·霍普金斯大学); Allen Institute for AI(艾伦人工智能研究所); Thomas Jefferson University(托马斯·杰斐逊大学); University of Arizona(亚利桑那大学)
关键词: Recent reasoning methods, Recent reasoning, reasoning methods, entailment reasoning, users understand
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent reasoning methods (e.g., chain-of-thought, entailment reasoning) help users understand how language models (LMs) answer a single question, but they do little to reveal the LM’s overall understanding, or “theory,” about the question’s \textittopic , making it still hard to trust the model. Our goal is to materialize such theories - here called \textitmicrotheories (a linguistic analog of logical microtheories) - as a set of sentences encapsulating an LM’s core knowledge about a topic. These statements systematically work together to entail answers to a \textitset of questions to both engender trust and improve performance. Our approach is to first populate a knowledge store with (model-generated) sentences that entail answers to training questions and then distill those down to a core microtheory that is concise, general, and non-redundant. We show that, when added to a general corpus (e.g., Wikipedia), microtheories can supply critical, topical information not necessarily present in the corpus, improving both a model’s ability to ground its answers to verifiable knowledge (i.e., show how answers are systematically entailed by documents in the corpus, fully grounding up to +8% more answers), and the accuracy of those grounded answers (up to +8% absolute). We also show that, in a human evaluation in the medical domain, our distilled microtheories contain a significantly higher concentration of topically critical facts than the non-distilled knowledge store. Finally, we show we can quantify the coverage of a microtheory for a topic (characterized by a dataset) using a notion of p -relevance. Together, these suggest that microtheories are an efficient distillation of an LM’s topic-relevant knowledge, that they can usefully augment existing corpora, and can provide both performance gains and an interpretable, verifiable window into the model’s knowledge of a topic.
zh

[NLP-10] Understanding the Logic of Direct Preference Alignment through Logic

【速读】：该论文试图解决直接偏好对齐算法（DPA）在理解和开发新损失函数时缺乏技术框架和概念基础的问题。解决方案的关键在于提出了一种新的形式化方法，通过离散推理问题来描述DPA损失的语义。具体来说，论文提出了一种符号表达式系统，用于表征单模型和参考模型方法的偏好损失，并识别了多种常用DPA变体的符号形式。此外，该框架揭示了DPA损失的全貌，不仅有助于严格分析现有损失函数之间的关系，还为从基本原理出发系统地探索和推导新的损失函数提供了可能。

链接: https://arxiv.org/abs/2412.17696
作者: Kyle Richardson,Vivek Srikumar,Ashish Sabharwal
机构: Allen Institute for AI(艾伦人工智能研究所); University of Utah(犹他大学)
关键词: shown great promise, aligning large language, large language models, DPA loss, original DPO loss
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic expression that characterizes its semantics? How do the semantics of two losses relate to each other? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.
zh

[NLP-11] RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF for Conversational QA over KGs with RAG

【速读】：该论文试图解决在基于RDF知识图谱（RDF knowledge graphs）的对话式问答（ConvQA）中，SPARQL查询在处理复杂意图和抽象需求时的脆弱性问题。解决方案的关键在于提出了一种双管齐下的系统，通过融合从知识图谱自动生成的数据库上的SQL查询结果和知识图谱事实的文本化结果的文本搜索结果，来支持迭代检索。当任一分支的结果不理想时，系统能够自动进行进一步的检索。最终，通过检索增强生成（RAG）框架，利用大型语言模型（LLM）从累积的搜索结果中生成连贯的回答。

链接: https://arxiv.org/abs/2412.17690
作者: Rishiraj Saha Roy,Chris Hinze,Joel Schlotthauer,Farzad Naderi,Viktor Hangya,Andreas Foltyn,Luzian Hahn,Fabian Kuech
机构: Fraunhofer Institute for Integrated Circuits IIS (弗劳恩霍夫集成电路研究所)
关键词: translate natural language, Conversational question answering, natural language questions, searching over RDF, RDF knowledge graphs
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at BTW 2025, 10 pages

点击查看摘要

Abstract:Conversational question answering (ConvQA) is a convenient means of searching over RDF knowledge graphs (KGs), where a prevalent approach is to translate natural language questions to SPARQL queries. However, SPARQL has certain shortcomings: (i) it is brittle for complex intents and conversational questions, and (ii) it is not suitable for more abstract needs. Instead, we propose a novel two-pronged system where we fuse: (i) SQL-query results over a database automatically derived from the KG, and (ii) text-search results over verbalizations of KG facts. Our pipeline supports iterative retrieval: when the results of any branch are found to be unsatisfactory, the system can automatically opt for further rounds. We put everything together in a retrieval augmented generation (RAG) setup, where an LLM generates a coherent response from accumulated search results. We demonstrate the superiority of our proposed system over several baselines on a knowledge graph of BMW automobiles.
zh

[NLP-12] Generating Completions for Fragmented Brocas Aphasic Sentences Using Large Language Models

【速读】：该论文试图解决传统布洛卡失语症（Broca’s aphasia）治疗方法耗时、劳动密集且不符合真实对话场景的问题。解决方案的关键在于利用基于自然语言处理（Natural Language Processing, NLP）的大语言模型（Large Language Models, LLMs）来改进现有的治疗手段。具体而言，研究通过生成模拟布洛卡失语症语言特征的合成数据，并使用这些数据微调预训练的LLMs，使其能够完成布洛卡失语症患者的碎片化句子。实验结果表明，LLMs在重建碎片化句子方面表现出色，尤其是在处理较长输入时，展示了其在提升布洛卡失语症患者沟通辅助工具方面的潜力。

链接: https://arxiv.org/abs/2412.17669
作者: Sijbren van Vaals,Yevgen Matusevych,Frank Tsiwah
机构: Center for Language and Cognition Groningen, University of Groningen (格罗宁根大学语言与认知中心)
关键词: Broca aphasic, characterized by non-fluent, good comprehension, Broca, Large Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Broca’s aphasia is a type of aphasia characterized by non-fluent, effortful and fragmented speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing fragmented Broca’s aphasic sentences. We first generate synthetic Broca’s aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca’s aphasic speech. Using this synthetic data, we then fine-tune four pre-trained LLMs on the task of completing fragmented sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca’s aphasic data. We demonstrate LLMs’ capability for reconstructing fragmented sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs’ potential in advancing communication aids for individuals with Broca’s aphasia and possibly other clinical populations.
zh

[NLP-13] racking the Feature Dynamics in LLM Training: A Mechanistic Study

【速读】：该论文试图解决大语言模型 (LLMs) 训练过程中特征演变的机制性解释问题。解决方案的关键在于提出了 SAE-Track 方法，该方法能够高效地获取一系列连续的稀疏自编码器 (SAEs)，并通过分析特征形成过程和训练期间的特征漂移，提供了对 LLMs 特征动态演变的深入理解。这一方法不仅有助于揭示训练机制，还增强了我们对特征演化过程的认识。

链接: https://arxiv.org/abs/2412.17626
作者: Yang Xu,Yi Wang,Hao Wang
机构: 未知
关键词: large language models, language models, interpretability of large, large language, Understanding training dynamics
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs). Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive. In this study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual series of SAEs; (2) formulate the process of feature formation and conduct a mechanistic analysis; and (3) analyze and visualize feature drift during training. Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution.
zh

[NLP-14] LiveIdeaBench: Evaluating LLM s Scientific Creativity and Idea Generation with Minimal Context

【速读】：该论文试图解决现有评估框架主要依赖丰富上下文输入来评估大型语言模型（LLMs）在科学任务中的表现，而忽视了它们从极少信息中生成新颖创意的能力的问题。解决方案的关键在于引入LiveIdeaBench，这是一个综合性的基准测试，通过单关键词提示来评估LLMs的科学创造力和发散思维能力。该框架基于Guilford的创造力理论，从原创性（originality）、可行性（feasibility）、流畅性（fluency）和灵活性（flexibility）四个维度评估生成创意的质量，并揭示了科学创造能力与传统智能指标之间的不同模式。

链接: https://arxiv.org/abs/2412.17596
作者: Kai Ruan,Xuan Wang,Jixiang Hong,Hao Sun
机构: 未知
关键词: Large Language Models, Large Language, rich contextual inputs, demonstrated remarkable capabilities, contextual inputs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated remarkable capabilities in scientific tasks, existing evaluation frameworks primarily assess their performance using rich contextual inputs, overlooking their ability to generate novel ideas from minimal information. We introduce LiveIdeaBench, a comprehensive benchmark that evaluates LLMs’ scientific creativity and divergent thinking capabilities using single-keyword prompts. Drawing from Guilford’s creativity theory, our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility. Through extensive experimentation with 20 leading models across 1,180 keywords spanning 18 scientific domains, we reveal that scientific creative ability shows distinct patterns from general intelligence metrics. Notably, our results demonstrate that models like QwQ-32B-preview achieve comparable creative performance to top-tier models like o1-preview, despite significant gaps in their general intelligence scores. These findings highlight the importance of specialized evaluation frameworks for scientific creativity and suggest that the development of creative capabilities in LLMs may follow different trajectories than traditional problem-solving abilities.
zh

[NLP-15] Investigating Length Issues in Document-level Machine Translation

【速读】：该论文试图解决文档级机器翻译 (document-level machine translation, MT) 在处理超长文本时性能下降的问题。解决方案的关键在于设计并实现一种新方法，精确测量输入文本长度增加对翻译输出的影响。通过实验，研究揭示了翻译性能随输入文本长度增加而下降的现象，并发现句子在文档中的位置对翻译质量有显著影响，早期出现的句子翻译质量更高。尽管通过调整文档长度分布和位置嵌入 (positional embeddings) 可以在一定程度上缓解这些问题，但文档级MT的性能仍未达到基于句子的MT水平。

链接: https://arxiv.org/abs/2412.17592
作者: Ziqian Peng,Rachel Bawden,François Yvon
机构: Sorbonne Université & CNRS, ISIR, Paris, France; Inria, Paris, France
关键词: Transformer architectures, opening new perspectives, increasingly effective, effective at processing, processing and generating
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT outputs. Our experiments with two representative architectures unambiguously show that (a)~translation performance decreases with the length of the input text; (b)~the position of sentences within the document matters and translation quality is higher for sentences occurring earlier in a document. We further show that manipulating the distribution of document lengths and of positional embeddings only marginally mitigates such problems. Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.
zh

[NLP-16] ERUPD – English to Roman Urdu Parallel Dataset

【速读】：该论文试图解决罗马乌尔都语（Roman Urdu）在数字通信中因缺乏标准化、语音变异性和与英语的代码转换而导致的语言处理难题。解决方案的关键在于创建了一个包含75,146个句子对的新型平行数据集，并通过混合方法结合了通过高级提示工程生成的合成数据和来自个人消息群的真实对话数据。随后，通过人工评估阶段进一步优化数据集，确保代码转换、语音表示和同义词变异的准确性和一致性。这一数据集为机器翻译、情感分析和多语言教育提供了重要资源。

链接: https://arxiv.org/abs/2412.17562
作者: Mohammed Furqan,Raahid Bin Khaja,Rayyan Habeeb
机构: Muffakham Jah College of Engineering and Technology(穆夫提哈姆·贾大学工程与技术学院)
关键词: gaps fosters global, fosters global growth, Bridging linguistic gaps, Roman Urdu, linguistic gaps fosters
类目: Computation and Language (cs.CL)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu – a Latin-script adaptation of Urdu widely used in digital communication – by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu’s lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu’s diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.
zh

[NLP-17] A Survey of Query Optimization in Large Language Models

【速读】：该论文试图解决在大语言模型（LLMs）中，特别是在检索增强生成（RAG）场景下，如何通过查询优化（Query Optimization, QO）技术提高模型理解和回答复杂查询的效率与准确性。解决方案的关键在于通过QO技术优化RAG的检索阶段，确保能够准确地获取并整合多条相关证据，从而提升LLMs生成答案的准确性和可靠性。论文通过总结和分析现有QO技术，旨在整合这些技术并阐明其技术基础，以增强LLMs的多样性和应用范围。

链接: https://arxiv.org/abs/2412.17558
作者: Mingyang Song,Mao Zheng
机构: Tencent(腾讯)
关键词: Large Language Models, Query Optimization, Language Models, Large Language, Retrieval-Augmented Generation
类目: Computation and Language (cs.CL)
备注: Ongoing Work

点击查看摘要

Abstract:\textitQuery Optimization (QO) refers to techniques aimed at enhancing the efficiency and quality of Large Language Models (LLMs) in understanding and answering queries, especially complex ones in scenarios like Retrieval-Augmented Generation (RAG). Specifically, RAG mitigates the limitations of LLMs by dynamically retrieving and leveraging up-to-date relevant information, which provides a cost-effective solution to the challenge of LLMs producing plausible but potentially inaccurate responses. Recently, as RAG evolves and incorporates multiple components that influence its performance, QO has emerged as a critical element, playing a pivotal role in determining the effectiveness of RAG’s retrieval stage in accurately sourcing the necessary multiple pieces of evidence to answer queries correctly. In this paper, we trace the evolution of QO techniques by summarizing and analyzing significant studies. Through an organized framework and categorization, we aim to consolidate existing QO techniques in RAG, elucidate their technological foundations, and highlight their potential to enhance the versatility and applications of LLMs.
zh

[NLP-18] Comparative Analysis of Document-Level Embedding Methods for Similarity Scoring on Shakespeare Sonnets and Taylor Swift Lyrics

【速读】：该论文旨在评估TF-IDF权重、平均Word2Vec嵌入和BERT嵌入在两个不同文本领域中进行文档相似度评分的表现。解决方案的关键在于通过分析余弦相似度分数，揭示这些方法的优势和局限性。研究发现，TF-IDF依赖于词汇重叠，而Word2Vec在跨领域比较中表现出更强的语义泛化能力。BERT在具有挑战性的领域中表现较差，可能是由于缺乏领域特定的微调。

链接: https://arxiv.org/abs/2412.17552
作者: Klara Kramer
机构: 未知
关键词: contrasting textual domains, document similarity scoring, study evaluates, contrasting textual, TF-IDF weighting
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:This study evaluates the performance of TF-IDF weighting, averaged Word2Vec embeddings, and BERT embeddings for document similarity scoring across two contrasting textual domains. By analysing cosine similarity scores, the methods’ strengths and limitations are highlighted. The findings underscore TF-IDF’s reliance on lexical overlap and Word2Vec’s superior semantic generalisation, particularly in cross-domain comparisons. BERT demonstrates lower performance in challenging domains, likely due to insufficient domainspecific fine-tuning.
zh

[NLP-19] Resource-Aware Arabic LLM Creation: Model Adaptation Integration and Multi-Domain Testing

【速读】：该论文试图解决在资源受限的环境下（如仅有4GB VRAM的系统）对大型语言模型进行阿拉伯语处理的微调问题。解决方案的关键在于采用量化低秩适应 (Quantized Low-Rank Adaptation, QLoRA) 技术，结合自定义数据预处理、模型配置和训练优化技术（如梯度累积和混合精度训练），以适应阿拉伯语的复杂性，包括形态复杂性、方言变体和音调符号处理。通过这种方法，论文展示了在10,000个训练步骤后显著的性能提升，最终损失收敛至0.1083，并验证了模型在多种阿拉伯语任务中的鲁棒性和对特定语言现象的处理能力。

链接: https://arxiv.org/abs/2412.17548
作者: Prakash Aryan
机构: 未知
关键词: Quantized Low-Rank Adaptation, processing using Quantized, Quantized Low-Rank, Wikipedia Arabic corpora, Arabic language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for Arabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a system with only 4GB VRAM. We detail the process of adapting this large language model to the Arabic domain, using diverse datasets including Bactrian, OpenAssistant, and Wikipedia Arabic corpora. Our methodology involves custom data preprocessing, model configuration, and training optimization techniques such as gradient accumulation and mixed-precision training. We address specific challenges in Arabic NLP, including morphological complexity, dialectal variations, and diacritical mark handling. Experimental results over 10,000 training steps show significant performance improvements, with the final loss converging to 0.1083. We provide comprehensive analysis of GPU memory usage, training dynamics, and model evaluation across various Arabic language tasks, including text classification, question answering, and dialect identification. The fine-tuned model demonstrates robustness to input perturbations and improved handling of Arabic-specific linguistic phenomena. This research contributes to multilingual AI by demonstrating a resource-efficient approach for creating specialized language models, potentially democratizing access to advanced NLP technologies for diverse linguistic communities. Our work paves the way for future research in low-resource language adaptation and efficient fine-tuning of large language models.
zh

[NLP-20] Domain adapted machine translation: What does catastrophic forgetting forget and why? EMNLP2024

【速读】：该论文试图解决神经机器翻译 (Neural Machine Translation, NMT) 模型在领域适应 (domain adaptation) 过程中出现的灾难性遗忘 (catastrophic forgetting) 问题，即模型在微调后迅速丧失通用翻译质量的现象。论文通过研究遗忘现象与适应数据之间的关系，发现遗忘的程度和类型与目标词汇覆盖率密切相关。解决方案的关键在于深入理解遗忘的原因，并通过分析适应数据的目标词汇覆盖率来指导更有效的领域适应策略。

链接: https://arxiv.org/abs/2412.17537
作者: Danielle Saunders,Steve DeNeefe
机构: RWS; RWS Language Weaver
关键词: Neural Machine Translation, Neural Machine, Machine Translation, dataset of interest, involving fine-tuning
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Neural Machine Translation (NMT) models can be specialized by domain adaptation, often involving fine-tuning on a dataset of interest. This process risks catastrophic forgetting: rapid loss of generic translation quality. Forgetting has been widely observed, with many mitigation methods proposed. However, the causes of forgetting and the relationship between forgetting and adaptation data are under-explored. This paper takes a novel approach to understanding catastrophic forgetting during NMT adaptation by investigating the impact of the data. We provide a first investigation of what is forgotten, and why. We examine the relationship between forgetting and the in-domain data, and show that the amount and type of forgetting is linked to that data’s target vocabulary coverage. Our findings pave the way toward better informed NMT domain adaptation. Comments: EMNLP 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.17537 [cs.CL] (or arXiv:2412.17537v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.17537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-21] CiteBART: Learning to Generate Citations for Local Citation Recommendation

【速读】：该论文试图解决科学写作中引用生成的问题，特别是局部引用推荐 (Local Citation Recommendation, LCR)。解决方案的关键在于提出了CiteBART，这是一种基于BART预训练的自定义模型，通过引用标记掩码 (citation token masking) 来生成引用。具体来说，CiteBART在基础方案中通过掩码局部引用上下文中的引用标记来进行引用预测，而在全局方案中，将引用论文的标题和摘要与局部引用上下文连接，以学习重建引用标记。该方法在多个引用推荐基准测试中表现优异，尤其是在较大的数据集如Refseer和ArXiv上，展示了其生成式方法带来的零样本能力 (zero-shot capability)。

链接: https://arxiv.org/abs/2412.17534
作者: Ege Yiğit Çelik,Selma Tekir
机构: İzmir Institute of Technology (伊兹密尔理工大学); İzmir Institute of Technology (伊兹密尔理工大学)
关键词: essential building blocks, Citation, local citation context, local citation, essential building
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Citations are essential building blocks in scientific writing. The scientific community is longing for support in their generation. Citation generation involves two complementary subtasks: Determining the citation worthiness of a context and, if it’s worth it, proposing the best candidate papers for the citation placeholder. The latter subtask is called local citation recommendation (LCR). This paper proposes CiteBART, a custom BART pre-training based on citation token masking to generate citations to achieve LCR. In the base scheme, we mask the citation token in the local citation context to make the citation prediction. In the global one, we concatenate the citing paper’s title and abstract to the local citation context to learn to reconstruct the citation token. CiteBART outperforms state-of-the-art approaches on the citation recommendation benchmarks except for the smallest FullTextPeerRead dataset. The effect is significant in the larger benchmarks, e.g., Refseer and ArXiv. We present a qualitative analysis and an ablation study to provide insights into the workings of CiteBART. Our analyses confirm that its generative nature brings about a zero-shot capability.
zh

[NLP-22] Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

【速读】：该论文试图解决在非英语语境下，尤其是波兰语中，现有工具在检测色情内容方面的显著局限性问题。解决方案的关键在于提出了一个名为forePLay的新型波兰语数据集，该数据集包含超过24,000条标注句子，并采用多维分类法（涵盖模糊性、暴力和社会不可接受性维度）。通过评估，论文展示了专门针对波兰语的语言模型在性能上优于多语言模型，尤其是基于transformer的架构在处理不平衡类别方面表现出色。这一解决方案为开发语言感知的内容审核系统奠定了基础，并强调了在形态复杂的语言中扩展此类能力的关键考虑因素。

链接: https://arxiv.org/abs/2412.17533
作者: Anna Kołos,Katarzyna Lorenc,Emilia Wiśnios,Agnieszka Karlińska
机构: NASK National Research Institute(NASK国家研究所); Independent Researcher(独立研究员)
关键词: demonstrate significant limitations, current tools demonstrate, tools demonstrate significant, robust detection systems, significant limitations
类目: Computation and Language (cs.CL)
备注: The forePLay dataset and associated resources will be made publicly available for research purposes upon publication, in accordance with data sharing regulations

点击查看摘要

Abstract:The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We present forePLay, a novel Polish language dataset for erotic content detection, featuring over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.
zh

[NLP-23] DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在面对精心设计的输入时生成有害内容的问题，即LLM越狱 (jailbreaking)。解决方案的关键在于引入了一种名为DiffusionAttacker的端到端生成式方法，该方法受扩散模型 (diffusion models) 启发，采用序列到序列 (sequence-to-sequence, seq2seq) 文本扩散模型作为生成器，通过新颖的攻击损失 (attack loss) 引导去噪过程。与传统依赖自回归LLMs生成越狱提示的方法不同，DiffusionAttacker允许更灵活的令牌修改，从而在保留原始提示语义内容的同时生成有害内容。此外，通过利用Gumbel-Softmax技术，使扩散模型的输出分布采样过程可微分，避免了迭代令牌搜索的需求。实验结果表明，DiffusionAttacker在攻击成功率 (Attack Success Rate, ASR)、流畅性和多样性等评估指标上优于先前的方法。

链接: https://arxiv.org/abs/2412.17522
作者: Hao Wang,Hao Li,Junda Zhu,Xinyuan Wang,Chengwei Pan,MinLie Huang,Lei Sha
机构: Beihang University, Beijing, China; Tsinghua University, Beijing, China; Zhongguancun Laboratory, Beijing, China
关键词: Large Language Models, Large Language, carefully crafted inputs, Language Models, crafted inputs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model’s output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.
zh

[NLP-24] DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

【速读】：该论文试图解决在神经机器翻译 (MT) 中处理包含比喻和隐喻的文本时，由于文化差异导致的翻译困难问题。解决方案的关键在于引入长链式思维 (long chain-of-thought, CoT) 方法，通过多智能体框架来模拟大语言模型 (LLMs) 的长期思考能力。具体来说，论文首先从现有文学书籍中挖掘包含比喻或隐喻的句子，然后通过一个包含翻译者、顾问和评估者的多智能体系统进行迭代翻译。翻译者在顾问的建议下逐步改进翻译，评估者则判断当前翻译是否优于之前的版本。通过这种方式，论文收集了大量长链式思维的翻译数据，并用于训练DRT-o1模型。实验结果表明，DRT-o1在文学翻译任务中显著提升了翻译质量，使用Qwen2.5-7B和Qwen2.5-14B作为骨干模型时，BLEU分数提升了7.33~8.26，CometScore提升了1.66~3.36，并且DRT-o1-7B在某些指标上超越了QwQ-32B-Preview。

链接: https://arxiv.org/abs/2412.17498
作者: Jiaan Wang,Fandong Meng,Yunlong Liang,Jie Zhou
机构: Pattern Recognition Center, WeChat AI, Tencent Inc(模式识别中心，微信AI，腾讯公司)
关键词: reasoning tasks, coding tasks, models have emerged, emerged as representative, math and coding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, O1-like models have emerged as representative examples, illustrating the effectiveness of long chain-of-thought (CoT) in reasoning tasks such as math and coding tasks. In this paper, we introduce DRT-o1, an attempt to bring the success of long CoT to neural machine translation (MT). Specifically, in view of the literature books that might involve similes and metaphors, translating these texts to a target language is very difficult in practice due to cultural differences. In such cases, literal translation often fails to convey the intended meaning effectively. Even for professional human translators, considerable thought must be given to preserving semantics throughout the translation process. To simulate LLMs’ long thought ability in MT, we first mine sentences containing similes or metaphors from existing literature books, and then develop a multi-agent framework to translate these sentences via long thought. In the multi-agent framework, a translator is used to iteratively translate the source sentence under the suggestions provided by an advisor. To ensure the effectiveness of the long thoughts, an evaluator is also employed to judge whether the translation in the current round is better than the previous one or not. In this manner, we collect tens of thousands of long-thought MT data, which is used to train our DRT-o1. The experimental results on literature translation demonstrate the effectiveness of the DRT-o1. Using Qwen2.5-7B and Qwen2.5-14B as the backbones, the improvement brought by DRT-o1 achieves 7.33~8.26 BLEU and 1.66~3.36 CometScore. Besides, DRT-o1-7B can outperform QwQ-32B-Preview by 7.82 BLEU and 1.46 CometScore, showing its effectiveness. The project is available at this https URL
zh

[NLP-25] A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

【速读】：该论文试图解决在大语言模型中处理长上下文时，基于要点（gist-based）的上下文压缩方法的性能和潜在失败模式问题。解决方案的关键在于提出两种策略：一是细粒度的自动编码（fine-grained autoencoding），以增强对原始标记信息的还原能力；二是基于段落的标记重要性估计（segment-wise token importance estimation），通过调整优化策略来适应标记间的依赖关系。这些策略旨在缓解压缩过程中出现的边界丢失、意外信息丢失和逐步信息丢失等失败模式，从而提升压缩方法的实用性和效果。

链接: https://arxiv.org/abs/2412.17483
作者: Chenlong Deng,Zhisong Zhang,Kelong Mao,Shuaiyi Li,Xinting Huang,Dong Yu,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China(高瓴人工智能学院，中国人民大学); Tencent AI Lab(腾讯人工智能实验室)
关键词: improve long-context processing, large language models, improve long-context, long-context processing, processing in large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.
zh

[NLP-26] A Survey on Multi-Generative Agent System: Recent Advances and New Frontiers

【速读】：该论文试图解决多生成式代理系统 (Multi-generative agent systems, MGAS) 研究领域中现有综述难以全面捕捉新进展的问题。解决方案的关键在于提供一个全面的综述框架，涵盖了MGAS的定义、应用领域（如解决复杂任务、模拟特定场景和评估生成式代理）以及当前面临的挑战和未来研究方向。通过系统性地梳理和总结现有研究，论文为该领域的进一步发展提供了清晰的指导。

链接: https://arxiv.org/abs/2412.17481
作者: Shuaihang Chen,Yuanxing Liu,Wei Han,Weinan Zhang,Ting Liu
机构: Harbin Institute of Technology, China(哈尔滨工业大学); Research Center for Social Computing and Information Retrieval(社会计算与信息检索研究中心)
关键词: large language models, Multi-generative agent systems, language models, rise of large, large language
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Multi-generative agent systems (MGASs) have become a research hotspot since the rise of large language models (LLMs). However, with the continuous influx of new related works, the existing reviews struggle to capture them comprehensively. This paper presents a comprehensive survey of these studies. We first discuss the definition of MGAS, a framework encompassing much of previous work. We provide an overview of the various applications of MGAS in (i) solving complex tasks, (ii) simulating specific scenarios, and (iii) evaluating generative agents. Building on previous studies, we also highlight several challenges and propose future directions for research in this field.
zh

[NLP-27] Developmental Predictive Coding Model for Early Infancy Mono and Bilingual Vocal Continual Learning

【速读】：该论文试图解决婴儿如何感知语音和语言结构的问题，特别是语言声音在“关键期”内的习得以及后期婴儿对声音的模仿。解决方案的关键在于提出了一种基于预测编码的持续学习机制的小型生成神经网络，该网络能够在无学习参与的情况下进行组合优化生成。与依赖大规模数据集的深度网络不同，该模型通过在线学习机制不断更新，具有高适应性和对输入变化的响应能力，从而有效模拟了“感知窄化”现象，尤其是在后期婴儿期进行第二语言习得时面临的挑战。

链接: https://arxiv.org/abs/2412.17456
作者: Xiaodan Chen(ETIS, ASTAR, IPAL),Alexandre Pitti(ETIS, IPAL),Mathias Quoy(ETIS, IPAL),Nancy F Chen(ASTAR, IPAL)
机构: 未知
关键词: Understanding how infants, infants perceive speech, open problem, infants perceive, perceive speech sounds
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how infants perceive speech sounds and language structures is still an open problem. Previous research in artificial neural networks has mainly focused on large dataset-dependent generative models, aiming to replicate language-related phenomena such as ‘‘perceptual narrowing’’. In this paper, we propose a novel approach using a small-sized generative neural network equipped with a continual learning mechanism based on predictive coding for mono-and bilingual speech sound learning (referred to as language sound acquisition during ‘‘critical period’’) and a compositional optimization mechanism for generation where no learning is involved (later infancy sound imitation). Our model prioritizes interpretability and demonstrates the advantages of online learning: Unlike deep networks requiring substantial offline training, our model continuously updates with new data, making it adaptable and responsive to changing inputs. Through experiments, we demonstrate that if second language acquisition occurs during later infancy, the challenges associated with learning a foreign language after the critical period amplify, replicating the perceptual narrowing effect.
zh

[NLP-28] Diving into Self-Evolving Training for Multimodal Reasoning

【速读】：该论文试图解决在缺乏多模态思维链（multimodal chain-of-thought）标注数据的情况下，如何有效提升大型多模态模型（Large Multimodal Models, LMMs）的推理能力问题。解决方案的关键在于自进化训练（self-evolving training），并深入探讨了影响训练效果的三个核心因素：训练方法（Training Method）、奖励模型（Reward Model）和提示变异（Prompt Variation）。通过系统分析这些因素及其不同配置对训练效果的影响，论文提出了一套最佳实践，并进一步研究了自进化动力学（Self-Evolution Dynamics）和自动平衡机制对性能提升的作用。最终，论文提出了一个名为MSTaR（Multimodal Self-evolving Training for Reasoning）的框架，该框架在多个基准测试中表现优异，无需额外的人类标注数据即可显著超越预进化模型。

链接: https://arxiv.org/abs/2412.17451
作者: Wei Liu,Junlong Li,Xiwen Zhang,Fan Zhou,Yu Cheng,Junxian He
机构: The Hong Kong University of Science and Technology; Shanghai Jiao Tong University; Helixon Research; The Chinese University of Hong Kong
关键词: Large Multimodal Models, self-evolving training, essential for Large, multimodal reasoning, Large Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training’s effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.
zh

[NLP-29] Measuring Contextual Informativeness in Child-Directed Text COLING2025

【速读】：该论文旨在解决儿童故事中词汇丰富性（vocabulary enrichment）的自动评估问题，特别是如何衡量故事中目标词汇的语义传达效果。解决方案的关键在于提出了一个名为“测量儿童故事中的上下文信息量”的任务，并通过使用大型语言模型（LLM）来自动化这一任务。实验结果表明，基于LLM的方法在信息量评估上与人类判断的相关性（Spearman correlation）达到了0.4983，显著优于最强的基线方法（0.3534）。此外，该方法还展示了在成人导向文本中测量上下文信息量的泛化能力，同样优于所有基线方法。

链接: https://arxiv.org/abs/2412.17427
作者: Maria Valentini,Téa Wright,Ali Marashian,Jennifer Weber,Eliana Colunga,Katharina von der Wense
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); Johannes Gutenberg University Mainz(约翰内斯古腾堡美因茨大学); University of California Berkeley(加州大学伯克利分校)
关键词: target vocabulary words, generating educational content, creating children stories, vocabulary enrichment, vocabulary words
类目: Computation and Language (cs.CL)
备注: COLING 2025 main conference short paper

点击查看摘要

Abstract:To address an important gap in creating children’s stories for vocabulary enrichment, we investigate the automatic evaluation of how well stories convey the semantics of target vocabulary words, a task with substantial implications for generating educational content. We motivate this task, which we call measuring contextual informativeness in children’s stories, and provide a formal task definition as well as a dataset for the task. We further propose a method for automating the task using a large language model (LLM). Our experiments show that our approach reaches a Spearman correlation of 0.4983 with human judgments of informativeness, while the strongest baseline only obtains a correlation of 0.3534. An additional analysis shows that the LLM-based approach is able to generalize to measuring contextual informativeness in adult-directed text, on which it also outperforms all baselines.
zh

[NLP-30] Just What You Desire: Constrained Timeline Summarization with Self-Reflection for Enhanced Relevance AAAI2025

【速读】：该论文试图解决时间线摘要任务（Timeline Summarization, TLS）中存在的模糊性问题，即不同读者对事件的关注点可能不同，导致没有单一的理想时间线。为此，论文提出了一个新的任务——约束时间线摘要（Constrained Timeline Summarization, CTLS），其关键在于生成满足特定约束条件的时间线。解决方案的核心是利用大型语言模型（LLM）根据指定的约束条件对新闻文章进行摘要，并通过聚类方法识别关键事件。此外，论文还提出了一种在摘要生成过程中进行自我反思的方法，以进一步提升性能。

链接: https://arxiv.org/abs/2412.17408
作者: Muhammad Reza Qorib,Qisheng Hu,Hwee Tou Ng
机构: National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学)
关键词: Constrained Timeline Summarization, timeline summarization, timeline, Constrained Timeline, figure or organization
类目: Computation and Language (cs.CL)
备注: AAAI 2025 (with appendix)

点击查看摘要

Abstract:Given news articles about an entity, such as a public figure or organization, timeline summarization (TLS) involves generating a timeline that summarizes the key events about the entity. However, the TLS task is too underspecified, since what is of interest to each reader may vary, and hence there is not a single ideal or optimal timeline. In this paper, we introduce a novel task, called Constrained Timeline Summarization (CTLS), where a timeline is generated in which all events in the timeline meet some constraint. An example of a constrained timeline concerns the legal battles of Tiger Woods, where only events related to his legal problems are selected to appear in the timeline. We collected a new human-verified dataset of constrained timelines involving 47 entities and 5 constraints per entity. We propose an approach that employs a large language model (LLM) to summarize news articles according to a specified constraint and cluster them to identify key events to include in a constrained timeline. In addition, we propose a novel self-reflection method during summary generation, demonstrating that this approach successfully leads to improved performance.
zh

[NLP-31] WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models

【速读】：该论文试图解决代码大语言模型（LLMs）在数据收集和标注方面面临的挑战，特别是依赖高质量数据进行微调所导致的数据多样性和系统性偏差问题。解决方案的关键在于提出了一种名为 WarriorCoder 的新方法，通过创建一个专家代码 LLMs 的竞技场，让各个模型相互挑战并回应对方的挑战，由独立的评判模型进行评估。这种竞争框架能够从头生成新颖的训练数据，充分利用所有参与者的优势，从而在不依赖专有 LLMs 的情况下实现与现有方法相媲美的性能。

链接: https://arxiv.org/abs/2412.17395
作者: Huawen Feng,Pu Zhao,Qingfeng Sun,Can Xu,Fangkai Yang,Lu Wang,Qianli Ma,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
机构: South China University of Technology(华南理工大学); Microsoft(微软)
关键词: recent progress achieved, code large language, large language models, collection and annotation, recent progress
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to gather complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from the limited pool of proprietary LLMs (e.g., Claude, GPT4, and so on), which limits the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose WarriorCoder which learns from expert battles to address these limitations. Specifically, we create an arena for current expert code LLMs, where each model challenges and responds to others’ challenges, with evaluations conducted by uninvolved judge models. This competitive framework generates novel training data constructed from scratch, harnessing the strengths of all participants. Experimental results demonstrate that WarriorCoder achieves competitive performance compared to previous methods, even without relying on proprietary LLMs.
zh

[NLP-32] Interweaving Memories of a Siamese Large Language Model AAAI2025

【速读】：该论文试图解决参数高效微调 (Parameter-efficient fine-tuning, PEFT) 方法在优化大型语言模型 (LLMs) 时可能导致的灾难性遗忘问题，即模型在适应下游任务时过度优先新知识而忽视了原有的全面世界知识。解决方案的关键在于提出了一个模型无关的 PEFT 框架 IMSM (Interweaves Memories of a Siamese Large Language Model)，通过引入孪生 LLM 并结合预训练和微调参数生成两种不同的记忆，利用交织机制 (interweaving mechanism) 调节原始记忆和增强记忆在生成下一个 token 时的贡献，从而有效缓解灾难性遗忘问题，同时保持与传统 PEFT 方法相当的时间和空间效率。

链接: https://arxiv.org/abs/2412.17383
作者: Xin Song,Zhikai Xue,Guoxiu He,Jiawei Liu,Wei Lu
机构: 未知
关键词: Parameter-efficient fine-tuning, optimize large language, large language models, large language, downstream tasks
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2025 Main Conference

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods optimize large language models (LLMs) by modifying or introducing a small number of parameters to enhance alignment with downstream tasks. However, they can result in catastrophic forgetting, where LLMs prioritize new knowledge at the expense of comprehensive world knowledge. A promising approach to mitigate this issue is to recall prior memories based on the original knowledge. To this end, we propose a model-agnostic PEFT framework, IMSM, which Interweaves Memories of a Siamese Large Language Model. Specifically, our siamese LLM is equipped with an existing PEFT method. Given an incoming query, it generates two distinct memories based on the pre-trained and fine-tuned parameters. IMSM then incorporates an interweaving mechanism that regulates the contributions of both original and enhanced memories when generating the next token. This framework is theoretically applicable to all open-source LLMs and existing PEFT methods. We conduct extensive experiments across various benchmark datasets, evaluating the performance of popular open-source LLMs using the proposed IMSM, in comparison to both classical and leading PEFT methods. Our findings indicate that IMSM maintains comparable time and space efficiency to backbone PEFT methods while significantly improving performance and effectively mitigating catastrophic forgetting.
zh

[NLP-33] Boosting LLM via Learning from Data Iteratively and Selectively

【速读】：该论文试图解决在多源数据集上进行指令微调（instruction tuning）时，数据去噪和去重的问题。解决方案的关键在于提出了一种迭代数据选择方法（\ApproachName），通过同时考虑样本的复杂性（complexity）和多样性（diversity）来评估数据质量。与传统方法不同，该方法在微调过程中动态更新模型特定的复杂性评分，以适应模型的动态变化，并通过贪婪选择具有最高复杂性-多样性评分的样本来优化数据选择过程。实验结果表明，该方法在多个指令微调数据集上均优于强基线模型，并能很好地泛化到特定领域和不同骨干模型。

链接: https://arxiv.org/abs/2412.17365
作者: Qi Jia,Siyu Ren,Ziheng Qin,Fuzhao Xue,Jinjie Ni,Yang You
机构: 未知
关键词: making data de-noising, Datasets nowadays, synthetic techniques, nowadays are generally, generally constructed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples’ responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models. All resources will be available at this https URL.
zh

[NLP-34] An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification WWW

【速读】：该论文旨在研究三种流行的分词工具（tokenization tools）——MeCab、Sudachi 和 SentencePiece 在日语文本情感分类预处理步骤中的性能表现。解决方案的关键在于评估这些工具在结合TF-IDF向量化和传统机器学习分类器（Multinomial Naive Bayes 和 Logistic Regression）时的分类效果。研究表明，Sudachi 生成的分词更符合词典定义，而 MeCab 和 SentencePiece 在处理速度上更具优势。最终，SentencePiece 结合 TF-IDF 和 Logistic Regression 的组合在分类性能上表现最佳。

链接: https://arxiv.org/abs/2412.17361
作者: Andre Rusli,Makoto Shishido
机构: 未知
关键词: popular tokenization tools, Multinomial Naive Bayes, Japanese texts, sentiment-based text classification, Frequency-Inverse Document Frequency
类目: Computation and Language (cs.CL)
备注: Accepted at The 27th Annual Meeting of the Association for Natural Language Processing (NLP2021). Published version available at: this https URL

点击查看摘要

Abstract:This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.
zh

[NLP-35] hree-Class Text Sentiment Analysis Based on LSTM

【速读】：该论文试图解决微博评论中的情感分类问题，旨在通过三分类方法区分正面、中性和负面情感。解决方案的关键在于使用长短期记忆网络（LSTM），该深度学习模型能够有效捕捉文本数据中的长距离依赖关系，从而在情感分析中提供显著优势。通过预处理和特征提取，LSTM模型实现了高精度的情感预测，实验结果显示其准确率和F1分数分别达到98.31%和98.28%，显著优于传统机器学习方法和其他深度学习模型。

链接: https://arxiv.org/abs/2412.17347
作者: Yin Qixuan
机构: Zhongnan University of Economics and Law (中南财经政法大学)
关键词: public opinion monitoring, Long Short-Term Memory, natural language processing, market research, opinion monitoring
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sentiment analysis is a crucial task in natural language processing (NLP) with applications in public opinion monitoring, market research, and beyond. This paper introduces a three-class sentiment classification method for Weibo comments using Long Short-Term Memory (LSTM) networks to discern positive, neutral, and negative sentiments. LSTM, as a deep learning model, excels at capturing long-distance dependencies in text data, providing significant advantages over traditional machine learning approaches. Through preprocessing and feature extraction from Weibo comment texts, our LSTM model achieves precise sentiment prediction. Experimental results demonstrate superior performance, achieving an accuracy of 98.31% and an F1 score of 98.28%, notably outperforming conventional models and other deep learning methods. This underscores the effectiveness of LSTM in capturing nuanced sentiment information within text, thereby enhancing classification accuracy. Despite its strengths, the LSTM model faces challenges such as high computational complexity and slower processing times for lengthy texts. Moreover, complex emotional expressions like sarcasm and humor pose additional difficulties. Future work could explore combining pre-trained models or advancing feature engineering techniques to further improve both accuracy and practicality. Overall, this study provides an effective solution for sentiment analysis on Weibo comments.
zh

[NLP-36] MineAgent : Towards Remote-Sensing Mineral Exploration with Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型（MLLMs）在遥感矿产勘探中的应用难题，特别是领域特定地质知识的局限性和多幅遥感图像间的推理困难，以及长上下文处理问题。解决方案的关键在于提出了MineAgent框架，该框架通过分层判断和决策模块来增强多图像推理和空间-光谱信息的整合能力。此外，论文还引入了MineBench基准，用于评估MLLMs在矿产勘探任务中的表现，特别是基于地质和超光谱数据的任务。实验结果表明MineAgent的有效性，展示了其在推进遥感矿产勘探中MLLMs应用的潜力。

链接: https://arxiv.org/abs/2412.17339
作者: Beibei Yu,Tao Shen,Hongbin Na,Ling Chen,Denqi Li
机构: Australian Artificial Intelligence Institute, University of Technology Sydney(澳大利亚人工智能研究所，悉尼科技大学); Faculty of Science and Engineering, Curtin University(科学与工程学院，科廷大学)
关键词: large language models, identifying economically viable, poses significant challenges, multimodal large language, viable mineral deposits
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Remote-sensing mineral exploration is critical for identifying economically viable mineral deposits, yet it poses significant challenges for multimodal large language models (MLLMs). These include limitations in domain-specific geological knowledge and difficulties in reasoning across multiple remote-sensing images, further exacerbating long-context issues. To address these, we present MineAgent, a modular framework leveraging hierarchical judging and decision-making modules to improve multi-image reasoning and spatial-spectral integration. Complementing this, we propose MineBench, a benchmark specific for evaluating MLLMs in domain-specific mineral exploration tasks using geological and hyperspectral data. Extensive experiments demonstrate the effectiveness of MineAgent, highlighting its potential to advance MLLMs in remote-sensing mineral exploration.
zh

[NLP-37] A Dual-Perspective Metaphor Detection Framework Using Large Language Models ICASSP2025

【速读】：该论文试图解决隐喻检测（metaphor detection）任务中传统监督学习模型决策过程不透明的问题。解决方案的关键在于提出了一种双视角框架（DMD），该框架结合了隐喻理论的隐式和显式应用，以指导大型语言模型（LLMs）进行隐喻检测，并通过自判断机制验证不同指导方式的响应，从而提高了推理过程的透明度和预测的可靠性。

链接: https://arxiv.org/abs/2412.17332
作者: Yujie Lin,Jingyao Liu,Yan Gao,Ante Wang,Jinsong Su
机构: School of Informatics; Xiamen University
关键词: natural language processing, involves identifying, critical task, task in natural, Metaphor detection
类目: Computation and Language (cs.CL)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Metaphor detection, a critical task in natural language processing, involves identifying whether a particular word in a sentence is used metaphorically. Traditional approaches often rely on supervised learning models that implicitly encode semantic relationships based on metaphor theories. However, these methods often suffer from a lack of transparency in their decision-making processes, which undermines the reliability of their predictions. Recent research indicates that LLMs (large language models) exhibit significant potential in metaphor detection. Nevertheless, their reasoning capabilities are constrained by predefined knowledge graphs. To overcome these limitations, we propose DMD, a novel dual-perspective framework that harnesses both implicit and explicit applications of metaphor theories to guide LLMs in metaphor detection and adopts a self-judgment mechanism to validate the responses from the aforementioned forms of guidance. In comparison to previous methods, our framework offers more transparent reasoning processes and delivers more reliable predictions. Experimental results prove the effectiveness of DMD, demonstrating state-of-the-art performance across widely-used datasets.
zh

[NLP-38] Assessing Human Editing Effort on LLM -Generated Texts via Compression-Based Edit Distance

【速读】：该论文试图解决现有编辑距离度量（如Levenshtein、BLEU、ROUGE和TER）在评估大型语言模型（LLMs）生成文本的人工编辑量时，无法准确衡量涉及大量修改（如块操作）的编辑工作量的问题。解决方案的关键在于引入了一种基于Lempel-Ziv-77算法的压缩式编辑距离度量，通过利用文本压缩的特性来量化原始文本与编辑后文本之间的信息差异。该方法不仅在实际编辑时间和工作量的评估上表现出高度相关性，还能捕捉复杂的编辑操作，并具有线性计算效率的优势。

链接: https://arxiv.org/abs/2412.17321
作者: Nicolas Devatine,Louis Abraham
机构: Tiime, Paris (Tiime, 巴黎)
关键词: Large Language Models, Language Models, Large Language, Assessing the extent, generated by Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Assessing the extent of human edits on texts generated by Large Language Models (LLMs) is crucial to understanding the human-AI interactions and improving the quality of automated text generation systems. Existing edit distance metrics, such as Levenshtein, BLEU, ROUGE, and TER, often fail to accurately measure the effort required for post-editing, especially when edits involve substantial modifications, such as block operations. In this paper, we introduce a novel compression-based edit distance metric grounded in the Lempel-Ziv-77 algorithm, designed to quantify the amount of post-editing applied to LLM-generated texts. Our method leverages the properties of text compression to measure the informational difference between the original and edited texts. Through experiments on real-world human edits datasets, we demonstrate that our proposed metric is highly correlated with actual edit time and effort. We also show that LLMs exhibit an implicit understanding of editing speed, that aligns well with our metric. Furthermore, we compare our metric with existing ones, highlighting its advantages in capturing complex edits with linear computational efficiency. Our code and data are available at: this https URL
zh

[NLP-39] Fast Gradient Computation for RoPE Attention in Almost Linear Time

【速读】：该论文试图解决在基于旋转位置编码（Rotary Position Embedding, RoPE）的注意力机制中，反向传播计算复杂度过高的问题。解决方案的关键在于开发了一种几乎线性时间（即 (n^{1+o(1)})，其中 (n) 是输入token的数量）的反向传播算法，该算法在输入项有界的情况下有效。这一方法结合了多项式方法和快速傅里叶变换（Fast Fourier Transform），并基于对强指数时间假设（Strong Exponential Time Hypothesis, SETH）的下界分析，证明了有界输入条件对于实现次二次性能的必要性。

链接: https://arxiv.org/abs/2412.17316
作者: Yifang Chen,Jiayan Huo,Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song
机构: The University of Chicago; University of Arizona; Independent Researcher; The University of Hong Kong; University of Wisconsin-Madison; University of Wisconsin-Madison; The Simons Institute for the Theory of Computing at UC Berkeley
关键词: Rotary Position Embedding, Position Embedding, Rotary Position, encoding positional information, capture token relationships
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e., n^1+o(1) where n is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.
zh

[NLP-40] CodeV: Issue Resolving with Visual Data

【速读】：该论文试图解决在软件工程中利用大型语言模型 (Large Language Models, LLMs) 处理GitHub问题时，现有方法忽视视觉数据的问题。解决方案的关键在于提出了CodeV，这是首个利用视觉数据增强LLMs问题解决能力的方案。CodeV通过数据处理和补丁生成两个阶段来解决每个问题，并构建了Visual SWE-bench基准来评估其效果，展示了视觉数据在提升GitHub问题解决中的重要性。

链接: https://arxiv.org/abs/2412.17315
作者: Linhao Zhang,Daoguang Zan,Quanshun Yang,Zhirong Huang,Dong Chen,Bo Shen,Tianyu Liu,Yongshun Gong,Pengjie Huang,Xudong Lu,Guangtai Liang,Lizhen Cui,Qianxiang Wang
机构: Shandong University(山东大学); Chinese Academy of Science(中国科学院); Huawei Co., Ltd(华为有限公司); Peking University(北京大学); Lingzhi-zhiguang Co., Ltd(灵芝之光有限公司)
关键词: Large Language Models, Large Language, Language Models, software engineering expanding, complex repository-level tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced rapidly in recent years, with their applications in software engineering expanding to more complex repository-level tasks. GitHub issue resolving is a key challenge among these tasks. While recent approaches have made progress on this task, they focus on textual data within issues, neglecting visual data. However, this visual data is crucial for resolving issues as it conveys additional knowledge that text alone cannot. We propose CodeV, the first approach to leveraging visual data to enhance the issue-resolving capabilities of LLMs. CodeV resolves each issue by following a two-phase process: data processing and patch generation. To evaluate CodeV, we construct a benchmark for visual issue resolving, namely Visual SWE-bench. Through extensive experiments, we demonstrate the effectiveness of CodeV, as well as provide valuable insights into leveraging visual data to resolve GitHub issues.
zh

[NLP-41] Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding AAAI2025

【速读】：该论文试图解决多模态多方对话 (Multi-modal multi-party conversation, MMC) 中的两个关键任务：对话发言人识别和对话响应预测。解决方案的关键在于构建了一个名为 Friends-MMC 的数据集，该数据集包含超过 24,000 条配对视频上下文的独特话语，并详细标注了每条话语的发言人、视频中出现的面部名称和边界框。针对对话发言人识别，论文指出预训练模型的不足，并提出了一种利用优化求解器结合视觉和文本上下文的新基线方法，以提升识别性能。对于对话响应预测，论文通过在 Friends-MMC 数据集上微调生成式对话模型，分析了发言人信息对模型性能的增益。

链接: https://arxiv.org/abs/2412.17295
作者: Yueqian Wang,Xiaojun Meng,Yuxuan Wang,Jianxin Liang,Qun Liu,Dongyan Zhao
机构: 未知
关键词: fits real-world scenarios, Multi-modal multi-party conversation, widely-used applications, traditional multi-modal conversations, studied yet important
类目: Computation and Language (cs.CL)
备注: Published at AAAI 2025

点击查看摘要

Abstract:Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset is publicly available at this https URL and thus we call for more attention on modeling speaker information when understanding conversations.
zh

[NLP-42] Learning from Mistakes: Self-correct Adversarial Training for Chinese Unnatural Text Correction AAAI2025

【速读】：该论文试图解决自然语言文本中的非自然错误校正问题，特别是由于训练数据与实际应用场景之间的数据分布差异导致的泛化性能差（exposure bias problem）。解决方案的关键在于提出了一个自校正对抗训练框架（LIMIT），该框架通过利用模型在推理阶段主动暴露的错误（即与目标不一致的预测）进行训练，从而模拟真实应用场景中的潜在错误，并缓解传统训练过程中的暴露偏差问题。此外，论文还设计了一种新的解码干预策略，以保持语义一致性。实验结果表明，该方法在多种错误形式上表现优异，并可作为即插即用的防御模块，适用于新的模型和数据集，无需额外训练。

链接: https://arxiv.org/abs/2412.17279
作者: Xuan Feng,Tianlong Gu,Xiaoli Liu,Liang Chang
机构: 1. School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院);
2. School of Information and Software Engineering, University of Electronic Science and Technology of China (电子科技大学信息与软件工程学院);
3. School of Mathematical Sciences, University of Electronic Science and Technology of China (电子科技大学数学科学学院);
4. School of Information Science and Technology, Tsinghua University (清华大学信息科学与技术学院)
关键词: textbf, aims to automatically, automatically detect, adversarial perturbation errors, errors
类目: Computation and Language (cs.CL)
备注: AAAI 2025

点击查看摘要

Abstract:Unnatural text correction aims to automatically detect and correct spelling errors or adversarial perturbation errors in sentences. Existing methods typically rely on fine-tuning or adversarial training to correct errors, which have achieved significant success. However, these methods exhibit poor generalization performance due to the difference in data distribution between training data and real-world scenarios, known as the exposure bias problem. In this paper, we propose a self-correct adversarial training framework for \textbfLearn\textbfIng from \textbfMIs\textbfTakes (\textbfLIMIT), which is a task- and model-independent framework to correct unnatural errors or mistakes. Specifically, we fully utilize errors generated by the model that are actively exposed during the inference phase, i.e., predictions that are inconsistent with the target. This training method not only simulates potential errors in real application scenarios, but also mitigates the exposure bias of the traditional training process. Meanwhile, we design a novel decoding intervention strategy to maintain semantic consistency. Extensive experimental results on Chinese unnatural text error correction datasets show that our proposed method can correct multiple forms of errors and outperforms the state-of-the-art text correction methods. In addition, extensive results on Chinese and English datasets validate that LIMIT can serve as a plug-and-play defense module and can extend to new models and datasets without further training.
zh

[NLP-43] LegalAgent Bench: Evaluating LLM Agents in Legal Domain

【速读】：该论文试图解决现有通用领域基准无法全面捕捉现实世界司法认知和决策复杂性与细微差别的问题。解决方案的关键在于提出了LegalAgentBench，这是一个专门针对中国法律领域的综合基准，用于评估大型语言模型（LLM）代理。LegalAgentBench包含17个来自真实法律场景的语料库，并提供了37种与外部知识交互的工具。通过设计可扩展的任务构建框架并精心标注300个任务，涵盖多跳推理和写作等多种类型及不同难度级别，有效反映了现实法律场景的复杂性。此外，该基准不仅评估最终成功率，还通过中间过程的关键词分析计算进展率，实现更细粒度的评估。

链接: https://arxiv.org/abs/2412.17259
作者: Haitao Li,Junjie Chen,Jingli Yang,Qingyao Ai,Wei Jia,Youfeng Liu,Kai Lin,Yueyue Wu,Guozhi Yuan,Yiran Hu,Wuyue Wang,Yiqun Liu,Minlie Huang
机构: Tsinghua University(清华大学); Zhipu AI; Shanghai Amarsoft Enterprise Credit Information Service Co Ltd(上海安硕企业信用信息服务有限公司); Central South University(中南大学); University of Waterloo(滑铁卢大学); University of Notre Dame(圣母大学)
关键词: increasingly apparent, increasing intelligence, intelligence and autonomy, LLM agents, Chinese legal domain
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 23 pages

点击查看摘要

Abstract:With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \urlthis https URL.
zh

[NLP-44] B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

【速读】：该论文试图解决在缺乏大量人工标注数据的情况下，复杂推理任务中模型自改进（self-improvement）的有效性和瓶颈问题。解决方案的关键在于识别并监控迭代过程中两个关键因素：(1) 模型生成多样化响应的能力（exploration）；(2) 外部奖励区分高质量和低质量候选者的有效性（exploitation）。通过引入B-STaR（Self-Taught Reasoning）框架，该框架能够在迭代过程中自主调整配置，平衡探索和利用，从而优化自改进效果，提升模型性能。实验结果表明，B-STaR不仅增强了模型的探索能力，还实现了探索与利用之间的更有效平衡，显著提升了模型在数学推理、编程和常识推理等任务中的表现。

链接: https://arxiv.org/abs/2412.17256
作者: Weihao Zeng,Yuzhen Huang,Lulu Zhao,Yijun Wang,Zifei Shan,Junxian He
机构: The Hong Kong University of Science and Technology (香港科技大学); BAAI; Tencent (腾讯)
关键词: extensive human-annotated data, complex reasoning tasks, absence of extensive, extensive human-annotated, human-annotated data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement – where models are trained on their own outputs – has emerged as a primary method for enhancing performance. However, the critical factors underlying the mechanism of these iterative self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model’s ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation). Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model’s exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STaR not only enhances the model’s exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.
zh

[NLP-45] Unlocking Cross-Lingual Sentiment Analysis through Emoji Interpretation: A Multimodal Generative AI Approach

【速读】：该论文试图解决表情符号（emojis）作为跨语言和文化背景下的通用情感指示器的能力问题。解决方案的关键在于利用大型语言模型（LLMs），特别是ChatGPT的多模态能力，来评估表情符号传达情感的准确性。研究通过分析来自32个国家的多语言数据集，发现基于LLM的表情符号情感传达准确率达到81.43%，并观察到随着文本中表情符号数量的增加，情感传达的准确性呈上升趋势。这一结果强调了表情符号作为全球情感指示器的潜力，特别是在跨语言和跨文化的社交媒体情感分析领域。

链接: https://arxiv.org/abs/2412.17255
作者: Rafid Ishrak Jahan,Heng Fan,Haihua Chen,Yunhe Feng
机构: University of North Texas (北德克萨斯大学)
关键词: online communication, decorative elements, ubiquitous in online, medium to convey, convey emotions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emojis have become ubiquitous in online communication, serving as a universal medium to convey emotions and decorative elements. Their widespread use transcends language and cultural barriers, enhancing understanding and fostering more inclusive interactions. While existing work gained valuable insight into emojis understanding, exploring emojis’ capability to serve as a universal sentiment indicator leveraging large language models (LLMs) has not been thoroughly examined. Our study aims to investigate the capacity of emojis to serve as reliable sentiment markers through LLMs across languages and cultures. We leveraged the multimodal capabilities of ChatGPT to explore the sentiments of various representations of emojis and evaluated how well emoji-conveyed sentiment aligned with text sentiment on a multi-lingual dataset collected from 32 countries. Our analysis reveals that the accuracy of LLM-based emoji-conveyed sentiment is 81.43%, underscoring emojis’ significant potential to serve as a universal sentiment marker. We also found a consistent trend that the accuracy of sentiment conveyed by emojis increased as the number of emojis grew in text. The results reinforce the potential of emojis to serve as global sentiment indicators, offering insight into fields such as cross-lingual and cross-cultural sentiment analysis on social media platforms. Code: this https URL.
zh

[NLP-46] On the Generalization Ability of Machine-Generated Text Detectors

【速读】：该论文试图解决机器生成文本 (Machine-Generated Text, MGT) 检测系统的泛化能力问题，特别是在学术写作领域中的应用。解决方案的关键在于构建了一个大规模的学术写作数据集 MGTAcademic，涵盖了不同学科领域的人类写作文本 (Human-Written Texts, HWTs) 和机器生成文本 (MGTs)，并设计了一个可扩展的代码框架用于高效基准测试。此外，论文还研究了检测器在不同领域和不同大语言模型 (Large Language Models, LLMs) 之间的迁移能力，通过细粒度数据集和少样本学习技术提升了约13.2%的性能。最后，论文引入了一个新的归因任务，要求模型在几乎没有或仅有有限先验训练数据的情况下适应新类别，并通过多种适应技术提升了约10%的性能。这些研究为构建鲁棒且适应性强的MGT检测系统奠定了基础。

链接: https://arxiv.org/abs/2412.17242
作者: Yule Liu,Zhiyuan Zhong,Yifan Liao,Zhen Sun,Jingyi Zheng,Jiaheng Wei,Qingyuan Gong,Fenghua Tong,Yang Chen,Yang Zhang,Xinlei He
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Southern University of Science and Technology(南方科技大学); National University of Singapore (Chongqing Research Institute)(新加坡国立大学(重庆研究院)); Fudan University(复旦大学); Qilu University of Technology(齐鲁工业大学); CISPA Helmholtz Center for Information Security(亥姆霍兹信息安全中心)
关键词: large language models, including ethical, plagiarism and misinformation, rise of large, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has raised concerns about machine-generated text (MGT), including ethical and practical issues like plagiarism and misinformation. Building a robust and highly generalizable MGT detection system has become increasingly important. This work investigates the generalization capabilities of MGT detectors in three aspects: First, we construct MGTAcademic, a large-scale dataset focused on academic writing, featuring human-written texts (HWTs) and MGTs across STEM, Humanities, and Social Sciences, paired with an extensible code framework for efficient benchmarking. Second, we investigate the transferability of detectors across domains and LLMs, leveraging fine-grained datasets to reveal insights into domain transferring and implementing few-shot techniques to improve the performance by roughly 13.2%. Third, we introduce a novel attribution task where models must adapt to new classes over time without (or with very limited) access to prior training data and benchmark detectors. We implement several adapting techniques to improve the performance by roughly 10% and highlight the inherent complexity of the task. Our findings provide insights into the generalization ability of MGT detectors across diverse scenarios and lay the foundation for building robust, adaptive detection systems.
zh

[NLP-47] Brain-to-Text Benchmark 24: Lessons Learned

【速读】：该论文旨在解决从神经活动直接解码为文本的问题，以恢复因瘫痪而失去语言能力的人的沟通能力。解决方案的关键在于采用集成方法 (ensembling approach)，即通过结合多个独立解码器的输出，并使用经过微调的大型语言模型 (large language model) 进行融合，从而显著提升解码准确性。此外，优化基线循环神经网络 (RNN) 模型的训练过程，包括学习率调度和使用双音素训练目标 (diphone training objective)，也带来了性能提升。尽管尝试使用深度状态空间模型或Transformer架构尚未显示出优于RNN基线的优势，但这些方法仍为未来的改进提供了方向。

链接: https://arxiv.org/abs/2412.17227
作者: Francis R. Willett,Jingyuan Li,Trung Le,Chaofei Fan,Mingfei Chen,Eli Shlizerman,Yue Chen,Xin Zheng,Tatsuo S. Okubo,Tyler Benster,Hyun Dong Lee,Maxwell Kounga,E. Kelly Buchanan,David Zoltowski,Scott W. Linderman,Jaimie M. Henderson
机构: Stanford University (斯坦福大学); Department of Electrical & Computer Engineering, University of Washington (华盛顿大学电气与计算机工程系); Department of Computer Science, Stanford University (斯坦福大学计算机科学系); Department of Applied Mathematics, University of Washington (华盛顿大学应用数学系); Beijing Institute for Brain Research, Chinese Academy of Medical Sciences (中国医学科学院北京脑科学研究所); Peking Union Medical College (北京协和医学院); Chinese Institute for Brain Research, Beijing (CIBR) (北京脑科学研究所); Wu Tsai Neurosciences Institute and Department of Statistics, Stanford University (斯坦福大学吴蔡神经科学研究所和统计学系)
关键词: Speech brain-computer interfaces, brain-computer interfaces aim, Speech brain-computer, convert neural activity, restoring communication
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Speech brain-computer interfaces aim to decipher what a person is trying to say from neural activity alone, restoring communication to people with paralysis who have lost the ability to speak intelligibly. The Brain-to-Text Benchmark '24 and associated competition was created to foster the advancement of decoding algorithms that convert neural activity to text. Here, we summarize the lessons learned from the competition ending on June 1, 2024 (the top 4 entrants also presented their experiences in a recorded webinar). The largest improvements in accuracy were achieved using an ensembling approach, where the output of multiple independent decoders was merged using a fine-tuned large language model (an approach used by all 3 top entrants). Performance gains were also found by improving how the baseline recurrent neural network (RNN) model was trained, including by optimizing learning rate scheduling and by using a diphone training objective. Improving upon the model architecture itself proved more difficult, however, with attempts to use deep state space models or transformers not yet appearing to offer a benefit over the RNN baseline. The benchmark will remain open indefinitely to support further work towards increasing the accuracy of brain-to-text algorithms.
zh

[NLP-48] COVID-19 on YouTube: A Data-Driven Analysis of Sentiment Toxicity and Content Recommendations

【速读】：该论文试图解决COVID-19相关内容在YouTube上的话语分析问题，并开发一个推荐系统以支持用户获取相关且负责任的内容。解决方案的关键在于应用先进的自然语言处理（NLP）技术，包括情感分析（VADER）、毒性检测（Detoxify）和主题建模（LDA），以识别视频内容的情感、毒性和主题模式。通过这些分析，论文揭示了YouTube上COVID-19相关内容的主要主题，并基于TF-IDF向量化和余弦相似度开发了一个推荐系统，该系统通过情感、毒性和主题过滤器进行优化，确保推荐内容的准确性和相关性。该推荐系统在覆盖率和性能上表现出色，能够有效支持用户在YouTube上获取与COVID-19相关的信息。

链接: https://arxiv.org/abs/2412.17180
作者: Vanessa Su,Nirmalya Thakur
机构: 未知
关键词: Latent Dirichlet Allocation, published between January, video content published, thematic patterns, Dirichlet Allocation
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a data-driven analysis of COVID-19 discourse on YouTube, examining the sentiment, toxicity, and thematic patterns of video content published between January 2023 and October 2024. The analysis involved applying advanced natural language processing (NLP) techniques: sentiment analysis with VADER, toxicity detection with Detoxify, and topic modeling using Latent Dirichlet Allocation (LDA). The sentiment analysis revealed that 49.32% of video descriptions were positive, 36.63% were neutral, and 14.05% were negative, indicating a generally informative and supportive tone in pandemic-related content. Toxicity analysis identified only 0.91% of content as toxic, suggesting minimal exposure to toxic content. Topic modeling revealed two main themes, with 66.74% of the videos covering general health information and pandemic-related impacts and 33.26% focused on news and real-time updates, highlighting the dual informational role of YouTube. A recommendation system was also developed using TF-IDF vectorization and cosine similarity, refined by sentiment, toxicity, and topic filters to ensure relevant and context-aligned video recommendations. This system achieved 69% aggregate coverage, with monthly coverage rates consistently above 85%, demonstrating robust performance and adaptability over time. Evaluation across recommendation sizes showed coverage reaching 69% for five video recommendations and 79% for ten video recommendations per video. In summary, this work presents a framework for understanding COVID-19 discourse on YouTube and a recommendation system that supports user engagement while promoting responsible and relevant content related to COVID-19.
zh

[NLP-49] A Multi-AI Agent System for Autonomous Optimization of Agent ic AI Solutions via Iterative Refinement and LLM -Driven Feedback Loops

【速读】：该论文试图解决Agentic AI系统在复杂工作流程中自动化和效率优化的挑战，特别是手动调整角色、任务和交互的劳动密集型问题。解决方案的关键在于引入一个自主优化框架，该框架通过迭代反馈循环（由LLM Llama 3.2-3B驱动）实现自主生成和测试假设，从而优化系统配置。该框架通过Refinement、Execution、Evaluation、Modification和Documentation五个代理模块，实现了无需人工干预的最优性能，显著提升了系统的可扩展性和适应性，适用于动态环境中的实际应用。

链接: https://arxiv.org/abs/2412.17149
作者: Kamer Ali Yuksel,Hassan Sawaf
机构: aiXplain Inc.(aiXplain公司)
关键词: complex workflows, enabling automation, automation and efficiency, autonomously optimizing Agentic, handle tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Agentic AI systems use specialized agents to handle tasks within complex workflows, enabling automation and efficiency. However, optimizing these systems often requires labor-intensive, manual adjustments to refine roles, tasks, and interactions. This paper introduces a framework for autonomously optimizing Agentic AI solutions across industries, such as NLP-driven enterprise applications. The system employs agents for Refinement, Execution, Evaluation, Modification, and Documentation, leveraging iterative feedback loops powered by an LLM (Llama 3.2-3B). The framework achieves optimal performance without human input by autonomously generating and testing hypotheses to improve system configurations. This approach enhances scalability and adaptability, offering a robust solution for real-world applications in dynamic environments. Case studies across diverse domains illustrate the transformative impact of this framework, showcasing significant improvements in output quality, relevance, and actionability. All data for these case studies, including original and evolved agent codes, along with their outputs, are here: this https URL
zh

[NLP-50] Hate Speech Detection and Target Identification in Devanagari Languages via Parameter Efficient Fine-Tuning of LLM s

【速读】：该论文试图解决在基于天城文（Devanagari）的语言中进行仇恨言论检测的问题，特别是在资源和工具匮乏的情况下。解决方案的关键在于采用参数高效微调（Parameter Efficient Fine tuning, PEFT）方法，以克服传统微调大型语言模型（LLMs）时的高计算成本和资源需求。通过在由Thapa等人（2025）提供的天城文数据集（包含印地语和尼泊尔语的标注实例）上评估多个LLMs，研究展示了该方法在处理天城文内容时的有效性。

链接: https://arxiv.org/abs/2412.17131
作者: Rushendra Sidibomma,Pransh Patwa,Parth Patwa,Aman Chadha,Vinija Jain,Amitava Das
机构: IIIT Sri City, India; Aditya English Medium School, India; UCLA, USA; Stanford University, USA; Amazon GenAI, USA; University of South Carolina, USA
关键词: combating online hostility, hate speech detection, hate speech, real-world consequences, increasingly important
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The detection of hate speech has become increasingly important in combating online hostility and its real-world consequences. Despite recent advancements, there is limited research addressing hate speech detection in Devanagari-scripted languages, where resources and tools are scarce. While large language models (LLMs) have shown promise in language-related tasks, traditional fine-tuning approaches are often infeasible given the size of the models. In this paper, we propose a Parameter Efficient Fine tuning (PEFT) based solution for hate speech detection and target identification. We evaluate multiple LLMs on the Devanagari dataset provided by (Thapa et al., 2025), which contains annotated instances in 2 languages - Hindi and Nepali. The results demonstrate the efficacy of our approach in handling Devanagari-scripted content.
zh

[NLP-51] Lies Damned Lies and Distributional Language Statistics: Persuasion and Deception with Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在生成具有说服力和欺骗性内容方面的潜在风险问题。解决方案的关键在于综合分析LLMs在说服和欺骗方面的能力和倾向，评估理论风险，并探讨可能的缓解策略。论文指出，尽管当前的说服效果相对较小，但通过微调（fine-tuning）、多模态（multimodality）和社会因素等机制，其影响可能增加。此外，论文还提出了未来研究的关键问题，如AI系统的说服力可能达到何种程度、真相是否在本质上优于谎言，以及不同缓解策略在实践中的有效性。

链接: https://arxiv.org/abs/2412.17128
作者: Cameron R. Jones,Benjamin K. Bergen
机构: University of California, San Diego (加利福尼亚大学圣地亚哥分校); Department of Cognitive Science (认知科学系)
关键词: Large Language Models, producing deceptive outputs, selectively producing deceptive, Large Language, Language Models
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 37 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) can generate content that is as persuasive as human-written text and appear capable of selectively producing deceptive outputs. These capabilities raise concerns about potential misuse and unintended consequences as these systems become more widely deployed. This review synthesizes recent empirical work examining LLMs’ capacity and proclivity for persuasion and deception, analyzes theoretical risks that could arise from these capabilities, and evaluates proposed mitigations. While current persuasive effects are relatively small, various mechanisms could increase their impact, including fine-tuning, multimodality, and social factors. We outline key open questions for future research, including how persuasive AI systems might become, whether truth enjoys an inherent advantage over falsehoods, and how effective different mitigation strategies may be in practice.
zh

[NLP-52] Learning to Adapt to Low-Resource Paraphrase Generation

【速读】：该论文试图解决在将复述生成模型从源领域迁移到目标领域时遇到的领域偏移问题，特别是在目标领域数据稀少的情况下，以及在大规模预训练语言模型 (PLMs) 在少量标注数据上训练时容易出现的过拟合问题。解决方案的关键是提出了 LAPA，一种基于元学习的 PLMs 适配器。LAPA 通过三阶段的训练过程来解决这些问题：首先在无监督语料上预训练 PLMs，然后在源领域标注数据上插入适配器层并进行元训练，最后在少量目标领域标注数据上微调适配器。这种方法使得模型能够先学习基本的语言知识，再学习复述任务本身，最后适应目标任务。实验结果表明，LAPA 在监督、无监督和低资源设置下均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.17111
作者: Zhigen Li,Yanmeng Wang,Rizhao Fan,Ye Wang,Jianfeng Li,Shaojun Wang
机构: Ping An Technology(平安科技); University of Bologna(博洛尼亚大学)
关键词: longstanding NLP task, longstanding NLP, achieves great success, NLP task, great success
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Paraphrase generation is a longstanding NLP task and achieves great success with the aid of large corpora. However, transferring a paraphrasing model to another domain encounters the problem of domain shifting especially when the data is sparse. At the same time, widely using large pre-trained language models (PLMs) faces the overfitting problem when training on scarce labeled data. To mitigate these two issues, we propose, LAPA, an effective adapter for PLMs optimized by meta-learning. LAPA has three-stage training on three types of related resources to solve this problem: 1. pre-training PLMs on unsupervised corpora, 2. inserting an adapter layer and meta-training on source domain labeled data, and 3. fine-tuning adapters on a small amount of target domain labeled data. This method enables paraphrase generation models to learn basic language knowledge first, then learn the paraphrasing task itself later, and finally adapt to the target task. Our experimental results demonstrate that LAPA achieves state-of-the-art in supervised, unsupervised, and low-resource settings on three benchmark datasets. With only 2% of trainable parameters and 1% labeled data of the target task, our approach can achieve a competitive performance with previous work.
zh

[NLP-53] SAIL: Sample-Centric In-Context Learning for Document Information Extraction AAAI2025

【速读】：该论文试图解决文档信息提取 (Document Information Extraction, DIE) 在视觉丰富文档 (Visually Rich Documents, VRDs) 中面临的两个主要挑战：一是理解布局与文本元素之间的复杂关系，二是为预训练模型提供准确的指导。解决方案的关键在于提出了基于样本的上下文学习方法 (Sample-centric In-context Learning, SAIL)。SAIL 通过引入细粒度的实体级文本相似性来促进大语言模型 (Large Language Models, LLMs) 的深入文本分析，并结合布局相似性以增强对 VRDs 布局的分析。此外，SAIL 设计了一个统一的上下文学习 (In-Context Learning, ICL) 提示模板，用于各种样本中心示例，从而为每个样本提供精确的指导。实验结果表明，SAIL 在多个基准数据集上优于无训练基线，并接近全训练方法的性能，展示了其优越性和泛化能力。

链接: https://arxiv.org/abs/2412.17092
作者: Jinyu Zhang,Zhiyuan You,Jize Wang,Xinyi Le
机构: 未知
关键词: Visually Rich Documents, Document Information Extraction, extract structured information, Rich Documents, Visually Rich
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by AAAI 2025

点击查看摘要

Abstract:Document Information Extraction (DIE) aims to extract structured information from Visually Rich Documents (VRDs). Previous full-training approaches have demonstrated strong performance but may struggle with generalization to unseen data. In contrast, training-free methods leverage powerful pre-trained models like Large Language Models (LLMs) to address various downstream tasks with only a few examples. Nonetheless, training-free methods for DIE encounter two primary challenges: (1) understanding the complex relationship between layout and textual elements in VRDs, and (2) providing accurate guidance to pre-trained models. To address these challenges, we propose Sample-centric In-context Learning (SAIL) for DIE. SAIL introduces a fine-grained entity-level textual similarity to facilitate in-depth text analysis by LLMs and incorporates layout similarity to enhance the analysis of layouts in VRDs. Additionally, SAIL formulates a unified In-Context Learning (ICL) prompt template for various sample-centric examples, enabling tailored prompts that deliver precise guidance to pre-trained models for each sample. Extensive experiments on FUNSD, CORD, and SROIE benchmarks with various base models (e.g., LLMs) indicate that our method outperforms training-free baselines, even closer to the full-training methods. The results show the superiority and generalization of our method.
zh

[NLP-54] Iterative NLP Query Refinement for Enhancing Domain-Specific Information Retrieval: A Case Study in Career Services

【速读】：该论文试图解决在特定领域（如Humber College的职业服务网页）中，基于传统TF-IDF的系统在检索语义相关文档时面临的挑战，这些系统通常导致低相似度评分和次优的检索性能。解决方案的关键在于引入了一种迭代和半自动化的查询优化方法，具体包括：1) 领域感知的查询优化，通过引入如“在线学习资源”、“学生在线服务”和“职业咨询”等专业术语；2) 整合结构化的教育描述符，如“在线简历和面试改进工具”；3) 自动化从高排名文档中提取领域特定关键词以建议查询扩展。通过这些方法，实验结果显示平均顶部相似度评分从约0.18提升至0.42，显著改善了检索性能。

链接: https://arxiv.org/abs/2412.17075
作者: Elham Peimani(1),Gurpreet Singh(1),Nisarg Mahyavanshi(1),Aman Arora(1),Awais Shaikh(1) ((1) Humber College, Toronto, Canada)
机构: Humber College(汉博学院)
关键词: Retrieving semantically relevant, niche domains poses, domains poses significant, Retrieving semantically, poses significant challenges
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: To be submitted to CoLM 2025

点击查看摘要

Abstract:Retrieving semantically relevant documents in niche domains poses significant challenges for traditional TF-IDF-based systems, often resulting in low similarity scores and suboptimal retrieval performance. This paper addresses these challenges by introducing an iterative and semi-automated query refinement methodology tailored to Humber College’s career services webpages. Initially, generic queries related to interview preparation yield low top-document similarities (approximately 0.2–0.3). To enhance retrieval effectiveness, we implement a two-fold approach: first, domain-aware query refinement by incorporating specialized terms such as resources-online-learning, student-online-services, and career-advising; second, the integration of structured educational descriptors like “online resume and interview improvement tools.” Additionally, we automate the extraction of domain-specific keywords from top-ranked documents to suggest relevant terms for query expansion. Through experiments conducted on five baseline queries, our semi-automated iterative refinement process elevates the average top similarity score from approximately 0.18 to 0.42, marking a substantial improvement in retrieval performance. The implementation details, including reproducible code and experimental setups, are made available in our GitHub repositories \urlthis https URL and \urlthis https URL. We also discuss the limitations of our approach and propose future directions, including the integration of advanced neural retrieval models.
zh

[NLP-55] Computational Analysis of Character Development in Holocaust Testimonies

【速读】：该论文试图通过计算方法分析叙事时间线中的角色发展，特别是主角在叙事过程中经历的内在和外在变化及其相互作用。解决方案的关键在于利用自然语言处理技术，对大屠杀幸存者的证词进行分析，聚焦于幸存者的宗教轨迹，研究其对宗教信仰和实践的态度演变。通过聚类分析，识别出宗教信仰和实践的常见模式，其中信仰多为恒定，而实践则呈现波动结构。这一方法展示了自然语言处理技术在分析叙事中主题轨迹和角色演变方面的潜力。

链接: https://arxiv.org/abs/2412.17063
作者: Esther Shizgal,Eitan Wagner,Renana Keydar,Omri Abend
机构: Hebrew University of Jerusalem(耶路撒冷希伯来大学); Department of Computer Science(计算机科学系); Faculty of Law and Digital Humanities(法律与数字人文学院)
关键词: analyze character development, computational approach, approach to analyze, narrative timeline, analyze character
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.
zh

[NLP-56] Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agent ic Collaboration

【速读】：该论文试图解决多智能体系统中推理计算的扩展性问题，特别是在数据合成方面。解决方案的关键在于将模型协调视为一个多步骤的决策过程，通过动态优化生成结构来适应每个输入问题。论文提出了基于树搜索的协调智能体（Tree Search-based Orchestrated Agents, TOA），利用蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）和奖励模型进行实时反馈，从而在多智能体采样过程中实现高效的协作和性能优化。实验结果表明，多智能体采样在推理计算扩展时显著优于单智能体采样，并且在多个任务（如对齐、机器翻译和数学推理）中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.17061
作者: Hai Ye,Mingbao Lin,Hwee Tou Ng,Shuicheng Yan
机构: National University of Singapore(新加坡国立大学); Skywork AI(天工AI)
关键词: systems remain under-explored, remain under-explored compared, Scaling laws, multi-agent systems remain, systems remain
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling laws for inference compute in multi-agent systems remain under-explored compared to single-agent scenarios. This work aims to bridge this gap by investigating the problem of data synthesis through multi-agent sampling, where synthetic responses are generated by sampling from multiple distinct language models. Effective model coordination is crucial for successful multi-agent collaboration. Unlike previous approaches that rely on fixed workflows, we treat model coordination as a multi-step decision-making process, optimizing generation structures dynamically for each input question. We introduce Tree Search-based Orchestrated Agents~(TOA), where the workflow evolves iteratively during the sequential sampling process. To achieve this, we leverage Monte Carlo Tree Search (MCTS), integrating a reward model to provide real-time feedback and accelerate exploration. Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi-agent sampling significantly outperforms single-agent sampling as inference compute scales. TOA is the most compute-efficient approach, achieving SOTA performance on WMT and a 71.8% LC win rate on AlpacaEval. Moreover, fine-tuning with our synthesized alignment data surpasses strong preference learning methods on challenging benchmarks such as Arena-Hard and AlpacaEval.
zh

[NLP-57] he HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM s Internal States

【速读】：该论文试图解决大语言模型 (LLMs) 中幻觉 (hallucinations) 的检测问题，特别是针对那些涉及训练数据中未包含的信息的幻觉。解决方案的关键在于通过使用一个名为 HalluRAG 的数据集，在句子级别上检测这些幻觉，并利用不同内部状态的 LLMs 进行分类器的训练。研究结果表明，基于 HalluRAG 训练的多层感知器 (MLPs) 在检测幻觉方面表现出色，最高测试准确率可达 75%，尤其是在 Mistral-7B-Instruct-v0.1 模型上表现最佳。此外，研究发现可回答和不可回答的提示在编码上存在差异，针对这两类提示分别训练分类器可以提高准确性。然而，HalluRAG 在泛化能力上存在一定局限性，因此呼吁在幻觉数据集的构建中增加多样性。

链接: https://arxiv.org/abs/2412.17056
作者: Fabian Ridder,Malte Schilling
机构: University of Münster (明斯特大学); University of Münster (明斯特大学)
关键词: reliability and trustworthiness, critical for enhancing, enhancing their reliability, hallucinations, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Detecting hallucinations in large language models (LLMs) is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM’s parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.
zh

[NLP-58] Modular Conversational Agents for Surveys and Interviews

【速读】：该论文试图解决传统人类主导的调查和访谈方法在成本、可扩展性和一致性方面的挑战，特别是在涉及重大公共投资和政策决策的场景中。解决方案的关键在于引入了一种模块化的设计框架，用于构建由大型语言模型（LLMs）驱动的对话代理（chatbots）。该框架通过集成工程化的提示（engineered prompts）、专门的领域知识库（specialized knowledge bases）以及可定制的目标导向对话逻辑，实现了透明、隐私保护、成本效益高的决策支持系统。论文通过三个实证研究展示了该方法的适应性、通用性和有效性，涵盖了多模态交互、多语言支持、实时澄清请求以及对非结构化输入的处理能力。

链接: https://arxiv.org/abs/2412.17049
作者: Jiangbo Yu,Jinhua Zhao,Luis Miranda-Moreno,Matthew Korp
机构: 未知
关键词: hypothetical scenarios, collecting insights, insights on emerging, emerging or hypothetical, modular approach
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Surveys and interviews (structured, semi-structured, or unstructured) are widely used for collecting insights on emerging or hypothetical scenarios. Traditional human-led methods often face challenges related to cost, scalability, and consistency. Recently, various domains have begun to explore the use of conversational agents (chatbots) powered by large language models (LLMs). However, as public investments and policies on infrastructure and services often involve substantial public stakes and environmental risks, there is a need for a rigorous, transparent, privacy-preserving, and cost-efficient development framework tailored for such major decision-making processes. This paper addresses this gap by introducing a modular approach and its resultant parameterized process for designing conversational agents. We detail the system architecture, integrating engineered prompts, specialized knowledge bases, and customizable, goal-oriented conversational logic in the proposed approach. We demonstrate the adaptability, generalizability, and efficacy of our modular approach through three empirical studies: (1) travel preference surveys, highlighting multimodal (voice, text, and image generation) capabilities; (2) public opinion elicitation on a newly constructed, novel infrastructure project, showcasing question customization and multilingual (English and French) capabilities; and (3) transportation expert consultation about future transportation systems, highlighting real-time, clarification request capabilities for open-ended questions, resilience in handling erratic inputs, and efficient transcript post-processing. The results show the effectiveness of this modular approach and how it addresses key ethical, privacy, security, and token consumption concerns, setting the stage for the next-generation surveys and interviews.
zh

[NLP-59] Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

【速读】：该论文试图解决大语言模型 (LLMs) 中的越狱 (jailbreaking) 问题，即如何防止模型生成有害文本。解决方案的关键在于识别并限制模型在低层和中层神经网络中的有害激活，使其保持在安全边界 (safety boundary) 内。论文提出了激活边界防御 (Activation Boundary Defense, ABD) 方法，通过自适应约束激活来防止越狱攻击，并利用贝叶斯优化选择性地应用于低层和中层。实验结果表明，ABD 在多种越狱攻击下实现了超过 98% 的防御成功率 (DSR)，同时对模型的一般能力影响小于 2%。

链接: https://arxiv.org/abs/2412.17034
作者: Lang Gao,Xiangliang Zhang,Preslav Nakov,Xiuying Chen
机构: MBZUAI; Huazhong University of Science and Technology; University of Notre Dame
关键词: Large Language Models, Large Language, major security concern, generate harmful text, Language Models
类目: Computation and Language (cs.CL)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textitsafety boundary, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98% against various forms of jailbreak attacks, with less than 2% impact on the model’s general capabilities.
zh

[NLP-60] MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLM s on New and Tail Knowledge

【速读】：该论文试图解决大型语言模型 (LLMs) 在处理复杂、知识密集型多跳查询（尤其是涉及新知识或长尾知识）时面临的挑战。解决方案的关键在于引入了一个名为 MINTQA 的综合基准，用于评估 LLMs 在多跳推理中的能力，涵盖四个关键维度：问题处理策略、子问题生成、检索增强生成以及迭代或动态分解与检索。MINTQA 包含 10,479 个新知识问题对和 17,887 个长尾知识问题对，每个问题都配备了相应的子问题和答案，从而系统地评估了 22 个最先进的 LLMs 在处理复杂知识库查询方面的局限性，并为提升多跳推理能力提供了重要见解。

链接: https://arxiv.org/abs/2412.17032
作者: Jie He,Nan Hu,Wanqiu Long,Jiaoyan Chen,Jeff Z. Pan
机构: University of Edinburgh(爱丁堡大学); Southeast University(东南大学); University of Manchester(曼彻斯特大学)
关键词: Large language models, Large language, demonstrated impressive capabilities, language models, knowledge-intensive multi-hop queries
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs’ capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at this https URL.
zh

[NLP-61] A Reality Check on Context Utilisation for Retrieval-Augmented Generation

【速读】：该论文试图解决在真实世界场景中，语言模型（LM）如何利用不同复杂度的检索信息进行生成式增强（Retrieval-augmented generation, RAG）的问题。现有研究主要依赖于合成数据集（如CounterFact和ConflictQA），这些数据集未能充分反映真实世界中复杂的上下文环境。论文的关键解决方案是引入了DRUID数据集，该数据集包含真实世界的查询和上下文，并进行了立场的手动标注。通过对比DRUID与合成数据集，研究发现合成数据集往往夸大了真实检索数据中罕见的上下文特征，导致上下文利用率（ACU）结果被高估。此外，论文强调了在真实世界RAG设置中，需要进行与真实世界对齐的上下文利用研究，以提升模型性能。

链接: https://arxiv.org/abs/2412.17031
作者: Lovisa Hagström,Sara Vera Marjanović,Haeun Yu,Arnav Arora,Christina Lioma,Maria Maistro,Pepa Atanasova,Isabelle Augenstein
机构: 未知
关键词: parametric knowledge embedded, Retrieval-augmented generation, language model, address the limitations, parametric knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 43 pages, 18 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) helps address the limitations of the parametric knowledge embedded within a language model (LM). However, investigations of how LMs utilise retrieved information of varying complexity in real-world scenarios have been limited to synthetic contexts. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complex and diverse real-world context settings. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
zh

[NLP-62] Reversed Attention: On The Gradient Descent Of Attention Layers In GPT

【速读】：该论文试图解决Transformer-based语言模型（LMs）中注意力机制在反向传播过程中未被充分研究的问题。解决方案的关键在于揭示并研究反向传播过程中隐式计算的注意力矩阵，即“反向注意力”（Reversed Attention）。通过分析反向注意力的性质，论文展示了其能够解释模型行为和编辑动态的能力，并通过一种称为“注意力修补”（attention patching）的新方法，在不修改模型权重的情况下直接影响前向传播中的注意力机制。这一研究不仅增强了对于语言模型在反向传播过程中如何配置注意力层的理解，还为反向传播过程提供了更强的可解释性。

链接: https://arxiv.org/abs/2412.17019
作者: Shahar Katz,Lior Wolf
机构: Blavatnik School of Computer Science, Tel Aviv University (布劳沃特尼克计算机科学学院，特拉维夫大学)
关键词: Transformer-based Language Models, Transformer-based Language, success of Transformer-based, Reversed Attention, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward pass of LMs, the backward pass of attention has been largely overlooked. In this work, we study the mathematics of the backward pass of attention, revealing that it implicitly calculates an attention matrix we refer to as “Reversed Attention”. We examine the properties of Reversed Attention and demonstrate its ability to elucidate the models’ behavior and edit dynamics. In an experimental setup, we showcase the ability of Reversed Attention to directly alter the forward pass of attention, without modifying the model’s weights, using a novel method called “attention patching”. In addition to enhancing the comprehension of how LM configure attention layers during backpropagation, Reversed Attention maps contribute to a more interpretable backward pass.
zh

[NLP-63] Robustness of Large Language Models Against Adversarial Attacks

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在面对对抗性攻击时的鲁棒性问题。解决方案的关键在于采用两种不同的评估方法：一是通过字符级别的文本攻击（character-level text attack）对输入提示进行测试，涉及三个情感分类数据集（StanfordNLP/IMDB, Yelp Reviews, 和 SST-2）；二是使用越狱提示（jailbreak prompts）挑战模型的安全机制。实验结果显示，这些模型在不同程度的对抗性攻击下表现出显著的鲁棒性差异，强调了改进对抗训练和增强安全机制以提高LLMs鲁棒性的必要性。

链接: https://arxiv.org/abs/2412.17011
作者: Yiyi Tao,Yixian Shen,Hang Zhang,Yanxin Shen,Lun Wang,Chuanqi Shi,Shaoshuai Du
机构: Johns Hopkins University; University of Amsterdam; University of California San Diego; Simon Fraser University; Duke University; University of California San Diego; University of Amsterdam
关键词: Large Language Models, Large Language, deployment of Large, Language Models, increasing deployment
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety mechanisms to bolster the robustness of LLMs.
zh

[NLP-64] On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora

【速读】：该论文试图解决不连续命名实体识别 (Discontinuous Named Entity Recognition, DNER) 问题，特别是如何通过集成学习方法来提升现有深度学习模型的性能。解决方案的关键在于将ChatGPT作为集成学习中的仲裁者，结合五个最先进的NER模型，通过自定义提示工程（custom prompt engineering）来评估集成算法的鲁棒性和泛化能力。实验结果表明，该方法在CADEC、ShARe13和ShARe14三个医疗数据集上优于现有的最先进模型和单一GPT模型的表现，展示了其在医疗领域NLP应用中的潜力。

链接: https://arxiv.org/abs/2412.16976
作者: Tzu-Chieh Chen,Wen-Yang Lin
机构: 未知
关键词: Named Entity Recognition, extract important terms, Discontinuous Named Entity, unstructured text data, Named Entity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Named Entity Recognition has traditionally been a key task in natural language processing, aiming to identify and extract important terms from unstructured text data. However, a notable challenge for contemporary deep-learning NER models has been identifying discontinuous entities, which are often fragmented within the text. To date, methods to address Discontinuous Named Entity Recognition have not been explored using ensemble learning to the best of our knowledge. Furthermore, the rise of large language models, such as ChatGPT in recent years, has shown significant effectiveness across many NLP tasks. Most existing approaches, however, have primarily utilized ChatGPT as a problem-solving tool rather than exploring its potential as an integrative element within ensemble learning algorithms. In this study, we investigated the integration of ChatGPT as an arbitrator within an ensemble method, aiming to enhance performance on DNER tasks. Our method combines five state-of-the-art NER models with ChatGPT using custom prompt engineering to assess the robustness and generalization capabilities of the ensemble algorithm. We conducted experiments on three benchmark medical datasets, comparing our method against the five SOTA models, individual applications of GPT-3.5 and GPT-4, and a voting ensemble method. The results indicate that our proposed fusion of ChatGPT with the ensemble learning algorithm outperforms the SOTA results in the CADEC, ShARe13, and ShARe14 datasets, showcasing its potential to enhance NLP applications in the healthcare domain.
zh

[NLP-65] Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLM s NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在拒绝执行用户指令时的行为分类和评估问题。现有分类和评估数据集存在不足，主要关注“不应”（should-not-related）而非“不能”（cannot-related）的拒绝类别，且缺乏对黑箱LLM输出中拒绝内容的审计工具。论文提出的解决方案包括：(a) 一个包含16个拒绝类别的分类体系；(b) 一个由公开的指令微调（IFT）和人类反馈强化学习（RLHF）数据集中的8,600多个实例组成的人工标注数据集；© 每个拒绝类别8,000个示例的合成数据集；(d) 用于拒绝分类的训练分类器。这些工具和数据集使得能够精确审计黑箱LLM的拒绝行为，并自动分析大规模IFT和RLHF数据集中的拒绝模式，从而有助于调整LLM的拒绝策略，提升其安全性和可靠性。

链接: https://arxiv.org/abs/2412.16974
作者: Alexander von Recum,Christoph Schnabl,Gabor Hollbeck,Silas Alberti,Philip Blinde,Marvin von Hagen
机构: Technical University of Munich(慕尼黑工业大学); University of Cambridge(剑桥大学); ETH Zurich(苏黎世联邦理工学院); Stanford University(斯坦福大学); TÜV Nord Mobility(TÜV Nord Mobility); Massachusetts Institute of Technology (MIT)(麻省理工学院)
关键词: fully execute user, execute user instructions, decline or fail, fail to fully, fully execute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2024 Workshop SFLLM

点击查看摘要

Abstract:Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions - are crucial for both AI safety and AI capabilities and the reduction of hallucinations in particular. These behaviors are learned during post-training, especially in instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF). However, existing taxonomies and evaluation datasets for refusals are inadequate, often focusing solely on should-not-related (instead of cannot-related) categories, and lacking tools for auditing refusal content in black-box LLM outputs. We present a comprehensive framework for classifying LLM refusals: (a) a taxonomy of 16 refusal categories, (b) a human-annotated dataset of over 8,600 instances from publicly available IFT and RLHF datasets, © a synthetic dataset with 8,000 examples for each refusal category, and (d) classifiers trained for refusal classification. Our work enables precise auditing of refusal behaviors in black-box LLMs and automatic analyses of refusal patterns in large IFT and RLHF datasets. This facilitates the strategic adjustment of LLM refusals, contributing to the development of more safe and reliable LLMs. Comments: NeurIPS 2024 Workshop SFLLM Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2412.16974 [cs.AI] (or arXiv:2412.16974v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.16974 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-66] Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models COLING2025

【速读】：该论文旨在探讨在混合专家模型 (Mixture of Experts, MoE) 中，模型集成路由器如何基于词性标签 (Part-of-Speech, POS) 对 tokens 进行路由，并研究不同 MoE 架构下专家是否专门处理具有相似语言特征的 tokens。解决方案的关键在于通过分析 tokens 在专家和层之间的轨迹，揭示 MoE 模型如何处理语言信息，并发现专家对特定 POS 类别的专门化处理，从而验证路由路径在表征 tokens 方面的价值。

链接: https://arxiv.org/abs/2412.16971
作者: Elie Antoine,Frédéric Béchet,Philippe Langlais
机构: CNRS, LIS, Aix-Marseille Université, France; RALI, DIRO, Université de Montréal, Canada; International Laboratory on Learning Systems (ILLS - IRL CNRS), Montreal
关键词: routers in Mixture, study investigates, investigates the behavior, behavior of model-integrated, model-integrated routers
类目: Computation and Language (cs.CL)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:This study investigates the behavior of model-integrated routers in Mixture of Experts (MoE) models, focusing on how tokens are routed based on their linguistic features, specifically Part-of-Speech (POS) tags. The goal is to explore across different MoE architectures whether experts specialize in processing tokens with similar linguistic traits. By analyzing token trajectories across experts and layers, we aim to uncover how MoE models handle linguistic information. Findings from six popular MoE models reveal expert specialization for specific POS categories, with routing paths showing high predictive accuracy for POS, highlighting the value of routing paths in characterizing tokens.
zh

[NLP-67] System-2 Mathematical Reasoning via Enriched Instruction Tuning

【速读】：该论文试图解决当前大型语言模型（LLMs）在复杂数学问题上的系统性推理能力不足的问题。其关键解决方案是引入Enriched Instruction Tuning (EIT)，通过结合人类和AI反馈来丰富现有的数学数据集，生成细粒度的推理轨迹。具体来说，EIT包括两个核心步骤：Enriching with Reasoning Plan (ERP) 用于生成高层次的推理计划，将复杂问题分解为简单目标序列；Enriching with Reasoning Step (ERS) 则补充人类标注者常忽略的推理细节，形成更平滑的推理轨迹。与仅依赖LLM内部知识的现有CoT提示方法不同，EIT利用人类标注的初始答案作为“元知识”，帮助LLM生成更详细和精确的推理过程，从而提升其在复杂数学问题上的表现。实验结果表明，EIT在GSM8K和MATH数据集上分别达到了84.1%和32.5%的准确率，超越了现有的微调和提示方法，甚至与工具增强方法的表现相当。

链接: https://arxiv.org/abs/2412.16964
作者: Huanqia Cai,Yijun Yang,Zhifeng Li
机构: Tencent(腾讯); University of Technology Sydney(悉尼科技大学)
关键词: large language models, current large language, Solving complex mathematical, natural human skill, Solving complex
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Solving complex mathematical problems via system-2 reasoning is a natural human skill, yet it remains a significant challenge for current large language models (LLMs). We identify the scarcity of deliberate multi-step reasoning data as a primary limiting factor. To this end, we introduce Enriched Instruction Tuning (EIT), a method that enriches existing human-annotated mathematical datasets by synergizing human and AI feedback to create fine-grained reasoning trajectories. These datasets are then used to fine-tune open-source LLMs, enhancing their mathematical reasoning abilities without reliance on any symbolic verification program. Concretely, EIT is composed of two critical steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS). The former generates a high-level plan that breaks down complex instructions into a sequence of simpler objectives, while ERS fills in reasoning contexts often overlooked by human annotators, creating a smoother reasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods that generate reasoning chains only depending on LLM’s internal knowledge, our method leverages human-annotated initial answers as ``meta-knowledge’’ to help LLMs generate more detailed and precise reasoning processes, leading to a more trustworthy LLM expert for complex mathematical problems. In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods.
zh

[NLP-68] LH-Mix: Local Hierarchy Correlation Guided Mixup over Hierarchical Prompt Tuning KDD2025

【速读】：该论文试图解决分层文本分类 (Hierarchical Text Classification, HTC) 中全局层次结构冗余的问题，并提出了一种新的解决方案。关键在于通过引入文本特定的局部层次结构 (local hierarchy)，并结合深度级别提示 (depth-level prompt) 来捕捉父子关系，同时利用Mixup技术增强兄弟/同辈关系 (sibling/peer relationships) 的潜在关联。论文提出的Local Hierarchy Mixup (LH-Mix) 模型通过一种由局部层次结构相关性引导的Mixup比例，有效捕捉内在关联，从而在多个广泛使用的数据集上表现出显著的性能提升。

链接: https://arxiv.org/abs/2412.16963
作者: Fanshuang Kong,Richong Zhang,Ziqiao Wang
机构: Beihang University(北京航空航天大学); Tongji University(同济大学)
关键词: Hierarchical text classification, text classification, aims to assign, local hierarchy, HTC
类目: Computation and Language (cs.CL)
备注: Accepted by KDD 2025

点击查看摘要

Abstract:Hierarchical text classification (HTC) aims to assign one or more labels in the hierarchy for each text. Many methods represent this structure as a global hierarchy, leading to redundant graph structures. To address this, incorporating a text-specific local hierarchy is essential. However, existing approaches often model this local hierarchy as a sequence, focusing on explicit parent-child relationships while ignoring implicit correlations among sibling/peer relationships. In this paper, we first integrate local hierarchies into a manual depth-level prompt to capture parent-child relationships. We then apply Mixup to this hierarchical prompt tuning scheme to improve the latent correlation within sibling/peer relationships. Notably, we propose a novel Mixup ratio guided by local hierarchy correlation to effectively capture intrinsic correlations. This Local Hierarchy Mixup (LH-Mix) model demonstrates remarkable performance across three widely-used datasets.
zh

[NLP-69] Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework

【速读】：该论文试图解决大型语言模型 (LLMs) 在逻辑推理任务中面临的效能和效率问题，这些问题源于现有系统未能充分利用逻辑任务的内在结构，如分解、搜索和解析。解决方案的关键在于提出了一个名为 Aristotle 的逻辑完备推理框架，该框架包含三个核心组件：逻辑分解器 (Logical Decomposer)、逻辑搜索路由器 (Logical Search Router) 和逻辑解析器 (Logical Resolver)。通过将符号表达式和逻辑规则全面整合到推理过程中，Aristotle 显著缓解了逻辑推理的瓶颈，包括降低子任务复杂性、减少搜索错误和解决逻辑矛盾。实验结果表明，Aristotle 在多个数据集上均优于现有的最先进推理框架，特别是在复杂逻辑推理场景中表现出色。

链接: https://arxiv.org/abs/2412.16953
作者: Jundong Xu,Hao Fei,Meng Luo,Qian Liu,Liangming Pan,William Yang Wang,Preslav Nakov,Mong-Li Lee,Wynne Hsu
机构: 未知
关键词: large language models, made impressive strides, current advanced reasoning, advanced reasoning methods, logical
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the context of large language models (LLMs), current advanced reasoning methods have made impressive strides in various reasoning tasks. However, when it comes to logical reasoning tasks, major challenges remain in both efficacy and efficiency. This is rooted in the fact that these systems fail to fully leverage the inherent structure of logical tasks throughout the reasoning processes such as decomposition, search, and resolution. To address this, we propose a logic-complete reasoning framework, Aristotle, with three key components: Logical Decomposer, Logical Search Router, and Logical Resolver. In our framework, symbolic expressions and logical rules are comprehensively integrated into the entire reasoning process, significantly alleviating the bottlenecks of logical reasoning, i.e., reducing sub-task complexity, minimizing search errors, and resolving logical contradictions. The experimental results on several datasets demonstrate that Aristotle consistently outperforms state-of-the-art reasoning frameworks in both accuracy and efficiency, particularly excelling in complex logical reasoning scenarios. We will open-source all our code at this https URL.
zh

[NLP-70] A Career Interview Dialogue System using Large Language Model-based Dynamic Slot Generation COLING2025

【速读】：该论文旨在提高护理管理者进行职业面试的效率和质量。其关键解决方案是开发一种基于槽填充（slot-filling）的对话系统，用于预面试阶段收集员工职业信息。与传统基于预定义槽集的对话系统不同，该研究利用大型语言模型（LLMs）动态生成新槽，以适应对话流程，从而实现更自然的对话。此外，通过在槽生成过程中引入溯因推理（abduction），进一步提升了槽生成的适当性和有效性。实验结果表明，该方法在增强信息收集能力和对话自然度方面具有显著效果。

链接: https://arxiv.org/abs/2412.16943
作者: Ekai Hashimoto,Mikio Nakano,Takayoshi Sakurai,Shun Shiramatsu,Toshitake Komazaki,Shiho Tsuchiya
机构: Nagoya Institute of Technology; Tokyo Healthcare University; Kitasato University Hospital; C4A Research Institute, Inc.
关键词: nursing managers, study aims, aims to improve, improve the efficiency, efficiency and quality
类目: Computation and Language (cs.CL)
备注: 9 pages, 9 tables, 2 figures; 14 pages of appendix. Accepted to COLING 2025

点击查看摘要

Abstract:This study aims to improve the efficiency and quality of career interviews conducted by nursing managers. To this end, we have been developing a slot-filling dialogue system that engages in pre-interviews to collect information on staff careers as a preparatory step before the actual interviews. Conventional slot-filling-based interview dialogue systems have limitations in the flexibility of information collection because the dialogue progresses based on predefined slot sets. We therefore propose a method that leverages large language models (LLMs) to dynamically generate new slots according to the flow of the dialogue, achieving more natural conversations. Furthermore, we incorporate abduction into the slot generation process to enable more appropriate and effective slot generation. To validate the effectiveness of the proposed method, we conducted experiments using a user simulator. The results suggest that the proposed method using abduction is effective in enhancing both information-collecting capabilities and the naturalness of the dialogue.
zh

[NLP-71] Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering

【速读】：该论文试图解决现有基于知识的大语言模型（LLMs）在视觉问答（VQA）任务中直接预测答案而忽略中间推理过程的问题。解决方案的关键在于提出了一种名为PLRH的框架，通过引入链式思维（Chain of Thought, CoT）来生成推理启发式（rationale heuristics），即中间推理过程，并利用这些启发式来引导LLMs进行更准确的答案预测。实验结果表明，该方法在OK-VQA和A-OKVQA数据集上分别比现有基线方法提升了2.2和2.1个百分点。

链接: https://arxiv.org/abs/2412.16936
作者: Zhongjian Hu,Peng Yang,Bing Li,Fengyuan Liu
机构: Southeast University (东南大学); Zhejiang University of Finance & Economics (浙江财经大学)
关键词: Large Language Models, Visual Question Answering, knowledge-based Visual Question, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have been used for knowledge-based Visual Question Answering (VQA). Despite the encouraging results of previous studies, prior methods prompt LLMs to predict answers directly, neglecting intermediate thought processes. We argue that prior methods do not sufficiently activate the capacities of LLMs. We propose a framework called PLRH that Prompts LLMs with Rationale Heuristics for knowledge-based VQA. The PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, i.e., intermediate thought processes, and then leverages the rationale heuristics to inspire LLMs to predict answers. Experiments show that our approach outperforms the existing baselines by more than 2.2 and 2.1 on OK-VQA and A-OKVQA, respectively.
zh

[NLP-72] owards a Unified Paradigm: Integrating Recommendation Systems as a New Language in Large Models

【速读】：该论文试图解决将大型语言模型 (LLMs) 应用于序列推荐 (sequential recommendation) 的问题，即基于用户过去的行为预测其未来的交互。解决方案的关键在于提出了“将推荐系统作为大型模型中的新语言” (RSLLM) 的概念，通过结合传统推荐系统的优势与 LLMs 的能力，使用独特的提示方法将基于 ID 的物品嵌入 (ID-based item embeddings) 与文本特征结合，并将用户的行为序列视为一种独立的语言，通过投影器 (projector) 将 ID 嵌入与 LLM 的输入空间对齐。此外，论文还提出了一个两阶段的 LLM 微调框架，首先使用纯文本提示进行微调，然后通过统一提示在目标领域进行微调，结合对比损失和语言建模损失，使模型能够整合传统序列推荐器的行为知识。

链接: https://arxiv.org/abs/2412.16933
作者: Kai Zheng,Qingfeng Sun,Can Xu,Peng Yu,Qingwei Guo
机构: Microsoft(微软); 未知
关键词: future interactions based, Integrating Recommendation Systems, Large Language Models, predicts users’ future, users’ future interactions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:This paper explores the use of Large Language Models (LLMs) for sequential recommendation, which predicts users’ future interactions based on their past behavior. We introduce a new concept, “Integrating Recommendation Systems as a New Language in Large Models” (RSLLM), which combines the strengths of traditional recommenders and LLMs. RSLLM uses a unique prompting method that combines ID-based item embeddings from conventional recommendation models with textual item features. It treats users’ sequential behaviors as a distinct language and aligns the ID embeddings with the LLM’s input space using a projector. We also propose a two-stage LLM fine-tuning framework that refines a pretrained LLM using a combination of two contrastive losses and a language modeling loss. The LLM is first fine-tuned using text-only prompts, followed by target domain fine-tuning with unified prompts. This trains the model to incorporate behavioral knowledge from the traditional sequential recommender into the LLM. Our empirical results validate the effectiveness of our proposed framework.
zh

[NLP-73] Revisiting In-Context Learning with Long Context Language Models

【速读】：该论文试图解决的问题是：在长上下文语言模型 (Long Context Language Models, LCLMs) 的背景下，In-Context Learning (ICL) 在多示例场景下的性能是否仍然对示例选择方法敏感。解决方案的关键在于，论文通过实验发现，复杂的示例选择技术并未显著优于简单的随机选择方法，而LCLMs的出现使得ICL的主要挑战从选择最有效的示例转变为收集足够多的示例以填满上下文窗口。通过简单的数据增强方法来补充上下文中的示例，可以显著提升ICL性能达5%。

链接: https://arxiv.org/abs/2412.16926
作者: Jinheon Baek,Sun Jae Lee,Prakhar Gupta,Geunseob(GS)Oh,Siddharth Dalmia,Prateek Kolhar
机构: KAIST(韩国科学技术院); Google DeepMind(谷歌深度思维)
关键词: make predictions based, models make predictions, language models make, In-Context Learning, Context Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we find that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.
zh

[NLP-74] Quantifying Public Response to COVID-19 Events: Introducing the Community Sentiment and Engagement Index

【速读】：该论文旨在解决如何准确捕捉公众在社交媒体上对重大事件（尤其是与COVID-19相关的事件）的情绪和参与度变化的问题。解决方案的关键在于开发了社区情绪与参与度指数（Community Sentiment and Engagement Index, CSEI），该指数通过整合多种情绪指标（如参与度、每日发帖量、复合情绪、细粒度情绪、可读性、冒犯性和领域多样性），并利用基于主成分分析（PCA）的多步骤框架对各指标进行系统加权，动态调整各成分的重要性，从而精确捕捉公众情绪的高敏感性变化。CSEI的内部一致性和对特定情绪维度的敏感性通过与4,510,178条Reddit帖子数据的分析得到验证，显示出与重大事件（如WHO宣布COVID-19为大流行、首次病例报告、封锁措施、疫苗开发等）相关的显著情绪波动。

链接: https://arxiv.org/abs/2412.16925
作者: Nirmalya Thakur,Kesha A. Patel,Audrey Poon,Shuqi Cui,Nazif Azizi,Rishika Shah,Riyan Shah
机构: 未知
关键词: introduces the Community, Engagement Index, Community Sentiment, CSEI, Sentiment
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces the Community Sentiment and Engagement Index (CSEI), developed to capture nuanced public sentiment and engagement variations on social media, particularly in response to major events related to COVID-19. Constructed with diverse sentiment indicators, CSEI integrates features like engagement, daily post count, compound sentiment, fine-grain sentiments (fear, surprise, joy, sadness, anger, disgust, and neutral), readability, offensiveness, and domain diversity. Each component is systematically weighted through a multi-step Principal Component Analysis (PCA)-based framework, prioritizing features according to their variance contributions across temporal sentiment shifts. This approach dynamically adjusts component importance, enabling CSEI to precisely capture high-sensitivity shifts in public sentiment. The development of CSEI showed statistically significant correlations with its constituent features, underscoring internal consistency and sensitivity to specific sentiment dimensions. CSEI’s responsiveness was validated using a dataset of 4,510,178 Reddit posts about COVID-19. The analysis focused on 15 major events, including the WHO’s declaration of COVID-19 as a pandemic, the first reported cases of COVID-19 across different countries, national lockdowns, vaccine developments, and crucial public health measures. Cumulative changes in CSEI revealed prominent peaks and valleys aligned with these events, indicating significant patterns in public sentiment across different phases of the pandemic. Pearson correlation analysis further confirmed a statistically significant relationship between CSEI daily fluctuations and these events (p = 0.0428), highlighting the capacity of CSEI to infer and interpret shifts in public sentiment and engagement in response to major events related to COVID-19.
zh

[NLP-75] Unsupervised Bilingual Lexicon Induction for Low Resource Languages

【速读】：该论文试图解决低资源语言（LRLs）在缺乏双语词典的情况下无法受益于监督式双语词典归纳（BLI）技术的问题。解决方案的关键在于探索无监督双语词典归纳（UBLI）技术中的结构化方法，特别是通过同时应用多种改进技术来提升性能。研究采用了无监督版本的VecMap框架，对英语-僧伽罗语、英语-泰米尔语和英语-旁遮普语进行了全面的实验，以确定最佳的改进组合，并发布了英语-僧伽罗语和英语-旁遮普语的双语词典。

链接: https://arxiv.org/abs/2412.16894
作者: Charitha Rathnayake,P.R.S. Thilakarathna,Uthpala Nethmini,Rishemjith Kaur,Surangika Ranathunga
机构: University of Moratuwa(莫拉图瓦大学); Central Scientific Instruments Organisation(中央科学仪器组织); Massey University(梅西大学)
关键词: Natural Language Processing, Language Processing tasks, Processing tasks, Bilingual Lexicon Induction, Natural Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bilingual lexicons play a crucial role in various Natural Language Processing tasks. However, many low-resource languages (LRLs) do not have such lexicons, and due to the same reason, cannot benefit from the supervised Bilingual Lexicon Induction (BLI) techniques. To address this, unsupervised BLI (UBLI) techniques were introduced. A prominent technique in this line is structure-based UBLI. It is an iterative method, where a seed lexicon, which is initially learned from monolingual embeddings is iteratively improved. There have been numerous improvements to this core idea, however they have been experimented with independently of each other. In this paper, we investigate whether using these techniques simultaneously would lead to equal gains. We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework, and carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi. These experiments helped us to identify the best combination of the extensions. We also release bilingual dictionaries for English-Sinhala and English-Punjabi.
zh

[NLP-76] PsychAdapter: Adapting LLM Transformers to Reflect Traits Personality and Mental Health

【速读】：该论文试图解决生成式语言模型在默认情况下生成“平均”语言，而未能反映个体差异的问题。解决方案的关键在于提出了一种轻量级的修改方法——“PsychAdapter”，它通过使用经验得出的特质-语言模式，直接在标准语言模型的Transformer架构中引入心理行为模式，从而生成符合特定人格、人口统计学和心理健康特征的自然语言。PsychAdapter的核心创新在于其能够在不依赖提示的情况下，通过影响每个Transformer层来实现这一目标，从而使生成的文本能够准确反映目标特质，如大五人格特质和抑郁与生活满意度等。

链接: https://arxiv.org/abs/2412.16882
作者: Huy Vu,Huy Anh Nguyen,Adithya V Ganesan,Swanie Juhng,Oscar N.E. Kjell,Joao Sedoc,Margaret L. Kern,Ryan L. Boyd,Lyle Ungar,H. Andrew Schwartz,Johannes C. Eichstaedt
机构: Computer Science Department, Stony Brook University (计算机科学系，石溪大学); Stern School of Business, New York University (斯特恩商学院，纽约大学); Centre for Wellbeing Science, University of Melbourne (幸福科学中心，墨尔本大学); Computer and Information Science, University of Pennsylvania (计算机与信息科学，宾夕法尼亚大学)
关键词: Artificial intelligence-based language, Artificial intelligence-based, intelligence-based language generators, people lives, language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence-based language generators are now a part of most people’s lives. However, by default, they tend to generate “average” language without reflecting the ways in which people differ. Here, we propose a lightweight modification to the standard language model transformer architecture - “PsychAdapter” - that uses empirically derived trait-language patterns to generate natural language for specified personality, demographic, and mental health characteristics (with or without prompting). We applied PsychAdapters to modify OpenAI’s GPT-2, Google’s Gemma, and Meta’s Llama 3 and found generated text to reflect the desired traits. For example, expert raters evaluated PsychAdapter’s generated text output and found it matched intended trait levels with 87.3% average accuracy for Big Five personalities, and 96.7% for depression and life satisfaction. PsychAdapter is a novel method to introduce psychological behavior patterns into language models at the foundation level, independent of prompting, by influencing every transformer layer. This approach can create chatbots with specific personality profiles, clinical training tools that mirror language associated with psychological conditionals, and machine translations that match an authors reading or education level without taking up LLM context windows. PsychAdapter also allows for the exploration psychological constructs through natural language expression, extending the natural language processing toolkit to study human psychology.
zh

[NLP-77] Reconsidering SMT Over NMT for Closely Related Languages: A Case Study of Persian-Hindi Pair

【速读】：该论文试图解决在中等资源环境下，结构相似语言对的机器翻译性能问题。解决方案的关键在于证明基于短语的统计机器翻译 (Phrase-Based Statistical Machine Translation, PBSMT) 在波斯语-印地语等结构相似语言对中，能够显著优于基于 Transformer 的神经机器翻译 (Transformer-based Neural Machine Translation, NMT)。具体来说，PBSMT 在相同数据集上达到了 66.32 的 BLEU 分数，远超 Transformer-NMT 的 53.7 分。此外，论文还探讨了通过罗马化文本训练以及调整波斯语句子顺序以匹配印地语从左到右 (LTR) 结构的方法，进一步提升了翻译性能。这些发现强调了根据语言对特性选择合适架构的重要性，并主张在某些情况下，PBSMT 可以作为 NMT 的高性能替代方案。

链接: https://arxiv.org/abs/2412.16877
作者: Waisullah Yousofi,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Computation for Indian Langauge Technology (CFILT) (印度语言技术计算)
关键词: Statistical Machine Translation, Neural Machine Translation, Transformer-based Neural Machine, Phrase-Based Statistical Machine, outperform Transformer-based Neural
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:This paper demonstrates that Phrase-Based Statistical Machine Translation (PBSMT) can outperform Transformer-based Neural Machine Translation (NMT) in moderate-resource scenarios, specifically for structurally similar languages, like the Persian-Hindi pair. Despite the Transformer architecture’s typical preference for large parallel corpora, our results show that PBSMT achieves a BLEU score of 66.32, significantly exceeding the Transformer-NMT score of 53.7 on the same dataset. Additionally, we explore variations of the SMT architecture, including training on Romanized text and modifying the word order of Persian sentences to match the left-to-right (LTR) structure of Hindi. Our findings highlight the importance of choosing the right architecture based on language pair characteristics and advocate for SMT as a high-performing alternative, even in contexts commonly dominated by NMT.
zh

[NLP-78] aching LLM s to Refine with Tools

【速读】：该论文试图解决现有大型语言模型（LLMs）在基于反馈进行自我改进时，主要局限于同一推理格式（reasoning format）内的优化，从而可能导致无法纠正错误的问题。解决方案的关键在于提出了一种名为CaP的新方法，该方法通过使用外部工具来优化由同一或不同LLMs生成的链式思维（chain-of-thought, CoT）响应。CaP采用两阶段训练过程：首先是监督微调（supervised fine-tuning），然后是使用DPO变体的偏好优化（preference optimization）。研究表明，偏好优化在实现有效优化中起着关键作用。此外，论文还比较了几种采样策略，以在推理时有效利用CoT和工具。实验结果显示，CaP在跨推理格式优化和高效推理方面具有潜力。

链接: https://arxiv.org/abs/2412.16871
作者: Dian Yu,Yuheng Zhang,Jiahao Xu,Tian Liang,Linfeng Song,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: Tencent AI Lab(腾讯AI实验室)
关键词: Large language models, Large language, language models, based on feedback, self-improvement through iterative
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can refine their responses based on feedback, enabling self-improvement through iterative training or test-time refinement. However, existing methods predominantly focus on refinement within the same reasoning format, which may lead to non-correcting behaviors. We propose CaP, a novel approach that uses external tools to refine chain-of-thought (CoT) responses generated by the same or other LLMs. CaP employs a two-stage training process: supervised fine-tuning followed by preference optimization with DPO variants. Our observations highlight the critical role of preference optimization in enabling effective refinement. Additionally, we compare several sampling strategies to leverage CoT and tools at inference time. Experimental results demonstrate CaP’s potential for effective cross-reasoning refinement and efficient inference.
zh

[NLP-79] GME: Improving Universal Multimodal Retrieval by Multimodal LLM s

【速读】：该论文试图解决通用多模态检索 (Universal Multimodal Retrieval, UMR) 问题，即通过统一的模型实现跨文本、图像及其组合的检索。解决方案的关键在于通过合成训练数据管道构建大规模、高质量的融合模态训练数据集，并基于此数据集开发了多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的通用多模态嵌入器 (General Multimodal Embedder, GME)。这一方法有效解决了现有多模态训练数据在模态上的不平衡问题，并通过实验验证了其在UMR任务中的最先进性能。

链接: https://arxiv.org/abs/2412.16855
作者: Xin Zhang,Yanzhao Zhang,Wen Xie,Mingxin Li,Ziqi Dai,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang
机构: Tongyi Lab, Alibaba Group(通义实验室，阿里巴巴集团); The Hong Kong Polytechnic University(香港理工大学)
关键词: Universal Multimodal Retrieval, Universal Multimodal, Multimodal Retrieval, aims to enable, enable search
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 32 pages, models at this https URL

点击查看摘要

Abstract:Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling, training strategies, and perform ablation studies on both the model and synthetic data.
zh

[NLP-80] Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with an LLM -Enabled Simulation

【速读】：该论文旨在解决传统9-1-1调度员培训方法中存在的劳动密集、耗时且忽视特定社区需求的问题。解决方案的关键在于引入Sim911，这是首个基于大型语言模型 (LLMs) 的9-1-1调度员培训模拟系统。Sim911通过三项技术创新提升培训效果：(1) 知识构建，利用存档的9-1-1通话数据生成贴近真实场景的模拟；(2) 上下文感知控制生成，通过动态提示和向量基确保LLM行为与培训目标一致；(3) 循环校正验证，过滤低质量响应并优化系统性能。

链接: https://arxiv.org/abs/2412.16844
作者: Zirong Chen,Elizabeth Chason,Noah Mladenovski,Erin Wilson,Kristin Mullen,Stephen Martini,Meiyi Ma
机构: University of Florida (佛罗里达大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
关键词: enhancing public safety, Emergency response services, safeguarding the environment, human lives, vital for enhancing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emergency response services are vital for enhancing public safety by safeguarding the environment, property, and human lives. As frontline members of these services, 9-1-1 dispatchers have a direct impact on response times and the overall effectiveness of emergency operations. However, traditional dispatcher training methods, which rely on role-playing by experienced personnel, are labor-intensive, time-consuming, and often neglect the specific needs of underserved communities. To address these challenges, we introduce Sim911, the first training simulation for 9-1-1 dispatchers powered by Large Language Models (LLMs). Sim911 enhances training through three key technical innovations: (1) knowledge construction, which utilizes archived 9-1-1 call data to generate simulations that closely mirror real-world scenarios; (2) context-aware controlled generation, which employs dynamic prompts and vector bases to ensure that LLM behavior aligns with training objectives; and (3) validation with looped correction, which filters out low-quality responses and refines the system performance.
zh

[NLP-81] Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM -Powered Error Detector for Math Word Problem Solutions

【速读】：该论文试图解决数学应用题 (Math Word Problems, MWPs) 中由于存在多种有效解法而导致的“一致性偏差 (conformity bias)”问题。解决方案的关键在于引入Ask-Before-Detect (AskBD) 框架，该框架利用大型语言模型 (LLMs) 生成适应性参考解法，从而增强错误检测的准确性。通过在GSM8K数据集上的实验，AskBD框架结合链式思维提示 (chain-of-thought prompting) 等推理增强技术，有效缓解了偏差并提升了性能。

链接: https://arxiv.org/abs/2412.16838
作者: Hang Li,Tianlong Xu,Kaiqi Yang,Yucheng Chu,Yanling Chen,Yichi Song,Qingsong Wen,Hui Liu
机构: Michigan State University; Squirrel Ai Learning; Carleton College
关键词: large language models, math word problems, language models, offers new opportunities, word problems
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.
zh

[NLP-82] Quantum-Like Contextuality in Large Language Models

【速读】：该论文试图解决的问题是验证量子力学中的上下文性（contextuality）是否在自然语言处理领域中同样存在，并探讨其潜在的应用优势。解决方案的关键在于构建一个基于量子场景的语言模式（linguistic schema），并通过BERT模型在Simple English Wikipedia上提取概率分布，从而发现大量的上下文性实例。具体来说，研究采用了两种框架：信号修正的束理论模型（signalling-corrected sheaf theoretic model）和默认上下文性框架（Contextuality-by-Default, CbD），并证明了这些上下文性实例来源于语义相似的词汇。通过回归分析，研究进一步揭示了BERT嵌入向量的欧几里得距离是上下文性的最佳统计预测因子。这些结果表明，量子方法可能在语言任务中具有优势。

链接: https://arxiv.org/abs/2412.16806
作者: Kin Ian Lo,Mehrnoosh Sadrzadeh,Shane Mansfield
机构: 未知
关键词: distinguishing feature, Simple English Wikipedia, quantum advantage, growing evidence, quantum mechanics
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Contextuality is a distinguishing feature of quantum mechanics and there is growing evidence that it is a necessary condition for quantum advantage. In order to make use of it, researchers have been asking whether similar phenomena arise in other domains. The answer has been yes, e.g. in behavioural sciences. However, one has to move to frameworks that take some degree of signalling into account. Two such frameworks exist: (1) a signalling-corrected sheaf theoretic model, and (2) the Contextuality-by-Default (CbD) framework. This paper provides the first large scale experimental evidence for a yes answer in natural language. We construct a linguistic schema modelled over a contextual quantum scenario, instantiate it in the Simple English Wikipedia and extract probability distributions for the instances using the large language model BERT. This led to the discovery of 77,118 sheaf-contextual and 36,938,948 CbD contextual instances. We proved that the contextual instances came from semantically similar words, by deriving an equation between degrees of contextuality and Euclidean distances of BERT’s embedding vectors. A regression model further reveals that Euclidean distance is indeed the best statistical predictor of contextuality. Our linguistic schema is a variant of the co-reference resolution challenge. These results are an indication that quantum methods may be advantageous in language tasks.
zh

[NLP-83] SubData: A Python Library to Collect and Combine Datasets for Evaluating LLM Alignment on Downstream Tasks

【速读】：该论文试图解决在自然语言处理（NLP）领域中，缺乏评估大型语言模型（LLMs）在主观性标注任务中实际下游任务表现的问题。解决方案的关键在于提出了SubData，这是一个Python库，旨在为研究主观性标注任务的学者提供一个便捷的工具，用于收集、整合和使用适合的数据集，从而更好地评估LLMs在这些任务中的表现。

链接: https://arxiv.org/abs/2412.16783
作者: Leon Fröhling,Pietro Bernardelle,Gianluca Demartini
机构: GESIS(GESIS); The University of Queensland(昆士兰大学)
关键词: capable large language, large language models, capable large, large language, disciplines have started
类目: Computation and Language (cs.CL)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:With the release of ever more capable large language models (LLMs), researchers in NLP and related disciplines have started to explore the usability of LLMs for a wide variety of different annotation tasks. Very recently, a lot of this attention has shifted to tasks that are subjective in nature. Given that the latest generations of LLMs have digested and encoded extensive knowledge about different human subpopulations and individuals, the hope is that these models can be trained, tuned or prompted to align with a wide range of different human perspectives. While researchers already evaluate the success of this alignment via surveys and tests, there is a lack of resources to evaluate the alignment on what oftentimes matters the most in NLP; the actual downstream tasks. To fill this gap we present SubData, a Python library that offers researchers working on topics related to subjectivity in annotation tasks a convenient way of collecting, combining and using a range of suitable datasets.
zh

[NLP-84] AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles

【速读】：该论文试图解决在生物医学研究中，特别是阿尔茨海默病领域，如何高效整合多模态数据以提升信息检索和生成模型性能的问题。解决方案的关键在于引入了一种名为AlzheimerRAG的多模态检索增强生成 (Multimodal Retrieval-Augmented Generation, RAG) 管道工具，该工具通过多模态融合技术，将文本和视觉数据处理相结合，能够高效地索引和访问大量生物医学文献。这种方法不仅在信息检索和领域特定信息的合成方面表现出优越性，还通过减少认知任务负担，帮助研究人员获得多模态洞察，从而提升对阿尔茨海默病的理解和治疗。

链接: https://arxiv.org/abs/2412.16701
作者: Aritra Kumar Lahiri,Qinmin Vivian Hu
机构: Toronto Metropolitan University, Canada(多伦多都会大学，加拿大)
关键词: Large Language Models, adept Large Language, highly adept Large, Large Language, Language Models
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in generative AI have flourished the development of highly adept Large Language Models (LLMs) that integrate diverse data types to empower decision-making. Among these, Multimodal Retrieval-Augmented Generation (RAG) applications are promising for their capability to combine the strengths of information retrieval and generative models, enhancing their utility across various domains, including biomedical research. This paper introduces AlzheimerRAG, a Multimodal RAG pipeline tool for biomedical research use cases, primarily focusing on Alzheimer’s disease from PubMed articles. Our pipeline incorporates multimodal fusion techniques to integrate textual and visual data processing by efficiently indexing and accessing vast amounts of biomedical literature. Preliminary experimental results against benchmarks, such as BioASQ and PubMedQA, have returned improved results in information retrieval and synthesis of domain-specific information. We also demonstrate a case study with our RAG pipeline across different Alzheimer’s clinical scenarios. We infer that AlzheimerRAG can generate responses with accuracy non-inferior to humans and with low rates of hallucination. Overall, a reduction in cognitive task load is observed, which allows researchers to gain multimodal insights, improving understanding and treatment of Alzheimer’s disease.
zh

[NLP-85] DragonVerseQA: Open-Domain Long-Form Context-Aware Question-Answering

【速读】：该论文试图解决现有问答数据集（QA datasets）在处理复杂叙事和深度上下文理解方面的不足，特别是针对《权力的游戏》和《龙之家族》这类幻想题材电视剧的开放领域长篇问答数据集的缺乏。解决方案的关键在于构建了一个名为DragonVerseQA的新型数据集，该数据集整合了来自HBO、粉丝维基、IMDb、Rotten Tomatoes等来源的全集摘要、用户评论和结构化数据，并通过严格的数据预处理和过滤方法，确保了数据的高质量和无偏性。这种多维度的上下文信息使得生成的长篇回答能够反映复杂的角色动态和剧情发展，从而提升了对话式AI、叙事分析、情感分析、摘要技术和关系抽取等方面的性能。与现有的SQuAD 2.0、TriviaQA等数据集相比，DragonVerseQA在上下文复杂性和回答长度上具有显著优势，为领域特定的问答系统设定了新的质量基准。

链接: https://arxiv.org/abs/2412.16694
作者: Aritra Kumar Lahiri,Qinmin Vivian Hu
机构: 未知
关键词: Game Of Thrones, specifically oriented, paper proposes, approach to develop, fantasy universe
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper proposes a novel approach to develop an open-domain and long-form Over-The-Top (OTT) Question-Answering (QA) dataset, DragonVerseQA, specifically oriented to the fantasy universe of “House of the Dragon” and “Game Of Thrones” TV series. Most existing QA datasets focus on short, fact-based answers sourced almost solely from Wikipedia articles, devoid of depth and contextual richness for sophisticated narrative understanding. We curate a dataset that combines full episode summaries sourced from HBO and fandom wiki websites, user reviews from sources like IMDb and Rotten Tomatoes, and high-quality, open-domain, legally admissible sources, and structured data from repositories like WikiData into one dataset. The dataset provides a multi-dimensional context, reflecting complex character dynamics and plot developments from these varied sources. That means, on equal footing, only after heavy data preprocessing and filtering methods will meaningful, non-spam unbiased reviews be available in this enriched dataset. The comprehensive insights are given through the long-form answers generated from this enriched context. This is what makes this valuable dataset for improving conversational AI, narrative analysis, sentiment analysis, summarization techniques, and relation extraction. A comparative analysis with state-of-the-art QA datasets such as SQuAD 2.0, TriviaQA, and Natural Questions brings to light the unique advantages of our dataset in terms of contextual complexity and answer length. Detailed reviews add layers to audience sentiment and narrative interpretation, raising the bar for domain-specific QA with a new quality benchmark. Our work also allows a deeper understanding of entertainment-industry content and opens the door to more knowledgeable and creative AI-driven interactions within digital media environments. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2412.16694 [cs.CL] (or arXiv:2412.16694v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.16694 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-86] NILE: Internal Consistency Alignment in Large Language Models

【速读】：该论文试图解决现有指令微调（Instruction Fine-Tuning, IFT）数据集中知识与预训练语言模型（LLM）内部知识不一致的问题，这影响了IFT的有效性。解决方案的关键在于提出了NILE（iNternal consIstency aLignmEnt）框架，通过引出目标预训练LLM的内部知识来优化IFT数据集，并使用内部一致性过滤（Internal Consistency Filtering, ICF）方法筛选训练样本，确保数据集与LLM内部知识的高度一致性。实验结果表明，NILE框架显著提升了LLM在多个评估数据集上的性能，证明了数据集与预训练内部知识的一致性对最大化LLM潜力至关重要。

链接: https://arxiv.org/abs/2412.16686
作者: Minda Hu,Qiyuan Zhang,Yufei Wang,Bowei He,Hongru Wang,Jingyan Zhou,Liangyou Li,Yasheng Wang,Chen Ma,Irwin King
机构: The Chinese University of Hong Kong; City University of Hong Kong; Huawei Noah’s Ark Lab
关键词: IFT datasets, IFT, internal knowledge, human intentions, LLM internal knowledge
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a crucial step to enhance LLMs alignment with human intentions, Instruction Fine-Tuning (IFT) has a high demand on dataset quality. However, existing IFT datasets often contain knowledge that is inconsistent with LLMs’ internal knowledge learned from the pre-training phase, which can greatly affect the efficacy of IFT. To address this issue, we introduce NILE (iNternal consIstency aLignmEnt) framework, aimed at optimizing IFT datasets to unlock LLMs’ capability further. NILE operates by eliciting target pre-trained LLM’s internal knowledge corresponding to instruction data. The internal knowledge is leveraged to revise the answer in IFT datasets. Additionally, we propose a novel Internal Consistency Filtering (ICF) method to filter training samples, ensuring its high consistency with LLM’s internal knowledge. Our experiments demonstrate that NILE-aligned IFT datasets sharply boost LLM performance across multiple LLM ability evaluation datasets, achieving up to 66.6% gain on Arena-Hard and 68.5% on Alpaca-Eval V2. Further analysis confirms that each component of the NILEframework contributes to these substantial performance improvements, and provides compelling evidence that dataset consistency with pre-trained internal knowledge is pivotal for maximizing LLM potential.
zh

[NLP-87] he Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents

【速读】：该论文试图解决大型语言模型 (LLM) 代理在通过工具集成执行复杂任务时面临的安全漏洞问题，特别是间接提示注入攻击 (indirect prompt injection attacks) 的威胁。解决方案的关键在于从传统的防止有害行为转向确保任务对齐 (task alignment)，即每个代理动作都必须服务于用户目标。基于这一视角，论文提出了Task Shield，一种测试时防御机制，通过系统性地验证每条指令和工具调用是否符合用户指定的目标，从而在保持高任务效用 (task utility) 的同时，显著降低攻击成功率。

链接: https://arxiv.org/abs/2412.16682
作者: Feiran Jia,Tong Wu,Xin Qin,Anna Squicciarini
机构: The Pennsylvania State University; Princeton University; California State University, Long Beach
关键词: Large Language Model, Large Language, Language Model, conversational assistants capable, performing complex real-world
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses based on rule constraints, source spotlighting, and authentication protocols show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07%) while maintaining high task utility (69.79%) on GPT-4o.
zh

[NLP-88] L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

【速读】：该论文试图解决基于学习的概率模型在数据压缩中的高复杂性问题，特别是在文本压缩领域的实际应用。解决方案的关键在于提出了一种低复杂度的无损文本压缩方法，称为L3TC。具体来说，论文采用了RWKV模型作为骨干，实现了快速的解码速度和适中的压缩比；提出了异常感知分词器，通过有限的词汇表覆盖常见词元并允许异常词元绕过预测和编码；还引入了高秩重参数化策略，在训练期间增强学习能力而不增加推理复杂度。实验结果表明，L3TC在压缩性能上与现有学习型压缩器相当，同时模型参数减少了50倍，并且提供了所有学习型压缩器中最快的实时解码速度。

链接: https://arxiv.org/abs/2412.16642
作者: Junxuan Zhang,Zhengxue Cheng,Yan Zhao,Shihao Wang,Dajiang Zhou,Guo Lu,Li Song
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China (哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China (中山大学计算机科学与工程学院，广州，中国)
关键词: Learning-based probabilistic models, Learning-based probabilistic, entropy coder, coder for data, Lossless Low-complexity Text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, \emphL3TC offers compression performance comparable to other learned compressors, with a 50\times reduction in model parameters. More importantly, \emphL3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second.
zh

[NLP-89] Large Language Model Can Be a Foundation for Hidden Rationale-Based Retrieval ECIR2025

【速读】：该论文试图解决的是一种更具挑战性的检索任务，称为隐藏理由检索（hidden rationale retrieval），其中查询与文档在语义上不相似，但可以通过推理链、逻辑关系或经验推断出关联。解决方案的关键在于利用指令调优的大型语言模型（LLM），并采用跨编码器架构，通过设计特定的指令将检索任务转化为生成任务，即通过提示LLM回答二选一问题来实现。此外，模型通过直接偏好优化（DPO）进行微调，并在计算效率上进行了优化，确保性能无损。该框架称为RaHoRe，在情感支持对话（ESC）任务中展示了零样本和微调后的性能优势。

链接: https://arxiv.org/abs/2412.16615
作者: Luo Ji,Feixiang Guo,Teng Chen,Qingqing Gu,Xiaoyu Wang,Ningyuan Xi,Yihong Wang,Peng Yu,Yue Zhao,Hongyang Lei,Zhonglin Jiang,Yong Chen
机构: 未知
关键词: Retrieval-Augmented Generation, recent advancement, advancement in Retrieval-Augmented, developed for factual, RAG
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, accepted by ECIR 2025

点击查看摘要

Abstract:Despite the recent advancement in Retrieval-Augmented Generation (RAG) systems, most retrieval methodologies are often developed for factual retrieval, which assumes query and positive documents are semantically similar. In this paper, we instead propose and study a more challenging type of retrieval task, called hidden rationale retrieval, in which query and document are not similar but can be inferred by reasoning chains, logic relationships, or empirical experiences. To address such problems, an instruction-tuned Large language model (LLM) with a cross-encoder architecture could be a reasonable choice. To further strengthen pioneering LLM-based retrievers, we design a special instruction that transforms the retrieval task into a generative task by prompting LLM to answer a binary-choice question. The model can be fine-tuned with direct preference optimization (DPO). The framework is also optimized for computational efficiency with no performance degradation. We name this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned performance superiority on Emotional Support Conversation (ESC), compared with previous retrieval works. Our study suggests the potential to employ LLM as a foundation for a wider scope of retrieval tasks. Our codes, models, and datasets are available on this https URL.
zh

[NLP-90] Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning with Dense Labeling

【速读】：该论文旨在解决家庭服务机器人（DSRs）在复杂室内环境中根据开放词汇指令准确搬运日常物品到指定家具的问题。解决方案的关键在于提出了RelaX-Former模型，该模型通过从大量室内环境图像中学习多样且鲁棒的表示，能够有效区分正样本、未标注正样本和负样本，从而在复杂的图像检索任务中表现出色。实验结果表明，RelaX-Former在标准图像检索指标上优于现有基线模型，并在零样本转移设置下的物理实验中实现了75%的成功率。

链接: https://arxiv.org/abs/2412.16576
作者: Daichi Yashima,Ryosuke Korekata,Komei Sugiura
机构: Keio University (庆应义塾大学)
关键词: Growing labor shortages, domestic service robots, Growing labor, service robots, labor shortages
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for IEEE RA-L 2025

点击查看摘要

Abstract:Growing labor shortages are increasing the demand for domestic service robots (DSRs) to assist in various settings. In this study, we develop a DSR that transports everyday objects to specified pieces of furniture based on open-vocabulary instructions. Our approach focuses on retrieving images of target objects and receptacles from pre-collected images of indoor environments. For example, given an instruction “Please get the right red towel hanging on the metal towel rack and put it in the white washing machine on the left,” the DSR is expected to carry the red towel to the washing machine based on the retrieved images. This is challenging because the correct images should be retrieved from thousands of collected images, which may include many images of similar towels and appliances. To address this, we propose RelaX-Former, which learns diverse and robust representations from among positive, unlabeled positive, and negative samples. We evaluated RelaX-Former on a dataset containing real-world indoor images and human annotated instructions including complex referring expressions. The experimental results demonstrate that RelaX-Former outperformed existing baseline models across standard image retrieval metrics. Moreover, we performed physical experiments using a DSR to evaluate the performance of our approach in a zero-shot transfer setting. The experiments involved the DSR to carry objects to specific receptacles based on open-vocabulary instructions, achieving an overall success rate of 75%.
zh

[NLP-91] Acquisition of Recursive Possessives and Recursive Locatives in Mandarin

【速读】：该论文试图解决儿童在语言学习过程中对递归结构（recursion）的习得问题，特别是普通话儿童对递归所有格和递归处所格的习得发展轨迹。研究的关键在于评估结构多样性对语言习得的影响，并通过对比3至7岁儿童对两层递归结构的理解，揭示递归结构习得的认知基础和复杂性。研究发现，儿童在6岁前无法达到成人水平的递归能力，且递归所有格与递归处所格的习得存在显著不对称性，这强调了结构复杂性和认知因素在语言习得过程中的重要性。

链接: https://arxiv.org/abs/2412.16556
作者: Chenxi Fu,Xiaoyi Wang,Zaijiang Man,Caimei Yang
机构: 未知
关键词: Mandarin-speaking children acquisition, point of inquiry, underlying any linguistic, linguistic work, focal point
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As recursion has been underlying any linguistic work for the last 60 years, the acquisition of recursive structures by children during language learning has become a focal point of inquiry. This study delves into the developmental trajectory of Mandarin-speaking children’s acquisition of recursive possessives and locatives, assessing the impact of structural diversity on language acquisition. The research contrasts the comprehension of two-level recursive structures among children aged 3 to 7 years, employing answering question while seeing a picture task to elicit responses. The findings indicate that children do not attain adult-like proficiency in two-level recursion until the age of 6, and there exists a notable asymmetry in the acquisition of recursive possessives versus locatives. These results underscore the primacy of structural complexity and cognitive factors in the acquisition process, enhancing our comprehension of the cognitive foundations of language development and the pivotal role of recursion in child language acquisition.
zh

[NLP-92] Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models

【速读】：该论文试图解决现有大型语言模型（LLMs）在面对jailbreaking攻击时的局限性问题，包括高查询次数、攻击模式覆盖有限、攻击成功率低以及评估方法简单等。解决方案的关键在于提出了一个多模态jailbreaking方法JMLLM，该方法通过整合多种策略，在文本、视觉和听觉模态上进行全面的jailbreaking攻击。此外，论文还贡献了一个新的多模态jailbreaking研究数据集TriJail，实验结果表明JMLLM在攻击成功率和时间开销方面均有显著提升。

链接: https://arxiv.org/abs/2412.16555
作者: Yanxu Mao,Peipei Liu,Tiehan Cui,Congying Liu,Datao You
机构: School of Software, Henan University, China(河南大学软件学院); Institute of Information Engineering, Chinese Academy of Sciences, China(中国科学院信息工程研究所); University of Chinese Academy of Sciences, China(中国科学院大学)
关键词: Large language models, Large language, powerful reasoning, generation capabilities, widely applied
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely applied in various fields of society due to their powerful reasoning, understanding, and generation capabilities. However, the security issues associated with these models are becoming increasingly severe. Jailbreaking attacks, as an important method for detecting vulnerabilities in LLMs, have been explored by researchers who attempt to induce these models to generate harmful content through various attack methods. Nevertheless, existing jailbreaking methods face numerous limitations, such as excessive query counts, limited coverage of jailbreak modalities, low attack success rates, and simplistic evaluation methods. To overcome these constraints, this paper proposes a multimodal jailbreaking method: JMLLM. This method integrates multiple strategies to perform comprehensive jailbreak attacks across text, visual, and auditory modalities. Additionally, we contribute a new and comprehensive dataset for multimodal jailbreaking research: TriJail, which includes jailbreak prompts for all three modalities. Experiments on the TriJail dataset and the benchmark dataset AdvBench, conducted on 13 popular LLMs, demonstrate advanced attack success rates and significant reduction in time overhead.
zh

[NLP-93] Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models

【速读】：该论文试图解决长序列处理中全自注意力机制（full self-attention）效率低下以及可能忽略输入结构的问题。解决方案的关键在于采用并行上下文编码（parallel context encoding），并通过引入注意力汇聚点（attention sinks）和选择性机制（selective mechanisms）来降低异常高的注意力熵（attention entropy），从而提升模型在各种任务中的性能。

链接: https://arxiv.org/abs/2412.16545
作者: Zhisong Zhang,Yan Wang,Xinting Huang,Tianqing Fang,Hongming Zhang,Chenlong Deng,Shuaiyi Li,Dong Yu
机构: Tencent AI Lab(腾讯AI实验室)
关键词: Large language models, Large language, shown remarkable performance, language models, models have shown
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling. The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers. Although powerful, this method can be inefficient for long sequences and may overlook inherent input structures. To address these problems, an alternative approach is parallel context encoding, which splits the context into sub-pieces and encodes them parallelly. Because parallel patterns are not encountered during training, naively applying parallel encoding leads to performance degradation. However, the underlying reasons and potential mitigations are unclear. In this work, we provide a detailed analysis of this issue and identify that unusually high attention entropy can be a key factor. Furthermore, we adopt two straightforward methods to reduce attention entropy by incorporating attention sinks and selective mechanisms. Experiments on various tasks reveal that these methods effectively lower irregular attention entropy and narrow performance gaps. We hope this study can illuminate ways to enhance context modeling mechanisms.
zh

[NLP-94] Self-guided Knowledgeable Network of Thoughts: Amplifying Reasoning with Large Language Models

【速读】：该论文试图解决现有大型语言模型（LLMs）在复杂任务处理中的局限性，特别是在使用链式思维（Chain-of-Thought, CoT）、树状思维（Tree of Thoughts, ToT）和图状思维（Graph of Thoughts, GoT）等范式时，难以生成高效且可靠的执行计划的问题。解决方案的关键在于引入了知识网络思维（Knowledgeable Network of Thoughts, kNoT）及其核心组件——LLM工作流模板（LLM Workflow Template, LWT）。LWT允许LLMs为其他LLMs生成可执行的计划，这些计划可以是任意网络结构，其中单步LLM操作作为节点，边对应于这些步骤之间的消息传递。LWT还支持通过索引选择单个元素，使得kNoT能够生成复杂的计划，每个LLM操作可以限制为基本操作，从而显著提高在长任务序列中的可靠性，并在多个用例中显著优于现有技术。

链接: https://arxiv.org/abs/2412.16533
作者: Chao-Chi Chen,Chin-Yuan Yeh,Hsi-Wen Chen,De-Nian Yang,Ming-Syan Chen
机构: Academia Sinica(中央研究院); National Taiwan University(台湾大学); IBM Research(IBM研究院)
关键词: Tree of Thoughts, Graph of Thoughts, introduce Knowledgeable Network, large language models, Thoughts
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: SOTA result over CoT, ToT, GoT

点击查看摘要

Abstract:We introduce Knowledgeable Network of Thoughts (kNoT): a prompt scheme that advances the capabilities of large language models (LLMs) beyond existing paradigms like Chain-of-Thought (CoT), Tree of Thoughts (ToT), and Graph of Thoughts (GoT). The key innovation of kNoT is the LLM Workflow Template (LWT), which allows for an executable plan to be specified by LLMs for LLMs. LWT allows these plans to be arbitrary networks, where single-step LLM operations are nodes, and edges correspond to message passing between these steps. Furthermore, LWT supports selection of individual elements through indexing, facilitating kNoT to produce intricate plans where each LLM operation can be limited to elementary operations, greatly enhancing reliability over extended task sequences. We demonstrate that kNoT significantly outperforms the state of the art on six use cases, while reducing the need for extensive prompt engineering. For instance, kNoT finds 92% accuracy for sorting 32 numbers over 12% and 31% for ToT and GoT, while utilizing up to 84.4% and 87.3% less task-specific prompts, respectively.
zh

[NLP-95] Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation ICASSP

【速读】：该论文试图解决在音视频语音到语音翻译（Audio-Visual Speech-to-Speech Translation, AVS2S）中，唇形同步（lip-synchrony）被忽视的问题。解决方案的关键在于将唇形同步损失（lip-synchrony loss）集成到AVS2S模型的训练过程中。通过这种方法，论文显著提升了直接音视频语音到语音翻译中的唇形同步效果，平均LSE-D得分达到10.67，相较于强基线模型降低了9.2%，同时保持了翻译语音的自然性和高质量，且未对翻译质量造成任何损害。

链接: https://arxiv.org/abs/2412.16530
作者: Lucas Goncalves,Prashant Mathur,Xing Niu,Brady Houston,Chandrashekhar Lavania,Srikanth Vishnubhotla,Lijia Sun,Anthony Ferritto
机构: 未知
关键词: typically prioritizes improving, prioritizes improving translation, Translation typically prioritizes, typically prioritizes, prioritizes improving
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP, 4 pages

点击查看摘要

Abstract:Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.
zh

[NLP-96] xt2midi: Generating Symbolic Music from Captions AAAI AAAI2025

【速读】：该论文试图解决从文本描述生成MIDI文件的问题，解决方案的关键在于利用预训练的大型语言模型 (LLMs) 构建一个端到端的生成模型。具体来说，论文提出的text2midi系统通过LLM编码器处理文本描述，并利用自回归Transformer解码器生成符合描述的MIDI序列。这种方法不仅简化了音乐创作流程，还通过结合自动化和人工评估，证明了生成的MIDI文件在质量和可控性方面表现出色，能够处理包含和弦、调性和速度等音乐理论术语的文本描述。

链接: https://arxiv.org/abs/2412.16526
作者: Keshav Bhandari,Abhinaba Roy,Kyra Wang,Geeta Puri,Simon Colton,Dorien Herremans
机构: 1. Queen Mary University of London (伦敦玛丽女王大学); 2. Singapore University of Technology and Design (新加坡科技设计大学)
关键词: MIDI files, paper introduces, model generates MIDI, generate MIDI files, MIDI
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 9 pages, 3 figures, Accepted at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. We release the code and music samples on our demo page (this https URL) for users to interact with text2midi.
zh

[NLP-97] HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

【速读】：该论文试图解决在人机交互中评估大型语言模型（LLMs）功能调用能力的问题，特别是在复杂和开放的对话过程中。解决方案的关键在于提出了HammerBench，这是一个新颖的基准测试框架，通过模拟移动设备上的多种真实用户场景（包括不完美的指令、多样化的问答轨迹、意图/参数的转变以及通过代词使用外部个人信息）来更有效地评估LLMs的功能调用能力。为了构建高质量的数据集，论文提出了一种综合的流程，包括LLM生成的数据和多轮人工验证。此外，通过将对话分解为功能调用快照，实现了对每一轮对话的细粒度评估。实验结果表明，参数命名错误是导致不同数据类型对话失败的主要因素。

链接: https://arxiv.org/abs/2412.16516
作者: Jun Wang,Jiamu Zhou,Muning Wen,Xiaoyun Mo,Haoyu Zhang,Qiqiang Lin,Cheng Jin,Xihuai Wang,Weinan Zhang,Qiuying Peng,Jun Wang
机构: OPPO(欧珀); Shanghai Jiao Tong University (上海交通大学)
关键词: remains challenging due, human-LLM interactions remains, interactions remains challenging, large language models, Evaluating the capabilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the capabilities of large language models (LLMs) in human-LLM interactions remains challenging due to the inherent complexity and openness of dialogue processes. This paper introduces HammerBench, a novel benchmarking framework designed to assess the function-calling ability of LLMs more effectively in such interactions. We model a wide range of real-world user scenarios on mobile devices, encompassing imperfect instructions, diverse question-answer trajectories, intent/argument shifts, and the use of external individual information through pronouns. To construct the corresponding datasets, we propose a comprehensive pipeline that involves LLM-generated data and multiple rounds of human validation, ensuring high data quality. Additionally, we decompose the conversations into function-calling snapshots, enabling a fine-grained evaluation of each turn. We evaluate several popular LLMs using HammerBench and highlight different performance aspects. Our empirical findings reveal that errors in parameter naming constitute the primary factor behind conversation failures across different data types.
zh

[NLP-98] Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding

【速读】：该论文试图解决代码转换自动语音识别 (Code-switching Automatic Speech Recognition, CS-ASR) 中由于口音、听觉相似性和无缝语言切换导致的语言混淆问题。解决方案的关键在于对预训练的多语言模型 Whisper 进行适应性调整，具体包括：1) 提出一个编码器优化器 (encoder refiner) 以增强编码器在句子内部语言切换的能力；2) 使用两组带有不同语言提示嵌入的语言感知适配器 (language-aware adapters)，在每个解码器层中实现语言特定的解码信息；3) 添加一个融合模块 (fusion module) 来融合语言感知的解码信息。实验结果表明，该方法在 SEAME 数据集上相较于基线模型，分别在 dev_man 和 dev_sge 测试集上实现了 4.1% 和 7.2% 的相对词错误率 (MER) 降低，超越了现有的最先进方法。

链接: https://arxiv.org/abs/2412.16507
作者: Jiahui Zhao,Hao Shi,Chenrui Cui,Tianrui Wang,Hexin Liu,Zhaoheng Ni,Lingxuan Ye,Longbiao Wang
机构: Tianjin University(天津大学); Nanyang Technological University(南洋理工大学); College of Intelligence and Computing(智能与计算学院); Graduate School of Informatics(信息学研究生院); Kyoto University(京都大学); College of Computing and Data Science(计算与数据科学学院); Meta(Meta); Key Laboratory of Speech Acoustics and Content Understanding(语音声学与内容理解重点实验室); Institute of Acoustics(声学研究所)
关键词: faces challenges due, seamless language switches, language confusion resulting, automatic speech recognition, auditory similarity
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder’s capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.
zh

[NLP-99] Real-time Bangla Sign Language Translator

【速读】：该论文旨在解决聋哑人群的沟通障碍问题，特别是通过开发一种高效的孟加拉手语翻译系统 (Bangla Sign Language Translation, BSLT)。解决方案的关键在于使用Mediapipe Holistic采集关键点数据，采用长短期记忆网络 (LSTM) 架构进行数据训练，并结合计算机视觉 (Computer Vision) 技术实现实时手语检测，最终达到了94%的准确率。

链接: https://arxiv.org/abs/2412.16497
作者: Rotan Hawlader Pranto,Shahnewaz Siddique
机构: North South University (北南大学); Department of Electrical and Computer Engineering (电气与计算机工程系)
关键词: human body communicates, Sign Language Translation, sign language, meaningful gestures, Bangla Sign Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in 2024 27th international Conference on Computer and information Technology (ICCIT), Bangladesh

点击查看摘要

Abstract:The human body communicates through various meaningful gestures, with sign language using hands being a prominent example. Bangla Sign Language Translation (BSLT) aims to bridge communication gaps for the deaf and mute community. Our approach involves using Mediapipe Holistic to gather key points, LSTM architecture for data training, and Computer Vision for realtime sign language detection with an accuracy of 94%. Keywords=Recurrent Neural Network, LSTM, Computer Vision, Bangla font.
zh

[NLP-100] Evaluating the Performance of Large Language Models in Scientific Claim Detection and Classification

【速读】：该论文试图解决在COVID-19疫情期间社交媒体上广泛传播的错误信息问题，特别是通过评估大型语言模型（Large Language Models, LLMs）在检测和分类与COVID-19相关的科学声明中的有效性，以促进公众的知情决策。解决方案的关键在于利用LLMs的预训练和适应性优势，这些模型如OpenAI的GPT和Meta的LLaMA，能够避免传统机器学习模型中常见的过度训练问题，从而提供更为准确和自动化的信息核查工具。尽管LLMs在这一领域的应用尚处于初期阶段，但研究结果表明其具有显著潜力，论文还提出了一个在公共卫生传播中应用LLMs的框架。

链接: https://arxiv.org/abs/2412.16486
作者: Tanjim Bin Faruk
机构: Colorado State University (科罗拉多州立大学); Fort Collins (柯林斯堡)
关键词: simultaneously propagating misinformation, double-edged sword, Large Language Models, pervasive influence, influence of social
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The pervasive influence of social media during the COVID-19 pandemic has been a double-edged sword, enhancing communication while simultaneously propagating misinformation. This \textitDigital Infodemic has highlighted the urgent need for automated tools capable of discerning and disseminating factual content. This study evaluates the efficacy of Large Language Models (LLMs) as innovative solutions for mitigating misinformation on platforms like Twitter. LLMs, such as OpenAI’s GPT and Meta’s LLaMA, offer a pre-trained, adaptable approach that bypasses the extensive training and overfitting issues associated with traditional machine learning models. We assess the performance of LLMs in detecting and classifying COVID-19-related scientific claims, thus facilitating informed decision-making. Our findings indicate that LLMs have significant potential as automated fact-checking tools, though research in this domain is nascent and further exploration is required. We present a comparative analysis of LLMs’ performance using a specialized dataset and propose a framework for their application in public health communication.
zh

[NLP-101] Automated CVE Analysis: Harnessing Machine Learning In Designing Question-Answering Models For Cybersecurity Information Extraction

【速读】：该论文试图解决从网络安全领域的非结构化文本数据中自动提取答案的问题，关键在于开发一个针对网络安全领域的问答系统 (Question Answering, QA)。为了应对这一挑战，论文提出了一种新的网络安全专用数据集，并基于此数据集训练了一个机器学习模型，旨在增强模型对领域特定信息的理解和检索能力。这一解决方案的核心在于通过深度学习技术处理大量非结构化数据，并利用QA系统精确提取和映射数据点之间的关系，从而为构建网络安全知识图谱提供支持。

链接: https://arxiv.org/abs/2412.16484
作者: Tanjim Bin Faruk
机构: Colorado State University (科罗拉多州立大学)
关键词: MITRE ATTCK Framework, ATTCK Framework, MITRE ATTCK, including critical data, including critical
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The vast majority of cybersecurity information is unstructured text, including critical data within databases such as CVE, NVD, CWE, CAPEC, and the MITRE ATTCK Framework. These databases are invaluable for analyzing attack patterns and understanding attacker behaviors. Creating a knowledge graph by integrating this information could unlock significant insights. However, processing this large amount of data requires advanced deep-learning techniques. A crucial step towards building such a knowledge graph is developing a robust mechanism for automating the extraction of answers to specific questions from the unstructured text. Question Answering (QA) systems play a pivotal role in this process by pinpointing and extracting precise information, facilitating the mapping of relationships between various data points. In the cybersecurity context, QA systems encounter unique challenges due to the need to interpret and answer questions based on a wide array of domain-specific information. To tackle these challenges, it is necessary to develop a cybersecurity-specific dataset and train a machine learning model on it, aimed at enhancing the understanding and retrieval of domain-specific information. This paper presents a novel dataset and describes a machine learning model trained on this dataset for the QA task. It also discusses the model’s performance and key findings in a manner that maintains a balance between formality and accessibility.
zh

[NLP-102] Chained Tuning Leads to Biased Forgetting

【速读】：该论文试图解决大型语言模型（LLMs）在微调（fine-tuning）下游任务时可能出现的灾难性遗忘（catastrophic forgetting）问题，特别是这种遗忘对模型安全性和特定群体信息的负面影响。解决方案的关键在于引入了一个新的度量标准——偏差遗忘（biased forgetting），并通过系统评估任务顺序对遗忘的影响，以及应用缓解措施帮助模型恢复遗忘的安全信息。研究旨在为持续学习（continual learning）环境下的LLMs微调提供更安全、更少毒性的训练方法。

链接: https://arxiv.org/abs/2412.16469
作者: Megan Ung,Alicia Sun,Samuel J. Bell,Bhaktipriya Radharapu,Levent Sagun,Adina Williams
机构: Meta FAIR
关键词: Large language models, degrade capabilities learned, Large language, degrade capabilities, capabilities learned
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are often fine-tuned for use on downstream tasks, though this can degrade capabilities learned during previous training. This phenomenon, often referred to as catastrophic forgetting, has important potential implications for the safety of deployed models. In this work, we first show that models trained on downstream tasks forget their safety tuning to a greater extent than models trained in the opposite this http URL, we show that forgetting disproportionately impacts safety information about certain groups. To quantify this phenomenon, we define a new metric we term biased forgetting. We conduct a systematic evaluation of the effects of task ordering on forgetting and apply mitigations that can help the model recover from the forgetting observed. We hope our findings can better inform methods for chaining the finetuning of LLMs in continual learning settings to enable training of safer and less toxic models.
zh

[NLP-103] ransducer-Llama: Integrating LLM s into Streamable Transducer-based Speech Recognition ICASSP2025

【速读】：该论文试图解决将大型语言模型 (LLMs) 应用于自动语音识别 (ASR) 时，如何实现模型流式处理的问题。解决方案的关键在于提出了一种新的模型架构，即 Transducer-Llama，它将 LLMs 集成到因子分解转换器 (Factorized Transducer, FT) 模型中，从而自然地实现了流式处理能力。此外，论文还提出了一种高效的词汇适应技术，以解决 LLMs 的大词汇量导致的稀疏性问题和训练成本增加的问题。通过弱到强语言模型 (LM) 交换策略，在 RNN-T 损失训练期间使用弱 LM 预测器，随后替换为强 LLM，并结合最小词错误率 (MWER) 损失进行微调，最终在 LibriSpeech 和多语言 LibriSpeech 语料库上实现了显著的词错误率 (WER) 降低。

链接: https://arxiv.org/abs/2412.16464
作者: Keqi Deng,Jinxi Guo,Yingyi Ma,Niko Moritz,Philip C. Woodland,Ozlem Kalinli,Mike Seltzer
机构: Meta AI, USA; Department of Engineering, University of Cambridge, UK
关键词: automatic speech recognition, model streamable remains, remains a challenge, applied to automatic, task of making
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.
zh

[NLP-104] Research on Violent Text Detection System Based on BERT-fasttext Model

【速读】：该论文试图解决网络环境中暴力文本泛滥的问题，解决方案的关键在于构建一个基于BERT-fasttext模型的有效系统来切断暴力文本。BERT模型具有强大的自然语言理解能力，能够深入挖掘和分析文本语义信息；而fasttext则是一个高效的文本分类工具，具有低复杂度和良好的效果，能够快速为文本处理提供基本判断。通过结合这两种技术，系统不仅能够准确识别暴力文本，还能高效合理地切断内容，防止有害信息在网络上自由传播。与单一的BERT模型和fasttext相比，该组合模型的准确率分别提高了0.7%和0.8%，有助于净化网络环境，维护网络信息的健康，为网民创造一个积极、文明、和谐的在线交流空间。

链接: https://arxiv.org/abs/2412.16455
作者: Yongsheng Yang,Xiaoying Wang
机构: School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, 430073, China(信息工程学院，中南财经政法大学，武汉，430073，中国); School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, 430073, China(信息工程学院，中南财经政法大学，武汉，430073，中国)
关键词: violent text, violent text cutting, age of today, people lives, digital age
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages,4 figures

点击查看摘要

Abstract:In the digital age of today, the internet has become an indispensable platform for people’s lives, work, and information exchange. However, the problem of violent text proliferation in the network environment has arisen, which has brought about many negative effects. In view of this situation, it is particularly important to build an effective system for cutting off violent text. The study of violent text cutting off based on the BERT-fasttext model has significant meaning. BERT is a pre-trained language model with strong natural language understanding ability, which can deeply mine and analyze text semantic information; Fasttext itself is an efficient text classification tool with low complexity and good effect, which can quickly provide basic judgments for text processing. By combining the two and applying them to the system for cutting off violent text, on the one hand, it can accurately identify violent text, and on the other hand, it can efficiently and reasonably cut off the content, preventing harmful information from spreading freely on the network. Compared with the single BERT model and fasttext, the accuracy was improved by 0.7% and 0.8%, respectively. Through this model, it is helpful to purify the network environment, maintain the health of network information, and create a positive, civilized, and harmonious online communication space for netizens, driving the development of social networking, information dissemination, and other aspects in a more benign direction.
zh

[NLP-105] Correcting Large Language Model Behavior via Influence Function

【速读】：该论文试图解决大语言模型 (LLMs) 在动态人类偏好变化下偏离当代人类偏好和社会规范的问题。解决方案的关键在于提出了一种无需人工干预的新方法，称为 Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET)。该方法通过两个阶段实现：首先利用影响函数 (influence functions) 识别对不良模型输出有显著影响的训练数据，然后采用基于影响函数的 Bregman 优化 (Influence function-driven Bregman Optimization, IBO) 技术，根据这些影响分布调整模型行为。实验结果表明，LANCET 能有效且高效地纠正 LLMs 的不当行为，并优于依赖收集人类偏好的方法，同时增强了学习人类偏好在 LLMs 中的可解释性。

链接: https://arxiv.org/abs/2412.16451
作者: Han Zhang,Zhuo Zhang,Yi Zhang,Yuanzhao Zhai,Hanyang Peng,Yu Lei,Yue Yu,Hui Wang,Bin Liang,Lin Gui,Ruifeng Xu
机构: Harbin Institute of Technology(哈尔滨工业大学); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); Heilongjiang University(黑龙江大学); Harbin Medical University(哈尔滨医科大学)
关键词: Recent advancements, human preferences, static human preferences, large language, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, whether they involve the curation of new data for continual alignment or the manual correction of outdated data for re-alignment, demand costly human resources. To address this challenge, we propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET), which requires no human involvement. LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an Influence function-driven Bregman Optimization (IBO) technique to adjust the model’s behavior based on these influence distributions. Our experiments demonstrate that LANCET effectively and efficiently correct inappropriate behaviors of LLMs. Furthermore, LANCET can outperform methods that rely on collecting human preferences, and it enhances the interpretability of learning human preferences within LLMs.
zh

[NLP-106] Effective Context Modeling Framework for Emotion Recognition in Conversations

【速读】：该论文试图解决在对话中情感识别 (Emotion Recognition in Conversations, ERC) 时，现有方法难以充分捕捉多模态和对话上下文之间复杂交互的问题。解决方案的关键在于提出了一个基于图神经网络 (Graph Neural Networks, GNNs) 的新框架 ConxGNN，该框架通过两个并行的模块来增强对上下文信息的捕捉：一个是多尺度异构图 (multi-scale heterogeneous graph)，用于捕捉话语对情感变化的多样性影响；另一个是超图 (hypergraph)，用于建模多模态和话语之间的多元关系。这两个模块的输出通过融合层进行整合，并应用跨模态注意力机制 (cross-modal attention mechanism) 生成上下文丰富的表示。此外，ConxGNN 还通过在损失函数中引入重加权方案 (re-weighting scheme) 来解决少数类或语义相似情感类的识别难题。

链接: https://arxiv.org/abs/2412.16444
作者: Cuong Tran Van,Thanh V. T. Tran,Van Nguyen,Truong Son Hy
机构: FPT Software AI Center(FPT软件AI中心); The University of Alabama at Birmingham(阿拉巴马大学伯明翰分校)
关键词: Graph Neural Networks, Recognition in Conversations, facilitates a deeper, Emotion Recognition, deeper understanding
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emotion Recognition in Conversations (ERC) facilitates a deeper understanding of the emotions conveyed by speakers in each utterance within a conversation. Recently, Graph Neural Networks (GNNs) have demonstrated their strengths in capturing data relationships, particularly in contextual information modeling and multimodal fusion. However, existing methods often struggle to fully capture the complex interactions between multiple modalities and conversational context, limiting their expressiveness. To overcome these limitations, we propose ConxGNN, a novel GNN-based framework designed to capture contextual information in conversations. ConxGNN features two key parallel modules: a multi-scale heterogeneous graph that captures the diverse effects of utterances on emotional changes, and a hypergraph that models the multivariate relationships among modalities and utterances. The outputs from these modules are integrated into a fusion layer, where a cross-modal attention mechanism is applied to produce a contextually enriched representation. Additionally, ConxGNN tackles the challenge of recognizing minority or semantically similar emotion classes by incorporating a re-weighting scheme into the loss functions. Experimental results on the IEMOCAP and MELD benchmark datasets demonstrate the effectiveness of our method, achieving state-of-the-art performance compared to previous baselines.
zh

[NLP-107] chnical Report: Small Language Model for Japanese Clinical and Medicine

【速读】：该论文旨在解决在临床和医学领域中使用小型语言模型（SLM）进行文本生成和理解的可行性问题。解决方案的关键在于开发了一个名为NCVC-slm-1的10亿参数日语语言模型，该模型经过高质量日语文本的训练，并针对临床和医学内容进行了增强，涵盖了多种疾病、药物和检查。通过精心设计的预处理、专用的形态分析器和分词器，NCVC-slm-1不仅能够生成文本，还展示了理解临床和医学文本的能力。在JMED-LLM的8项任务中，经过微调的NCVC-slm-1在6项任务中表现最佳，证明了其在临床和医学领域执行下游任务的可行性。

链接: https://arxiv.org/abs/2412.16423
作者: Shogo Watanabe
机构: National Cerebral and Cardiovascular Center Hospital, Japan; Graduate School of Medicine and Faculty of Medicine, Kyoto University, Japan
关键词: clinical and medicine, report presents, Japanese text classified, Japanese clinical, clinical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This report presents a small language model (SLM) for Japanese clinical and medicine, named NCVC-slm-1. This 1B parameters model was trained using Japanese text classified to be of high-quality. Moreover, NCVC-slm-1 was augmented with respect to clinical and medicine content that includes the variety of diseases, drugs, and examinations. Using a carefully designed pre-processing, a specialized morphological analyzer and tokenizer, this small and light-weight model performed not only to generate text but also indicated the feasibility of understanding clinical and medicine text. In comparison to other large language models, a fine-tuning NCVC-slm-1 demonstrated the highest scores on 6 tasks of total 8 on JMED-LLM. According to this result, SLM indicated the feasibility of performing several downstream tasks in the field of clinical and medicine. Hopefully, NCVC-slm-1 will be contributed to develop and accelerate the field of clinical and medicine for a bright future.
zh

[NLP-108] Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

【速读】：该论文试图解决现有视觉-语言模型（VLMs）在流程图理解中的两个关键问题：(i) 控制性有限，用户只能修改输入图像而无法直接影响模型的训练；(ii) 缺乏可解释性，难以追溯模型错误的具体原因。解决方案的关键是提出TextFlow，它通过两个阶段来解决这些问题：(i) 视觉文本化器（Vision Textualizer），将流程图图像转换为文本表示；(ii) 文本推理器（Textual Reasoner），基于文本表示进行问答。TextFlow的优势在于：(i) 提供多种文本表示选择（如Graphviz、Mermaid、PlantUML），并可进一步转换为可执行的图对象，增强性能和控制性；(ii) 通过明确错误归因于视觉或文本处理组件，提高可解释性；(iii) 促进解决方案的模块化，允许在推理阶段使用更高级的大型语言模型（LLMs）以提升性能。实验结果表明，TextFlow在FlowVQA和FlowLearn基准测试中达到了最先进的性能和鲁棒性。

链接: https://arxiv.org/abs/2412.16420
作者: Junyi Ye,Ankan Dash,Wenpeng Yin,Guiling Wang
机构: New Jersey Institute of Technology(新泽西理工学院); Pennsylvania State University(宾夕法尼亚州立大学)
关键词: driving the trend, vision-language models, typically presented, flowchart understanding, modify input images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Flowcharts are typically presented as images, driving the trend of using vision-language models (VLMs) for end-to-end flowchart understanding. However, two key challenges arise: (i) Limited controllability–users have minimal influence over the downstream task, as they can only modify input images, while the training of VLMs is often out of reach for most researchers. (ii) Lack of explainability–it is difficult to trace VLM errors to specific causes, such as failures in visual encoding or reasoning. We propose TextFlow, addressing aforementioned issues with two stages: (i) Vision Textualizer–which generates textual representations from flowchart images; and (ii) Textual Reasoner–which performs question-answering based on the text representations. TextFlow offers three key advantages: (i) users can select the type of text representations (e.g., Graphviz, Mermaid, PlantUML), or further convert them into executable graph object to call tools, enhancing performance and controllability; (ii) it improves explainability by helping to attribute errors more clearly to visual or textual processing components; and (iii) it promotes the modularization of the solution, such as allowing advanced LLMs to be used in the Reasoner stage when VLMs underperform in end-to-end fashion. Experiments on the FlowVQA and FlowLearn benchmarks demonstrate TextFlow’s state-of-the-art performance as well as its robustness. All code is publicly available.
zh

[NLP-109] Identifying Cyberbullying Roles in Social Media

【速读】：该论文试图解决在社交媒体环境中准确检测网络欺凌（cyberbullying）中个体角色的问题，以有效应对这一全球性威胁。解决方案的关键在于利用机器学习模型，特别是基于BERT、RoBERTa、T5和GPT-2等大型语言模型（LLMs），并通过过采样技术（oversampling techniques）处理类别不平衡问题，从而提升模型性能。研究结果表明，经过微调的RoBERTa模型在过采样数据上表现最佳，整体F1分数达到83.5%，并在应用预测阈值后提升至89.3%。该方法在处理具有更多样本和较少上下文混淆的类别（如旁观者其他）时表现优异，但在样本较少且上下文模糊的类别（如旁观者助手、施害者和受害者）上仍存在挑战。

链接: https://arxiv.org/abs/2412.16417
作者: Manuel Sandoval,Mohammed Abuhamad,Patrick Furman,Mujtaba Nazari,Deborah L. Hall,Yasin N. Silva
机构: 未知
关键词: allowing people worldwide, Social media, revolutionized communication, allowing people, interact instantly
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Social media has revolutionized communication, allowing people worldwide to connect and interact instantly. However, it has also led to increases in cyberbullying, which poses a significant threat to children and adolescents globally, affecting their mental health and well-being. It is critical to accurately detect the roles of individuals involved in cyberbullying incidents to effectively address the issue on a large scale. This study explores the use of machine learning models to detect the roles involved in cyberbullying interactions. After examining the AMiCA dataset and addressing class imbalance issues, we evaluate the performance of various models built with four underlying LLMs (i.e., BERT, RoBERTa, T5, and GPT-2) for role detection. Our analysis shows that oversampling techniques help improve model performance. The best model, a fine-tuned RoBERTa using oversampled data, achieved an overall F1 score of 83.5%, increasing to 89.3% after applying a prediction threshold. The top-2 F1 score without thresholding was 95.7%. Our method outperforms previously proposed models. After investigating the per-class model performance and confidence scores, we show that the models perform well in classes with more samples and less contextual confusion (e.g., Bystander Other), but struggle with classes with fewer samples (e.g., Bystander Assistant) and more contextual ambiguity (e.g., Harasser and Victim). This work highlights current strengths and limitations in the development of accurate models with limited data and complex scenarios.
zh

[NLP-110] InfoTech Assistant : A Multimodal Conversational Agent for InfoTechnology Web Portal Queries

【速读】：该论文旨在解决桥梁评估和基础设施技术领域中的信息查询问题，开发了一种特定领域的多模态聊天机器人——InfoTech Assistant。解决方案的关键在于整合了网络数据抓取、大型语言模型 (LLMs) 和检索增强生成 (RAG) 技术，以提供准确且上下文相关的响应。通过从公开文档中提取文本描述和图像数据，并以JSON格式组织，系统能够高效处理查询。其架构包括基于HTML的界面和连接到Llama 3.1模型的Flask后端，通过LLM Studio实现模型调用。这种RAG增强的设置使得InfoTech Assistant能够处理复杂的、多模态的查询，提供文本和视觉信息，从而在领域特定任务中实现了约95%的准确率，显示出其在基础设施专业人员中的潜在应用价值。

链接: https://arxiv.org/abs/2412.16412
作者: Sai Surya Gadiraju,Duoduo Liao,Akhila Kudupudi,Santosh Kasula,Charitha Chalasani
机构: George Mason University (乔治梅森大学)
关键词: pilot study presents, multimodal chatbot engineered, InfoTech Assistant, pilot study, study presents
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE Big Data 2024

点击查看摘要

Abstract:This pilot study presents the development of the InfoTech Assistant, a domain-specific, multimodal chatbot engineered to address queries in bridge evaluation and infrastructure technology. By integrating web data scraping, large language models (LLMs), and Retrieval-Augmented Generation (RAG), the InfoTech Assistant provides accurate and contextually relevant responses. Data, including textual descriptions and images, are sourced from publicly available documents on the InfoTechnology website and organized in JSON format to facilitate efficient querying. The architecture of the system includes an HTML-based interface and a Flask back end connected to the Llama 3.1 model via LLM Studio. Evaluation results show approximately 95 percent accuracy on domain-specific tasks, with high similarity scores confirming the quality of response matching. This RAG-enhanced setup enables the InfoTech Assistant to handle complex, multimodal queries, offering both textual and visual information in its responses. The InfoTech Assistant demonstrates strong potential as a dependable tool for infrastructure professionals, delivering high accuracy and relevance in its domain-specific outputs.
zh

[NLP-111] Application of Multimodal Large Language Models in Autonomous Driving

【速读】：该论文试图解决自动驾驶系统(Autonomous Driving, AD)在复杂驾驶环境中的性能限制问题。解决方案的关键在于采用多模态大语言模型(Multi-modal Large Language Model, MLLM)，并通过构建虚拟问答数据集(Virtual Question Answering, VQA)对模型进行微调，以提升其在自动驾驶任务中的表现。论文进一步将自动驾驶的决策过程分解为场景理解、预测和决策，并使用思维链(Chain of Thought)方法来优化决策过程，从而提高系统的安全性和适应性。

链接: https://arxiv.org/abs/2412.16410
作者: Md Robiul Islam
机构: William & Mary (威廉与玛丽学院)
关键词: complex driving environments, enhance Autonomous Driving, technological advancements, focusing on improving, improving safety
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this era of technological advancements, several cutting-edge techniques are being implemented to enhance Autonomous Driving (AD) systems, focusing on improving safety, efficiency, and adaptability in complex driving environments. However, AD still faces some problems including performance limitations. To address this problem, we conducted an in-depth study on implementing the Multi-modal Large Language Model. We constructed a Virtual Question Answering (VQA) dataset to fine-tune the model and address problems with the poor performance of MLLM on AD. We then break down the AD decision-making process by scene understanding, prediction, and decision-making. Chain of Thought has been used to make the decision more perfectly. Our experiments and detailed analysis of Autonomous Driving give an idea of how important MLLM is for AD.
zh

[NLP-112] REFA: Reference Free Alignment for multi-preference optimization

【速读】：该论文试图解决在无参考对齐方法中优化多用户偏好并实现细粒度长度控制的问题。解决方案的关键在于引入REFA（Reference-Free Alignment）方法，该方法通过集成偏差加权（deviation-based weighting）、长度归一化（length normalization）和EOS概率正则化（EOS-probability regularizer）来强调高质量响应、防止短响应的简单解决方案，并缓解数据集导致的简洁偏差。REFA通过修正理论上的不确定性减少与序列长度断言（URSLA）框架下的长度归一化带来的隐性激励，引导模型生成更丰富且更符合人类偏好的输出，从而在AlpacaEval v2基准测试中显著超越了现有的无参考对齐方法。

链接: https://arxiv.org/abs/2412.16378
作者: Taneesh Gupta,Rahul Madhavan,Xuchao Zhang,Chetan Bansal,Saravan Rajmohan
机构: Microsoft(微软); IISc, Bangalore(印度科学理工学院，班加罗尔)
关键词: fine-grained length control, enforcing fine-grained length, multiple user preferences, Sequence Length Assertion, optimize over multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce REFA, a family of reference-free alignment methods that optimize over multiple user preferences while enforcing fine-grained length control. Our approach integrates deviation-based weighting to emphasize high-quality responses more strongly, length normalization to prevent trivial short-response solutions, and an EOS-probability regularizer to mitigate dataset-induced brevity biases. Theoretically, we show that under the Uncertainty Reduction with Sequence Length Assertion (URSLA), naive length normalization can still incentivize length-based shortcuts. By contrast, REFA corrects these subtle incentives, guiding models toward genuinely more informative and higher-quality outputs. Empirically, REFA sets a new state-of-the-art among reference-free alignment methods, producing richer responses aligned more closely with human preferences. Compared to a base supervised fine-tuned (SFT) mistral-7b model that achieves 8.4% length-controlled win rate (LC-WR) and 6.2% win rate (WR), our best REFA configuration attains 21.62% LC-WR and 19.87% WR on the AlpacaEval v2 benchmark. This represents a substantial improvement over both the strongest multi-preference baseline, InfoNCA (16.82% LC-WR, 10.44% WR), and the strongest reference-free baseline, SimPO (20.01% LC-WR, 17.65% WR)
zh

[NLP-113] Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)

【速读】：该论文旨在解决低资源语言（low-resource languages）在自然语言处理（NLP）中的挑战，特别是由于神经语言模型（LMs）对高资源语言的偏向性问题。解决方案的关键在于通过LoResLM 2025研讨会为研究人员提供一个交流平台，分享和讨论针对低资源语言的语言模型研究进展，涵盖了从八个语系和十三个研究领域的广泛内容，从而推动NLP领域的语言多样性和包容性。

链接: https://arxiv.org/abs/2412.16365
作者: Hansi Hettiarachchi,Tharindu Ranasinghe,Paul Rayson,Ruslan Mitkov,Mohamed Gaber,Damith Premasiri,Fiona Anting Tan,Lasitha Uyangodage
机构: Lancaster University, UK; Birmingham City University, UK; National University of Singapore, Singapore; University of Münster, Germany
关键词: United Arab Emirates, International Conference, Abu Dhabi, United Arab, Arab Emirates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)

点击查看摘要

Abstract:The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.
zh

[NLP-114] A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation COLING2025

【速读】：该论文试图解决大尺度多模态模型在处理文本丰富的图像时因训练数据不足而表现不佳的问题。解决方案的关键在于通过混合指令生成方法（hybrid instruction generation）来增强多模态对齐（multimodal alignment）。具体来说，研究提出了LLaVAR-2方法，结合人类标注者提供的详细图像描述和GPT-4o生成的定制文本提示，构建了一个高质量的数据集。该方法还通过多种机制过滤低质量数据，最终生成了包含424k高质量指令对的数据集。实验结果表明，基于该数据集微调的模型在性能上显著优于使用自指令（self-instruct）数据训练的模型。

链接: https://arxiv.org/abs/2412.16364
作者: Shijie Zhou,Ruiyi Zhang,Yufan Zhou,Changyou Chen
机构: 未知
关键词: inadequate training data, inadequate training, data, training data, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: COLING 2025

点击查看摘要

Abstract:Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.
zh

[NLP-115] Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

【速读】：该论文试图解决现有研究中依赖于容易被自动化方法检测到的无意义对抗提示（nonsensical adversarial prompts）的问题，提出了一种更现实且有效的威胁模型，即利用人类可读的对抗提示（human-readable adversarial prompts）。解决方案的关键在于：1) 情境驱动的攻击，通过电影剧本创建上下文相关的、人类可读的提示；2) 对抗后缀转换（adversarial suffix conversion），将无意义的后缀转化为有意义的文本；3) AdvPrompter 结合 p-nucleus 采样，生成多样化的人类可读对抗后缀，从而提高对 GPT-3.5 和 Gemma 7B 等模型的攻击效果。这些方法共同展示了通过复杂对抗提示成功欺骗大型语言模型（LLMs）的可能性，并指出了提升模型鲁棒性的必要性。

链接: https://arxiv.org/abs/2412.16359
作者: Nilanjana Das,Edward Raff,Manas Gaur
机构: UMBC, MD, USA (马里兰大学巴尔的摩分校); Booz Allen Hamilton (博思艾伦汉密尔顿)
关键词: Previous research, human-readable adversarial prompts, vulnerabilities often relied, easily detectable, detectable by automated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts, which were easily detectable by automated methods. We address this gap by focusing on human-readable adversarial prompts, a more realistic and potent threat. Our key contributions are situation-driven attacks leveraging movie scripts to create contextually relevant, human-readable prompts that successfully deceive LLMs, adversarial suffix conversion to transform nonsensical adversarial suffixes into meaningful text, and AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes, improving attack efficacy in models like GPT-3.5 and Gemma 7B. Our findings demonstrate that LLMs can be tricked by sophisticated adversaries into producing harmful responses with human-readable adversarial prompts and that there exists a scope for improvement when it comes to robust LLMs.
zh

[NLP-116] A Machine Learning Approach for Emergency Detection in Medical Scenarios Using Large Language Models

【速读】：该论文试图解决通过数字通信渠道快速识别医疗紧急情况的问题，特别是在远程医疗日益普及的背景下。解决方案的关键在于利用大型语言模型 (LLMs) 和提示工程 (prompt engineering) 技术，通过开发和评估多个 LLaMA 模型变体（1B、3B 和 7B 参数）来自动分类医疗场景为紧急或非紧急情况。研究中采用了系统提示和提示内训练方法，并通过在不同硬件配置上进行测试，发现包含 10 个示例场景的提示设计能够实现最佳分类性能。LLaMA 2 (7B) 模型达到了 99.7% 的准确率，而 LLaMA 3.2 (3B) 模型也达到了 99.6% 的准确率。该系统特别强调了在紧急情况下减少高风险假阴性的重要性，这对于患者安全至关重要。

链接: https://arxiv.org/abs/2412.16341
作者: Ferit Akaybicen,Aaron Cummings,Lota Iwuagwu,Xinyue Zhang,Modupe Adewuyi
机构: 未知
关键词: digital communication channels, communication channels remains, modern healthcare delivery, prevalence of telemedicine, rapid identification
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid identification of medical emergencies through digital communication channels remains a critical challenge in modern healthcare delivery, particularly with the increasing prevalence of telemedicine. This paper presents a novel approach leveraging large language models (LLMs) and prompt engineering techniques for automated emergency detection in medical communications. We developed and evaluated a comprehensive system using multiple LLaMA model variants (1B, 3B, and 7B parameters) to classify medical scenarios as emergency or non-emergency situations. Our methodology incorporated both system prompts and in-prompt training approaches, evaluated across different hardware configurations. The results demonstrate exceptional performance, with the LLaMA 2 (7B) model achieving 99.7% accuracy and the LLaMA 3.2 (3B) model reaching 99.6% accuracy with optimal prompt engineering. Through systematic testing of training examples within the prompts, we identified that including 10 example scenarios in the model prompts yielded optimal classification performance. Processing speeds varied significantly between platforms, ranging from 0.05 to 2.2 seconds per request. The system showed particular strength in minimizing high-risk false negatives in emergency scenarios, which is crucial for patient safety. The code implementation and evaluation framework are publicly available on GitHub, facilitating further research and development in this crucial area of healthcare technology.
zh

[NLP-117] Deliberative Alignment: Reasoning Enables Safer Language Models

【速读】：该论文试图解决大规模语言模型在安全关键领域中可靠遵循预定义原则的挑战。解决方案的关键在于提出了Deliberative Alignment这一新范式，通过直接教授模型安全规范并训练其在回答前明确回忆和准确推理这些规范，从而实现对安全政策的精确遵循。该方法无需人工编写的推理链或答案，显著提升了模型的鲁棒性（robustness）、降低了过度拒绝率（overrefusal rates），并增强了模型在分布外（out-of-distribution）情况下的泛化能力，同时提高了模型的可扩展性、可信度和可解释性。

链接: https://arxiv.org/abs/2412.16339
作者: Melody Y. Guan,Manas Joglekar,Eric Wallace,Saachi Jain,Boaz Barak,Alec Heylar,Rachel Dias,Andrea Vallone,Hongyu Ren,Jason Wei,Hyung Won Chung,Sam Toyer,Johannes Heidecke,Alex Beutel,Amelia Glaese
机构: OpenAI; Google
关键词: impact safety-critical domains, increasingly impact safety-critical, well-defined principles remains, large-scale language models, language models increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 24 pages

点击查看摘要

Abstract:As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models, and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.
zh

[NLP-118] Decoding Linguistic Nuances in Mental Health Text Classification Using Expressive Narrative Stories

【速读】：该论文试图解决在分析表达性叙事故事（Expressive Narrative Stories, ENS）时，传统模型和专门设计的MentalBERT模型对缺乏明确心理健康术语的文本的敏感性问题。解决方案的关键在于发现BERT模型在处理ENS时表现出对主题词缺失的低敏感性，能够更好地理解深层语言特征，从而在实际应用中更为有效。此外，BERT和MentalBERT在识别语言细微差别和保持分类准确性方面表现出色，即使在叙事顺序被打乱的情况下也能保持稳定性，这为深入理解心理健康相关叙事提供了重要支持。

链接: https://arxiv.org/abs/2412.16302
作者: Jinwen Tang,Qiming Guo,Yunxin Zhao,Yi Shang
机构: University of Missouri (密苏里大学); Texas A&M University-Corpus Christi (德克萨斯A&M大学-科珀斯克里斯蒂)
关键词: analyzing social media, Recent advancements, advancements in NLP, NLP have spurred, Expressive Narrative Stories
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by IEEE CogMI 2024

点击查看摘要

Abstract:Recent advancements in NLP have spurred significant interest in analyzing social media text data for identifying linguistic features indicative of mental health issues. However, the domain of Expressive Narrative Stories (ENS)-deeply personal and emotionally charged narratives that offer rich psychological insights-remains underexplored. This study bridges this gap by utilizing a dataset sourced from Reddit, focusing on ENS from individuals with and without self-declared depression. Our research evaluates the utility of advanced language models, BERT and MentalBERT, against traditional models. We find that traditional models are sensitive to the absence of explicit topic-related words, which could risk their potential to extend applications to ENS that lack clear mental health terminology. Despite MentalBERT is design to better handle psychiatric contexts, it demonstrated a dependency on specific topic words for classification accuracy, raising concerns about its application when explicit mental health terms are sparse (P-value0.05). In contrast, BERT exhibited minimal sensitivity to the absence of topic words in ENS, suggesting its superior capability to understand deeper linguistic features, making it more effective for real-world applications. Both BERT and MentalBERT excel at recognizing linguistic nuances and maintaining classification accuracy even when narrative order is disrupted. This resilience is statistically significant, with sentence shuffling showing substantial impacts on model performance (P-value0.05), especially evident in ENS comparisons between individuals with and without mental health declarations. These findings underscore the importance of exploring ENS for deeper insights into mental health-related narratives, advocating for a nuanced approach to mental health text analysis that moves beyond mere keyword detection.
zh

[NLP-119] Benchmarking LLM s and SLMs for patient reported outcomes

【速读】：该论文试图解决在医疗领域中，如何利用语言模型（LLMs）和轻量级语言模型（SLMs）高效且隐私保护地总结患者报告结果（PROs）的问题。解决方案的关键在于利用SLMs的优势，即能够在本地部署，确保患者数据的隐私和符合医疗法规，从而在放射治疗背景下对患者报告的问答表单进行总结。通过对比LLMs和SLMs的性能，研究评估了它们的精确性和可靠性，揭示了SLMs在处理高风险医疗任务中的潜力和局限性，推动了更高效和隐私保护的AI驱动医疗解决方案的发展。

链接: https://arxiv.org/abs/2412.16291
作者: Matteo Marengo,Jarod Lévy,Jean-Emmanuel Bibault
机构: ENS Paris-Saclay(巴黎萨克雷高等师范学校); Hôpital Européen Georges Pompidou(乔治·蓬皮杜欧洲医院)
关键词: transformed the execution, execution of numerous, numerous tasks, medical domain, summarizing patient-reported outcomes
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:LLMs have transformed the execution of numerous tasks, including those in the medical domain. Among these, summarizing patient-reported outcomes (PROs) into concise natural language reports is of particular interest to clinicians, as it enables them to focus on critical patient concerns and spend more time in meaningful discussions. While existing work with LLMs like GPT-4 has shown impressive results, real breakthroughs could arise from leveraging SLMs as they offer the advantage of being deployable locally, ensuring patient data privacy and compliance with healthcare regulations. This study benchmarks several SLMs against LLMs for summarizing patient-reported Q\A forms in the context of radiotherapy. Using various metrics, we evaluate their precision and reliability. The findings highlight both the promise and limitations of SLMs for high-stakes medical tasks, fostering more efficient and privacy-preserving AI-driven healthcare solutions.
zh

[NLP-120] Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving

【速读】：该论文试图解决大型语言模型（LLMs）在追求高准确性和推理能力时，忽视计算效率的问题。解决方案的关键在于探索推理增强与计算效率之间的协同效应，通过分析Quiet-STaR和REBASE两种方法的集成，揭示了在推理深度与计算效率之间难以调和的根本挑战。研究结果表明，尽管单独使用Quiet-STaR和REBASE分别在准确性和效率上表现优异，但它们的集成却导致性能下降，这突显了在LLMs中平衡推理增强与效率优化的复杂性，并指出了未来研究需要开发新型架构和算法以弥合这一差距。

链接: https://arxiv.org/abs/2412.16260
作者: Marwan AbdElhameed,Pavly Halim
机构: New York University (纽约大学)
关键词: Recent advances, overlooking crucial computational, large language models, computational efficiency considerations, crucial computational efficiency
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have predominantly focused on maximizing accuracy and reasoning capabilities, often overlooking crucial computational efficiency considerations. While this approach has yielded impressive accuracy improvements, it has led to methods that may be impractical for real-world deployment due to computational overhead and latency constraints. This paper investigates the potential synergy between reasoning enhancement and computational efficiency by analyzing the integration of two contrasting approaches: Quiet-STaR (Self-Taught Reasoner) and REBASE (REward BAlanced SEarch). Through comprehensive empirical analysis using the Mistral-7B model on the GSM8K dataset, we demonstrate that while each method excels in its primary objective-Quiet-STaR achieving superior accuracy (32.03%) despite high computational cost (554.66s runtime, 12.73T FLOPs), and REBASE providing exceptional efficiency (8.47s runtime, 2.35T FLOPs) while maintaining baseline-comparable accuracy (10.94%)-their integration reveals fundamental challenges in reconciling reasoning depth with computational efficiency. The combined approach unexpectedly results in degraded performance (9.38% accuracy, 143.66s runtime), highlighting critical insights about the complex interplay between reasoning enhancement and efficiency optimization in LLMs. Our findings illuminate the need for novel architectures and algorithms specifically designed to bridge the gap between these competing objectives, while providing concrete directions for future research in compute-efficient reasoning methods.
zh

[NLP-121] Adversarial Robustness through Dynamic Ensemble Learning

【速读】：该论文试图解决预训练语言模型 (PLMs) 如 GPT、BERT、RoBERTa 和 T5 在面对对抗攻击时的可靠性问题。解决方案的关键是提出了一种名为 Adversarial Robustness through Dynamic Ensemble Learning (ARDEL) 的新方案，通过动态集成学习增强 PLMs 的对抗鲁棒性。ARDEL 的核心在于利用多个 PLMs 的多样性，并根据输入特征和检测到的对抗模式动态调整集成配置。其关键组件包括用于动态加权的元模型、对抗模式检测模块以及结合正则化技术的对抗训练。通过动态重新配置集成模型，优先选择对当前输入最鲁棒的模型，ARDEL 有效降低了攻击成功率并在对抗条件下保持了更高的准确性。

链接: https://arxiv.org/abs/2412.16254
作者: Hetvi Waghela,Jaydip Sen,Sneha Rakshit
机构: 未知
关键词: pre-trained language models, Dynamic Ensemble Learning, pose a significant, significant threat, reliability of pre-trained
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: This is the accepted version of our paper for the 2024 IEEE Silchar Subsection Conference (IEEE SILCON24), held from November 15 to 17, 2024, at the National Institute of Technology (NIT), Agartala, India. The paper is 6 pages long and contains 3 Figures and 7 Tables

点击查看摘要

Abstract:Adversarial attacks pose a significant threat to the reliability of pre-trained language models (PLMs) such as GPT, BERT, RoBERTa, and T5. This paper presents Adversarial Robustness through Dynamic Ensemble Learning (ARDEL), a novel scheme designed to enhance the robustness of PLMs against such attacks. ARDEL leverages the diversity of multiple PLMs and dynamically adjusts the ensemble configuration based on input characteristics and detected adversarial patterns. Key components of ARDEL include a meta-model for dynamic weighting, an adversarial pattern detection module, and adversarial training with regularization techniques. Comprehensive evaluations using standardized datasets and various adversarial attack scenarios demonstrate that ARDEL significantly improves robustness compared to existing methods. By dynamically reconfiguring the ensemble to prioritize the most robust models for each input, ARDEL effectively reduces attack success rates and maintains higher accuracy under adversarial conditions. This work contributes to the broader goal of developing more secure and trustworthy AI systems for real-world NLP applications, offering a practical and scalable solution to enhance adversarial resilience in PLMs.
zh

[NLP-122] GraphLoRA: Empowering LLM s Fine-Tuning via Graph Collaboration of MoE

【速读】：该论文试图解决现有低秩适应（Low-Rank Adaptation, LoRA）和专家混合（Mixture-of-Expert, MoE）技术在大型语言模型（LLMs）微调中，由于专家间缺乏有效协作和通信导致的负载不平衡问题。解决方案的关键在于提出了一种基于图神经网络（Graph Neural Networks, GNNs）的图路由函数，即GraphLoRA框架。该框架通过图路由函数捕捉专家间的协作信号，并通过聚合操作使专家能够理解输入知识并共享邻近专家的信息。此外，论文还设计了两种新的协调策略：基于泊松分布的区分策略和基于正态分布的负载平衡策略，以增强专家的能力及其协作效果。实验结果表明，GraphLoRA在参数高效微调LLMs方面具有显著优势。

链接: https://arxiv.org/abs/2412.16216
作者: Ting Bai,Yue Yu,Le Huang,Zenan Xu,Zhe Zhao,Chuan Shi
机构: Ting Bai1, Yue Yu1, Le Huang1, Chuan Shi1 -> Beijing Key Laboratory of Big Data Management and Analysis Methods, School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院大数据管理与分析方法北京市重点实验室); Zenan Xu2, Zhe Zhao2 -> School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院)
关键词: Low-Rank Adaptation, widely adopted, downstream applications, parameter-efficient fine-tuning method, Adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that has been widely adopted in various downstream applications of LLMs. Together with the Mixture-of-Expert (MoE) technique, fine-tuning approaches have shown remarkable improvements in model capability. However, the coordination of multiple experts in existing studies solely relies on the weights assigned by the simple router function. Lack of communication and collaboration among experts exacerbate the instability of LLMs due to the imbalance load problem of MoE. To address this issue, we propose a novel MoE graph-based LLM fine-tuning framework GraphLoRA, in which a graph router function is designed to capture the collaboration signals among experts by graph neural networks (GNNs). GraphLoRA enables all experts to understand input knowledge and share information from neighbor experts by aggregating operations. Besides, to enhance each expert’s capability and their collaborations, we design two novel coordination strategies: the Poisson distribution-based distinction strategy and the Normal distribution-based load balance strategy. Extensive experiments on four real-world datasets demonstrate the effectiveness of our GraphLoRA in parameter-efficient fine-tuning of LLMs, showing the benefits of facilitating collaborations of multiple experts in the graph router of GraphLoRA.
zh

[NLP-123] Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

【速读】：该论文试图解决当前文本到视频（T2V）生成模型在处理多事件序列故事生成时的不足问题。现有模型虽然在生成高质量细节方面表现出色，但在呈现复杂故事情节时仍存在困难，尤其是在处理连续事件的连贯性上。解决方案的关键在于引入StoryEval，这是一个专门设计的以故事为导向的基准测试，用于评估T2V模型在完成故事生成任务中的能力。StoryEval通过423个涵盖7类故事的提示，测试模型在生成2-4个连续事件视频时的表现，并利用GPT-4V和LLaVA-OV-Chat-72B等先进视觉语言模型进行事件完成验证，采用一致投票方法提高评估可靠性。该基准测试揭示了现有模型在故事生成方面的挑战，平均故事完成率未超过50%，为未来开发更先进的T2V模型提供了新的评估标准和研究方向。

链接: https://arxiv.org/abs/2412.16211
作者: Yiping Wang,Xuehai He,Kuan Wang,Luyao Ma,Jianwei Yang,Shuohang Wang,Simon Shaolei Du,Yelong Shen
机构: University of Washington; University of California, Santa Cruz; Georgia Institute of Technology; University of California, San Diego; Microsoft
关键词: highly realistic details, produce commercial-grade videos, realistic details, produce commercial-grade, highly realistic
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
备注: benchmark paper, project page: this https URL

点击查看摘要

Abstract:The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story ‘how to put an elephant into a refrigerator.’ While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models’ abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models’ story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.
zh

[NLP-124] Multi-head attention debiasing and contrastive learning for mitigating Dataset Artifacts in Natural Language Inference

【速读】：该论文试图解决自然语言推理（NLI）模型是否真正理解任务，还是主要依赖于数据集中的偏差（artifacts）的问题。通过分析Stanford Natural Language Inference (SNLI)数据集，论文揭示了多种偏差模式及其相互作用，并提出了新的结构化去偏方法。关键解决方案在于细粒度的偏差分析和多头部去偏架构（multi-head debiasing architecture），该架构在长度偏差、词汇重叠偏差、子集关系偏差和否定偏差等多个类别上实现了显著改进，同时降低了整体错误率并提高了对中性关系的处理能力。

链接: https://arxiv.org/abs/2412.16194
作者: Karthik Sivakoti
机构: 未知
关键词: Natural Language Inference, Stanford Natural Language, Language Inference, Natural Language, largely exploit dataset
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Natural Language Inference (NLI) models have achieved high performances on benchmark datasets, there are still concerns whether they truly capture the intended task, or largely exploit dataset artifacts. Through detailed analysis of the Stanford Natural Language Inference (SNLI) dataset, we have uncovered complex patterns of various types of artifacts and their interactions, leading to the development of our novel structural debiasing approach. Our fine-grained analysis of 9,782 validation examples reveals four major categories of artifacts: length-based patterns, lexical overlap, subset relationships, and negation patterns. Our multi-head debiasing architecture achieves substantial improvements across all bias categories: length bias accuracy improved from 86.03% to 90.06%, overlap bias from 91.88% to 93.13%, subset bias from 95.43% to 96.49%, and negation bias from 88.69% to 94.64%. Overall, our approach reduces the error rate from 14.19% to 10.42% while maintaining high performance on unbiased examples. Analysis of 1,026 error cases shows significant improvement in handling neutral relationships, traditionally one of the most challenging areas for NLI systems.
zh

[NLP-125] HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing

【速读】：该论文试图解决基于Transformer的大型语言模型（LLMs）在推理过程中由于使用键值缓存（KV cache）而导致的显著GPU内存消耗问题。解决方案的关键是引入了一种名为LSH-E的算法，该算法利用局部敏感哈希（LSH）来压缩KV缓存。LSH-E通过计算当前查询令牌与缓存令牌键之间的汉明距离，快速定位与查询令牌余弦不相似的令牌，从而减少缓存中的冗余信息。与现有依赖注意力计算的压缩策略不同，LSH-E在注意力计算之前做出令牌保留决策，从而降低计算成本。此外，LSH-E具有动态性，能够在每个解码步骤中替换预计产生最低注意力分数的令牌，实现缓存的动态更新。实验表明，LSH-E能够在保持高推理性能的同时，将KV缓存压缩30%-70%。

链接: https://arxiv.org/abs/2412.16187
作者: Minghui Liu,Tahseen Rabbani,Tony O’Halloran,Ananth Sankaralingam,Mary-Anne Hartley,Brian Gravelle,Furong Huang,Cornelia Fermüller,Yiannis Aloimonos
机构: University of Maryland(马里兰大学); Yale University(耶鲁大学); University of Galway(高威大学); Laboratory for Physical Sciences(物理科学实验室)
关键词: Transformer-based large language, large language models, significantly accelerate inference, Transformer-based large, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Performance (cs.PF)
备注: 10 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce LSH-E, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. LSH-E quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. Unlike existing compression strategies that compute attention to determine token retention, LSH-E makes these decisions pre-attention, thereby reducing computational costs. Additionally, LSH-E is dynamic - at every decoding step, the key and value of the current token replace the embeddings of a token expected to produce the lowest attention score. We demonstrate that LSH-E can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.
zh

[NLP-126] Decoding Poultry Vocalizations – Natural Language Processing and Transformer Models for Semantic and Emotional Analysis

【速读】：该论文试图解决鸡群声音信号的语义解析问题，旨在通过解读鸡的声学语言来提升动物福利和生态信息学研究。解决方案的关键在于应用先进的自然语言处理 (Natural Language Processing) 和基于transformer的模型，结合Wave2Vec 2.0进行原始音频特征提取，并使用在广泛动物声音语料库上预训练并微调的BERT模型 (Bidirectional Encoder Representations from Transformers)，将鸡的叫声解码为可解释的类别，如痛苦呼叫、进食信号和交配叫声。该方法实现了92%的分类准确率，展示了实时自动化监测鸡群健康和压力的可行性，从而帮助农民主动应对环境或行为变化，改善家禽福利并支持可持续农场管理。

链接: https://arxiv.org/abs/2412.16182
作者: Venkatraman Manikandan,Suresh Neethirajan
机构: 未知
关键词: Deciphering the acoustic, ecological informatics, chickens offers, offers new opportunities, Natural Language Processing
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 28 Pages, 14 figures

点击查看摘要

Abstract:Deciphering the acoustic language of chickens offers new opportunities in animal welfare and ecological informatics. Their subtle vocal signals encode health conditions, emotional states, and dynamic interactions within ecosystems. Understanding the semantics of these calls provides a valuable tool for interpreting their functional vocabulary and clarifying how each sound serves a specific purpose in social and environmental contexts. We apply advanced Natural Language Processing and transformer based models to translate bioacoustic data into meaningful insights. Our method integrates Wave2Vec 2.0 for raw audio feature extraction with a fine tuned Bidirectional Encoder Representations from Transformers model, pretrained on a broad corpus of animal sounds and adapted to poultry tasks. This pipeline decodes poultry vocalizations into interpretable categories including distress calls, feeding signals, and mating vocalizations, revealing emotional nuances often overlooked by conventional analyses. Achieving 92 percent accuracy in classifying key vocalization types, our approach demonstrates the feasibility of real time automated monitoring of flock health and stress. By tracking this functional vocabulary, farmers can respond proactively to environmental or behavioral changes, improving poultry welfare, reducing stress related productivity losses, and supporting more sustainable farm management. Beyond agriculture, this research enhances our understanding of computational ecology. Accessing the semantic foundation of animal calls may indicate biodiversity, environmental stressors, and species interactions, informing integrative ecosystem level decision making.
zh

[NLP-127] Efficient VoIP Communications through LLM -based Real-Time Speech Reconstruction and Call Prioritization for Emergency Services

【速读】：该论文试图解决紧急通信系统中因丢包、带宽限制、信号质量差、延迟和抖动等问题导致的实时服务质量下降，以及因恐慌、语音障碍和背景噪音等因素使得受害者难以传达关键信息的问题。解决方案的关键在于利用大型语言模型 (LLMs) 来重建不完整的语音、填补上下文空白，并根据紧急程度优先处理呼叫。该系统通过集成实时转录与检索增强生成 (RAG) 技术，生成上下文相关的响应，并利用Twilio和AssemblyAI API实现无缝实施。评估结果显示该系统具有高精度、良好的BLEU和ROUGE评分，并与实际需求高度一致，展示了其在优化紧急响应流程和有效优先处理关键案例方面的潜力。

链接: https://arxiv.org/abs/2412.16176
作者: Danush Venkateshperumal,Rahman Abdul Rafi,Shakil Ahmed,Ashfaq Khokhar
机构: Iowa State University (艾奥瓦州立大学); Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa, USA (艾奥瓦州立大学电气与计算机工程系，艾姆斯，爱荷华州，美国)
关键词: poor signal quality, face disruptions due, real-time service quality, communication systems face, systems face disruptions
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 15 pages,8 figures

点击查看摘要

Abstract:Emergency communication systems face disruptions due to packet loss, bandwidth constraints, poor signal quality, delays, and jitter in VoIP systems, leading to degraded real-time service quality. Victims in distress often struggle to convey critical information due to panic, speech disorders, and background noise, further complicating dispatchers’ ability to assess situations accurately. Staffing shortages in emergency centers exacerbate delays in coordination and assistance. This paper proposes leveraging Large Language Models (LLMs) to address these challenges by reconstructing incomplete speech, filling contextual gaps, and prioritizing calls based on severity. The system integrates real-time transcription with Retrieval-Augmented Generation (RAG) to generate contextual responses, using Twilio and AssemblyAI APIs for seamless implementation. Evaluation shows high precision, favorable BLEU and ROUGE scores, and alignment with real-world needs, demonstrating the model’s potential to optimize emergency response workflows and prioritize critical cases effectively.
zh

[NLP-128] Experimenting with Multi-modal Information to Predict Success of Indian IPOs

【速读】：该论文试图解决的问题是如何通过数据驱动的方法预测印度公司首次公开募股（IPO）的成功性。解决方案的关键在于利用机器学习（Machine Learning）和自然语言处理（Natural Language Processing）技术，综合分析IPO招股说明书中提到的各种因素、宏观经济条件、市场状况、灰市价格（Grey Market Price）等多模态信息（文本、图像、数值和分类特征），以估计IPO在上市首日的股价走势和定价偏差。

链接: https://arxiv.org/abs/2412.16174
作者: Sohom Ghosh,Arnab Maji,N Harsha Vardhan,Sudip Kumar Naskar
机构: Jadavpur University (贾达夫普尔大学); International Institute of Information Technology, Hyderabad (海得拉巴国际信息技术学院)
关键词: Initial Public Offerings, Initial Public, Public Offerings, Indian Economy, consistent growth
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注: Dataset: this https URL Codes: this https URL

点击查看摘要

Abstract:With consistent growth in Indian Economy, Initial Public Offerings (IPOs) have become a popular avenue for investment. With the modern technology simplifying investments, more investors are interested in making data driven decisions while subscribing for IPOs. In this paper, we describe a machine learning and natural language processing based approach for estimating if an IPO will be successful. We have extensively studied the impact of various facts mentioned in IPO filing prospectus, macroeconomic factors, market conditions, Grey Market Price, etc. on the success of an IPO. We created two new datasets relating to the IPOs of Indian companies. Finally, we investigated how information from multiple modalities (texts, images, numbers, and categorical features) can be used for estimating the direction and underpricing with respect to opening, high and closing prices of stocks on the IPO listing day.
zh

[NLP-129] LABIIUM: AI-Enhanced Zero-configuration Measurement Automation System

【速读】：该论文试图解决实验室环境中仪器交互复杂性和测量自动化不足的问题。传统工具通常需要配置、软件和编程技能，限制了生产力。解决方案的关键在于LABIIUM，这是一个AI增强的零配置测量自动化系统，通过集成基于大语言模型（LLMs）的AI助手来生成代码，并利用Lab-Automation-Measurement Bridges (LAMBs)实现无缝仪器连接，使用标准工具如VSCode和Python，从而消除设置开销，简化实验流程并提高用户生产力。

链接: https://arxiv.org/abs/2412.16172
作者: Emmanuel A. Olowe,Danial Chitnis
机构: The University of Edinburgh(爱丁堡大学); The University of Edinburgh(爱丁堡大学)
关键词: simplify instrument interaction, environments requires solutions, laboratory environments requires, environments requires, Large Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: submitted to IEEE conference for review

点击查看摘要

Abstract:The complexity of laboratory environments requires solutions that simplify instrument interaction and enhance measurement automation. Traditional tools often require configuration, software, and programming skills, creating barriers to productivity. Previous approaches, including dedicated software suites and custom scripts, frequently fall short in providing user-friendly solutions that align with programming practices. We present LABIIUM, an AI-enhanced, zero-configuration measurement automation system designed to streamline experimental workflows and improve user productivity. LABIIUM integrates an AI assistant powered by Large Language Models (LLMs) to generate code. LABIIUM’s Lab-Automation-Measurement Bridges (LAMBs) enable seamless instrument connectivity using standard tools such as VSCode and Python, eliminating setup overhead. To demonstrate its capabilities, we conducted experiments involving the measurement of the parametric transfer curve of a simple two-transistor inverting amplifier with a current source load. The AI assistant was evaluated using different prompt scenarios and compared with multiple models, including Claude Sonnet 3.5, Gemini Pro 1.5, and GPT-4o. An expert solution implementing the Gradient-Weighted Adaptive Stochastic Sampling (GWASS) method was used as a baseline. The solutions generated by the AI assistant were compared with the expert solution and a uniform linear sweep baseline with 10,000 points. The graph results show that the LLMs were able to successfully complete the most basic uniform sweep, but LLMs were unable to develop adaptive sweeping algorithms to compete with GWASS. The evaluation underscores LABIIUM’s ability to enhance laboratory productivity and support digital transformation in research and industry, and emphasizes the future work required to improve LLM performance in Electronic Measurement Science Tasks.
zh

[NLP-130] Hanprome: Modified Hangeul for Expression of foreign language pronunciation

【速读】：该论文试图解决的问题是如何在不改变韩文字母（Hangeul）基本形式的前提下，通过修改笔画的形状来表达不同于原始字母的发音。解决方案的关键在于保留字母的基本结构，仅通过改变笔画的形状来实现这一目标。据作者所知，这是首次尝试通过改变字母笔画的形状来表达不同发音的方法。

链接: https://arxiv.org/abs/2412.11090
作者: Wonchan Kim,Michelle Meehyun Kim
机构: 未知
关键词: basic form, Hangeul was created, existing alphabets, Hangeul, phonetic alphabet
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 21 pages

点击查看摘要

Abstract:Hangeul was created as a phonetic alphabet and is known to have the best 1:1 correspondence between letters and pronunciation among existing alphabets. In this paper, we examine the possibility of modifying the basic form of Hangeul and using it as a kind of phonetic symbol. The core concept of this approach is to preserve the basic form of the alphabet, modifying only the shape of a stroke rather than the letter itself. To the best of our knowledge, no previous attempts in any language have been made to express pronunciations of an alphabet different from the original simply by changing the shape of the alphabet strokes, and this paper is probably the first attempt in this direction.
zh

[NLP-131] Adaptive Large Language Models By Layerwise Attention Shortcuts

【速读】：该论文试图解决传统Transformer架构在处理信息时依赖于固定层级顺序的问题，提出了一种自适应计算方法。解决方案的关键在于引入计算注意力捷径 (computational attention shortcuts)，使得最终层能够通过注意力机制根据需要选择性地关注所有中间层，从而实现架构深度和上下文的自适应性。这种方法在多种数据集（如声学标记、自然语言和符号音乐）上展示了优于GPT-like架构的性能，并通过注意力图证明了模型能够学习到依赖于输入标记的复杂跨层关系。

链接: https://arxiv.org/abs/2409.10870
作者: Prateek Verma,Mert Pilanci
机构: 未知
关键词: modern AI revolution, Transformer architectures, Transformer, processing information sequentially, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbfattention shortcuts. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.
zh

[NLP-132] Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization ICASSP2025

【速读】：该论文试图解决自动说话人验证和语音匿名化任务中语音时序动态的影响问题。解决方案的关键在于提出基于音素时长（phoneme durations）的自动说话人验证指标，并揭示音素时长中泄露的说话人信息。研究表明，音素时长不仅在原始语音中，甚至在匿名化语音中也能揭示说话人身份，因此强调了在开发具有强隐私保护能力的匿名化系统时，必须考虑并修改说话人的语速和音素时长特征。

链接: https://arxiv.org/abs/2412.17164
作者: Natalia Tomashenko,Emmanuel Vincent,Marc Tommasi
机构: Université de Lorraine, CNRS, Inria(洛林大学, 法国国家科学研究中心, 法国国家信息与自动化研究所); Université de Lille, CNRS, Inria(里尔大学, 法国国家科学研究中心, 法国国家信息与自动化研究所)
关键词: voice anonymization tasks, automatic speaker verification, speech temporal dynamics, investigate the impact, temporal dynamics
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:In this paper, we investigate the impact of speech temporal dynamics in application to automatic speaker verification and speaker voice anonymization tasks. We propose several metrics to perform automatic speaker verification based only on phoneme durations. Experimental results demonstrate that phoneme durations leak some speaker information and can reveal speaker identity from both original and anonymized speech. Thus, this work emphasizes the importance of taking into account the speaker’s speech rate and, more importantly, the speaker’s phonetic duration characteristics, as well as the need to modify them in order to develop anonymization systems with strong privacy protection capacity.
zh

[NLP-133] Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

【速读】：该论文试图解决语音语言模型（Speech Language Models, SLMs）在生成语义连贯输出时面临的挑战，主要原因包括语音符号主要提供语音信息而非语义信息（Factor A）、语音序列长度远长于文本序列（Factor B）以及副语言信息（如韵律）引入的额外复杂性和变异性（Factor C）。论文通过逐步从文本到语音的转换方式，分别探讨了这三个关键因素的影响，发现Factor A影响较小，Factor B对句法和语义建模影响更明显，而Factor C在基本词汇建模中影响最大。基于这些发现，论文提供了关于训练SLMs独特挑战的见解，并强调了开发更有效的端到端SLMs的路径。

链接: https://arxiv.org/abs/2412.17048
作者: Hankun Wang,Haoran Wang,Yiwei Guo,Zhihan Li,Chenpeng Du,Xie Chen,Kai Yu
机构: 未知
关键词: large language models, language models exhibit, text-based large language, models exhibit human-level, semantically coherent outputs
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and © paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.
zh

[NLP-134] Speech-Based Depression Prediction Using Encoder-Weight-Only Transfer Learning and a Large Corpus

【速读】：该论文试图解决基于语音的行为健康状况管理问题，特别是抑郁症的诊断。解决方案的关键在于采用了一种基于语音的迁移学习方法，使用轻量级编码器并仅迁移编码器权重，从而简化了运行时模型。通过使用比先前研究大两个数量级的数据集，研究展示了迁移学习在PHQ-8标签预测中的显著性能提升，二分类任务中相对性能提升高达27%，且统计显著性接近零。此外，研究表明迁移学习的增益并不依赖于源任务的高性能，表明该方法具有灵活性和高效实现的潜力。

链接: https://arxiv.org/abs/2412.16900
作者: Amir Harati,Elizabeth Shriberg,Tomasz Rutowski,Piotr Chlebek,Yang Lu,Ricardo Oliveira
机构: 未知
关键词: behavioral health conditions, algorithms have gained, gained interest, management of behavioral, behavioral health
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech-based algorithms have gained interest for the management of behavioral health conditions such as depression. We explore a speech-based transfer learning approach that uses a lightweight encoder and that transfers only the encoder weights, enabling a simplified run-time model. Our study uses a large data set containing roughly two orders of magnitude more speakers and sessions than used in prior work. The large data set enables reliable estimation of improvement from transfer learning. Results for the prediction of PHQ-8 labels show up to 27% relative performance gains for binary classification; these gains are statistically significant with a p-value close to zero. Improvements were also found for regression. Additionally, the gain from transfer learning does not appear to require strong source task performance. Results suggest that this approach is flexible and offers promise for efficient implementation.
zh

[NLP-135] Autoregressive Speech Synthesis with Next-Distribution Prediction

【速读】：该论文试图解决文本到语音合成 (TTS) 中现有方法依赖于变分自编码器 (VAE) 或扩散模型的问题，提出了一种新的自回归 (AR) 语言建模方法 KALL-E。其关键解决方案在于直接建模和预测基于文本的连续语音分布，而不使用离散语音标记，通过 WaveVAE 从波形中提取连续语音分布，并使用单一 AR 语言模型进行预测，同时以 Kullback-Leibler 散度损失作为约束。这种方法在零样本 TTS 场景中表现出更优的自然度和说话者相似性，并展示了在情感和口音克隆方面的卓越能力。

链接: https://arxiv.org/abs/2412.16846
作者: Xinfa Zhu,Wenjie Tian,Lei Xie
机构: Northwestern Polytechnical University (西北工业大学); Audio, Speech and Language Processing Group (ASLP@NPU) (音频、语音和语言处理组)
关键词: language modeling approach, modeling approach, approach with next-distribution, next-distribution prediction, continuous speech
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Technical report, work in progress

点击查看摘要

Abstract:We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS. Audio samples are available at: \urlthis https URL.
zh

[NLP-136] Speech Retrieval-Augmented Generation without Automatic Speech Recognition

【速读】：该论文试图解决在语音数据上的开放式问答任务中，自动语音识别（ASR）错误可能传播到后续的检索和生成步骤的问题。解决方案的关键在于提出了SpeechRAG框架，该框架通过微调预训练的语音编码器作为语音适配器，并将其输入到冻结的大型语言模型（LLM）检索模型中，从而实现文本和语音嵌入空间的对齐。这种设计使得语音检索器能够直接从文本查询中检索音频段落，而无需依赖ASR转录，从而避免了ASR错误的影响。此外，生成步骤采用基于音频段落的语音语言模型（SLM），在不需要微调SLM的情况下，能够在转录错误率较高时优于传统的基于文本的级联模型。

链接: https://arxiv.org/abs/2412.16500
作者: Do June Min,Karel Mundnich,Andy Lapastora,Erfan Soltanmohammadi,Srikanth Ronanki,Kyu Han
机构: University of Michigan(密歇根大学); AWS AI Labs(AWS AI实验室)
关键词: automatic speech recognition, employ text-based retrieval-augmented, RAG, ASR, speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)–based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.
zh

[NLP-137] Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling ICASSP2025

【速读】：该论文试图解决多语言自动语音识别 (Multilingual Automatic Speech Recognition, ASR) 系统在处理未见语言 (unseen languages) 时的性能问题。解决方案的关键在于利用语言之间的语言学特性 (linguistic characteristics) 关系，通过引入加权求和方法和基于预测器的方法来增强对未见语言的识别能力。具体来说，加权求和方法利用Whisper模型预测的语言概率计算语言标签嵌入的加权和，而基于预测器的方法则进一步优化该加权和嵌入，使其更接近未见语言的真实嵌入。实验结果表明，这些方法在零样本和微调设置下均显著提升了ASR性能，超越了基线方法。

链接: https://arxiv.org/abs/2412.16474
作者: Shao-Syuan Huang,Kuan-Po Huang,Andy T. Liu,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); National Taiwan University, AICS ASUS (国立台湾大学, AICS 华硕)
关键词: Automatic Speech Recognition, Multilingual Automatic Speech, Automatic Speech, Speech Recognition, transcribe speech
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper’s predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.
zh

计算机视觉

[CV-0] FaceLift: Single Image to 3D Head with View Generation and GS-LRM

【速读】：该论文试图解决从单张图像快速、高质量地进行360度头部3D重建的问题。解决方案的关键在于采用了一种前馈方法，称为FaceLift，其核心包括两个主要组件：首先，使用多视角潜在扩散模型（multi-view latent diffusion model）从单张面部输入生成一致的侧面和背面视图；其次，将这些生成的视图输入到GS-LRM重建器（GS-LRM reconstructor），通过高斯样条（Gaussian splats）生成全面的3D表示。该系统通过在合成3D人头数据集上训练，展示了在真实世界图像上的显著泛化能力，并在3D头部重建任务中超越了现有最先进的方法。

链接: https://arxiv.org/abs/2412.17812
作者: Weijie Lyu,Yi Zhou,Ming-Hsuan Yang,Zhixin Shu
机构: University of California, Merced(加州大学默塞德分校); Adobe Research(Adobe研究)
关键词: approach for rapid, feed-forward approach, head, single, synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We present FaceLift, a feed-forward approach for rapid, high-quality, 360-degree head reconstruction from a single image. Our pipeline begins by employing a multi-view latent diffusion model that generates consistent side and back views of the head from a single facial input. These generated views then serve as input to a GS-LRM reconstructor, which produces a comprehensive 3D representation using Gaussian splats. To train our system, we develop a dataset of multi-view renderings using synthetic 3D human head as-sets. The diffusion-based multi-view generator is trained exclusively on synthetic head images, while the GS-LRM reconstructor undergoes initial training on Objaverse followed by fine-tuning on synthetic head data. FaceLift excels at preserving identity and maintaining view consistency across views. Despite being trained solely on synthetic data, FaceLift demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images. In addition to single image reconstruction, FaceLift supports video inputs for 4D novel view synthesis and seamlessly integrates with 2D reanimation techniques to enable 3D facial animation. Project page: this https URL.
zh

[CV-1] ChatGarment: Garment Estimation Generation and Editing via Large Language Models

【速读】：该论文试图解决从图像或文本描述中自动化估计、生成和编辑3D服装的问题。解决方案的关键在于利用大规模视觉-语言模型 (VLMs) 进行微调，以直接生成包含服装类型、样式和连续数值属性的JSON文件。这一JSON文件通过编程参数化模型生成缝纫图案，并进一步转化为可动画化和可模拟的3D服装。论文通过扩展现有的编程模型GarmentCode的服装类型覆盖范围并简化其结构，以及构建大规模的图像-缝纫图案和文本-缝纫图案数据集，实现了这一目标。

链接: https://arxiv.org/abs/2412.17811
作者: Siyuan Bian,Chenghao Xu,Yuliang Xiu,Artur Grigorev,Zhen Liu,Cewu Lu,Michael J. Black,Yao Feng
机构: Max Planck Institute for Intelligent Systems; Shanghai Jiao Tong University; EPFL; Westlake University; ETH Zürich; Mila, University of Montreal; Meshcapade
关键词: leverages large vision-language, large vision-language models, automate the estimation, approach that leverages, leverages large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions. Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text descriptions, and edit garments based on user instructions, all within an interactive dialogue. These sewing patterns can then be draped into 3D garments, which are easily animatable and simulatable. This is achieved by finetuning a VLM to directly generate a JSON file that includes both textual descriptions of garment types and styles, as well as continuous numerical attributes. This JSON file is then used to create sewing patterns through a programming parametric model. To support this, we refine the existing programming model, GarmentCode, by expanding its garment type coverage and simplifying its structure for efficient VLM fine-tuning. Additionally, we construct a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs through an automated data pipeline. Extensive evaluations demonstrate ChatGarment’s ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications. Code and data will be available at this https URL.
zh

[CV-2] Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

【速读】：该论文试图解决现有3D内容生成管道中，使用变分自编码器 (Variational Autoencoders, VAEs) 进行形状编码时，由于均匀点采样策略导致的形状几何细节丢失问题，从而限制了形状重建和下游生成任务的质量。解决方案的关键在于提出了Dora-VAE，通过引入锐边采样策略 (sharp edge sampling strategy) 和双重交叉注意力机制 (dual cross-attention mechanism)，优先处理几何复杂度高的区域，显著提升了细粒度形状特征的保留。此外，论文还提出了Dora-bench基准，通过量化锐边密度来评估形状复杂度，进一步验证了Dora-VAE在重建精度上的优势，尤其是在关键几何特征上的表现。

链接: https://arxiv.org/abs/2412.17808
作者: Rui Chen,Jianfeng Zhang,Yixun Liang,Guan Luo,Weiyu Li,Jiarui Liu,Xiu Li,Xiaoxiao Long,Jiashi Feng,Ping Tan
机构: The Hong Kong University of Science and Technology(香港科技大学); ByteDance Seed(字节跳动种子); Tsinghua University(清华大学)
关键词: employ Variational Autoencoders, commonly employ Variational, Variational Autoencoders, pipelines commonly employ, content generation pipelines
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation. However, the widely adopted uniform point sampling strategy in Shape VAE training often leads to a significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks. We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross-attention mechanism. By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features. Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches. To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8 \times smaller (1,280 vs. 10,000 codes). We will release our code and benchmark dataset to facilitate future research in 3D shape modeling.
zh

[CV-3] Cross-View Referring Multi-Object Tracking AAAI2025

【速读】：该论文试图解决单视图引用多目标跟踪 (Referring Multi-Object Tracking, RMOT) 中由于物体外观在某些视角下不可见而导致匹配错误的问题。解决方案的关键在于提出了跨视图引用多目标跟踪 (Cross-view Referring Multi-Object Tracking, CRMOT) 任务，通过从多个视角获取物体外观信息，避免了单视图中的外观不可见问题。为此，论文构建了一个基于CAMPUS和DIVOTrack数据集的跨视图引用多目标跟踪基准CRTrack，并提出了一种端到端的跨视图引用多目标跟踪方法CRTracker，通过大量实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2412.17807
作者: Sijia Chen,En Yu,Wenbing Tao
机构: 未知
关键词: Referring Multi-Object Tracking, Cross-view Referring Multi-Object, Referring Multi-Object, Multi-Object Tracking, current tracking field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025!

点击查看摘要

Abstract:Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field. Its task form is to guide the tracker to track objects that match the language description. Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences. However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description. In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT). It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task. CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view. To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack. Specifically, it provides 13 different scenes and 221 language descriptions. Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker. Extensive experiments on the CRTrack benchmark verify the effectiveness of our method. The dataset and code are available at this https URL.
zh

[CV-4] Reconstructing People Places and Cameras

【速读】：该论文试图解决从稀疏、未标定的多视角图像中联合重建多个人体网格、场景点云和相机参数的问题，特别是在度量世界坐标系中实现精确重建。解决方案的关键在于结合数据驱动的场景重建与传统的运动结构恢复 (Structure-from-Motion, SfM) 框架，通过利用人体统计模型来估计近似的度量尺度，并在同一世界坐标系中重建多个人体网格和场景点云，从而捕捉个体之间的空间关系及其在环境中的位置。该方法通过联合优化人体、场景和相机参数，显著提升了人体定位精度和相机姿态估计的准确性，同时在场景重建质量上也有所提高。

链接: https://arxiv.org/abs/2412.17806
作者: Lea Müller,Hongsuk Choi,Anthony Zhang,Brent Yi,Jitendra Malik,Angjoo Kanazawa
机构: UC Berkeley(加州大学伯克利分校)
关键词: Structure from Motion, images featuring people, uncalibrated multi-view images, multi-view images featuring, multiple human meshes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this http URL

点击查看摘要

Abstract:We present “Humans and Structure from Motion” (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. Our approach combines data-driven scene reconstruction with the traditional Structure-from-Motion (SfM) framework to achieve more accurate scene reconstruction and camera estimation, while simultaneously recovering human meshes. In contrast to existing scene reconstruction and SfM methods that lack metric scale information, our method estimates approximate metric scale by leveraging a human statistical model. Furthermore, it reconstructs multiple human meshes within the same world coordinate system alongside the scene point cloud, effectively capturing spatial relationships among individuals and their positions in the environment. We initialize the reconstruction of humans, scenes, and cameras using robust foundational models and jointly optimize these elements. This joint optimization synergistically improves the accuracy of each component. We compare our method to existing approaches on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human localization accuracy within the world coordinate frame (reducing error from 3.51m to 1.04m in EgoHumans and from 2.9m to 0.56m in EgoExo4D). Notably, our results show that incorporating human data into the SfM pipeline improves camera pose estimation (e.g., increasing RRA@15 by 20.3% on EgoHumans). Additionally, qualitative results show that our approach improves overall scene reconstruction quality. Our code is available at: this http URL.
zh

[CV-5] Large Motion Video Autoencoding with Cross-modal Video VAE

【速读】：该论文试图解决视频变分自编码器 (Video Variational Autoencoder, VAE) 在直接应用于单帧图像时导致的时序不一致性和压缩率不足的问题。解决方案的关键在于提出了一种新型的视频自编码器，通过以下三个主要创新来提升性能：首先，采用时序感知的空间压缩方法，避免将图像VAE简单扩展为3D VAE带来的运动模糊和细节失真问题；其次，引入轻量级的运动压缩模型以进一步优化时序压缩；最后，利用文本到视频数据集中的文本信息，将文本引导融入模型，显著提升重建质量，特别是在细节保留和时序稳定性方面。此外，通过联合训练图像和视频数据，增强了模型的多功能性，使其能够同时处理图像和视频的自编码任务。

链接: https://arxiv.org/abs/2412.17805
作者: Yazhou Xing,Yang Fei,Yingqing He,Jingye Chen,Jiaxin Xie,Xiaowei Chi,Qifeng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
关键词: robust video Variational, video Variational Autoencoder, Learning a robust, Variational Autoencoder, efficient video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\hrefthis https URLthis https URL.
zh

[CV-6] GauSim: Registering Elastic Objects into Digital World by Gaussian Simulator WWW

【速读】：该论文试图解决传统基于粒子（particle-based）模拟方法在处理弹性物体动态行为时存在的理想化假设问题，并提出了一种基于神经网络的模拟器GauSim。解决方案的关键在于利用连续介质力学（continuum mechanics）将高斯核（Gaussian kernels）建模为连续物质，而非粒子，从而避免了传统方法中的理想化假设，实现了更真实的形变模拟。此外，GauSim通过引入层次结构（hierarchical structure）将高斯核组织成质心系统（Center of Mass Systems, CMS），并采用从粗到细（coarse-to-fine）的模拟策略，显著提高了计算效率和模拟精度。同时，GauSim还引入了明确的物理约束（如质量和动量守恒），确保了模拟结果的可解释性和物理合理性。

链接: https://arxiv.org/abs/2412.17804
作者: Yidi Shao,Mu Huang,Chen Change Loy,Bo Dai
机构: S-Lab Nanyang Technological University(南洋理工大学); Fudan University(复旦大学); The University of Hong Kong(香港大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
关键词: neural network-based simulator, network-based simulator designed, represented through Gaussian, elastic objects represented, Gaussian kernels
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we introduce GauSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. Unlike traditional methods that treat kernels as particles within particle-based simulations, we leverage continuum mechanics, modeling each kernel as a continuous piece of matter to account for realistic deformations without idealized assumptions. To improve computational efficiency and fidelity, we employ a hierarchical structure that organizes kernels into Center of Mass Systems (CMS) with explicit formulations, enabling a coarse-to-fine simulation approach. This structure significantly reduces computational overhead while preserving detailed dynamics. In addition, GauSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations. To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. Experimental results demonstrate that GauSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors. Code and model will be released. Project page: this https URL .
zh

[CV-7] Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

【速读】：该论文试图解决在大规模词汇类别（vast-vocabulary）对象检测中，由于训练时类别词汇量扩展到真实世界级别，导致传统分类器与粗略类别名称对齐时识别性能显著下降的问题。解决方案的关键是引入Prova，一种多模态原型分类器（multi-modal prototype classifier）。Prova通过提取全面的多模态原型作为对齐分类器的初始化，有效应对大规模词汇对象识别失败的问题。该方法在V3Det数据集上显著提升了包括Faster R-CNN、FCOS和DINO在内的多种检测器的性能，并在监督和开放词汇设置下均取得了新的最先进性能。

链接: https://arxiv.org/abs/2412.17800
作者: Yitong Chen,Wenhao Yao,Lingchen Meng,Sihong Wu,Zuxuan Wu,Yu-Gang Jiang
机构: 1. Fudan University (复旦大学); 2. Shanghai Key Laboratory of Data Science (上海市数据科学重点实验室)
关键词: recognize vast open-world, vast open-world categories, Enabling models, longstanding pursuit, recognize vast
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.
zh

[CV-8] ActiveGS: Active Scene Reconstruction using Gaussian Splatting

【速读】：该论文试图解决在未知场景中使用机载RGB-D相机主动构建精确地图的问题。解决方案的关键在于提出了一种混合地图表示方法，结合了高斯喷射图（Gaussian splatting map）的高保真场景重建能力和体素图（voxel map）的空间建模优势。核心技术包括为高斯喷射图设计有效的置信度建模，以识别未充分重建的区域，并利用体素图的空间信息来定位未探索区域，同时辅助无碰撞路径规划。通过主动收集未充分重建和未探索区域的信息进行地图更新，该方法在实际应用中，特别是在无人机上，实现了优于现有技术的高斯喷射重建效果。

链接: https://arxiv.org/abs/2412.17769
作者: Liren Jin,Xingguang Zhong,Yue Pan,Jens Behley,Cyrill Stachniss,Marija Popović
机构: 未知
关键词: enable downstream tasks, Robotics applications, Gaussian splatting map, Gaussian splatting, downstream tasks
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotics applications often rely on scene reconstructions to enable downstream tasks. In this work, we tackle the challenge of actively building an accurate map of an unknown scene using an on-board RGB-D camera. We propose a hybrid map representation that combines a Gaussian splatting map with a coarse voxel map, leveraging the strengths of both representations: the high-fidelity scene reconstruction capabilities of Gaussian splatting and the spatial modelling strengths of the voxel map. The core of our framework is an effective confidence modelling technique for the Gaussian splatting map to identify under-reconstructed areas, while utilising spatial information from the voxel map to target unexplored areas and assist in collision-free path planning. By actively collecting scene information in under-reconstructed and unexplored areas for map updates, our approach achieves superior Gaussian splatting reconstruction results compared to state-of-the-art approaches. Additionally, we demonstrate the applicability of our active scene reconstruction framework in the real world using an unmanned aerial vehicle.
zh

[CV-9] Survey of Large Multimodal Model Datasets Application Categories and Taxonomy

【速读】：该论文旨在探讨多模态学习（Multimodal Learning）领域的发展，特别是通过整合和分析多种数据类型（如文本、图像、音频和视频）来构建更通用和鲁棒的系统。解决方案的关键在于大规模多模态数据集（Large-scale multimodal datasets）的开发与应用，这些数据集不仅支持模型的训练和测试，还为领域特定任务和实际应用提供了基准。论文强调了这些数据集在评估模型性能、扩展性和适用性方面的重要性，并指出克服当前挑战将推动AI研究和应用达到新的高度。

链接: https://arxiv.org/abs/2412.17759
作者: Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Bhargava Kumar,Amit Agarwal,Ishan Banerjee,Srikant Panda,Tejaswini Kumar
机构: 未知
关键词: rapidly evolving field, analyzing diverse types, artificial intelligence, seeks to construct, types of data
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning. Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview. Large-scale multimodal datasets are essential because they allow for thorough testing and training of these models. With an emphasis on their contributions to the discipline, the study examines a variety of datasets, including those for training, domain-specific tasks, and real-world applications. It also emphasizes how crucial benchmark datasets are for assessing models’ performance in a range of scenarios, scalability, and applicability. Since multimodal learning is always changing, overcoming these obstacles will help AI research and applications reach new heights.
zh

[CV-10] Reasoning to Attend: Try to Understand How SEG Token Works

【速读】：该论文试图解决当前大型多模态模型（Large Multimodal Models, LMMs）在视觉定位任务中依赖于 \texttt{SEG} 标记作为文本提示时，如何有效利用该标记进行视觉-语言模型的联合优化问题。解决方案的关键在于通过可视化语义相似度图（similarity maps），揭示 \texttt{SEG} 标记在图像-文本对中的语义相似性贡献，并提出了一种名为 READ 的方法。READ 方法通过引入 Similarity as Points 模块（SasP），利用相似度图中高度激活的点来指导模型在推理过程中注意力分配的位置，从而增强模型的鲁棒性推理能力。该方法在 ReasonSeg 和 RefCOCO(+/g) 数据集上进行了广泛实验，并验证了其在微调后不会出现灾难性遗忘的问题。

链接: https://arxiv.org/abs/2412.17741
作者: Rui Qian,Xin Yin,Dejing Dou
机构: School of Computer Science and Technology, Fudan University (复旦大学); The State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学)
关键词: Current Large Multimodal, Large Multimodal Models, empowered visual grounding, visual grounding typically, grounding typically rely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on \textttSEG token as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM). However, we observe that little research has looked into how it this http URL this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the \textttSEG token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map,which reveals that what \textttSEG token contributes to is the semantic similarity within image-text pairs. Specifically, \textttSEG token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image while the Large Language Models (LLMs) are being fine-tuned. Upon the above findings, we present READ, which facilitates LMMs’ resilient \textbfREA soning capability of where to atten \textbfD under the guidance of highly activated points borrowed from similarity maps. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to \textttSEG -like paradigms in a plug-and-play this http URL, extensive experiments have been conducted on the ReasonSeg and RefCOCO(+/g) datasets. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset. All codes and models are publicly available at this https URL.
zh

[CV-11] Mimicking-Bench: A Benchmark for Generalizable Humanoid-Scene Interaction Learning via Human Mimicking

【速读】：该论文试图解决人形机器人与3D场景交互时通用技能学习的问题，现有方法受限于小规模、手动收集的演示数据，缺乏通用数据集和基准支持，难以有效探索场景几何的泛化能力。解决方案的关键在于引入Mimicking-Bench，这是首个针对通过模仿大规模人类动画参考数据进行可泛化人形-场景交互学习的综合基准。Mimicking-Bench包含六个家庭场景中的全身人形-场景交互任务，涵盖11K多种物体形状，并提供20K合成和3K真实世界的人类交互技能参考。论文构建了完整的人形技能学习流程，并评估了运动重定向、运动跟踪、模仿学习及其组合方法的性能，通过大量实验展示了人类模仿在技能学习中的价值，揭示了关键挑战和研究方向。

链接: https://arxiv.org/abs/2412.17730
作者: Yun Liu,Bowen Yang,Licheng Zhong,He Wang,Li Yi
机构: Tsinghua University(清华大学); Galbot; Shanghai Qi Zhi Institute(上海期智研究院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Peking University(北京大学)
关键词: humanoid robots interacting, robots interacting, significant implications, implications for robotics, Learning generic skills
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning generic skills for humanoid robots interacting with 3D scenes by mimicking human data is a key research challenge with significant implications for robotics and real-world applications. However, existing methodologies and benchmarks are constrained by the use of small-scale, manually collected demonstrations, lacking the general dataset and benchmark support necessary to explore scene geometry generalization effectively. To address this gap, we introduce Mimicking-Bench, the first comprehensive benchmark designed for generalizable humanoid-scene interaction learning through mimicking large-scale human animation references. Mimicking-Bench includes six household full-body humanoid-scene interaction tasks, covering 11K diverse object shapes, along with 20K synthetic and 3K real-world human interaction skill references. We construct a complete humanoid skill learning pipeline and benchmark approaches for motion retargeting, motion tracking, imitation learning, and their various combinations. Extensive experiments highlight the value of human mimicking for skill learning, revealing key challenges and research directions.
zh

[CV-12] VidTwin: Video VAE with Decoupled Structure and Dynamics

【速读】：该论文试图解决视频生成中高效且高质量的压缩与重建问题。解决方案的关键在于提出了一种新颖且紧凑的视频自编码器（Video Autoencoder），名为VidTwin，通过将视频解耦为两个不同的潜在空间：结构潜在向量（Structure latent vectors）捕捉整体内容和全局运动，动态潜在向量（Dynamics latent vectors）捕捉细粒度细节和快速运动。具体实现上，VidTwin采用了一个编码器-解码器主干网络，并增加了两个子模块分别提取这两个潜在空间。实验结果表明，VidTwin在MCL-JCV数据集上实现了0.20%的高压缩率，同时保持了高质量的重建效果（PSNR为28.14），并在下游生成任务中表现出色。

链接: https://arxiv.org/abs/2412.17726
作者: Yuchi Wang,Junliang Guo,Xinyi Xie,Tianyu He,Xu Sun,Jiang Bian
机构: Peking University(北京大学); Microsoft Research Asia(微软亚洲研究院); CUHK (SZ)(香港中文大学(深圳))
关键词: Recent advancements, compact video autoencoder, significantly improved, Structure latent vectors, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at this https URL.
zh

[CV-13] GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance AAAI2025

【速读】：该论文试图解决3D高斯喷洒（3D Gaussian splatting）中由于参数空间过大导致的非唯一性问题，以及传统测试时优化方法耗时过长的问题。解决方案的关键在于引入了一种创新的正向传播方法（feed-forward approach），通过估计每个点的表面法线（surface normal）来确定高斯旋转，从而在受限的空间内有效预测剩余的高斯参数。此外，论文还提出了外观注入模块（appearance injection module），通过多尺度三平面表示（multiscale triplane representation）将参考图像的外观信息融入高斯场，从而在单次前向传播中实现高效且高质量的3D内容生成。

链接: https://arxiv.org/abs/2412.17715
作者: Jingqiu Zhou,Lue Fan,Xuesong Chen,Linjiang Huang,Si Liu,Hongsheng Li
机构: 1. Guangdong Provincial People’s Hospital(广东省人民医院);
2. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院);
3. Shenzhen University(深圳大学);
4. The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))
关键词: Gaussian, Gaussian fields, Gaussian splatting, present GaussianPainter, high-quality Gaussian fields
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in AAAI 2025

点击查看摘要

Abstract:In this paper, we present GaussianPainter, the first method to paint a point cloud into 3D Gaussians given a reference image. GaussianPainter introduces an innovative feed-forward approach to overcome the limitations of time-consuming test-time optimization in 3D Gaussian splatting. Our method addresses a critical challenge in the field: the non-uniqueness problem inherent in the large parameter space of 3D Gaussian splatting. This space, encompassing rotation, anisotropic scales, and spherical harmonic coefficients, introduces the challenge of rendering similar images from substantially different Gaussian fields. As a result, feed-forward networks face instability when attempting to directly predict high-quality Gaussian fields, struggling to converge on consistent parameters for a given output. To address this issue, we propose to estimate a surface normal for each point to determine its Gaussian rotation. This strategy enables the network to effectively predict the remaining Gaussian parameters in the constrained space. We further enhance our approach with an appearance injection module, incorporating reference image appearance into Gaussian fields via a multiscale triplane representation. Our method successfully balances efficiency and fidelity in 3D Gaussian generation, achieving high-quality, diverse, and robust 3D content creation from point clouds in a single forward pass.
zh

[CV-14] Establishing Reality-Virtuality Interconnections in Urban Digital Twins for Superior Intelligent Road Inspection

【速读】：该论文试图解决传统道路检测方法（如人工评估）劳动密集、成本高且耗时的问题，以及现实世界中道路缺陷数据稀缺和空间分布稀疏的挑战。解决方案的关键在于利用城市数字孪生技术（Urban Digital Twin, UDT）构建智能道路检测系统。首先，通过从真实驾驶数据中构建分层道路模型，生成高度详细的道路缺陷结构和表面高程表示。接着，生成数字道路孪生体，创建模拟环境以进行全面分析和评估。这些场景随后被导入模拟器，用于数据采集和物理仿真。实验结果表明，该系统生成的高保真道路缺陷场景显著提升了驾驶任务中的感知和决策能力。

链接: https://arxiv.org/abs/2412.17699
作者: Yikang Zhang,Chuang-Wei Liu,Jiahang Li,Yingbing Chen,Jie Cheng,Rui Fan
机构: Tongji University(同济大学); Shanghai Research Institute for Intelligent Autonomous Systems(上海智能自主系统研究所); the State Key Laboratory of Intelligent Autonomous Systems(智能自主系统国家重点实验室); Frontiers Science Center for Intelligent Autonomous Systems(智能自主系统前沿科学中心); Individualized Interdisciplinary Program, Division of Emerging Interdisciplinary Areas, HKUST(香港科技大学新兴跨学科领域个性化跨学科项目); Department of Electronic and Computer Engineering, HKUST(香港科技大学电子与计算机工程系)
关键词: defects gradually emerge, compromise road functionality, ensuring road maintenance, Road, road defects gradually
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Road inspection is essential for ensuring road maintenance and traffic safety, as road defects gradually emerge and compromise road functionality. Traditional methods, which rely on manual evaluations, are labor-intensive, costly, and time-consuming. Although data-driven approaches are gaining traction, the scarcity and spatial sparsity of road defects in the real world pose significant challenges in acquiring high-quality datasets. Existing simulators designed to generate detailed synthetic driving scenes, however, lack models for road defects. Furthermore, advanced driving tasks involving interactions with road surfaces, such as planning and control in defective areas, remain underexplored. To address these limitations, we propose a system based on Urban Digital Twin (UDT) technology for intelligent road inspection. First, hierarchical road models are constructed from real-world driving data, creating highly detailed representations of road defect structures and surface elevations. Next, digital road twins are generated to create simulation environments for comprehensive analysis and evaluation. These scenarios are subsequently imported into a simulator to enable both data acquisition and physical simulation. Experimental results demonstrate that driving tasks, including perception and decision-making, can be significantly improved using the high-fidelity road defect scenes generated by our system.
zh

[CV-15] EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities ICASSP2025

【速读】：该论文试图解决多模态学习中常见的缺失模态问题，尤其是在训练和测试阶段。现有方法通常需要为每种模态或缺失情况设计单独的提示（prompt），导致设计复杂且参数数量大幅增加，尤其在模态数量增加时，参数冗余问题更加严重。论文提出的解决方案是基于证据的参数高效提示方法（Evidence-based Parameter-Efficient Prompting, EPE-P），其关键在于通过简化的设计整合不同模态的提示信息，减少复杂性和参数冗余。此外，论文还提出了基于证据的损失函数（Evidence-based Loss function），以更好地处理缺失模态带来的不确定性，从而提升模型的决策能力。实验结果表明，EPE-P在有效性和效率上均优于现有的基于提示的方法。

链接: https://arxiv.org/abs/2412.17677
作者: Zhe Chen,Xun Lin,Yawen Cui,Zitong Yu
机构: 未知
关键词: multimodal learning scenarios, real-world multimodal learning, learning scenarios, training and testing, common challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Missing modalities are a common challenge in real-world multimodal learning scenarios, occurring during both training and testing. Existing methods for managing missing modalities often require the design of separate prompts for each modality or missing case, leading to complex designs and a substantial increase in the number of parameters to be learned. As the number of modalities grows, these methods become increasingly inefficient due to parameter redundancy. To address these issues, we propose Evidence-based Parameter-Efficient Prompting (EPE-P), a novel and parameter-efficient method for pretrained multimodal networks. Our approach introduces a streamlined design that integrates prompting information across different modalities, reducing complexity and mitigating redundant parameters. Furthermore, we propose an Evidence-based Loss function to better handle the uncertainty associated with missing modalities, improving the model’s decision-making. Our experiments demonstrate that EPE-P outperforms existing prompting-based methods in terms of both effectiveness and efficiency. The code is released at this https URL.
zh

[CV-16] A Bias-Free Training Paradigm for More General AI-generated Image Detection

【速读】：该论文试图解决现有法医检测器在监督学习基准测试中表现优异但在实际应用中表现不佳的问题，主要原因是训练数据质量不足。解决方案的关键在于提出了一种无偏训练范式（B-Free），通过使用稳定扩散模型的条件生成过程，从真实图像生成假图像，确保真实图像与假图像在语义上对齐，从而使检测器专注于检测AI生成过程中引入的细微伪影，而非数据偏差。这种方法通过基于内容的增强，显著提升了检测器的泛化能力和鲁棒性，并在27种不同的生成模型上取得了更校准的结果。

链接: https://arxiv.org/abs/2412.17671
作者: Fabrizio Guillaro,Giada Zingarini,Ben Usman,Avneesh Sud,Davide Cozzolino,Luisa Verdoliva
机构: University Federico II of Naples(那不勒斯费德里克二世大学); Google DeepMind(谷歌DeepMind)
关键词: supervised learning benchmarks, Successful forensic detectors, produce excellent results, Successful forensic, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Successful forensic detectors can produce excellent results in supervised learning benchmarks but struggle to transfer to real-world applications. We believe this limitation is largely due to inadequate training data quality. While most research focuses on developing new algorithms, less attention is given to training data selection, despite evidence that performance can be strongly impacted by spurious correlations such as content, format, or resolution. A well-designed forensic detector should detect generator specific artifacts rather than reflect data biases. To this end, we propose B-Free, a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. This ensures semantic alignment between real and fake images, allowing any differences to stem solely from the subtle artifacts introduced by AI generation. Through content-based augmentation, we show significant improvements in both generalization and robustness over state-of-the-art detectors and more calibrated results across 27 different generative models, including recent releases, like FLUX and Stable Diffusion 3.5. Our findings emphasize the importance of a careful dataset curation, highlighting the need for further research in dataset design. Code and data will be publicly available at this https URL
zh

[CV-17] Enhanced Temporal Processing in Spiking Neural Networks for Static Object Detection Using 3D Convolutions

【速读】：该论文试图解决直接训练的脉冲神经网络 (Spiking Neural Networks, SNNs) 在基于帧的静态对象检测任务（如COCO2017数据集）中性能显著落后于传统人工神经网络 (Artificial Neural Networks, ANNs) 的问题。解决方案的关键在于增强SNN处理时空信息的能力，具体通过以下两项创新实现：1) 将传统的二维卷积 (2D convolutions) 替换为三维卷积 (3D convolutions)，从而直接在卷积过程中引入时间维度信息；2) 在神经元内部引入时间信息循环机制 (temporal information recurrence mechanism)，以提高神经元对时间信息的利用效率。这些改进使得直接训练的SNNs在COCO2017和VOC数据集上达到了与ANNs相当的性能水平。

链接: https://arxiv.org/abs/2412.17654
作者: Huaxu He
机构: 未知
关键词: Artificial Neural Networks, Spiking Neural Networks, directly trained SNNs, Neural Networks, network models capable
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are a class of network models capable of processing spatiotemporal information, with event-driven characteristics and energy efficiency advantages. Recently, directly trained SNNs have shown potential to match or surpass the performance of traditional Artificial Neural Networks (ANNs) in classification tasks. However, in object detection tasks, directly trained SNNs still exhibit a significant performance gap compared to ANNs when tested on frame-based static object datasets (such as COCO2017). Therefore, bridging this performance gap and enabling directly trained SNNs to achieve performance comparable to ANNs on these static datasets has become one of the key challenges in the development of this http URL address this challenge, this paper focuses on enhancing the SNN’s unique ability to process spatiotemporal information. Spiking neurons, as the core components of SNNs, facilitate the exchange of information between different temporal channels during the process of converting input floating-point data into binary spike signals. However, existing neuron models still have certain limitations in the communication of temporal information. Some studies have even suggested that disabling the backpropagation in the time dimension during SNN training can still yield good training results. To improve the SNN handling of temporal information, this paper proposes replacing traditional 2D convolutions with 3D convolutions, thus directly incorporating temporal information into the convolutional process. Additionally, temporal information recurrence mechanism is introduced within the neurons to further enhance the neurons’ efficiency in utilizing temporal this http URL results show that the proposed method enables directly trained SNNs to achieve performance levels comparable to ANNs on the COCO2017 and VOC datasets.
zh

[CV-18] DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

【速读】：该论文试图解决从文本或图像提示生成以服装为中心的人类图像时，现有方法在轻量化与模型泛化能力之间的矛盾问题。解决方案的关键在于提出了一种名为DreamFit的新方法，该方法通过引入轻量级的Anything-Dressing Encoder，结合自适应注意力机制和LoRA模块，显著减少了模型复杂性（仅83.4M可训练参数），同时保持了在多种场景下的高质量生成能力。DreamFit还通过利用预训练的多模态大模型（LMMs）来丰富提示信息，进一步提升了生成质量，并具备即插即用的特性，便于与现有扩散模型插件集成。

链接: https://arxiv.org/abs/2412.17644
作者: Ente Lin,Xujie Zhang,Fuwei Zhao,Yuxuan Luo,Xin Dong,Long Zeng,Xiaodan Liang
机构: 1. Sun Yat-sen University (中山大学); 2. South China University of Technology (华南理工大学); 3. Tencent AI Lab (腾讯AI实验室); 4. Peng Cheng Laboratory (鹏城实验室)
关键词: great application potential, garnered emerging attention, garment-centric human generation, application potential, garnered emerging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) \textbfLightweight training: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)\textbfAnything-Dressing: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) \textbfPlug-and-play: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both 768 \times 512 high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.
zh

[CV-19] Hierarchical Vector Quantization for Unsupervised Action Segmentation AAAI

【速读】：该论文试图解决无监督时间动作分割问题，即如何将一系列长且未经修剪的视频分割成在语义上有意义且在视频间一致的片段。现有方法在处理同一类别内时间片段的较大变化时表现不佳。为此，论文提出了一种名为分层向量量化 (Hierarchical Vector Quantization, \ours) 的新方法，该方法由两个连续的向量量化模块组成，形成分层聚类结构，其中额外的子簇覆盖了簇内的变化。这一关键创新使得该方法在捕捉片段长度分布方面优于现有技术，并通过引入基于Jensen-Shannon距离 (JSD) 的新度量标准进行评估。实验结果表明，该方法在Breakfast、YouTube Instructional和IKEA ASM三个公开数据集上，在F1分数、召回率和JSD方面均优于现有技术。

链接: https://arxiv.org/abs/2412.17640
作者: Federico Spurio,Emad Bahrami,Gianpiero Francesca,Juergen Gall
机构: University of Bonn(波恩大学); University of Florence(佛罗伦萨大学); German Center for Neurodegenerative Diseases (DZNE)(德国神经退行性疾病中心)
关键词: semantically meaningful segments, temporal action segmentation, untrimmed videos, set of long, unsupervised temporal action
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in Conference on Artificial Intelligence (AAAI) 2025

点击查看摘要

Abstract:In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (\ours), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the Jensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.
zh

[CV-20] SCBench: A Sports Commentary Benchmark for Video LLM s

【速读】：该论文试图解决现有视频大语言模型（Video LLMs）在细粒度时间视觉能力评估和基准测试方面的不足。现有基准测试使用简单视频（如带字幕的电影片段），且任务格式单一（如问答或多选问答），未能充分评估模型生成深入和精确文本的能力。论文提出了一种新的任务：体育视频解说生成，并开发了 SCBench 基准测试。解决方案的关键在于引入了两个创新点：(1) SCORES，一个专门设计的六维评估指标，并基于此提出了GPT评估方法；(2) CommentarySet，包含5,775个标注视频片段和对应标签的数据集。通过这些方法，论文对多个Video LLMs进行了全面评估，发现InternVL-Chat-2表现最佳，为未来复杂视觉理解任务的研究提供了新视角。

链接: https://arxiv.org/abs/2412.17637
作者: Kuangzhi Ge,Lingjun Chen,Kevin Zhang,Yulin Luo,Tianyu Shi,Liaoyuan Fan,Xiang Li,Guanqun Wang,Shanghang Zhang
机构: State Key Laboratory of Multimedia Information Processing(多媒体信息处理国家重点实验室); School of Computer Science(计算机科学学院); Peking University(北京大学); Department of Computer Science(计算机科学系); University of Toronto(多伦多大学); The University of Hong Kong(香港大学); Southern University of Science and Technology(南方科技大学)
关键词: Large Language Models, Video Large Language, Large Language, Video LLMs, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry. However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited. On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames. On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models’ capacity for generating in-depth and precise texts. Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task. Inspired by these challenges, we propose a novel task: sports video commentary generation, developed \textbfSCBench for Video LLMs. To construct such a benchmark, we introduce (1) \textbfSCORES , a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) \textbfCommentarySet , a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric. Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods. Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04. Our work provides a fresh perspective for future research, aiming to enhance models’ overall capabilities in complex visual understanding tasks. Our dataset will be released soon.
zh

[CV-21] LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding

【速读】：该论文试图解决现有方法在3D场景理解任务中，由于仅从新视角渲染2D特征图而导致的3D语言场不精确问题，特别是存在异常语言特征和无法准确对齐3D空间中物体的问题。解决方案的关键在于提出了语言嵌入表面场（Language-Embedded Surface Field, LangSurf），通过几何监督和对比损失的联合训练策略，将语言高斯分布精确对齐到物体表面，从而实现精确的2D和3D分割。此外，引入的分层上下文感知模块（Hierarchical-Context Awareness Module）通过分层掩码池化提取细粒度的语言特征，进一步提升了特征表示的准确性。实验结果表明，LangSurf在开放词汇的2D和3D语义分割任务中显著优于先前的最先进方法LangSplat。

链接: https://arxiv.org/abs/2412.17635
作者: Hao Li,Roy Qin,Zhengyu Zou,Diqi He,Bohan Li,Bingquan Dai,Dingewn Zhang,Junwei Han
机构: 未知
关键词: Applying Gaussian Splatting, Splatting to perception, Applying Gaussian, Gaussian Splatting, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \url{ this https URL }{Project Page}

点击查看摘要

Abstract:Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language-Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state-of-the-art method LangSplat by a large margin. As shown in Fig.~\reffig:teaser, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments. \urlthis https URLProject Page.
zh

[CV-22] ANID: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance

【速读】：该论文试图解决生成式 AI (AIGC) 领域中区分 AI 合成图像与自然图像的关键挑战。解决方案的关键在于引入了一个名为 AI-Natural Image Discrepancy Evaluation 的基准，旨在系统地评估和量化 AI 生成图像 (AIGIs) 与真实图像之间的差异。为此，研究团队构建了一个大规模多模态数据集，即 Distinguishing Natural and AI-generated Images (DNAI) 数据集，包含超过 440,000 张由 8 种代表性模型生成的 AIGI 样本，涵盖文本到图像 (Text-to-Image, T2I)、图像到图像 (Image-to-Image, I2I) 以及文本与图像到图像 (Text vs. Image-to-Image, TI2I) 等多种生成方式。通过细粒度的评估框架，论文从五个关键维度（视觉特征质量、多模态生成中的语义对齐、美学吸引力、下游任务适用性及协同人类验证）对 DNAI 数据集进行了全面评估，强调了量化指标与人类判断相结合的必要性，以实现对 AI 生成图像质量的全面理解。

链接: https://arxiv.org/abs/2412.17632
作者: Renyang Liu,Ziyu Lyu,Wei Zhou,See-Kiong Ng
机构: Sun Yat-sen University (中山大学); School of Cyber Security, Sun Yat-sen University (中山大学网络空间安全学院)
关键词: Intelligence Generated Content, Artificial Intelligence Generated, Artificial Intelligence, rapidly evolving field, field of Artificial
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of Artificial Intelligence Generated Content (AIGC), one of the key challenges is distinguishing AI-synthesized images from natural images. Despite the remarkable capabilities of advanced AI generative models in producing visually compelling images, significant discrepancies remain when these images are compared to natural ones. To systematically investigate and quantify these discrepancies, we introduce an AI-Natural Image Discrepancy Evaluation benchmark aimed at addressing the critical question: \textithow far are AI-generated images (AIGIs) from truly realistic images? We have constructed a large-scale multimodal dataset, the Distinguishing Natural and AI-generated Images (DNAI) dataset, which includes over 440,000 AIGI samples generated by 8 representative models using both unimodal and multimodal prompts, such as Text-to-Image (T2I), Image-to-Image (I2I), and Text \textitvs. Image-to-Image (TI2I). Our fine-grained assessment framework provides a comprehensive evaluation of the DNAI dataset across five key dimensions: naive visual feature quality, semantic alignment in multimodal generation, aesthetic appeal, downstream task applicability, and coordinated human validation. Extensive evaluation results highlight significant discrepancies across these dimensions, underscoring the necessity of aligning quantitative metrics with human judgment to achieve a holistic understanding of AI-generated image quality. Code is available at \hrefthis https URLthis https URL.
zh

[CV-23] Detail-Preserving Latent Diffusion for Stable Shadow Removal

【速读】：该论文试图解决复杂全局光照场景下阴影去除的泛化性问题。由于现有阴影去除数据集的多样性有限，当前方法容易过拟合训练数据，导致在新场景中表现不佳。解决方案的关键在于利用预训练的Stable Diffusion (SD)模型的丰富视觉先验，并提出两阶段微调流程：第一阶段固定VAE并在潜在空间微调去噪器，实现显著的阴影去除但可能丢失高频细节；第二阶段通过细节注入阶段，从VAE编码器中选择性提取特征并调制解码器，从而将精细细节注入最终结果。实验表明，该方法在阴影去除效果和泛化性上均优于现有技术。

链接: https://arxiv.org/abs/2412.17630
作者: Jiamin Xu,Yuxin Zheng,Zelong Li,Chi Wang,Renshu Gu,Weiwei Xu,Gang Xu
机构: Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学)
关键词: Achieving high-quality shadow, complex global illumination, Achieving high-quality, shadow removal, high-quality shadow removal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving high-quality shadow removal with strong generalizability is challenging in scenes with complex global illumination. Due to the limited diversity in shadow removal datasets, current methods are prone to overfitting training data, often leading to reduced performance on unseen cases. To address this, we leverage the rich visual priors of a pre-trained Stable Diffusion (SD) model and propose a two-stage fine-tuning pipeline to adapt the SD model for stable and efficient shadow removal. In the first stage, we fix the VAE and fine-tune the denoiser in latent space, which yields substantial shadow removal but may lose some high-frequency details. To resolve this, we introduce a second stage, called the detail injection stage. This stage selectively extracts features from the VAE encoder to modulate the decoder, injecting fine details into the final results. Experimental results show that our method outperforms state-of-the-art shadow removal techniques. The cross-dataset evaluation further demonstrates that our method generalizes effectively to unseen data, enhancing the applicability of shadow removal methods.
zh

[CV-24] Editing Implicit and Explicit Representations of Radiance Fields: A Survey

【速读】：该论文试图解决神经辐射场 (NeRF) 在编辑方面的进展相对滞后的问题。解决方案的关键在于对现有文献中不同的编辑方法进行全面综述，并提出一种新的分类法 (taxonomy) 来对这些方法进行分类。论文回顾了开创性模型，探讨了当前和潜在的新应用，并比较了现有最先进方法在编辑选项和性能方面的表现。

链接: https://arxiv.org/abs/2412.17628
作者: Arthur Hubert,Gamal Elghazaly,Raphael Frank
机构: University of Luxembourg (卢森堡大学)
关键词: Neural Radiance Fields, high-quality image rendering, Neural Radiance, revolutionized novel view, image rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) revolutionized novel view synthesis in recent years by offering a new volumetric representation, which is compact and provides high-quality image rendering. However, the methods to edit those radiance fields developed slower than the many improvements to other aspects of NeRF. With the recent development of alternative radiance field-based representations inspired by NeRF as well as the worldwide rise in popularity of text-to-image models, many new opportunities and strategies have emerged to provide radiance field editing. In this paper, we deliver a comprehensive survey of the different editing methods present in the literature for NeRF and other similar radiance field representations. We propose a new taxonomy for classifying existing works based on their editing methodologies, review pioneering models, reflect on current and potential new applications of radiance field editing, and compare state-of-the-art approaches in terms of editing options and performance.
zh

[CV-25] Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection AAAI2025

【速读】：该论文试图解决少样本异常检测 (Few-shot Anomaly Detection, FSAD) 中现有方法忽视视觉特征内在上下文信息的问题，特别是不同视觉层之间的交互关系。解决方案的关键在于提出了一个核感知图提示学习框架 (Kernel-aware Graph prompt, KAG-prompt)，通过构建一个层次化的核感知图，将不同层的特征作为节点，节点间的交互关系作为边，通过消息传递机制捕捉跨层的上下文信息，从而实现更准确的异常检测。此外，论文还提出了一种基于多层次信息融合的图像级评分方法，以整合多个重要异常信号的信息，进一步提升检测性能。

链接: https://arxiv.org/abs/2412.17619
作者: Fenfang Tao,Guo-Sen Xie,Fang Zhao,Xiangbo Shu
机构: 未知
关键词: normal support images, Few-shot anomaly detection, detect unseen anomaly, Few-shot anomaly, aims to detect
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Few-shot anomaly detection (FSAD) aims to detect unseen anomaly regions with the guidance of very few normal support images from the same class. Existing FSAD methods usually find anomalies by directly designing complex text prompts to align them with visual features under the prevailing large vision-language model paradigm. However, these methods, almost always, neglect intrinsic contextual information in visual features, e.g., the interaction relationships between different vision layers, which is an important clue for detecting anomalies comprehensively. To this end, we propose a kernel-aware graph prompt learning framework, termed as KAG-prompt, by reasoning the cross-layer relations among visual features for FSAD. Specifically, a kernel-aware hierarchical graph is built by taking the different layer features focusing on anomalous regions of different sizes as nodes, meanwhile, the relationships between arbitrary pairs of nodes stand for the edges of the graph. By message passing over this graph, KAG-prompt can capture cross-layer contextual information, thus leading to more accurate anomaly prediction. Moreover, to integrate the information of multiple important anomaly signals in the prediction map, we propose a novel image-level scoring method based on multi-level information fusion. Extensive experiments on MVTecAD and VisA datasets show that KAG-prompt achieves state-of-the-art FSAD results for image-level/pixel-level anomaly detection. Code is available at this https URL.
zh

[CV-26] CoSurfGS:Collaborative 3D Surface Gaussian Splatting with Distributed Learning for Large Scene Reconstruction

【速读】：该论文试图解决现有3D高斯样条（3D Gaussian Splatting, 3DGS）方法在大规模场景重建中面临的高内存成本、过长的时间消耗以及缺乏几何细节等问题。解决方案的关键在于提出了一个基于分布式学习的多智能体协作快速3DGS表面重建框架，并通过局部模型压缩（Local Model Compression, LMC）和模型聚合方案（Model Aggregation Schemes, MAS）来实现高质量的大场景表面表示，同时显著降低GPU内存消耗。实验结果表明，该方法能够实现快速、可扩展的高保真表面重建和逼真的渲染效果。

链接: https://arxiv.org/abs/2412.17612
作者: Yuanyuan Gao,Yalun Dai,Hao Li,Weicai Ye,Junyi Chen,Danpeng Chen,Dingwen Zhang,Tong He,Guofeng Zhang,Junwei Han
机构: Northwestern Polytechnical University(西北工业大学); Nanyang Technological University(南洋理工大学); Zhejiang University(浙江大学); Shanghai AI Lab(上海人工智能实验室)
关键词: Gaussian Splatting, demonstrated impressive performance, demonstrated impressive, impressive performance, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page is available at \url{ this https URL }

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive performance in scene reconstruction. However, most existing GS-based surface reconstruction methods focus on 3D objects or limited scenes. Directly applying these methods to large-scale scene reconstruction will pose challenges such as high memory costs, excessive time consumption, and lack of geometric detail, which makes it difficult to implement in practical applications. To address these issues, we propose a multi-agent collaborative fast 3DGS surface reconstruction framework based on distributed learning for large-scale surface reconstruction. Specifically, we develop local model compression (LMC) and model aggregation schemes (MAS) to achieve high-quality surface representation of large scenes while reducing GPU memory consumption. Extensive experiments on Urban3d, MegaNeRF, and BlendedMVS demonstrate that our proposed method can achieve fast and scalable high-fidelity surface reconstruction and photorealistic rendering. Our project page is available at \urlthis https URL.
zh

[CV-27] Personalized Large Vision-Language Models

【速读】：该论文试图解决在大规模视觉-语言模型 (LVLMs) 中个性化图像生成的问题，特别是在对话中使用特定指代概念（如“Mike和Susan在交谈”）而非通用描述（如“一个男孩和一个女孩在交谈”），以提升对话的定制性和指代友好性。解决方案的关键在于提出了PLVM模型，并引入了Aligner，这是一个预训练的视觉编码器，用于将指代概念与查询图像对齐。在对话过程中，Aligner能够提取参考图像的特征并与相应概念匹配，从而在查询图像中识别这些概念，实现个性化。此外，Aligner的计算成本和参数数量在整个框架中几乎可以忽略不计，显著增强了模型的实用性。

链接: https://arxiv.org/abs/2412.17610
作者: Chau Pham,Hoang Phan,David Doermann,Yunjie Tian
机构: University at Buffalo(布法罗大学); New York University(纽约大学)
关键词: large vision-language models, gained significant attention, vision-language models, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A simple way to personalize your LLM

点击查看摘要

Abstract:The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., Mike and Susan are talking.'') instead of the generic form (e.g., a boy and a girl are talking.‘’), making the conversation more customizable and referentially friendly. In addition, PLVM is equipped to continuously add new concepts during a dialogue without incurring additional costs, which significantly enhances the practicality. PLVM proposes Aligner, a pre-trained visual encoder to align referential concepts with the queried images. During the dialogues, it extracts features of reference images with these corresponding concepts and recognizes them in the queried image, enabling personalization. We note that the computational cost and parameter count of the Aligner are negligible within the entire framework. With comprehensive qualitative and quantitative analyses, we reveal the effectiveness and superiority of PLVM.
zh

[CV-28] SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images AAAI-25

【速读】：该论文试图解决大规模图表问答数据集构建中的效率和多样性问题。解决方案的关键在于提出了SBSFigures（Stage-by-Stage Synthetic Figures）数据集及其分阶段生成流程。该流程通过自动化方式生成带有完整可视化数据标注和密集问答标注的图表，避免了手动标注的繁琐过程，同时有效减少了代码错误，并能够高效生成多样化主题和外观的图表。这一方法显著提升了预训练效果，使得在有限的真实图表数据下，从预训练权重开始进行高效训练成为可能。

链接: https://arxiv.org/abs/2412.17606
作者: Risa Shinoda,Kuniaki Saito,Shohei Tanaka,Tosho Hirasawa,Yoshitaka Ushiku
机构: 未知
关键词: Building a large-scale, attributes like text, generating QAs, requires a considerable, gathering and selecting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-25 Workshop on Document Understanding and Intelligence. Dataset and code: this https URL

点击查看摘要

Abstract:Building a large-scale figure QA dataset requires a considerable amount of work, from gathering and selecting figures to extracting attributes like text, numbers, and colors, and generating QAs. Although recent developments in LLMs have led to efforts to synthesize figures, most of these focus primarily on QA generation. Additionally, creating figures directly using LLMs often encounters issues such as code errors, similar-looking figures, and repetitive content in figures. To address this issue, we present SBSFigures (Stage-by-Stage Synthetic Figures), a dataset for pre-training figure QA. Our proposed pipeline enables the creation of chart figures with complete annotations of the visualized data and dense QA annotations without any manual annotation process. Our stage-by-stage pipeline makes it possible to create diverse topic and appearance figures efficiently while minimizing code errors. Our SBSFigures demonstrate a strong pre-training effect, making it possible to achieve efficient training with a limited amount of real-world chart data starting from our pre-trained weights.
zh

[CV-29] AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation

【速读】：该论文试图解决在视觉密集型任务中，如少样本语义分割（few-shot semantic segmentation），像素级标注耗时且成本高的问题。解决方案的关键在于提出了一个自适应频率感知网络（adaptive frequency-aware network, AFANet），用于弱监督少样本语义分割（weakly-supervised few-shot semantic segmentation, WFSS）。具体来说，AFANet 包含两个核心模块：一是跨粒度频率感知模块（cross-granularity frequency-aware module, CFM），通过将 RGB 图像分解为高频和低频分布，并重新对齐以优化语义结构信息；二是 CLIP 引导的空间适配器模块（CLIP-guided spatial-adapter module, CSM），通过在线学习对文本信息进行空间域自适应变换，从而为 CFM 提供丰富的跨模态语义信息。这些创新使得 AFANet 在 Pascal-5i 和 COCO-20i 数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.17601
作者: Jiaqi Ma,Guo-Sen Xie,Fang Zhao,Zechao Li
机构: School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; School of Intelligence Science and Technology, Nanjing University, Suzhou 215163, China
关键词: leveraging prior knowledge, prior knowledge learned, few-shot semantic segmentation, aims to recognize, recognize novel concepts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5\textsuperscripti and COCO-20\textsuperscripti datasets demonstrate that AFANet has achieved state-of-the-art performance. The code is available at this https URL.
zh

[CV-30] V2-SfMLearner: Learning Monocular Depth and Ego-motion for Multimodal Wireless Capsule Endoscopy

【速读】：该论文试图解决胶囊内窥镜在胃肠道内碰撞引起的振动扰动问题，这些扰动会影响深度图和胶囊自身运动的预测，进而影响3D场景重建和病变定位。解决方案的关键在于提出了一种多模态方法V²-SfMLearner，通过整合振动信号与基于视觉的深度和胶囊运动估计，有效消除振动扰动。具体来说，该方法设计了一个振动网络分支和一个傅里叶融合模块，用于检测和减轻振动噪声，并通过多模态学习提升了算法的性能和鲁棒性。此外，该解决方案无需大型外部设备，具有临床应用的潜力，能够为医生提供实时且可靠的消化道检查工具。

链接: https://arxiv.org/abs/2412.17595
作者: Long Bai,Beilei Cui,Liangyu Wang,Yanheng Li,Shilong Yao,Sishen Yuan,Yanan Wu,Yang Zhang,Max Q.-H. Meng,Zhen Li,Weiping Ding,Hongliang Ren
机构: Southern University of Science and Technology (南方科技大学); The Chinese University of Hong Kong (香港中文大学); City University of Hong Kong (香港城市大学); Hubei University of Technology (湖北工业大学); Qilu Hospital of Shandong University (山东大学齐鲁医院); Nantong University (南通大学)
关键词: predict depth maps, Deep learning, scene reconstruction, lesion localization, capsule endoscopy videos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To appear in IEEE Transactions on Automation Science and Engineering (IEEE TASE)

点击查看摘要

Abstract:Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V ^2 -SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V ^2 -SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors.
zh

[CV-31] Improved Cotton Leaf Disease Classification Using Parameter-Efficient Deep Learning Framework

【速读】：该论文试图解决棉花作物中叶部病害的准确识别问题，特别是在保证高精度的同时，开发一种轻量级、计算效率高的模型。解决方案的关键在于创新性地整合了MobileNet的可训练层子集、迁移学习、数据增强、学习率衰减计划、模型检查点和早停机制。这些技术共同作用，使得模型在分类七种棉花病害时达到了98.42%的整体准确率和96%至100%的类别精度，显著提升了效率并降低了模型复杂性，使其在智能农业中的实际应用成为可能。

链接: https://arxiv.org/abs/2412.17587
作者: Aswini Kumar Patra,Tejashwini Gajurel
机构: 未知
关键词: face significant production, significant production challenges, white gold, face significant, primarily due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 figures, 3 Tables

点击查看摘要

Abstract:Cotton crops, often called “white gold,” face significant production challenges, primarily due to various leaf-affecting diseases. As a major global source of fiber, timely and accurate disease identification is crucial to ensure optimal yields and maintain crop health. While deep learning and machine learning techniques have been explored to address this challenge, there remains a gap in developing lightweight models with fewer parameters which could be computationally effective for agricultural practitioners. To address this, we propose an innovative deep learning framework integrating a subset of trainable layers from MobileNet, transfer learning, data augmentation, a learning rate decay schedule, model checkpoints, and early stopping mechanisms. Our model demonstrates exceptional performance, accurately classifying seven cotton disease types with an overall accuracy of 98.42% and class-wise precision ranging from 96% to 100%. This results in significantly enhanced efficiency, surpassing recent approaches in accuracy and model complexity. The existing models in the literature have yet to attain such high accuracy, even when tested on data sets with fewer disease types. The substantial performance improvement, combined with the lightweight nature of the model, makes it practically suitable for real-world applications in smart farming. By offering a high-performing and efficient solution, our framework can potentially address challenges in cotton cultivation, contributing to sustainable agricultural practices.
zh

[CV-32] HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLM s with Synthetic Benchmark Data

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 在以人为中心的视频理解方面的不足，特别是现有基准主要关注物体和动作识别，而忽略了视频内容中复杂的情感、行为和语音视觉对齐等细微差别。解决方案的关键在于提出了HumanVBench，这是一个创新的基准，通过17个精心设计的任务，涵盖内在情感和外在表现两个主要维度，探索静态与动态、基础与复杂、单模态与跨模态的方面。HumanVBench利用先进的自动化视频标注和包含干扰项的问答生成管道，采用多样化的最先进 (SOTA) 技术，减少对人工标注的依赖，专注于以人为中心的多模态属性。通过全面评估16个SOTA视频MLLMs，揭示了当前模型在跨模态和时间对齐方面的显著局限性，强调了进一步改进以实现更类人理解的重要性。

链接: https://arxiv.org/abs/2412.17574
作者: Ting Zhou,Daoyuan Chen,Qirui Jiao,Bolin Ding,Yaliang Li,Ying Shen
机构: Sun Yat-Sen University(中山大学); Alibaba Group(阿里巴巴集团)
关键词: Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 24 figures, 4 tables

点击查看摘要

Abstract:In the domain of Multimodal Large Language Models (MLLMs), achieving human-centric video understanding remains a formidable challenge. Existing benchmarks primarily emphasize object and action recognition, often neglecting the intricate nuances of human emotions, behaviors, and speech visual alignment within video content. We present HumanVBench, an innovative benchmark meticulously crafted to bridge these gaps in the evaluation of video MLLMs. HumanVBench comprises 17 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects. With two advanced automated pipelines for video annotation and distractor-included QA generation, HumanVBench utilizes diverse state-of-the-art (SOTA) techniques to streamline benchmark data synthesis and quality assessment, minimizing human annotation dependency tailored to human-centric multimodal attributes. A comprehensive evaluation across 16 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and temporal alignment, underscoring the necessity for further refinement toward achieving more human-like understanding. HumanVBench is open-sourced to facilitate future advancements and real-world applications in video MLLMs.
zh

[CV-33] URoadNet: Dual Sparse Attentive U-Net for Multiscale Road Network Extraction

【速读】：该论文试图解决道路网络分割中的挑战，特别是传统编码-解码方法和简单Transformer嵌入在处理稀疏、不规则形状和多样化上下文时容易失败的问题。解决方案的关键在于提出了URoadNet框架，该框架通过集成连接性注意力（connectivity attention）和整体性注意力（integrality attention）机制，有效编码细粒度的局部道路连通性和全局拓扑语义，同时解码多尺度道路网络信息。这种双层稀疏注意力机制交替互补，能够在降低计算复杂度的同时提升分割性能，从而在多个高分辨率数据集上显著优于现有技术。

链接: https://arxiv.org/abs/2412.17573
作者: Jie Song,Yue Sun,Ziyun Cai,Liang Xiao,Yawen Huang,Yefeng Zheng
机构: College of Automation & College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology, Nanjing 210094, China; Jiangsu Key Laboratory of Spectral Imaging & Intelligent Sense, Nanjing University of Science and Technology, Nanjing 210094, China; Jarvis Research Center, Tencent YouTu Lab, Shenzhen, 518057, China
关键词: simple Transformer embeddings, leads traditional encoding-decoding, simple Transformer, Transformer embeddings, traditional encoding-decoding methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:The challenges of road network segmentation demand an algorithm capable of adapting to the sparse and irregular shapes, as well as the diverse context, which often leads traditional encoding-decoding methods and simple Transformer embeddings to failure. We introduce a computationally efficient and powerful framework for elegant road-aware segmentation. Our method, called URoadNet, effectively encodes fine-grained local road connectivity and holistic global topological semantics while decoding multiscale road network information. URoadNet offers a novel alternative to the U-Net architecture by integrating connectivity attention, which can exploit intra-road interactions across multi-level sampling features with reduced computational complexity. This local interaction serves as valuable prior information for learning global interactions between road networks and the background through another integrality attention mechanism. The two forms of sparse attention are arranged alternatively and complementarily, and trained jointly, resulting in performance improvements without significant increases in computational complexity. Extensive experiments on various datasets with different resolutions, including Massachusetts, DeepGlobe, SpaceNet, and Large-Scale remote sensing images, demonstrate that URoadNet outperforms state-of-the-art techniques. Our approach represents a significant advancement in the field of road network extraction, providing a computationally feasible solution that achieves high-quality segmentation results.
zh

[CV-34] Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor

【速读】：该论文试图解决聊天机器人在理解细微情感差异和管理长对话历史方面的挑战。解决方案的关键在于两个创新方法：首先，采用情感偏好优化 (Emotional Preference Optimization, EPO) 训练模型，使其不仅能够生成正确响应，还能生成与情境相似但情感相反的反情感响应 (counter-emotional responses)，从而增强模型对情感细微差异的辨别能力；其次，引入 MambaCompressor 来高效压缩和管理长对话历史，降低时间和内存复杂度，同时提升上下文理解能力。通过这些方法，模型在生成共情响应和处理长对话方面显著优于现有模型。

链接: https://arxiv.org/abs/2412.17572
作者: Yeonju Kim,Se Jin Park,Yong Man Ro
机构: Integrated Vision and Language Lab, KAIST, South Korea(集成视觉与语言实验室, 韩国科学技术院, 韩国)
关键词: require human interactions, mental health care, Emotional Preference Optimization, human interactions, health care
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chatbot research is advancing with the growing importance of chatbots in fields that require human interactions, such as customer support and mental health care. Despite these advancements, chatbots still face significant challenges in understanding subtle nuances and managing long conversation histories. To address these issues, our study introduces a dual approach: firstly, we employ Emotional Preference Optimization (EPO) to train chatbots not only with correct responses but also with counter-emotional responses-those that are contextually similar but emotionally divergent. This training enables the model to discern fine nuance distinctions between correct and counter-emotional responses, thereby enhancing the quality of its responses. Secondly, we introduce MambaCompressor to effectively compress and manage extensive conversation histories, significantly reducing time and memory complexities while improving the chatbot’s contextual understanding. Our comprehensive experiments across multiple datasets demonstrate that our model significantly outperforms existing models in generating empathetic responses and efficiently managing lengthy dialogues.
zh

[CV-35] he Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

【速读】：该论文试图解决自监督视觉表示学习中，传统掩码自编码器（Masked Autoencoders, MAE）在掩码策略和目标设定上忽视学生模型反馈的问题。解决方案的关键在于提出了协同掩码和目标机制（Collaborative Masking and Targets, CMT-MAE），通过教师和学生模型的注意力线性聚合来实现协同掩码，并利用两者的输出特征作为解码器的协同目标。这一简单而有效的框架在ImageNet-1K上预训练后，显著提升了线性探测和微调性能，尤其是在ViT-base模型上，将原始MAE的微调结果从83.6%提升至85.7%。

链接: https://arxiv.org/abs/2412.17566
作者: Shentong Mo
机构: 未知
关键词: vision representation learning, self-supervised vision representation, representation learning, Masked autoencoders, recently succeeded
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.
zh

[CV-36] S-INF: Towards Realistic Indoor Scene Synthesis via Scene Implicit Neural Field

【速读】：该论文试图解决现有基于学习的3D室内场景合成 (ISS) 方法在生成真实感场景时，由于过度简化的显式表示和缺乏多模态关系指导而导致的物体布局和风格不真实的问题。解决方案的关键在于引入了一种新的方法——场景隐式神经场 (Scene Implicit Neural Field, S-INF)，通过学习多模态关系的有效表示来增强室内场景合成的真实感。S-INF将多模态关系分解为场景布局关系和详细物体关系，并通过隐式神经场 (INFs) 将两者融合，从而实现场景布局的真实生成。此外，S-INF通过可微渲染捕捉密集的物体关系，确保物体间的风格一致性。实验结果表明，该方法在3D-FRONT数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.17561
作者: Zixi Liang,Guowei Xu,Haifeng Wu,Ye Huang,Wen Li,Lixin Duan
机构: 1. University of Electronic Science and Technology of China (电子科技大学); 2. Chengdu University of Information Technology (成都信息工程大学); 3. University of Technology Sydney (悉尼科技大学)
关键词: traditional optimization-based approaches, indoor scene synthesis, scene, Learning-based methods, scene synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based methods have become increasingly popular in 3D indoor scene synthesis (ISS), showing superior performance over traditional optimization-based approaches. These learning-based methods typically model distributions on simple yet explicit scene representations using generative models. However, due to the oversimplified explicit representations that overlook detailed information and the lack of guidance from multimodal relationships within the scene, most learning-based methods struggle to generate indoor scenes with realistic object arrangements and styles. In this paper, we introduce a new method, Scene Implicit Neural Field (S-INF), for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships, to enhance the realism of indoor scene synthesis. S-INF assumes that the scene layout is often related to the object-detailed information. It disentangles the multimodal relationships into scene layout relationships and detailed object relationships, fusing them later through implicit neural fields (INFs). By learning specialized scene layout relationships and projecting them into S-INF, we achieve a realistic generation of scene layout. Additionally, S-INF captures dense and detailed object relationships through differentiable rendering, ensuring stylistic consistency across objects. Through extensive experiments on the benchmark 3D-FRONT dataset, we demonstrate that our method consistently achieves state-of-the-art performance under different types of ISS.
zh

[CV-37] Concept Discovery in Deep Neural Networks for Explainable Face Anti-Spoofing

【速读】：该论文试图解决人脸反欺骗（Face Anti-Spoofing）模型在识别假脸时缺乏解释性的问题。现有的模型虽然能够达到较高的分类准确率，但仅能告知“这张脸是假的”，而无法解释“为什么是假的”，这降低了系统的可信度并导致用户困惑。论文提出的解决方案是将可解释人工智能（XAI）引入人脸反欺骗领域，提出了一个新的问题——可解释人脸反欺骗（X-FAS），并设计了SPED（SPoofing Evidence Discovery）方法。SPED能够发现欺骗概念并基于这些概念提供可靠的解释，从而增强模型的透明性和用户信任。为评估X-FAS方法的质量，论文还提出了一个由专家标注欺骗证据的X-FAS基准，并通过实验验证了SPED在生成可靠解释方面的能力。

链接: https://arxiv.org/abs/2412.17541
作者: Haoyuan Zhang,Xiangyu Zhu,Li Gao,Jiawei Pan,Kai Pang,Guoying Zhao,Stan Z. Li,Zhen Lei
机构: University of Chinese Academy of Sciences, Beijing, China; Institute of Automation, Beijing, China; China Mobile Communications Company Limited Research Institute; Guangzhou Pixel Solutions Co., Ltd.; University of Oulu, 90014 Oulu, Finland; Westlake University, Hangzhou, China; Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, SAR
关键词: avoid malicious attacks, rapid growth usage, people daily life, face anti-spoofing, face anti-spoofing models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures

点击查看摘要

Abstract:With the rapid growth usage of face recognition in people’s daily life, face anti-spoofing becomes increasingly important to avoid malicious attacks. Recent face anti-spoofing models can reach a high classification accuracy on multiple datasets but these models can only tell people this face is fake'' while lacking the explanation to answer why it is fake’'. Such a system undermines trustworthiness and causes user confusion, as it denies their requests without providing any explanations. In this paper, we incorporate XAI into face anti-spoofing and propose a new problem termed X-FAS (eXplainable Face Anti-Spoofing) empowering face anti-spoofing models to provide an explanation. We propose SPED (SPoofing Evidence Discovery), an X-FAS method which can discover spoof concepts and provide reliable explanations on the basis of discovered concepts. To evaluate the quality of X-FAS methods, we propose an X-FAS benchmark with annotated spoofing evidence by experts. We analyze SPED explanations on face anti-spoofing dataset and compare SPED quantitatively and qualitatively with previous XAI methods on proposed X-FAS benchmark. Experimental results demonstrate SPED’s ability to generate reliable explanations.
zh

[CV-38] WildPPG: A Real-World PPG Dataset of Long Continuous Recordings NEURIPS2024

【速读】：该论文试图解决在日常户外活动中，基于反射式光电容积描记术 (PPG) 的心率 (HR) 估计受到多种因素（如活动类型、传感器位置、运动伪影、环境温度和光照等）影响而可靠性下降的问题。解决方案的关键在于引入了一个新的多模态数据集，该数据集包含了16名参与者在13.5小时内的连续PPG记录，涵盖了多种户外活动和环境条件（如温度、海拔变化等），并同步采集了加速度计、温度、海拔数据以及基于Lead I的ECG作为真实心率参考。通过这一数据集，论文提出了一种在复杂现实场景中更稳健的心率估计方法，超越了现有基准方法的性能。

链接: https://arxiv.org/abs/2412.17540
作者: Manuel Meier,Berken Utku Demirel,Christian Holz
机构: ETH Zürich, Switzerland (瑞士苏黎世联邦理工学院); Department of Computer Science (计算机科学系)
关键词: person heart rate, default sensing technique, monitor cardiac activity, Reflective photoplethysmography, heart rate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS2024

点击查看摘要

Abstract:Reflective photoplethysmography (PPG) has become the default sensing technique in wearable devices to monitor cardiac activity via a person’s heart rate (HR). However, PPG-based HR estimates can be substantially impacted by factors such as the wearer’s activities, sensor placement and resulting motion artifacts, as well as environmental characteristics such as temperature and ambient light. These and other factors can significantly impact and decrease HR prediction reliability. In this paper, we show that state-of-the-art HR estimation methods struggle when processing \emphrepresentative data from everyday activities in outdoor environments, likely because they rely on existing datasets that captured controlled conditions. We introduce a novel multimodal dataset and benchmark results for continuous PPG recordings during outdoor activities from 16 participants over 13.5 hours, captured from four wearable sensors, each worn at a different location on the body, totaling 216,hours. Our recordings include accelerometer, temperature, and altitude data, as well as a synchronized Lead I-based electrocardiogram for ground-truth HR references. Participants completed a round trip from Zurich to Jungfraujoch, a tall mountain in Switzerland over the course of one day. The trip included outdoor and indoor activities such as walking, hiking, stair climbing, eating, drinking, and resting at various temperatures and altitudes (up to 3,571,m above sea level) as well as using cars, trains, cable cars, and lifts for transport – all of which impacted participants’ physiological dynamics. We also present a novel method that estimates HR values more robustly in such real-world scenarios than existing baselines.
zh

[CV-39] Exploring Dynamic Novel View Synthesis Technologies for Cinematography

【速读】：该论文试图解决动态新视角合成 (dynamic novel view synthesis) 中的模型选择问题，关键在于通过展示不同新视角合成模型（如NeRF和Gaussian Splatting）在实际应用中的表现，帮助用户更有效地选择适合其需求的模型。论文通过拍摄一段使用多种NVS模型的短片，展示了这些技术在电影制作中的潜力，特别是在实现平滑的摄像机运动、虚拟重拍和慢动作效果等方面的优势。

链接: https://arxiv.org/abs/2412.17532
作者: Adrian Azzarelli,Nantheera Anantrasirichai,David R Bull
机构: University of Bristol(布里斯托大学)
关键词: Neural Radiance Fields, Radiance Fields, Gaussian Splatting, Neural Radiance, shown significant promise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis (NVS) has shown significant promise for applications in cinematographic production, particularly through the exploitation of Neural Radiance Fields (NeRF) and Gaussian Splatting (GS). These methods model real 3D scenes, enabling the creation of new shots that are challenging to capture in the real world due to set topology or expensive equipment requirement. This innovation also offers cinematographic advantages such as smooth camera movements, virtual re-shoots, slow-motion effects, etc. This paper explores dynamic NVS with the aim of facilitating the model selection process. We showcase its potential through a short montage filmed using various NVS models.
zh

[CV-40] Constructing Fair Latent Space for Intersection of Fairness and Explainability AAAI2025

【速读】：该论文试图解决机器学习模型中公平性（fairness）与可解释性（explainability）之间的结合问题，特别是在实际用户信任方面存在的不足。解决方案的关键在于提出了一种新颖的模块，通过构建一个公平的潜在空间（fair latent space）来实现这一目标。该模块通过解耦和重新分配标签和敏感属性（sensitive attributes），生成反事实解释（counterfactual explanations），从而在保持公平性的同时提供可信的解释。该模块可以附加到预训练的生成模型上，将其有偏的潜在空间转换为公平的潜在空间，且仅需训练该模块，从而节省时间和成本。通过多种公平性指标的验证，证明了该方法能够有效提供有偏决策的解释并确保公平性。

链接: https://arxiv.org/abs/2412.17523
作者: Hyungjun Joo,Hyeonggeun Han,Sehwan Kim,Sangwoo Hong,Jungwoo Lee
机构: 1. Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); 2. Korea Institute of Science and Technology (KIST)(韩国科学技术院); 3. Korea University (高丽大学)
关键词: fair latent space, machine learning models, latent space, fair latent, numerous studies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, accepted in AAAI 2025

点击查看摘要

Abstract:As the use of machine learning models has increased, numerous studies have aimed to enhance fairness. However, research on the intersection of fairness and explainability remains insufficient, leading to potential issues in gaining the trust of actual users. Here, we propose a novel module that constructs a fair latent space, enabling faithful explanation while ensuring fairness. The fair latent space is constructed by disentangling and redistributing labels and sensitive attributes, allowing the generation of counterfactual explanations for each type of information. Our module is attached to a pretrained generative model, transforming its biased latent space into a fair latent space. Additionally, since only the module needs to be trained, there are advantages in terms of time and cost savings, without the need to train the entire generative model. We validate the fair latent space with various fairness metrics and demonstrate that our approach can effectively provide explanations for biased decisions and assurances of fairness.
zh

[CV-41] Dataset for Real-World Human Action Detection Using FMCW mmWave Radar

【速读】：该论文旨在解决使用隐私保护的毫米波雷达传感器进行人体动作检测的问题，特别关注其在医疗保健和家庭自动化中的应用。与现有研究仅限于受控环境中的模拟不同，论文的关键创新在于提供了一个真实世界的毫米波雷达数据集，并展示了基于该数据集的人体动作检测的基线结果。

链接: https://arxiv.org/abs/2412.17517
作者: Dylan jayabahu,Parthipan Siva
机构: Chirp Inc.; Vision and Image Processing Group, Systems Design Engineering, University of Waterloo
关键词: Human action detection, home automation, Human action, privacy-preserving mmWave radar, mmWave radar sensors
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in JCVIS (proceedings of 10th Annual Conference on Vision and Intelligent Systems)

点击查看摘要

Abstract:Human action detection using privacy-preserving mmWave radar sensors is studied for its applications in healthcare and home automation. Unlike existing research, limited to simulations in controlled environments, we present a real-world mmWave radar dataset with baseline results for human action detection.
zh

[CV-42] BEE: Metric-Adapted Explanations via Baseline Exploration-Exploitation AAAI2025

【速读】：该论文试图解决解释性研究中的两个主要挑战：1) 解释的细致评估；2) 通过基线表示对缺失信息的建模。现有文献提出了多种评估指标和基线表示方法，但缺乏统一的共识。论文提出了一种名为基线探索-利用（Baseline Exploration-Exploitation, BEE）的路径积分方法，通过将基线建模为学习到的随机张量，并引入随机性到积分过程中，从而生成多样化的解释图。BEE通过上下文探索-利用过程优化基线分布，并从学习到的分布中重新采样基线，生成全面的解释图集合，以便根据特定指标选择最佳的解释图。该方法在多种模型架构和客观评估指标上展示了优于现有最先进解释方法的性能。

链接: https://arxiv.org/abs/2412.17512
作者: Oren Barkan,Yehonatan Elisha,Jonathan Weill,Noam Koenigstein
机构: Technion - Israel Institute of Technology(以色列理工学院); Meta
关键词: explainability research involve, baseline representations, baseline, research involve, prominent challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Two prominent challenges in explainability research involve 1) the nuanced evaluation of explanations and 2) the modeling of missing information through baseline representations. The existing literature introduces diverse evaluation metrics, each scrutinizing the quality of explanations through distinct lenses. Additionally, various baseline representations have been proposed, each modeling the notion of missingness differently. Yet, a consensus on the ultimate evaluation metric and baseline representation remains elusive. This work acknowledges the diversity in explanation metrics and baselines, demonstrating that different metrics exhibit preferences for distinct explanation maps resulting from the utilization of different baseline representations and distributions. To address the diversity in metrics and accommodate the variety of baseline representations in a unified manner, we propose Baseline Exploration-Exploitation (BEE) - a path-integration method that introduces randomness to the integration process by modeling the baseline as a learned random tensor. This tensor follows a learned mixture of baseline distributions optimized through a contextual exploration-exploitation procedure to enhance performance on the specific metric of interest. By resampling the baseline from the learned distribution, BEE generates a comprehensive set of explanation maps, facilitating the selection of the best-performing explanation map in this broad set for the given metric. Extensive evaluations across various model architectures showcase the superior performance of BEE in comparison to state-of-the-art explanation methods on a variety of objective evaluation metrics.
zh

[CV-43] An Evaluation Framework for Product Images Background Inpainting based on Human Feedback and Product Consistency AAAI2025

【速读】：该论文试图解决产品广告中利用AI技术自动修复背景时出现的背景不合适和产品不一致的问题，并提出了一种基于人类反馈和产品一致性（Human Feedback and Product Consistency, HFPC）的自动化评估方法。解决方案的关键在于两个模块：首先，通过收集44,000张自动修复背景的产品图像的人类反馈，训练一个基于多模态特征（从BLIP提取）和对比学习的奖励模型，以解决背景不合适的问题；其次，使用微调的分割模型对原始和生成的产品图像进行产品分割，并通过比较两者的差异来过滤掉产品不一致的图像。实验表明，HFPC能够有效评估生成图像的质量，显著减少人工标注的成本，并在精度上达到96.4%，优于其他开源视觉质量评估模型。

链接: https://arxiv.org/abs/2412.17504
作者: Yuqi Liang,Jun Luo,Xiaoxi Guo,Jianqi Bi
机构: 未知
关键词: generated product images, product images, product advertising applications, generated product, product
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by AAAI2025

点击查看摘要

Abstract:In product advertising applications, the automated inpainting of backgrounds utilizing AI techniques in product images has emerged as a significant task. However, the techniques still suffer from issues such as inappropriate background and inconsistent product in generated product images, and existing approaches for evaluating the quality of generated product images are mostly inconsistent with human feedback causing the evaluation for this task to depend on manual annotation. To relieve the issues above, this paper proposes Human Feedback and Product Consistency (HFPC), which can automatically assess the generated product images based on two modules. Firstly, to solve inappropriate backgrounds, human feedback on 44,000 automated inpainting product images is collected to train a reward model based on multi-modal features extracted from BLIP and comparative learning. Secondly, to filter generated product images containing inconsistent products, a fine-tuned segmentation model is employed to segment the product of the original and generated product images and then compare the differences between the above two. Extensive experiments have demonstrated that HFPC can effectively evaluate the quality of generated product images and significantly reduce the expense of manual annotation. Moreover, HFPC achieves state-of-the-art(96.4% in precision) in comparison to other open-source visual-quality-assessment models. Dataset and code are available at: this https URL inpainting products dataset/.
zh

[CV-44] Guided Real Image Dehazing using YCbCr Color Space

【速读】：该论文试图解决基于学习的图像去雾方法在RGB颜色空间中难以完全去除残余雾气的问题，主要原因在于从有雾的RGB图像中提取清晰的纹理特征困难，以及在非受控环境下难以获取真实的雾/清晰图像对。解决方案的关键在于提出了一种新颖的结构引导去雾网络 (Structure Guided Dehazing Network, SGDN)，该网络利用YCbCr特征的优越结构特性来指导RGB空间，通过包含双色引导桥 (Bi-Color Guidance Bridge, BGB) 和颜色增强模块 (Color Enhancement Module, CEM) 的两个关键模块，分别在频率和空间域中恢复更清晰的特征，并通过YCbCr通道信息增强RGB特征的颜色感知，以保持色调一致性。此外，论文还引入了真实世界对齐雾气 (Real-World Well-Aligned Haze, RW²AH) 数据集，以支持有效的监督学习。

链接: https://arxiv.org/abs/2412.17496
作者: Wenxuan Fang,Jankai Fan,Yu Zheng,Jiangwei Weng,Ying Tai,Jun Li
机构: 1. School of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院); 2. Alibaba Group(阿里巴巴集团)
关键词: gained significant attention, significant attention due, Guided Dehazing Network, gained significant, Structure Guided Dehazing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image dehazing, particularly with learning-based methods, has gained significant attention due to its importance in real-world applications. However, relying solely on the RGB color space often fall short, frequently leaving residual haze. This arises from two main issues: the difficulty in obtaining clear textural features from hazy RGB images and the complexity of acquiring real haze/clean image pairs outside controlled environments like smoke-filled scenes. To address these issues, we first propose a novel Structure Guided Dehazing Network (SGDN) that leverages the superior structural properties of YCbCr features over RGB. It comprises two key modules: Bi-Color Guidance Bridge (BGB) and Color Enhancement Module (CEM). BGB integrates a phase integration module and an interactive attention module, utilizing the rich texture features of the YCbCr space to guide the RGB space, thereby recovering clearer features in both frequency and spatial domains. To maintain tonal consistency, CEM further enhances the color perception of RGB features by aggregating YCbCr channel information. Furthermore, for effective supervised learning, we introduce a Real-World Well-Aligned Haze (RW ^2 AH) dataset, which includes a diverse range of scenes from various geographical regions and climate conditions. Experimental results demonstrate that our method surpasses existing state-of-the-art methods across multiple real-world smoke/haze datasets. Code and Dataset: \textcolorblue\urlthis https URL.
zh

[CV-45] Predicting Satisfied User and Machine Ratio for Compressed Images: A Unified Approach

【速读】：该论文试图解决压缩图像在人类和机器视觉分析中的感知质量预测问题。解决方案的关键在于提出了一种统一的深度学习方法，通过预训练特征提取网络和基于MLP-Mixer的网络来同时预测压缩图像的满意用户比例（Satisfied User Ratio, SUR）和满意机器比例（Satisfied Machine Ratio, SMR）。具体来说，首先在大规模SMR标注数据集上预训练特征提取网络，然后利用MLP-Mixer网络融合多层特征进行SUR和SMR的预测。此外，引入了差异特征残差学习（Difference Feature Residual Learning, DFRL）模块和多头注意力聚合与池化（Multi-Head Attention Aggregation and Pooling, MHAAP）层，以增强差异特征的学习和减少冗余，从而显著提升预测性能。

链接: https://arxiv.org/abs/2412.17477
作者: Qi Zhang,Shanshe Wang,Xinfeng Zhang,Siwei Ma,Jingshan Pan,Wen Gao
机构: National Research Engineering Center of Visual Technology, School of Computer Science, Peking University, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; National Supercomputer Center in Jinan, Jinan, China
关键词: accurate visual analysis, visual analysis, viewing experience, accurate visual, Satisfied User Ratio
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Nowadays, high-quality images are pursued by both humans for better viewing experience and by machines for more accurate visual analysis. However, images are usually compressed before being consumed, decreasing their quality. It is meaningful to predict the perceptual quality of compressed images for both humans and machines, which guides the optimization for compression. In this paper, we propose a unified approach to address this. Specifically, we create a deep learning-based model to predict Satisfied User Ratio (SUR) and Satisfied Machine Ratio (SMR) of compressed images simultaneously. We first pre-train a feature extractor network on a large-scale SMR-annotated dataset with human perception-related quality labels generated by diverse image quality models, which simulates the acquisition of SUR labels. Then, we propose an MLP-Mixer-based network to predict SUR and SMR by leveraging and fusing the extracted multi-layer features. We introduce a Difference Feature Residual Learning (DFRL) module to learn more discriminative difference features. We further use a Multi-Head Attention Aggregation and Pooling (MHAAP) layer to aggregate difference features and reduce their redundancy. Experimental results indicate that the proposed model significantly outperforms state-of-the-art SUR and SMR prediction methods. Moreover, our joint learning scheme of human and machine perceptual quality prediction tasks is effective at improving the performance of both.
zh

[CV-46] CALLIC: Content Adaptive Learning for Lossless Image Compression AAAI2025

【速读】：该论文试图解决现有无损图像压缩方法在编码过程中对特定测试图像的概率分布估计不准确的问题。解决方案的关键在于结合最小描述长度原则 (Minimum Description Length, MDL) 和参数高效迁移学习 (Parameter-Efficient Transfer Learning, PETL)，提出了一种内容自适应的无损图像压缩方法，称为 CALLIC。具体来说，CALLIC 通过引入内容感知的自回归自注意力机制（Masked Gated ConvFormer, MGCF）和预训练的卷积门控操作，结合缓存裁剪推理 (Cache then Crop Inference, CCI) 加速编码过程，并利用速率引导的渐进微调 (Rate-guided Progressive Fine-Tuning, RPFT) 在编码时对测试图像进行增量权重调整，从而优化学习过程并减少适应时间。

链接: https://arxiv.org/abs/2412.17464
作者: Daxin Li,Yuanchao Bai,Kai Wang,Junjun Jiang,Xianming Liu,Wen Gao
机构: 未知
关键词: achieved significant advancements, Learned lossless image, lossless image compression, Learned lossless, Minimum Description Length
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL) principle and Parameter-Efficient Transfer Learning (PETL), leading to the development of a novel content-adaptive approach for learned lossless image compression, dubbed CALLIC. Specifically, we first propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations, termed Masked Gated ConvFormer (MGCF), and pretrain MGCF on training dataset. Cache then Crop Inference (CCI) is proposed to accelerate the coding process. During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT). RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time. Extensive experiments across diverse datasets demonstrate that CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.
zh

[CV-47] Progressive Boundary Guided Anomaly Synthesis for Industrial Anomaly Detection

【速读】：该论文试图解决无监督异常检测中由于仅使用正常样本训练而导致的过拟合问题，以及现有异常合成策略依赖外部异常纹理数据集、覆盖范围和方向性不足的问题。解决方案的关键在于提出了一种新的渐进边界引导异常合成 (Progressive Boundary-guided Anomaly Synthesis, PBAS) 策略，该策略通过三个核心组件实现：近似边界学习 (Approximate Boundary Learning, ABL)、异常特征合成 (Anomaly Feature Synthesis, AFS) 和精细边界优化 (Refined Boundary Optimization, RBO)。ABL 通过中心约束学习近似决策边界，AFS 在正常特征的超球面分布引导下灵活合成异常特征，RBO 则通过二分类优化决策边界，从而在不依赖外部纹理的情况下实现方向性异常合成，提升检测性能。

链接: https://arxiv.org/abs/2412.17458
作者: Qiyu Chen,Huiyuan Luo,Han Gao,Chengkan Lv,Zhengtao Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
关键词: Unsupervised anomaly detection, identify surface defects, Unsupervised anomaly, anomaly synthesis, identify surface
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Unsupervised anomaly detection methods can identify surface defects in industrial images by leveraging only normal samples for training. Due to the risk of overfitting when learning from a single class, anomaly synthesis strategies are introduced to enhance detection capability by generating artificial anomalies. However, existing strategies heavily rely on anomalous textures from auxiliary datasets. Moreover, their limitations in the coverage and directionality of anomaly synthesis may result in a failure to capture useful information and lead to significant redundancy. To address these issues, we propose a novel Progressive Boundary-guided Anomaly Synthesis (PBAS) strategy, which can directionally synthesize crucial feature-level anomalies without auxiliary textures. It consists of three core components: Approximate Boundary Learning (ABL), Anomaly Feature Synthesis (AFS), and Refined Boundary Optimization (RBO). To make the distribution of normal samples more compact, ABL first learns an approximate decision boundary by center constraint, which improves the center initialization through feature alignment. AFS then directionally synthesizes anomalies with more flexible scales guided by the hypersphere distribution of normal features. Since the boundary is so loose that it may contain real anomalies, RBO refines the decision boundary through the binary classification of artificial anomalies and normal features. Experimental results show that our method achieves state-of-the-art performance and the fastest detection speed on three widely used industrial datasets, including MVTec AD, VisA, and MPDD. The code will be available at: this https URL.
zh

[CV-48] Multimodal Preference Data Synthetic Alignment with Reward Model

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视觉-语言任务中因预训练数据与用户提示之间的差异而产生的误导性或幻觉内容的问题。解决方案的关键在于提出了一种新的框架，通过使用奖励模型（reward model）作为人类偏好的代理，生成合成数据以进行直接偏好优化（Direct Preference Optimization, DPO）训练。该方法显著提升了模型的可信度和推理能力，减少了对外部人工标注数据的依赖，并增强了模型在多模态对齐任务中的表现，为更安全的部署提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2412.17417
作者: Robert Wijaya,Ngoc-Bao Nguyen,Ngai-Man Cheung
机构: Singapore University of Technology and Design (新加坡科技设计大学)
关键词: visual question answering, Multimodal large language, significantly advanced tasks, large language models, visual question
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have significantly advanced tasks like caption generation and visual question answering by integrating visual and textual data. However, they sometimes produce misleading or hallucinate content due to discrepancies between their pre-training data and real user prompts. Existing approaches using Direct Preference Optimization (DPO) in vision-language tasks often rely on strong models like GPT-4 or CLIP to determine positive and negative responses. Here, we propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training. The resulting DPO dataset ranges from 2K to 9K image-text pairs, was evaluated on LLaVA-v1.5-7B, where our approach demonstrated substantial improvements in both the trustworthiness and reasoning capabilities of the base model across multiple hallucination and vision-language benchmark. The experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data while enhancing MLLMs’ alignment capability, offering a scalable solution for safer deployment.
zh

[CV-49] VidCtx: Context-aware Video Question Answering with Image Models

【速读】：该论文试图解决在视频问答任务中，大型多模态模型（Large Multimodal Models, LMM）在计算和内存方面的限制问题。现有的方法通过提取每帧的文本表示（如通过字幕生成）并将其输入到大型语言模型（Large Language Model, LLM）中，但这种方法无法利用视觉信息，且常需处理相邻帧的重复文本描述。论文提出的解决方案是引入VidCtx，一种无需训练的视频问答框架，该框架能够整合视觉信息和适当的文本上下文。关键在于使用预训练的多模态模型（LMM）定期提取问题感知的文本描述（captions），并将其作为上下文，结合特定帧、问题和适当的帧描述来回答问题。为避免冗余信息，上下文选择远距离帧的描述，并通过简单的最大池化机制（max pooling）聚合帧级决策，从而使模型能够聚焦于视频的相关片段并处理大量帧。实验表明，VidCtx在三个公开视频问答基准（NExT-QA, IntentQA, STAR）上表现出色。

链接: https://arxiv.org/abs/2412.17415
作者: Andreas Goulas,Vasileios Mezaris,Ioannis Patras
机构: CERTH-ITI, Thessaloniki, Greece, 570001; Queen Mary University of London, London, UK, E14NS
关键词: Large Language Model, Large Multimodal Models, Large Language, Video Question-Answering task, Large Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Submitted for publication

点击查看摘要

Abstract:To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model (LLM) that processes them to produce the final response. However, in this way, the LLM does not have access to visual information and often has to process repetitive textual descriptions of nearby frames. To address those shortcomings, in this paper, we introduce VidCtx, a novel training-free VideoQA framework which integrates both modalities, i.e. both visual information from input frames and textual descriptions of others frames that give the appropriate context. More specifically, in the proposed framework a pre-trained Large Multimodal Model (LMM) is prompted to extract at regular intervals, question-aware textual descriptions (captions) of video frames. Those will be used as context when the same LMM will be prompted to answer the question at hand given as input a) a certain frame, b) the question and c) the context/caption of an appropriate frame. To avoid redundant information, we chose as context the descriptions of distant frames. Finally, a simple yet effective max pooling mechanism is used to aggregate the frame-level decisions. This methodology enables the model to focus on the relevant segments of the video and scale to a high number of frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models on three public Video QA benchmarks, NExT-QA, IntentQA and STAR.
zh

[CV-50] Impact of Evidence Theory Uncertainty on Training Object Detection Models

【速读】：该论文试图解决对象检测模型训练效率的问题，通过引入不确定性（uncertainty）来优化反馈机制。解决方案的关键在于利用证据理论（Evidence Theory）在每次训练迭代中量化预测与真实标签之间的关系，并通过Dempster-Shafer组合规则计算不确定性。这种不确定性被用于加权反馈损失，从而动态调整模型的学习过程。实验表明，基于不确定性的反馈不仅减少了训练时间，还提升了模型性能，为机器学习工作流中的不确定性管理提供了新的视角，特别是在对象检测领域，并具有推广到其他AI领域的潜力。

链接: https://arxiv.org/abs/2412.17405
作者: M. Tahasanul Ibrahim,Rifshu Hussain Shaik,Andreas Schwung
机构: 未知
关键词: Evidence Theory, paper investigates, Evidence, Theory, feedback loop
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the use of Evidence Theory to enhance the training efficiency of object detection models by incorporating uncertainty into the feedback loop. In each training iteration, during the validation phase, Evidence Theory is applied to establish a relationship between ground truth labels and predictions. The Dempster-Shafer rule of combination is used to quantify uncertainty based on the evidence from these predictions. This uncertainty measure is then utilized to weight the feedback loss for the subsequent iteration, allowing the model to adjust its learning dynamically. By experimenting with various uncertainty-weighting strategies, this study aims to determine the most effective method for optimizing feedback to accelerate the training process. The results demonstrate that using uncertainty-based feedback not only reduces training time but can also enhance model performance compared to traditional approaches. This research offers insights into the role of uncertainty in improving machine learning workflows, particularly in object detection, and suggests broader applications for uncertainty-driven training across other AI disciplines.
zh

[CV-51] Learning Dynamic Local Context Representations for Infrared Small Target Detection

【速读】：该论文试图解决红外小目标检测 (Infrared Small Target Detection, ISTD) 中的挑战，包括复杂背景、低信杂比以及目标大小和形状的多样性。解决方案的关键在于提出了一种名为LCRNet的新方法，通过学习动态局部上下文表示来实现有效的检测。LCRNet包含三个核心组件：(1) C2FBlock，受偏微分方程求解器启发，用于高效捕捉小目标信息；(2) DLC-Attention，一种大核注意力机制，动态构建上下文并减少特征冗余；(3) HLKConv，基于大核分解的分层卷积操作，保持稀疏性并缓解扩张卷积的缺点。尽管模型简单，仅有1.65M参数，LCRNet在多个数据集上实现了最先进的性能，展示了其优越的性能和效率。

链接: https://arxiv.org/abs/2412.17401
作者: Guoyi Zhang,Guangsheng Xu,Han Wang,Siyang Chen,Yunxiao Shan,Xiaohu Zhang
机构: School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, Guangdong, China; School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, Guangdong 519082, China; Southern Marine Science and Engineering Guangdong Laboratory, Sun Yat-sen University, Zhuhai, Guangdong 519082, China; Shenzhen Institute, Sun Yat-sen University, Shenzhen 510275, China; Guangdong Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University, Guangzhou 510006, China
关键词: Infrared small target, varying target sizes, Infrared small, complex backgrounds, sizes and shapes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes. Effective detection relies on capturing local contextual information at the appropriate scale. However, small-kernel CNNs have limited receptive fields, leading to false alarms, while transformer models, with global receptive fields, often treat small targets as noise, resulting in miss-detections. Hybrid models struggle to bridge the semantic gap between CNNs and transformers, causing high this http URL address these challenges, we propose LCRNet, a novel method that learns dynamic local context representations for ISTD. The model consists of three components: (1) C2FBlock, inspired by PDE solvers, for efficient small target information capture; (2) DLC-Attention, a large-kernel attention mechanism that dynamically builds context and reduces feature redundancy; and (3) HLKConv, a hierarchical convolution operator based on large-kernel decomposition that preserves sparsity and mitigates the drawbacks of dilated convolutions. Despite its simplicity, with only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) this http URL on multiple datasets, comparing LCRNet with 33 SOTA methods, demonstrate its superior performance and efficiency.
zh

[CV-52] owards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning AAAI2025

【速读】：该论文旨在通过增强大型语言模型（LLMs）的逐步推理能力来提升其性能，特别是在算术推理任务中。解决方案的关键在于引入内在自我修正机制，并通过两阶段训练过程实现。第一阶段，模型通过自我预测增强其自我修正推理能力，主要依赖于自我生成的数据。第二阶段，利用第一阶段增强的自我修正策略，结合逐步偏好学习（step-wise preference learning）进行进一步优化。这种方法在算术推理任务中显著提高了模型的准确性，例如在MATH数据集上分别提升了4.18%和4.94%，在GSM8K数据集上分别提升了2.00%和2.28%。

链接: https://arxiv.org/abs/2412.17397
作者: Huchen Jiang,Yangyang Ma,Chaofan Ding,Kexin Luan,Xinhan Di
机构: 未知
关键词: Large Language Models, Language Models, Large Language, step-wise preference learning, iterative preference learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Pages,3 figures, accepted by AAAI 2025 Workshop NeurMAD

点击查看摘要

Abstract:With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabilities through intrinsic self-correction to some extent. Our work leverages step-wise preference learning to enhance self-verification via reinforcement learning. We initially conduct our work through a two-stage training procedure. At the first stage, the self-correction reasoning ability of an LLM is enhanced through its own predictions, relying entirely on self-generated data within the intrinsic self-correction to some extent. At the second stage, the baseline step-wise preference learning is leveraged via the application of the enhanced self-correct policy achieved at the first stage. In the evaluation of arithmetic reasoning tasks, our approach outperforms OpenMath2-Llama3.1-8B, dart-math-mistral-7b-uniform on MATH with increases in accuracy to 71.34%(+4.18%) and 48.06%(+4.94%) and LLama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1 on GSM8K with increases in accuracy to 86.76%(+2.00%) and 38.06%(+2.28%).
zh

[CV-53] PointVoxelFormer – Reviving point cloud networks for 3D medical imaging

【速读】：该论文试图解决在医学影像中点云数据应用不足的问题，特别是在与体积3D卷积神经网络（CNNs）和视觉变换器相比时。解决方案的关键在于提出了一种混合方法，即结合点操作与中间可微分光栅化和局部密集CNNs，以克服点云中空间邻近点交互带来的计算瓶颈。具体而言，该方法通过早期融合坐标特征的方案，将两个点云在共同参考框架中结合，并采用逆一致的两步对齐架构，从而在分割和配准任务中实现了显著的速度提升、内存减少和配准误差降低。

链接: https://arxiv.org/abs/2412.17390
作者: Mattias Paul Heinrich
机构: 未知
关键词: represent volumetric data, Point clouds, medical imaging, represent volumetric, point cloud registration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Point clouds are a very efficient way to represent volumetric data in medical imaging. First, they do not occupy resources for empty spaces and therefore can avoid trade-offs between resolution and field-of-view for voxel-based 3D convolutional networks (CNNs) - leading to smaller and robust models. Second, they provide a modality agnostic representation of anatomical surfaces and shapes to avoid domain gaps for generic geometric models. Third, they remove identifiable patient-specific information and may increase privacy preservation when publicly sharing data. Despite their benefits, point clouds are still underexplored in medical imaging compared to volumetric 3D CNNs and vision transformers. To date both datasets and stringent studies on comparative strengths and weaknesses of methodological choices are missing. Interactions and information exchange of spatially close points - e.g. through k-nearest neighbour graphs in edge convolutions or point transformations - within points clouds are crucial for learning geometrically meaningful features but may incur computational bottlenecks. This work presents a hybrid approach that combines point-wise operations with intermediate differentiable rasterisation and dense localised CNNs. For deformable point cloud registration, we devise an early fusion scheme for coordinate features that joins both clouds within a common reference frame and is coupled with an inverse consistent, two-step alignment architecture. Our extensive experiments on three different datasets for segmentation and registration demonstrate that our method, PointVoxelFormer, enables very compact models that excel with threefold speed-ups, fivefold memory reduction and over 30% registration error reduction against edge convolutions and other state-of-the-art models in geometric deep learning.
zh

[CV-54] Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement AAAI2025

【速读】：该论文试图解决在模型压缩过程中，剪枝方法仅关注保留关键连接而忽略剪枝权重对后续微调或蒸馏的影响，导致效率低下的问题。解决方案的关键在于引入奇异值缩放 (Singular Value Scaling, SVS) 技术，通过调整剪枝权重的奇异值差异，优化权重初始化，从而提升微调效率和模型性能。该方法适用于生成对抗网络 (GANs) 和扩散模型 (Diffusion models)，并在实验中证明了其在不同模型架构（如 StyleGAN2、StyleGAN3 和 DDPM）上的有效性，且无需额外训练成本。

链接: https://arxiv.org/abs/2412.17387
作者: Hyeonjin Kim,Jaejun Yoo
机构: 未知
关键词: preserving crucial connections, effectively maintain model, pruning methods effectively, methods effectively maintain, crucial connections
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:While pruning methods effectively maintain model performance without extra training costs, they often focus solely on preserving crucial connections, overlooking the impact of pruned weights on subsequent fine-tuning or distillation, leading to inefficiencies. Moreover, most compression techniques for generative models have been developed primarily for GANs, tailored to specific architectures like StyleGAN, and research into compressing Diffusion models has just begun. Even more, these methods are often applicable only to GANs or Diffusion models, highlighting the need for approaches that work across both model types. In this paper, we introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. Our analysis reveals that pruned weights often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance compared to random initialization. Our method enhances weight initialization by minimizing the disparities between singular values of pruned weights, thereby improving the fine-tuning process. This approach not only guides the compressed model toward superior solutions but also significantly speeds up fine-tuning. Extensive experiments on StyleGAN2, StyleGAN3 and DDPM demonstrate that SVS improves compression performance across model types without additional training costs. Our code is available at: this https URL.
zh

[CV-55] Balanced 3DGS: Gaussian-wise Parallelism Rendering with Fine-Grained Tiling

【速读】：该论文试图解决3D Gaussian Splatting (3DGS)模型训练中的负载不平衡问题，特别是在像素和Gaussian球体之间工作负载多样性导致的渲染CUDA内核性能下降。解决方案的关键在于引入Balanced 3DGS，通过高斯粒度的并行渲染和细粒度分块方法来解决负载不平衡问题。具体措施包括：1) 创新的块间动态工作负载分配技术，动态地将工作负载映射到单个GPU的流多处理器(Streaming Multiprocessor, SM)资源上，奠定负载均衡的基础；2) 首次提出的高斯粒度并行渲染技术，显著减少warp内部的工作负载分歧，是解决负载不平衡的关键组件；3) 进一步提出的细粒度组合负载均衡技术，均匀分布所有SM的工作负载，提升前向渲染CUDA内核性能高达7.52倍。此外，论文还提出了基于不同负载平衡情况的自适应渲染内核选择策略，有效提高训练效率。

链接: https://arxiv.org/abs/2412.17378
作者: Hao Gui,Lin Hu,Rui Chen,Mingxiao Huang,Yuxin Yin,Jin Yang,Yong Wu
机构: 未知
关键词: increasingly attracting attention, superior visual quality, Gaussian Splatting, increasingly attracting, attracting attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load imbalance scenarios where workload diversity among pixels and Gaussian spheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS, a Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS training process, perfectly solving load-imbalance issues. First, we innovatively introduce the inter-block dynamic workload distribution technique to map workloads to Streaming Multiprocessor(SM) resources within a single GPU dynamically, which constitutes the foundation of load balancing. Second, we are the first to propose the Gaussian-wise parallel rendering technique to significantly reduce workload divergence inside a warp, which serves as a critical component in addressing load imbalance. Based on the above two methods, we further creatively put forward the fine-grained combined load balancing technique to uniformly distribute workload across all SMs, which boosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we present a self-adaptive render kernel selection strategy during the 3DGS training process based on different load-balance situations, which effectively improves training efficiency.
zh

[CV-56] A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions

【速读】：该论文试图解决从单目视频中提取物理上合理的3D人体运动的问题，特别是针对高难度运动场景。解决方案的关键在于引入了一个基于掩码的运动校正模块（Mask-based Motion Correction Module, MCM），通过利用运动上下文和视频掩码来修复有缺陷的运动片段，生成适合模仿的运动；同时提出了一个基于物理的运动传递模块（Physics-based Motion Transfer Module, PTM），采用预训练和适应的方法来提升运动模仿的物理合理性，并能够处理复杂和具有挑战性的运动场景。这两个模块共同构成了一个即插即用的物理运动捕捉结果优化系统，适用于高难度和野外环境中的运动捕捉。

链接: https://arxiv.org/abs/2412.17377
作者: Youliang Zhang,Ronghui Li,Yachao Zhang,Liang Pan,Jingbo Wang,Yebin Liu,Xiu Li
机构: Tsinghua University, China(清华大学); Shanghai AI Laboratory(上海人工智能实验室)
关键词: Extracting physically plausible, Extracting physically, critical task, motion, motion capture results
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting physically plausible 3D human motion from videos is a critical task. Although existing simulation-based motion imitation methods can enhance the physical quality of daily motions estimated from monocular video capture, extending this capability to high-difficulty motions remains an open challenge. This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions, producing imitation-friendly motions; and propose a physics-based motion transfer module (PTM), which employs a pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. Our approach is designed as a plug-and-play module to physically refine the video motion capture results, including high-difficulty in-the-wild motions. Finally, to validate our approach, we collected a challenging in-the-wild test set to establish a benchmark, and our method has demonstrated effectiveness on both the new benchmark and existing public this http URL://physicalmotionrestoration.this http URL
zh

[CV-57] FlowMamba: Learning Point Cloud Scene Flow with Global Motion Propagation AAAI2025

【速读】：该论文试图解决基于深度学习的场景流方法在处理不适定区域（如大面积平坦区域或遮挡区域）时因局部证据不足而表现不佳的问题。解决方案的关键在于提出了一种名为FlowMamba的全局感知场景流估计网络，其核心是基于状态空间模型（State Space Model）的迭代单元（ISU）。ISU首先传播全局运动模式，然后自适应地将全局运动信息与先前的隐藏状态结合，从而提升对不适定区域的处理能力。此外，针对点云的不规则性对全局运动传播的限制，论文提出了特征诱导排序策略（FIO），通过语义相关和运动相关的特征将点排序为具有空间连续性的序列，进一步增强了全局运动传播的效果。实验结果表明，FlowMamba在FlyingThings3D和KITTI数据集上显著提升了预测精度，首次实现了毫米级精度的场景流估计。

链接: https://arxiv.org/abs/2412.17366
作者: Min Lin,Gangwei Xu,Yun Wang,Xianqi Wang,Xin Yang
机构: 未知
关键词: achieved impressive performance, Scene flow methods, global motion propagation, global motion, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Scene flow methods based on deep learning have achieved impressive performance. However, current top-performing methods still struggle with ill-posed regions, such as extensive flat regions or occlusions, due to insufficient local evidence. In this paper, we propose a novel global-aware scene flow estimation network with global motion propagation, named FlowMamba. The core idea of FlowMamba is a novel Iterative Unit based on the State Space Model (ISU), which first propagates global motion patterns and then adaptively integrates the global motion information with previously hidden states. As the irregular nature of point clouds limits the performance of ISU in global motion propagation, we propose a feature-induced ordering strategy (FIO). The FIO leverages semantic-related and motion-related features to order points into a sequence characterized by spatial continuity. Extensive experiments demonstrate the effectiveness of FlowMamba, with 21.9% and 20.5% EPE3D reduction from the best published results on FlyingThings3D and KITTI datasets. Specifically, our FlowMamba is the first method to achieve millimeter-level prediction accuracy in FlyingThings3D and KITTI. Furthermore, the proposed ISU can be seamlessly embedded into existing iterative networks as a plug-and-play module, improving their estimation accuracy significantly.
zh

[CV-58] DiffFormer: a Differential Spatial-Spectral Transformer for Hyperspectral Image Classification

【速读】：该论文试图解决高光谱图像分类 (Hyperspectral Image Classification, HSIC) 中的关键问题，包括光谱冗余和空间不连续性。解决方案的关键在于提出了一个名为差分空间-光谱Transformer (Differential Spatial-Spectral Transformer, DiffFormer) 的新框架。该框架通过引入差分多头自注意力机制 (Differential Multi-Head Self-Attention, DMHSA) 来增强局部特征的判别能力，并通过三维卷积 (3D convolution) 进行光谱-空间分块嵌入 (Spectral-Spatial Tokenization)，结合位置编码和带有SWiGLU激活函数的Transformer层，实现高效特征提取。此外，基于token的分类头确保了鲁棒的表示学习，从而实现精确的高光谱像素标注。

链接: https://arxiv.org/abs/2412.17350
作者: Muhammad Ahmad,Manuel Mazzara,Salvatore Distefano,Adil Mehmood Khan,Silvia Liberata Ullo
机构: Dipartimento di Matematica e Informatica-MIFT, University of Messina(数学与信息科学系-MIFT，墨西拿大学); Institute of Software Development and Engineering, Innopolis University(软件开发与工程研究所，英诺波利斯大学); School of Computer Science, University of Hull(计算机科学学院，赫尔大学); Engineering Department, University of Sannio(工程系，萨尼奥大学)
关键词: analyzing high-dimensional data, gained significant attention, gained significant, potential in analyzing, analyzing high-dimensional
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image classification (HSIC) has gained significant attention because of its potential in analyzing high-dimensional data with rich spectral and spatial information. In this work, we propose the Differential Spatial-Spectral Transformer (DiffFormer), a novel framework designed to address the inherent challenges of HSIC, such as spectral redundancy and spatial discontinuity. The DiffFormer leverages a Differential Multi-Head Self-Attention (DMHSA) mechanism, which enhances local feature discrimination by introducing differential attention to accentuate subtle variations across neighboring spectral-spatial patches. The architecture integrates Spectral-Spatial Tokenization through three-dimensional (3D) convolution-based patch embeddings, positional encoding, and a stack of transformer layers equipped with the SWiGLU activation function for efficient feature extraction (SwiGLU is a variant of the Gated Linear Unit (GLU) activation function). A token-based classification head further ensures robust representation learning, enabling precise labeling of hyperspectral pixels. Extensive experiments on benchmark hyperspectral datasets demonstrate the superiority of DiffFormer in terms of classification accuracy, computational efficiency, and generalizability, compared to existing state-of-the-art (SOTA) methods. In addition, this work provides a detailed analysis of computational complexity, showcasing the scalability of the model for large-scale remote sensing applications. The source code will be made available at \urlthis https URL after the first round of revision.
zh

[CV-59] FFA Sora video generation as fundus fluorescein angiography simulator

【速读】：该论文试图解决眼底荧光血管造影（FFA）图像解读困难的问题，特别是针对初学者。解决方案的关键在于开发了一个名为FFA Sora的文本到视频模型，该模型通过小波流变分自编码器（WF-VAE）和扩散变换器（DiT）将FFA报告转换为动态视频。该模型在匿名化数据集上训练，能够准确模拟输入文本中的疾病特征，并通过客观指标（如Frechet视频距离、学习感知图像块相似性和视觉问答评分）验证了其性能。此外，该模型在隐私保护方面表现出色，支持大规模FFA数据的共享，同时提升了医学教育的效果。

链接: https://arxiv.org/abs/2412.17346
作者: Xinyuan Wu,Lili Wang,Ruoyu Chen,Bowen Liu,Weiyi Zhang,Xi Yang,Yifan Feng,Mingguang He,Danli Shi
机构: 未知
关键词: Fundus fluorescein angiography, diagnosing retinal vascular, Fundus fluorescein, retinal vascular diseases, Wavelet-Flow Variational Autoencoder
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:Fundus fluorescein angiography (FFA) is critical for diagnosing retinal vascular diseases, but beginners often struggle with image interpretation. This study develops FFA Sora, a text-to-video model that converts FFA reports into dynamic videos via a Wavelet-Flow Variational Autoencoder (WF-VAE) and a diffusion transformer (DiT). Trained on an anonymized dataset, FFA Sora accurately simulates disease features from the input text, as confirmed by objective metrics: Frechet Video Distance (FVD) = 329.78, Learned Perceptual Image Patch Similarity (LPIPS) = 0.48, and Visual-question-answering Score (VQAScore) = 0.61. Specific evaluations showed acceptable alignment between the generated videos and textual prompts, with BERTScore of 0.35. Additionally, the model demonstrated strong privacy-preserving performance in retrieval evaluations, achieving an average Recall@K of 0.073. Human assessments indicated satisfactory visual quality, with an average score of 1.570(scale: 1 = best, 5 = worst). This model addresses privacy concerns associated with sharing large-scale FFA data and enhances medical education.
zh

[CV-60] Neural-MCRL: Neural Multimodal Contrastive Representation Learning for EEG-based Visual Decoding

【速读】：该论文试图解决基于脑电图 (EEG) 的脑机接口 (BMI) 中神经视觉表征解码的关键问题，特别是现有方法在多模态对比表征学习 (MCRL) 中忽视模态内语义一致性和完整性，以及模态间语义对齐不足的问题。解决方案的关键在于提出了Neural-MCRL框架，通过语义桥接和交叉注意力机制实现多模态对齐，同时确保模态内的完整性和模态间的一致性。此外，该框架还引入了具有光谱-时间适应性 (Spectral-Temporal Adaptation) 的神经编码器 (Neural Encoder with Spectral-Temporal Adaptation, NESTA)，能够自适应捕捉光谱模式并学习个体特定的变换，从而显著提升视觉解码精度和模型泛化能力。

链接: https://arxiv.org/abs/2412.17337
作者: Yueyang Li,Zijian Kang,Shengyu Gong,Wenhao Dong,Weiming Zeng,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
机构: Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China; Affiliated Lianyungang Hospital of Xuzhou Medical University, Lianyungang 222002, China; Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong SAR, China
关键词: based brain activity, neural sensory rehabilitation, advancing brain-machine interfaces, based brain, brain-machine interfaces
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Decoding neural visual representations from electroencephalogram (EEG)-based brain activity is crucial for advancing brain-machine interfaces (BMI) and has transformative potential for neural sensory rehabilitation. While multimodal contrastive representation learning (MCRL) has shown promise in neural decoding, existing methods often overlook semantic consistency and completeness within modalities and lack effective semantic alignment across modalities. This limits their ability to capture the complex representations of visual neural responses. We propose Neural-MCRL, a novel framework that achieves multimodal alignment through semantic bridging and cross-attention mechanisms, while ensuring completeness within modalities and consistency across modalities. Our framework also features the Neural Encoder with Spectral-Temporal Adaptation (NESTA), a EEG encoder that adaptively captures spectral patterns and learns subject-specific transformations. Experimental results demonstrate significant improvements in visual decoding accuracy and model generalization compared to state-of-the-art methods, advancing the field of EEG-based neural visual representation decoding in BMI. Codes will be available at: this https URL.
zh

[CV-61] Uncertainty-Participation Context Consistency Learning for Semi-supervised Semantic Segmentation ICASSP

【速读】：该论文试图解决半监督语义分割中现有的一致性正则化方法未能充分利用网络中潜在监督信息的问题。解决方案的关键在于提出了不确定性参与上下文一致性学习 (Uncertainty-participation Context Consistency Learning, UCCL) 方法，通过设计语义反向传播更新 (Semantic Backpropagation Update, SBU) 策略来挖掘不确定像素区域的潜在知识，并引入类别感知知识调节 (Class-aware Knowledge Regulation, CKR) 模块来增强不同增强视图间的类别级语义特征一致性。这些创新使得模型能够更全面地学习像素级和类别级的语义信息，从而在半监督语义分割任务中实现更优的性能。

链接: https://arxiv.org/abs/2412.17331
作者: Jianjian Yin,Yi Chen,Zhichao Zheng,Junsheng Zhou,Yanhui Gu
机构: School of Artificial Intelligence, Nanjing Normal University (人工智能学院，南京师范大学); School of Artificial Intelligence, Nanjing Normal University (人工智能学院，南京师范大学); School of Artificial Intelligence, Nanjing Normal University (人工智能学院，南京师范大学); School of Artificial Intelligence, Nanjing Normal University (人工智能学院，南京师范大学)
关键词: Semi-supervised semantic segmentation, extensive labeled data, attracted considerable attention, Uncertainty-participation Context Consistency, Semi-supervised semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in ICASSP

点击查看摘要

Abstract:Semi-supervised semantic segmentation has attracted considerable attention for its ability to mitigate the reliance on extensive labeled data. However, existing consistency regularization methods only utilize high certain pixels with prediction confidence surpassing a fixed threshold for training, failing to fully leverage the potential supervisory information within the network. Therefore, this paper proposes the Uncertainty-participation Context Consistency Learning (UCCL) method to explore richer supervisory signals. Specifically, we first design the semantic backpropagation update (SBU) strategy to fully exploit the knowledge from uncertain pixel regions, enabling the model to learn consistent pixel-level semantic information from those areas. Furthermore, we propose the class-aware knowledge regulation (CKR) module to facilitate the regulation of class-level semantic features across different augmented views, promoting consistent learning of class-level semantic information within the encoder. Experimental results on two public benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Our code is available at this https URL.
zh

[CV-62] Feature Based Methods Domain Adaptation for Object Detection: A Review Paper

【速读】：该论文旨在解决领域自适应（Domain Adaptation）问题，特别是在目标域数据分布与源域显著不同的情况下，提升机器学习模型在目标域中的性能。关键解决方案在于通过多种先进的领域自适应方法，如对抗学习（Adversarial Learning）、基于差异的方法（Discrepancy-based）、多领域（Multi-domain）、师生模型（Teacher-student）、集成（Ensemble）和视觉语言模型（VLM）技术，来减少领域间差异并增强模型的鲁棒性。特别地，基于特征的方法（Feature-based methods），如特征对齐（Feature Alignment）、特征增强/重构（Feature Augmentation/Reconstruction）和特征变换（Feature Transformation），在跨领域特征表示的协调中发挥了重要作用，从而有效缩小领域差距并提升模型性能。此外，论文还强调了减少对大量标注数据的依赖，利用未标注数据，尤其是在合成到真实域（synthetic-to-real）的场景中，以应对实际应用中的挑战。

链接: https://arxiv.org/abs/2412.17325
作者: Helia Mohamadi,Mohammad Ali Keyvanrad,Mohammad Reza Mohammadi
机构: 未知
关键词: distinct data distributions, aims to enhance, machine learning models, pivotal branch, branch of transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 46 pages, 13 figures, It will be submitted to a journal

点击查看摘要

Abstract:Domain adaptation, a pivotal branch of transfer learning, aims to enhance the performance of machine learning models when deployed in target domains with distinct data distributions. This is particularly critical for object detection tasks, where domain shifts (caused by factors such as lighting conditions, viewing angles, and environmental variations) can lead to significant performance degradation. This review delves into advanced methodologies for domain adaptation, including adversarial learning, discrepancy-based, multi-domain, teacher-student, ensemble, and VLM techniques, emphasizing their efficacy in reducing domain gaps and enhancing model robustness. Feature-based methods have emerged as powerful tools for addressing these challenges by harmonizing feature representations across domains. These techniques, such as Feature Alignment, Feature Augmentation/Reconstruction, and Feature Transformation, are employed alongside or as integral parts of other domain adaptation strategies to minimize domain gaps and improve model performance. Special attention is given to strategies that minimize the reliance on extensive labeled data and using unlabeled data, particularly in scenarios involving synthetic-to-real domain shifts. Applications in fields such as autonomous driving and medical imaging are explored, showcasing the potential of these methods to ensure reliable object detection in diverse and complex settings. By providing a thorough analysis of state-of-the-art techniques, challenges, and future directions, this work offers a valuable reference for researchers striving to develop resilient and adaptable object detection frameworks, advancing the seamless deployment of artificial intelligence in dynamic environments.
zh

[CV-63] Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio ICASSP2025

【速读】：该论文试图解决预训练音频-语言模型 (Audio-Language Models, ALMs) 在零样本分类任务中的测试时适应 (Test-Time Adaptation, TTA) 方法中，模型容易陷入错误预测的问题。解决方案的关键在于提出了一种无需标注标签的多重指导提示学习方法。首先，通过在上下文和领域标记上设置一致性指导；其次，在单个测试样本的多个增强视图之间设置一致性指导，并在不同测试样本之间进行对比学习；最后，提出了一种端到端的适应学习框架。实验结果表明，该方法在12个下游任务中平均提升了4.41%（最高7.50%）的零样本分类性能。

链接: https://arxiv.org/abs/2412.17306
作者: Gongyu Chen,Haomin Zhang,Chaofan Ding,Zihao Chen,Xinhan Di
机构: Giant Network(巨人网络); Zhejiang University(浙江大学)
关键词: pre-trained Audio-Language Models, impressive zero-shot generalization, zero-shot generalization capability, TTA, fascinating aspect
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 6 pages, 1 figure, accepted by ICASSP 2025

点击查看摘要

Abstract:One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, previous test time adaptation (TTA) methods for ALMs in zero-shot classification tend to be stuck in incorrect model predictions. In order to further boost the performance, we propose multiple guidance on prompt learning without annotated labels. First, guidance of consistency on both context tokens and domain tokens of ALMs is set. Second, guidance of both consistency across multiple augmented views of each single test sample and contrastive learning across different test samples is set. Third, we propose a corresponding end-end learning framework for the proposed test-time adaptation method without annotated labels. We extensively evaluate our approach on 12 downstream tasks across domains, our proposed adaptation method leads to 4.41% (max 7.50%) average zero-shot performance improvement in comparison with the state-of-the-art models.
zh

[CV-64] FedLEC: Effective Federated Learning Algorithm with Spiking Neural Networks Under Label Skews

【速读】：该论文试图解决在联邦学习（Federated Learning, FL）系统中，由于客户端数据通常是非独立同分布（non-IID）且存在标签偏斜（label skews），导致在脉冲神经网络（Spiking Neural Networks, SNNs）上的学习任务面临显著困难的问题。解决方案的关键在于提出了一个名为FedLEC的后处理框架，该框架通过惩罚本地缺失标签的对应本地logits来增强每个本地模型的泛化能力，并利用从全局模型中提取的相关标签分布信息来缓解标签偏差。实验结果表明，FedLEC在多种标签偏斜分布设置下，相较于七种最先进的FL算法，平均准确率提升了约11.59%。

链接: https://arxiv.org/abs/2412.17305
作者: Di Yu,Xin Du,Linshan Jiang,Shunwen Bai,Wentao Tong,Shuiguang Deng
机构: 1. Beijing University of Posts and Telecommunications (北京邮电大学); 2. Tsinghua University (清华大学)
关键词: Spiking Neural Networks, Neural Networks, Spiking Neural, resource-constrained edge devices, implementing Federated Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of neuromorphic chips, implementing Federated Learning (FL) with Spiking Neural Networks (SNNs) potentially offers a more energy-efficient schema for collaborative learning across various resource-constrained edge devices. However, one significant challenge in the FL systems is that the data from different clients are often non-independently and identically distributed (non-IID), with label skews presenting substantial difficulties in various federated SNN learning tasks. In this study, we propose a practical post-hoc framework named FedLEC to address the challenge. This framework penalizes the corresponding local logits for locally missing labels to enhance each local model’s generalization ability. Additionally, it leverages the pertinent label distribution information distilled from the global model to mitigate label bias. Extensive experiments with three different structured SNNs across five datasets (i.e., three non-neuromorphic and two neuromorphic datasets) demonstrate the efficiency of FedLEC. Compared to seven state-of-the-art FL algorithms, FedLEC achieves an average accuracy improvement of approximately 11.59% under various label skew distribution settings.
zh

[CV-65] Neural Spatial-Temporal Tensor Representation for Infrared Small Target Detection

【速读】：该论文试图解决传统基于优化的红外小目标检测方法在多帧场景中难以适应动态变化的问题。解决方案的关键在于引入神经表示的空间-时间张量模型 (NeurSTT)，通过非线性网络增强背景近似中的空间-时间特征相关性，从而实现无监督的目标检测。具体来说，NeurSTT 利用神经层在低秩约束的深度框架中近似连续背景，并开发了神经三维全变差来优化背景平滑度，同时减少序列中的静态目标簇。通过在损失函数中引入传统稀疏性约束来保留潜在目标，并采用深度更新策略替代复杂的优化求解器，简化了优化过程。实验结果表明，该方法在参数数量和交并比 (IoU) 方面均优于现有方法。

链接: https://arxiv.org/abs/2412.17302
作者: Fengyi Wu,Simin Liu,Haoan Wang,Bingjie Tao,Junhai Luo,Zhenming Peng
机构: 未知
关键词: Optimization-based approaches dominate, approaches dominate infrared, dominate infrared small, leverage infrared imagery, infrared imagery intrinsic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optimization-based approaches dominate infrared small target detection as they leverage infrared imagery’s intrinsic low-rankness and sparsity. While effective for single-frame images, they struggle with dynamic changes in multi-frame scenarios as traditional spatial-temporal representations often fail to adapt. To address these challenges, we introduce a Neural-represented Spatial-Temporal Tensor (NeurSTT) model. This framework employs nonlinear networks to enhance spatial-temporal feature correlations in background approximation, thereby supporting target detection in an unsupervised manner. Specifically, we employ neural layers to approximate sequential backgrounds within a low-rank informed deep scheme. A neural three-dimensional total variation is developed to refine background smoothness while reducing static target-like clusters in sequences. Traditional sparsity constraints are incorporated into the loss functions to preserve potential targets. By replacing complex solvers with a deep updating strategy, NeurSTT simplifies the optimization process in a domain-awareness way. Visual and numerical results across various datasets demonstrate that our method outperforms detection challenges. Notably, it has 16.6 \times fewer parameters and averaged 19.19% higher in IoU compared to the suboptimal method on 256 \times 256 sequences.
zh

[CV-66] Revisiting Multimodal Fusion for 3D Anomaly Detection from an Architectural Perspective

【速读】：该论文试图解决3D异常检测（3D-AD）中多模态融合架构设计对性能的影响问题。现有研究主要集中在多模态融合策略的改进上，而忽视了多模态融合架构（topology）设计的作用。论文通过系统研究多模态融合架构设计对3D-AD的影响，提出了一个理论和实验相结合的分析框架，并在此基础上扩展了当前最先进的神经架构搜索（NAS）范式，提出了3D-ADNAS。该方法能够在多模态融合策略和特定模态模块之间进行联合搜索，从而在不同模型容量下实现准确性、帧率和内存使用方面的显著提升，并展现出在处理少样本3D-AD任务中的巨大潜力。解决方案的关键在于通过NAS技术优化多模态融合架构设计，以提升3D-AD的整体性能。

链接: https://arxiv.org/abs/2412.17297
作者: Kaifang Long,Guoyang Xie,Lianbo Ma,Jiaqi Liu,Zhichao Lu
机构: 1. School of Computer Science and Technology, Tianjin University, Tianjin, China(天津大学计算机科学与技术学院，天津，中国);
2. School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机科学学院，上海，中国);
3. Department of Computer Science, Tsinghua University, Beijing, China(清华大学计算机科学系，北京，中国)
关键词: Existing efforts, multimodal fusion architecture, multimodal fusion, boost multimodal fusion, effective multimodal fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing efforts to boost multimodal fusion of 3D anomaly detection (3D-AD) primarily concentrate on devising more effective multimodal fusion strategies. However, little attention was devoted to analyzing the role of multimodal fusion architecture (topology) design in contributing to 3D-AD. In this paper, we aim to bridge this gap and present a systematic study on the impact of multimodal fusion architecture design on 3D-AD. This work considers the multimodal fusion architecture design at the intra-module fusion level, i.e., independent modality-specific modules, involving early, middle or late multimodal features with specific fusion operations, and also at the inter-module fusion level, i.e., the strategies to fuse those modules. In both cases, we first derive insights through theoretically and experimentally exploring how architectural designs influence 3D-AD. Then, we extend SOTA neural architecture search (NAS) paradigm and propose 3D-ADNAS to simultaneously search across multimodal fusion strategies and modality-specific modules for the first this http URL experiments show that 3D-ADNAS obtains consistent improvements in 3D-AD across various model capacities in terms of accuracy, frame rate, and memory usage, and it exhibits great potential in dealing with few-shot 3D-AD tasks.
zh

[CV-67] AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues

【速读】：该论文试图解决在人机对话中如何更好地利用非语言信息（non-linguistic information）来增强情感和上下文感知的交互问题。解决方案的关键在于提出了一个名为AV-EmoDialog的对话系统，该系统通过整合用户的视听输入（audio-visual inputs），包括语音内容、情感语调（emotional tones）和面部表情（facial expressions），以端到端的方式生成更具情感响应性和共情能力的对话。通过系统性地提取和分析这些非语言线索，AV-EmoDialog能够生成不仅情感上合适，而且在上下文上也合适的响应，从而在多模态大型语言模型（multimodal LLMs）中表现出色。

链接: https://arxiv.org/abs/2412.17292
作者: Se Jin Park,Yeonju Kim,Hyeongseop Rha,Bella Godiva,Yong Man Ro
机构: KAIST(韩国科学技术院); School of Electrical Engineering(电气工程学院)
关键词: non-verbal cues play, conveying emotions, play a crucial, crucial role, role in conveying
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone. These non-linguistic information, such as facial expressions, eye contact, voice tone, and pitch, are fundamental elements of effective interactions, enriching conversations by adding emotional and contextual depth. Recognizing the importance of non-linguistic content in communication, we present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users’ audio-visual inputs to generate more responsive and empathetic interactions. AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner. Through extensive experiments, we validate that the proposed AV-EmoDialog outperforms existing multimodal LLMs in generating not only emotionally appropriate but also contextually appropriate responses.
zh

[CV-68] Free-viewpoint Human Animation with Pose-correlated Reference Selection

【速读】：该论文试图解决基于扩散模型的人体动画在视角显著变化（尤其是缩放场景）中生成高质量姿态的挑战。解决方案的关键在于提出了一种姿态相关参考选择扩散网络（pose-correlated reference selection diffusion network），通过利用多张参考图像作为输入，以应对视角变化导致的外观细节缺失问题。具体实现上，论文引入了姿态相关模块（pose correlation module）来计算目标与源姿态之间的相似性，并提出了自适应参考选择策略（adaptive reference selection strategy），利用注意力图（attention map）识别关键区域以进行动画生成。此外，通过从公开的TED演讲中构建的大规模数据集进行训练，模型能够学习不同视角下的合成能力。实验结果表明，在相同数量的参考图像下，该模型在大视角变化下优于当前最先进的方法。

链接: https://arxiv.org/abs/2412.17290
作者: Fa-Ting Hong,Zhan Xu,Haiyang Liu,Qinjie Lin,Luchuan Song,Zhixin Shu,Yang Zhou,Duygu Ceylan,Dan Xu
机构: HKUST(香港科技大学); Adobe Research(Adobe研究); Northwestern University, USA(美国西北大学)
关键词: Diffusion-based human animation, Diffusion-based human, aims to animate, driving signals, human animation aims
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control. We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation. To train our model, we curated a large dataset from public TED talks featuring varied shots of the same character, helping the model learn synthesis for different perspectives. Our experimental results show that with the same number of reference images, our model performs favorably compared to the current SOTA methods under large viewpoint change. We further show that the adaptive reference selection is able to choose the most relevant reference regions to generate humans under free viewpoints.
zh

[CV-69] owards Unsupervised Model Selection for Domain Adaptive Object Detection NEURIPS2024

【速读】：该论文试图解决在目标域无标签数据的情况下，如何为域自适应目标检测（DAOD）任务选择最优模型的难题。解决方案的关键在于提出了一种无监督的模型选择方法，基于平坦最小值（flat minima）原理，通过设计检测适应性评分（DAS）来近似衡量模型的平坦性，而不依赖目标域的标签。具体来说，论文提出了平坦性指数评分（FIS）来评估模型参数扰动前后的分类和定位波动，以及原型距离比（PDR）来衡量模型的可迁移性和判别性，从而有效评估模型在目标域上的泛化能力。实验结果表明，该方法与DAOD模型的性能高度相关，可作为训练后模型选择的有效工具。

链接: https://arxiv.org/abs/2412.17284
作者: Hengfu Yu,Jinhong Deng,Wen Li,Lixin Duan
机构: University of Electronic Science and Technology of China(电子科技大学)
关键词: drawn increasing attention, Evaluating the performance, flat minima, model, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures, Accepted to NeurIPS 2024

点击查看摘要

Abstract:Evaluating the performance of deep models in new scenarios has drawn increasing attention in recent years. However, while it is possible to collect data from new scenarios, the annotations are not always available. Existing DAOD methods often rely on validation or test sets on the target domain for model selection, which is impractical in real-world applications. In this paper, we propose a novel unsupervised model selection approach for domain adaptive object detection, which is able to select almost the optimal model for the target domain without using any target labels. Our approach is based on the flat minima principle, i,e., models located in the flat minima region in the parameter space usually exhibit excellent generalization ability. However, traditional methods require labeled data to evaluate how well a model is located in the flat minima region, which is unrealistic for the DAOD task. Therefore, we design a Detection Adaptation Score (DAS) approach to approximately measure the flat minima without using target labels. We show via a generalization bound that the flatness can be deemed as model variance, while the minima depend on the domain distribution distance for the DAOD task. Accordingly, we propose a Flatness Index Score (FIS) to assess the flatness by measuring the classification and localization fluctuation before and after perturbations of model parameters and a Prototypical Distance Ratio (PDR) score to seek the minima by measuring the transferability and discriminability of the models. In this way, the proposed DAS approach can effectively evaluate the model generalization ability on the target domain. We have conducted extensive experiments on various DAOD benchmarks and approaches, and the experimental results show that the proposed DAS correlates well with the performance of DAOD models and can be used as an effective tool for model selection after training.
zh

[CV-70] VarAD: Lightweight High-Resolution Image Anomaly Detection via Visual Autoregressive Modeling

【速读】：该论文试图解决高分辨率图像异常检测 (High-Resolution Image Anomaly Detection, HRIAD) 问题，相较于传统低分辨率图像异常检测，HRIAD 具有更高的计算负担和更强的全局信息捕捉需求。解决方案的关键在于将图像异常检测转化为视觉标记预测问题，并提出了基于视觉自回归建模的 VarAD 方法。VarAD 首先提取多层次和多方向的视觉标记序列，然后使用先进的 Mamba 模型进行视觉自回归建模和标记预测。在预测过程中，VarAD 有效利用所有先前标记的信息来预测目标标记，并通过预测标记与原始标记之间的差异来评估异常。实验结果表明，VarAD 在保持轻量级的同时，显著提升了高分辨率图像异常检测的性能。

链接: https://arxiv.org/abs/2412.17263
作者: Yunkang Cao,Haiming Yao,Wei Luo,Weiming Shen
机构: State Key Laboratory of Intelligent Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; State Key Laboratory of Precision Measurement Technology and Instruments, Department of Precision Instrument, Tsinghua University, Beijing 100084, China
关键词: Image Anomaly Detection, Image Anomaly, Anomaly Detection, High-Resolution Image Anomaly, conventional image anomaly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TII

点击查看摘要

Abstract:This paper addresses a practical task: High-Resolution Image Anomaly Detection (HRIAD). In comparison to conventional image anomaly detection for low-resolution images, HRIAD imposes a heavier computational burden and necessitates superior global information capture capacity. To tackle HRIAD, this paper translates image anomaly detection into visual token prediction and proposes VarAD based on visual autoregressive modeling for token prediction. Specifically, VarAD first extracts multi-hierarchy and multi-directional visual token sequences, and then employs an advanced model, Mamba, for visual autoregressive modeling and token prediction. During the prediction process, VarAD effectively exploits information from all preceding tokens to predict the target token. Finally, the discrepancies between predicted tokens and original tokens are utilized to score anomalies. Comprehensive experiments on four publicly available datasets and a real-world button inspection dataset demonstrate that the proposed VarAD achieves superior high-resolution image anomaly detection performance while maintaining lightweight, rendering VarAD a viable solution for HRIAD. Code is available at \hrefthis https URL\urlthis https URL.
zh

[CV-71] An Intrinsically Explainable Approach to Detecting Vertebral Compression Fractures in CT Scans via Neurosymbolic Modeling

【速读】：该论文试图解决椎体压缩性骨折（Vertebral Compression Fractures, VCFs）在骨质疏松症患者中常被漏诊的问题，并提出了一种结合深度学习（Deep Learning, DL）和基于形状的算法（Shape-Based Algorithm, SBA）的神经符号方法来实现高效的VCF检测。解决方案的关键在于利用DL进行椎体分割，并通过SBA分析椎体高度分布，从而定义一套基于高度分布的规则集来检测VCFs。这种方法不仅在VerSe19数据集上实现了96%的准确率和91%的敏感性，与黑箱模型DenseNet表现相当，还提供了内在的可解释性，增强了临床医生对AI推荐结果的信任，支持更明智的诊断和治疗决策。

链接: https://arxiv.org/abs/2412.17258
作者: Blanca Inigo,Yiqing Shen,Benjamin D. Killeen,Michelle Song,Axel Krieger,Christopher Bradley,Mathias Unberath
机构: Johns Hopkins University(约翰霍普金斯大学)
关键词: Vertebral compression fractures, compression fractures, consequence of osteoporosis, common and potentially, potentially serious consequence
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Vertebral compression fractures (VCFs) are a common and potentially serious consequence of osteoporosis. Yet, they often remain undiagnosed. Opportunistic screening, which involves automated analysis of medical imaging data acquired primarily for other purposes, is a cost-effective method to identify undiagnosed VCFs. In high-stakes scenarios like opportunistic medical diagnosis, model interpretability is a key factor for the adoption of AI recommendations. Rule-based methods are inherently explainable and closely align with clinical guidelines, but they are not immediately applicable to high-dimensional data such as CT scans. To address this gap, we introduce a neurosymbolic approach for VCF detection in CT volumes. The proposed model combines deep learning (DL) for vertebral segmentation with a shape-based algorithm (SBA) that analyzes vertebral height distributions in salient anatomical regions. This allows for the definition of a rule set over the height distributions to detect VCFs. Evaluation of VerSe19 dataset shows that our method achieves an accuracy of 96% and a sensitivity of 91% in VCF detection. In comparison, a black box model, DenseNet, achieved an accuracy of 95% and sensitivity of 91% in the same dataset. Our results demonstrate that our intrinsically explainable approach can match or surpass the performance of black box deep neural networks while providing additional insights into why a prediction was made. This transparency can enhance clinician’s trust thus, supporting more informed decision-making in VCF diagnosis and treatment planning.
zh

[CV-72] Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis Prompt Alignment and Theory

【速读】：该论文试图解决长视频生成中场景过渡不流畅和一致性不足的问题。解决方案的关键在于提出了基于时间-频率的时序注意力重加权算法 (Time-frequency based temporal Attention Reweighting Algorithm, TiARA)，通过离散短时傅里叶变换 (Discrete Short-Time Fourier Transform) 精细编辑注意力得分矩阵，从而提升视频的连贯性。此外，针对多提示词生成视频的情况，论文还提出了PromptBlend，一种先进的提示词插值流程，以改善提示词插值质量。这些方法在实验中表现出对基线方法的显著改进。

链接: https://arxiv.org/abs/2412.17254
作者: Xingyao Li,Fengzhuo Zhang,Jiachun Pan,Yunlong Hou,Vincent Y. F. Tan,Zhuoran Yang
机构: National University of Singapor
关键词: considerable progress achieved, video generation problem, long video generation, generation problem, transitions between scenes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, 11 figures

点击查看摘要

Abstract:Despite the considerable progress achieved in the long video generation problem, there is still significant room to improve the consistency of the videos, particularly in terms of smoothness and transitions between scenes. We address these issues to enhance the consistency and coherence of videos generated with either single or multiple prompts. We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which meticulously edits the attention score matrix based on the Discrete Short-Time Fourier Transform. Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models. For videos generated by multiple prompts, we further investigate key factors affecting prompt interpolation quality and propose PromptBlend, an advanced prompt interpolation pipeline. The efficacy of our proposed method is validated via extensive experimental results, exhibiting consistent and impressive improvements over baseline methods. The code will be released upon acceptance.
zh

[CV-73] GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning ICASSP2025

【速读】：该论文试图解决在有限标注数据的情况下，从视网膜图像生成准确医学报告的挑战。解决方案的关键在于提出了一种新的视觉-语言模型，通过引导上下文自注意力机制（guided context self-attention mechanism）来结合视觉和文本特征。这种方法能够在数据稀缺的情况下捕捉图像的细节和全局临床背景，从而生成更全面的医学描述。实验结果表明，该模型在DeepEyeNet数据集上实现了0.023的BLEU@4提升，并显著改进了生成报告的质量。

链接: https://arxiv.org/abs/2412.17251
作者: Teja Krishna Cherukuri,Nagur Shareef Shaik,Jyostna Devi Bodapati,Dong Hye Ye
机构: Georgia State University(佐治亚州立大学); Vignan’s Foundation for Science, Technology & Research University(Vignan科学、技术与研究大学)
关键词: treating eye diseases, remains challenging due, limited labeled data, images remains challenging, accurate medical reports
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: This paper has been accepted for presentation at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Retinal image analysis is crucial for diagnosing and treating eye diseases, yet generating accurate medical reports from images remains challenging due to variability in image quality and pathology, especially with limited labeled data. Previous Transformer-based models struggled to integrate visual and textual information under limited supervision. In response, we propose a novel vision-language model for retinal image captioning that combines visual and textual features through a guided context self-attention mechanism. This approach captures both intricate details and the global clinical context, even in data-scarce scenarios. Extensive experiments on the DeepEyeNet dataset demonstrate a 0.023 BLEU@4 improvement, along with significant qualitative advancements, highlighting the effectiveness of our model in generating comprehensive medical captions.
zh

[CV-74] STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

【速读】：该论文试图解决现有遥感变化检测 (RSCD) 方法在时空维度上缺乏交互性，导致特征提取质量不足的问题。解决方案的关键在于提出了STeInFormer，这是一种专门为RSCD设计的时空交互Transformer架构，用于多时相特征提取。此外，论文还引入了一种无参数的多频段token混合器，用于整合频域特征，从而为RSCD提供光谱信息。通过在三个数据集上的实验验证，该方法在效率和准确性之间取得了最佳平衡，并超越了现有的最先进方法。

链接: https://arxiv.org/abs/2412.17247
作者: Xiaowen Ma,Zhenkai Wu,Mengting Ma,Mengjiao Zhao,Fan Yang,Zhenhong Du,Wei Zhang
机构: Zhejiang University(浙江大学); Innovation Center of Yangtze River Delta, Zhejiang University(长江三角洲创新中心，浙江大学)
关键词: Convolutional neural networks, outstanding discriminative ability, greatly benefited remote, benefited remote sensing, remote sensing change
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: JSTARS 2025

点击查看摘要

Abstract:Convolutional neural networks and attention mechanisms have greatly benefited remote sensing change detection (RSCD) because of their outstanding discriminative ability. Existent RSCD methods often follow a paradigm of using a non-interactive Siamese neural network for multi-temporal feature extraction and change detection heads for feature fusion and change representation. However, this paradigm lacks the contemplation of the characteristics of RSCD in temporal and spatial dimensions, and causes the drawback on spatial-temporal interaction that hinders high-quality feature extraction. To address this problem, we present STeInFormer, a spatial-temporal interaction Transformer architecture for multi-temporal feature extraction, which is the first general backbone network specifically designed for RSCD. In addition, we propose a parameter-free multi-frequency token mixer to integrate frequency-domain features that provide spectral information for RSCD. Experimental results on three datasets validate the effectiveness of the proposed method, which can outperform the state-of-the-art methods and achieve the most satisfactory efficiency-accuracy trade-off. Code is available at this https URL.
zh

[CV-75] QTSeg: A Query Token-Based Architecture for Efficient 2D Medical Image Segmentation

【速读】：该论文试图解决医学图像分割中现有方法在处理长程依赖（long-range dependencies）时性能不足的问题，同时面临计算资源消耗过高的挑战。解决方案的关键在于提出了一种新型架构QTSeg，该架构结合了卷积神经网络（CNN）和Transformer的优势，通过特征金字塔网络（FPN）作为图像编码器提取多尺度特征，并引入多级特征融合（MLFF）模块在编码器和解码器之间进行自适应特征融合，最后使用多查询掩码解码器（MQM Decoder）在解码阶段整合查询令牌与金字塔特征，从而在保持低计算资源需求的同时，显著提升了分割精度。

链接: https://arxiv.org/abs/2412.17241
作者: Phuong-Nam Tran,Nhat Truong Pham,Duc Ngoc Minh Dang,Eui-Nam Huh,Choong Seon Hong
机构: Department of Artificial Intelligence Kyung Hee University(人工智能系庆熙大学); Department of Integrative Biotechnology Sungkyunkwan University(综合生物技术系成均馆大学); Department of Computing Fundamental, FPT University(计算基础系 FPT大学); Department of Computer Science and Engineering Kyung Hee University(计算机科学与工程系庆熙大学)
关键词: accurate automatic diagnosis, enabling accurate automatic, assisting medical doctors, automatic diagnosis, doctors in making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image segmentation is crucial in assisting medical doctors in making diagnoses and enabling accurate automatic diagnosis. While advanced convolutional neural networks (CNNs) excel in segmenting regions of interest with pixel-level precision, they often struggle with long-range dependencies, which is crucial for enhancing model performance. Conversely, transformer architectures leverage attention mechanisms to excel in handling long-range dependencies. However, the computational complexity of transformers grows quadratically, posing resource-intensive challenges, especially with high-resolution medical images. Recent research aims to combine CNN and transformer architectures to mitigate their drawbacks and enhance performance while keeping resource demands low. Nevertheless, existing approaches have not fully leveraged the strengths of both architectures to achieve high accuracy with low computational requirements. To address this gap, we propose a novel architecture for 2D medical image segmentation (QTSeg) that leverages a feature pyramid network (FPN) as the image encoder, a multi-level feature fusion (MLFF) as the adaptive module between encoder and decoder and a multi-query mask decoder (MQM Decoder) as the mask decoder. In the first step, an FPN model extracts pyramid features from the input image. Next, MLFF is incorporated between the encoder and decoder to adapt features from different encoder stages to the decoder. Finally, an MQM Decoder is employed to improve mask generation by integrating query tokens with pyramid features at all stages of the mask decoder. Our experimental results show that QTSeg outperforms state-of-the-art methods across all metrics with lower computational demands than the baseline and the existing methods. Code is available at this https URL (v0.1.0)
zh

[CV-76] Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification

【速读】：该论文试图解决基于图像的行人重识别（Person Re-identification, ReID）问题，旨在通过非重叠摄像头检索特定行人。解决方案的关键在于提出了一种名为FusionReID的新型融合框架，该框架结合了卷积神经网络（Convolutional Neural Networks, CNNs）和Transformer的各自优势，以学习更全面的行人表示。具体来说，论文设计了双分支特征提取（Dual-branch Feature Extraction, DFE）模块，分别通过CNNs和Transformers从单张图像中提取特征，并通过双注意力互融（Dual-attention Mutual Fusion, DMF）模块实现充分的特征融合。DMF模块包括局部细化单元（Local Refinement Units, LRU）和异构传输模块（Heterogenous Transmission Modules, HTM），其中LRU通过深度可分离卷积对特征进行通道维度和空间尺寸的对齐，而HTM则通过共享编码单元（Shared Encoding Unit, SEU）和互融单元（Mutual Fusion Units, MFU）的连续堆叠，进一步生成更具判别力的特征。实验结果表明，该方法在多个公开的ReID基准数据集上表现优于大多数现有最先进方法。

链接: https://arxiv.org/abs/2412.17239
作者: Yuhao Wang,Pingping Zhang,Xuehu Liu,Zhengzheng Tu,Huchuan Lu
机构: School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, Dalian, 116024, China; School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China; School of Computer Science and Technology, Anhui University, Hefei, 230039, China; School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116024, China
关键词: Convolutional Neural Networks, intelligent transportation systems, Person Re-identification, aims to retrieve, non-overlapping cameras
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by Trans. on ITS

点击查看摘要

Abstract:Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras, which greatly helps intelligent transportation systems. As we all know, Convolutional Neural Networks (CNNs) and Transformers have the unique strengths to extract local and global features, respectively. Considering this fact, we focus on the mutual fusion between them to learn more comprehensive representations for persons. In particular, we utilize the complementary integration of deep features from different model structures. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID. More specifically, we first deploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs and Transformers from a single image. Moreover, we design a novel Dual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The DMF comprises Local Refinement Units (LRU) and Heterogenous Transmission Modules (HTM). LRU utilizes depth-separable convolutions to align deep features in channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit (SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of HTM, deep features after LRU are repeatedly utilized to generate more discriminative features. Extensive experiments on three public ReID benchmarks demonstrate that our method can attain superior performances than most state-of-the-arts. The source code is available at this https URL.
zh

[CV-77] Modality-Aware Shot Relating and Comparing for Video Scene Detection

【速读】：该论文试图解决视频场景检测中多模态语义处理不均衡的问题，即现有方法未能充分考虑视觉实体和地点模态之间的上下文差异，导致检测性能不佳。解决方案的关键在于提出了模态感知镜头关联与比较方法（Modality-Aware Shot Relating and Comparing, MASRC），通过分别挖掘实体语义的长时相关性和地点语义的短时相关性，来学习具有场景内一致性和场景间区分性的镜头特征。此外，通过相似度卷积编码前后镜头的关系，进一步辅助识别场景结束的镜头。实验结果表明，MASRC在视频场景检测中显著提升了性能。

链接: https://arxiv.org/abs/2412.17238
作者: Jiawei Tan,Hongxing Wang,Kang Dang,Jiaxin Li,Zhilong Ou
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院);
2. Key Laboratory of Network Security and Privacy Protection, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学网络安全与隐私保护重点实验室);
3. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China(电子科技大学计算机科学与工程学院)
关键词: detection involves assessing, shot, involves assessing, surroundings belong, visual entity
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, \ite.g. visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the \bfM odality- \bfA ware \bfS hot \bfR elating and \bfC omparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.
zh

[CV-78] OLiDM: Object-aware LiDAR Diffusion Models for Autonomous Driving AAAI2025

【速读】：该论文旨在解决复杂场景下自动驾驶安全问题，特别是现有方法在生成高质量、多样化和可控的前景物体（foreground objects）方面的不足。解决方案的关键在于提出了一个名为OLiDM的新框架，该框架包含两个核心模块：对象-场景渐进生成模块（Object-Scene Progressive Generation, OPG）和对象语义对齐模块（Object Semantic Alignment, OSA）。OPG模块根据用户特定的提示生成所需的前景物体，并将其作为条件用于场景生成，从而在对象和场景级别上实现可控输出，并支持用户定义的对象级标注与生成的LiDAR场景关联。OSA模块则用于纠正前景物体与背景场景之间的对齐误差，提升生成对象的整体质量。通过这些创新，OLiDM在LiDAR生成任务和3D感知任务中表现出色，显著优于现有方法。

链接: https://arxiv.org/abs/2412.17226
作者: Tianyi Yan,Junbo Yin,Xianpeng Lang,Ruigang Yang,Cheng-Zhong Xu,Jianbing Shen
机构: Tianyi Yan1,2\equalcontrib; Junbo Yin3\equalcontrib; Xianpeng Lang2; Ruigang Yang4; Cheng-Zhong Xu1; Jianbing Shen1
关键词: autonomous driving safety, point cloud data, simulate LiDAR point, LiDAR point cloud, enhance autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: AAAI 2025, this https URL

点击查看摘要

Abstract:To enhance autonomous driving safety in complex scenarios, various methods have been proposed to simulate LiDAR point cloud data. Nevertheless, these methods often face challenges in producing high-quality, diverse, and controllable foreground objects. To address the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating high-fidelity LiDAR data at both the object and the scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable outputs at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad effectiveness of OLiDM is demonstrated across various LiDAR generation tasks, as well as in 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 17.5 in FPD. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47% increase in semantic IoU. Moreover, OLiDM enhances the performance of mainstream 3D detectors by 2.4% in mAP and 1.9% in NDS, underscoring its potential in advancing object-aware 3D tasks. Code is available at: this https URL.
zh

[CV-79] CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

【速读】：该论文试图解决基于扩散的视觉文本生成模型在渲染复杂视觉文本时遇到的字符和笔画不准确的问题。解决方案的关键在于提出了一种名为 CharGen 的高精度字符级视觉文本生成与编辑模型。CharGen 通过采用字符级多模态编码器（character-level multimodal encoder），不仅提取字符级文本嵌入，还逐字符编码字形图像，从而更有效地捕捉细粒度的跨模态特征。此外，CharGen 引入了一种新的感知损失（perceptual loss），以增强字符形状的监督，解决生成文本中笔画不准确的问题。该模型还可以集成到现有的扩散模型中，显著提高文本渲染的准确性，在 AnyText-benchmark 和 MARIO-Eval 等公开基准测试中表现优异，分别提升了 8% 和 6% 的准确率，尤其在中文测试集上提高了 5.5%。

链接: https://arxiv.org/abs/2412.17225
作者: Lichen Ma,Tiezhu Yue,Pei Fu,Yujie Zhong,Kai Zhou,Xiaoming Wei,Jie Hu
机构: Meituan(美团)
关键词: visual text, visual text generation, diffusion-based visual text, significant advancements, text
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.
zh

[CV-80] Discriminative Image Generation with Diffusion Models for Zero-Shot Learning

【速读】：该论文试图解决生成式零样本学习 (Generative Zero-Shot Learning, ZSL) 方法在特征生成过程中缺乏可解释性以及依赖人工标注的语义原型 (semantic prototypes) 导致扩展性受限的问题。解决方案的关键在于提出了一种新的判别图像生成框架 (Discriminative Image Generation framework for Zero-Shot Learning, DIG-ZSL)，通过学习每个未见类别的判别类标记 (Discriminative Class Token, DCT)，并利用预训练的类别判别模型 (Category Discrimination Model, CDM) 指导生成多样且高质量的图像，从而为零样本学习任务提供信息丰富的未见样本。这种方法不仅生成了更具判别性的图像，还显著超越了基于非人工标注语义原型的现有方法，并在性能上与使用人工标注语义原型的基线方法相当或更优。

链接: https://arxiv.org/abs/2412.17219
作者: Dingjie Fu,Wenjin Hou,Shiming Chen,Shuhuang Chen,Xinge You,Salman Khan,Fahad Shahbaz Khan
机构: Huazhong University of Science and Technology; ReLER Lab, Zhejiang University, China; Mohamed bin Zayed University of AI; Australian National University; Linköping University
关键词: Generative Zero-Shot Learning, synthesize class-related features, class-related features based, showcasing superior performance, Generative Zero-Shot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report, 16 pages

点击查看摘要

Abstract:Generative Zero-Shot Learning (ZSL) methods synthesize class-related features based on predefined class semantic prototypes, showcasing superior performance. However, this feature generation paradigm falls short of providing interpretable insights. In addition, existing approaches rely on semantic prototypes annotated by human experts, which exhibit a significant limitation in their scalability to generalized scenes. To overcome these deficiencies, a natural solution is to generate images for unseen classes using text prompts. To this end, We present DIG-ZSL, a novel Discriminative Image Generation framework for Zero-Shot Learning. Specifically, to ensure the generation of discriminative images for training an effective ZSL classifier, we learn a discriminative class token (DCT) for each unseen class under the guidance of a pre-trained category discrimination model (CDM). Harnessing DCTs, we can generate diverse and high-quality images, which serve as informative unseen samples for ZSL tasks. In this paper, the extensive experiments and visualizations on four datasets show that our DIG-ZSL: (1) generates diverse and high-quality images, (2) outperforms previous state-of-the-art nonhuman-annotated semantic prototype-based methods by a large margin, and (3) achieves comparable or better performance than baselines that leverage human-annotated semantic prototypes. The codes will be made available upon acceptance of the paper.
zh

[CV-81] Dual Conditioned Motion Diffusion for Pose-Based Video Anomaly Detection

【速读】：该论文试图解决基于姿态的视频异常检测 (Video Anomaly Detection, VAD) 问题，并提出了一种名为双条件运动扩散 (Dual Conditioned Motion Diffusion, DCMD) 的新框架。解决方案的关键在于综合利用重建和预测两种方法的优势：通过结合条件运动和条件嵌入，全面利用姿态特征和潜在语义信息；在反向扩散过程中，提出运动变换器 (motion transformer) 以捕捉人体运动频谱空间中的多层次特征相关性；并通过联合关联差异 (United Association Discrepancy, UAD) 正则化，增强正常与异常实例的区分能力，该正则化主要依赖于基于高斯核的时间关联和基于自注意力的全局关联。此外，在推理阶段的反向扩散过程中引入掩码完成策略，以提高条件运动在异常检测预测分支中的利用率。

链接: https://arxiv.org/abs/2412.17210
作者: Andi Xu,Hongsong Wang,Pinle Ding,Jie Gui
机构: 未知
关键词: computer vision research, Video Anomaly Detection, vision research, Anomaly Detection, Existing VAD methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is on this https URL

点击查看摘要

Abstract:Video Anomaly Detection (VAD) is essential for computer vision research. Existing VAD methods utilize either reconstruction-based or prediction-based frameworks. The former excels at detecting irregular patterns or structures, whereas the latter is capable of spotting abnormal deviations or trends. We address pose-based video anomaly detection and introduce a novel framework called Dual Conditioned Motion Diffusion (DCMD), which enjoys the advantages of both approaches. The DCMD integrates conditioned motion and conditioned embedding to comprehensively utilize the pose characteristics and latent semantics of observed movements, respectively. In the reverse diffusion process, a motion transformer is proposed to capture potential correlations from multi-layered characteristics within the spectrum space of human motion. To enhance the discriminability between normal and abnormal instances, we design a novel United Association Discrepancy (UAD) regularization that primarily relies on a Gaussian kernel-based time association and a self-attention-based global association. Finally, a mask completion strategy is introduced during the inference stage of the reverse diffusion process to enhance the utilization of conditioned motion for the prediction branch of anomaly detection. Extensive experiments on four datasets demonstrate that our method dramatically outperforms state-of-the-art methods and exhibits superior generalization performance.
zh

[CV-82] Where Did Your Model Learn That? Label-free Influence for Self-supervised Learning

【速读】：该论文试图解决自监督学习 (Self-supervised Learning, SSL) 中预训练数据与学习到的表示之间内在关系的理解问题。传统监督学习依赖于基于梯度的数据归属工具（如影响函数）来衡量单个数据点对模型预测的贡献，但这些方法依赖标签，不适用于无标签的 SSL 场景。论文提出的解决方案是引入 Influence-SSL，一种无需标签的新型影响函数定义方法，利用学习表示对数据增强的稳定性来识别对模型预测有解释作用的训练样本。该方法不仅提供了理论基础，还通过实验验证了其在重复检测、异常识别和公平性分析等应用中的有效性。

链接: https://arxiv.org/abs/2412.17170
作者: Nidhin Harilal,Amit Kiran Rege,Reza Akbarian Bafghi,Maziar Raissi,Claire Monteleoni
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); University of California, Riverside(加州大学河滨分校); INRIA Paris(巴黎INRIA)
关键词: large-scale unlabeled datasets, remains poorly understood, Self-supervised learning, representations remains poorly, unlabeled datasets
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has revolutionized learning from large-scale unlabeled datasets, yet the intrinsic relationship between pretraining data and the learned representations remains poorly understood. Traditional supervised learning benefits from gradient-based data attribution tools like influence functions that measure the contribution of an individual data point to model predictions. However, existing definitions of influence rely on labels, making them unsuitable for SSL settings. We address this gap by introducing Influence-SSL, a novel and label-free approach for defining influence functions tailored to SSL. Our method harnesses the stability of learned representations against data augmentations to identify training examples that help explain model predictions. We provide both theoretical foundations and empirical evidence to show the utility of Influence-SSL in analyzing pre-trained SSL models. Our analysis reveals notable differences in how SSL models respond to influential data compared to supervised models. Finally, we validate the effectiveness of Influence-SSL through applications in duplicate detection, outlier identification and fairness analysis. Code is available at: \urlthis https URL.
zh

[CV-83] Generative Diffusion Modeling: A Practical Handbook

【速读】：该论文旨在解决扩散模型（diffusion models）领域中从理论研究到代码实现之间的鸿沟问题。其关键解决方案在于通过统一符号表示并将其与代码实现对齐，从而标准化扩散概率模型、基于分数的生成模型、一致性模型、校正流及相关方法的实现过程。此外，论文还涵盖了预训练和多种后训练技术（如模型蒸馏和基于奖励的微调），以促进稳健的实现和公平的比较。

链接: https://arxiv.org/abs/2412.17162
作者: Zihan Ding,Chi Jin
机构: 未知
关键词: encompassing diffusion probabilistic, rectified flow, handbook offers, offers a unified, unified perspective
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This handbook offers a unified perspective on diffusion models, encompassing diffusion probabilistic models, score-based generative models, consistency models, rectified flow, and related methods. By standardizing notations and aligning them with code implementations, it aims to bridge the “paper-to-code” gap and facilitate robust implementations and fair comparisons. The content encompasses the fundamentals of diffusion models, the pre-training process, and various post-training methods. Post-training techniques include model distillation and reward-based fine-tuning. Designed as a practical guide, it emphasizes clarity and usability over theoretical depth, focusing on widely adopted approaches in generative modeling with diffusion models.
zh

[CV-84] he Potential of Convolutional Neural Networks for Cancer Detection

【速读】：该论文试图解决癌症早期检测的关键问题，特别是针对肺癌、乳腺癌和前列腺癌等常见癌症，以提高治疗效果和生存率。解决方案的关键在于利用卷积神经网络 (CNN) 对医学图像进行分析和分类，通过不同的CNN架构识别与癌症相关的模式。论文详细比较了各种CNN架构的性能和局限性，探讨了将CNN技术整合到临床环境中作为早期检测工具的可行性，并指出了数据多样性、结果解释和伦理考虑等方面的挑战。通过识别表现最佳的CNN架构并进行比较分析，论文旨在为CNN在癌症检测中的应用提供全面的视角，并推动医疗诊断能力的进步。

链接: https://arxiv.org/abs/2412.17155
作者: Hossein Molaeian,Kaveh Karamjani,Sina Teimouri,Saeed Roshani,Sobhan Roshani
机构: 未知
关键词: increasing survival rates, global mortality burden, Convolutional Neural Networks, improving treatment outcomes, significant global mortality
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Early detection of cancer is critical in improving treatment outcomes and increasing survival rates, particularly for common cancers such as lung, breast, and prostate which collectively contribute to a significant global mortality burden. With advancements in imaging technologies and data processing, Convolutional Neural Networks (CNNs) have emerged as a powerful tool for analyzing and classifying medical images, enabling more precise cancer detection. This paper provides a comprehensive review of recent studies leveraging CNN models for detecting ten different types of cancer. Each study employs distinct CNN architectures to identify patterns associated with these cancers, utilizing diverse datasets. Key differences and strengths of these architectures are meticulously compared and analyzed, highlighting their efficacy in improving early detection. Beyond reviewing the performance and limitations of CNN-based cancer detection methods, this study explores the feasibility of integrating CNNs into clinical settings as an early detection tool, potentially complementing or replacing traditional methods. Despite significant progress, challenges remain, including data diversity, result interpretation, and ethical considerations. By identifying the best-performing CNN architectures and providing a comparative analysis, this study aims to contribute a comprehensive perspective on the application of CNNs in cancer detection and their role in advancing diagnostic capabilities in healthcare.
zh

[CV-85] Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

【速读】：该论文试图解决自回归模型（Autoregressive, AR）在文本和图像生成中因逐个生成标记（token-by-token）而导致的生成速度慢的问题。解决方案的关键是提出了一种名为Distilled Decoding (DD) 的方法，通过流匹配（flow matching）技术创建从高斯分布到预训练AR模型输出分布的确定性映射，并训练一个网络来提炼这一映射，从而实现少步生成（few-step generation）。DD方法无需原始AR模型的训练数据，显著提升了生成速度，并在图像生成任务中展示了显著的加速效果，如在ImageNet-256上实现了从10步到1步的生成（6.3倍加速），以及从256步到1步的生成（217.8倍加速），同时保持了可接受的FID指标。

链接: https://arxiv.org/abs/2412.17153
作者: Enshu Liu,Xuefei Ning,Yu Wang,Zinan Lin
机构: Department of EE, Tsinghua University(电子工程系，清华大学); Microsoft Research(微软研究院)
关键词: generation, performance in text, Autoregressive, models, FID increase
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn’t need the training data of the original AR model, making it more this http URL evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3 \times speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8 \times speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at this https URL.
zh

[CV-86] Style Transfer Dataset: What Makes A Good Stylization?

【速读】：该论文旨在推动图像风格迁移 (image style transfer) 技术的发展，通过构建一个包含10,000个手动评级的风格化图像的新数据集，解决了风格迁移任务中缺乏高质量评估数据的问题。解决方案的关键在于：1) 提供了一个多样化的内容和风格图像数据集，涵盖不同尺寸；2) 通过三位标注者对风格化图像进行1-10分的评分，量化了影响用户评价的关键因素；3) 提出了创建风格迁移数据集的方法论，并展示了具有统计显著性的量化指标对用户评分的影响。该数据集可用于自动化与风格迁移相关的配置和评估任务。

链接: https://arxiv.org/abs/2412.17139
作者: Victor Kitov,Valentin Abramov,Mikhail Akhtyrchenko
机构: Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学); Plekhanov Russian University of Economics(普列汉诺夫俄罗斯经济大学)
关键词: goal of advancing, advancing image style, advancing image, style, style transfer
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present a new dataset with the goal of advancing image style transfer - the task of rendering one image in the style of another image. The dataset covers various content and style images of different size and contains 10.000 stylizations manually rated by three annotators in 1-10 scale. Based on obtained ratings, we find which factors are mostly responsible for favourable and poor user evaluations and show quantitative measures having statistically significant impact on user grades. A methodology for creating style transfer datasets is discussed. Presented dataset can be used in automating multiple tasks, related to style transfer configuration and evaluation.
zh

[CV-87] Similarity Trajectories: Linking Sampling Process to Artifacts in Diffusion-Generated Images

【速读】：该论文试图解决扩散模型生成图像中伪影（artifacts）检测的问题，特别是在训练数据需求量大、效率和可扩展性受限的情况下。解决方案的关键在于引入相似性轨迹（Similarity Trajectory）的概念，通过分析采样过程中连续时间步长去噪图像之间的相似性变化，来表征伪影的严重程度。利用仅占先前工作所需数据量0.1%的680张标注图像数据集，训练分类器预测图像中伪影的存在，实现了在有限训练数据下区分伪影图像与自然图像的目标，并通过10折交叉验证达到了72.35%的准确率。

链接: https://arxiv.org/abs/2412.17109
作者: Dennis Menn,Feng Liang,Hung-Yueh Chiang,Diana Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
关键词: Artifact detection algorithms, detection algorithms, algorithms are crucial, crucial to correcting, correcting the output
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artifact detection algorithms are crucial to correcting the output generated by diffusion models. However, because of the variety of artifact forms, existing methods require substantial annotated data for training. This requirement limits their scalability and efficiency, which restricts their wide application. This paper shows that the similarity of denoised images between consecutive time steps during the sampling process is related to the severity of artifacts in images generated by diffusion models. Building on this observation, we introduce the concept of Similarity Trajectory to characterize the sampling process and its correlation with the image artifacts presented. Using an annotated data set of 680 images, which is only 0.1% of the amount of data used in the prior work, we trained a classifier on these trajectories to predict the presence of artifacts in images. By performing 10-fold validation testing on the balanced annotated data set, the classifier can achieve an accuracy of 72.35%, highlighting the connection between the Similarity Trajectory and the occurrence of artifacts. This approach enables differentiation between artifact-exhibiting and natural-looking images using limited training data.
zh

[CV-88] Refining CNN-based Heatmap Regression with Gradient-based Corner Points for Electrode Localization

【速读】：该论文试图解决锂离子电池中电极位置的检测问题。解决方案的关键在于结合传统像素梯度分析与卷积神经网络（CNN）基于热图的回归方法进行关键点提取。具体步骤包括：首先通过角点检测确定电池X射线图像中的感兴趣区域（ROI），然后利用CNN在该ROI内回归电极位置，最后通过角点先验信息对回归位置进行优化和校正，从而有效缓解网络训练过程中由于特征图下采样和填充操作导致的定位精度损失。这种方法显著提升了检测的准确性和效率。

链接: https://arxiv.org/abs/2412.17105
作者: Lin Wu
机构: 未知
关键词: lithium-ion batteries, propose a method, method for detecting, detecting the electrode, battery X-ray image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a method for detecting the electrode positions in lithium-ion batteries. The process begins by identifying the region of interest (ROI) in the battery’s X-ray image through corner point detection. A convolutional neural network is then used to regress the pole positions within this ROI. Finally, the regressed positions are optimized and corrected using corner point priors, significantly mitigating the loss of localization accuracy caused by operations such as feature map down-sampling and padding during network training. Our findings show that combining traditional pixel gradient analysis with CNN-based heatmap regression for keypoint extraction enhances both accuracy and efficiency, resulting in significant performance improvements.
zh

[CV-89] DreamOmni: Unified Image Generation and Editing

【速读】：该论文试图解决计算机视觉领域中，文本到图像 (Text-to-Image, T2I) 模型与下游编辑任务（如各种类型的图像编辑）之间缺乏统一框架的问题。解决方案的关键在于提出了一个名为 DreamOmni 的统一模型，该模型不仅整合了 T2I 生成任务，还集成了多种编辑任务。为了实现这一目标，论文首先分析了现有框架和下游任务的需求，提出了一个统一的框架设计。此外，论文还解决了高质量编辑数据的高效创建问题，通过开发合成数据管道，利用类似贴纸的元素生成准确且高质量的数据集，从而支持大规模的统一模型训练。在训练过程中，DreamOmni 同时进行 T2I 生成和编辑任务的联合训练，使得模型在生成质量和编辑任务的细节理解上均得到显著提升。

链接: https://arxiv.org/abs/2412.17098
作者: Bin Xia,Yuechen Zhang,Jingyao Li,Chengyao Wang,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia
机构: CUHK(香港中文大学); ByteDance Inc(字节跳动公司); HKUST(香港科技大学)
关键词: foster synergistic benefits, unified multitasking approach, large language models, streamline deployment, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model’s understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.
zh

[CV-90] Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation

【速读】：该论文旨在解决视频帧插值问题，特别是在处理大运动场景时，传统确定性方法的性能不足。解决方案的关键在于采用预训练的大规模图像到视频扩散模型，并通过引入一个条件编码器（conditional encoder）来实现模型的适应性。该条件编码器通过提取首尾帧的空间和时间特征，指导扩散模型生成关键帧引导的视频序列。这种方法在Fréchet视频距离（FVD）指标上表现出优于传统确定性方法的性能，突显了生成式方法在处理复杂运动场景中的优势。

链接: https://arxiv.org/abs/2412.17042
作者: Luoxu Jin,Hiroshi Watanabe
机构: Waseda University(早稻田大学); CSCE, Graduate School of FSE(CSCE，FSE研究生院)
关键词: recent years, video generation models, advanced significantly, significantly in recent, conditional encoder
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of video generation models has advanced significantly in recent years. For video frame interpolation, we adopt a pre-trained large-scale image-to-video diffusion model. To enable this adaptation, we propose a conditional encoder, which serves as a simple yet effective trainable module. By leveraging the first and last frames, we extract spatial and temporal features and input them into the conditional encoder. The computed features of the conditional encoder guide the video diffusion model in generating keyframe-guided video sequences. Our method demonstrates superior performance on the Fréchet Video Distance (FVD) metric compared to previous deterministic approaches in handling large-motion cases, highlighting advancements in generative-based methodologies.
zh

[CV-91] An OpenMind for 3D medical vision self-supervised learning

【速读】：该论文试图解决3D医学视觉自监督学习领域中缺乏一致性和标准化的问题。解决方案的关键在于：a) 发布了一个包含114k个3D脑部MRI体积的最大公开预训练数据集；b) 在通用架构下对现有的自监督学习方法进行了基准测试；c) 公开了框架代码，以促进快速采用和复现。这些措施为该领域的进一步方法改进奠定了基础。

链接: https://arxiv.org/abs/2412.17041
作者: Tassilo Wald,Constantin Ulrich,Jonathan Suprijadi,Michal Nohel,Robin Peretzke,Klaus H. Maier-Hein
机构: 未知
关键词: medical vision self-supervised, vision self-supervised learning, self-supervised learning lacks, learning lacks consistency, medical vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Pre-Print for Challenge proposal; Dataset, Benchmark and Codebase will be made available shortly once Benchmarking concludes

点击查看摘要

Abstract:The field of 3D medical vision self-supervised learning lacks consistency and standardization. While many methods have been developed it is impossible to identify the current state-of-the-art, due to i) varying and small pre-training datasets, ii) varying architectures, and iii) being evaluated on differing downstream datasets. In this paper we bring clarity to this field and lay the foundation for further method advancements: We a) publish the largest publicly available pre-training dataset comprising 114k 3D brain MRI volumes and b) benchmark existing SSL methods under common architectures and c) provide the code of our framework publicly to facilitate rapid adoption and reproduction. This pre-print \textitonly describes the dataset contribution (a); Data, benchmark, and codebase will be made available shortly.
zh

[CV-92] ErasableMask: A Robust and Erasable Privacy Protection Scheme against Black-box Face Recognition Models

【速读】：该论文试图解决现有面部隐私保护方案在对抗黑盒面部识别模型时存在的两个主要问题：一是对抗样本的弱迁移性，难以有效对抗黑盒模型；二是永久性破坏面部可识别信息，无法满足法医鉴定和身份认证等授权操作的需求。解决方案的关键在于提出了一种名为ErasableMask的鲁棒且可擦除的隐私保护方案。其核心创新包括：1) 引入元辅助攻击（meta-auxiliary attack），通过学习更通用的特征并采用稳定平衡的优化策略，显著提升对抗样本在黑盒模型上的迁移性；2) 提供一种扰动擦除机制（perturbation erasion mechanism），能够在不降低图像质量的前提下擦除保护面部中的语义扰动；3) 采用课程学习策略（curriculum learning strategy），有效缓解对抗攻击与扰动擦除之间的优化冲突。实验结果表明，ErasableMask在迁移性和扰动擦除性能上均达到了最先进的水平。

链接: https://arxiv.org/abs/2412.17038
作者: Sipeng Shen,Yunming Zhang,Dengpan Ye,Xiuwen Shi,Long Tang,Haoran Duan,Ziyi Liu
机构: Wuhan University (武汉大学)
关键词: brought remarkable convenience, pose substantial privacy, substantial privacy risks, brought remarkable, remarkable convenience
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While face recognition (FR) models have brought remarkable convenience in face verification and identification, they also pose substantial privacy risks to the public. Existing facial privacy protection schemes usually adopt adversarial examples to disrupt face verification of FR models. However, these schemes often suffer from weak transferability against black-box FR models and permanently damage the identifiable information that cannot fulfill the requirements of authorized operations such as forensics and authentication. To address these limitations, we propose ErasableMask, a robust and erasable privacy protection scheme against black-box FR models. Specifically, via rethinking the inherent relationship between surrogate FR models, ErasableMask introduces a novel meta-auxiliary attack, which boosts black-box transferability by learning more general features in a stable and balancing optimization strategy. It also offers a perturbation erasion mechanism that supports the erasion of semantic perturbations in protected face without degrading image quality. To further improve performance, ErasableMask employs a curriculum learning strategy to mitigate optimization conflicts between adversarial attack and perturbation erasion. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that ErasableMask achieves the state-of-the-art performance in transferability, achieving over 72% confidence on average in commercial FR systems. Moreover, ErasableMask also exhibits outstanding perturbation erasion performance, achieving over 90% erasion success rate.
zh

[CV-93] Parameter-Efficient Interventions for Enhanced Model Merging SDM

【速读】：该论文试图解决多任务模型合并中的表示偏差（representation bias）问题，即在将特定任务模型合并为统一的多任务模型时，不同任务的表示可能相互干扰，影响任务性能。解决方案的关键是提出了一种名为IntervMerge的新方法，通过任务特定的干预（task-specific interventions）来有效缓解模型中的表示偏差。此外，为了提高效率，论文还引入了迷你干预（mini-interventions），仅修改部分表示，从而在不牺牲性能的情况下减少额外参数。实验结果表明，IntervMerge在参数更少的情况下，持续优于当前最先进的方法。

链接: https://arxiv.org/abs/2412.17023
作者: Marcin Osial,Daniel Marczak,Bartosz Zieliński
机构: Society for Industrial and Applied Mathematics (工业与应用数学学会); Society for Industrial and Applied Mathematics (工业与应用数学学会)
关键词: avoid joint training, merging combines knowledge, Model merging combines, unified multi-task model, combines knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, SIAM International Conference on Data Mining (SDM) 2025

点击查看摘要

Abstract:Model merging combines knowledge from task-specific models into a unified multi-task model to avoid joint training on all task data. However, current methods face challenges due to representation bias, which can interfere with tasks performance. As a remedy, we propose IntervMerge, a novel approach to multi-task model merging that effectively mitigates representation bias across the model using taskspecific interventions. To further enhance its efficiency, we introduce mini-interventions, which modify only part of the representation, thereby reducing the additional parameters without compromising performance. Experimental results demonstrate that IntervMerge consistently outperforms the state-of-the-art approaches using fewer parameters.
zh

[CV-94] FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos AAAI2025

【速读】：该论文试图解决现有视频问答（VideoQA）模型在深度视频理解（DVU）任务中的不足，特别是在处理故事视频时，模型难以理解复杂的情节发展和长期的故事主题演变。解决方案的关键在于提出了一种基于大型语言模型的多智能体协作框架，称为StoryMind，用于自动生成大规模的DVU数据集。该数据集FriendsQA基于著名情景喜剧《老友记》，包含44.6K个问题，均匀分布在14个细粒度主题上，从而能够全面评估模型对复杂故事情节的理解能力。

链接: https://arxiv.org/abs/2412.17022
作者: Zhengqian Wu,Ruizhe Li,Zijun Xu,Zhongyuan Wang,Chunxia Xiao,Chao Liang
机构: 未知
关键词: answer natural language, aims to answer, DVU, answer natural, Video question answering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interactions and long-range evolvement of core story topics including characters, actions and locations. Understanding these topics requires models to possess DVU capability. However, existing DVU datasets rarely organize questions according to these story topics, making them difficult to comprehensively assess VideoQA models’ DVU capability of complex storylines. Additionally, the question quantity and video length of these dataset are limited by high labor costs of handcrafted dataset building method. In this paper, we devise a large language model based multi-agent collaboration framework, StoryMind, to automatically generate a new large-scale DVU dataset. The dataset, FriendsQA, derived from the renowned sitcom Friends with an average episode length of 1,358 seconds, contains 44.6K questions evenly distributed across 14 fine-grained topics. Finally, We conduct comprehensive experiments on 10 state-of-the-art VideoQA models using the FriendsQA dataset.
zh

[CV-95] Where am I? Cross-View Geo-localization with Natural Language Descriptions

【速读】：该论文试图解决跨视图地理定位（cross-view geo-localization）中基于自然语言描述的检索问题，特别是在行人导航和应急响应等应用中，如何通过场景文本描述来检索对应的卫星图像或OSM数据库。解决方案的关键在于提出了一个新任务，即利用自然语言描述进行跨视图地理定位，并为此构建了CVG-Text数据集，该数据集通过大尺度多模态模型（Large Multimodal Models）生成高质量的场景文本描述。论文还提出了一种新的基于文本的检索定位方法CrossText2Loc，该方法在召回率上提升了10%，并展示了出色的长文本检索能力，同时提供了可解释的相似度评分和检索理由。

链接: https://arxiv.org/abs/2412.17007
作者: Junyan Ye,Honglin Lin,Leyan Ou,Dairong Chen,Zihao Wang,Conghui He,Weijia Li
机构: Sun Yat-Sen University; Shanghai AI Laboratory; Sensetime Research; Wuhan University
关键词: geo-tagged satellite images, Cross-view geo-localization identifies, identifies the locations, locations of street-view, Large Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization this http URL, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at this https URL.
zh

[CV-96] Multi-Scale Foreground-Background Confidence for Out-of-Distribution Segmentation

【速读】：该论文试图解决深度神经网络在语义分割任务中对未知物体（out-of-distribution (OOD) objects）的预测失败问题，尤其是在开放世界场景和安全关键应用（如自动驾驶）中的应用。解决方案的关键在于提出了一种多尺度OOD分割方法，该方法利用前景-背景分割模型的置信度信息。与传统的语义分割模型不同，前景-背景分割模型不受预定义类别限制，因此更适合处理OOD物体。通过聚合不同尺寸补丁的像素置信度分数，该方法能够在一个图像中识别出各种尺寸的OOD物体，并在SegmentMeIfYouCan基准测试中表现出优于现有基线的性能。

链接: https://arxiv.org/abs/2412.16990
作者: Samuel Marschall,Kira Maag
机构: Technical University of Berlin, Germany(柏林工业大学); Heinrich-Heine-University Düsseldorf, Germany(海因里希-海涅-杜塞尔多夫大学)
关键词: Deep neural networks, computer vision tasks, Deep neural, shown outstanding performance, OOD segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks have shown outstanding performance in computer vision tasks such as semantic segmentation and have defined the state-of-the-art. However, these segmentation models are trained on a closed and predefined set of semantic classes, which leads to significant prediction failures in open-world scenarios on unknown objects. As this behavior prevents the application in safety-critical applications such as automated driving, the detection and segmentation of these objects from outside their predefined semantic space (out-of-distribution (OOD) objects) is of the utmost importance. In this work, we present a multi-scale OOD segmentation method that exploits the confidence information of a foreground-background segmentation model. While semantic segmentation models are trained on specific classes, this restriction does not apply to foreground-background methods making them suitable for OOD segmentation. We consider the per pixel confidence score of the model prediction which is close to 1 for a pixel in a foreground object. By aggregating these confidence values for different sized patches, objects of various sizes can be identified in a single image. Our experiments show improved performance of our method in OOD segmentation compared to comparable baselines in the SegmentMeIfYouCan benchmark.
zh

[CV-97] Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection AAAI2025

【速读】：该论文试图解决现有基于卷积神经网络（CNN）的红外小目标检测方法在处理像素分布空间特性方面的不足，特别是标准卷积未能充分利用小目标的像素高斯空间分布特性。解决方案的关键在于提出了新型针轮形卷积（PConv），以替代骨干网络中较低层的标准卷积，从而更好地匹配小目标的像素分布特性，增强特征提取能力，并显著扩大感受野，同时仅引入极少的参数增加。此外，针对现有损失函数在不同目标尺度上敏感性不一致的问题，论文提出了基于尺度的动态损失（SD Loss），根据目标大小动态调整尺度和位置损失的影响，从而提升网络对不同尺度目标的检测能力。通过将PConv和SD Loss集成到最新的目标检测算法中，论文在IRSTD-1K和SIRST-UAVB数据集上实现了显著的性能提升，验证了其方法的有效性和通用性。

链接: https://arxiv.org/abs/2412.16986
作者: Jiangnan Yang,Shuangli Liu,Jingjun Wu,Xinyu Su,Nan Hai,Xueli Huang
机构: Jiangnan Yang1; Shuangli Liu1; Xinyu Su1; Nan Hai1; Xueli Huang1; Jingjun Wu2
关键词: detecting infrared small, convolutional neural network, infrared small targets, infrared small, years have witnessed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:These recent years have witnessed that convolutional neural network (CNN)-based methods for detecting infrared small targets have achieved outstanding performance. However, these methods typically employ standard convolutions, neglecting to consider the spatial characteristics of the pixel distribution of infrared small targets. Therefore, we propose a novel pinwheel-shaped convolution (PConv) as a replacement for standard convolutions in the lower layers of the backbone network. PConv better aligns with the pixel Gaussian spatial distribution of dim small targets, enhances feature extraction, significantly increases the receptive field, and introduces only a minimal increase in parameters. Additionally, while recent loss functions combine scale and location losses, they do not adequately account for the varying sensitivity of these losses across different target scales, limiting detection performance on dim-small targets. To overcome this, we propose a scale-based dynamic (SD) Loss that dynamically adjusts the influence of scale and location losses based on target size, improving the network’s ability to detect targets of varying scales. We construct a new benchmark, SIRST-UAVB, which is the largest and most challenging dataset to date for real-shot single-frame infrared small target detection. Lastly, by integrating PConv and SD Loss into the latest small target detection algorithms, we achieved significant performance improvements on IRSTD-1K and our SIRST-UAVB dataset, validating the effectiveness and generalizability of our approach. Code – this https URL Comments: Accepted by AAAI 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.16986 [cs.CV] (or arXiv:2412.16986v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.16986 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-98] InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions

【速读】：该论文试图解决现有生成式模型在生成高质量双人舞蹈交互动作方面的不足，主要问题包括缺乏大规模高质量数据集、交互动作表示不完整以及交互优化不足。解决方案的关键在于提出了一个名为InterDance的大规模双人舞蹈数据集，显著提升了动作质量、数据规模和舞蹈种类多样性。基于此数据集，论文还提出了一种新的动作表示方法，能够准确全面地描述交互动作，并通过引入基于扩散的框架和交互优化指导策略，逐步提升交互的真实性。

链接: https://arxiv.org/abs/2412.16982
作者: Ronghui Li,Youliang Zhang,Yachao Zhang,Yuxiang Zhang,Mingyang Su,Jie Guo,Ziwei Liu,Yebin Liu,Xiu Li
机构: Tsinghua University(清华大学); Peng Cheng Laboratory(鹏城实验室); Xiamen University(厦门大学); S-Lab, Nanyang Technological University(南洋理工大学)
关键词: duet dance, Humans perform, human motion generative, duet dance dataset, large-scale duet dance
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: this https URL

点击查看摘要

Abstract:Humans perform a variety of interactive motions, among which duet dance is one of the most challenging interactions. However, in terms of human motion generative models, existing works are still unable to generate high-quality interactive motions, especially in the field of duet dance. On the one hand, it is due to the lack of large-scale high-quality datasets. On the other hand, it arises from the incomplete representation of interactive motion and the lack of fine-grained optimization of interactions. To address these challenges, we propose, InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres. Built upon this dataset, we propose a new motion representation that can accurately and comprehensively describe interactive motion. We further introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively. Extensive experiments demonstrate the effectiveness of our dataset and algorithm.
zh

[CV-99] A Conditional Diffusion Model for Electrical Impedance Tomography Image Reconstruction

【速读】：该论文试图解决电阻抗断层成像 (Electrical Impedance Tomography, EIT) 中由于电压数据欠采样与高分辨率电导率图像之间的不匹配导致的病态重建问题。解决方案的关键在于提出了一种基于条件扩散模型 (Conditional Diffusion Model) 的新方法，称为 CDEIT。CDEIT 通过前向扩散过程逐步向干净的电导率图像添加高斯噪声，并通过逆向去噪过程学习从噪声版本中预测原始电导率图像，同时条件化于边界电压。此外，论文还详细介绍了归一化程序，展示了如何在模拟数据集上训练的 EIT 图像重建模型应用于具有不同大小、激励电流和背景电导率的实际数据集。实验结果表明，CDEIT 在合成数据集和真实数据集上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.16979
作者: Shuaikai Shi,Ruiyuan Kang,Panos Liatsis
机构: Department of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates; Wave and Machine Intelligence Department, Technology Innovation Institute, Abu Dhabi, United Arab Emirates
关键词: Electrical impedance tomography, non-invasive imaging technique, Electrical impedance, EIT image reconstruction, electrical conductivity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electrical impedance tomography (EIT) is a non-invasive imaging technique, capable of reconstructing images of the electrical conductivity of tissues and materials. It is popular in diverse application areas, from medical imaging to industrial process monitoring and tactile sensing, due to its low cost, real-time capabilities and non-ionizing nature. EIT visualizes the conductivity distribution within a body by measuring the boundary voltages, given a current injection. However, EIT image reconstruction is ill-posed due to the mismatch between the under-sampled voltage data and the high-resolution conductivity image. A variety of approaches, both conventional and deep learning-based, have been proposed, capitalizing on the use of spatial regularizers, and the paradigm of image regression. In this research, a novel method based on the conditional diffusion model for EIT reconstruction is proposed, termed CDEIT. Specifically, CDEIT consists of the forward diffusion process, which first gradually adds Gaussian noise to the clean conductivity images, and a reverse denoising process, which learns to predict the original conductivity image from its noisy version, conditioned on the boundary voltages. Following model training, CDEIT applies the conditional reverse process on test voltage data to generate the desired conductivities. Moreover, we provide the details of a normalization procedure, which demonstrates how EIT image reconstruction models trained on simulated datasets can be applied on real datasets with varying sizes, excitation currents and background conductivities. Experiments conducted on a synthetic dataset and two real datasets demonstrate that the proposed model outperforms state-of-the-art methods. The CDEIT software is available as open-source (this https URL) for reproducibility purposes.
zh

[CV-100] PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

【速读】：该论文试图解决文本可编辑虚拟试衣任务中的关键问题，即在基于文本描述的情况下，如何有效地更换服装并编辑穿着风格（如塞入式、合身度等）。解决方案的关键在于设计丰富的文本描述来训练模型，处理现有服装文本信息与新服装生成之间的冲突，以及根据文本描述自适应调整修复掩码，确保编辑区域的同时保留与新服装无关的原始人物外观。为此，论文提出了PromptDresser模型，利用大型多模态模型（LMM）通过上下文学习生成详细的文本描述，并根据文本提示自适应调整修复掩码，从而实现高质量和多样化的文本驱动编辑。

链接: https://arxiv.org/abs/2412.16978
作者: Jeongho Kim,Hoiyeong Jin,Sunghyun Park,Jaegul Choo
机构: KAIST, Daejeon, South Korea(韩国科学技术院, 大田, 韩国)
关键词: Recent virtual try-on, text-editable virtual try-on, virtual try-on, virtual try-on approaches, Recent virtual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Recent virtual try-on approaches have advanced by fine-tuning the pre-trained text-to-image diffusion models to leverage their powerful generative ability. However, the use of text prompts in virtual try-on is still underexplored. This paper tackles a text-editable virtual try-on task that changes the clothing item based on the provided clothing image while editing the wearing style (e.g., tucking style, fit) according to the text descriptions. In the text-editable virtual try-on, three key aspects exist: (i) designing rich text descriptions for paired person-clothing data to train the model, (ii) addressing the conflicts where textual information of the existing person’s clothing interferes the generation of the new clothing, and (iii) adaptively adjust the inpainting mask aligned with the text descriptions, ensuring proper editing areas while preserving the original person’s appearance irrelevant to the new clothing. To address these aspects, we propose PromptDresser, a text-editable virtual try-on model that leverages large multimodal model (LMM) assistance to enable high-quality and versatile manipulation based on generative text prompts. Our approach utilizes LMMs via in-context learning to generate detailed text descriptions for person and clothing images independently, including pose details and editing attributes using minimal human cost. Moreover, to ensure the editing areas, we adjust the inpainting mask depending on the text prompts adaptively. We found that our approach, utilizing detailed text prompts, not only enhances text editability but also effectively conveys clothing details that are difficult to capture through images alone, thereby enhancing image quality. Our code is available at this https URL.
zh

[CV-101] Breaking Barriers in Physical-World Adversarial Examples: Improving Robustness and Transferability via Robust Feature AAAI2025

【速读】：该论文试图解决物理世界对抗样本 (Physical-world Adversarial Examples, PAEs) 的两个主要挑战：攻击性能不佳（如迁移性差和环境条件下的鲁棒性不足）以及攻击效果与隐蔽性之间的平衡问题。解决方案的关键在于提出了两种策略：基于鲁棒特征 (Robust Features, RFs) 的欺骗性射频注入 (Deceptive RF injection) 和对抗语义模式最小化 (Adversarial Semantic Pattern Minimization)。前者通过覆盖其他类别的鲁棒特征来提高对抗样本的迁移性和鲁棒性，后者则通过去除大部分扰动并保留必要的对抗模式来增强隐蔽性。基于这两种策略，论文设计了鲁棒特征覆盖攻击 (Robust Feature Coverage Attack, RFCoA) 方法，包括鲁棒特征解耦 (Robust Feature Disentanglement) 和对抗特征融合 (Adversarial Feature Fusion)，从而在实验中展现出优于现有方法的迁移性、鲁棒性和隐蔽性。

链接: https://arxiv.org/abs/2412.16958
作者: Yichen Wang,Yuxuan Chou,Ziqi Zhou,Hangtao Zhang,Wei Wan,Shengshan Hu,Minghui Li
机构: 未知
关键词: deep neural networks, model incorrect outputs, neural networks, physical world, incorrect outputs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:As deep neural networks (DNNs) are widely applied in the physical world, many researches are focusing on physical-world adversarial examples (PAEs), which introduce perturbations to inputs and cause the model’s incorrect outputs. However, existing PAEs face two challenges: unsatisfactory attack performance (i.e., poor transferability and insufficient robustness to environment conditions), and difficulty in balancing attack effectiveness with stealthiness, where better attack effectiveness often makes PAEs more perceptible. In this paper, we explore a novel perturbation-based method to overcome the challenges. For the first challenge, we introduce a strategy Deceptive RF injection based on robust features (RFs) that are predictive, robust to perturbations, and consistent across different models. Specifically, it improves the transferability and robustness of PAEs by covering RFs of other classes onto the predictive features in clean images. For the second challenge, we introduce another strategy Adversarial Semantic Pattern Minimization, which removes most perturbations and retains only essential adversarial patterns in AEsBased on the two strategies, we design our method Robust Feature Coverage Attack (RFCoA), comprising Robust Feature Disentanglement and Adversarial Feature Fusion. In the first stage, we extract target class RFs in feature space. In the second stage, we use attention-based feature fusion to overlay these RFs onto predictive features of clean images and remove unnecessary perturbations. Experiments show our method’s superior transferability, robustness, and stealthiness compared to existing state-of-the-art methods. Additionally, our method’s effectiveness can extend to Large Vision-Language Models (LVLMs), indicating its potential applicability to more complex tasks. Comments: Accepted by AAAI2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.16958 [cs.CV] (or arXiv:2412.16958v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.16958 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-102] Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning ICASSP2025

【速读】：该论文试图解决视觉模型在迁移学习中使用视觉提示调优（Visual Prompt Tuning, VPT）时，由于不加区分地对每一层应用提示而导致的显著干扰问题，以及缺乏显式挖掘判别性视觉特征的机制问题。解决方案的关键在于提出了一种语义层次提示（Semantic Hierarchical Prompt, SHIP）微调策略，通过自适应构建语义层次结构，并使用语义独立和语义共享的提示来学习层次化表示。此外，SHIP还集成了属性提示和提示匹配损失以增强特征判别能力，并采用解耦注意力机制以提高鲁棒性和降低推理成本。实验结果表明，SHIP在VTAB-1k任务中使用ViT-B/16骨干网络时，相较于VPT显著提升了4.8%的准确率。

链接: https://arxiv.org/abs/2412.16956
作者: Haowei Zhu,Fangyuan Zhang,Rui Qin,Tianxiang Pan,Junhai Yong,Bin Wang
机构: 未知
关键词: transfer learning technique, Visual Prompt Tuning, parameter-efficient transfer learning, vision models continues, superior performance compared
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:As the scale of vision models continues to grow, Visual Prompt Tuning (VPT) has emerged as a parameter-efficient transfer learning technique, noted for its superior performance compared to full fine-tuning. However, indiscriminately applying prompts to every layer without considering their inherent correlations, can cause significant disturbances, leading to suboptimal transferability. Additionally, VPT disrupts the original self-attention structure, affecting the aggregation of visual features, and lacks a mechanism for explicitly mining discriminative visual features, which are crucial for classification. To address these issues, we propose a Semantic Hierarchical Prompt (SHIP) fine-tuning strategy. We adaptively construct semantic hierarchies and use semantic-independent and semantic-shared prompts to learn hierarchical representations. We also integrate attribute prompts and a prompt matching loss to enhance feature discrimination and employ decoupled attention for robustness and reduced inference costs. SHIP significantly improves performance, achieving a 4.8% gain in accuracy over VPT with a ViT-B/16 backbone on VTAB-1k tasks. Our code is available at this https URL. Comments: Accepted by ICASSP 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.16956 [cs.CV] (or arXiv:2412.16956v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.16956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-103] NumbOD: A Spatial-Frequency Fusion Attack Against Object Detectors AAAI2025

【速读】：该论文试图解决现有目标检测器 (Object Detectors, ODs) 在面对对抗攻击时的脆弱性评估问题，特别是针对传统图像级攻击方法在目标检测任务中的低效性和冗余计算问题。解决方案的关键在于提出了一种全新的空间-频率融合攻击方法，称为 NumbOD。该方法通过直接利用目标检测器的输出特征，而不依赖其内部结构，来生成对抗样本。具体来说，NumbOD 采用双轨攻击目标选择策略来选择高质量的边界框进行攻击，并使用方向性扰动来干扰预测框和分类结果。此外，通过操控图像的高频成分来混淆目标检测器的注意力，从而提高攻击效率和隐蔽性。

链接: https://arxiv.org/abs/2412.16955
作者: Ziqi Zhou,Bowen Li,Yufei Song,Zhifei Yu,Shengshan Hu,Wei Wan,Leo Yu Zhang,Dezhong Yao,Hai Jin
机构: 1. School of Computer Science, Wuhan University(武汉大学计算机科学学院);
2. Hubei Key Laboratory of Intelligent Geo-Information Processing(湖北省智能地理信息处理重点实验室);
3. Collaborative Innovation Center of Geospatial Technology(地理空间技术协同创新中心);
4. State Key Laboratory of Software Engineering(软件工程国家重点实验室);
5. Wuhan National Laboratory for Optoelectronics(武汉光电国家实验室)
关键词: achieved significant success, deep learning, autonomous driving, NMS and RPN, advancement of deep
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:With the advancement of deep learning, object detectors (ODs) with various architectures have achieved significant success in complex scenarios like autonomous driving. Previous adversarial attacks against ODs have been focused on designing customized attacks targeting their specific structures (e.g., NMS and RPN), yielding some results but simultaneously constraining their scalability. Moreover, most efforts against ODs stem from image-level attacks originally designed for classification tasks, resulting in redundant computations and disturbances in object-irrelevant areas (e.g., background). Consequently, how to design a model-agnostic efficient attack to comprehensively evaluate the vulnerabilities of ODs remains challenging and unresolved. In this paper, we propose NumbOD, a brand-new spatial-frequency fusion attack against various ODs, aimed at disrupting object detection within images. We directly leverage the features output by the OD without relying on its internal structures to craft adversarial examples. Specifically, we first design a dual-track attack target selection strategy to select high-quality bounding boxes from OD outputs for targeting. Subsequently, we employ directional perturbations to shift and compress predicted boxes and change classification results to deceive ODs. Additionally, we focus on manipulating the high-frequency components of images to confuse ODs’ attention on critical objects, thereby enhancing the attack efficiency. Our extensive experiments on nine ODs and two datasets show that NumbOD achieves powerful attack performance and high stealthiness.
zh

[CV-104] DTSGAN: Learning Dynamic Textures via Spatiotemporal Generative Adversarial Network

【速读】：该论文试图解决动态纹理合成问题，即生成与参考视频纹理视觉上相似且在时间上具有特定平稳性属性的序列。解决方案的关键在于引入了一种时空生成对抗网络 (DTSGAN)，该网络通过捕捉动态纹理的运动和内容分布，能够从单一动态纹理中学习并生成新的视频序列。DTSGAN 采用从粗到细的生成流程，并通过提出一种新颖的数据更新策略来避免模式崩溃，从而提高生成结果的多样性。实验结果表明，该模型能够生成高质量的动态纹理和自然的运动。

链接: https://arxiv.org/abs/2412.16948
作者: Xiangtian Li,Xiaobo Wang,Zhen Qi,Han Cao,Zhaoyang Zhang,Ao Xiang
机构: 未知
关键词: exhibit specific stationary, specific stationary properties, texture synthesis aims, Dynamic texture synthesis, properties in time
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic texture synthesis aims to generate sequences that are visually similar to a reference video texture and exhibit specific stationary properties in time. In this paper, we introduce a spatiotemporal generative adversarial network (DTSGAN) that can learn from a single dynamic texture by capturing its motion and content distribution. With the pipeline of DTSGAN, a new video sequence is generated from the coarsest scale to the finest one. To avoid mode collapse, we propose a novel strategy for data updates that helps improve the diversity of generated results. Qualitative and quantitative experiments show that our model is able to generate high quality dynamic textures and natural motion.
zh

[CV-105] Separating Drone Point Clouds From Complex Backgrounds by Cluster Filter – Technical Report for CVPR 2024 UG2 Challenge

【速读】：该论文试图解决小型无人机在复杂环境中难以被传统监督式点云或图像检测方法有效识别的问题。解决方案的关键在于采用无监督的时空序列处理方法，通过融合多个激光雷达（lidar）数据集，进行前后背景分割和点云数据的时空密度与体素解析，从而实现对无人机的检测与跟踪。此外，论文还提出了一种基于时间序列的点云移动目标评分机制，以提高检测的准确性和效率。

链接: https://arxiv.org/abs/2412.16947
作者: Hanfang Liang,Jinming Hu,Xiaohuan Ling,Bing Wang
机构: School of Jianghan University (江汉大学); School of Jianghan University (江汉大学); School of Jianghan University (江汉大学); School of Jianghan University (江汉大学)
关键词: effective anti-drone measures, amplified their threat, highlighting the urgent, anti-drone measures, increasing deployment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:The increasing deployment of small drones as tools of conflict and disruption has amplified their threat, highlighting the urgent need for effective anti-drone measures. However, the compact size of most drones presents a significant challenge, as traditional supervised point cloud or image-based object detection methods often fail to identify such small objects effectively. This paper proposes a simple UAV detection method using an unsupervised pipeline. It uses spatial-temporal sequence processing to fuse multiple lidar datasets effectively, tracking and determining the position of UAVs, so as to detect and track UAVs in challenging environments. Our method performs front and rear background segmentation of point clouds through a global-local sequence clusterer and parses point cloud data from both the spatial-temporal density and spatial-temporal voxels of the point cloud. Furthermore, a scoring mechanism for point cloud moving targets is proposed, using time series detection to improve accuracy and efficiency. We used the MMAUD dataset, and our method achieved 4th place in the CVPR 2024 UG2+ Challenge, confirming the effectiveness of our method in practical applications.
zh

[CV-106] Video Domain Incremental Learning for Human Action Recognition in Home Environments

【速读】：该论文试图解决在非受限家庭环境中识别日常人类行为时，由于环境多样性和动态变化导致的视频理解模型在新领域适应过程中出现的灾难性遗忘问题。解决方案的关键在于提出了视频领域增量学习 (Video Domain Incremental Learning, VDIL) 的框架，使模型能够在不同领域间持续学习并保持固定的动作类别集合。论文通过设计三种领域划分类型（用户、场景、混合）来系统评估实际家庭环境中领域转移带来的挑战，并提出了一种基于重放 (replay) 和水库采样 (reservoir sampling) 技术的基线学习策略，该策略无需领域标签，适用于内存有限和任务不可知的情况。实验结果表明，该简单采样和重放策略在三个提出的基准测试中优于大多数现有的持续学习方法。

链接: https://arxiv.org/abs/2412.16946
作者: Yuanda Hu,Xing Liu,Meiying Li,Yate Ge,Xiaohua Sun,Weiwei Guo
机构: 未知
关键词: recognize daily human, significantly challenging, challenging to recognize, recognize daily, diversity and dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:It is significantly challenging to recognize daily human actions in homes due to the diversity and dynamic changes in unconstrained home environments. It spurs the need to continually adapt to various users and scenes. Fine-tuning current video understanding models on newly encountered domains often leads to catastrophic forgetting, where the models lose their ability to perform well on previously learned scenarios. To address this issue, we formalize the problem of Video Domain Incremental Learning (VDIL), which enables models to learn continually from different domains while maintaining a fixed set of action classes. Existing continual learning research primarily focuses on class-incremental learning, while the domain incremental learning has been largely overlooked in video understanding. In this work, we introduce a novel benchmark of domain incremental human action recognition for unconstrained home environments. We design three domain split types (user, scene, hybrid) to systematically assess the challenges posed by domain shifts in real-world home settings. Furthermore, we propose a baseline learning strategy based on replay and reservoir sampling techniques without domain labels to handle scenarios with limited memory and task agnosticism. Extensive experimental results demonstrate that our simple sampling and replay strategy outperforms most existing continual learning methods across the three proposed benchmarks.
zh

[CV-107] Linguistics-Vision Monotonic Consistent Network for Sign Language Production ICASSP2025

【速读】：该论文试图解决手语生成 (Sign Language Production, SLP) 中由于跨模态语义差距和缺乏强监督对齐标签而导致的手语词汇与动作序列之间的一致性问题。解决方案的关键在于提出了基于Transformer的语言视觉单调一致网络 (Linguistics-Vision Monotonic Consistent Network, LVMCN)，通过跨模态语义对齐器 (Cross-modal Semantic Aligner, CSA) 和多模态语义比较器 (Multimodal Semantic Comparator, MSC) 来约束细粒度的跨模态单调对齐和粗粒度的多模态语义一致性。CSA通过计算跨模态特征序列的余弦相似度关联矩阵来确保手语词汇与动作序列的顺序一致性，而MSC则通过构建多模态三元组来拉近对应文本-视觉对并推远非对应对，从而确保文本句子与手语视频的语义一致性。

链接: https://arxiv.org/abs/2412.16944
作者: Xu Wang,Shengeng Tang,Peipei Song,Shuo Wang,Dan Guo,Richang Hong
机构: School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(计算机科学与信息工程学院，合肥工业大学，合肥，中国); School of Information Science and Technology, University of Science and Technology of China, Hefei, China(信息科学与技术学院，中国科学技术大学，合肥，中国)
关键词: Sign Language Production, Language Production, spoken language sentences, spoken language, Cross-modal Semantic Aligner
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences, where the conversion of sign Glosses to Poses (G2P) is the key step. Due to the cross-modal semantic gap and the lack of word-action correspondence labels for strong supervision alignment, the SLP suffers huge challenges in linguistics-vision consistency. In this work, we propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment and coarse-grained multimodal semantic consistency in language-visual cues through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator (MSC). In the CSA, we constrain the implicit alignment between corresponding gloss and pose sequences by computing the cosine similarity association matrix between cross-modal feature sequences (i.e., the order consistency of fine-grained sign glosses and actions). As for MSC, we construct multimodal triplets based on paired and unpaired samples in batch data. By pulling closer the corresponding text-visual pairs and pushing apart the non-corresponding text-visual pairs, we constrain the semantic co-occurrence degree between corresponding gloss and pose sequences (i.e., the semantic consistency of coarse-grained textual sentences and sign videos). Extensive experiments on the popular PHOENIX14T benchmark show that the LVMCN outperforms the state-of-the-art.
zh

[CV-108] BloomCoreset: Fast Coreset Sampling using Bloom Filters for Fine-Grained Self-Supervised Learning ICASSP2025

【速读】：该论文试图解决在细粒度自监督学习（Self-Supervised Learning, SSL）中，从大规模未标注数据集（Open-Set）中高效采样核心集（Core-Set）的问题。解决方案的关键在于提出了BloomCoreset方法，通过利用Bloom过滤器作为创新的哈希机制，存储由Open-CLIP捕获的细粒度数据集的低级和高级特征，从而在空间高效的方式下实现快速检索核心集。该方法显著减少了采样时间，同时仅以0.83%的平均精度损失实现了98.5%的采样时间减少。

链接: https://arxiv.org/abs/2412.16942
作者: Prajwal Singh,Gautam Vashishtha,Indra Deep Mastan,Shanmuganathan Raman
机构: CVIG Lab; IIT Gandhinagar; Computer Science & Eng.; IIT BHU
关键词: domain-specific tasks relies, tasks relies heavily, supervised fine-grained recognition, fine-grained Self-Supervised Learning, expert annotations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:The success of deep learning in supervised fine-grained recognition for domain-specific tasks relies heavily on expert annotations. The Open-Set for fine-grained Self-Supervised Learning (SSL) problem aims to enhance performance on downstream tasks by strategically sampling a subset of images (the Core-Set) from a large pool of unlabeled data (the Open-Set). In this paper, we propose a novel method, BloomCoreset, that significantly reduces sampling time from Open-Set while preserving the quality of samples in the coreset. To achieve this, we utilize Bloom filters as an innovative hashing mechanism to store both low- and high-level features of the fine-grained dataset, as captured by Open-CLIP, in a space-efficient manner that enables rapid retrieval of the coreset from the Open-Set. To show the effectiveness of the sampled coreset, we integrate the proposed method into the state-of-the-art fine-grained SSL framework, SimCore [1]. The proposed algorithm drastically outperforms the sampling strategy of the baseline in SimCore [1] with a 98.5% reduction in sampling time with a mere 0.83% average trade-off in accuracy calculated across 11 downstream datasets.
zh

[CV-109] Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference

【速读】：该论文试图解决现有全参考图像质量评估 (FR-IQA) 方法在捕捉复杂因果机制方面的不足，这些机制影响人类对图像失真的感知，限制了其在多样化场景中的泛化能力。解决方案的关键在于提出了一种基于溯因反事实推理 (abductive counterfactual inference) 的FR-IQA方法，通过探索深度特征对感知的因果效应，并将因果推理与特征比较相结合，构建了一个能够有效处理不同IQA场景中复杂失真类型的模型。该方法的分析独立于骨干架构，适用于多种深度网络，并通过反事实实验验证了其因果关系的有效性，展示了其在感知相关性和质量评分可解释性方面的优越性。

链接: https://arxiv.org/abs/2412.16939
作者: Wenhao Shen,Mingliang Zhou,Yu Chen,Xuekai Wei,Jun Luo,Huayan Pu,Weijia Jia
机构: 未知
关键词: Existing full-reference image, Existing full-reference, image quality assessment, underlie human perceptual, human perceptual responses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing full-reference image quality assessment (FR-IQA) methods often fail to capture the complex causal mechanisms that underlie human perceptual responses to image distortions, limiting their ability to generalize across diverse scenarios. In this paper, we propose an FR-IQA method based on abductive counterfactual inference to investigate the causal relationships between deep network features and perceptual distortions. First, we explore the causal effects of deep features on perception and integrate causal reasoning with feature comparison, constructing a model that effectively handles complex distortion types across different IQA scenarios. Second, the analysis of the perceptual causal correlations of our proposed method is independent of the backbone architecture and thus can be applied to a variety of deep networks. Through abductive counterfactual experiments, we validate the proposed causal relationships, confirming the model’s superior perceptual relevance and interpretability of quality scores. The experimental results demonstrate the robustness and effectiveness of the method, providing competitive quality predictions across multiple benchmarks. The source code is available at this https URL.
zh

[CV-110] ImagineMap: Enhanced HD Map Construction with SD Maps

【速读】：该论文试图解决无地图导航中多视角图像和标准定义（SD）地图的处理问题，目标是输出车道和交通元素的感知及其拓扑关系。解决方案的关键在于提出了一种新颖的架构，该架构通过集成SD地图先验信息来提升车道线和区域检测的性能。模型采用两阶段结构：感知和推理，其中下游的拓扑推理头利用上游检测头的输出，从而使得检测精度的提升显著改善了下游的拓扑推理性能。

链接: https://arxiv.org/abs/2412.16938
作者: Yishen Ji,Zhiqi Li,Tong Lu
机构: Nanjing University (南京大学)
关键词: Track Mapless demands, Track Mapless, Mapless demands models, process multi-view images, Mapless demands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figures, technical report

点击查看摘要

Abstract:Track Mapless demands models to process multi-view images and Standard-Definition (SD) maps, outputting lane and traffic element perceptions along with their topological relationships. We propose a novel architecture that integrates SD map priors to improve lane line and area detection performance. Inspired by TopoMLP, our model employs a two-stage structure: perception and reasoning. The downstream topology head uses the output from the upstream detection head, meaning accuracy improvements in detection significantly boost downstream performance.
zh

[CV-111] PINN-EMFNet: PINN-based and Enhanced Multi-Scale Feature Fusion Network for Breast Ultrasound Images Segmentation

【速读】：该论文试图解决乳腺癌早期诊断中乳腺超声图像分割的挑战，特别是由于低对比度、斑点噪声和肿瘤形态多样性导致的现有分割方法在准确性和鲁棒性方面的局限性。解决方案的关键在于提出了一种基于物理信息神经网络 (PINN) 和增强多尺度特征融合网络，通过引入分层聚合编码器 (Hierarchical Aggregation Encoder) 和多尺度特征精炼解码器 (Multi-Scale Feature Refinement Decoder)，结合多尺度监督机制和修正模块，显著提升了分割的准确性和适应性。此外，通过在损失函数中引入PINN机制，增强了模型在分割过程中对肿瘤边界准确描绘的能力。

链接: https://arxiv.org/abs/2412.16937
作者: Jiajun Ding,Beiyao Zhu,Wenjie Wang,Shurong Zhang,Dian Zhua,Zhao Liua
机构: 未知
关键词: computer vision technologies, medical image segmentation, breast ultrasound images, image segmentation plays, Feature Fusion Network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of deep learning and computer vision technologies, medical image segmentation plays a crucial role in the early diagnosis of breast cancer. However, due to the characteristics of breast ultrasound images, such as low contrast, speckle noise, and the highly diverse morphology of tumors, existing segmentation methods exhibit significant limitations in terms of accuracy and robustness. To address these challenges, this study proposes a PINN-based and Enhanced Multi-Scale Feature Fusion Network. The network introduces a Hierarchical Aggregation Encoder in the backbone, which efficiently integrates and globally models multi-scale features through several structural innovations and a novel PCAM module. In the decoder section, a Multi-Scale Feature Refinement Decoder is employed, which, combined with a Multi-Scale Supervision Mechanism and a correction module, significantly improves segmentation accuracy and adaptability. Additionally, the loss function incorporating the PINN mechanism introduces physical constraints during the segmentation process, enhancing the model’s ability to accurately delineate tumor boundaries. Comprehensive evaluations on two publicly available breast ultrasound datasets, BUSIS and BUSI, demonstrate that the proposed method outperforms previous segmentation approaches in terms of segmentation accuracy and robustness, particularly under conditions of complex noise and low contrast, effectively improving the accuracy and reliability of tumor segmentation. This method provides a more precise and robust solution for computer-aided diagnosis of breast ultrasound images.
zh

[CV-112] Detecting and Classifying Defective Products in Images Using YOLO

【速读】：该论文试图解决工业自动化中产品缺陷检测的效率和准确性问题。解决方案的关键在于采用深度学习技术中的YOLO算法，该算法通过高效的实时检测能力和优秀的分类性能，能够在保持高检测精度的同时实现实时检测，从而显著提升产品质检的效率和准确性。

链接: https://arxiv.org/abs/2412.16935
作者: Zhen Qi,Liwei Ding,Xiangtian Li,Jiacheng Hu,Bin Lyu,Ao Xiang
机构: 未知
关键词: manufacturing process, product quality inspection, continuous advancement, increasingly important, YOLO algorithm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the continuous advancement of industrial automation, product quality inspection has become increasingly important in the manufacturing process. Traditional inspection methods, which often rely on manual checks or simple machine vision techniques, suffer from low efficiency and insufficient accuracy. In recent years, deep learning technology, especially the YOLO (You Only Look Once) algorithm, has emerged as a prominent solution in the field of product defect detection due to its efficient real-time detection capabilities and excellent classification performance. This study aims to use the YOLO algorithm to detect and classify defects in product images. By constructing and training a YOLO model, we conducted experiments on multiple industrial product datasets. The results demonstrate that this method can achieve real-time detection while maintaining high detection accuracy, significantly improving the efficiency and accuracy of product quality inspection. This paper further analyzes the advantages and limitations of the YOLO algorithm in practical applications and explores future research directions.
zh

[CV-113] GSemSplat: Generalizable Semantic 3D Gaussian Splatting from Uncalibrated Image Pairs

【速读】：该论文试图解决从稀疏、未校准的图像对中进行可泛化的三维语义场建模的问题。传统方法依赖于密集校准图像的场景特定优化，限制了其实用性。解决方案的关键在于引入GSemSplat框架，该框架通过学习与三维高斯（3D Gaussians）关联的开放词汇语义表示，无需场景特定优化、密集图像集合或校准。为确保三维空间中语义特征的有效学习，采用了双特征方法，结合区域特定和上下文感知的语义特征作为二维空间的监督，从而利用它们的互补优势。实验结果表明，该方法在ScanNet++数据集上优于传统场景特定方法。

链接: https://arxiv.org/abs/2412.16932
作者: Xingrui Wang,Cuiling Lan,Hanxin Zhu,Zhibo Chen,Yan Lu
机构: University of Science and Technology of China(中国科学技术大学); Microsoft Research Asia(微软亚洲研究院)
关键词: world is crucial, robotic navigation, augmented reality, reality to robotic, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling and understanding the 3D world is crucial for various applications, from augmented reality to robotic navigation. Recent advancements based on 3D Gaussian Splatting have integrated semantic information from multi-view images into Gaussian primitives. However, these methods typically require costly per-scene optimization from dense calibrated images, limiting their practicality. In this paper, we consider the new task of generalizable 3D semantic field modeling from sparse, uncalibrated image pairs. Building upon the Splatt3R architecture, we introduce GSemSplat, a framework that learns open-vocabulary semantic representations linked to 3D Gaussians without the need for per-scene optimization, dense image collections or calibration. To ensure effective and reliable learning of semantic features in 3D space, we employ a dual-feature approach that leverages both region-specific and context-aware semantic features as supervision in the 2D space. This allows us to capitalize on their complementary strengths. Experimental results on the ScanNet++ dataset demonstrate the effectiveness and superiority of our approach compared to the traditional scene-specific method. We hope our work will inspire more research into generalizable 3D understanding.
zh

[CV-114] AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory Estimation and Classification ICRA2025

【速读】：该论文试图解决小型无人机 (UAV) 对公共安全的威胁问题，同时克服传统无人机检测系统体积庞大且成本高昂的局限。解决方案的关键在于提出了 AV-DTEC，一种轻量级的自监督音视频融合反无人机系统。AV-DTEC 通过自监督学习与 LiDAR 生成的标签进行训练，并利用并行的选择性状态空间模型同时学习音频和视觉特征。其核心创新包括一个即插即用的主辅特征增强模块，该模块将视觉特征整合到音频特征中，以提高在复杂光照条件下的鲁棒性。此外，通过教师-学生模型自适应调整视觉特征的权重，减少对辅助特征的依赖并实现模态对齐。该系统在多模态数据上的实际应用中表现出卓越的准确性和有效性。

链接: https://arxiv.org/abs/2412.16928
作者: Zhenyuan Xiao,Yizhuo Yang,Guili Xu,Xianglong Zeng,Shenghai Yuan
机构: College of Automation Engineering, Nanhang University, China(自动化工程学院，南京航空航天大学，中国); School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore(电气与电子工程学院，南洋理工大学，新加坡)
关键词: created significant threats, traditional drone detection, drone detection systems, public safety, bulky and costly
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Submitted to ICRA 2025

点击查看摘要

Abstract:The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub \urlthis https URL. Comments: Submitted to ICRA 2025 Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS) Cite as: arXiv:2412.16928 [cs.SD] (or arXiv:2412.16928v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2412.16928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-115] Leveraging Consistent Spatio-Temporal Correspondence for Robust Visual Odometry

【速读】：该论文试图解决现有视觉里程计（VO）方法在处理噪声和不一致的光流匹配时遇到的挑战，特别是在复杂场景和长序列估计中的困难。解决方案的关键在于引入了一种新的深度网络架构——时空视觉里程计（Spatio-Temporal Visual Odometry, STVO），通过有效利用固有的时空线索来提高多帧光流匹配的准确性和一致性。具体来说，STVO包含两个创新组件：1) 时间传播模块（Temporal Propagation Module），利用多帧信息提取并传播时间线索，保持时间一致性；2) 空间激活模块（Spatial Activation Module），利用深度图的几何先验增强空间一致性，同时过滤噪声和错误匹配。这些改进使得STVO在多个基准测试中达到了最先进的性能，显著提升了估计精度。

链接: https://arxiv.org/abs/2412.16923
作者: Zhaoxing Zhang,Junda Cheng,Gangwei Xu,Xiaoxiang Wang,Can Zhang,Xin Yang
机构: 未知
关键词: Recent approaches, predict optical flow, significantly improved performance, flow matching, significantly improved
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent approaches to VO have significantly improved performance by using deep networks to predict optical flow between video frames. However, existing methods still suffer from noisy and inconsistent flow matching, making it difficult to handle challenging scenarios and long-sequence estimation. To overcome these challenges, we introduce Spatio-Temporal Visual Odometry (STVO), a novel deep network architecture that effectively leverages inherent spatio-temporal cues to enhance the accuracy and consistency of multi-frame flow matching. With more accurate and consistent flow matching, STVO can achieve better pose estimation through the bundle adjustment (BA). Specifically, STVO introduces two innovative components: 1) the Temporal Propagation Module that utilizes multi-frame information to extract and propagate temporal cues across adjacent frames, maintaining temporal consistency; 2) the Spatial Activation Module that utilizes geometric priors from the depth maps to enhance spatial consistency while filtering out excessive noise and incorrect matches. Our STVO achieves state-of-the-art performance on TUM-RGBD, EuRoc MAV, ETH3D and KITTI Odometry benchmarks. Notably, it improves accuracy by 77.8% on ETH3D benchmark and 38.9% on KITTI Odometry benchmark over the previous best methods.
zh

[CV-116] AR3D: Creating High-Quality 3D Assets via Next-Part Prediction

【速读】：该论文试图解决高质量3D资产生成的问题，提出了一种名为TAR3D的新框架。解决方案的关键在于将多模态统一和下一词预测范式（next-token prediction paradigm）的学习能力迁移到条件3D物体生成中。具体实现包括两个核心组件：首先，使用3D感知向量量化变分自编码器（3D-aware Vector Quantized-Variational AutoEncoder, 3D VQ-VAE）将广泛的3D形状编码为紧凑的三平面潜在空间，并通过可训练的码本中的离散表示来重建细粒度几何结构；其次，利用配备自定义三平面位置嵌入（TriPE）的3D生成式预训练Transformer（3D GPT），以自回归方式预测码本索引序列，从而逐步构建3D几何体的组成。实验结果表明，TAR3D在文本到3D和图像到3D任务中优于现有方法。

链接: https://arxiv.org/abs/2412.16919
作者: Xuying Zhang,Yutong Liu,Yangguang Li,Renrui Zhang,Yufei Liu,Kai Wang,Wanli Ouyang,Zhiwei Xiong,Peng Gao,Qibin Hou,Ming-Ming Cheng
机构: Nankai University(南开大学); University of Science and Technology of China(中国科学技术大学); VAST(越南科学技术研究院); Shanghai AI Lab(上海人工智能实验室)
关键词: Generative Pre-trained Transformer, Vector Quantized-Variational AutoEncoder, Pre-trained Transformer, Generative Pre-trained, Vector Quantized-Variational
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the 3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokens in an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on ShapeNet and Objaverse demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks
zh

[CV-117] Detect Changes like Humans: Incorporating Semantic Priors for Improved Change Detection

【速读】：该论文试图解决现有变化检测模型在面对噪声和光照变化时，由于过度依赖标注的二值变化图而忽视语义理解，导致检测精度下降的问题。解决方案的关键在于引入语义先验，提出了一种语义感知的变化检测网络（SA-CDNet），通过结合语义感知特征和差异感知特征的双流特征解码器，模拟人类视觉范式来区分变化。此外，设计了单时相语义预训练策略，利用公开的单时相遥感分割数据集构建伪变化检测数据进行大规模预训练，进一步增强了语义理解能力。

链接: https://arxiv.org/abs/2412.16918
作者: Yuhang Gan,Wenjie Xuan,Zhiming Luo,Lei Fang,Zengmao Wang,Juhua Liu,Bo Du
机构: Wuhan University (武汉大学)
关键词: comparing the appearance, identify their differences, differences by comparing, change detection, mainstream change detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When given two similar images, humans identify their differences by comparing the appearance (\it e.g., color, texture) with the help of semantics (\it e.g., objects, relations). However, mainstream change detection models adopt a supervised training paradigm, where the annotated binary change map is the main constraint. Thus, these methods primarily emphasize the difference-aware features between bi-temporal images and neglect the semantic understanding of the changed landscapes, which undermines the accuracy in the presence of noise and illumination variations. To this end, this paper explores incorporating semantic priors to improve the ability to detect changes. Firstly, we propose a Semantic-Aware Change Detection network, namely SA-CDNet, which transfers the common knowledge of the visual foundation models (\it i.e., FastSAM) to change detection. Inspired by the human visual paradigm, a novel dual-stream feature decoder is derived to distinguish changes by combining semantic-aware features and difference-aware features. Secondly, we design a single-temporal semantic pre-training strategy to enhance the semantic understanding of landscapes, which brings further increments. Specifically, we construct pseudo-change detection data from public single-temporal remote sensing segmentation datasets for large-scale pre-training, where an extra branch is also introduced for the proxy semantic segmentation task. Experimental results on five challenging benchmarks demonstrate the superiority of our method over the existing state-of-the-art methods. The code is available at \hrefthis https URLSA-CD.
zh

[CV-118] FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

【速读】：该论文试图解决基于扩散模型的音频驱动虚拟形象生成方法在推理速度上的瓶颈问题。解决方案的关键在于提出了FADA（Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation）方法。首先，通过设计混合监督损失（mixed-supervised loss），利用不同质量的数据来增强模型的整体能力和鲁棒性。其次，采用多CFG蒸馏（multi-CFG distillation）与可学习标记（learnable tokens）相结合的方式，有效利用音频与参考图像条件之间的关联，减少了多CFG带来的三倍推理运行次数，同时保持了可接受的质量损失。实验结果表明，FADA在生成高质量视频的同时，实现了4.17到12.5倍的NFE加速。

链接: https://arxiv.org/abs/2412.16915
作者: Tianyun Zhong,Chao Liang,Jianwen Jiang,Gaojie Lin,Jiaqi Yang,Zhou Zhao
机构: Zhejiang University(浙江大学); ByteDance(字节跳动)
关键词: Diffusion-based audio-driven talking, recently gained attention, Diffusion-based audio-driven, audio-driven talking avatar, audio-driven talking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage this http URL.
zh

[CV-119] Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation AAAI2025

【速读】：该论文试图解决生成式模型中流匹配（flow matching）方法在采样过程中需要大量函数评估的问题。解决方案的关键在于引入自校正流蒸馏方法，该方法将一致性模型（consistency models）和对抗训练（adversarial training）集成到流匹配框架中，从而在少步和单步采样中实现一致的生成质量。实验结果表明，该方法在CelebA-HQ和COCO数据集的零样本基准上，无论是在定量还是定性方面，均取得了优于现有方法的效果。

链接: https://arxiv.org/abs/2412.16906
作者: Quan Dao,Hao Phung,Trung Dao,Dimitris Metaxas,Anh Tran
机构: 1. VinAI Research(VinAI研究); 2. Rutgers University(罗格斯大学); 3. University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
关键词: demonstrating impressive empirical, impressive empirical performance, offering relative ease, demonstrating impressive, matching has emerged
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at this https URL
zh

[CV-120] Learning to Generate Gradients for Test-Time Adaptation via Test-Time Training Layers

【速读】：该论文试图解决测试时适应（Test-time Adaptation, TTA）中由于无监督学习目标（如熵最小化）产生的噪声学习信号导致的不可靠梯度和优化不稳定问题。解决方案的关键在于采用学习优化（learning-to-optimize）的方法，自动学习一个优化器，称为元梯度生成器（Meta Gradient Generator, MGG）。MGG通过设计一个轻量且高效的梯度记忆层（gradient memory layer），利用自监督重构损失来压缩历史梯度信息，从而在在线优化过程中有效利用这些信息，提升模型的长期适应能力。该方法在预训练阶段仅需少量未标注样本，且在处理未见样本时表现出优于现有最先进方法的性能，具有更少的更新次数、更少的数据需求和更短的适应迭代时间。

链接: https://arxiv.org/abs/2412.16901
作者: Qi Deng,Shuaicheng Niu,Ronghao Zhang,Yaofo Chen,Runhao Zeng,Jian Chen,Xiping Hu
机构: 未知
关键词: demonstrating broad application, broad application potential, demonstrating broad, real-world scenarios, Test-time adaptation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 figures, 11 tables

点击查看摘要

Abstract:Test-time adaptation (TTA) aims to fine-tune a trained model online using unlabeled testing data to adapt to new environments or out-of-distribution data, demonstrating broad application potential in real-world scenarios. However, in this optimization process, unsupervised learning objectives like entropy minimization frequently encounter noisy learning signals. These signals produce unreliable gradients, which hinder the model ability to converge to an optimal solution quickly and introduce significant instability into the optimization process. In this paper, we seek to resolve these issues from the perspective of optimizer design. Unlike prior TTA using manually designed optimizers like SGD, we employ a learning-to-optimize approach to automatically learn an optimizer, called Meta Gradient Generator (MGG). Specifically, we aim for MGG to effectively utilize historical gradient information during the online optimization process to optimize the current model. To this end, in MGG, we design a lightweight and efficient sequence modeling layer – gradient memory layer. It exploits a self-supervised reconstruction loss to compress historical gradient information into network parameters, thereby enabling better memorization ability over a long-term adaptation process. We only need a small number of unlabeled samples to pre-train MGG, and then the trained MGG can be deployed to process unseen samples. Promising results on ImageNet-C, R, Sketch, and A indicate that our method surpasses current state-of-the-art methods with fewer updates, less data, and significantly shorter adaptation iterations. Compared with a previous SOTA method SAR, we achieve 7.4% accuracy improvement and 4.2 times faster adaptation speed on ImageNet-C.
zh

[CV-121] MVREC: A General Few-shot Defect Classification Model Using Multi-View Region-Context AAAI2025

【速读】：该论文试图解决工业制造中少样本缺陷多分类 (Few-shot defect multi-classification, FSDMC) 的通用性和有效提取图像上下文信息的问题。解决方案的关键在于提出了一个名为 MVREC 的通用 FSDMC 框架，该框架通过两个主要创新来实现目标：(1) 利用预训练的 AlphaCLIP 模型提取缺陷实例的通用特征；(2) 采用区域-上下文框架，通过掩码区域输入和多视角上下文增强来增强缺陷特征。此外，模型中引入了少样本 Zip-Adapter(-F) 分类器，用于缓存支持集的视觉特征并进行少样本分类。通过引入新的 MVTec-FS 基准，并在多个数据集上进行广泛实验，验证了该框架在通用缺陷分类和上下文信息利用方面的有效性。

链接: https://arxiv.org/abs/2412.16897
作者: Shuai Lyu,Fangjian Liao,Zeqi Ma,Rongchen Zhang,Dongmei Mo,Waikeung Wong
机构: 未知
关键词: industrial manufacturing, emerging trend, trend in quality, quality control, control within industrial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Few-shot defect multi-classification (FSDMC) is an emerging trend in quality control within industrial manufacturing. However, current FSDMC research often lacks generalizability due to its focus on specific datasets. Additionally, defect classification heavily relies on contextual information within images, and existing methods fall short of effectively extracting this information. To address these challenges, we propose a general FSDMC framework called MVREC, which offers two primary advantages: (1) MVREC extracts general features for defect instances by incorporating the pre-trained AlphaCLIP model. (2) It utilizes a region-context framework to enhance defect features by leveraging mask region input and multi-view context augmentation. Furthermore, Few-shot Zip-Adapter(-F) classifiers within the model are introduced to cache the visual features of the support set and perform few-shot classification. We also introduce MVTec-FS, a new FSDMC benchmark based on MVTec AD, which includes 1228 defect images with instance-level mask annotations and 46 defect types. Extensive experiments conducted on MVTec-FS and four additional datasets demonstrate its effectiveness in general defect classification and its ability to incorporate contextual information to improve classification performance. Code: this https URL
zh

[CV-122] Adaptive Dataset Quantization

【速读】：该论文试图解决深度学习中由于大规模数据集和复杂神经网络训练带来的计算和存储负担问题。解决方案的关键在于提出了一种名为自适应数据集量化 (Adaptive Dataset Quantization, ADQ) 的新型数据集压缩框架。与传统方法如数据集蒸馏 (Dataset Distillation, DD) 和核心集选择 (coreset selection) 相比，ADQ 通过引入自适应采样策略，评估生成样本的表示性分数、多样性分数和重要性分数，从而克服了传统方法在优化成本、泛化能力和数据保留率方面的局限性。具体而言，ADQ 通过量化技术生成样本，并根据样本的特征（如纹理级别和对比学习技术）进行自适应选择，显著提升了在不同架构上的泛化能力和压缩效果，实验结果表明其性能优于传统方法。

链接: https://arxiv.org/abs/2412.16895
作者: Muquan Li,Dongyang Zhang,Qiang Dong,Xiurui Xie,Ke Qin
机构: 未知
关键词: Contemporary deep learning, confronts substantial computational, substantial computational hurdles, cumbersome neural networks, Contemporary deep
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contemporary deep learning, characterized by the training of cumbersome neural networks on massive datasets, confronts substantial computational hurdles. To alleviate heavy data storage burdens on limited hardware resources, numerous dataset compression methods such as dataset distillation (DD) and coreset selection have emerged to obtain a compact but informative dataset through synthesis or selection for efficient training. However, DD involves an expensive optimization procedure and exhibits limited generalization across unseen architectures, while coreset selection is limited by its low data keep ratio and reliance on heuristics, hindering its practicality and feasibility. To address these limitations, we introduce a newly versatile framework for dataset compression, namely Adaptive Dataset Quantization (ADQ). Specifically, we first identify the sub-optimal performance of naive Dataset Quantization (DQ), which relies on uniform sampling and overlooks the varying importance of each generated bin. Subsequently, we propose a novel adaptive sampling strategy through the evaluation of generated bins’ representativeness score, diversity score and importance score, where the former two scores are quantified by the texture level and contrastive learning-based techniques, respectively. Extensive experiments demonstrate that our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
zh

[CV-123] Anchor3DLane: 3D Lane Detection via Sample-Adaptive Sparse 3D Anchor Regression

【速读】：该论文试图解决单目3D车道检测中的挑战，特别是传统方法依赖逆透视映射（IPM）将前视图（FV）图像或特征转换为鸟瞰图（BEV）空间时，由于平地假设和上下文信息丢失导致的3D信息估计不准确问题。解决方案的关键在于提出了一种无需BEV的全新方法——Anchor3DLane++，该方法通过定义3D车道锚点（3D lane anchors）作为结构化表示，并直接从前视图特征中进行预测。此外，论文设计了基于原型的自适应锚点生成模块（Prototype-based Adaptive Anchor Generation, PAAG），动态生成样本自适应的稀疏3D锚点，并开发了等宽损失（Equal-Width, EW）以利用车道的平行特性进行正则化。最后，论文还探讨了基于Anchor3DLane++的相机-LiDAR融合，以利用互补信息。实验结果表明，Anchor3DLane++在多个3D车道检测基准上优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.16889
作者: Shaofei Huang,Zhenwei Shen,Zehao Huang,Yue Liao,Jizhong Han,Naiyan Wang,Si Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China; TuSimple, Beijing, China; The Chinese University of Hong Kong, Hong Kong SAR, China; The Chinese University of Hong Kong, Shenzhen, China; Institute of Artificial Intelligence, Beihang University, Beijing, China; Hangzhou Innovation Institute, Beihang University, Hangzhou, China
关键词: task of monocular, challenging task, lane detection, BEV, IPM
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:In this paper, we focus on the challenging task of monocular 3D lane detection. Previous methods typically adopt inverse perspective mapping (IPM) to transform the Front-Viewed (FV) images or features into the Bird-Eye-Viewed (BEV) space for lane detection. However, IPM’s dependence on flat ground assumption and context information loss in BEV representations lead to inaccurate 3D information estimation. Though efforts have been made to bypass BEV and directly predict 3D lanes from FV representations, their performances still fall behind BEV-based methods due to a lack of structured modeling of 3D lanes. In this paper, we propose a novel BEV-free method named Anchor3DLane++ which defines 3D lane anchors as structural representations and makes predictions directly from FV features. We also design a Prototype-based Adaptive Anchor Generation (PAAG) module to generate sample-adaptive sparse 3D anchors dynamically. In addition, an Equal-Width (EW) loss is developed to leverage the parallel property of lanes for regularization. Furthermore, camera-LiDAR fusion is also explored based on Anchor3DLane++ to leverage complementary information. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane++ outperforms previous state-of-the-art methods. Code is available at: this https URL.
zh

[CV-124] Lightweight Design and Optimization methods for DCNNs: Progress and Futures

【速读】：该论文试图解决深度卷积神经网络 (Deep Convolutional Neural Networks, DCNNs) 在资源受限硬件平台（如智能手机、机器人和物联网设备）上的广泛应用问题。解决方案的关键在于轻量化设计策略，包括轻量化的网络架构设计和模型压缩技术。通过这些策略，论文旨在降低DCNNs的计算成本和网络复杂性，从而使其能够在移动和嵌入式设备上高效运行，推动智能家庭、远程医疗和自动驾驶等领域的应用发展。

链接: https://arxiv.org/abs/2412.16886
作者: Hanhua Long,Wenbin Bi,Jian Sun
机构: Dalian Neusoft University of Information(大连东软信息学院); School of Computer & Software(计算机与软件学院)
关键词: alongside rapid development, deep learning technologies, deep learning models, deep learning, Deep Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Lightweight design, as a key approach to mitigate disparity between computational requirements of deep learning models and hardware performance, plays a pivotal role in advancing application of deep learning technologies on mobile and embedded devices, alongside rapid development of smart home, telemedicine, and autonomous driving. With its outstanding feature extracting capabilities, Deep Convolutional Neural Networks (DCNNs) have demonstrated superior performance in computer vision tasks. However, high computational costs and large network architectures severely limit the widespread application of DCNNs on resource-constrained hardware platforms such as smartphones, robots, and IoT devices. This paper reviews lightweight design strategies for DCNNs and examines recent research progress in both lightweight architectural design and model compression. Additionally, this paper discusses current limitations in this field of research and propose prospects for future directions, aiming to provide valuable guidance and reflection for lightweight design philosophy on deep neural networks in the field of computer vision.
zh

[CV-125] Out-of-Distribution Detection with Prototypical Outlier Proxy AAAI

【速读】：该论文试图解决深度学习模型在实际部署中面临的分布外检测 (Out-of-distribution, OOD) 问题，特别是模型在未见过的测试数据上表现出过度自信的现象。解决方案的关键在于提出了一个简单而有效的框架——原型异常代理 (Prototypical Outlier Proxy, POP)，通过引入虚拟的 OOD 原型来重塑 ID 和 OOD 数据之间的决策边界。具体来说，POP 将可学习的分类器转换为固定的分类器，并增加一组原型权重向量，同时引入分层相似边界损失 (hierarchical similarity boundary loss)，根据误分类的程度施加自适应惩罚。实验结果表明，POP 在多个基准数据集上显著优于现有方法，且在训练和推理速度上均有显著提升。

链接: https://arxiv.org/abs/2412.16884
作者: Mingrong Gong,Chaoqi Chen,Qingqiang Sun,Yue Wang,Hui Huang
机构: 1. School of Information Science and Engineering, Central South University, Changsha, China(信息科学与工程学院，中南大学，长沙，中国);
2. School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, China(计算机科学与工程学院，湖南科技大学，湘潭，中国);
3. School of Computer Science and Engineering, Hunan University, Changsha, China(计算机科学与工程学院，湖南大学，长沙，中国)
关键词: deploying deep learning, deep learning models, crucial task, task for deploying, well-trained deep models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a crucial task for deploying deep learning models in the wild. One of the major challenges is that well-trained deep models tend to perform over-confidence on unseen test data. Recent research attempts to leverage real or synthetic outliers to mitigate the issue, which may significantly increase computational costs and be biased toward specific outlier characteristics. In this paper, we propose a simple yet effective framework, Prototypical Outlier Proxy (POP), which introduces virtual OOD prototypes to reshape the decision boundaries between ID and OOD data. Specifically, we transform the learnable classifier into a fixed one and augment it with a set of prototypical weight vectors. Then, we introduce a hierarchical similarity boundary loss to impose adaptive penalties depending on the degree of misclassification. Extensive experiments across various benchmarks demonstrate the effectiveness of POP. Notably, POP achieves average FPR95 reductions of 7.70%, 6.30%, and 5.42% over the second-best methods on CIFAR-10, CIFAR-100, and ImageNet-200, respectively. Moreover, compared to the recent method NPOS, which relies on outlier synthesis, POP trains 7.2X faster and performs inference 19.5X faster. The source code is available at: this https URL.
zh

[CV-126] Predicting the Reliability of an Image Classifier under Image Distortion

【速读】：该论文试图解决图像分类任务中深度学习模型对图像畸变（image distortions）的脆弱性问题，即模型在输入图像畸变时准确率显著下降。解决方案的关键在于构建一个训练集，该训练集包含不同畸变水平及其对应的“不可靠”或“可靠”标签，并训练一个机器学习预测模型（称为畸变分类器，distortion-classifier）来分类未见过的畸变水平。由于训练集高度不平衡，论文提出了两种基于高斯过程（Gaussian process）的方法来重新平衡训练集，从而有效提升畸变分类器的性能。实验结果表明，该方法在六个流行的图像数据集上显著优于多个基线模型。

链接: https://arxiv.org/abs/2412.16881
作者: Dang Nguyen,Sunil Gupta,Kien Do,Svetha Venkatesh
机构: Applied Artificial Intelligence Institute (A2I2), Deakin University, Geelong, Australia
关键词: image classification tasks, classification tasks, accuracy significantly drops, deep learning models, reliable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In image classification tasks, deep learning models are vulnerable to image distortions i.e. their accuracy significantly drops if the input images are distorted. An image-classifier is considered “reliable” if its accuracy on distorted images is above a user-specified threshold. For a quality control purpose, it is important to predict if the image-classifier is unreliable/reliable under a distortion level. In other words, we want to predict whether a distortion level makes the image-classifier “non-reliable” or “reliable”. Our solution is to construct a training set consisting of distortion levels along with their “non-reliable” or “reliable” labels, and train a machine learning predictive model (called distortion-classifier) to classify unseen distortion levels. However, learning an effective distortion-classifier is a challenging problem as the training set is highly imbalanced. To address this problem, we propose two Gaussian process based methods to rebalance the training set. We conduct extensive experiments to show that our method significantly outperforms several baselines on six popular image datasets.
zh

[CV-127] MAGIC: Efficient and Resilient Modality-Agnostic Semantic Segmentation via Hierarchical Modality Selection

【速读】：该论文试图解决模态无关的语义分割 (modality-agnostic semantic segmentation, MaSS) 问题，旨在在不同特征粒度上平衡各模态的价值。现有方法通常以RGB模态为中心，导致架构不对称，限制了在复杂场景（如夜间）中的表现。论文提出的解决方案是MAGIC++框架，其关键在于两个可插拔模块：多模态交互模块和分层模态选择模块。前者通过通道和空间指导有效处理多模态特征，提取互补信息；后者基于分层特征空间的相似性评分，动态选择和融合多模态特征，从而减少对RGB模态的依赖，增强对传感器故障和环境噪声的鲁棒性，并在多种基准测试中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.16876
作者: Xu Zheng,Yuanhuiyi Lyu,Lutao Jiang,Jiazhou Zhou,Lin Wang,Xuming Hu
机构: HKUST(GZ); Dept. of CSE, HKUST; Dept. of EEE, NTU
关键词: challenging modality-agnostic semantic, aiming at centering, semantic segmentation, address the challenging, modality-agnostic semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we address the challenging modality-agnostic semantic segmentation (MaSS), aiming at centering the value of every modality at every feature granularity. Training with all available visual modalities and effectively fusing an arbitrary combination of them is essential for robust multi-modal fusion in semantic segmentation, especially in real-world scenarios, yet remains less explored to date. Existing approaches often place RGB at the center, treating other modalities as secondary, resulting in an asymmetric architecture. However, RGB alone can be limiting in scenarios like nighttime, where modalities such as event data excel. Therefore, a resilient fusion model must dynamically adapt to each modality’s strengths while compensating for weaker this http URL this end, we introduce the MAGIC++ framework, which comprises two key plug-and-play modules for effective multi-modal fusion and hierarchical modality selection that can be equipped with various backbone models. Firstly, we introduce a multi-modal interaction module to efficiently process features from the input multi-modal batches and extract complementary scene information with channel-wise and spatial-wise guidance. On top, a unified multi-scale arbitrary-modal selection module is proposed to utilize the aggregated features as the benchmark to rank the multi-modal features based on the similarity scores at hierarchical feature spaces. This way, our method can eliminate the dependence on RGB modality at every feature granularity and better overcome sensor failures and environmental noises while ensuring the segmentation performance. Under the common multi-modal setting, our method achieves state-of-the-art performance on both real-world and synthetic benchmarks. Moreover, our method is superior in the novel modality-agnostic setting, where it outperforms prior arts by a large margin.
zh

[CV-128] CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models ICASSP2025

【速读】：该论文试图解决多模态大语言模型 (Multi-modal Large Language Model, MLLM) 在处理细粒度多模态任务时的局限性，特别是由于视觉编码器的空间感知和感知敏锐度受限，导致模型难以有效处理图像中的背景干扰和忽略细微但关键的细节。解决方案的关键在于将多模态理解过程分解为从粗到细 (Coarse to Fine, CoF) 的两个阶段：首先引导模型定位答案的大致区域，然后通过视觉提示工程调整相关区域的注意力权重，从而增强模型的视觉定位能力并提升下游任务的性能。

链接: https://arxiv.org/abs/2412.16869
作者: Yeyuan Wang,Dehong Gao,Bin Li,Rujiao Long,Lei Yi,Xiaoyan Cai,Libin Yang,Jinxia Zhang,Shanqing Yu,Qi Xuan
机构: School of Automation, Northwestern Polytechnical University, Xi’an, Shaanxi, China; School of Cybersecurity, Northwestern Polytechnical University, Xi’an, Shaanxi, China; Alibaba Group, Hangzhou, Zhejiang, China; The Key Laboratory of Measurement and Control of CSE, Ministry of Education, School of Automation, Southeast University, Nanjing 210096, China; Advanced Ocean Institute of Southeast University, Nantong 226010, China; Zhejiang University of Technology, Hangzhou, Zhejiang, China; Binjiang Institute of Artificial Intelligence, Hangzhou, Zhejiang, China
关键词: Large Language Model, develop Multi-modal LLM, Large Language, shown great potential, Multi-modal LLM
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, Accepted by ICASSP2025, full paper

点击查看摘要

Abstract:The impressive performance of Large Language Model (LLM) has prompted researchers to develop Multi-modal LLM (MLLM), which has shown great potential for various multi-modal tasks. However, current MLLM often struggles to effectively address fine-grained multi-modal challenges. We argue that this limitation is closely linked to the models’ visual grounding capabilities. The restricted spatial awareness and perceptual acuity of visual encoders frequently lead to interference from irrelevant background information in images, causing the models to overlook subtle but crucial details. As a result, achieving fine-grained regional visual comprehension becomes difficult. In this paper, we break down multi-modal understanding into two stages, from Coarse to Fine (CoF). In the first stage, we prompt the MLLM to locate the approximate area of the answer. In the second stage, we further enhance the model’s focus on relevant areas within the image through visual prompt engineering, adjusting attention weights of pertinent regions. This, in turn, improves both visual grounding and overall performance in downstream tasks. Our experiments show that this approach significantly boosts the performance of baseline models, demonstrating notable generalization and effectiveness. Our CoF approach is available online at this https URL.
zh

[CV-129] SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera WACV2025

【速读】：该论文试图解决三维声源的精确定位及其语义标签估计问题，特别是在声源不可见但假设位于场景中物体物理表面上的情况下。解决方案的关键在于利用跨模态信息，通过一个包含针孔RGB-D相机和共面四通道麦克风阵列（Mic-Array）的声学相机装置，从多视角记录音视频信号。具体而言，提出的框架SoundLoc3D将任务视为集合预测问题，首先从单视角麦克风阵列信号中学习初始的集合表示，然后通过多视角RGB-D图像提供的物理表面线索进行精炼。该方法在大规模模拟数据集上展示了其效率和优越性，并表现出对RGB-D测量不准确和环境噪声干扰的鲁棒性。

链接: https://arxiv.org/abs/2412.16861
作者: Yuhang He,Sangyun Shin,Anoop Cherian,Andrew Markham
机构: Department of Computer Science, University of Oxford, UK; Mitsubishi Electric Research Labs, Cambridge, MA, US
关键词: including detecting gas, detecting gas leak, Accurately localizing, semantic labels, real applications
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted by WACV2025

点击查看摘要

Abstract:Accurately localizing 3D sound sources and estimating their semantic labels – where the sources may not be visible, but are assumed to lie on the physical surface of objects in the scene – have many real applications, including detecting gas leak and machinery malfunction. The audio-visual weak-correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross-modal information to solve the task. Towards this end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array~(Mic-Array). By using this rig to record audio-visual signals from multiviews, we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically, our framework SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation, the set representation is initially learned from a single view microphone array signal, and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.
zh

[CV-130] Adversarial Diffusion Model for Unsupervised Domain-Adaptive Semantic Segmentation

【速读】：该论文试图解决语义分割任务中由于域差异（domain difference）导致的监督信号获取困难问题，提出了一种新的无监督域适应（Unsupervised Domain Adaptation, UDA）方法。解决方案的关键在于引入条件与编码器间连接的潜在扩散模型（Conditional and Inter-coder Connected Latent Diffusion, CICLD），通过潜在扩散模型和对抗学习的结合，有效弥合合成图像与真实图像之间的差距。CICLD通过条件机制提升分割过程中的上下文理解，并通过编码器间连接保留细粒度细节和空间层次结构。此外，对抗学习用于对齐源域、混合域和目标域的潜在特征分布，进一步增强模型的泛化能力。实验结果表明，CICLD在GTA5、Synthia和Cityscape三个基准数据集上均优于现有的UDA方法，特别是在GTA5到Cityscape和Synthia到Cityscape的UDA设置中，分别达到了74.4和67.2的平均交并比（mIoU）。

链接: https://arxiv.org/abs/2412.16859
作者: Jongmin Yu,Zhongtian Sun,Shan Luo
机构: 未知
关键词: requires labour-intensive labelling, Semantic segmentation requires, segmentation requires labour-intensive, weakly labelled target, labour-intensive labelling tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic segmentation requires labour-intensive labelling tasks to obtain the supervision signals, and because of this issue, it is encouraged that using domain adaptation, which transfers information from the existing labelled source domains to unlabelled or weakly labelled target domains, is essential. However, it is intractable to find a well-generalised representation which can describe two domains due to probabilistic or geometric difference between the two domains. This paper presents a novel method, the Conditional and Inter-coder Connected Latent Diffusion (CICLD) based Semantic Segmentation Model, to advance unsupervised domain adaptation (UDA) for semantic segmentation tasks. Leveraging the strengths of latent diffusion models and adversarial learning, our method effectively bridges the gap between synthetic and real-world imagery. CICLD incorporates a conditioning mechanism to improve contextual understanding during segmentation and an inter-coder connection to preserve fine-grained details and spatial hierarchies. Additionally, adversarial learning aligns latent feature distributions across source, mixed, and target domains, further enhancing generalisation. Extensive experiments are conducted across three benchmark datasets-GTA5, Synthia, and Cityscape-shows that CICLD outperforms state-of-the-art UDA methods. Notably, the proposed method achieves a mean Intersection over Union (mIoU) of 74.4 for the GTA5 to Cityscape UDA setting and 67.2 mIoU for the Synthia to Cityscape UDA setting. This project is publicly available on 'this https URL.
zh

[CV-131] Sharpness-Aware Minimization with Adaptive Regularization for Training Deep Neural Networks

【速读】：该论文试图解决Sharpness-Aware Minimization (SAM) 中固定正则化超参数限制模型泛化能力的问题。解决方案的关键在于提出了SAM with Adaptive Regularization (SAMAR)，通过引入灵活的锐度比率规则动态更新正则化参数，从而提升模型的泛化能力。该方法在满足Lipschitz连续性的函数上提供了理论收敛性证明，并在CIFAR-10和CIFAR-100图像识别任务中验证了其对准确性和泛化能力的提升。

链接: https://arxiv.org/abs/2412.16854
作者: Jinping Zou,Xiaoge Deng,Tao Sun
机构: College of Computer Science and Technology(计算机科学与技术学院); National University of Defense Technology(国防科技大学)
关键词: proven highly effective, Sharpness-Aware Minimization, machine learning tasks, improving model generalization, proven highly
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has proven highly effective in improving model generalization in machine learning tasks. However, SAM employs a fixed hyperparameter associated with the regularization to characterize the sharpness of the model. Despite its success, research on adaptive regularization methods based on SAM remains scarce. In this paper, we propose the SAM with Adaptive Regularization (SAMAR), which introduces a flexible sharpness ratio rule to update the regularization parameter dynamically. We provide theoretical proof of the convergence of SAMAR for functions satisfying the Lipschitz continuity. Additionally, experiments on image recognition tasks using CIFAR-10 and CIFAR-100 demonstrate that SAMAR enhances accuracy and model generalization.
zh

[CV-132] Seamless Detection: Unifying Salient Object Detection and Camouflaged Object Detection

【速读】：该论文试图解决显著目标检测 (Salient Object Detection, SOD) 和伪装目标检测 (Camouflaged Object Detection, COD) 这两个任务之间的矛盾，因为它们具有截然不同的目标特性（显著性和伪装性）。现有的研究将这两个任务视为对立的，分别在大规模标注数据上交替训练模型，并独立评估它们，但这种任务特定的机制无法有效应对现实世界中未知的任务需求。为此，论文提出了一种任务无关的框架，通过对比蒸馏范式 (Contrastive Distillation Paradigm, CDP) 来统一 SOD 和 COD。CDP 通过提取前景与背景的对比信息，帮助识别显著和伪装对象。论文还设计了一个包含区间层和全局上下文的简单而有效的上下文解码器，实现了 67 fps 的推理速度。此外，CDP 可以无缝集成到无监督设置中，减少对大量人工标注的依赖。实验结果表明，该框架在有监督和无监督设置下均优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.16840
作者: Yi Liu,Chengxin Li,Xiaohui Dong,Lei Li,Dingwen Zhang,Shoukun Xu,Jungong Han
机构: Changzhou University(常州大学); Science and Technology on Complex System Control and Intelligent Agent Cooperation Laboratory(复杂系统控制与智能代理合作实验室); Northwestern Polytechnical University(西北工业大学); Tsinghua University(清华大学)
关键词: Achieving joint learning, Camouflaged Object Detection, Salient Object Detection, Object Detection, extremely challenging due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving joint learning of Salient Object Detection (SOD) and Camouflaged Object Detection (COD) is extremely challenging due to their distinct object characteristics, i.e., saliency and camouflage. The only preliminary research treats them as two contradictory tasks, training models on large-scale labeled data alternately for each task and assessing them independently. However, such task-specific mechanisms fail to meet real-world demands for addressing unknown tasks effectively. To address this issue, in this paper, we pioneer a task-agnostic framework to unify SOD and COD. To this end, inspired by the agreeable nature of binary segmentation for SOD and COD, we propose a Contrastive Distillation Paradigm (CDP) to distil the foreground from the background, facilitating the identification of salient and camouflaged objects amidst their surroundings. To probe into the contribution of our CDP, we design a simple yet effective contextual decoder involving the interval-layer and global context, which achieves an inference speed of 67 fps. Besides the supervised setting, our CDP can be seamlessly integrated into unsupervised settings, eliminating the reliance on extensive human annotations. Experiments on public SOD and COD datasets demonstrate the superiority of our proposed framework in both supervised and unsupervised settings, compared with existing state-of-the-art approaches. Code is available on this https URL.
zh

[CV-133] Human-Guided Image Generation for Expanding Small-Scale Training Image Datasets

【速读】：该论文试图解决计算机视觉模型在某些实际应用（如稀有野生动物观察）中因可用数据集规模较小而导致的性能受限问题。解决方案的关键在于提出了一种人引导的图像生成方法，通过多模态投影技术实现对生成图像的更可控扩展。该方法允许用户在探索原始和生成图像的基础上，通过样本级别的提示精炼来改进生成效果，从而提升模型在分类和目标检测任务中的性能。

链接: https://arxiv.org/abs/2412.16839
作者: Changjian Chen,Fei Lv,Yalong Guan,Pengcheng Wang,Shengjie Yu,Yifan Zhang,Zhuo Tang
机构: 未知
关键词: http URL datasets, rare wildlife observation, http URL, computer vision models, pre-trained generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance of computer vision models in certain real-world applications (e.g., rare wildlife observation) is limited by the small number of available this http URL datasets using pre-trained generative models is an effective way to address this limitation. However, since the automatic generation process is uncontrollable, the generated images are usually limited in diversity, and some of them are undesired. In this paper, we propose a human-guided image generation method for more controllable dataset expansion. We develop a multi-modal projection method with theoretical guarantees to facilitate the exploration of both the original and generated images. Based on the exploration, users refine the prompts and re-generate images for better performance. Since directly refining the prompts is challenging for novice users, we develop a sample-level prompt refinement method to make it easier. With this method, users only need to provide sample-level feedback (e.g., which samples are undesired) to obtain better prompts. The effectiveness of our method is demonstrated through the quantitative evaluation of the multi-modal projection method, improved model performance in the case study for both classification and object detection tasks, and positive feedback from the experts.
zh

[CV-134] RealisID: Scale-Robust and Fine-Controllable Identity Customization via Local and Global Complementation AAAI2025

【速读】：该论文试图解决现有身份定制方法在实际应用中难以同时满足多种需求的问题，包括小面部身份保真度、面部位置、姿态和表情的控制，以及多人物定制。解决方案的关键在于提出了一种尺度鲁棒且精细可控的方法，即RealisID，通过局部和全局分支的协作来学习不同的控制能力。具体来说，局部分支通过裁剪和上采样操作过滤掉与面部无关的信息，专注于面部区域的细节控制和尺度鲁棒的身份保真度；而全局分支则管理整个图像的整体和谐，并通过输入位置指导来控制面部位置。RealisID通过这两个分支的互补性实现了对多种需求的满足，并且通过使用ControlNet的不同变体，该方法可以轻松扩展到多人物定制，即使仅在单人物数据集上训练。

链接: https://arxiv.org/abs/2412.16832
作者: Zhaoyang Sun,Fei Du,Weihua Chen,Fan Wang,Yaxiong Chen,Yi Rong,Shengwu Xiong
机构: DAMO Academy, Alibaba Group(达摩院，阿里巴巴集团)
关键词: produce realistic identity-specific, realistic identity-specific photographs, identity-specific photographs based, identity customization techniques, reference face images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Recently, the success of text-to-image synthesis has greatly advanced the development of identity customization techniques, whose main goal is to produce realistic identity-specific photographs based on text prompts and reference face images. However, it is difficult for existing identity customization methods to simultaneously meet the various requirements of different real-world applications, including the identity fidelity of small face, the control of face location, pose and expression, as well as the customization of multiple persons. To this end, we propose a scale-robust and fine-controllable method, namely RealisID, which learns different control capabilities through the cooperation between a pair of local and global branches. Specifically, by using cropping and up-sampling operations to filter out face-irrelevant information, the local branch concentrates the fine control of facial details and the scale-robust identity fidelity within the face region. Meanwhile, the global branch manages the overall harmony of the entire image. It also controls the face location by taking the location guidance as input. As a result, RealisID can benefit from the complementarity of these two branches. Finally, by implementing our branches with two different variants of ControlNet, our method can be easily extended to handle multi-person customization, even only trained on single-person datasets. Extensive experiments and ablation studies indicate the effectiveness of RealisID and verify its ability in fulfilling all the requirements mentioned above.
zh

[CV-135] Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

【速读】：该论文试图解决扩散变换器 (Diffusion Transformers, DiTs) 在图像生成任务中存在的高延迟和内存效率低下的问题，特别是在资源受限设备上的部署困难。解决方案的关键在于提出了一个动态推理框架 DiffRatio-MoD，通过可微分的压缩比率机制，自动学习在不同图像区域、层级和时间步长上动态分配计算资源。具体来说，DiffRatio-MoD 集成了三个核心特性：(1) 基于令牌级别的动态路由机制，通过联合微调模型权重来预测令牌重要性，从而跳过不重要令牌的计算；(2) 层级可微分比率机制，自动学习不同层的压缩比率，实现冗余层的较大压缩，而其他层保持较低压缩或不压缩；(3) 时间步长可微分比率机制，根据去噪时间步长的噪声水平动态调整压缩比率，噪声较大的时间步长采用较高压缩比率，图像清晰时则降低压缩比率。这些机制共同实现了在生成质量和效率之间的优越权衡。

链接: https://arxiv.org/abs/2412.16822
作者: Haoran You,Connelly Barnes,Yuqian Zhou,Yan Kang,Zhenbang Du,Wei Zhou,Lingzhi Zhang,Yotam Nitzan,Xiaoyang Liu,Zhe Lin,Eli Shechtman,Sohrab Amirghodsi,Yingyan Celine Lin
机构: Georgia Institute of Technology(乔治亚理工学院); Adobe Research(Adobe研究)
关键词: Diffusion Transformers, memory inefficiency, making them difficult, resource-constrained devices, suffer from high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 13 figures, 4 tables

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One key efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in Mixture-of-Depths (MoD) efficient DiT models. Specifically, DiffRatio-MoD integrates three features: (1) A token-level routing scheme where each DiT layer includes a router that is jointly fine-tuned with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer’s computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the image becomes clearer. Extensive experiments on both text-to-image and inpainting tasks show that DiffRatio-MoD effectively captures dynamism across token, layer, and timestep axes, achieving superior trade-offs between generation quality and efficiency compared to prior works.
zh

[CV-136] GeoTexDensifier: Geometry-Texture-Aware Densification for High-Quality Photorealistic 3D Gaussian Splatting

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）在高质量重建中面临的挑战，即如何在保证几何结构和纹理细节的前提下，合理分布高斯点云以实现逼真的渲染效果。解决方案的关键在于提出了GeoTexDensifier框架，该框架通过几何感知（geometry-aware）和纹理感知（texture-aware）的密度增强策略来优化高斯点云的分布。具体来说，纹理感知密度增强方法在纹理丰富的区域生成更密集的点云分布，而在低纹理区域保持稀疏性以维持点云质量；几何感知分割策略则利用深度和法线先验来指导点云的分割采样，并通过深度比率变化验证来过滤掉远离实际几何表面的噪声点云。这两种策略的结合有效提升了3DGS模型的逼真度和渲染质量。

链接: https://arxiv.org/abs/2412.16809
作者: Hanqing Jiang,Xiaojun Xiang,Han Sun,Hongjie Li,Liyang Zhou,Xiaoyu Zhang,Guofeng Zhang
机构: SenseTime Research(商汤科技研究院); State Key Lab of CAD&CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室)
关键词: Virtual Reality, recently attracted wide, attracted wide attentions, Gaussian Splatting, efficient rendering performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, 1 table

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently attracted wide attentions in various areas such as 3D navigation, Virtual Reality (VR) and 3D simulation, due to its photorealistic and efficient rendering performance. High-quality reconstrution of 3DGS relies on sufficient splats and a reasonable distribution of these splats to fit real geometric surface and texture details, which turns out to be a challenging problem. We present GeoTexDensifier, a novel geometry-texture-aware densification strategy to reconstruct high-quality Gaussian splats which better comply with the geometric structure and texture richness of the scene. Specifically, our GeoTexDensifier framework carries out an auxiliary texture-aware densification method to produce a denser distribution of splats in fully textured areas, while keeping sparsity in low-texture regions to maintain the quality of Gaussian point cloud. Meanwhile, a geometry-aware splitting strategy takes depth and normal priors to guide the splitting sampling and filter out the noisy splats whose initial positions are far from the actual geometric surfaces they aim to fit, under a Validation of Depth Ratio Change checking. With the help of relative monocular depth prior, such geometry-aware validation can effectively reduce the influence of scattered Gaussians to the final rendering quality, especially in regions with weak textures or without sufficient training views. The texture-aware densification and geometry-aware splitting strategies are fully combined to obtain a set of high-quality Gaussian splats. We experiment our GeoTexDensifier framework on various datasets and compare our Novel View Synthesis results to other state-of-the-art 3DGS approaches, with detailed quantitative and qualitative evaluations to demonstrate the effectiveness of our method in producing more photorealistic 3DGS models.
zh

[CV-137] IMVB7t: A Multi-Modal Model for Food Preferences based on Artificially Produced Traits

【速读】：该论文试图解决视觉刺激对人类行为和互动，尤其是食物消费和选择的影响问题。解决方案的关键在于通过环境图像提取五个关键属性，并采用基于五种不同模型的集成模型IMVB7进行检测，取得了0.85的评分。此外，通过调查分析视觉刺激对食物偏好的影响模式，利用决策树模型基于这些属性生成推荐，最终获得了0.96的评分。这一研究为该跨学科领域的进一步探索奠定了基础。

链接: https://arxiv.org/abs/2412.16807
作者: Mushfiqur Rahman Abir,Md. Tanzib Hosain,Md. Abdullah-Al-Jubair,M. F. Mridha
机构: 未知
关键词: Human behavior, behavior and interactions, interactions are profoundly, profoundly influenced, visual stimuli present
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Proceedings of the 3rd International Conference on Computing Advancements, 2024

点击查看摘要

Abstract:Human behavior and interactions are profoundly influenced by visual stimuli present in their surroundings. This influence extends to various aspects of life, notably food consumption and selection. In our study, we employed various models to extract different attributes from the environmental images. Specifically, we identify five key attributes and employ an ensemble model IMVB7 based on five distinct models for some of their detection resulted 0.85 mark. In addition, we conducted surveys to discern patterns in food preferences in response to visual stimuli. Leveraging the insights gleaned from these surveys, we formulate recommendations using decision tree for dishes based on the amalgamation of identified attributes resulted IMVB7t 0.96 mark. This study serves as a foundational step, paving the way for further exploration of this interdisciplinary domain.
zh

[CV-138] Forget Vectors at Play: Universal Input Perturbations Driving Machine Unlearning in Image Classification

【速读】：该论文试图解决机器遗忘（Machine Unlearning, MU）问题，即如何从已训练的模型中移除特定不想要数据的影响，特别是在遵守如“被遗忘权”等数据法规的背景下。解决方案的关键在于提出了一种新颖的基于输入扰动的方法，称为“遗忘向量”（forget vector），该方法在遗忘过程中保持模型权重不变。遗忘向量是一种输入无关的数据扰动，能够有效实现与基于模型的近似遗忘方法相同的效果。此外，论文还探索了遗忘向量算术，通过简单的操作（如线性组合）将多个类特定的遗忘向量组合，生成适用于未见遗忘任务的新向量，如跨类别的任意子集遗忘。实验结果表明，该方法在效果和适应性上与最先进的基于模型的方法相当。

链接: https://arxiv.org/abs/2412.16780
作者: Changchang Sun,Ren Wang,Yihua Zhang,Jinghan Jia,Jiancheng Liu,Gaowen Liu,Sijia Liu,Yan Yan
机构: University of Illinois Chicago(伊利诺伊大学芝加哥分校); Illinois Institute of Technology(伊利诺伊理工学院); Michigan State University(密歇根州立大学); Cisco Research(思科研究)
关键词: specific unwanted data, evolving data regulations, Machine unlearning, seeks to erase, erase the influence
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine unlearning (MU), which seeks to erase the influence of specific unwanted data from already-trained models, is becoming increasingly vital in model editing, particularly to comply with evolving data regulations like the ``right to be forgotten’'. Conventional approaches are predominantly model-based, typically requiring retraining or fine-tuning the model’s weights to meet unlearning requirements. In this work, we approach the MU problem from a novel input perturbation-based perspective, where the model weights remain intact throughout the unlearning process. We demonstrate the existence of a proactive input-based unlearning strategy, referred to forget vector, which can be generated as an input-agnostic data perturbation and remains as effective as model-based approximate unlearning approaches. We also explore forget vector arithmetic, whereby multiple class-specific forget vectors are combined through simple operations (e.g., linear combinations) to generate new forget vectors for unseen unlearning tasks, such as forgetting arbitrary subsets across classes. Extensive experiments validate the effectiveness and adaptability of the forget vector, showcasing its competitive performance relative to state-of-the-art model-based methods. Codes are available at this https URL.
zh

[CV-139] RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

【速读】：该论文试图解决室内场景纹理合成中的跨视图不一致性和显著接缝问题，以及优化方法带来的高计算开销问题。解决方案的关键在于提出了RoomPainter框架，该框架通过零样本技术将2D扩散模型有效适应于3D一致性纹理合成，并采用两阶段生成策略确保全局和局部一致性。具体来说，RoomPainter引入了注意力引导的多视图集成采样（Attention-Guided Multi-View Integrated Sampling, MVIS）和邻域集成注意力机制，首先生成整个房间的纹理图以确保全局一致性，然后使用其变体——注意力引导的多视图集成重绘采样（Attention-Guided Multi-View Integrated Repaint Sampling, MVRS）对房间内的个体实例进行重绘，从而进一步增强局部一致性。

链接: https://arxiv.org/abs/2412.16778
作者: Zhipeng Huang,Wangbo Yu,Xinhua Cheng,ChengShu Zhao,Yunyang Ge,Mingyi Guo,Li Yuan,Yonghong Tian
机构: Peking University(北京大学); Pengcheng Laboratory(鹏城实验室)
关键词: garnered significant interest, significant interest due, important potential applications, digital media, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media, and creative arts. Existing diffusion model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or they resort to optimization-based approaches that entail substantial computational overhead. In this work, we present RoomPainter, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (MVIS) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the MVIS, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely an attention-guided multi-view integrated repaint sampling (MVRS) to repaint individual instances within the room, thereby further enhancing local consistency. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency, and generation efficiency.
zh

[CV-140] HyperCLIP: Adapting Vision-Language models with Hypernetworks

【速读】：该论文试图解决在资源受限环境下部署大规模视觉-语言模型（vision-language models）的挑战。解决方案的关键在于提出了一种名为HyperCLIP的新型视觉-语言架构，该架构通过使用一个小型的图像编码器（image encoder）和一个超网络（hypernetwork）来动态调整图像编码器的权重，以适应每一组新的文本输入。所有三个组件（超网络、图像编码器和文本编码器）都经过端到端的联合预训练。通过这种方式，HyperCLIP能够在单次前向传播中生成适用于任何任务的零样本（zero-shot）部署友好的图像分类器，从而显著提高了使用小型图像编码器的SigLIP训练模型的零样本准确率，在ImageNet上提升了3%，在CIFAR-100上提升了5%，且训练吞吐量开销最小。

链接: https://arxiv.org/abs/2412.16777
作者: Victor Akinwande,Mohammad Sadegh Norouzzadeh,Devin Willmott,Anna Bair,Madan Ravi Ganesh,J. Zico Kolter
机构: Carnegie Mellon University(卡内基梅隆大学); Bosch Center for AI(博世人工智能中心)
关键词: contrastive objectives form, Self-supervised vision-language models, Self-supervised vision-language, basis of current, contrastive objectives
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
zh

[CV-141] DMesh: An Efficient Differentiable Mesh for Complex Shapes

【速读】：该论文试图解决现有概率方法在处理3D三角网格时，随着形状细节增加而导致的计算成本过高的问题。解决方案的关键在于提出了一种新的可微分网格处理方法，能够在2D和3D空间中高效处理具有复杂结构的网格。此外，论文还提出了一种算法，能够根据局部几何形状动态调整2D网格的分辨率，从而实现更高效的表示。通过在2D点云和3D多视图重建任务中的实验，验证了该方法的有效性。

链接: https://arxiv.org/abs/2412.16776
作者: Sanghyun Son,Matheus Gadelha,Yang Zhou,Matthew Fisher,Zexiang Xu,Yi-Ling Qiao,Ming C. Lin,Yi Zhou
机构: University of Maryland; Adobe Research
关键词: Recent probabilistic methods, increased shape details, face high computational, high computational costs, capture diverse shapes
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 26 pages, 27 figures, 4 tables

点击查看摘要

Abstract:Recent probabilistic methods for 3D triangular meshes capture diverse shapes by differentiable mesh connectivity, but face high computational costs with increased shape details. We introduce a new differentiable mesh processing method in 2D and 3D that addresses this challenge and efficiently handles meshes with intricate structures. Additionally, we present an algorithm that adapts the mesh resolution to local geometry in 2D for efficient representation. We demonstrate the effectiveness of our approach on 2D point cloud and 3D multi-view reconstruction tasks. Visit our project page (this https URL) for source code and supplementary material.
zh

[CV-142] SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

【速读】：该论文试图解决现有视觉语言模型在处理语音指令时的局限性，特别是在推理和提示技术（如COT）方面的不足。解决方案的关键在于提出了一种名为SilVar的端到端多模态模型，该模型结合了CLIP、Whisper和LLaMA 3.1-8B，能够处理和推理来自语音指令的视觉问答任务。SilVar通过引入一个专门设计的语音指令数据集，增强了模型在复杂场景下的推理能力，使其能够从语音输入中解析和解释视觉场景，而不仅仅是进行对象识别。实验结果表明，SilVar在MMMU和ScienceQA基准测试中达到了最先进（SOTA）的性能，展示了其在处理语音指令方面的有效性。

链接: https://arxiv.org/abs/2412.16771
作者: Tan-Hanh Pham,Hoang-Nam Le,Phu-Vinh Nguyen,Chris Ngo,Truong-Son Hy
机构: Florida Institute of Technology, USA(美国佛罗里达理工学院); FPT University, Vietnam(越南FPT大学); Uppsala University, Sweden(瑞典乌普萨拉大学); Knovel Engineering Lab, Singapore(新加坡Knovel工程实验室); University of Alabama at Birmingham, USA(美国阿拉巴马大学伯明翰分校)
关键词: demonstrated remarkable capabilities, image captioning, demonstrated remarkable, remarkable capabilities, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model ability to process and explain visual scenes from spoken input, moving beyond object recognition to reasoning-based interactions. The experiments show that SilVar achieves SOTA performance on the MMMU and ScienceQA benchmarks despite the challenge of speech-based instructions. We believe SilVar will inspire next-generation multimodal reasoning models, toward expert artificial general intelligence. Our code and dataset are available here.
zh

[CV-143] he Master Key Filters Hypothesis: Deep Filters Are General in DS-CNNs

【速读】：该论文试图挑战传统观点，即卷积神经网络 (CNN) 的滤波器在深层会变得越来越专门化。通过分析深度可分离卷积神经网络 (DS-CNNs) 在不同领域和数据集上的表现，研究发现深层滤波器仍然保持通用性，而非预期的类别特定性。关键解决方案在于展示了这些滤波器在迁移学习中的广泛适用性，表明即使在不同数据集上训练的模型中冻结的滤波器也能表现良好，并且来自更大数据集的滤波器可以进一步提高性能。这一发现揭示了深度可分离卷积学习到的空间特征在所有层、领域和架构中都保持通用性，为神经网络的泛化特性和模型设计提供了新的见解。

链接: https://arxiv.org/abs/2412.16751
作者: Zahra Babaiee,Peyman M. Kiasari,Daniela Rus,Radu Grosu
机构: Massachusetts Institute of Technology (MIT)(麻省理工学院); Technical University of Vienna (维也纳技术大学)
关键词: paper challenges, challenges the prevailing, prevailing view, view that convolutional, increasingly specialized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper challenges the prevailing view that convolutional neural network (CNN) filters become increasingly specialized in deeper layers. Motivated by recent observations of clusterable repeating patterns in depthwise separable CNNs (DS-CNNs) trained on ImageNet, we extend this investigation across various domains and datasets. Our analysis of DS-CNNs reveals that deep filters maintain generality, contradicting the expected transition to class-specific filters. We demonstrate the generalizability of these filters through transfer learning experiments, showing that frozen filters from models trained on different datasets perform well and can be further improved when sourced from larger datasets. Our findings indicate that spatial features learned by depthwise separable convolutions remain generic across all layers, domains, and architectures. This research provides new insights into the nature of generalization in neural networks, particularly in DS-CNNs, and has significant implications for transfer learning and model design.
zh

[CV-144] ViM-Disparity: Bridging the Gap of Speed Accuracy and Memory for Disparity Map Generation

【速读】：该论文试图解决实时性和准确性之间的权衡问题，特别是在视差图生成 (Disparity Map Generation, DMG) 模型中，同时要求低计算开销。解决方案的关键在于提出了基于视觉Mamba (Visual Mamba, ViM) 的架构，该架构能够有效溶解这一权衡，并引入了一种性能评估方法，能够联合评估模型的推理速度、计算开销和准确性。

链接: https://arxiv.org/abs/2412.16745
作者: Maheswar Bora,Tushar Anand,Saurabh Atreya,Aritra Mukherjee,Abhijit Das
机构: Machine Intelligence Group, Department of CS&IS, Birla Institute of Technology and Sciences, Pilani, India(机器智能组，计算机科学与信息系统系，皮拉尼印度理工学院和科学研究所)
关键词: disparity map generation, Visual Mamba, propose a Visual, based architecture, low computation overhead
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work we propose a Visual Mamba (ViM) based architecture, to dissolve the existing trade-off for real-time and accurate model with low computation overhead for disparity map generation (DMG). Moreover, we proposed a performance measure that can jointly evaluate the inference speed, computation overhead and the accurateness of a DMG model.
zh

[CV-145] EasyVis2: A Real Time Multi-view 3D Visualization for Laparoscopic Surgery Training Enhanced by a Deep Neural Network YOLOv8-Pose

【速读】：该论文试图解决在腹腔镜手术中实现免手操作、实时3D可视化的问题。解决方案的关键在于开发了一个名为EasyVis2的系统，该系统通过在手术套管中集成微型摄像头，提供扩展的视野和3D视角。核心技术是定制化的YOLOv8-Pose深度神经网络算法，用于估计手术器械在每个摄像头视图中的位置和方向，并通过多视角的2D关键点进行3D手术器械姿态估计，从而实现手术器械的3D表面模型实时渲染。此外，论文还介绍了如何通过最小化标注工作量来为新手术器械定制训练数据集，并展示了该系统在3D重建精度和计算时间上的改进，以及在真实动物组织上的3D渲染实验，表明其在未来实际手术中的潜在应用。

链接: https://arxiv.org/abs/2412.16742
作者: Yung-Hong Sun,Gefei Shen,Jiangang Chen,Jayer Fernandes,Hongrui Jiang,Yu Hen Hu
机构: University of Wisconsin - Madison (威斯康星大学麦迪逊分校)
关键词: designed for hands-free, laparoscopic surgery, surgical tools, surgical, system designed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages (12 pages with citations), 11 figures

点击查看摘要

Abstract:EasyVis2 is a system designed for hands-free, real-time 3D visualization during laparoscopic surgery. It incorporates a surgical trocar equipped with a set of micro-cameras, which are inserted into the body cavity to provide an expanded field of view and a 3D perspective of the surgical procedure. A sophisticated deep neural network algorithm, YOLOv8-Pose, is tailored to estimate the position and orientation of surgical instruments in each individual camera view. Subsequently, 3D surgical tool pose estimation is performed using associated 2D key points across multiple views. This enables the rendering of a 3D surface model of the surgical tools overlaid on the observed background scene for real-time visualization. In this study, we explain the process of developing a training dataset for new surgical tools to customize YoLOv8-Pose while minimizing labeling efforts. Extensive experiments were conducted to compare EasyVis2 with the original EasyVis, revealing that, with the same number of cameras, the new system improves 3D reconstruction accuracy and reduces computation time. Additionally, experiments with 3D rendering on real animal tissue visually demonstrated the distance between surgical tools and tissues by displaying virtual side views, indicating potential applications in real surgeries in the future.
zh

[CV-146] UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning

【速读】：该论文试图解决小样本学习（few-shot learning）中由于超参数（hyper-parameters）设置不当导致的性能问题。当前方法依赖于经验性的网格搜索（grid-search）来调整控制测试批次预测统计的超参数，如类别平衡程度，但这种方法在不同数据集和预训练模型上表现不稳定且计算成本高。论文提出的解决方案是将“学习优化”（learning to optimize）范式引入小样本学习，通过展开期望最大化（Expectation-Maximization, EM）优化器的泛化形式为一个神经网络架构，从而自动学习并优化超参数。该方法能够适应不同的统计特征分布和预训练范式，包括基础视觉-语言模型和标准视觉分类器，实验结果显示在视觉和视觉-语言基准测试中分别带来了高达10%和7.5%的性能提升。

链接: https://arxiv.org/abs/2412.16739
作者: Long Zhou,Fereshteh Shakeri,Aymen Sadraoui,Mounir Kaaniche,Jean-Christophe Pesquet,Ismail Ben Ayed
机构: Politecnico di Milano; Université Paris-Saclay, Inria, CentraleSupélec, CVN; ÉTS Montréal; Université Sorbonne Paris Nord
关键词: recently triggered wide, triggered wide attention, Transductive few-shot learning, Transductive few-shot, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transductive few-shot learning has recently triggered wide attention in computer vision. Yet, current methods introduce key hyper-parameters, which control the prediction statistics of the test batches, such as the level of class balance, affecting performances significantly. Such hyper-parameters are empirically grid-searched over validation data, and their configurations may vary substantially with the target dataset and pre-training model, making such empirical searches both sub-optimal and computationally intractable. In this work, we advocate and introduce the unrolling paradigm, also referred to as “learning to optimize”, in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyper-parameters. Specifically, we unroll a generalization of the ubiquitous Expectation-Maximization (EM) optimizer into a neural network architecture, mapping each of its iterates to a layer and learning a set of key hyper-parameters over validation data. Our unrolling approach covers various statistical feature distributions and pre-training paradigms, including recent foundational vision-language models and standard vision-only classifiers. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks, showing significant gains brought by the proposed unrolled EM algorithm over iterative variants. The achieved improvements reach up to 10% and 7.5% on vision-only and vision-language benchmarks, respectively.
zh

[CV-147] LUCES-MV: A Multi-View Dataset for Near-Field Point Light Source Photometric Stereo

【速读】：该论文试图解决现有光度立体（Photometric Stereo, PS）数据集不足的问题，特别是缺乏针对近场点光源设置的实际、多视角数据集，这些数据集通常缺少具有挑战性的物体（如简单、光滑、无纹理的物体）以及实用的近场光照配置。解决方案的关键在于提出了LUCES-MV数据集，这是首个针对近场点光源光度立体的真实世界多视角数据集。该数据集包含15个具有多样材质的物体，每个物体在15个LED光源的不同光照条件下进行成像，光源距离相机中心30至40厘米。LUCES-MV不仅提供了地面真实法线、物体网格和姿态，还包括光源和相机的校准图像，以支持透明的端到端评估。通过评估当前最先进的近场光度立体算法，该数据集为开发更鲁棒、准确和可扩展的基于光度立体的3D重建方法提供了重要基准。

链接: https://arxiv.org/abs/2412.16737
作者: Fotios Logothetis,Ignas Budvytis,Stephan Liwicki,Roberto Cipolla
机构: Toshiba Europe(东芝欧洲); Independent researcher(独立研究员); University of Cambridge(剑桥大学)
关键词: Neural SDF achieving, SDF achieving impressive, Neural SDF, differentiable volumetric rendering, volumetric rendering techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The biggest improvements in Photometric Stereo (PS) field has recently come from adoption of differentiable volumetric rendering techniques such as NeRF or Neural SDF achieving impressive reconstruction error of 0.2mm on DiLiGenT-MV benchmark. However, while there are sizeable datasets for environment lit objects such as Digital Twin Catalogue (DTS), there are only several small Photometric Stereo datasets which often lack challenging objects (simple, smooth, untextured) and practical, small form factor (near-field) light setup. To address this, we propose LUCES-MV, the first real-world, multi-view dataset designed for near-field point light source photometric stereo. Our dataset includes 15 objects with diverse materials, each imaged under varying light conditions from an array of 15 LEDs positioned 30 to 40 centimeters from the camera center. To facilitate transparent end-to-end evaluation, our dataset provides not only ground truth normals and ground truth object meshes and poses but also light and camera calibration images. We evaluate state-of-the-art near-field photometric stereo algorithms, highlighting their strengths and limitations across different material and shape complexities. LUCES-MV dataset offers an important benchmark for developing more robust, accurate and scalable real-world Photometric Stereo based 3D reconstruction methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.16737 [cs.CV] (or arXiv:2412.16737v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.16737 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-148] Divide and Conquer: Grounding a Bleeding Areas in Gastrointestinal Image with Two-Stage Model

【速读】：该论文试图解决传统多任务学习模型（Multi-Task Learning models）在同时优化分类和分割任务时面临的内在挑战，特别是任务间干扰和标签异质性问题。解决方案的关键在于提出了一种两阶段框架，首先将分类和定位任务解耦，先对图像进行出血与非出血的分类，从而隔离后续定位任务，避免任务间的干扰。此外，通过引入随机权重平均（Stochastic Weight Averaging）和测试时增强（Test-Time Augmentation），进一步提升了模型在面对领域偏移和标注不一致时的鲁棒性。该方法在Auto-WCEBleedGen Challenge V2数据集上验证，显著提高了分类准确性和分割精度，尤其是在视觉模式一致的序列数据上表现突出。

链接: https://arxiv.org/abs/2412.16723
作者: Yu-Fan Lin,Bo-Cheng Qiu,Chia-Ming Lee,Chih-Chung Hsu
机构: 未知
关键词: colorectal cancer, Accurate detection, critical for diagnosing, diagnosing diseases, peptic ulcers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection and segmentation of gastrointestinal bleeding are critical for diagnosing diseases such as peptic ulcers and colorectal cancer. This study proposes a two-stage framework that decouples classification and grounding to address the inherent challenges posed by traditional Multi-Task Learning models, which jointly optimizes classification and segmentation. Our approach separates these tasks to achieve targeted optimization for each. The model first classifies images as bleeding or non-bleeding, thereby isolating subsequent grounding from inter-task interference and label heterogeneity. To further enhance performance, we incorporate Stochastic Weight Averaging and Test-Time Augmentation, which improve model robustness against domain shifts and annotation inconsistencies. Our method is validated on the Auto-WCEBleedGen Challenge V2 Challenge dataset and achieving second place. Experimental results demonstrate significant improvements in classification accuracy and segmentation precision, especially on sequential datasets with consistent visual patterns. This study highlights the practical benefits of a two-stage strategy for medical image analysis and sets a new standard for GI bleeding detection and segmentation. Our code is publicly available at this GitHub repository.
zh

[CV-149] GANFusion: Feed-Forward Text-to-3D with Diffusion in GAN Space

【速读】：该论文试图解决现有3D生成模型在生成高质量3D人体角色时，无法匹配图像或视频生成模型的保真度问题。现有3D生成模型要么依赖于显式的3D监督数据，受限于3D数据的多样性和数量，要么仅依赖2D数据监督，但通常生成结果较粗糙、无法进行文本条件化或需要测试时优化。论文提出的解决方案关键在于引入GANFusion，该方法结合了生成对抗网络（GAN）和扩散模型的优势。首先，使用仅依赖单视图2D数据的GAN架构生成无条件的3D数据三平面特征，然后通过生成随机样本并为其添加描述，训练一个文本条件化的扩散模型，该模型直接从高质量三平面特征空间中采样，最终解码为3D对象。

链接: https://arxiv.org/abs/2412.16717
作者: Souhaib Attaiki,Paul Guerrero,Duygu Ceylan,Niloy J. Mitra,Maks Ovsjanikov
机构: LIX, École Polytechnique, IPP Paris; Adobe Research; University College London (UCL)
关键词: human characters, data, supervision, trained, GAN
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:We train a feed-forward text-to-3D diffusion generator for human characters using only single-view 2D data for supervision. Existing 3D generative models cannot yet match the fidelity of image or video generative models. State-of-the-art 3D generators are either trained with explicit 3D supervision and are thus limited by the volume and diversity of existing 3D data. Meanwhile, generators that can be trained with only 2D data as supervision typically produce coarser results, cannot be text-conditioned, or must revert to test-time optimization. We observe that GAN- and diffusion-based generators have complementary qualities: GANs can be trained efficiently with 2D supervision to produce high-quality 3D objects but are hard to condition on text. In contrast, denoising diffusion models can be conditioned efficiently but tend to be hard to train with only 2D supervision. We introduce GANFusion, which starts by generating unconditional triplane features for 3D data using a GAN architecture trained with only single-view 2D data. We then generate random samples from the GAN, caption them, and train a text-conditioned diffusion model that directly learns to sample from the space of good triplane features that can be decoded into 3D objects.
zh

[CV-150] From Histopathology Images to Cell Clouds: Learning Slide Representations with Hierarchical Cell Transformer

【速读】：该论文试图解决在病理学全切片图像 (WSI) 中直接分析和建模细胞空间分布的问题，而现有的大多数 WSI 数据集缺乏细胞级别的标注，这主要由于对千兆像素图像进行标注的成本极高。论文的关键解决方案包括构建了一个大规模的 WSI 数据集 WSI-Cell5B，该数据集包含超过 50 亿个细胞级别的标注，并且基于 6,998 张来自癌症基因组图谱计划 (The Cancer Genome Atlas Program) 的 11 种癌症的 WSI。此外，论文提出了一种新的分层细胞云变换器 (Cell Cloud Transformer, CCFormer)，通过将每个 WSI 中的细胞集合建模为细胞云，并利用邻域信息嵌入 (Neighboring Information Embedding, NIE) 和分层空间感知 (Hierarchical Spatial Perception, HSP) 模块来学习细胞的空间分布关系。实验结果表明，仅通过学习细胞的空间分布，CCFormer 就能在生存预测和癌症分期任务中达到最先进的性能。

链接: https://arxiv.org/abs/2412.16715
作者: Zijiang Yang,Zhongwei Qiu,Tiancheng Lin,Hanqing Chao,Wanxing Chang,Yelin Yang,Yunshuo Zhang,Wenpei Jiao,Yixuan Shen,Wenbin Liu,Dongmei Fu,Dakai Jin,Ke Yan,Le Lu,Hui Jiang,Yun Bian
机构: 未知
关键词: clinically crucial, crucial and potentially, potentially very beneficial, histopathology whole slide, cell-level annotations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It is clinically crucial and potentially very beneficial to be able to analyze and model directly the spatial distributions of cells in histopathology whole slide images (WSI). However, most existing WSI datasets lack cell-level annotations, owing to the extremely high cost over giga-pixel images. Thus, it remains an open question whether deep learning models can directly and effectively analyze WSIs from the semantic aspect of cell distributions. In this work, we construct a large-scale WSI dataset with more than 5 billion cell-level annotations, termed WSI-Cell5B, and a novel hierarchical Cell Cloud Transformer (CCFormer) to tackle these challenges. WSI-Cell5B is based on 6,998 WSIs of 11 cancers from The Cancer Genome Atlas Program, and all WSIs are annotated per cell by coordinates and types. To the best of our knowledge, WSI-Cell5B is the first WSI-level large-scale dataset integrating cell-level annotations. On the other hand, CCFormer formulates the collection of cells in each WSI as a cell cloud and models cell spatial distribution. Specifically, Neighboring Information Embedding (NIE) is proposed to characterize the distribution of cells within the neighborhood of each cell, and a novel Hierarchical Spatial Perception (HSP) module is proposed to learn the spatial relationship among cells in a bottom-up manner. The clinical analysis indicates that WSI-Cell5B can be used to design clinical evaluation metrics based on counting cells that effectively assess the survival risk of patients. Extensive experiments on survival prediction and cancer staging show that learning from cell spatial distribution alone can already achieve state-of-the-art (SOTA) performance, i.e., CCFormer strongly outperforms other competing methods.
zh

[CV-151] From Pixels to Gigapixels: Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba

【速读】：该论文试图解决全切片图像 (WSIs) 在深度学习模型中的处理难题，包括计算效率和有效表示学习。解决方案的关键在于引入了一种名为 Pixel-Mamba 的新型深度学习架构，该架构利用状态空间模型 (SSM) 的线性内存复杂度，并通过逐步扩展的局部归纳偏置 (local inductive biases) 来结合局部和全局信息，从而在不需要病理学特定预训练的情况下，在肿瘤分期和生存分析任务中达到或超越现有最先进 (SOTA) 模型的性能。

链接: https://arxiv.org/abs/2412.16711
作者: Zhongwei Qiu,Hanqing Chao,Tiancheng Lin,Wanxing Chang,Zijiang Yang,Wenpei Jiao,Yixuan Shen,Yunshuo Zhang,Yelin Yang,Wenbin Liu,Hui Jiang,Yun Bian,Ke Yan,Dakai Jin,Le Lu
机构: DAMO Academy, Alibaba Group(达摩院, 阿里巴巴集团); Hupan Lab(湖畔实验室); Department of Radiology, Changhai Hospital(放射科, 长海医院); Department of Pathology, Changhai Hospital(病理科, 长海医院); Zhejiang University(浙江大学); Fudan University(复旦大学); University of Science and Technology Beijing(北京科技大学); Peking University(北京大学)
关键词: offering valuable insights, influence clinical decision-making, directly influence clinical, Histopathology plays, medical diagnostics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Histopathology plays a critical role in medical diagnostics, with whole slide images (WSIs) offering valuable insights that directly influence clinical decision-making. However, the large size and complexity of WSIs may pose significant challenges for deep learning models, in both computational efficiency and effective representation learning. In this work, we introduce Pixel-Mamba, a novel deep learning architecture designed to efficiently handle gigapixel WSIs. Pixel-Mamba leverages the Mamba module, a state-space model (SSM) with linear memory complexity, and incorporates local inductive biases through progressively expanding tokens, akin to convolutional neural networks. This enables Pixel-Mamba to hierarchically combine both local and global information while efficiently addressing computational challenges. Remarkably, Pixel-Mamba achieves or even surpasses the quantitative performance of state-of-the-art (SOTA) foundation models that were pretrained on millions of WSIs or WSI-text pairs, in a range of tumor staging and survival analysis tasks, \bf even without requiring any pathology-specific pretraining. Extensive experiments demonstrate the efficacy of Pixel-Mamba as a powerful and efficient framework for end-to-end WSI analysis.
zh

[CV-152] CAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models

【速读】：该论文试图解决扩散模型在推理过程中由于复杂的网络架构和大量的时间步长导致的内存和时间开销问题。解决方案的关键在于提出了一种名为“时间步长-通道自适应量化方法（Timestep-Channel Adaptive Quantization for Diffusion Models, TCAQ-DM）”的新方法。具体来说，该方法通过时间步长-通道联合重参数化模块（TCR）来平衡激活值在时间步长和通道上的分布范围，从而促进后续的重建过程。此外，动态自适应量化模块（DAQ）根据不同层的分布特性选择最优量化器，以减少量化误差。最后，渐进对齐重建策略（PAR）用于缓解输入不一致导致的偏差。实验结果表明，该方法在多个基准测试和不同扩散模型上显著优于现有技术，尤其是在W6A6设置下，其FID指标与全精度模型相当，并在W4A4设置下能够生成可用图像。

链接: https://arxiv.org/abs/2412.16700
作者: Haocheng Huang,Jiaxin Chen,Jinyang Guo,Ruiyi Zhan,Yunhong Wang
机构: 未知
关键词: video generation tasks, achieved remarkable success, Diffusion models, generation tasks, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in the image and video generation tasks. Nevertheless, they often require a large amount of memory and time overhead during inference, due to the complex network architecture and considerable number of timesteps for iterative diffusion. Recently, the post-training quantization (PTQ) technique has proved a promising way to reduce the inference cost by quantizing the float-point operations to low-bit ones. However, most of them fail to tackle with the large variations in the distribution of activations across distinct channels and timesteps, as well as the inconsistent of input between quantization and inference on diffusion models, thus leaving much room for improvement. To address the above issues, we propose a novel method dubbed Timestep-Channel Adaptive Quantization for Diffusion Models (TCAQ-DM). Specifically, we develop a timestep-channel joint reparameterization (TCR) module to balance the activation range along both the timesteps and channels, facilitating the successive reconstruction procedure. Subsequently, we employ a dynamically adaptive quantization (DAQ) module that mitigate the quantization error by selecting an optimal quantizer for each post-Softmax layers according to their specific types of distributions. Moreover, we present a progressively aligned reconstruction (PAR) strategy to mitigate the bias caused by the input mismatch. Extensive experiments on various benchmarks and distinct diffusion models demonstrate that the proposed method substantially outperforms the state-of-the-art approaches in most cases, especially yielding comparable FID metrics to the full precision model on CIFAR-10 in the W6A6 setting, while enabling generating available images in the W4A4 settings.
zh

[CV-153] Interact with me: Joint Egocentric Forecasting of Intent to Interact Attitude and Social Actions

【速读】：该论文试图解决从代理（agent）的自我中心视角（egocentric perspective）出发，联合预测用户与代理交互的意图、态度及行为的问题。解决方案的关键在于提出了一个基于图的时空框架——SocialEgoNet，该框架通过分层多任务学习（hierarchical multitask learning）方法利用任务间的依赖关系，并使用仅1秒视频输入中提取的全身骨骼关键点（whole-body skeletons）进行高效推理。实验结果表明，SocialEgoNet在增强的JPL-Social数据集上实现了实时推理，并在各项任务中表现出优异的性能（平均准确率83.15%）。

链接: https://arxiv.org/abs/2412.16698
作者: Tongfei Bian,Yiming Ma,Mathieu Chollet,Victor Sanchez,Tanaya Guha
机构: University of Glasgow, Glasgow, United Kingdom(格拉斯哥大学，格拉斯哥，英国); University of Warwick, Warwick, United Kingdom(华威大学，沃里克，英国)
关键词: proactively recognize, recognize their target, target user, user and prepare, prepare for upcoming
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person’s intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent’s (egocentric) perspective. So we propose \emphSocialEgoNet - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emphreal-time inference and superior performance (average accuracy across all tasks: 83.15%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.
zh

[CV-154] VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation

【速读】：该论文试图解决从文本描述生成高质量视频时面临的时序一致性和主体运动控制问题。解决方案的关键在于提出了一种两阶段框架VAST (Video As Storyboard from Text)。首先，StoryForge将文本描述转化为详细的脚本（storyboards），捕捉人体姿态和物体布局，以表示场景的结构本质；其次，VisionForge根据这些脚本生成视频，确保视频具有平滑的运动、时序一致性和空间连贯性。通过将文本理解与视频生成解耦，VAST实现了对主体动态和场景构成的精确控制。实验结果表明，VAST在视觉质量和语义表达上优于现有方法，为动态和连贯的视频生成设定了新标准。

链接: https://arxiv.org/abs/2412.16677
作者: Chi Zhang,Yuanzhi Liang,Xi Qiu,Fangqiu Yi,Xuelong Li
机构: Institute of Artificial Intelligence, China Telecom (TeleAI); Institute of Artificial Intelligence, China Telecom (TeleAI); Institute of Artificial Intelligence, China Telecom (TeleAI); Institute of Artificial Intelligence, China Telecom (TeleAI); Institute of Artificial Intelligence, China Telecom (TeleAI)
关键词: Generating high-quality videos, Generating high-quality, video generation, high-quality video generation, Generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-quality videos from textual descriptions poses challenges in maintaining temporal coherence and control over subject motion. We propose VAST (Video As Storyboard from Text), a two-stage framework to address these challenges and enable high-quality video generation. In the first stage, StoryForge transforms textual descriptions into detailed storyboards, capturing human poses and object layouts to represent the structural essence of the scene. In the second stage, VisionForge generates videos from these storyboards, producing high-quality videos with smooth motion, temporal consistency, and spatial coherence. By decoupling text understanding from video generation, VAST enables precise control over subject dynamics and scene composition. Experiments on the VBench benchmark demonstrate that VAST outperforms existing methods in both visual quality and semantic expression, setting a new standard for dynamic and coherent video generation.
zh

[CV-155] wo-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer

【速读】：该论文试图解决多人在线互动动作生成中的关键问题，特别是如何在保留个体动作细节的同时，有效建模人与人之间的互动，并生成在文本条件驱动下具有显著差异的动作。解决方案的关键在于提出了一种统一的方法，通过单一的潜在空间（latent space）来同时建模多人的动作及其互动。具体来说，该方法利用变分自编码器（VAE）将互动动作压缩到一个统一的潜在空间中，并在该空间内进行扩散过程，由自然语言条件引导。这种方法不仅简化了流程，还提高了生成质量和效率，特别是在处理动作不对称性时表现出色。

链接: https://arxiv.org/abs/2412.16670
作者: Boyuan Li,Xihua Wang,Ruihua Song,Wenbing Huang
机构: Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China(高瓴人工智能学院，中国人民大学，北京，中国)
关键词: computer character animation, character animation, critical yet under-explored, under-explored domain, domain in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Multi-person interactive motion generation, a critical yet under-explored domain in computer character animation, poses significant challenges such as intricate modeling of inter-human interactions beyond individual motions and generating two motions with huge differences from one text condition. Current research often employs separate module branches for individual motions, leading to a loss of interaction information and increased computational demands. To address these challenges, we propose a novel, unified approach that models multi-person motions and their interactions within a single latent space. Our approach streamlines the process by treating interactive motions as an integrated data point, utilizing a Variational AutoEncoder (VAE) for compression into a unified latent space, and performing a diffusion process within this space, guided by the natural language conditions. Experimental results demonstrate our method’s superiority over existing approaches in generation quality, performing text condition in particular when motions have significant asymmetry, and accelerating the generation efficiency while preserving high quality.
zh

[CV-156] Adversarial Attack Against Images Classification based on Generative Adversarial Networks

【速读】：该论文试图解决图像分类系统中的对抗攻击问题，关键解决方案是利用生成对抗网络 (GANs) 生成具有微小扰动的对抗样本，这些样本足以影响分类器的决策。通过生成器与分类器之间的对抗学习，生成的对抗样本在保持自然性的同时成功欺骗了多种先进的分类器。实验结果表明，该方法在经典图像分类数据集上具有显著的有效性。

链接: https://arxiv.org/abs/2412.16662
作者: Yahe Yang
机构: 未知
关键词: generative adversarial networks, powerful generative capabilities, generative adversarial, adversarial networks, Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Adversarial attacks on image classification systems have always been an important problem in the field of machine learning, and generative adversarial networks (GANs), as popular models in the field of image generation, have been widely used in various novel scenarios due to their powerful generative capabilities. However, with the popularity of generative adversarial networks, the misuse of fake image technology has raised a series of security problems, such as malicious tampering with other people’s photos and videos, and invasion of personal privacy. Inspired by the generative adversarial networks, this work proposes a novel adversarial attack method, aiming to gain insight into the weaknesses of the image classification system and improve its anti-attack ability. Specifically, the generative adversarial networks are used to generate adversarial samples with small perturbations but enough to affect the decision-making of the classifier, and the adversarial samples are generated through the adversarial learning of the training generator and the classifier. From extensive experiment analysis, we evaluate the effectiveness of the method on a classical image classification dataset, and the results show that our model successfully deceives a variety of advanced classifiers while maintaining the naturalness of adversarial samples.
zh

[CV-157] Generalizable Articulated Object Perception with Superpoints

【速读】：该论文试图解决机器人操作铰接物体时面临的复杂运动学结构问题，特别是如何通过精确的部件分割实现高效操作。解决方案的关键在于引入了一种基于超点（superpoint）的感知方法，通过可学习的部件感知超点生成技术，基于几何和语义相似性高效地对点云进行分组，从而清晰地划分部件边界。此外，利用2D基础模型SAM的分割能力识别像素区域中心，并选择相应的超点作为候选查询点，结合基于查询的transformer解码器进一步提升了精确的部件分割能力。实验结果表明，该方法在跨类别部件分割任务中显著优于现有最先进方法。

链接: https://arxiv.org/abs/2412.16656
作者: Qiaojun Yu,Ce Hao,Xibin Yuan,Li Zhang,Liu Liu,Yukang Huo,Rohit Agarwal,Cewu Lu
机构: 未知
关键词: Manipulating articulated objects, complex kinematic structure, Manipulating articulated, kinematic structure, efficient manipulation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manipulating articulated objects with robotic arms is challenging due to the complex kinematic structure, which requires precise part segmentation for efficient manipulation. In this work, we introduce a novel superpoint-based perception method designed to improve part segmentation in 3D point clouds of articulated objects. We propose a learnable, part-aware superpoint generation technique that efficiently groups points based on their geometric and semantic similarities, resulting in clearer part boundaries. Furthermore, by leveraging the segmentation capabilities of the 2D foundation model SAM, we identify the centers of pixel regions and select corresponding superpoints as candidate query points. Integrating a query-based transformer decoder further enhances our method’s ability to achieve precise part segmentation. Experimental results on the GAPartNet dataset show that our method outperforms existing state-of-the-art approaches in cross-category part segmentation, achieving AP50 scores of 77.9% for seen categories (4.4% improvement) and 39.3% for unseen categories (11.6% improvement), with superior results in 5 out of 9 part categories for seen objects and outperforming all previous methods across all part categories for unseen objects.
zh

[CV-158] IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

【速读】：该论文试图解决在红外-可见光 (IR-VIS) 任务中，由于下游数据集稀缺和迁移能力有限，导致全参数微调方法缺乏通用性和效率低下的问题。解决方案的关键在于提出了一种名为“IV-tuning”的新型微调方法，该方法通过冻结预训练的可见光基础模型 (Vision Foundation Models, VFMs)，并在模型主干中集成模态特定的提示 (modal-specific prompts) 和适配器 (adapters)，从而在保留预训练模型通用表示的同时，有效弥合了VFMs与下游红外-可见光任务之间的差距，并学习了不同模态间的互补性。通过仅微调约3%的主干参数，IV-tuning在红外-可见光语义分割和目标检测任务中超越了全参数微调方法，展示了其在减少训练参数的同时实现更优性能的优势。

链接: https://arxiv.org/abs/2412.16654
作者: Yaming Zhang,Chenqiang Gao,Fangcen Liu,Junjie Guo,Lan Wang,Xinggan Peng,Deyu Meng
机构: 未知
关键词: Vision Foundation Models, greatly benefit, advantage of combining, combining infrared, infrared and visible
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared-visible (IR-VIS) tasks, such as semantic segmentation and object detection, greatly benefit from the advantage of combining infrared and visible modalities. To inherit the general representations of the Vision Foundation Models (VFMs), task-specific dual-branch networks are designed and fully fine-tuned on downstream datasets. Although effective, this manner lacks generality and is sub-optimal due to the scarcity of downstream infrared-visible datasets and limited transferability. In this paper, we propose a novel and general fine-tuning approach, namely “IV-tuning”, to parameter-efficiently harness VFMs for various infrared-visible downstream tasks. At its core, IV-tuning freezes pre-trained visible-based VFMs and integrates modal-specific prompts with adapters within the backbone, bridging the gap between VFMs and downstream infrared-visible tasks while simultaneously learning the complementarity between different modalities. By fine-tuning approximately 3% of the backbone parameters, IV-tuning outperforms full fine-tuning across various baselines in infrared-visible semantic segmentation and object detection, as well as previous state-of-the-art methods. Extensive experiments across various settings demonstrate that IV-tuning achieves superior performance with fewer training parameters, providing a good alternative to full fine-tuning and a novel method of extending visible-based models for infrared-visible tasks. The code is available at this https URL.
zh

[CV-159] PB-UAP: Hybrid Universal Adversarial Attack For Image Segmentation ICASSP2025

【速读】：该论文试图解决分割模型在面对对抗攻击时的鲁棒性问题，特别是针对通用对抗扰动（universal adversarial perturbations）的研究。解决方案的关键在于提出了一种新颖的通用对抗攻击方法，该方法包括双特征分离（dual feature separation）和低频散射（low-frequency scattering）模块。这两个模块分别在像素空间和频率空间中引导对抗样本的训练，从而实现了对分割模型的高成功率攻击，并展示了跨不同模型的强迁移性。

链接: https://arxiv.org/abs/2412.16651
作者: Yufei Song,Ziqi Zhou,Minghui Li,Xianlong Wang,Menghao Deng,Wei Wan,Shengshan Hu,Leo Yu Zhang
机构: School of Cyber Science and Engineering, Huazhong University of Science and Technology; School of Computer Science and Technology, Huazhong University of Science and Technology; School of Software Engineering, Huazhong University of Science and Technology; School of Information and Communication Technology, Griffith University
关键词: significant research hotspot, deep neural networks, deep learning, deep neural, rapid advancement
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:With the rapid advancement of deep learning, the model robustness has become a significant research hotspot, \ie, adversarial attacks on deep neural networks. Existing works primarily focus on image classification tasks, aiming to alter the model’s predicted labels. Due to the output complexity and deeper network architectures, research on adversarial examples for segmentation models is still limited, particularly for universal adversarial perturbations. In this paper, we propose a novel universal adversarial attack method designed for segmentation models, which includes dual feature separation and low-frequency scattering modules. The two modules guide the training of adversarial examples in the pixel and frequency space, respectively. Experiments demonstrate that our method achieves high attack success rates surpassing the state-of-the-art methods, and exhibits strong transferability across different models.
zh

[CV-160] Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising

【速读】：该论文试图解决复杂噪声图像去噪时细节恢复困难的问题，特别是通过近红外（NIR）图像辅助RGB图像去噪时，由于NIR和RGB图像之间的不一致性，现有方法难以平衡两者的贡献。解决方案的关键在于提出了跨域频率相关性利用网络（FCENet），并通过频率相关性先验（frequency correlation prior）揭示了NIR和RGB图像在频率域的互补性。基于此，论文设计了频率动态选择机制（FDSM）和频率穷举融合机制（FEFM），分别用于动态选择频率域的互补信息和增强融合过程中共同特征与差异特征的控制。实验结果表明，该方法在图像质量和计算效率上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.16645
作者: Yuchen Wang,Hongyuan Wang,Lizhi Wang,Xin Wang,Lin Zhu,Wanxuan Lu,Hua Huang
机构: Beijing Institute of Technology; Beijing Normal University; Chinese Academy of Sciences
关键词: NIR and RGB, single-image denoising algorithms, Existing single-image denoising, complex noisy images, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing single-image denoising algorithms often struggle to restore details when dealing with complex noisy images. The introduction of near-infrared (NIR) images offers new possibilities for RGB image denoising. However, due to the inconsistency between NIR and RGB images, the existing works still struggle to balance the contributions of two fields in the process of image fusion. In response to this, in this paper, we develop a cross-field Frequency Correlation Exploiting Network (FCENet) for NIR-assisted image denoising. We first propose the frequency correlation prior based on an in-depth statistical frequency analysis of NIR-RGB image pairs. The prior reveals the complementary correlation of NIR and RGB images in the frequency domain. Leveraging frequency correlation prior, we then establish a frequency learning framework composed of Frequency Dynamic Selection Mechanism (FDSM) and Frequency Exhaustive Fusion Mechanism (FEFM). FDSM dynamically selects complementary information from NIR and RGB images in the frequency domain, and FEFM strengthens the control of common and differential features during the fusion of NIR and RGB features. Extensive experiments on simulated and real data validate that our method outperforms various state-of-the-art methods in terms of image quality and computational efficiency. The code will be released to the public.
zh

[CV-161] Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey Experimental Analysis and Future Trends

【速读】：该论文试图解决高分辨率地表温度（Land Surface Temperature, LST）数据的获取难题，尤其是在卫星传感器面临空间分辨率与时间分辨率之间权衡的情况下。解决方案的关键在于利用时空融合（Spatio-Temporal Fusion, STF）技术，特别是基于深度学习（Deep Learning, DL）的方法，来整合高空间低时间分辨率和高时间低空间分辨率的卫星数据源。通过捕捉输入与输出LST数据之间的复杂非线性关系，DL方法能够有效提升LST数据的空间和时间分辨率。论文不仅综述了最新的DL-based STF技术进展，还提出了新的分类方法，并讨论了当前方法面临的挑战和未来研究方向，同时发布了首个用于LST估计的开源基准数据集。

链接: https://arxiv.org/abs/2412.16631
作者: Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
机构: INSA CVL, PRISME UR 4229, Bourges, 18022, Centre Val de Loire, France; Université d’Orléans, PRISME UR 4229, Orléans, 45067, Centre Val de Loire, France; Université d’Orléans, CEDETE, UR 1210, Orléans, 45067, Centre Val de Loire, France
关键词: Land Surface Temperature, Earth surface, satellite remote sensing, LST estimation, LST
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the Proceedings of IEEE

点击查看摘要

Abstract:The rapid advancements in satellite remote sensing have enhanced the capability to monitor and analyze the Earth’s surface. Among the many variables captured through satellite sensors, Land Surface Temperature (LST) plays a critical role in understanding key environmental processes. However, obtaining high-resolution LST data remains a challenge, as satellite sensors often face a trade-off between spatial and temporal resolutions. In response, Spatio-Temporal Fusion (STF) has emerged as a powerful method to integrate two satellite data sources, one providing high spatial but low temporal resolution, and the other offering high temporal but low spatial resolution. Although a range of STF techniques have been proposed, from traditional methods to cutting-edge deep learning (DL) models, most have focused on surface reflectance, with limited application to LST estimation. DL approaches, in particular, show promise in improving the spatial and temporal resolutions of LST by capturing complex, non-linear relationships between input and output LST data. This paper offers a comprehensive review of the latest advancements in DL-based STF techniques for LST estimation. We analyze key research developments, mathematically formulate the STF problem, and introduce a novel taxonomy for DL-based STF methods. Furthermore, we discuss the challenges faced by current methods and highlight future research directions. In addition, we present the first open-source benchmark STF dataset for LST estimation, consisting of 51 pairs of MODIS-Landsat images spanning from 2013 to 2024. To support our findings, we conduct extensive experiments on state-of-the-art methods and present both quantitative and qualitative assessments. This is the first survey paper focused on DL-based STF for LST estimation. We hope it serves as a valuable reference for researchers and paves the way for future research in this field.
zh

[CV-162] Automated Bleeding Detection and Classification in Wireless Capsule Endoscopy with YOLOv8-X

【速读】：该论文试图解决胃肠道出血（GI bleeding）的检测问题，这是一种消化系统疾病的重要指标。解决方案的关键在于开发了一个统一的YOLOv8-X模型，用于无线胶囊内窥镜（WCE）图像中出血区域的检测和分类。通过精心策划和标注的数据集（包含6,345张多样化的图像），该模型在验证数据集上实现了96.10%的分类准确率和76.8%的平均精度（mAP），为胃肠道出血的准确检测提供了有效的技术手段。

链接: https://arxiv.org/abs/2412.16624
作者: Pavan C Shekar,Vivek Kanhangad,Shishir Maheshwari,T Sunil Kumar
机构: 未知
关键词: digestive system disorders, accurate detection methods, system disorders, Wireless Capsule Endoscopy, critical indicator
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, challenge

点击查看摘要

Abstract:Gastrointestinal (GI) bleeding, a critical indicator of digestive system disorders, re quires efficient and accurate detection methods. This paper presents our solution to the Auto-WCEBleedGen Version V1 Challenge, where we achieved the consolation position. We developed a unified YOLOv8-X model for both detection and classification of bleeding regions in Wireless Capsule Endoscopy (WCE) images. Our approach achieved 96.10% classification accuracy and 76.8% mean Average Precision (mAP) at 0.5 IoU on the val idation dataset. Through careful dataset curation and annotation, we assembled and trained on 6,345 diverse images to ensure robust model performance. Our implementa tion code and trained models are publicly available at this https URL.
zh

[CV-163] opology-Aware 3D Gaussian Splatting: Leveraging Persistent Homology for Optimized Structural Integrity

【速读】：该论文试图解决当前3D高斯拼接（Gaussian Splatting, GS）方法中的两个关键问题：由于初始几何覆盖不完全导致的像素级结构完整性受损，以及优化过程中拓扑约束不足导致的特征级完整性不足。解决方案的关键在于引入拓扑感知3D高斯拼接（Topology-Aware 3D Gaussian Splatting, Topology-GS），并通过两种创新方法克服这些问题：一是局部持久性Voronoi插值（Local Persistent Voronoi Interpolation, LPVI），利用持久同调（persistent homology）指导自适应插值，增强低曲率区域的点覆盖并保持拓扑结构；二是基于持久条码的拓扑正则化项（PersLoss），通过约束渲染图像与真实图像的拓扑特征距离，提升视觉感知相似性。这些方法在多个新视角合成基准测试中显著提升了PSNR、SSIM和LPIPS指标，同时保持了高效的内存使用。

链接: https://arxiv.org/abs/2412.16619
作者: Tianqi Shen,Shaohua Liu,Jiaqi Feng,Ziye Ma,Ning An
机构: 1. China Coal Technology & Engineering Group Big Data Research Institute(中国煤炭科技工程集团大数据研究院); 2. China University of Mining and Technology(中国矿业大学); 3. China Coal Technology & Engineering Group(中国煤炭科技工程集团); 4. China University of Mining and Technology(中国矿业大学); City University of Hong Kong(香港城市大学)
关键词: volumetric radiance fields, representing discrete volumetric, discrete volumetric radiance, Gaussian Splatting, radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Algebraic Topology (math.AT); Geometric Topology (math.GT)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS) has emerged as a crucial technique for representing discrete volumetric radiance fields. It leverages unique parametrization to mitigate computational demands in scene optimization. This work introduces Topology-Aware 3D Gaussian Splatting (Topology-GS), which addresses two key limitations in current approaches: compromised pixel-level structural integrity due to incomplete initial geometric coverage, and inadequate feature-level integrity from insufficient topological constraints during optimization. To overcome these limitations, Topology-GS incorporates a novel interpolation strategy, Local Persistent Voronoi Interpolation (LPVI), and a topology-focused regularization term based on persistent barcodes, named PersLoss. LPVI utilizes persistent homology to guide adaptive interpolation, enhancing point coverage in low-curvature areas while preserving topological structure. PersLoss aligns the visual perceptual similarity of rendered images with ground truth by constraining distances between their topological features. Comprehensive experiments on three novel-view synthesis benchmarks demonstrate that Topology-GS outperforms existing methods in terms of PSNR, SSIM, and LPIPS metrics, while maintaining efficient memory usage. This study pioneers the integration of topology with 3D-GS, laying the groundwork for future research in this area.
zh

[CV-164] Concept Guided Co-saliency Objection Detection

【速读】：该论文试图解决传统协同显著目标检测 (Co-SOD) 方法在面对多样化的物体变化（如不同姿态）和背景噪声时表现不佳的问题。解决方案的关键在于提出了ConceptCoSOD，一种基于概念引导的新方法，通过利用文本语义信息来增强Co-SOD性能。具体来说，ConceptCoSOD将Co-SOD任务重新定义为(image-text)-to-image任务，首先捕捉图像组中的共享语义概念，然后利用这些概念在复杂场景中进行精确的物体分割。这种方法显著提高了检测精度，特别是在背景干扰和物体变化较大的挑战性场景中。

链接: https://arxiv.org/abs/2412.16609
作者: Jiayi Zhu,Qing Guo,Felix Juefei-Xu,Yihao Huang,Yang Liu,Geguang Pu
机构: East China Normal University, China; New York University, USA; IHPC & CFAR, Agency for Science, Technology and Research, Singapore; Nanyang Technological University, Singapore; Shanghai Industrial Control Safety Innovation Tech. Co., Ltd, China
关键词: examining shared visual, seeks to identify, identify common, shared visual features, visual features
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of co-saliency object detection (Co-SOD) seeks to identify common, salient objects across a collection of images by examining shared visual features. However, traditional Co-SOD methods often encounter limitations when faced with diverse object variations (e.g., different postures) and irrelevant background elements that introduce noise. To address these challenges, we propose ConceptCoSOD, a novel concept-guided approach that leverages text semantic information to enhance Co-SOD performance by guiding the model to focus on consistent object features. Through rethinking Co-SOD as an (image-text)-to-image task instead of an image-to-image task, ConceptCoSOD first captures shared semantic concepts within an image group and then uses them as guidance for precise object segmentation in complex scenarios. Experimental results on three benchmark datasets and six corruptions reveal that ConceptCoSOD significantly improves detection accuracy, especially in challenging settings with considerable background distractions and object variability.
zh

[CV-165] OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities

【速读】：该论文试图解决从少量全方位图像（omnidirectional images）快速生成前馈3D高斯喷射（feed-forward 3D Gaussian Splatting, 3DGS）的问题。现有前馈模型仅适用于透视图像，而全方位图像的独特光学特性使得特征编码器难以正确理解图像上下文，导致高斯分布在空间上不均匀，从而影响新视角下的图像质量。解决方案的关键在于提出OmniSplat模型，并引入阴阳网格（Yin-Yang grid）来分解图像，以缩小全方位图像与透视图像之间的领域差距。阴阳网格的准均匀特性使得分解后的图像接近透视图像，从而能够利用现有卷积神经网络（CNN）结构及其学习到的强先验知识，提升重建精度并增强全方位图像之间的分割一致性，实现快速且高质量的3DGS编辑。

链接: https://arxiv.org/abs/2412.16604
作者: Suyoung Lee,Jaeyoung Chung,Kihoon Kim,Jaeyoo Huh,Gunhee Lee,Minsoo Lee,Kyoung Mu Lee
机构: Dept. of ECE & ASRI(电子与计算机工程系 & 先进科学与技术研究所), Seoul National University(首尔国立大学); IPAI(智能与应用研究所), Seoul National University(首尔国立大学); LG AI Research(LG人工智能研究院)
关键词: needing per-scene optimization, gained significant popularity, significant popularity due, Gaussian Splatting, generate scenes immediately
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (3DGS) models have gained significant popularity due to their ability to generate scenes immediately without needing per-scene optimization. Although omnidirectional images are getting more popular since they reduce the computation for image stitching to composite a holistic scene, existing feed-forward models are only designed for perspective images. The unique optical properties of omnidirectional images make it difficult for feature encoders to correctly understand the context of the image and make the Gaussian non-uniform in space, which hinders the image quality synthesized from novel views. We propose OmniSplat, a pioneering work for fast feed-forward 3DGS generation from a few omnidirectional images. We introduce Yin-Yang grid and decompose images based on it to reduce the domain gap between omnidirectional and perspective images. The Yin-Yang grid can use the existing CNN structure as it is, but its quasi-uniform characteristic allows the decomposed image to be similar to a perspective image, so it can exploit the strong prior knowledge of the learned feed-forward network. OmniSplat demonstrates higher reconstruction accuracy than existing feed-forward networks trained on perspective images. Furthermore, we enhance the segmentation consistency between omnidirectional images by leveraging attention from the encoder of OmniSplat, providing fast and clean 3DGS editing results.
zh

[CV-166] V"Mean"ba: Visual State Space Models only need 1 hidden dimension NEURIPS2024

【速读】：该论文试图解决视觉Transformer在图像处理任务中由于自注意力机制的二次复杂度导致的可扩展性受限问题，尤其是在资源受限设备上的部署挑战。解决方案的关键在于引入了一种名为VMeanba的无训练压缩方法，通过在状态空间模型(State Space Models, SSMs)中使用均值操作消除通道维度，从而优化计算复杂度。论文的核心观察是SSM块的输出激活在通道间具有低方差，VMeanba利用这一特性通过跨通道平均激活图来减少计算开销，同时保持精度。实验结果表明，该方法在图像分类和语义分割任务中实现了高达1.12倍的加速，且精度损失小于3%。

链接: https://arxiv.org/abs/2412.16602
作者: Tien-Yu Chi,Hung-Yueh Chiang,Chi-Chih Chang,Ning-Chi Huang,Kai-Chiang Wu
机构: National Yang Ming Chiao Tung University(阳明交通大学); The University of Texas at Austin(德克萨斯大学奥斯汀分校); Cornell University(康奈尔大学)
关键词: Vision transformers dominate, processing tasks due, transformers dominate image, dominate image processing, superior performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2024 Machine Learning for Systems workshop

点击查看摘要

Abstract:Vision transformers dominate image processing tasks due to their superior performance. However, the quadratic complexity of self-attention limits the scalability of these systems and their deployment on resource-constrained devices. State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism, which reduces the complexity of sequence modeling from quadratic to linear. Recently, SSMs have been extended to high-resolution vision tasks. Nonetheless, the linear recurrence mechanism struggles to fully utilize matrix multiplication units on modern hardware, resulting in a computational bottleneck. We address this issue by introducing \textitVMeanba, a training-free compression method that eliminates the channel dimension in SSMs using mean operations. Our key observation is that the output activations of SSM blocks exhibit low variances across channels. Our \textitVMeanba leverages this property to optimize computation by averaging activation maps across the channel to reduce the computational overhead without compromising accuracy. Evaluations on image classification and semantic segmentation tasks demonstrate that \textitVMeanba achieves up to a 1.12x speedup with less than a 3% accuracy loss. When combined with 40% unstructured pruning, the accuracy drop remains under 3%.
zh

[CV-167] Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances

【速读】：该论文试图解决在不同天气条件下城市场景的分割任务中的域适应和泛化问题。解决方案的关键在于引入了一个新颖的合成数据集，该数据集捕捉了多种天气条件下的城市场景，并提供了像素级精确、与地面实况对齐的图像，以促进跨域的特征对齐。此外，论文提出了一种利用每个场景的多个版本进行域适应和泛化的方法，通过在不同天气场景间强制特征一致性来提升性能。实验结果表明，该数据集在多个对齐指标上显著提升了性能，并为合成数据生成和域适应提供了一个新的范式。

链接: https://arxiv.org/abs/2412.16592
作者: Javier Montalvo,Roberto Alcover-Couso,Pablo Carballeira,Álvaro García-Martín,Juan C. SanMiguel,Marcos Escudero-Viñolo
机构: 未知
关键词: captures urban scenes, facilitate effective feature, providing pixel-perfect, effective feature alignment, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a novel synthetic dataset that captures urban scenes under a variety of weather conditions, providing pixel-perfect, ground-truth-aligned images to facilitate effective feature alignment across domains. Additionally, we propose a method for domain adaptation and generalization that takes advantage of the multiple versions of each scene, enforcing feature consistency across different weather scenarios. Our experimental results demonstrate the impact of our dataset in improving performance across several alignment metrics, addressing key challenges in domain adaptation and generalization for segmentation tasks. This research also explores critical aspects of synthetic data generation, such as optimizing the balance between the volume and variability of generated images to enhance segmentation performance. Ultimately, this work sets forth a new paradigm for synthetic data generation and domain adaptation.
zh

[CV-168] REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在地球观测（Earth Observation, EO）领域应用中的局限性问题，特别是其在地理和科学回归任务中的潜力未被充分挖掘。解决方案的关键在于引入了一个名为REO-Instruct的新型基准数据集，该数据集包含160万对多模态EO图像和语言数据，专门用于支持生物量回归和图像内容解释任务。基于此数据集，论文提出了REO-VLM模型，该模型不仅具备传统的生成功能，还整合了回归能力，通过语言驱动的推理结合科学领域知识，从而能够从EO数据中全面解释复杂的科学属性，显著提升了环境监测和资源管理的能力。

链接: https://arxiv.org/abs/2412.16583
作者: Xizhe Xue,Guoting Wei,Hao Chen,Haokui Zhang,Feng Lin,Chunhua Shen,Xiao Xiang Zhu
机构: Technical University of Munich, Germany; Zhejiang University, China; Lighthouse; Munich Center for Machine Learning
关键词: including Earth Observation, Earth Observation, catalyzed significant advancements, Vision Language Models, including Earth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of Vision Language Models (VLMs) has catalyzed significant advancements in artificial intelligence, expanding research across various disciplines, including Earth Observation (EO). While VLMs have enhanced image understanding and data processing within EO, their applications have predominantly focused on image content description. This limited focus overlooks their potential in geographic and scientific regression tasks, which are essential for diverse EO applications. To bridge this gap, this paper introduces a novel benchmark dataset, called \textbfREO-Instruct to unify regression and generation tasks specifically for the EO domain. Comprising 1.6 million multimodal EO imagery and language pairs, this dataset is designed to support both biomass regression and image content interpretation tasks. Leveraging this dataset, we develop \textbfREO-VLM, a groundbreaking model that seamlessly integrates regression capabilities with traditional generative functions. By utilizing language-driven reasoning to incorporate scientific domain knowledge, REO-VLM goes beyond solely relying on EO imagery, enabling comprehensive interpretation of complex scientific attributes from EO data. This approach establishes new performance benchmarks and significantly enhances the capabilities of environmental monitoring and resource management.
zh

[CV-169] SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

【速读】：该论文试图解决协同语音动作生成中的问题，即如何在保持节奏一致性的基础上，增强语义动作的表达。解决方案的关键在于提出了一种名为SemTalk的方法，通过分别学习通用动作（general motions）和稀疏动作（sparse motions），并利用语义分数（semantic score）进行自适应融合。具体来说，论文首先通过节奏一致性学习（rhythmic consistency learning）建立与节奏相关的基本动作，确保动作与语音节奏同步；然后通过语义强调学习（semantic emphasis learning）生成语义感知的稀疏动作，聚焦于帧级别的语义线索；最后，利用学习到的语义分数将稀疏动作融入基本动作，生成具有语义强调的协同语音动作。实验结果表明，该方法在两个公开数据集上均优于现有技术，生成了高质量的协同语音动作，兼具稳定的节奏基础和丰富的语义表达。

链接: https://arxiv.org/abs/2412.16563
作者: Xiangyue Zhang,Jiangfang Li,Jiaxu Zhang,Ziqiang Dang,Jianqiang Ren,Liefeng Bo,Zhigang Tu
机构: Wuhan University(武汉大学); Alibaba(阿里巴巴); Zhejiang University(浙江大学)
关键词: co-speech motion generation, motion, good co-speech motion, common rhythmic motion, essential semantic motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, textitsemantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.
zh

[CV-170] Semantics Prompting Data-Free Quantization for Low-Bit Vision Transformers

【速读】：该论文试图解决数据无损量化 (Data-free Quantization, DFQ) 在视觉变换器 (Vision Transformers, ViTs) 中应用时，由于现有方法生成的合成图像语义不足导致性能下降的问题。解决方案的关键在于提出了一种名为 SPDFQ 的语义提示数据无损量化方法，该方法通过三个核心技术提升合成图像的语义质量：首先，引入注意力先验对齐 (Attention Priors Alignment, APA)，利用随机生成的注意力先验增强合成图像的语义；其次，采用多语义强化 (Multi-Semantic Reinforcement, MSR)，通过局部化补丁优化促进高效参数化和多样化的语义；最后，使用软标签学习 (Softlabel Learning, SL)，通过适应性软学习目标鼓励更复杂的语义并适应由 MSR 增强的图像。实验结果表明，SPDFQ 在 ImageNet 上显著提升了 ViT-B 模型的性能，例如在 W4A4 配置下，top-1 准确率提高了 15.52%。

链接: https://arxiv.org/abs/2412.16553
作者: Yunshan Zhong,Yuyao Zhou,Yuxin Zhang,Shen Li,Yong Li,Fei Chao,Zhanpeng Zeng,Rongrong Ji
机构: Institute of Artificial Intelligence, Xiamen University; MAC Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University; Alibaba
关键词: model compression community, address increasing concerns, facilitates model quantization, garnered significant attention, data security
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-free quantization (DFQ), which facilitates model quantization without real data to address increasing concerns about data security, has garnered significant attention within the model compression community. Recently, the unique architecture of vision transformers (ViTs) has driven the development of specialized DFQ techniques. However, we observe that the synthetic images from existing methods suffer from the deficient semantics issue compared to real images, thereby compromising performance. Motivated by this, we propose SPDFQ, a Semantics Prompting Data-Free Quantization method for ViTs. First, SPDFQ incorporates Attention Priors Alignment (APA), which uses randomly generated attention priors to enhance the semantics of synthetic images. Second, SPDFQ introduces Multi-Semantic Reinforcement (MSR), which utilizes localized patch optimization to prompt efficient parameterization and diverse semantics in synthetic images. Finally, SPDFQ employs Softlabel Learning (SL), where soft learning targets are adapted to encourage more complex semantics and accommodate images augmented by MSR. Experimental results demonstrate that SPDFQ significantly outperforms existing methods. For instance, SPDFQ achieves a 15.52% increase in top-1 accuracy on ImageNet for W4A4 ViT-B
zh

[CV-171] Diffusion Prior Interpolation for Flexibility Real-World Face Super-Resolution AAAI25

【速读】：该论文试图解决在无监督训练的情况下，基于先验的扩散模型在人脸超分辨率 (Face Super-Resolution, FSR) 任务中难以满足像素级精度要求和保持一致性的问题。解决方案的关键在于提出了一种名为扩散先验插值 (Diffusion Prior Interpolation, DPI) 的方法，通过引入强弱约束的掩码策略和迭代优化，结合条件校正器 (Condition Corrector, CRT) 来增强条件与样本之间的相互优化，从而在保持一致性的同时提升生成结果的质量和多样性。该方法能够无缝集成到预训练模型中，并在实验中表现出优于现有最先进 (SOTA) 方法的性能。

链接: https://arxiv.org/abs/2412.16552
作者: Jiarui Yang,Tao Dai,Yufei Zhu,Naiqi Li,Jinmin Li,Shutao Xia
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学); 2. Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University(苏州大学); 3. School of Advanced Materials, Soochow University(苏州大学)
关键词: Diffusion models represent, generative modeling, Diffusion Prior Interpolation, diffusion models’ powerful, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI25

点击查看摘要

Abstract:Diffusion models represent the state-of-the-art in generative modeling. Due to their high training costs, many works leverage pre-trained diffusion models’ powerful representations for downstream tasks, such as face super-resolution (FSR), through fine-tuning or prior-based methods. However, relying solely on priors without supervised training makes it challenging to meet the pixel-level accuracy requirements of discrimination task. Although prior-based methods can achieve high fidelity and high-quality results, ensuring consistency remains a significant challenge. In this paper, we propose a masking strategy with strong and weak constraints and iterative refinement for real-world FSR, termed Diffusion Prior Interpolation (DPI). We introduce conditions and constraints on consistency by masking different sampling stages based on the structural characteristics of the face. Furthermore, we propose a condition Corrector (CRT) to establish a reciprocal posterior sampling process, enhancing FSR performance by mutual refinement of conditions and samples. DPI can balance consistency and diversity and can be seamlessly integrated into pre-trained models. In extensive experiments conducted on synthetic and real datasets, along with consistency validation in face recognition, DPI demonstrates superiority over SOTA FSR methods. The code is available at \urlthis https URL.
zh

[CV-172] FairDD: Enhancing Fairness with domain-incremental learning in dermatological disease diagnosis

【速读】：该论文试图解决在皮肤病诊断中深度学习模型面临的决策偏差问题，特别是在数据驱动方法中公平性与准确性之间的权衡。解决方案的关键在于提出了一个名为FairDD的新型公平皮肤病诊断网络，通过利用领域增量学习（domain incremental learning）来平衡不同群体的学习，并敏感地应对数据分布的变化。此外，该方法结合了mixup数据增强技术和监督对比学习（supervised contrastive learning），以增强网络的鲁棒性和泛化能力。实验结果表明，FairDD在公平性和性能之间的权衡上表现优异。

链接: https://arxiv.org/abs/2412.16542
作者: Yiqin Luo,Tianlong Gu
机构: College of Information Science and Technology; Engineering Research Center of Trustworthy AI (Ministry of Education); College of Cyber Security
关键词: deep learning technologies, artificial intelligence, rapid advancement, advancement of deep, increasingly prevalent
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:With the rapid advancement of deep learning technologies, artificial intelligence has become increasingly prevalent in the research and application of dermatological disease diagnosis. However, this data-driven approach often faces issues related to decision bias. Existing fairness enhancement techniques typically come at a substantial cost to accuracy. This study aims to achieve a better trade-off between accuracy and fairness in dermatological diagnostic models. To this end, we propose a novel fair dermatological diagnosis network, named FairDD, which leverages domain incremental learning to balance the learning of different groups by being sensitive to changes in data distribution. Additionally, we incorporate the mixup data augmentation technique and supervised contrastive learning to enhance the network’s robustness and generalization. Experimental validation on two dermatological datasets demonstrates that our proposed method excels in both fairness criteria and the trade-off between fairness and performance.
zh

[CV-173] Prior2Posterior: Model Prior Correction for Long-Tailed Learning WACV

【速读】：该论文试图解决长尾识别问题中，基于学习的解决方案在平衡测试数据集上泛化能力不足的问题。由于数据不平衡的先验（imbalanced prior），模型在学习过程中倾向于最频繁（head）类别，导致最不频繁（tail）类别的性能较差。解决方案的关键在于通过消除基于类别样本数（frequencies）建模的不平衡先验的影响，来纠正这种偏差。论文提出了一种新颖的方法，通过使用训练后模型的后验概率（posteriori probabilities）来准确建模有效先验（effective prior），并提出了一种后验概率调整方法（Prior2Posterior: P2P），在训练后以事后方式（post-hoc）调整预测的后验概率，从而提高模型性能。该方法在理论上证明了其在使用朴素交叉熵损失和logit调整损失训练的模型中的最优性，并在多个长尾基准数据集上实现了新的最先进（SOTA）性能。此外，该方法还可以用于检查现有方法，捕捉有效先验并消除残留偏差，从而在不重新训练模型的情况下进一步提高其性能。

链接: https://arxiv.org/abs/2412.16540
作者: S Divakar Bhat,Amit More,Mudit Soni,Surbhi Agrawal
机构: Honda R&D Co., Ltd.(本田研发有限公司)
关键词: long-tailed recognition face, recognition face difficulties, Learning-based solutions, balanced test datasets, prior
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Learning-based solutions for long-tailed recognition face difficulties in generalizing on balanced test datasets. Due to imbalanced data prior, the learned \textita posteriori distribution is biased toward the most frequent (head) classes, leading to an inferior performance on the least frequent (tail) classes. In general, the performance can be improved by removing such a bias by eliminating the effect of imbalanced prior modeled using the number of class samples (frequencies). We first observe that the \textiteffective prior on the classes, learned by the model at the end of the training, can differ from the empirical prior obtained using class frequencies. Thus, we propose a novel approach to accurately model the effective prior of a trained model using \textita posteriori probabilities. We propose to correct the imbalanced prior by adjusting the predicted \textita posteriori probabilities (Prior2Posterior: P2P) using the calculated prior in a post-hoc manner after the training, and show that it can result in improved model performance. We present theoretical analysis showing the optimality of our approach for models trained with naive cross-entropy loss as well as logit adjusted loss. Our experiments show that the proposed approach achieves new state-of-the-art (SOTA) on several benchmark datasets from the long-tail literature in the category of logit adjustment methods. Further, the proposed approach can be used to inspect any existing method to capture the \textiteffective prior and remove any residual bias to improve its performance, post-hoc, without model retraining. We also show that by using the proposed post-hoc approach, the performance of many existing methods can be improved further.
zh

[CV-174] LLaVA-SLT: Visual Language Tuning for Sign Language Translation

【速读】：该论文试图解决手语翻译 (Sign Language Translation, SLT) 领域中依赖昂贵的gloss注释数据集的问题。解决方案的关键在于引入LLaVA-SLT，这是一个开创性的大规模多模态模型 (Large Multimodal Model, LMM) 框架，通过有效学习视觉语言嵌入来利用大规模语言模型 (Large Language Models, LLMs) 的能力。具体步骤包括：1) 语言继续预训练，通过扩展LLM并适应手语领域，增强其对手语文本语言知识的理解；2) 视觉对比预训练，将视觉编码器与大规模预训练的文本编码器对齐，并提出分层视觉编码器以学习与LLM标记嵌入兼容的稳健词级中间表示；3) 视觉语言调优，通过冻结预训练模型并使用轻量级可训练的MLP连接器，将预训练的视觉语言嵌入高效映射到LLM标记嵌入空间，从而实现下游SLT任务。实验结果表明，LLaVA-SLT在性能上超越了现有最先进的方法，并接近基于gloss的翻译准确性。

链接: https://arxiv.org/abs/2412.16524
作者: Han Liang,Chengyu Huang,Yuecheng Xu,Cheng Tang,Weicai Ye,Juze Zhang,Xin Chen,Jingyi Yu,Lan Xu
机构: ShanghaiTech University; ByteDance; Zhejiang University
关键词: costly gloss-annotated datasets, Sign Language Translation, Large Multimodal Model, reliance on costly, significant barrier
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the realm of Sign Language Translation (SLT), reliance on costly gloss-annotated datasets has posed a significant barrier. Recent advancements in gloss-free SLT methods have shown promise, yet they often largely lag behind gloss-based approaches in terms of translation accuracy. To narrow this performance gap, we introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings. Our model is trained through a trilogy. First, we propose linguistic continued pretraining. We scale up the LLM and adapt it to the sign language domain using an extensive corpus dataset, effectively enhancing its textual linguistic knowledge about sign language. Then, we adopt visual contrastive pretraining to align the visual encoder with a large-scale pretrained text encoder. We propose hierarchical visual encoder that learns a robust word-level intermediate representation that is compatible with LLM token embeddings. Finally, we propose visual language tuning. We freeze pretrained models and employ a lightweight trainable MLP connector. It efficiently maps the pretrained visual language embeddings into the LLM token embedding space, enabling downstream SLT task. Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods. By using extra annotation-free data, it even closes to the gloss-based accuracy.
zh

[CV-175] Enhancing Contrastive Learning Inspired by the Philosophy of “The Blind Men and the Elephant” AAAI2025

【速读】：该论文试图解决自监督视觉表示学习中对比学习方法的有效性问题，特别是如何通过设计更有效的数据增强策略来提升特征表示的质量。解决方案的关键在于引入JointCrop和JointBlur方法，这两种方法通过利用两个数据增强参数的联合分布生成更具挑战性的正样本对，从而使对比学习能够获得更有效的特征表示。这是首次将两个数据增强参数的联合分布明确引入对比学习框架，且该方法作为一个即插即用的框架，无需额外的计算开销，显著提升了SimCLR、BYOL、MoCo系列、SimSiam和Dino等基线模型的性能。

链接: https://arxiv.org/abs/2412.16522
作者: Yudong Zhang,Ruobing Xie,Jiansheng Chen,Xingwu Sun,Zhanhui Kang,Yu Wang
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. Baidu Inc., Beijing, China(百度公司，北京，中国);
3. School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区计算机科学与技术学院，深圳，中国);
4. School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China(哈尔滨工业大学威海校区计算机科学与技术学院，威海，中国)
关键词: typically generating positive, self-supervised vision representation, Contrastive learning, generating positive pairs, vision representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Contrastive learning is a prevalent technique in self-supervised vision representation learning, typically generating positive pairs by applying two data augmentations to the same image. Designing effective data augmentation strategies is crucial for the success of contrastive learning. Inspired by the story of the blind men and the elephant, we introduce JointCrop and JointBlur. These methods generate more challenging positive pairs by leveraging the joint distribution of the two augmentation parameters, thereby enabling contrastive learning to acquire more effective feature representations. To the best of our knowledge, this is the first effort to explicitly incorporate the joint distribution of two data augmentation parameters into contrastive learning. As a plug-and-play framework without additional computational overhead, JointCrop and JointBlur enhance the performance of SimCLR, BYOL, MoCo v1, MoCo v2, MoCo v3, SimSiam, and Dino baselines with notable improvements.
zh

[CV-176] Anchor Learning with Potential Cluster Constraints for Multi-view Clustering

【速读】：该论文试图解决现有基于锚点的多视图聚类 (Anchor-based Multi-view Clustering, MVC) 方法中，锚点生成不均匀且可能散布在簇外的问题。解决方案的关键在于提出了一个新的方法，称为基于潜在簇约束的锚点学习 (Anchor Learning with Potential Cluster Constraints, ALPC)。该方法通过建立共享的潜在语义模块，确保锚点从特定簇中均匀生成，并通过调整锚点图来增强锚点的代表性和区分性，从而捕捉样本和锚点的共同聚类中心。最终，ALPC将锚点学习和图构建整合为一个统一的框架，进行协同学习和相互优化，以提升聚类性能。

链接: https://arxiv.org/abs/2412.16519
作者: Yawei Chen,Huibing Wang,Jinjia Peng,Yang Wang
机构: 1. School of Computer Science and Technology, Tianjin University, Tianjin, China(天津大学计算机科学与技术学院，天津，中国);
2. School of Computer Science and Technology, Shandong University, Jinan, China(山东大学计算机科学与技术学院，济南，中国);
3. School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China(南京理工大学计算机科学与技术学院，南京，中国)
关键词: Anchor-based multi-view clustering, Anchor-based multi-view, received extensive attention, extensive attention due, Potential Cluster Constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anchor-based multi-view clustering (MVC) has received extensive attention due to its efficient performance. Existing methods only focus on how to dynamically learn anchors from the original data and simultaneously construct anchor graphs describing the relationships between samples and perform clustering, while ignoring the reality of anchors, i.e., high-quality anchors should be generated uniformly from different clusters of data rather than scattered outside the clusters. To deal with this problem, we propose a noval method termed Anchor Learning with Potential Cluster Constraints for Multi-view Clustering (ALPC) method. Specifically, ALPC first establishes a shared latent semantic module to constrain anchors to be generated from specific clusters, and subsequently, ALPC improves the representativeness and discriminability of anchors by adapting the anchor graph to capture the common clustering center of mass from samples and anchors, respectively. Finally, ALPC combines anchor learning and graph construction into a unified framework for collaborative learning and mutual optimization to improve the clustering performance. Extensive experiments demonstrate the effectiveness of our proposed method compared to some state-of-the-art MVC methods. Our source code is available at this https URL.
zh

[CV-177] rojFlow: Flow Models are Natural Targets for Trojan Attacks

【速读】：该论文试图解决生成式模型（Flow-based generative models, FMs）在面对特洛伊木马/后门攻击（Trojan/Backdoor attacks）时的脆弱性问题。解决方案的关键在于提出了一种名为TrojFlow的攻击方法，该方法利用FMs能够拟合任意两个分布的特性，简化了攻击的训练和采样过程，从而使得FMs成为后门攻击的自然目标。通过在CIFAR-10和CelebA数据集上的实验，论文验证了TrojFlow能够高效地破坏FMs，并突破现有的防御机制。

链接: https://arxiv.org/abs/2412.16512
作者: Zhengyang Qi,Xiaohua Xu
机构: University of Science and Technology of China (中国科学技术大学)
关键词: Flow-based generative models, sampling process makes, Flow-based generative, noise to data, rapidly advanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Flow-based generative models (FMs) have rapidly advanced as a method for mapping noise to data, its efficient training and sampling process makes it widely applicable in various fields. FMs can be viewed as a variant of diffusion models (DMs). At the same time, previous studies have shown that DMs are vulnerable to Trojan/Backdoor attacks, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. We found that Trojan attacks on generative models are essentially equivalent to image transfer tasks from the backdoor distribution to the target distribution, the unique ability of FMs to fit any two arbitrary distributions significantly simplifies the training and sampling setups for attacking FMs, making them inherently natural targets for backdoor attacks. In this paper, we propose TrojFlow, exploring the vulnerabilities of FMs through Trojan attacks. In particular, we consider various attack settings and their combinations and thoroughly explore whether existing defense methods for DMs can effectively defend against our proposed attack scenarios. We evaluate TrojFlow on CIFAR-10 and CelebA datasets, our experiments show that our method can compromise FMs with high utility and specificity, and can easily break through existing defense mechanisms.
zh

[CV-178] Context-Aware Outlier Rejection for Robust Multi-View 3D Tracking of Similar Small Birds in An Outdoor Aviary

【速读】：该论文试图解决在户外鸟舍中对多只鸟进行稳健的3D跟踪的问题，特别是针对视觉上相似的鸟类及其快速运动带来的挑战。解决方案的关键在于利用环境地标进行增强的特征匹配和3D重建，通过基于最近地标的异常值剔除来实现精确的3D建模和多鸟同时跟踪。这种方法显著提高了对视觉上相似鸟类的区分能力，解决了现有跟踪系统中的关键障碍，并在实验中展示了其有效性，3D重建过程中的异常值减少了20%，匹配准确率达到97%。

链接: https://arxiv.org/abs/2412.16511
作者: Keon Moradi,Ethan Haque,Jasmeen Kaur,Alexandra B. Bentz,Eli S. Bridge,Golnaz Habibi
机构: The University of Oklahoma (俄克拉荷马大学)
关键词: paper presents, multiple birds, visually similar birds, birds, tracking of multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel approach for robust 3D tracking of multiple birds in an outdoor aviary using a multi-camera system. Our method addresses the challenges of visually similar birds and their rapid movements by leveraging environmental landmarks for enhanced feature matching and 3D reconstruction. In our approach, outliers are rejected based on their nearest landmark. This enables precise 3D-modeling and simultaneous tracking of multiple birds. By utilizing environmental context, our approach significantly improves the differentiation between visually similar birds, a key obstacle in existing tracking systems. Experimental results demonstrate the effectiveness of our method, showing a 20% elimination of outliers in the 3D reconstruction process, with a 97% accuracy in matching. This remarkable accuracy in 3D modeling translates to robust and reliable tracking of multiple birds, even in challenging outdoor conditions. Our work not only advances the field of computer vision but also provides a valuable tool for studying bird behavior and movement patterns in natural settings. We also provide a large annotated dataset of 80 birds residing in four enclosures for 20 hours of footage which provides a rich testbed for researchers in computer vision, ornithologists, and ecologists. Code and the link to the dataset is available at this https URL
zh

[CV-179] Unsupervised Domain Adaptive Person Search via Dual Self-Calibration

【速读】：该论文试图解决无监督域自适应（Unsupervised Domain Adaptive, UDA）行人搜索中由于域间差异导致的伪标签噪声问题，关键解决方案是提出了一个双自校准（Dual Self-Calibration, DSCA）框架。该框架通过图像级和实例级特征视角来有效消除噪声伪标签的干扰。具体来说，首先引入感知驱动自适应滤波器（Perception-Driven Adaptive Filter, PDAF），通过动态预测滤波阈值来消除噪声伪框和背景干扰，从而聚焦于前景目标。其次，提出聚类代理表示（Cluster Proxy Representation, CPR）模块，优化聚类表示的更新策略，减少误识别实例对聚类的污染，从而简化无标签目标域的训练过程。这些设计使得该方法在两个基准数据集上达到了最先进的性能，甚至超越了一些全监督方法。

链接: https://arxiv.org/abs/2412.16506
作者: Linfeng Qi,Huibing Wang,Jiqing Zhang,Jinjia Peng,Yang Wang
机构: 未知
关键词: UDA person search, Unsupervised Domain Adaptive, person search focuses, UDA person, effective UDA person
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptive (UDA) person search focuses on employing the model trained on a labeled source domain dataset to a target domain dataset without any additional annotations. Most effective UDA person search methods typically utilize the ground truth of the source domain and pseudo-labels derived from clustering during the training process for domain adaptation. However, the performance of these approaches will be significantly restricted by the disrupting pseudo-labels resulting from inter-domain disparities. In this paper, we propose a Dual Self-Calibration (DSCA) framework for UDA person search that effectively eliminates the interference of noisy pseudo-labels by considering both the image-level and instance-level features perspectives. Specifically, we first present a simple yet effective Perception-Driven Adaptive Filter (PDAF) to adaptively predict a dynamic filter threshold based on input features. This threshold assists in eliminating noisy pseudo-boxes and other background interference, allowing our approach to focus on foreground targets and avoid indiscriminate domain adaptation. Besides, we further propose a Cluster Proxy Representation (CPR) module to enhance the update strategy of cluster representation, which mitigates the pollution of clusters from misidentified instances and effectively streamlines the training process for unlabeled target domains. With the above design, our method can achieve state-of-the-art (SOTA) performance on two benchmark datasets, with 80.2% mAP and 81.7% top-1 on the CUHK-SYSU dataset, with 39.9% mAP and 81.6% top-1 on the PRW dataset, which is comparable to or even exceeds the performance of some fully supervised methods. Our source code is available at this https URL.
zh

[CV-180] First-frame Supervised Video Polyp Segmentation via Propagative and Semantic Dual-teacher Network ICASSP2024

【速读】：该论文试图解决视频息肉分割中高昂的逐帧标注成本问题，特别是针对长时间视频和大规模数据集。解决方案的关键在于提出了一个新的任务——First-Frame Supervised Video Polyp Segmentation (FSVPS)，并设计了一种新颖的Propagative and Semantic Dual-Teacher Network (PSDNet)。PSDNet采用双教师框架，其中传播教师通过通用对象跟踪器将首帧标注传播到后续帧作为伪标签，而语义教师则通过学生模型的指数移动平均来生成更稳定和时间不变的伪标签。通过精心设计的反向传播策略，PSDNet能够评估伪标签的质量，并确保高质量的伪标签在空间上与首帧标注对齐，从而实现更精确的教师到学生的知识传递和分割性能的提升。

链接: https://arxiv.org/abs/2412.16503
作者: Qiang Hu,Mei Liu,Qiang Li,Zhiwei Wang
机构: Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology; Tongji Medical College, Huazhong University of Science and Technology
关键词: Automatic video polyp, gastrointestinal cancer screening, video polyp segmentation, Automatic video, polyp segmentation plays
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2024. Code and models: this https URL

点击查看摘要

Abstract:Automatic video polyp segmentation plays a critical role in gastrointestinal cancer screening, but the cost of frameby-frame annotations is prohibitively high. While sparse-frame supervised methods have reduced this burden proportionately, the cost remains overwhelming for long-duration videos and large-scale datasets. In this paper, we, for the first time, reduce the annotation cost to just a single frame per polyp video, regardless of the video’s length. To this end, we introduce a new task, First-Frame Supervised Video Polyp Segmentation (FSVPS), and propose a novel Propagative and Semantic Dual-Teacher Network (PSDNet). Specifically, PSDNet adopts a teacher-student framework but employs two distinct types of teachers: the propagative teacher and the semantic teacher. The propagative teacher is a universal object tracker that propagates the first-frame annotation to subsequent frames as pseudo labels. However, tracking errors may accumulate over time, gradually degrading the pseudo labels and misguiding the student model. To address this, we introduce the semantic teacher, an exponential moving average of the student model, which produces more stable and time-invariant pseudo labels. PSDNet merges the pseudo labels from both teachers using a carefully-designed back-propagation strategy. This strategy assesses the quality of the pseudo labels by tracking them backward to the first frame. High-quality pseudo labels are more likely to spatially align with the firstframe annotation after this backward tracking, ensuring more accurate teacher-to-student knowledge transfer and improved segmentation performance. Benchmarking on SUN-SEG, the largest VPS dataset, demonstrates the competitive performance of PSDNet compared to fully-supervised approaches, and its superiority over sparse-frame supervised state-of-the-arts with a minimum improvement of 4.5% in Dice score.
zh

[CV-181] Autonomous Crack Detection using Deep Learning on Synthetic Thermogram Datasets

【速读】：该论文试图解决在钢板裂纹检测中，传统依赖人工干预和大量实验数据生成的问题。解决方案的关键在于通过有限元模拟（Finite Element Simulations）构建一个合成数据生成管道，并结合数据增强技术（data augmentation techniques）来增加数据的多样性和数量，从而减少对大量真实实验数据的依赖。这种方法不仅降低了数据生成的成本和时间，还提高了模型的泛化能力，并通过实验验证了其在实际场景中的有效性。

链接: https://arxiv.org/abs/2412.16499
作者: Chinmay Makarand Pimpalkhare,D. N. Pawaskar
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
关键词: number of experiments, lot of scientific, extensive number, Convolutional Neural Netowrks, data
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 14 figures

点击查看摘要

Abstract:In a lot of scientific problems, there is the need to generate data through the running of an extensive number of experiments. Further, some tasks require constant human intervention. We consider the problem of crack detection in steel plates. The way in which this generally happens is through humans looking at an image of the thermogram generated by heating the plate and classifying whether it is cracked or not. There has been a rise in the use of Artificial Intelligence (AI) based methods which try to remove the requirement of a human from this loop by using algorithms such as Convolutional Neural Netowrks (CNN)s as a proxy for the detection process. The issue is that CNNs and other vision models are generally very data-hungry and require huge amounts of data before they can start performing well. This data generation process is not very easy and requires innovation in terms of mechanical and electronic design of the experimental setup. It further requires massive amount of time and energy, which is difficult in resource-constrained scenarios. We try to solve exactly this problem, by creating a synthetic data generation pipeline based on Finite Element Simulations. We employ data augmentation techniques on this data to further increase the volume and diversity of data generated. The working of this concept is shown via performing inference on fine-tuned vision models and we have also validated the results by checking if our approach translates to realistic experimental data. We show the conditions where this translation is successful and how we can go about achieving that.
zh

[CV-182] Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance

【速读】：该论文试图解决多角色（multi-character）同时出现在场景中的视频生成问题，现有的方法主要集中在单一对象的视频生成，忽略了多角色共存的现实情况。解决方案的关键在于提出了一种无需微调（tuning-free）的多角色视频生成框架，基于分离的文本和姿态引导。具体来说，首先从姿态序列中提取角色掩码（character masks）以确定每个生成角色的空间位置，然后通过大型语言模型（LLMs）为每个角色获取精确的文本提示。此外，提出了空间对齐的交叉注意力机制（spatial-aligned cross attention）和多分支控制模块（multi-branch control module），以生成细粒度可控的多角色视频。实验结果表明，该方法在多角色生成方面具有精确的可控性，并在性能上优于以往的工作。

链接: https://arxiv.org/abs/2412.16495
作者: Beiyuan Zhang,Yue Ma,Chunlei Fu,Xinyang Song,Zhenan Sun,Ziqiang Li
机构: Chongqing University(重庆大学); The Hong Kong University of Science and Technology(香港科技大学); University of Chinese Academy of Sciences(中国科学院大学); New Laboratory of Pattern Recognition (NLPR)(模式识别新实验室); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Shanghai Jiao Tong University(上海交通大学)
关键词: Text-editable and pose-controllable, pose-controllable character video, practical applications, challenging but prevailing, prevailing topic
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 5 pages,conference

点击查看摘要

Abstract:Text-editable and pose-controllable character video generation is a challenging but prevailing topic with practical applications. However, existing approaches mainly focus on single-object video generation with pose guidance, ignoring the realistic situation that multi-character appear concurrently in a scenario. To tackle this, we propose a novel multi-character video generation framework in a tuning-free manner, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs for precise text guidance. Moreover, the spatial-aligned cross attention and multi-branch control module are proposed to generate fine grained controllable multi-character video. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation. We also verify the generality of our method by applying it to various personalized T2I models. Moreover, the quantitative results show that our approach achieves superior performance compared with previous works.
zh

[CV-183] Cross-View Consistency Regularisation for Knowledge Distillation

【速读】：该论文试图解决基于logit的知识蒸馏（Knowledge Distillation, KD）方法中的两个主要问题：教师模型的过度自信（overconfident teacher）和确认偏差（confirmation bias）。解决方案的关键在于引入了视内正则化（within-view regularisation）和跨视正则化（cross-view regularisation），并结合基于置信度的软标签挖掘（confidence-based soft label mining），以改进教师模型传递的信号质量，从而缓解确认偏差问题。这些方法被整合到一致性正则化基础的logit蒸馏（Consistency-Regularisation-based Logit Distillation, CRLD）框架中，显著提升了学生模型的学习效果，并在多个基准数据集上取得了新的最先进结果，同时不增加额外的网络参数。

链接: https://arxiv.org/abs/2412.16493
作者: Weijia Zhang,Dongnan Liu,Weidong Cai,Chao Ma
机构: Shanghai Jiao Tong University(上海交通大学); University of Sydney(悉尼大学)
关键词: transferring privileged knowledge, privileged knowledge, established paradigm, paradigm for transferring, transferring privileged
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, previous research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins.
zh

[CV-184] ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition AAAI2025

【速读】：该论文试图解决Vision Transformers (ViTs) 由于多头部自注意力机制 (Multi-head Self-Attention, MHSA) 导致的巨大计算成本问题，特别是在实际应用中加速ViTs的需求。解决方案的关键在于提出了一种新的重标记化策略，称为ImagePiece。该策略通过遵循自然语言处理中的MaxMatch策略，将语义不足但局部一致的标记分组，直到它们传达出足够的语义信息。这种简单的重标记化方法与之前的标记减少技术高度兼容，能够显著缩小相关标记的数量，从而大幅提升推理速度，例如在DeiT-S模型上实现了54%的速度提升，同时ImageNet分类准确率提高了0.39%。在超高速推理场景下，该方法相比其他基线方法在准确率上高出8%以上。

链接: https://arxiv.org/abs/2412.16491
作者: Seungdong Yoa,Seungjun Lee,Hyeseung Cho,Bumsoo Kim,Woohyung Lim
机构: Seungdong Yoa1; Seungjun Lee1; Hyeseung Cho1; Bumsoo Kim2; Woohyung Lim1
关键词: achieved remarkable success, computer vision tasks, Vision Transformers, achieved remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multi-head self-attention (MHSA), prompting efforts to accelerate ViTs for practical applications. To this end, recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. Nevertheless, since ViT tokens are generated from non-overlapping grid patches, they usually do not convey sufficient semantics, making it incompatible with efficient ViTs. To address this, we propose ImagePiece, a novel re-tokenization strategy for Vision Transformers. Following the MaxMatch strategy of NLP tokenization, ImagePiece groups semantically insufficient yet locally coherent tokens until they convey meaning. This simple retokenization is highly compatible with previous token reduction methods, being able to drastically narrow down relevant tokens, enhancing the inference speed of DeiT-S by 54% (nearly 1.5 \times faster) while achieving a 0.39% improvement in ImageNet classification accuracy. For hyper-speed inference scenarios (with 251% acceleration), our approach surpasses other baselines by an accuracy over 8%.
zh

[CV-185] rusted Mamba Contrastive Network for Multi-View Clustering ICASSP2025

【速读】：该论文试图解决多视图聚类中的不可信融合问题，主要原因包括当前方法忽视了视图中的噪声或冗余信息，以及对比学习的相似性来源于同一样本而非同一簇。解决方案的关键在于提出了Trusted Mamba Contrastive Network (TMCN)，并通过Trusted Mamba Fusion Network (TMFN)实现多视图数据的可信融合，同时利用Average-similarity Contrastive Learning (AsCL)模块对融合表示和视图特定表示进行对齐，从而提高来自同一簇的视图表示的相似性。

链接: https://arxiv.org/abs/2412.16487
作者: Jian Zhu,Xin Zou,Lei Liu,Zhangmin Huang,Ying Zhang,Chang Tang,Li-Rong Dai
机构: University of Science and Technology of China; China University of Geosciences; Zhejiang Lab
关键词: Multi-view clustering, deep multi-view clustering, recent years, Trusted Mamba Contrastive, Mamba Contrastive Network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP2025)

点击查看摘要

Abstract:Multi-view clustering can partition data samples into their categories by learning a consensus representation in an unsupervised way and has received more and more attention in recent years. However, there is an untrusted fusion problem. The reasons for this problem are as follows: 1) The current methods ignore the presence of noise or redundant information in the view; 2) The similarity of contrastive learning comes from the same sample rather than the same cluster in deep multi-view clustering. It causes multi-view fusion in the wrong direction. This paper proposes a novel multi-view clustering network to address this problem, termed as Trusted Mamba Contrastive Network (TMCN). Specifically, we present a new Trusted Mamba Fusion Network (TMFN), which achieves a trusted fusion of multi-view data through a selective mechanism. Moreover, we align the fused representation and the view-specific representation using the Average-similarity Contrastive Learning (AsCL) module. AsCL increases the similarity of view presentation from the same cluster, not merely from the same sample. Extensive experiments show that the proposed method achieves state-of-the-art results in deep multi-view clustering tasks.
zh

[CV-186] Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality

【速读】：该论文试图解决当前点云（point cloud）骨干网络在统一几何局部性（geometric locality）、注意力机制（attention mechanisms）和GPU架构方面存在的不足。解决方案的关键在于引入Flash3D Transformer，通过基于完美空间哈希（Perfect Spatial Hashing, PSH）的原理性局部性机制，将几何局部性与GPU的平铺（tiling）对齐。这种对齐使得PSH局部性机制能够与FlashAttention自然融合，且几乎不增加额外成本。该机制为骨干网络提供了灵活的设计选择，从而在下游任务中实现了更优的性能，包括在基准数据集上超越了现有的PTv3结果，提供了2.25倍的加速和2.4倍的内存效率提升。

链接: https://arxiv.org/abs/2412.16481
作者: Liyan Chen,Gregory P. Meyer,Zaiwei Zhang,Eric M. Wolff,Paul Vernaza
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校); Cruise LLC(克鲁斯有限责任公司)
关键词: Recent efforts recognize, Recent efforts, Perfect Spatial Hashing, efforts recognize, recognize the power
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent efforts recognize the power of scale in 3D learning (e.g. PTv3) and attention mechanisms (e.g. FlashAttention). However, current point cloud backbones fail to holistically unify geometric locality, attention mechanisms, and GPU architectures in one view. In this paper, we introduce Flash3D Transformer, which aligns geometric locality and GPU tiling through a principled locality mechanism based on Perfect Spatial Hashing (PSH). The common alignment with GPU tiling naturally fuses our PSH locality mechanism with FlashAttention at negligible extra cost. This mechanism affords flexible design choices throughout the backbone that result in superior downstream task results. Flash3D outperforms state-of-the-art PTv3 results on benchmark datasets, delivering a 2.25x speed increase and 2.4x memory efficiency boost. This efficiency enables scaling to wider attention scopes and larger models without additional overhead. Such scaling allows Flash3D to achieve even higher task accuracies than PTv3 under the same compute budget.
zh

[CV-187] Enhancing Nighttime Vehicle Detection with Day-to-Night Style Transfer and Labeling-Free Augmentation

【速读】：该论文试图解决深度学习目标检测模型在夜间条件下性能显著下降的问题，主要原因是这些模型大多基于日间图像进行训练，而夜间图像的标注难度较大，尤其是在缺乏路灯和车灯产生眩光的乡村道路环境中。解决方案的关键在于引入一种无需标注的数据增强框架，利用CARLA生成的合成数据进行日间到夜间的图像风格迁移。具体来说，该框架采用Efficient Attention Generative Adversarial Network (EAGAN) 实现逼真的日间到夜间风格转换，并结合CARLA生成的合成夜间图像帮助模型学习车灯效果。通过这种方法，论文在乡村夜间环境下对YOLO11模型进行了微调，显著提升了夜间车辆检测的性能。

链接: https://arxiv.org/abs/2412.16478
作者: Yunxiang Yang,Hao Zhen,Yongcan Huang,Jidong J. Yang
机构: University of Georgia, Athens, GA, USA(佐治亚大学，雅典，佐治亚州，美国)
关键词: Existing deep learning-based, Existing deep, deep learning-based object, deep learning-based, predominantly trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 12 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Existing deep learning-based object detection models perform well under daytime conditions but face significant challenges at night, primarily because they are predominantly trained on daytime images. Additionally, training with nighttime images presents another challenge: even human annotators struggle to accurately label objects in low-light conditions. This issue is particularly pronounced in transportation applications, such as detecting vehicles and other objects of interest on rural roads at night, where street lighting is often absent, and headlights may introduce undesirable glare. This study addresses these challenges by introducing a novel framework for labeling-free data augmentation, leveraging CARLA-generated synthetic data for day-to-night image style transfer. Specifically, the framework incorporates the Efficient Attention Generative Adversarial Network for realistic day-to-night style transfer and uses CARLA-generated synthetic nighttime images to help the model learn vehicle headlight effects. To evaluate the efficacy of the proposed framework, we fine-tuned the YOLO11 model with an augmented dataset specifically curated for rural nighttime environments, achieving significant improvements in nighttime vehicle detection. This novel approach is simple yet effective, offering a scalable solution to enhance AI-based detection systems in low-visibility environments and extend the applicability of object detection models to broader real-world contexts.
zh

[CV-188] Query Quantized Neural SLAM AAAI25

【速读】：该论文试图解决在同时定位与地图构建（SLAM）系统中，由于运行时效率要求，神经隐式表示方法在优化过程中因迭代次数不足导致的欠拟合问题，进而引发的相机跟踪漂移和重建伪影。解决方案的关键在于提出了查询量化神经SLAM（query quantized neural SLAM），通过将查询量化为离散表示，减少输入变异，从而更容易和快速地拟合每一帧。具体实现上，使用一组代码对查询进行量化，限制神经网络观察的输入变异数量，使其在拟合更多先前帧后逐渐熟悉这些代码。此外，论文还引入了新的初始化、损失函数和增强方法，以稳定早期优化阶段的不确定性，约束优化空间，并更准确地估计相机姿态。

链接: https://arxiv.org/abs/2412.16476
作者: Sijia Jiang,Jing Hua,Zhizhong Han
机构: 未知
关键词: shown remarkable abilities, jointly modeling geometry, localization and mapping, shown remarkable, remarkable abilities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be appeared at AAAI25

点击查看摘要

Abstract:Neural implicit representations have shown remarkable abilities in jointly modeling geometry, color, and camera poses in simultaneous localization and mapping (SLAM). Current methods use coordinates, positional encodings, or other geometry features as input to query neural implicit functions for signed distances and color which produce rendering errors to drive the optimization in overfitting image observations. However, due to the run time efficiency requirement in SLAM systems, we are merely allowed to conduct optimization on each frame in few iterations, which is far from enough for neural networks to overfit these queries. The underfitting usually results in severe drifts in camera tracking and artifacts in reconstruction. To resolve this issue, we propose query quantized neural SLAM which uses quantized queries to reduce variations of input for much easier and faster overfitting a frame. To this end, we quantize a query into a discrete representation with a set of codes, and only allow neural networks to observe a finite number of variations. This allows neural networks to become increasingly familiar with these codes after overfitting more and more previous frames. Moreover, we also introduce novel initialization, losses, and argumentation to stabilize the optimization with significant uncertainty in the early optimization stage, constrain the optimization space, and estimate camera poses more accurately. We justify the effectiveness of each design and report visual and numerical comparisons on widely used benchmarks to show our superiority over the latest methods in both reconstruction and camera tracking.
zh

[CV-189] “ScatSpotter” 2024 – A Distributed Dog Poop Detection Dataset

【速读】：该论文试图解决的问题是创建一个大规模、动态增长的狗粪便图像数据集，并对其进行详细的多边形标注，以支持相关研究。解决方案的关键在于构建一个“活”数据集，通过手动或AI辅助的方式进行标注，并定期更新（每月约增加1GB数据）。研究中使用了VIT和MaskRCNN模型进行基准测试，展示了数据集的挑战性，并实现了较高的像素级平均精度（0.858和0.847）。此外，论文还探讨了数据集的分发方法（集中式和去中心化），并公开了数据集、代码和模型权重，以促进开放科学数据的共享。

链接: https://arxiv.org/abs/2412.16473
作者: Jon Crall
机构: Kitware
关键词: AI-assisted polygon labels, dog feces, annotated with manually, manually drawn, drawn or AI-assisted
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: dataset paper, unreviewed

点击查看摘要

Abstract:We introduce a new – currently 42 gigabyte – ``living’’ dataset of phone images of dog feces, annotated with manually drawn or AI-assisted polygon labels. There are 6k full resolution images and 4k detailed polygon annotations. The collection and annotation of images started in late 2020 and the dataset grows by roughly 1GB a month. We train VIT and MaskRCNN baseline models to explore the difficulty of the dataset. The best model achieves a pixelwise average precision of 0.858 on a 691-image validation set and 0.847 on a small independently captured 30-image contributor test set. The most recent snapshot of dataset is made publicly available through three different distribution methods: one centralized (Girder) and two decentralized (IPFS and BitTorrent). We study of the trade-offs between distribution methods and discuss the feasibility of each with respect to reliably sharing open scientific data. The code to reproduce the experiments is hosted on GitHub, and the data is published under the Creative Commons Attribution 4.0 International license. Model weights are made publicly available with the dataset. Experimental hardware, time, energy, and emissions are quantified.
zh

[CV-190] Sensing Surface Patches in Volume Rendering for Inferring Signed Distance Functions AAAI25

【速读】：该论文试图解决在多视角RGB图像中恢复三维几何（3D geometry）时，体积渲染（volume rendering）方法对表面细节感知能力有限的问题。解决方案的关键在于通过体积渲染推断带符号距离函数（Signed Distance Functions, SDFs），并利用梯度和带符号距离在估计的交点周围建立一个小表面片，从而能够显式地对感知到的表面片施加约束，如多视角光度一致性和深度或法线先验的监督。这种方法通过在体积渲染中引入表面约束，提升了对几何细节的推断能力。

链接: https://arxiv.org/abs/2412.16467
作者: Sijia Jiang,Tong Wu,Jing Hua,Zhizhong Han
机构: 未知
关键词: computer vision tasks, multi-view RGB images, RGB images, volume rendering, vital to recover
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be appeared at AAAI25

点击查看摘要

Abstract:It is vital to recover 3D geometry from multi-view RGB images in many 3D computer vision tasks. The latest methods infer the geometry represented as a signed distance field by minimizing the rendering error on the field through volume rendering. However, it is still challenging to explicitly impose constraints on surfaces for inferring more geometry details due to the limited ability of sensing surfaces in volume rendering. To resolve this problem, we introduce a method to infer signed distance functions (SDFs) with a better sense of surfaces through volume rendering. Using the gradients and signed distances, we establish a small surface patch centered at the estimated intersection along a ray by pulling points randomly sampled nearby. Hence, we are able to explicitly impose surface constraints on the sensed surface patch, such as multi-view photo consistency and supervision from depth or normal priors, through volume rendering. We evaluate our method by numerical and visual comparisons on scene benchmarks. Our superiority over the latest methods justifies our effectiveness.
zh

[CV-191] Positive2Negative: Breaking the Information-Lossy Barrier in Self-Supervised Single Image Denoising

【速读】：该论文试图解决现有自监督图像去噪方法（如Noise2Noise和Noise2Void）在处理单张噪声图像时因信息丢失操作（如下采样和掩码）导致去噪质量低的问题。解决方案的关键在于提出了一种新的自监督单图像去噪范式，称为Positive2Negative。该范式包括两个核心步骤：重噪数据构建（Renoised Data Construction, RDC）和去噪一致性监督（Denoised Consistency Supervision, DCS）。RDC通过预测的噪声对预测的去噪图像进行重噪，生成多张噪声图像，从而保留原始图像的所有信息；DCS则确保这些去噪图像之间的一致性，监督网络学习更鲁棒的去噪能力。该方法在自监督单图像去噪任务中实现了最先进的性能，并显著提升了处理速度。

链接: https://arxiv.org/abs/2412.16460
作者: Tong Li,Lizhi Wang,Zhiyuan Xu,Lin Zhu,Wanxuan Lu,Hua Huang
机构: Beijing Institute of Technology(北京理工大学); Beijing Normal University(北京师范大学); Chinese Academy of Sciences(中国科学院)
关键词: computational photography applications, Image denoising, self-supervised image denoising, Image denoising enhances, denoising enhances image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image denoising enhances image quality, serving as a foundational technique across various computational photography applications. The obstacle to clean image acquisition in real scenarios necessitates the development of self-supervised image denoising methods only depending on noisy images, especially a single noisy image. Existing self-supervised image denoising paradigms (Noise2Noise and Noise2Void) rely heavily on information-lossy operations, such as downsampling and masking, culminating in low quality denoising performance. In this paper, we propose a novel self-supervised single image denoising paradigm, Positive2Negative, to break the information-lossy barrier. Our paradigm involves two key steps: Renoised Data Construction (RDC) and Denoised Consistency Supervision (DCS). RDC renoises the predicted denoised image by the predicted noise to construct multiple noisy images, preserving all the information of the original image. DCS ensures consistency across the multiple denoised images, supervising the network to learn robust denoising. Our Positive2Negative paradigm achieves state-of-the-art performance in self-supervised single image denoising with significant speed improvements. The code will be released to the public.
zh

[CV-192] Rethinking Model Redundancy for Low-light Image Enhancement

【速读】：该论文试图解决低光图像增强 (Low-light image enhancement, LLIE) 任务中模型冗余的问题，特别是参数有害性 (parameter harmfulness) 和参数无用性 (parameter uselessness)。解决方案的关键在于提出了两种创新技术：注意力动态重分配 (Attention Dynamic Reallocation, ADR) 和参数正交生成 (Parameter Orthogonal Generation, POG)。ADR 通过根据原始注意力动态重新分配注意力，从而缓解参数有害性；POG 则通过学习参数的正交基嵌入，防止参数退化为静态参数，从而缓解参数无用性。这两种技术有效减少了模型冗余，提升了低光图像增强的性能。

链接: https://arxiv.org/abs/2412.16459
作者: Tong Li,Lizhi Wang,Hansen Feng,Lin Zhu,Wanxuan Lu,Hua Huang
机构: Beijing Institute of Technology; Beijing Normal University; Chinese Academy of Sciences
关键词: Low-light image enhancement, Low-light image, image enhancement, reduce noise, image quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance the image quality of low-light images. While recent advancements primarily focus on customizing complex neural network models, we have observed significant redundancy in these models, limiting further performance improvement. In this paper, we investigate and rethink the model redundancy for LLIE, identifying parameter harmfulness and parameter uselessness. Inspired by the rethinking, we propose two innovative techniques to mitigate model redundancy while improving the LLIE performance: Attention Dynamic Reallocation (ADR) and Parameter Orthogonal Generation (POG). ADR dynamically reallocates appropriate attention based on original attention, thereby mitigating parameter harmfulness. POG learns orthogonal basis embeddings of parameters and prevents degradation to static parameters, thereby mitigating parameter uselessness. Experiments validate the effectiveness of our techniques. We will release the code to the public.
zh

[CV-193] FACTS: Fine-Grained Action Classification for Tactical Sports

【速读】：该论文试图解决在快速、近距离的格斗运动（如击剑和拳击）中，由于动作复杂、速度快且具有细微差别，传统依赖姿态估计或传感器数据的分类方法难以准确捕捉这些动态的问题。解决方案的关键在于引入了一种名为FACTS的新型基于transformer的方法，该方法直接处理原始视频数据，无需姿态估计和使用笨重的身体标记和传感器，从而实现了对细粒度动作的高精度识别。FACTS在击剑动作上达到了90%的准确率，在拳击动作上达到了83.25%的准确率，并提供了一个包含8种详细击剑动作的新公开数据集，填补了体育分析资源中的关键空白。

链接: https://arxiv.org/abs/2412.16454
作者: Christopher Lai,Jason Mo,Haotian Xia,Yuan-fang Wang
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Georgia Institute of Technology(佐治亚理工学院); University of California, Irvine(加州大学欧文分校)
关键词: unique challenges due, Classifying fine-grained actions, Classifying fine-grained, presents unique challenges, nuance of movements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifying fine-grained actions in fast-paced, close-combat sports such as fencing and boxing presents unique challenges due to the complexity, speed, and nuance of movements. Traditional methods reliant on pose estimation or fancy sensor data often struggle to capture these dynamics accurately. We introduce FACTS, a novel transformer-based approach for fine-grained action recognition that processes raw video data directly, eliminating the need for pose estimation and the use of cumbersome body markers and sensors. FACTS achieves state-of-the-art performance, with 90% accuracy on fencing actions and 83.25% on boxing actions. Additionally, we present a new publicly available dataset featuring 8 detailed fencing actions, addressing critical gaps in sports analytics resources. Our findings enhance training, performance analysis, and spectator engagement, setting a new benchmark for action classification in tactical sports.
zh

[CV-194] Sensitive Image Classification by Vision Transformers

【速读】：该论文试图解决在分类儿童性虐待图像时，处理相似的类间相关性和多样化的类内相关性所面临的挑战。解决方案的关键在于利用视觉Transformer模型（Vision Transformer models）的自注意力机制（self-attention mechanism），该机制能够捕捉图像局部元素之间的全局交互，从而有效避免错误的关联并减少注意力图中的模糊性。通过构建包含清洁和色情图像的数据集以及包含色情、非色情和介于两者之间的图像的三类数据集，并结合成人内容图像基准数据集，研究对比了多种流行的视觉Transformer模型与传统的预训练ResNet模型以及基于注意力和度量学习的CNN和Bumble方法。结果表明，视觉Transformer网络在色情图像分类任务中表现出更优的分类和检测能力。

链接: https://arxiv.org/abs/2412.16446
作者: Hanxian He,Campbell Wilson,Thanh Thi Nguyen,Janis Dalins
机构: Monash University(蒙纳士大学); Australian Federal Police(澳大利亚联邦警察)
关键词: managing similar inter-class, similar inter-class correlations, diverse intra-class correlations, intra-class correlations poses, managing similar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:When it comes to classifying child sexual abuse images, managing similar inter-class correlations and diverse intra-class correlations poses a significant challenge. Vision transformer models, unlike conventional deep convolutional network models, leverage a self-attention mechanism to capture global interactions among contextual local elements. This allows them to navigate through image patches effectively, avoiding incorrect correlations and reducing ambiguity in attention maps, thus proving their efficacy in computer vision tasks. Rather than directly analyzing child sexual abuse data, we constructed two datasets: one comprising clean and pornographic images and another with three classes, which additionally include images indicative of pornography, sourced from Reddit and Google Open Images data. In our experiments, we also employ an adult content image benchmark dataset. These datasets served as a basis for assessing the performance of vision transformer models in pornographic image classification. In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models. Furthermore, we compared them with established methods for sensitive image detection such as attention and metric learning based CNN and Bumble. The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities in this task.
zh

[CV-195] Mixed geometry information regularization for image multiplicative denoising

【速读】：该论文试图解决乘性伽马去噪问题，通过引入变分模型来实现。解决方案的关键在于提出了一种混合几何信息模型，结合面积项和曲率项作为先验知识，以有效去除乘性噪声并保留边缘，同时避免阶梯效应。此外，针对高阶正则化中的非线性和非凸性挑战，论文提出了高效的加性算子分裂算法 (AOS) 和标量辅助变量算法 (SAV)，确保了算法的无条件稳定性，并允许使用大时间步长，同时SAV方法在模型中表现出更高的计算精度。通过二阶SAV算法进一步加速计算并保持精度，最终通过大量数值实验验证了模型的有效性和效率，展示了其在纹理保留方面的优越性且不产生虚假信息。

链接: https://arxiv.org/abs/2412.16445
作者: Shengkun Yang,Zhichang Guo,Jia Li,Fanghui Song,Wenjuan Yao
机构: School of Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学学院)
关键词: gamma denoising problem, multiplicative gamma denoising, focuses on solving, gamma denoising, model
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This paper focuses on solving the multiplicative gamma denoising problem via a variation model. Variation-based regularization models have been extensively employed in a variety of inverse problem tasks in image processing. However, sufficient geometric priors and efficient algorithms are still very difficult problems in the model design process. To overcome these issues, in this paper we propose a mixed geometry information model, incorporating area term and curvature term as prior knowledge. In addition to its ability to effectively remove multiplicative noise, our model is able to preserve edges and prevent staircasing effects. Meanwhile, to address the challenges stemming from the nonlinearity and non-convexity inherent in higher-order regularization, we propose the efficient additive operator splitting algorithm (AOS) and scalar auxiliary variable algorithm (SAV). The unconditional stability possessed by these algorithms enables us to use large time step. And the SAV method shows higher computational accuracy in our model. We employ the second order SAV algorithm to further speed up the calculation while maintaining accuracy. We demonstrate the effectiveness and efficiency of the model and algorithms by a lot of numerical experiments, where the model we proposed has better features texturepreserving properties without generating any false information.
zh

[CV-196] Object Detection Approaches to Identifying Hand Images with High Forensic Values

【速读】：该论文旨在解决法医科学中手部图像检测的效率和准确性问题，通过比较多种基于机器学习的物体检测方法，提出了一种优化的解决方案。关键在于对YOLOv8和基于视觉变换器（vision transformer）的物体检测模型进行微调，并在四个手部图像数据集上进行实验。实验结果表明，YOLOv8模型（包括YOLOv8n和YOLOv8x）在所有数据集上均优于DETR和DETA模型，且相较于基于YOLOv3和YOLOv4的现有手部检测方法，YOLOv8模型表现出更优越的性能。通过微调后的YOLOv8模型，能够高效识别具有高法医价值的手部图像或视频帧，显著减少法医专家的工作时间，从而实现实际应用中的有效性。

链接: https://arxiv.org/abs/2412.16431
作者: Thanh Thi Nguyen,Campbell Wilson,Imad Khan,Janis Dalins
机构: Monash University(莫纳什大学); Australian Federal Police(澳大利亚联邦警察)
关键词: Forensic science plays, legal investigations, advanced technologies, science plays, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:Forensic science plays a crucial role in legal investigations, and the use of advanced technologies, such as object detection based on machine learning methods, can enhance the efficiency and accuracy of forensic analysis. Human hands are unique and can leave distinct patterns, marks, or prints that can be utilized for forensic examinations. This paper compares various machine learning approaches to hand detection and presents the application results of employing the best-performing model to identify images of significant importance in forensic contexts. We fine-tune YOLOv8 and vision transformer-based object detection models on four hand image datasets, including the 11k hands dataset with our own bounding boxes annotated by a semi-automatic approach. Two YOLOv8 variants, i.e., YOLOv8 nano (YOLOv8n) and YOLOv8 extra-large (YOLOv8x), and two vision transformer variants, i.e., DEtection TRansformer (DETR) and Detection Transformers with Assignment (DETA), are employed for the experiments. Experimental results demonstrate that the YOLOv8 models outperform DETR and DETA on all datasets. The experiments also show that YOLOv8 approaches result in superior performance compared with existing hand detection methods, which were based on YOLOv3 and YOLOv4 models. Applications of our fine-tuned YOLOv8 models for identifying hand images (or frames in a video) with high forensic values produce excellent results, significantly reducing the time required by forensic experts. This implies that our approaches can be implemented effectively for real-world applications in forensics or related fields.
zh

[CV-197] Deepfake detection image manipulation detection fairness generalization

【速读】：该论文试图解决深度伪造检测中存在的公平性泛化问题，即训练数据中的偏见导致检测器在不同人口统计群体（如种族和性别）上的性能差异。解决方案的关键在于利用合成数据集和模型优化来增强公平性。具体而言，论文提出了一种数据驱动框架，通过生成代表不同人口统计群体的多样化合成样本，确保模型在平衡和具有代表性的数据集上进行训练，从而更有效地实现跨领域的公平性泛化。该方法结合了合成数据、损失锐度感知优化流程和多任务学习框架，以创建更公平的训练环境，并在跨数据集评估中保持公平性。实验结果表明，该方法在保持公平性方面优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.16428
作者: Uzoamaka Ezeakunne,Chrisantus Eze,Xiuwen Liu
机构: Florida State University(佛罗里达州立大学); Oklahoma State University(俄克拉荷马州立大学)
关键词: recent studies, race and gender, deepfake detection research, progress made, studies have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted at ICAART 2025

点击查看摘要

Abstract:Despite the progress made in deepfake detection research, recent studies have shown that biases in the training data for these detectors can result in varying levels of performance across different demographic groups, such as race and gender. These disparities can lead to certain groups being unfairly targeted or excluded. Traditional methods often rely on fair loss functions to address these issues, but they under-perform when applied to unseen datasets, hence, fairness generalization remains a challenge. In this work, we propose a data-driven framework for tackling the fairness generalization problem in deepfake detection by leveraging synthetic datasets and model optimization. Our approach focuses on generating and utilizing synthetic data to enhance fairness across diverse demographic groups. By creating a diverse set of synthetic samples that represent various demographic groups, we ensure that our model is trained on a balanced and representative dataset. This approach allows us to generalize fairness more effectively across different domains. We employ a comprehensive strategy that leverages synthetic data, a loss sharpness-aware optimization pipeline, and a multi-task learning framework to create a more equitable training environment, which helps maintain fairness across both intra-dataset and cross-dataset evaluations. Extensive experiments on benchmark deepfake detection datasets demonstrate the efficacy of our approach, surpassing state-of-the-art approaches in preserving fairness during cross-dataset evaluation. Our results highlight the potential of synthetic datasets in achieving fairness generalization, providing a robust solution for the challenges faced in deepfake detection.
zh

[CV-198] Revisiting MLLM s: An In-Depth Analysis of Image Classification Abilities

【速读】：该论文试图解决多模态大语言模型（MLLMs）在图像分类能力评估方面的不足问题。解决方案的关键在于通过深入分析图像分类任务，重新审视MLLMs的表现，并揭示其在多个数据集上能够匹配甚至超越传统的视觉-语言模型（如CLIP）。研究通过分析网络架构、数据选择和训练方法，发现语言模型的进步和训练数据源的多样性是提升MLLMs图像分类能力的主要原因。具体来说，概念知识转移和目标概念的增强暴露是驱动这一改进的关键因素。

链接: https://arxiv.org/abs/2412.16418
作者: Huan Liu,Lingyu Xiao,Jiangjiang Liu,Xiaofan Li,Ze Feng,Sen Yang,Jingdong Wang
机构: Beijing Jiaotong University(北京交通大学); Baidu VIS(百度视觉)
关键词: Multimodal Large Language, Multimodal Large, Large Language Models, Large Language, evaluate their capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and visual reasoning, little attention has been given to assessing their fundamental image classification abilities. In this paper, we address this gap by thoroughly revisiting the MLLMs with an in-depth analysis of image classification. Specifically, building on established datasets, we examine a broad spectrum of scenarios, from general classification tasks (e.g., ImageNet, ObjectNet) to more fine-grained categories such as bird and food classification. Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets, challenging the previous assumption that MLLMs are bad at image classification \citeVLMClassifier. To understand the factors driving this improvement, we conduct an in-depth analysis of the network architecture, data selection, and training recipe used in public MLLMs. Our results attribute this success to advancements in language models and the diversity of training data sources. Based on these observations, we further analyze and attribute the potential reasons to conceptual knowledge transfer and enhanced exposure of target concepts, respectively. We hope our findings will offer valuable insights for future research on MLLMs and their evaluation in image classification tasks.
zh

[CV-199] Uncertainty Quantification in Continual Open-World Learning NEURIPS2024

【速读】：该论文试图解决在实际部署中，AI系统在持续遇到未标记数据（包括已知类别的未见样本和未知类别的新样本）时，如何自主适应并持续学习的问题。解决方案的关键是提出了COUQ（Continual Open-world Uncertainty Quantification）方法，这是一种针对广义持续开放世界多类别设置的迭代不确定性估计算法。COUQ通过在持续新奇检测、不确定性引导的主动学习和不确定性引导的伪标签生成等关键子任务上进行严格应用和评估，展示了其在多个数据集和不同骨干网络上的优越性能。

链接: https://arxiv.org/abs/2412.16409
作者: Amanda S. Rios,Ibrahima J. Ndiour,Parual Datta,Jaroslaw Sydir,Omesh Tickoo,Nilesh Ahuja
机构: Intel(英特尔)
关键词: encountered after deployment, capable of autonomously, autonomously adapting, adapting to novelties, novelties encountered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript Under Review (full-length); Related 4-page manuscripts accepted at Neurips 2024 Non-Archival Workshops this https URL and this https URL

点击查看摘要

Abstract:AI deployed in the real-world should be capable of autonomously adapting to novelties encountered after deployment. Yet, in the field of continual learning, the reliance on novelty and labeling oracles is commonplace albeit unrealistic. This paper addresses a challenging and under-explored problem: a deployed AI agent that continuously encounters unlabeled data - which may include both unseen samples of known classes and samples from novel (unknown) classes - and must adapt to it continuously. To tackle this challenge, we propose our method COUQ “Continual Open-world Uncertainty Quantification”, an iterative uncertainty estimation algorithm tailored for learning in generalized continual open-world multi-class settings. We rigorously apply and evaluate COUQ on key sub-tasks in the Continual Open-World: continual novelty detection, uncertainty guided active learning, and uncertainty guided pseudo-labeling for semi-supervised CL. We demonstrate the effectiveness of our method across multiple datasets, ablations, backbones and performance superior to state-of-the-art.
zh

[CV-200] VerSe: Integrating Multiple Queries as Prompts for Versatile Cardiac MRI Segmentation

【速读】：该论文试图解决心脏结构从磁共振成像（MRI）中的精确分割问题，特别是在复杂区域如心脏基底和顶端部分，现有的自动分割方法仍需大量人工修正。解决方案的关键在于提出了一个名为VerSe的多功能分割框架，通过联合学习对象查询（object queries）和点击查询（click queries）作为共享分割主干的提示，实现了自动分割与交互式掩码精炼的统一。VerSe不仅支持通过对象查询进行全自动分割，还允许在需要时通过点击查询进行交互式掩码精炼，从而显著提升了性能和效率。

链接: https://arxiv.org/abs/2412.16381
作者: Bangwei Guo,Meng Ye,Yunhe Gao,Bingyu Xin,Leon Axel,Dimitris Metaxas
机构: Rutgers University(罗格斯大学); NYU Grossman School of Medicine(纽约大学格罗斯曼医学院)
关键词: magnetic resonance imaging, image segmentation approach, remains a critical, critical challenge, learning-based image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite the advances in learning-based image segmentation approach, the accurate segmentation of cardiac structures from magnetic resonance imaging (MRI) remains a critical challenge. While existing automatic segmentation methods have shown promise, they still require extensive manual corrections of the segmentation results by human experts, particularly in complex regions such as the basal and apical parts of the heart. Recent efforts have been made on developing interactive image segmentation methods that enable human-in-the-loop learning. However, they are semi-automatic and inefficient, due to their reliance on click-based prompts, especially for 3D cardiac MRI volumes. To address these limitations, we propose VerSe, a Versatile Segmentation framework to unify automatic and interactive segmentation through mutiple queries. Our key innovation lies in the joint learning of object and click queries as prompts for a shared segmentation backbone. VerSe supports both fully automatic segmentation, through object queries, and interactive mask refinement, by providing click queries when needed. With the proposed integrated prompting scheme, VerSe demonstrates significant improvement in performance and efficiency over existing methods, on both cardiac MRI and out-of-distribution medical imaging datasets. The code is available at this https URL.
zh

[CV-201] LiRCDepth: Lightweight Radar-Camera Depth Estimation via Knowledge Distillation and Uncertainty Guidance ICASSP2025

【速读】：该论文试图解决雷达-相机深度估计算法中存在的计算效率问题，提出了一种轻量级的雷达-相机深度估计模型LiRCDepth。解决方案的关键在于引入知识蒸馏（knowledge distillation）技术，通过三个关键领域的信息传递来增强训练过程：首先，通过像素级和成对蒸馏传递低级和高级特征；其次，引入不确定性感知的中间深度蒸馏损失，以优化解码过程中的中间深度图。这些改进使得轻量级模型在nuScenes数据集上的MAE性能提升了6.6%，同时保持了较高的计算效率。

链接: https://arxiv.org/abs/2412.16380
作者: Huawei Sun,Nastassia Vysotskaya,Tobias Sukianto,Hao Feng,Julius Ott,Xiangyuan Peng,Lorenzo Servadei,Robert Wille
机构: Technical University of Munich (慕尼黑工业大学); Infineon Technologies AG (英飞凌科技公司)
关键词: gained significant attention, radar sensors provide, sensors provide geometric, provide geometric information, limitations of cameras
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Recently, radar-camera fusion algorithms have gained significant attention as radar sensors provide geometric information that complements the limitations of cameras. However, most existing radar-camera depth estimation algorithms focus solely on improving performance, often neglecting computational efficiency. To address this gap, we propose LiRCDepth, a lightweight radar-camera depth estimation model. We incorporate knowledge distillation to enhance the training process, transferring critical information from a complex teacher model to our lightweight student model in three key domains. Firstly, low-level and high-level features are transferred by incorporating pixel-wise and pair-wise distillation. Additionally, we introduce an uncertainty-aware inter-depth distillation loss to refine intermediate depth maps during decoding. Leveraging our proposed knowledge distillation scheme, the lightweight model achieves a 6.6% improvement in MAE on the nuScenes dataset compared to the model trained without distillation.
zh

[CV-202] FairREAD: Re-fusing Demographic Attributes after Disentanglement for Fair Medical Image Classification

【速读】：该论文试图解决医学影像领域中深度学习模型在不同人口统计子组之间表现差异带来的公平性问题。解决方案的关键在于提出了一种名为Fair Re-fusion After Disentanglement (FairREAD)的新框架，通过正交约束和对抗训练来分离敏感的人口统计信息，同时利用受控的再融合机制保留临床相关细节，确保模型在不同子组间的公平性表现。此外，子组特定的阈值调整进一步确保了各子组的公平性能。该方法在大型临床X光数据集上的综合评估表明，FairREAD在显著减少不公平性指标的同时，保持了诊断准确性。

链接: https://arxiv.org/abs/2412.16373
作者: Yicheng Gao,Jinkui Hao,Bo Zhou
机构: Northwestern University (西北大学)
关键词: shown transformative potential, Recent advancements, fairness persist due, advancements in deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Submitted to Medical Image Analysis, code will be available after review is complete

点击查看摘要

Abstract:Recent advancements in deep learning have shown transformative potential in medical imaging, yet concerns about fairness persist due to performance disparities across demographic subgroups. Existing methods aim to address these biases by mitigating sensitive attributes in image data; however, these attributes often carry clinically relevant information, and their removal can compromise model performance-a highly undesirable outcome. To address this challenge, we propose Fair Re-fusion After Disentanglement (FairREAD), a novel, simple, and efficient framework that mitigates unfairness by re-integrating sensitive demographic attributes into fair image representations. FairREAD employs orthogonality constraints and adversarial training to disentangle demographic information while using a controlled re-fusion mechanism to preserve clinically relevant details. Additionally, subgroup-specific threshold adjustments ensure equitable performance across demographic groups. Comprehensive evaluations on a large-scale clinical X-ray dataset demonstrate that FairREAD significantly reduces unfairness metrics while maintaining diagnostic accuracy, establishing a new benchmark for fairness and performance in medical image classification.
zh

[CV-203] oward Robust Neural Reconstruction from Sparse Point Sets

【速读】：该论文试图解决从稀疏且带有噪声的3D点云中学习有符号距离函数 (Signed Distance Functions, SDF) 的挑战性问题。与依赖平滑先验的近期方法不同，论文提出的解决方案基于分布式鲁棒优化 (Distributionally Robust Optimization, DRO) 框架，通过引入正则化项，利用模型不确定性区域的样本，从而提升学习到的SDF质量。该方法的关键在于其可处理的对偶形式，使得在缺乏真实监督的情况下，仍能实现稳定且高效的SDF优化。通过多种合成和真实数据的评估，证明了该DRO框架在SDF学习方面相较于基线和最先进方法的优越性。

链接: https://arxiv.org/abs/2412.16361
作者: Amine Ouasfi,Shubhendu Jena,Eric Marchand,Adnane Boukhayma
机构: Inria, Univ. Rennes, CNRS, IRISA (法国国家信息与自动化研究所，雷恩大学，法国国家科学研究中心，信息系统与随机性研究所)
关键词: Signed Distance Functions, learning Signed Distance, Distance Functions, Signed Distance, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page : this https URL

点击查看摘要

Abstract:We consider the challenging problem of learning Signed Distance Functions (SDF) from sparse and noisy 3D point clouds. In contrast to recent methods that depend on smoothness priors, our method, rooted in a distributionally robust optimization (DRO) framework, incorporates a regularization term that leverages samples from the uncertainty regions of the model to improve the learned SDFs. Thanks to tractable dual formulations, we show that this framework enables a stable and efficient optimization of SDFs in the absence of ground truth supervision. Using a variety of synthetic and real data evaluations from different modalities, we show that our DRO based learning framework can improve SDF learning with respect to baselines and the state-of-the-art methods.
zh

[CV-204] xture- and Shape-based Adversarial Attacks for Vehicle Detection in Synthetic Overhead Imagery

【速读】：该论文试图解决在复杂背景下检测航空图像中车辆时，现有最先进的目标检测器（如 YOLO）易受对抗攻击 (Adversarial Attacks, AAs) 影响的问题。解决方案的关键在于提出针对对抗攻击的实际实施约束，包括像素化、遮罩、限制纹理的颜色调色板以及约束形状修改等。这些约束旨在平衡攻击性能与物理实现的可行性，并通过实验验证了其在三种常用目标检测器架构上的有效性，揭示了实用性与性能之间的权衡。此外，论文还引入了一个包含多种车辆类别的航空图像标注数据集，以支持进一步研究。

链接: https://arxiv.org/abs/2412.16358
作者: Mikael Yeghiazaryan,Sai Abhishek Siddhartha Namburu,Emily Kim,Stanislav Panev,Celso de Melo,Brent Lance,Fernando De la Torre,Jessica K. Hodgins
机构: Carnegie Mellon University(卡内基梅隆大学); Army Research Lab(陆军研究实验室)
关键词: small resolution, complex backgrounds, challenging due, due to complex, Detecting vehicles
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting vehicles in aerial images can be very challenging due to complex backgrounds, small resolution, shadows, and occlusions. Despite the effectiveness of SOTA detectors such as YOLO, they remain vulnerable to adversarial attacks (AAs), compromising their reliability. Traditional AA strategies often overlook the practical constraints of physical implementation, focusing solely on attack performance. Our work addresses this issue by proposing practical implementation constraints for AA in texture and/or shape. These constraints include pixelation, masking, limiting the color palette of the textures, and constraining the shape modifications. We evaluated the proposed constraints through extensive experiments using three widely used object detector architectures, and compared them to previous works. The results demonstrate the effectiveness of our solutions and reveal a trade-off between practicality and performance. Additionally, we introduce a labeled dataset of overhead images featuring vehicles of various categories. We will make the code/dataset public upon paper acceptance.
zh

[CV-205] SOUS VIDE: Cooking Visual Drone Navigation Policies in a Gaussian Splatting Vacuum

【速读】：该论文试图解决无人机端到端视觉导航问题，并提出了一个名为SOUS VIDE的解决方案，包括一个新的仿真器FiGS、训练方法和策略架构。解决方案的关键在于FiGS仿真器结合了计算简单的无人机动力学模型和高视觉保真度的场景重建，能够以高达130 fps的速度生成逼真的图像，用于收集大量观察-动作对。通过蒸馏专家模型MPC，生成轻量级的神经网络SV-Net，该网络能够处理多模态输入（颜色图像、光流和IMU数据）并输出低级别的机体速率和推力命令。SV-Net中的快速电机适应模块（RMA）能够在运行时动态适应无人机的动力学变化，从而实现零样本仿真到现实的迁移，并在多种复杂现实环境中表现出鲁棒性。

链接: https://arxiv.org/abs/2412.16346
作者: JunEn Low,Maximilian Adang,Javier Yu,Keiko Nagami,Mac Schwager
机构: Stanford University (斯坦福大学); Toyota Research Institute (丰田研究所)
关键词: training approach, collectively called SOUS, SOUS VIDE, visual drone navigation, called SOUS VIDE
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We propose a new simulator, training approach, and policy architecture, collectively called SOUS VIDE, for end-to-end visual drone navigation. Our trained policies exhibit zero-shot sim-to-real transfer with robust real-world performance using only on-board perception and computation. Our simulator, called FiGS, couples a computationally simple drone dynamics model with a high visual fidelity Gaussian Splatting scene reconstruction. FiGS can quickly simulate drone flights producing photorealistic images at up to 130 fps. We use FiGS to collect 100k-300k observation-action pairs from an expert MPC with privileged state and dynamics information, randomized over dynamics parameters and spatial disturbances. We then distill this expert MPC into an end-to-end visuomotor policy with a lightweight neural architecture, called SV-Net. SV-Net processes color image, optical flow and IMU data streams into low-level body rate and thrust commands at 20Hz onboard a drone. Crucially, SV-Net includes a Rapid Motor Adaptation (RMA) module that adapts at runtime to variations in drone dynamics. In a campaign of 105 hardware experiments, we show SOUS VIDE policies to be robust to 30% mass variations, 40 m/s wind gusts, 60% changes in ambient brightness, shifting or removing objects from the scene, and people moving aggressively through the drone’s visual field. Code, data, and experiment videos can be found on our project page: this https URL.
zh

[CV-206] DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

【速读】：该论文试图解决自监督视觉基础模型（self-supervised visual foundation models）在开放词汇任务（open-vocabulary tasks）中的应用问题，因为这些模型的视觉特征与语言特征未对齐，限制了其在开放词汇任务中的表现。解决方案的关键在于采用LiT训练策略，并通过以下几个关键改进来提升性能：1) 将[CLS] token与patch平均值拼接以训练对齐；2) 使用文本和图像模态共同筛选数据。这些改进使得模型在零样本分类（zero-shot classification）和开放词汇语义分割（open-vocabulary semantic segmentation）任务中以较低的计算成本实现了最先进的性能。

链接: https://arxiv.org/abs/2412.16334
作者: Cijo Jose,Théo Moutakanni,Dahyun Kang,Federico Baldassarre,Timothée Darcet,Hu Xu,Daniel Li,Marc Szafraniec,Michaël Ramamonjisoa,Maxime Oquab,Oriane Siméoni,Huy V. Vo,Patrick Labatut,Piotr Bojanowski
机构: Meta FAIR; Université Paris-Saclay, CentraleSupélec, MICS; POSTECH; Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP
关键词: produce powerful embeddings, Self-supervised visual foundation, achieve remarkable performance, foundation models produce, models produce powerful
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named this http URL, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
zh

[CV-207] Improving Object Detection for Time-Lapse Imagery Using Temporal Features in Wildlife Monitoring

【速读】：该论文试图解决在时间序列图像中提高目标检测器（object detector）性能的问题。解决方案的关键在于利用时间序列中的时空特征（spatio-temporal features），通过引入两个额外的空间特征通道来捕捉场景中的静态和动态元素，从而提升场景理解能力并减少静态误报。具体来说，该方法通过整合前帧的时空信息，显著提高了平均精度（mAP@0.05:0.95），在热带海鸟繁殖数据集上相较于无时间特征的单帧检测器提升了24%。

链接: https://arxiv.org/abs/2412.16329
作者: Marcus Jenkins,Kirsty A. Franklin,Malcolm A. C. Nicoll,Nik C. Cole,Kevin Ruhomaun,Vikash Tatayah,Michal Mackiewicz
机构: 未知
关键词: health of ecosystems, crucial for assessing, assessing the health, Monitoring animal populations, object detector
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures

点击查看摘要

Abstract:Monitoring animal populations is crucial for assessing the health of ecosystems. Traditional methods, which require extensive fieldwork, are increasingly being supplemented by time-lapse camera-trap imagery combined with an automatic analysis of the image data. The latter usually involves some object detector aimed at detecting relevant targets (commonly animals) in each image, followed by some postprocessing to gather activity and population data. In this paper, we show that the performance of an object detector in a single frame of a time-lapse sequence can be improved by including spatio-temporal features from the prior frames. We propose a method that leverages temporal information by integrating two additional spatial feature channels which capture stationary and non-stationary elements of the scene and consequently improve scene understanding and reduce the number of stationary false positives. The proposed technique achieves a significant improvement of 24% in mean average precision (mAP@0.05:0.95) over the baseline (temporal feature-free, single frame) object detector on a large dataset of breeding tropical seabirds. We envisage our method will be widely applicable to other wildlife monitoring applications that use time-lapse imaging.
zh

[CV-208] When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization

【速读】：该论文试图解决当前图像生成方法中存在的压缩与生成建模能力之间的权衡问题。传统方法假设更好的重建性能必然带来更好的生成效果，但研究表明，更小的生成模型在更压缩的潜在空间（latent space）中反而能受益，即使重建性能下降。论文提出的解决方案是引入因果正则化标记化（Causally Regularized Tokenization, CRT），通过在第一阶段训练中嵌入与第二阶段生成建模过程相关的归纳偏置，优化这一权衡。尽管CRT降低了重建性能，但显著提升了生成性能，并实现了计算效率的2-3倍提升，同时以更少的标记数和模型参数达到了与当前最先进模型（如LlamaGen）相当的效果。

链接: https://arxiv.org/abs/2412.16326
作者: Vivek Ramanujan,Kushal Tirumala,Armen Aghajanyan,Luke Zettlemoyer,Ali Farhadi
机构: University of Washington(华盛顿大学); Meta FAIR(Meta FAIR)
关键词: two-stage training approach, Current image generation, Current image, stage, training approach
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current image generation methods, such as latent diffusion and discrete token-based generation, depend on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. Most work focuses on maximizing stage 1 performance independent of stage 2, assuming better reconstruction always leads to better generation. However, we show this is not strictly true. Smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, showing a fundamental trade-off between compression and generation modeling capacity. To better optimize this trade-off, we introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. This regularization makes stage 1 reconstruction performance worse, but makes stage 2 generation performance better by making the tokens easier to model: we are able to improve compute efficiency 2-3 \times over baseline and match state-of-the-art discrete autoregressive ImageNet generation (2.18 FID) with less than half the tokens per image (256 vs. 576) and a fourth the total model parameters (775M vs. 3.1B) as the previous SOTA (LlamaGen).
zh

[CV-209] Mapping the Mind of an Instruction-based Image Editing using SMILE

【速读】：该论文试图解决基于指令的图像编辑模型（Instruct-based Image Editing models）在透明度和用户信任方面的不足，这些模型通常被视为“黑箱”。解决方案的关键是提出了SMILE（Statistical Model-agnostic Interpretability with Local Explanations），这是一种模型无关的局部可解释性方法，通过提供视觉热图来阐明文本元素对图像生成模型的影响。该方法应用于多种图像编辑模型（如Pix2Pix、Image2Image-turbo和Diffusers-Inpaint），显著提升了模型的可解释性和可靠性，并通过稳定性、准确性、保真度和一致性等指标进行了评估。

链接: https://arxiv.org/abs/2412.16277
作者: Zeinab Dehghani,Koorosh Aslansefat,Adil Khan,Adín Ramírez Rivera,Franky George,Muhammad Khalid
机构: University of Hull(赫尔大学); University of Oslo(奥斯陆大学)
关键词: Instruct-based Image Editing, Image Editing models, generating high-quality images, advancements in Instruct-based, Image Editing
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite recent advancements in Instruct-based Image Editing models for generating high-quality images, they are known as black boxes and a significant barrier to transparency and user trust. To solve this issue, we introduce SMILE (Statistical Model-agnostic Interpretability with Local Explanations), a novel model-agnostic for localized interpretability that provides a visual heatmap to clarify the textual elements’ influence on image-generating models. We applied our method to various Instruction-based Image Editing models like Pix2Pix, Image2Image-turbo and Diffusers-Inpaint and showed how our model can improve interpretability and reliability. Also, we use stability, accuracy, fidelity, and consistency metrics to evaluate our method. These findings indicate the exciting potential of model-agnostic interpretability for reliability and trustworthiness in critical applications such as healthcare and autonomous driving while encouraging additional investigation into the significance of interpretability in enhancing dependable image editing models.
zh

[CV-210] LEARN: A Unified Framework for Multi-Task Domain Adapt Few-Shot Learning

【速读】：该论文试图解决在计算机视觉领域中，将少样本学习（few-shot learning）与领域自适应（domain adaptation）结合的统一框架问题。解决方案的关键在于提出了一种高度模块化的框架，能够支持少样本学习任务，并根据需要选择性地加入领域自适应功能。该框架的一个重要特性是能够动态配置增量n-shot任务，并可扩展至传统的多样本任务。此外，框架还支持多种自监督学习（Self-Supervised Learning, SSL）预训练配置，以增强少样本学习的效果。通过广泛的算法和数据集基准测试，验证了该框架的有效性和灵活性。

链接: https://arxiv.org/abs/2412.16275
作者: Bharadwaj Ravichandran,Alexander Lynch,Sarah Brockman,Brandon RichardWebster,Dawei Du,Anthony Hoogs,Christopher Funk
机构: Kitware Inc(Kitware公司)
关键词: Computer Vision, significant recent progress, few-shot learning, significant recent, recent progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Both few-shot learning and domain adaptation sub-fields in Computer Vision have seen significant recent progress in terms of the availability of state-of-the-art algorithms and datasets. Frameworks have been developed for each sub-field; however, building a common system or framework that combines both is something that has not been explored. As part of our research, we present the first unified framework that combines domain adaptation for the few-shot learning setting across 3 different tasks - image classification, object detection and video classification. Our framework is highly modular with the capability to support few-shot learning with/without the inclusion of domain adaptation depending on the algorithm. Furthermore, the most important configurable feature of our framework is the on-the-fly setup for incremental n -shot tasks with the optional capability to configure the system to scale to a traditional many-shot task. With more focus on Self-Supervised Learning (SSL) for current few-shot learning approaches, our system also supports multiple SSL pre-training configurations. To test our framework’s capabilities, we provide benchmarks on a wide range of algorithms and datasets across different task and problem settings. The code is open source has been made publicly available here: this https URL
zh

[CV-211] PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models

【速读】：该论文试图解决文本到图像（T2I）扩散模型在部署为黑箱服务时可能被恶意修改以产生有害社会影响的问题。解决方案的关键在于提出了一种基于学习自动机的新型提示选择算法，通过捕捉生成图像特征分布的差异来验证模型的完整性。该算法在实验中表现出高效、稳定、准确和广泛的适用性，相较于现有方法在检测模型完整性违规方面具有显著优势。这是首次针对T2I扩散模型的完整性验证进行的研究，为人工智能应用中的版权讨论和保护奠定了基础。

链接: https://arxiv.org/abs/2412.16257
作者: Zhuomeng Zhang,Fangqi Li,Chong Di,Shilin Wang
机构: Shanghai Jiao Tong University(上海交通大学); Qilu University of Technology(齐鲁工业大学)
关键词: harmful social impacts, produce high-quality images, diffusion models, social impacts, produce high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Current text-to-image (T2I) diffusion models can produce high-quality images, and malicious users who are authorized to use the model only for benign purposes might modify their models to generate images that result in harmful social impacts. Therefore, it is essential to verify the integrity of T2I diffusion models, especially when they are deployed as black-box services. To this end, considering the randomness within the outputs of generative models and the high costs in interacting with them, we capture modifications to the model through the differences in the distributions of the features of generated images. We propose a novel prompt selection algorithm based on learning automaton for efficient and accurate integrity verification of T2I diffusion models. Extensive experiments demonstrate the effectiveness, stability, accuracy and generalization of our algorithm against existing integrity violations compared with baselines. To the best of our knowledge, this paper is the first work addressing the integrity verification of T2I diffusion models, which paves the way to copyright discussions and protections for artificial intelligence applications in practice.
zh

[CV-212] Interactive Scene Authoring with Specialized Generative Primitives

【速读】：该论文试图解决非专家用户在生成高质量3D数字资产时面临的复杂设计工具使用难题。解决方案的关键在于引入**专用生成式基元 (Specialized Generative Primitives)**框架，该框架通过高效的生成模型捕捉现实世界中单个样本的分布，使得用户能够以无缝、轻量且可控的方式创建高质量的3D场景。核心技术包括使用3D高斯飞溅 (3D Gaussian Splatting) 将环境视频转化为高质量的显式外观模型，结合语义感知特征引导用户选择感兴趣区域，并通过生成式元胞自动机 (Generative Cellular Automata) 进行单样本训练和可控生成。此外，通过在稀疏体素上操作并采用稀疏补丁一致性步骤恢复高质量输出，实现了生成任务与外观模型的解耦。每个基元可在10分钟内完成训练，并支持交互式组合生成新场景。

链接: https://arxiv.org/abs/2412.16253
作者: Clément Jambon(1),Changwoon Choi(2),Dongsu Zhang(2),Olga Sorkine-Hornung(1),Young Min Kim(2) ((1) ETH Zurich, (2) Seoul National University)
机构: ETH Zurich(苏黎世联邦理工学院); Seoul National University(首尔国立大学)
关键词: complex design tools, requires expert knowledge, Generating high-quality, design tools, requires expert
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Generating high-quality 3D digital assets often requires expert knowledge of complex design tools. We introduce Specialized Generative Primitives, a generative framework that allows non-expert users to author high-quality 3D scenes in a seamless, lightweight, and controllable manner. Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world. With our framework, users capture a video of an environment, which we turn into a high-quality and explicit appearance model thanks to 3D Gaussian Splatting. Users then select regions of interest guided by semantically-aware features. To create a generative primitive, we adapt Generative Cellular Automata to single-exemplar training and controllable generation. We decouple the generative task from the appearance model by operating on sparse voxels and we recover a high-quality output with a subsequent sparse patch consistency step. Each primitive can be trained within 10 minutes and used to author new scenes interactively in a fully compositional manner. We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes. We also demonstrate additional capabilities of our primitives: handling various 3D representations to control generation, transferring appearances, and editing geometries.
zh

[CV-213] owards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models

【速读】：该论文试图解决如何从科学数据（如细胞图像）中发现未知概念的问题，以推动科学发现。解决方案的关键在于使用稀疏字典学习 (Sparse Dictionary Learning, DL) 算法从多细胞图像数据中提取生物学上有意义的特征，如细胞类型和遗传扰动类型。论文提出了一种新的迭代码本特征学习算法 (Iterative Codebook Feature Learning, ICFL)，并结合控制数据集的PCA白化预处理步骤，显著提高了特征提取的选择性，优于传统的TopK稀疏自编码器。

链接: https://arxiv.org/abs/2412.16247
作者: Konstantin Donhauser,Kristina Ulicna,Gemma Elyse Moran,Aditya Ravuri,Kian Kenyon-Dean,Cian Eastwood,Jason Hartford
机构: Cranberry-Lemon University(蔓越莓-柠檬大学); University of the Witwatersrand(金山大学)
关键词: powerful interpretability tool, large language models, Dictionary learning, powerful interpretability, interpretability tool
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Dictionary learning (DL) has emerged as a powerful interpretability tool for large language models. By extracting known concepts (e.g., Golden-Gate Bridge) from human-interpretable data (e.g., text), sparse DL can elucidate a model’s inner workings. In this work, we ask if DL can also be used to discover unknown concepts from less human-interpretable scientific data (e.g., cell images), ultimately enabling modern approaches to scientific discovery. As a first step, we use DL algorithms to study microscopy foundation models trained on multi-cell image data, where little prior knowledge exists regarding which high-level concepts should arise. We show that sparse dictionaries indeed extract biologically-meaningful concepts such as cell type and genetic perturbation type. We also propose a new DL algorithm, Iterative Codebook Feature Learning~(ICFL), and combine it with a pre-processing step that uses PCA whitening from a control dataset. In our experiments, we demonstrate that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.
zh

[CV-214] WiFi CSI Based Temporal Activity Detection Via Dual Pyramid Network AAAI25

【速读】：该论文试图解决基于WiFi的时间活动检测问题，并提出了一种高效的双金字塔网络 (Dual Pyramid Network)，该网络集成了时间信号语义编码器 (Temporal Signal Semantic Encoder) 和局部敏感响应编码器 (Local Sensitive Response Encoder)。解决方案的关键在于：1) 时间信号语义编码器通过有符号掩码注意力机制 (Signed Mask-Attention mechanism) 将特征学习分为高频和低频分量，并使用对比归一化 (ContraNorm) 融合特征，从而强调重要区域并弱化不重要区域；2) 局部敏感响应编码器在不进行学习的情况下捕捉波动；3) 通过一种新的交叉注意力融合机制 (cross-attention fusion mechanism) 将这些特征金字塔结合。此外，论文还引入了一个包含2,114个活动片段和553个WiFi信道状态信息 (CSI) 样本的数据集，实验结果表明该方法优于现有的挑战性基线。

链接: https://arxiv.org/abs/2412.16233
作者: Zhendong Liu,Le Zhang,Bing Li,Yingjie Zhou,Zhenghua Chen,Ce Zhu
机构: University of Electronic Science and Technology of China (电子科技大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Shenzhen University (深圳大学)
关键词: Temporal Signal Semantic, Dual Pyramid Network, Local Sensitive Response, Signal Semantic Encoders, integrates Temporal Signal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: Accepted by AAAI25

点击查看摘要

Abstract:We address the challenge of WiFi-based temporal activity detection and propose an efficient Dual Pyramid Network that integrates Temporal Signal Semantic Encoders and Local Sensitive Response Encoders. The Temporal Signal Semantic Encoder splits feature learning into high and low-frequency components, using a novel Signed Mask-Attention mechanism to emphasize important areas and downplay unimportant ones, with the features fused using ContraNorm. The Local Sensitive Response Encoder captures fluctuations without learning. These feature pyramids are then combined using a new cross-attention fusion mechanism. We also introduce a dataset with over 2,114 activity segments across 553 WiFi CSI samples, each lasting around 85 seconds. Extensive experiments show our method outperforms challenging baselines. Code and dataset are available at this https URL.
zh

[CV-215] Defeasible Visual Entailment: Benchmark Evaluator and Reward-Driven Optimization AAAI2025

【速读】：该论文试图解决视觉蕴涵任务中基于额外更新修改蕴涵关系的问题，提出了一个新的任务称为“可废止视觉蕴涵 (Defeasible Visual Entailment, DVE)”。解决方案的关键在于引入了一种新的推理感知评估器 (inference-aware evaluator)，通过成对对比学习和分类信息学习来捕捉更新导致的蕴涵关系变化。此外，论文还提出了一种奖励驱动的更新优化方法 (reward-driven update optimization method)，以提高多模态模型生成更新的质量。实验结果验证了所提出评估器和优化方法的有效性。

链接: https://arxiv.org/abs/2412.16232
作者: Yue Zhang,Liqiang Jing,Vibhav Gogate
机构: 未知
关键词: task called Defeasible, called Defeasible Visual, Defeasible Visual Entailment, text hypothesis based, Natural Language Inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is well-established in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications such as detecting misleading information in images, enhancing visual question answering, and refining decision-making processes in autonomous systems. Existing metrics do not adequately capture the change in the entailment relationship brought by updates. To address this, we propose a novel inference-aware evaluator designed to capture changes in entailment strength induced by updates, using pairwise contrastive learning and categorical information learning. Additionally, we introduce a reward-driven update optimization method to further enhance the quality of updates generated by multimodal models. Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method.
zh

[CV-216] opView: Vectorising road users in a birds eye view from uncalibrated street-level imagery with deep learning

【速读】：该论文试图解决从图像中生成鸟瞰图（bird’s eye view）的问题，特别是在没有相机内参和外参先验知识的情况下。解决方案的关键在于通过学习场景的消失点（vanishing point）来实现物体的正交投影，从而将不同视角的物体投影到鸟瞰图上。此外，利用学习到的消失点和轨迹线，将道路用户的2D边界框转换为3D边界信息。该框架不仅能够实时生成地图，还能在大规模城市环境中分析社交距离违规行为，并在未校准的相机中实现高精度的道路用户地理定位，为城市建模技术和基于代理的建模（Agent-Based Modelling）提供了新的适应性。

链接: https://arxiv.org/abs/2412.16229
作者: Mohamed R Ibrahim
机构: University of Leeds (利兹大学)
关键词: bird eye view, detecting agent conflicts, measuring space occupancy, Generating a bird, bird eye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages

点击查看摘要

Abstract:Generating a bird’s eye view of road users is beneficial for a variety of applications, including navigation, detecting agent conflicts, and measuring space occupancy, as well as the ability to utilise the metric system to measure distances between different objects. In this research, we introduce a simple approach for estimating a bird’s eye view from images without prior knowledge of a given camera’s intrinsic and extrinsic parameters. The model is based on the orthogonal projection of objects from various fields of view to a bird’s eye view by learning the vanishing point of a given scene. Additionally, we utilised the learned vanishing point alongside the trajectory line to transform the 2D bounding boxes of road users into 3D bounding information. The introduced framework has been applied to several applications to generate a live Map from camera feeds and to analyse social distancing violations at the city scale. The introduced framework shows a high validation in geolocating road users in various uncalibrated cameras. It also paves the way for new adaptations in urban modelling techniques and simulating the built environment accurately, which could benefit Agent-Based Modelling by relying on deep learning and computer vision.
zh

[CV-217] GALOT: Generative Active Learning via Optimizable Zero-shot Text-to-image Generation

【速读】：该论文试图解决主动学习（Active Learning, AL）在依赖有限标注数据和数据分布时性能受限的问题。解决方案的关键在于通过结合零样本文本到图像生成（zero-shot text-to-image, T2I）和主动学习，设计了一种新型框架，能够仅使用文本描述高效训练机器学习（ML）模型。具体来说，该框架利用主动学习准则优化文本输入，生成更具信息量和多样性的数据样本，并通过从文本生成的伪标签进行标注，形成合成数据集用于主动学习。这种方法不仅降低了数据收集和标注的成本，还通过提供信息丰富的训练样本提高了模型训练的效率，从而实现从文本描述到视觉模型的端到端机器学习任务。

链接: https://arxiv.org/abs/2412.16227
作者: Hanbin Hong,Shenao Yan,Shuya Feng,Yan Yan,Yuan Hong
机构: University of Connecticut(康涅狄格大学); Illinois Institute of Technology(伊利诺伊理工学院)
关键词: Active Learning, represents a crucial, emphasizing the identification, crucial methodology, identification and utilization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Active Learning (AL) represents a crucial methodology within machine learning, emphasizing the identification and utilization of the most informative samples for efficient model training. However, a significant challenge of AL is its dependence on the limited labeled data samples and data distribution, resulting in limited performance. To address this limitation, this paper integrates the zero-shot text-to-image (T2I) synthesis and active learning by designing a novel framework that can efficiently train a machine learning (ML) model sorely using the text description. Specifically, we leverage the AL criteria to optimize the text inputs for generating more informative and diverse data samples, annotated by the pseudo-label crafted from text, then served as a synthetic dataset for active learning. This approach reduces the cost of data collection and annotation while increasing the efficiency of model training by providing informative training samples, enabling a novel end-to-end ML task from text description to vision models. Through comprehensive evaluations, our framework demonstrates consistent and significant improvements over traditional AL methods.
zh

[CV-218] Adaptive Calibration: A Unified Conversion Framework of Spiking Neural Network

【速读】：该论文试图解决传统人工神经网络（ANNs）与脉冲神经网络（SNNs）之间的性能差距问题，尤其是在转换过程中保持能量效率的挑战。解决方案的关键在于提出了一种无需训练的统一转换框架，并引入了自适应发射神经元模型（AdaFire），该模型通过动态调整不同层的发放模式来显著减少不均匀误差（Unevenness Error），这是在有限推理时间步长内转换SNNs的主要误差来源。此外，论文还提出了两种提升效率的技术：敏感性脉冲压缩（SSC）技术和输入感知自适应时间步长（IAT）技术，分别用于减少脉冲操作和降低延迟。这些方法共同实现了在CIFAR-10、CIFAR-100和ImageNet数据集上达到最先进性能的同时，分别实现了高达70.1%、60.3%和43.1%的能耗节省。

链接: https://arxiv.org/abs/2412.16219
作者: Ziqing Wang,Yuetong Fang,Jiahang Cao,Hongwei Ren,Renjing Xu
机构: 未知
关键词: Spiking Neural Networks, Artificial Neural Networks, traditional Artificial Neural, Neural Networks, Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are seen as an energy-efficient alternative to traditional Artificial Neural Networks (ANNs), but the performance gap remains a challenge. While this gap is narrowing through ANN-to-SNN conversion, substantial computational resources are still needed, and the energy efficiency of converted SNNs cannot be ensured. To address this, we present a unified training-free conversion framework that significantly enhances both the performance and efficiency of converted SNNs. Inspired by the biological nervous system, we propose a novel Adaptive-Firing Neuron Model (AdaFire), which dynamically adjusts firing patterns across different layers to substantially reduce the Unevenness Error - the primary source of error of converted SNNs within limited inference timesteps. We further introduce two efficiency-enhancing techniques: the Sensitivity Spike Compression (SSC) technique for reducing spike operations, and the Input-aware Adaptive Timesteps (IAT) technique for decreasing latency. These methods collectively enable our approach to achieve state-of-the-art performance while delivering significant energy savings of up to 70.1%, 60.3%, and 43.1% on CIFAR-10, CIFAR-100, and ImageNet datasets, respectively. Extensive experiments across 2D, 3D, event-driven classification tasks, object detection, and segmentation tasks, demonstrate the effectiveness of our method in various domains. The code is available at: this https URL.
zh

[CV-219] Zero-Shot Image Moderation in Google Ads with LLM -Assisted Textual Descriptions and Cross-modal Co-embeddings

【速读】：该论文试图解决在Google广告图像内容审核中，面对海量且内容多样化的广告图像以及不断变化的审核政策所带来的挑战。解决方案的关键在于利用人工编制的文本描述和跨模态的文本-图像共嵌入（cross-modal text-image co-embeddings）技术，实现对违规广告图像的零样本分类（zero-shot classification）。通过借助大型语言模型（LLMs）和用户专业知识，系统生成并优化代表政策指南的文本描述集。在推理阶段，利用输入图像与文本描述之间的共嵌入相似度作为违规检测的可靠信号，从而实现高效且灵活的广告内容审核。

链接: https://arxiv.org/abs/2412.16215
作者: Enming Luo,Wei Qiao,Katie Warren,Jingxiang Li,Eric Xiao,Krishna Viswanathan,Yuan Wang,Yintao Liu,Jimin Li,Ariel Fuxman
机构: Google Research(谷歌研究院); Google(谷歌)
关键词: moderating massive volumes, addressing the challenges, evolving policies, present a scalable, scalable and agile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We present a scalable and agile approach for ads image content moderation at Google, addressing the challenges of moderating massive volumes of ads with diverse content and evolving policies. The proposed method utilizes human-curated textual descriptions and cross-modal text-image co-embeddings to enable zero-shot classification of policy violating ads images, bypassing the need for extensive supervised training data and human labeling. By leveraging large language models (LLMs) and user expertise, the system generates and refines a comprehensive set of textual descriptions representing policy guidelines. During inference, co-embedding similarity between incoming images and the textual descriptions serves as a reliable signal for policy violation detection, enabling efficient and adaptable ads content moderation. Evaluation results demonstrate the efficacy of this framework in significantly boosting the detection of policy violating content.
zh

[CV-220] AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models AAAI-25

【速读】：该论文旨在解决3D生成模型（如NeRF）在面对对抗攻击时的脆弱性问题。解决方案的关键在于提出了一个名为AdvIRL的新框架，该框架结合了即时神经图形基元（Instant-NGP）和强化学习，用于生成对抗性噪声。与以往方法不同，AdvIRL生成的对抗性噪声在多种3D变换（如旋转和缩放）下保持稳健，从而能够在现实场景中实现有效的黑箱攻击。该方法不仅展示了在不同场景下的攻击效果，还证明了其作为对抗训练数据的潜力，以增强视觉系统的鲁棒性。

链接: https://arxiv.org/abs/2412.16213
作者: Tommy Nguyen,Mehmet Ergezer,Christian Green
机构: Wentworth Institute of Technology(温特沃斯理工学院); Amazon(亚马逊)
关键词: Neural Radiance Fields, Neural Graphics Primitives, Instant Neural Graphics, Radiance Fields, increasing deployment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: Accepted to The AAAI-25 Workshop on Artificial Intelligence for Cyber Security (AICS)

点击查看摘要

Abstract:The increasing deployment of AI models in critical applications has exposed them to significant risks from adversarial attacks. While adversarial vulnerabilities in 2D vision models have been extensively studied, the threat landscape for 3D generative models, such as Neural Radiance Fields (NeRF), remains underexplored. This work introduces \textitAdvIRL, a novel framework for crafting adversarial NeRF models using Instant Neural Graphics Primitives (Instant-NGP) and Reinforcement Learning. Unlike prior methods, \textitAdvIRL generates adversarial noise that remains robust under diverse 3D transformations, including rotations and scaling, enabling effective black-box attacks in real-world scenarios. Our approach is validated across a wide range of scenes, from small objects (e.g., bananas) to large environments (e.g., lighthouses). Notably, targeted attacks achieved high-confidence misclassifications, such as labeling a banana as a slug and a truck as a cannon, demonstrating the practical risks posed by adversarial NeRFs. Beyond attacking, \textitAdvIRL-generated adversarial models can serve as adversarial training data to enhance the robustness of vision systems. The implementation of \textitAdvIRL is publicly available at \urlthis https URL, ensuring reproducibility and facilitating future research.
zh

[CV-221] ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

【速读】：该论文试图解决从给定的手和物体运动序列生成一致且时间上连贯的双手-物体操作视频的问题。解决方案的关键在于构建一种多层遮挡 (Multi-Layer Occlusion, MLO) 表示，该表示通过学习从无遮挡的法线图和遮挡置信度图中提取的3D遮挡关系，增强了手-物体操作的3D一致性。通过将MLO结构嵌入到UNet中，模型提升了手-物体操作视频的生成质量。此外，论文还引入了大规模3D物体数据集Objaverse，以解决视频数据稀缺的问题，并通过创新的训练策略整合多个数据集，支持以人为中心的手-物体操作视频生成等下游任务。实验结果表明，该方法在生成合理的手-物体交互视频和可泛化的物体方面优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.16212
作者: Youxin Pang,Ruizhi Shao,Jiajun Zhang,Hanzhang Tu,Yun Liu,Boyao Zhou,Hongwen Zhang,Yebin Liu
机构: Tsinghua University(清华大学); Beijing University of Posts and Telecommunications(北京邮电大学); Beijing Normal University(北京师范大学)
关键词: temporally coherent bimanual, coherent bimanual hand-object, bimanual hand-object manipulation, generating consistent, consistent and temporally
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. The core idea of ManiVideo is the construction of a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. To further achieve the generalizable grasping of objects, we integrate Objaverse, a large-scale 3D object dataset, to address the scarcity of video data, thereby facilitating the learning of extensive object consistency. Additionally, we propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation. Through extensive experiments, we demonstrate that our approach not only achieves video generation with plausible hand-object interaction and generalizable objects, but also outperforms existing SOTA methods.
zh

[CV-222] Saliency Methods are Encoders: Analysing Logical Relations Towards Interpretation

【速读】：该论文试图解决神经网络解释性方法（如saliency maps）在评估过程中缺乏统一标准和容易产生确认偏差的问题。解决方案的关键在于提出了一种基于简单逻辑数据集的受控实验方法，通过分析不同saliency方法在处理互补和冗余信息时的表现，引入新的评估指标来量化解释性方法的质量。这种方法旨在通过逻辑关系来理解saliency方法在不同分类场景中的信息处理方式，并基于非信息性归因得分基线来检测典型期望的偏差。

链接: https://arxiv.org/abs/2412.16204
作者: Leonid Schwenke,Martin Atzmueller
机构: Semantic Information Systems Group (SIS); German Research Center for Artificial Intelligence (DFKI)
关键词: neural network architectures, necessitating explainability, increase in performance, neural network, network architectures
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 main text pages, 2 pages references, 13 pages appendix

点击查看摘要

Abstract:With their increase in performance, neural network architectures also become more complex, necessitating explainability. Therefore, many new and improved methods are currently emerging, which often generate so-called saliency maps in order to improve interpretability. Those methods are often evaluated by visual expectations, yet this typically leads towards a confirmation bias. Due to a lack of a general metric for explanation quality, non-accessible ground truth data about the model’s reasoning and the large amount of involved assumptions, multiple works claim to find flaws in those methods. However, this often leads to unfair comparison metrics. Additionally, the complexity of most datasets (mostly images or text) is often so high, that approximating all possible explanations is not feasible. For those reasons, this paper introduces a test for saliency map evaluation: proposing controlled experiments based on all possible model reasonings over multiple simple logical datasets. Using the contained logical relationships, we aim to understand how different saliency methods treat information in different class discriminative scenarios (e.g. via complementary and redundant information). By introducing multiple new metrics, we analyse propositional logical patterns towards a non-informative attribution score baseline to find deviations of typical expectations. Our results show that saliency methods can encode classification relevant information into the ordering of saliency scores.
zh

[CV-223] Aspect-Based Few-Shot Learning

【速读】：该论文试图解决传统小样本学习（few-shot learning）中单一“真实”标签假设的局限性问题，提出了一种基于“方面”（aspect）的新型小样本学习框架。解决方案的关键在于引入“方面”概念，通过构建上下文（context）来实现对查询对象和支持集对象的比较，而不依赖于预定义的类别集。具体实现上，论文提出了一种新的架构和训练过程，能够根据查询和支持集动态生成上下文，从而实现基于方面的比较。实验结果表明，该方法在Geometric Shapes和Sprites数据集上的表现验证了其可行性，相比传统小样本学习具有更好的灵活性和适应性。

链接: https://arxiv.org/abs/2412.16202
作者: Tim van Engeland,Lu Yin,Vlado Menkovski
机构: 未知
关键词: few-shot learning, introducing the concept, support set, few-shot, set
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We generalize the formulation of few-shot learning by introducing the concept of an aspect. In the traditional formulation of few-shot learning, there is an underlying assumption that a single “true” label defines the content of each data point. This label serves as a basis for the comparison between the query object and the objects in the support set. However, when a human expert is asked to execute the same task without a predefined set of labels, they typically consider the rest of the data points in the support set as context. This context specifies the level of abstraction and the aspect from which the comparison can be made. In this work, we introduce a novel architecture and training procedure that develops a context given the query and support set and implements aspect-based few-shot learning that is not limited to a predetermined set of classes. We demonstrate that our method is capable of forming and using an aspect for few-shot learning on the Geometric Shapes and Sprites dataset. The results validate the feasibility of our approach compared to traditional few-shot learning.
zh

[CV-224] Robust Spectral Anomaly Detection in EELS Spectral Images via Three Dimensional Convolutional Variational Autoencoders

【速读】：该论文旨在解决电子能量损失谱成像 (EELS-SI) 数据中的自动化异常检测问题，提出了一种三维卷积变分自编码器 (3D-CVAE) 作为解决方案。其关键在于利用 EELS-SI 数据的三维结构，通过负对数似然损失训练模型，使其能够重建无缺陷材料的特征，从而检测出细微的谱异常。与主成分分析 (PCA) 相比，3D-CVAE 在异常检测性能上表现更优，尤其是在处理低浓度异常和噪声主导的谱区域时，仍能保持高质量的重建效果。该方法为复杂材料系统的无监督自动化异常检测提供了强大的框架。

链接: https://arxiv.org/abs/2412.16200
作者: Seyfal Sultanov,James P Buban,Robert F Klie
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Department of Computer Science (计算机科学系); Department of Physics (物理系)
关键词: Convolutional Variational Autoencoder, Spectroscopy Spectrum Imaging, Electron Energy Loss, Energy Loss Spectroscopy, Loss Spectroscopy Spectrum
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a Three-Dimensional Convolutional Variational Autoencoder (3D-CVAE) for automated anomaly detection in Electron Energy Loss Spectroscopy Spectrum Imaging (EELS-SI) data. Our approach leverages the full three-dimensional structure of EELS-SI data to detect subtle spectral anomalies while preserving both spatial and spectral correlations across the datacube. By employing negative log-likelihood loss and training on bulk spectra, the model learns to reconstruct bulk features characteristic of the defect-free material. In exploring methods for anomaly detection, we evaluated both our 3D-CVAE approach and Principal Component Analysis (PCA), testing their performance using Fe L-edge peak shifts designed to simulate material defects. Our results show that 3D-CVAE achieves superior anomaly detection and maintains consistent performance across various shift magnitudes. The method demonstrates clear bimodal separation between normal and anomalous spectra, enabling reliable classification. Further analysis verifies that lower dimensional representations are robust to anomalies in the data. While performance advantages over PCA diminish with decreasing anomaly concentration, our method maintains high reconstruction quality even in challenging, noise-dominated spectral regions. This approach provides a robust framework for unsupervised automated detection of spectral anomalies in EELS-SI data, particularly valuable for analyzing complex material systems.
zh

[CV-225] Machine Learning-Based Automated Assessment of Intracorporeal Suturing in Laparoscopic Fundoplication

【速读】：该论文试图解决手术技能自动评估中的工具跟踪问题，特别是减少对人工标注的依赖。解决方案的关键在于开发了一种基于Segment Anything Model (SAM)的AI工具跟踪模型，该模型能够自动识别和跟踪腹腔镜手术中的双工具运动，从而无需人工标注。通过在Nissen胃底折叠术的缝合任务中应用该模型，研究验证了其在自动评估中的有效性，并展示了在监督学习和无监督学习框架下，该模型能够实现高性能的手术技能分类，无需依赖传统的运动学特征计算。

链接: https://arxiv.org/abs/2412.16195
作者: Shekhar Madhav Khairnar,Huu Phong Nguyen,Alexis Desir,Carla Holcomb,Daniel J. Scott,Ganesh Sankaranarayanan
机构: 未知
关键词: automated tool tracking, tool tracking, artificial intelligence, instantaneous feedback, surgical skills
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Automated assessment of surgical skills using artificial intelligence (AI) provides trainees with instantaneous feedback. After bimanual tool motions are captured, derived kinematic metrics are reliable predictors of performance in laparoscopic tasks. Implementing automated tool tracking requires time-intensive human annotation. We developed AI-based tool tracking using the Segment Anything Model (SAM) to eliminate the need for human annotators. Here, we describe a study evaluating the usefulness of our tool tracking model in automated assessment during a laparoscopic suturing task in the fundoplication procedure. An automated tool tracking model was applied to recorded videos of Nissen fundoplication on porcine bowel. Surgeons were grouped as novices (PGY1-2) and experts (PGY3-5, attendings). The beginning and end of each suturing step were segmented, and motions of the left and right tools were extracted. A low-pass filter with a 24 Hz cut-off frequency removed noise. Performance was assessed using supervised and unsupervised models, and an ablation study compared results. Kinematic features–RMS velocity, RMS acceleration, RMS jerk, total path length, and Bimanual Dexterity–were extracted and analyzed using Logistic Regression, Random Forest, Support Vector Classifier, and XGBoost. PCA was performed for feature reduction. For unsupervised learning, a Denoising Autoencoder (DAE) model with classifiers, such as a 1-D CNN and traditional models, was trained. Data were extracted for 28 participants (9 novices, 19 experts). Supervised learning with PCA and Random Forest achieved an accuracy of 0.795 and an F1 score of 0.778. The unsupervised 1-D CNN achieved superior results with an accuracy of 0.817 and an F1 score of 0.806, eliminating the need for kinematic feature computation. We demonstrated an AI model capable of automated performance classification, independent of human annotation.
zh

[CV-226] Superposition through Active Learning lens

【速读】：该论文试图解决的问题是如何通过主动学习 (Active Learning) 方法来解码神经元叠加 (Superposition) 现象，这一现象是解释性和理解机器学习黑箱中的重要障碍。解决方案的关键在于通过对比基线模型和主动学习模型在多个评估标准（如t-SNE可视化、余弦相似度直方图、轮廓分数 (Silhouette Scores) 和Davies-Bouldin指数）下的表现，来探究主动学习是否能有效减少或解码叠加现象。然而，实验结果表明，主动学习模型在特征分离和整体准确性上并未显著优于基线模型，这暗示了非信息性样本选择和可能的过拟合问题，提示需要更复杂的方法来应对叠加现象。

链接: https://arxiv.org/abs/2412.16168
作者: Akanksha Devkar
机构: 未知
关键词: Machine Learning black-box, Neuron Polysemanticity, intricately beautiful blockers, active learning model, Polysemanticity are important
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 6 Figures

点击查看摘要

Abstract:Superposition or Neuron Polysemanticity are important concepts in the field of interpretability and one might say they are these most intricately beautiful blockers in our path of decoding the Machine Learning black-box. The idea behind this paper is to examine whether it is possible to decode Superposition using Active Learning methods. While it seems that Superposition is an attempt to arrange more features in smaller space to better utilize the limited resources, it might be worth inspecting if Superposition is dependent on any other factors. This paper uses CIFAR-10 and Tiny ImageNet image datasets and the ResNet18 model and compares Baseline and Active Learning models and the presence of Superposition in them is inspected across multiple criteria, including t-SNE visualizations, cosine similarity histograms, Silhouette Scores, and Davies-Bouldin Indexes. Contrary to our expectations, the active learning model did not significantly outperform the baseline in terms of feature separation and overall accuracy. This suggests that non-informative sample selection and potential overfitting to uncertain samples may have hindered the active learning model’s ability to generalize better suggesting more sophisticated approaches might be needed to decode superposition and potentially reduce it.
zh

[CV-227] MRANet: A Modified Residual Attention Networks for Lung and Colon Cancer Classification

【速读】：该论文试图解决肺癌和结肠癌的早期和准确诊断问题，解决方案的关键在于提出了一种基于改进的残差注意力网络架构的高效深度学习模型。该模型通过训练25,000张高分辨率组织病理学图像，在二分类、三分类和五分类任务中分别达到了99.30%、96.63%和97.56%的准确率，显著超越了现有的最先进架构，从而在医学AI应用中提供了高度准确的癌症分类能力。

链接: https://arxiv.org/abs/2412.17700
作者: Diponkor Bala,S M Rakib Ul Karim,Rownak Ara Rasul
机构: 未知
关键词: predominant contributors, Lung and colon, cancer, model, cancer mortality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Lung and colon cancers are predominant contributors to cancer mortality. Early and accurate diagnosis is crucial for effective treatment. By utilizing imaging technology in different image detection, learning models have shown promise in automating cancer classification from histopathological images. This includes the histopathological diagnosis, an important factor in cancer type identification. This research focuses on creating a high-efficiency deep-learning model for identifying lung and colon cancer from histopathological images. We proposed a novel approach based on a modified residual attention network architecture. The model was trained on a dataset of 25,000 high-resolution histopathological images across several classes. Our proposed model achieved an exceptional accuracy of 99.30%, 96.63%, and 97.56% for two, three, and five classes, respectively; those are outperforming other state-of-the-art architectures. This study presents a highly accurate deep learning model for lung and colon cancer classification. The superior performance of our proposed model addresses a critical need in medical AI applications.
zh

[CV-228] Enhancing Reconstruction-Based Out-of-Distribution Detection in Brain MRI with Model and Metric Ensembles

【速读】：该论文试图解决医学图像分析中的分布外（Out-of-distribution, OOD）检测问题，特别是针对脑部MRI图像中的合成伪影进行无监督检测。解决方案的关键在于：1) 探索基于重构的自编码器在OOD检测中的潜力；2) 优化深度学习策略以适应OOD检测需求；3) 选择合适的重构评价指标。研究通过评估不同训练阶段、重构指标及其组合的效果，发现SSIM的对比度分量和LPIPS在检测均匀圆形异常方面表现最佳，结合两个收敛良好的模型并使用LPIPS和对比度作为重构指标，实现了像素级Precision-Recall曲线下面积为0.66的检测性能。此外，研究还强调了针对不同伪影类型（如局部和全局伪影）的检测性能差异，表明需要根据具体需求定制模型和指标选择。

链接: https://arxiv.org/abs/2412.17586
作者: Evi M.C. Huijben,Sina Amirrajab,Josien P.W. Pluim
机构: Eindhoven University of Technology, Department of Biomedical Engineering, Medical Image Analysis Group, Eindhoven, The Netherlands; Maastricht University, GROW - Research Institute for Oncology and Reproduction, Department of Precision Medicine, The D-Lab, Maastricht, the Netherlands
关键词: image analysis systems, automated medical image, medical image analysis, safely deploying automated, deploying automated medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for safely deploying automated medical image analysis systems, as abnormal patterns in images could hamper their performance. However, OOD detection in medical imaging remains an open challenge, and we address three gaps: the underexplored potential of a simple OOD detection model, the lack of optimization of deep learning strategies specifically for OOD detection, and the selection of appropriate reconstruction metrics. In this study, we investigated the effectiveness of a reconstruction-based autoencoder for unsupervised detection of synthetic artifacts in brain MRI. We evaluated the general reconstruction capability of the model, analyzed the impact of the selected training epoch and reconstruction metrics, assessed the potential of model and/or metric ensembles, and tested the model on a dataset containing a diverse range of artifacts. Among the metrics assessed, the contrast component of SSIM and LPIPS consistently outperformed others in detecting homogeneous circular anomalies. By combining two well-converged models and using LPIPS and contrast as reconstruction metrics, we achieved a pixel-level area under the Precision-Recall curve of 0.66. Furthermore, with the more realistic OOD dataset, we observed that the detection performance varied between artifact types; local artifacts were more difficult to detect, while global artifacts showed better detection results. These findings underscore the importance of carefully selecting metrics and model configurations, and highlight the need for tailored approaches, as standard deep learning approaches do not always align with the unique needs of OOD detection.
zh

[CV-229] Diffusion-Based Approaches in Medical Image Generation and Analysis

【速读】：该论文试图解决医学影像数据稀缺问题，尤其是由于隐私问题导致的可用数据不足。解决方案的关键在于利用扩散模型 (Diffusion Models) 生成合成且逼真的医学图像，以训练卷积神经网络 (CNN)。通过生成高质量的合成数据，研究验证了在脑肿瘤MRI、急性淋巴细胞白血病 (ALL) 和SARS-CoV-2 CT扫描三个领域中，使用合成数据训练的CNN在真实数据上的分类性能与使用原始数据训练的CNN相当。这表明扩散模型能够有效减少对患者特定数据的依赖，为医学影像分析提供了一种可行的数据增强方法。

链接: https://arxiv.org/abs/2412.16860
作者: Abdullah al Nomaan Nafi,Md. Alamgir Hossain,Rakib Hossain Rifat,Md Mahabub Uz Zaman,Md Manjurul Ahsan,Shivakumar Raman
机构: Islamic University(伊斯兰大学); Texas Tech University(德克萨斯理工大学); University of Oklahoma(俄克拉荷马大学)
关键词: imaging poses significant, poses significant challenges, significant challenges due, medical imaging poses, privacy concerns
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data scarcity in medical imaging poses significant challenges due to privacy concerns. Diffusion models, a recent generative modeling technique, offer a potential solution by generating synthetic and realistic data. However, questions remain about the performance of convolutional neural network (CNN) models on original and synthetic datasets. If diffusion-generated samples can help CNN models perform comparably to those trained on original datasets, reliance on patient-specific data for training CNNs might be reduced. In this study, we investigated the effectiveness of diffusion models for generating synthetic medical images to train CNNs in three domains: Brain Tumor MRI, Acute Lymphoblastic Leukemia (ALL), and SARS-CoV-2 CT scans. A diffusion model was trained to generate synthetic datasets for each domain. Pre-trained CNN architectures were then trained on these synthetic datasets and evaluated on unseen real data. All three datasets achieved promising classification performance using CNNs trained on synthetic data. Local Interpretable Model-Agnostic Explanations (LIME) analysis revealed that the models focused on relevant image features for classification. This study demonstrates the potential of diffusion models to generate synthetic medical images for training CNNs in medical image analysis.
zh

[CV-230] chnical Report: Towards Spatial Feature Regularization in Deep-Learning-Based Array-SAR Reconstruction

【速读】：该论文试图解决基于深度学习（DL）的阵列合成孔径雷达（Array-SAR）重建中忽视空间特征（如建筑物结构）的问题，导致重建结果出现孔洞和边缘破碎等伪影。解决方案的关键在于将空间特征正则化（spatial feature regularization）引入DL网络，通过描述、建模和正则化城市区域中的关键空间特征（如锐利边缘和几何形状），并设计特征增强的网络结构。具体方法包括提出一种基于2D切片的重建策略，通过并行和串行融合将切片融合为3D场景，并设计了两种计算框架（迭代增强和轻量增强），最终构建了四个专门的重建网络。实验结果表明，空间特征正则化显著提高了重建精度、恢复了更完整的建筑物结构，并增强了系统的鲁棒性。

链接: https://arxiv.org/abs/2412.16828
作者: Yu Ren,Xu Zhan,Yunqiao Hu,Xiangdong Ma,Liang Liu,Mou Wang,Jun Shi,Shunjun Wei,Tianjiao Zeng,Xiaoling Zhang
机构: University of Electronic Science and Technology of China (电子科技大学)
关键词: URL deep learning, Array synthetic aperture, http URL deep, synthetic aperture radar, demonstrated significant potential
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Array synthetic aperture radar (Array-SAR), also known as tomographic SAR (TomoSAR), has demonstrated significant potential for high-quality 3D mapping, particularly in urban this http URL deep learning (DL) methods have recently shown strengths in reconstruction, most studies rely on pixel-by-pixel reconstruction, neglecting spatial features like building structures, leading to artifacts such as holes and fragmented edges. Spatial feature regularization, effective in traditional methods, remains underexplored in DL-based approaches. Our study integrates spatial feature regularization into DL-based Array-SAR reconstruction, addressing key questions: What spatial features are relevant in urban-area mapping? How can these features be effectively described, modeled, regularized, and incorporated into DL networks? The study comprises five phases: spatial feature description and modeling, regularization, feature-enhanced network design, evaluation, and discussions. Sharp edges and geometric shapes in urban scenes are analyzed as key features. An intra-slice and inter-slice strategy is proposed, using 2D slices as reconstruction units and fusing them into 3D scenes through parallel and serial fusion. Two computational frameworks-iterative reconstruction with enhancement and light reconstruction with enhancement-are designed, incorporating spatial feature modules into DL networks, leading to four specialized reconstruction networks. Using our urban building simulation dataset and two public datasets, six tests evaluate close-point resolution, structural integrity, and robustness in urban scenarios. Results show that spatial feature regularization significantly improves reconstruction accuracy, retrieves more complete building structures, and enhances robustness by reducing noise and outliers.
zh

[CV-231] Evaluation of radiomic feature harmonization techniques for benign and malignant pulmonary nodules

【速读】：该论文试图解决肺结节（PNs）的影像组学特征在不同成像条件下的变异性问题，特别是良性与恶性肺结节之间的差异如何影响这些特征的校正。解决方案的关键在于采用不同的校正策略：不区分良性与恶性肺结节进行统一校正、使用协变量保留子组间差异的校正，以及分别对良性与恶性肺结节进行校正。研究结果表明，分别校正或使用协变量校正能够显著提高校正后特征的独立性，并提升基于这些特征的恶性预测模型的ROC-AUC值。因此，针对良性与恶性肺结节的影像组学特征需要不同的校正变换，以恢复其成像条件无关的分布。

链接: https://arxiv.org/abs/2412.16758
作者: Claire Huchthausen,Menglin Shi,Gabriel L.A. de Sousa,Jonathan Colen,Emery Shelley,James Larner,Krishni Wijesooriya
机构: 未知
关键词: malignant PNs, medical image acquisition, image acquisition variability, BACKGROUND, malignant
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, plus supplemental material

点击查看摘要

Abstract:BACKGROUND: Radiomics provides quantitative features of pulmonary nodules (PNs) which could aid lung cancer diagnosis, but medical image acquisition variability is an obstacle to clinical application. Acquisition effects may differ between radiomic features from benign vs. malignant PNs. PURPOSE: We evaluated how to account for differences between benign and malignant PNs when correcting radiomic features’ acquisition dependency. METHODS: We used 567 chest CT scans grouped as benign, malignant, or lung cancer screening (mixed benign, malignant). ComBat harmonization was applied to extracted features for variation in 4 acquisition parameters. We compared: harmonizing without distinction, harmonizing with a covariate to preserve distinctions between subgroups, and harmonizing subgroups separately. Significant ( p\le0.05 ) Kruskal-Wallis tests showed whether harmonization removed acquisition dependency. A LASSO-SVM pipeline was trained on successfully harmonized features to predict malignancy. To evaluate predictive information in these features, the trained harmonization estimators and predictive model were applied to unseen test sets. Harmonization and predictive performance were assessed for 10 trials of 5-fold cross-validation. RESULTS: An average 2.1% of features (95% CI:1.9-2.4%) were acquisition-independent when harmonized without distinction, 27.3% (95% CI:25.7-28.9%) when harmonized with a covariate, and 90.9% (95% CI:90.4-91.5%) when harmonized separately. Data harmonized separately or with a covariate trained models with higher ROC-AUC for screening scans than data harmonized without distinction between benign and malignant PNs (Delong test, adjusted p\le0.05 ). CONCLUSIONS: Radiomic features of benign and malignant PNs need different corrective transformations to recover acquisition-independent distributions. This can be done by harmonizing separately or with a covariate.
zh

[CV-232] Patherea: Cell Detection and Classification for the 2020s

【速读】：该论文试图解决细胞检测与分类的问题，提出了一种名为 Patherea 的框架，旨在为开发和评估最先进的方法提供完整的解决方案。解决方案的关键在于引入了一种基于点的检测与分类方法，直接预测点位置，无需中间表示。该方法通过混合匈牙利匹配策略和灵活的架构，有效利用点候选，并支持多种骨干网络和（预）训练策略。此外，论文还提出了一个新的数据集 Patherea，用于模拟临床工作流程中的 Ki-67 增殖指数估计，并展示了该数据集对现有方法的挑战性。论文还指出了现有评估协议中的性能指标计算错误，并提供了新的基准工具以实现公平比较。

链接: https://arxiv.org/abs/2412.16425
作者: Dejan Štepec,Maja Jerše,Snežana Đokić,Jera Jeruc,Nina Zidar,Danijel Skočaj
机构: 未知
关键词: proposed Patherea dataset, newly proposed Patherea, developing and evaluating, paper presents, complete solution
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to Medical Image Analysis

点击查看摘要

Abstract:This paper presents a Patherea, a framework for point-based cell detection and classification that provides a complete solution for developing and evaluating state-of-the-art approaches. We introduce a large-scale dataset collected to directly replicate a clinical workflow for Ki-67 proliferation index estimation and use it to develop an efficient point-based approach that directly predicts point-based predictions, without the need for intermediate representations. The proposed approach effectively utilizes point proposal candidates with the hybrid Hungarian matching strategy and a flexible architecture that enables the usage of various backbones and (pre)training strategies. We report state-of-the-art results on existing public datasets - Lizard, BRCA-M2C, BCData, and the newly proposed Patherea dataset. We show that the performance on existing public datasets is saturated and that the newly proposed Patherea dataset represents a significantly harder challenge for the recently proposed approaches. We also demonstrate the effectiveness of recently proposed pathology foundational models that our proposed approach can natively utilize and benefit from. We also revisit the evaluation protocol that is used in the broader field of cell detection and classification and identify the erroneous calculation of performance metrics. Patherea provides a benchmarking utility that addresses the identified issues and enables a fair comparison of different approaches. The dataset and the code will be publicly released upon acceptance.
zh

[CV-233] Generalizable Representation Learning for fMRI-based Neurological Disorder Identification

【速读】：该论文试图解决功能性磁共振成像（fMRI）数据在神经功能障碍分析中的异质性和数据稀缺性问题，尤其是在罕见疾病临床数据匮乏的情况下。解决方案的关键在于引入了一种结合元学习（meta-learning）和自监督学习（self-supervised learning）的新型表示学习策略。该策略通过在健康控制数据集上进行自监督学习，提取不局限于特定监督任务的固有特征，并利用元学习提升跨领域的泛化能力，从而在训练数据稀缺的临床任务中实现更好的特征识别和分类效果。

链接: https://arxiv.org/abs/2412.16197
作者: Wenhui Cui,Haleh Akrami,Anand A. Joshi,Richard M. Leahy
机构: University of Southern California (南加州大学); Ming Hsieh Department of Electrical and Computer Engineering (明希谢工程与计算机科学系)
关键词: brain activity analysis, impressive advances achieved, Magnetic Resonance Imaging, functional brain activity, functional Magnetic Resonance
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2312.14204

点击查看摘要

Abstract:Despite the impressive advances achieved using deep learning for functional brain activity analysis, the heterogeneity of functional patterns and the scarcity of imaging data still pose challenges in tasks such as identifying neurological disorders. For functional Magnetic Resonance Imaging (fMRI), while data may be abundantly available from healthy controls, clinical data is often scarce, especially for rare diseases, limiting the ability of models to identify clinically-relevant features. We overcome this limitation by introducing a novel representation learning strategy integrating meta-learning with self-supervised learning to improve the generalization from normal to clinical features. This approach enables generalization to challenging clinical tasks featuring scarce training data. We achieve this by leveraging self-supervised learning on the control dataset to focus on inherent features that are not limited to a particular supervised task and incorporating meta-learning to improve the generalization across domains. To explore the generalizability of the learned representations to unseen clinical applications, we apply the model to four distinct clinical datasets featuring scarce and heterogeneous data for neurological disorder classification. Results demonstrate the superiority of our representation learning strategy on diverse clinically-relevant tasks.
zh

人工智能

[AI-0] Automating the Search for Artificial Life with Foundation Models

链接: https://arxiv.org/abs/2412.17799
作者: Akarsh Kumar,Chris Lu,Louis Kirsch,Yujin Tang,Kenneth O. Stanley,Phillip Isola,David Ha
关键词: recent Nobel Prize, Nobel Prize awarded, Nobel Prize, exploring large combinatorial, recent Nobel
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 27 pages, 17 figures

点击查看摘要

Abstract:With the recent Nobel Prize awarded for radical advances in protein discovery, foundation models (FMs) for exploring large combinatorial spaces promise to revolutionize many scientific fields. Artificial Life (ALife) has not yet integrated FMs, thus presenting a major opportunity for the field to alleviate the historical burden of relying chiefly on manual design and trial-and-error to discover the configurations of lifelike simulations. This paper presents, for the first time, a successful realization of this opportunity using vision-language FMs. The proposed approach, called Automated Search for Artificial Life (ASAL), (1) finds simulations that produce target phenomena, (2) discovers simulations that generate temporally open-ended novelty, and (3) illuminates an entire space of interestingly diverse simulations. Because of the generality of FMs, ASAL works effectively across a diverse range of ALife substrates including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata. A major result highlighting the potential of this technique is the discovery of previously unseen Lenia and Boids lifeforms, as well as cellular automata that are open-ended like Conway’s Game of Life. Additionally, the use of FMs allows for the quantification of previously qualitative phenomena in a human-aligned way. This new paradigm promises to accelerate ALife research beyond what is possible through human ingenuity alone.

[AI-1] Observation Interference in Partially Observable Assistance Games

链接: https://arxiv.org/abs/2412.17797
作者: Scott Emmons,Caspar Oesterheld,Vincent Conitzer,Stuart Russell
关键词: observable assistance games, partially observable assistance, study partially observable, human, assistance games
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We study partially observable assistance games (POAGs), a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human’s observations? First, we prove that sometimes an optimal assistant must take observation-interfering actions, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of perfect information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entire policies. This can be viewed as an extension of the classic result that the value of perfect information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human’s preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.

[AI-2] SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC

链接: https://arxiv.org/abs/2412.17707
作者: Yue Deng,Yan Yu,Weiyu Ma,Zirui Wang,Wenhui Zhu,Jian Zhao,Yin Zhang
关键词: Multi-Agent Reinforcement Learning, challenging simulation environments, Reinforcement Learning, Multi-Agent Reinforcement, MARL algorithms
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL). In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm. However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies. To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness. SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability. Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms. We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents. Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries. We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems. Our code is available at this https URL.

[AI-3] FedTLU: Federated Learning with Targeted Layer Updates

链接: https://arxiv.org/abs/2412.17692
作者: Jong-Ik Park,Carlee Joe-Wong
关键词: addresses privacy concerns, enabling multiple clients, addresses privacy, privacy concerns, modeling by enabling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) addresses privacy concerns in language modeling by enabling multiple clients to contribute to training language models. However, non-IID (identically and independently distributed) data across clients often limits FL’s performance. This issue is especially challenging during model fine-tuning, as noise due to variations in clients’ data distributions can harm model convergence near the optimum. This paper proposes a targeted layer update strategy for fine-tuning in FL. Instead of randomly updating layers of the language model, as often done in practice, we use a scoring mechanism to identify and update the most critical layers, avoiding excessively noisy or even poisoned updates by freezing the parameters in other layers. We show in extensive experiments that our method improves convergence and performance in non-IID settings, offering a more efficient approach to fine-tuning federated language models.

[AI-4] Large Language Model Safety: A Holistic Survey

链接: https://arxiv.org/abs/2412.17686
作者: Dan Shi,Tianhao Shen,Yufei Huang,Zhigen Li,Yongqi Leng,Renren Jin,Chuang Liu,Xinwei Wu,Zishan Guo,Linhao Yu,Ling Shi,Bojian Jiang,Deyi Xiong
关键词: natural language understanding, large language models, LLM safety, LLM, large language
类目: Artificial Intelligence (cs.AI)
*备注: 158 pages, 18 figures

点击查看摘要

Abstract:The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at this https URL. Comments: 158 pages, 18 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17686 [cs.AI] (or arXiv:2412.17686v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.17686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Detecting anxiety and depression in dialogues: a multi-label and explainable approach

链接: https://arxiv.org/abs/2412.17651
作者: Francisco de Arriba-Pérez,Silvia García-Méndez
关键词: health issues worldwide, mental health issues, common mental health, issues worldwide, affecting a non-negligible
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anxiety and depression are the most common mental health issues worldwide, affecting a non-negligible part of the population. Accordingly, stakeholders, including governments’ health systems, are developing new strategies to promote early detection and prevention from a holistic perspective (i.e., addressing several disorders simultaneously). In this work, an entirely novel system for the multi-label classification of anxiety and depression is proposed. The input data consists of dialogues from user interactions with an assistant chatbot. Another relevant contribution lies in using Large Language Models (LLMs) for feature extraction, provided the complexity and variability of language. The combination of LLMs, given their high capability for language understanding, and Machine Learning (ML) models, provided their contextual knowledge about the classification problem thanks to the labeled data, constitute a promising approach towards mental health assessment. To promote the solution’s trustworthiness, reliability, and accountability, explainability descriptions of the model’s decision are provided in a graphical dashboard. Experimental results on a real dataset attain 90 % accuracy, improving those in the prior literature. The ultimate objective is to contribute in an accessible and scalable way before formal treatment occurs in the healthcare systems.

[AI-6] An Adaptive Framework for Multi-View Clustering Leveraging Conditional Entropy Optimization

链接: https://arxiv.org/abs/2412.17647
作者: Lijian Li
关键词: extracting valuable insights, perspectives or modalities, powerful technique, technique for extracting, extracting valuable
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-view clustering (MVC) has emerged as a powerful technique for extracting valuable insights from data characterized by multiple perspectives or modalities. Despite significant advancements, existing MVC methods struggle with effectively quantifying the consistency and complementarity among views, and are particularly susceptible to the adverse effects of noisy views, known as the Noisy-View Drawback (NVD). To address these challenges, we propose CE-MVC, a novel framework that integrates an adaptive weighting algorithm with a parameter-decoupled deep model. Leveraging the concept of conditional entropy and normalized mutual information, CE-MVC quantitatively assesses and weights the informative contribution of each view, facilitating the construction of robust unified representations. The parameter-decoupled design enables independent processing of each view, effectively mitigating the influence of noise and enhancing overall clustering performance. Extensive experiments demonstrate that CE-MVC outperforms existing approaches, offering a more resilient and accurate solution for multi-view clustering tasks.

[AI-7] Advances in Machine Learning Research Using Knowledge Graphs

链接: https://arxiv.org/abs/2412.17643
作者: Jing Si,Jianfei Xu
关键词: National Knowledge Infrastructure, China National Knowledge, Knowledge Infrastructure, National Knowledge, China National
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The study uses CSSCI-indexed literature from the China National Knowledge Infrastructure (CNKI) database as the data source. It utilizes the CiteSpace visualization software to draw knowledge graphs on aspects such as institutional collaboration and keyword co-occurrence. This analysis provides insights into the current state of research and emerging trends in the field of machine learning in China. Additionally, it identifies the challenges faced in the field of machine learning research and offers suggestions that could serve as valuable references for future research.

[AI-8] Graph Neural Networks Are Evolutionary Algorithms

链接: https://arxiv.org/abs/2412.17629
作者: Kaichen Ouyang,Shengwei Fu
关键词: graph neural networks, Graph Neural Evolution, traditionally distinct fields, propose Graph Neural, graph neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 31 pages, 10 figures

点击查看摘要

Abstract:In this paper, we reveal the intrinsic duality between graph neural networks (GNNs) and evolutionary algorithms (EAs), bridging two traditionally distinct fields. Building on this insight, we propose Graph Neural Evolution (GNE), a novel evolutionary algorithm that models individuals as nodes in a graph and leverages designed frequency-domain filters to balance global exploration and local exploitation. Through the use of these filters, GNE aggregates high-frequency (diversity-enhancing) and low-frequency (stability-promoting) information, transforming EAs into interpretable and tunable mechanisms in the frequency domain. Extensive experiments on benchmark functions demonstrate that GNE consistently outperforms state-of-the-art algorithms such as GA, DE, CMA-ES, SDAES, and RL-SHADE, excelling in complex landscapes, optimal solution shifts, and noisy environments. Its robustness, adaptability, and superior convergence highlight its practical and theoretical value. Beyond optimization, GNE establishes a conceptual and mathematical foundation linking EAs and GNNs, offering new perspectives for both fields. Its framework encourages the development of task-adaptive filters and hybrid approaches for EAs, while its insights can inspire advances in GNNs, such as improved global information propagation and mitigation of oversmoothing. GNE’s versatility extends to solving challenges in machine learning, including hyperparameter tuning and neural architecture search, as well as real-world applications in engineering and operations research. By uniting the dynamics of EAs with the structural insights of GNNs, this work provides a foundation for interdisciplinary innovation, paving the way for scalable and interpretable solutions to complex optimization problems.

[AI-9] Facial Expression Analysis and Its Potentials in IoT Systems: A Contemporary Survey

链接: https://arxiv.org/abs/2412.17616
作者: Zixuan Shanggua,Yanjie Dong,Song Guo,Victor C. M. Leung,M. Jamal Deen,Xiping Hu
关键词: convey human emotions, expressions convey human, facial expression analysis, Facial expressions convey, categorized into macro-expressions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Facial expressions convey human emotions and can be categorized into macro-expressions (MaEs) and micro-expressions (MiEs) based on duration and intensity. While MaEs are voluntary and easily recognized, MiEs are involuntary, rapid, and can reveal concealed emotions. The integration of facial expression analysis with Internet-of-Thing (IoT) systems has significant potential across diverse scenarios. IoT-enhanced MaE analysis enables real-time monitoring of patient emotions, facilitating improved mental health care in smart healthcare. Similarly, IoT-based MiE detection enhances surveillance accuracy and threat detection in smart security. This work aims at providing a comprehensive overview of research progress in facial expression analysis and explores its integration with IoT systems. We discuss the distinctions between our work and existing surveys, elaborate on advancements in MaE and MiE techniques across various learning paradigms, and examine their potential applications in IoT. We highlight challenges and future directions for the convergence of facial expression-based technologies and IoT systems, aiming to foster innovation in this domain. By presenting recent developments and practical applications, this study offers a systematic understanding of how facial expression analysis can enhance IoT systems in healthcare, security, and beyond.

[AI-10] Emerging Security Challenges of Large Language Models

链接: https://arxiv.org/abs/2412.17614
作者: Herve Debar,Sven Dietrich,Pavel Laskov,Emil C. Lupu,Eirini Ntoutsi
关键词: Large language models, sectors including high, including high importance, high importance areas, Large language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: A version of this appeared in the larger Dagstuhl seminar 23431 report ( this https URL )

点击查看摘要

Abstract:Large language models (LLMs) have achieved record adoption in a short period of time across many different sectors including high importance areas such as education [4] and healthcare [23]. LLMs are open-ended models trained on diverse data without being tailored for specific downstream tasks, enabling broad applicability across various domains. They are commonly used for text generation, but also widely used to assist with code generation [3], and even analysis of security information, as Microsoft Security Copilot demonstrates [18]. Traditional Machine Learning (ML) models are vulnerable to adversarial attacks [9]. So the concerns on the potential security implications of such wide scale adoption of LLMs have led to the creation of this working group on the security of LLMs. During the Dagstuhl seminar on “Network Attack Detection and Defense - AI-Powered Threats and Responses”, the working group discussions focused on the vulnerability of LLMs to adversarial attacks, rather than their potential use in generating malware or enabling cyberattacks. Although we note the potential threat represented by the latter, the role of the LLMs in such uses is mostly as an accelerator for development, similar to what it is in benign use. To make the analysis more specific, the working group employed ChatGPT as a concrete example of an LLM and addressed the following points, which also form the structure of this report: 1. How do LLMs differ in vulnerabilities from traditional ML models? 2. What are the attack objectives in LLMs? 3. How complex it is to assess the risks posed by the vulnerabilities of LLMs? 4. What is the supply chain in LLMs, how data flow in and out of systems and what are the security implications? We conclude with an overview of open challenges and outlook.

[AI-11] PC Agent : While You Sleep AI Works – A Cognitive Journey into Digital World

链接: https://arxiv.org/abs/2412.17589
作者: Yanheng He,Jiahe Jin,Shijie Xia,Jiadi Su,Runze Fan,Haoyang Zou,Xiangkun Hu,Pengfei Liu
关键词: Imagine a world, drafting a report, digital agents, Imagine, capable digital agents
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple “tasks” to handling complex “work” lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

[AI-12] Evaluation of Bio-Inspired Models under Different Learning Settings For Energy Efficiency in Network Traffic Prediction

链接: https://arxiv.org/abs/2412.17565
作者: Theodoros Tsiolakis,Nikolaos Pavlidis,Vasileios Perifanis,Pavlos Efraimidis
关键词: rapidly evolving environments, efficiently allocate resources, enables network operators, Cellular traffic forecasting, Spiking Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Cellular traffic forecasting is a critical task that enables network operators to efficiently allocate resources and address anomalies in rapidly evolving environments. The exponential growth of data collected from base stations poses significant challenges to processing and analysis. While machine learning (ML) algorithms have emerged as powerful tools for handling these large datasets and providing accurate predictions, their environmental impact, particularly in terms of energy consumption, is often overlooked in favor of their predictive capabilities. This study investigates the potential of two bio-inspired models: Spiking Neural Networks (SNNs) and Reservoir Computing through Echo State Networks (ESNs) for cellular traffic forecasting. The evaluation focuses on both their predictive performance and energy efficiency. These models are implemented in both centralized and federated settings to analyze their effectiveness and energy consumption in decentralized systems. Additionally, we compare bio-inspired models with traditional architectures, such as Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons (MLPs), to provide a comprehensive evaluation. Using data collected from three diverse locations in Barcelona, Spain, we examine the trade-offs between predictive accuracy and energy demands across these approaches. The results indicate that bio-inspired models, such as SNNs and ESNs, can achieve significant energy savings while maintaining predictive accuracy comparable to traditional architectures. Furthermore, federated implementations were tested to evaluate their energy efficiency in decentralized settings compared to centralized systems, particularly in combination with bio-inspired models. These findings offer valuable insights into the potential of bio-inspired models for sustainable and privacy-preserving cellular traffic forecasting.

[AI-13] Retention Score: Quantifying Jailbreak Risks for Vision Language Models AAAI2025

链接: https://arxiv.org/abs/2412.17544
作者: Zaitang Li,Pin-Yu Chen,Tsung-Yi Ho
关键词: Large Language Models, Large Language, machine learning capabilities, integrating computer vision, vision with Large
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures, AAAI 2025

点击查看摘要

Abstract:The emergence of Vision-Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has also made VLMs vulnerable to sophisticated adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM’s ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the \textbfRetention Score. Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a conditional diffusion model. These pairs are then predicted for toxicity score by a VLM alongside a toxicity judgment classifier. By calculating the margin in toxicity scores, we can quantify the robustness of the VLM in an attack-agnostic manner. Our work has four main contributions. First, we prove that Retention Score can serve as a certified robustness metric. Second, we demonstrate that most VLMs with visual components are less robust against jailbreak attacks than the corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that the security settings in Google Gemini significantly affect the score and robustness. Moreover, the robustness of GPT4V is similar to the medium settings of Gemini. Finally, our approach offers a time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

[AI-14] Double Landmines: Invisible Textual Backdoor Attacks based on Dual-Trigger

链接: https://arxiv.org/abs/2412.17531
作者: Yang Hou,Qiuling Yue,Lujia Chai,Guozhao Liao,Wenbao Han,Wei Ou
关键词: inserting specific content, final poisoning rate, abstract text features, attack performance, backdoor attack based
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:At present, all textual backdoor attack methods are based on single triggers: for example, inserting specific content into the text to activate the backdoor; or changing the abstract text features. The former is easier to be identified by existing defense strategies due to its obvious characteristics; the latter, although improved in invisibility, has certain shortcomings in terms of attack performance, construction of poisoned datasets, and selection of the final poisoning rate. On this basis, this paper innovatively proposes a Dual-Trigger backdoor attack based on syntax and mood, and optimizes the construction of the poisoned dataset and the selection strategy of the final poisoning rate. A large number of experimental results show that this method significantly outperforms the previous methods based on abstract features in attack performance, and achieves comparable attack performance (almost 100% attack success rate) with the insertion-based method. In addition, the two trigger mechanisms included in this method can be activated independently in the application phase of the model, which not only improves the flexibility of the trigger style, but also enhances its robustness against defense strategies. These results profoundly reveal that textual backdoor attacks are extremely harmful and provide a new perspective for security protection in this field.

[AI-15] Enhancing Cancer Diagnosis with Explainable Trustworthy Deep Learning Models

链接: https://arxiv.org/abs/2412.17527
作者: Badaru I. Olumuyiwa, TheAnh Han,Zia U. Shamszaman
关键词: explainable Artificial Intelligence, Artificial Intelligence, explainable Artificial, research presents, presents an innovative
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research presents an innovative approach to cancer diagnosis and prediction using explainable Artificial Intelligence (XAI) and deep learning techniques. With cancer causing nearly 10 million deaths globally in 2020, early and accurate diagnosis is crucial. Traditional methods often face challenges in cost, accuracy, and efficiency. Our study develops an AI model that provides precise outcomes and clear insights into its decision-making process, addressing the “black box” problem of deep learning models. By employing XAI techniques, we enhance interpretability and transparency, building trust among healthcare professionals and patients. Our approach leverages neural networks to analyse extensive datasets, identifying patterns for cancer detection. This model has the potential to revolutionise diagnosis by improving accuracy, accessibility, and clarity in medical decision-making, possibly leading to earlier detection and more personalised treatment strategies. Furthermore, it could democratise access to high-quality diagnostics, particularly in resource-limited settings, contributing to global health equity. The model’s applications extend beyond cancer diagnosis, potentially transforming various aspects of medical decision-making and saving millions of lives worldwide.

[AI-16] STAHGNet: Modeling Hybrid-grained Heterogenous Dependency Efficiently for Traffic Prediction

链接: https://arxiv.org/abs/2412.17524
作者: Jiyao Wang,Zehua Peng,Yijia Zhang,Dengbo He,Lei Chen
关键词: Traffic flow prediction, flow prediction plays, underlying complex Spatio-temporal, complex Spatio-temporal patterns, Traffic flow
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by Neural Computing and Applications

点击查看摘要

Abstract:Traffic flow prediction plays a critical role in the intelligent transportation system, and it is also a challenging task because of the underlying complex Spatio-temporal patterns and heterogeneities evolving across time. However, most present works mostly concentrate on solely capturing Spatial-temporal dependency or extracting implicit similarity graphs, but the hybrid-granularity evolution is ignored in their modeling process. In this paper, we proposed a novel data-driven end-to-end framework, named Spatio-Temporal Aware Hybrid Graph Network (STAHGNet), to couple the hybrid-grained heterogeneous correlations in series simultaneously through an elaborately Hybrid Graph Attention Module (HGAT) and Coarse-granularity Temporal Graph (CTG) generator. Furthermore, an automotive feature engineering with domain knowledge and a random neighbor sampling strategy is utilized to improve efficiency and reduce computational complexity. The MAE, RMSE, and MAPE are used for evaluation metrics. Tested on four real-life datasets, our proposal outperforms eight classical baselines and four state-of-the-art (SOTA) methods (e.g., MAE 14.82 on PeMSD3; MAE 18.92 on PeMSD4). Besides, extensive experiments and visualizations verify the effectiveness of each component in STAHGNet. In terms of computational cost, STAHGNet saves at least four times the space compared to the previous SOTA models. The proposed model will be beneficial for more efficient TFP as well as intelligent transport system construction.

[AI-17] A Toolkit for Virtual Reality Data Collection

链接: https://arxiv.org/abs/2412.17490
作者: Tim Rolff,Niklas Hypki,Markus Lappe,Frank Steinicke
关键词: reality datasets remains, multidimensional virtual reality, number of users, acquiring large-scale, significant challenge
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the still relatively low number of users, acquiring large-scale and multidimensional virtual reality datasets remains a significant challenge. Consequently, VR datasets comparable in size to state-of-the-art collections in natural language processing or computer vision are rare or absent. However, the availability of such datasets could unlock groundbreaking advancements in deep-learning, psychological modeling, and data analysis in the context of VR. In this paper, we present a versatile data collection toolkit designed to facilitate the capturing of extensive VR datasets. Our toolkit seamlessly integrates with any device, either directly via OpenXR or through the use of a virtual device. Additionally, we introduce a robust data collection pipeline that emphasizes ethical practices (e.g., ensuring data protection and regulation) and ensures a standardized, reproducible methodology.

[AI-18] DeepMF: Deep Motion Factorization for Closed-Loop Safety-Critical Driving Scenario Simulation

链接: https://arxiv.org/abs/2412.17487
作者: Yizhe Li,Linrui Zhang,Xueqian Wang,Houde Liu,Bin Liang
关键词: great practical relevance, Safety-critical traffic scenarios, traffic scenario generation, traffic, Safety-critical traffic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety-critical traffic scenarios are of great practical relevance to evaluating the robustness of autonomous driving (AD) systems. Given that these long-tail events are extremely rare in real-world traffic data, there is a growing body of work dedicated to the automatic traffic scenario generation. However, nearly all existing algorithms for generating safety-critical scenarios rely on snippets of previously recorded traffic events, transforming normal traffic flow into accident-prone situations directly. In other words, safety-critical traffic scenario generation is hindsight and not applicable to newly encountered and open-ended traffic this http URL this paper, we propose the Deep Motion Factorization (DeepMF) framework, which extends static safety-critical driving scenario generation to closed-loop and interactive adversarial traffic simulation. DeepMF casts safety-critical traffic simulation as a Bayesian factorization that includes the assignment of hazardous traffic participants, the motion prediction of selected opponents, the reaction estimation of autonomous vehicle (AV) and the probability estimation of the accident occur. All the aforementioned terms are calculated using decoupled deep neural networks, with inputs limited to the current observation and historical states. Consequently, DeepMF can effectively and efficiently simulate safety-critical traffic scenarios at any triggered time and for any duration by maximizing the compounded posterior probability of traffic risk. Extensive experiments demonstrate that DeepMF excels in terms of risk management, flexibility, and diversity, showcasing outstanding performance in simulating a wide range of realistic, high-risk traffic scenarios.

[AI-19] Is ChatGPT Massively Used by Students Nowadays? A Survey on the Use of Large Language Models such as ChatGPT in Educational Settings

链接: https://arxiv.org/abs/2412.17486
作者: Jérémie Sublime,Ilaria Renna
关键词: Large Language Models, offering transformative opportunities, Language Models, Large Language, based on Large
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 33 pages + references

点击查看摘要

Abstract:The rapid adoption of Generative AI (GenAI) based on Large Language Models (LLMs) such as ChatGPT has recently and profoundly impacted education, offering transformative opportunities while raising significant concerns. In this study we present the results of a survey that investigates how 395 students aged 13 to 25 years old in France and Italy integrate LLMs into their educational routines. Key findings include the widespread use of these tools across all age groups and disciplines, with older students and male students demonstrating higher usage frequencies, particularly in scientific contexts. The results also show gender disparities, raising concerns about an emerging AI literacy and technological gender gap. Additionally, while most students utilise LLMs constructively, the lack of systematic proofreading and critical evaluation among younger users suggests potential risks to cognitive skills development, including critical thinking and foundational knowledge. The survey results underscore the need for educational institutions to adapt their curricula to integrate AI tools effectively, promoting ethical use, critical thinking, and awareness of AI limitations and environmental costs. This paper provides actionable recommendations for fostering equitable and effective cohabitation of LLMs and education while addressing emerging challenges. Comments: 33 pages + references Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17486 [cs.CY] (or arXiv:2412.17486v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2412.17486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

链接: https://arxiv.org/abs/2412.17484
作者: Francesco Lettich,Emanuele Carlini,Franco Maria Nardini,Raffaele Perego,Salvatore Trani
关键词: Large Language Models, Artificial Intelligence, Intelligence and Large, Large Language, impacting operational costs
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

[AI-21] Line Graph Vietoris-Rips Persistence Diagram for Topological Graph Representation Learning

链接: https://arxiv.org/abs/2412.17468
作者: Jaesun Shin,Eunjoo Jeon,Taewon Cho,Namkyeong Cho,Youngjune Gwon
关键词: informative node embeddings, Topological Edge Diagram, persistence diagram, node embedding information, result in informative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注: 36 pages. Accepted to Journal of Machine Learning Research

点击查看摘要

Abstract:While message passing graph neural networks result in informative node embeddings, they may suffer from describing the topological properties of graphs. To this end, node filtration has been widely used as an attempt to obtain the topological information of a graph using persistence diagrams. However, these attempts have faced the problem of losing node embedding information, which in turn prevents them from providing a more expressive graph representation. To tackle this issue, we shift our focus to edge filtration and introduce a novel edge filtration-based persistence diagram, named Topological Edge Diagram (TED), which is mathematically proven to preserve node embedding information as well as contain additional topological information. To implement TED, we propose a neural network based algorithm, named Line Graph Vietoris-Rips (LGVR) Persistence Diagram, that extracts edge information by transforming a graph into its line graph. Through LGVR, we propose two model frameworks that can be applied to any message passing GNNs, and prove that they are strictly more powerful than Weisfeiler-Lehman type colorings. Finally we empirically validate superior performance of our models on several graph classification and regression benchmarks.

[AI-22] Applying LLM and Topic Modelling in Psychotherapeutic Contexts

链接: https://arxiv.org/abs/2412.17449
作者: Alexander Vanin,Vadim Bolshev,Anastasia Panfilova
关键词: Large language models, analyze therapist remarks, Large language, psychotherapeutic setting, models to analyze
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:This study explores the use of Large language models to analyze therapist remarks in a psychotherapeutic setting. The paper focuses on the application of BERTopic, a machine learning-based topic modeling tool, to the dialogue of two different groups of therapists (classical and modern), which makes it possible to identify and describe a set of topics that consistently emerge across these groups. The paper describes in detail the chosen algorithm for BERTopic, which included creating a vector space from a corpus of therapist remarks, reducing its dimensionality, clustering the space, and creating and optimizing topic representation. Along with the automatic topical modeling by the BERTopic, the research involved an expert assessment of the findings and manual topic structure optimization. The topic modeling results highlighted the most common and stable topics in therapists speech, offering insights into how language patterns in therapy develop and remain stable across different therapeutic styles. This work contributes to the growing field of machine learning in psychotherapy by demonstrating the potential of automated methods to improve both the practice and training of therapists. The study highlights the value of topic modeling as a tool for gaining a deeper understanding of therapeutic dialogue and offers new opportunities for improving therapeutic effectiveness and clinical supervision.

[AI-23] he Role of XAI in Transforming Aeronautics and Aerospace Systems

链接: https://arxiv.org/abs/2412.17440
作者: Francisco Javier Cantero Zorita,Mikel Galafate,Javier M. Moguerza,Isaac Martín de Diego,M. Teresa Gonzalez,Gema Gutierrez Peña
关键词: Artificial Intelligence, eXplainable Artificial Intelligence, Recent advancements, transformed decision-making, decision-making in aeronautics
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Artificial Intelligence (AI) have transformed decision-making in aeronautics and aerospace. These advancements in AI have brought with them the need to understand the reasons behind the predictions generated by AI systems and models, particularly by professionals in these sectors. In this context, the emergence of eXplainable Artificial Intelligence (XAI) has helped bridge the gap between professionals in the aeronautical and aerospace sectors and the AI systems and models they work with. For this reason, this paper provides a review of the concept of XAI is carried out defining the term and the objectives it aims to achieve. Additionally, the paper discusses the types of models defined within it and the properties these models must fulfill to be considered transparent, as well as the post-hoc techniques used to understand AI systems and models after their training. Finally, various application areas within the aeronautical and aerospace sectors will be presented, highlighting how XAI is used in these fields to help professionals understand the functioning of AI systems and models.

[AI-24] Markov Process-Based Graph Convolutional Networks for Entity Classification in Knowledge Graphs

链接: https://arxiv.org/abs/2412.17438
作者: Johannes Mäkelburg,Yiwen Peng,Mehwish Alam,Tobias Weller,Maribel Acosta
关键词: class affiliation, affiliation of entities, Knowledge Graphs, Graph Convolutional Networks, encoded in Knowledge
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the vast amount of information encoded in Knowledge Graphs (KGs), information about the class affiliation of entities remains often incomplete. Graph Convolutional Networks (GCNs) have been shown to be effective predictors of complete information about the class affiliation of entities in KGs. However, these models do not learn the class affiliation of entities in KGs incorporating the complexity of the task, which negatively affects the models prediction capabilities. To address this problem, we introduce a Markov process-based architecture into well-known GCN architectures. This end-to-end network learns the prediction of class affiliation of entities in KGs within a Markov process. The number of computational steps is learned during training using a geometric distribution. At the same time, the loss function combines insights from the field of evidential learning. The experiments show a performance improvement over existing models in several studied architectures and datasets. Based on the chosen hyperparameters for the geometric distribution, the expected number of computation steps can be adjusted to improve efficiency and accuracy during training.

[AI-25] Neural Continuous-Time Supermartingale Certificates

链接: https://arxiv.org/abs/2412.17432
作者: Grigory Neustroev,Mirco Giacobbe,Anna Lukina
关键词: continuous-time stochastic dynamical, stochastic dynamical systems, neural Lyapunov certificates, dynamical systems, systems
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce for the first time a neural-certificate framework for continuous-time stochastic dynamical systems. Autonomous learning systems in the physical world demand continuous-time reasoning, yet existing learnable certificates for probabilistic verification assume discretization of the time continuum. Inspired by the success of training neural Lyapunov certificates for deterministic continuous-time systems and neural supermartingale certificates for stochastic discrete-time systems, we propose a framework that bridges the gap between continuous-time and probabilistic neural certification for dynamical systems under complex requirements. Our method combines machine learning and symbolic reasoning to produce formally certified bounds on the probabilities that a nonlinear system satisfies specifications of reachability, avoidance, and persistence. We present both the theoretical justification and the algorithmic implementation of our framework and showcase its efficacy on popular benchmarks.

[AI-26] Pretraining with random noise for uncertainty calibration

链接: https://arxiv.org/abs/2412.17411
作者: Jeonghwan Cheon,Se-Bum Paik
关键词: human intelligence, process of aligning, hallmark of human, Uncertainty calibration, random noise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Uncertainty calibration, the process of aligning confidence with accuracy, is a hallmark of human intelligence. However, most machine learning models struggle to achieve this alignment, particularly when the training dataset is small relative to the network’s capacity. Here, we demonstrate that uncertainty calibration can be effectively achieved through a pretraining method inspired by developmental neuroscience. Specifically, training with random noise before data training allows neural networks to calibrate their uncertainty, ensuring that confidence levels are aligned with actual accuracy. We show that randomly initialized, untrained networks tend to exhibit erroneously high confidence, but pretraining with random noise effectively calibrates these networks, bringing their confidence down to chance levels across input spaces. As a result, networks pretrained with random noise exhibit optimal calibration, with confidence closely aligned with accuracy throughout subsequent data training. These pre-calibrated networks also perform better at identifying “unknown data” by exhibiting lower confidence for out-of-distribution samples. Our findings provide a fundamental solution for uncertainty calibration in both in-distribution and out-of-distribution contexts.

[AI-27] BrainMAP: Learning Multiple Activation Pathways in Brain Networks AAAI2025

链接: https://arxiv.org/abs/2412.17404
作者: Song Wang,Zhenyu Lei,Zhen Tan,Jiaqi Ding,Xinyu Zhao,Yushun Dong,Guorong Wu,Tianlong Chen,Chen Chen,Aiying Zhang,Jundong Li
关键词: Magnetic Resonance Image, Functional Magnetic Resonance, Resonance Image, Magnetic Resonance, Functional Magnetic
类目: Artificial Intelligence (cs.AI)
*备注: AAAI 2025

点击查看摘要

Abstract:Functional Magnetic Resonance Image (fMRI) is commonly employed to study human brain activity, since it offers insight into the relationship between functional fluctuations and human behavior. To enhance analysis and comprehension of brain activity, Graph Neural Networks (GNNs) have been widely applied to the analysis of functional connectivities (FC) derived from fMRI data, due to their ability to capture the synergistic interactions among brain regions. However, in the human brain, performing complex tasks typically involves the activation of certain pathways, which could be represented as paths across graphs. As such, conventional GNNs struggle to learn from these pathways due to the long-range dependencies of multiple pathways. To address these challenges, we introduce a novel framework BrainMAP to learn Multiple Activation Pathways in Brain networks. BrainMAP leverages sequential models to identify long-range correlations among sequentialized brain regions and incorporates an aggregation module based on Mixture of Experts (MoE) to learn from multiple pathways. Our comprehensive experiments highlight BrainMAP’s superior performance. Furthermore, our framework enables explanatory analyses of crucial brain regions involved in tasks. Our code is provided at this https URL.

[AI-28] FRTP: Federating Route Search Records to Enhance Long-term Traffic Prediction

链接: https://arxiv.org/abs/2412.17373
作者: Hangli Ge,Xiaojie Yang,Itsuki Matsunaga,Dizhi Huang,Noboru Koshizuka
关键词: Accurate traffic prediction, intelligent transportation systems, predicting traffic conditions, Accurate traffic, conditions several days
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE BigData 2024

点击查看摘要

Abstract:Accurate traffic prediction, especially predicting traffic conditions several days in advance is essential for intelligent transportation systems (ITS). Such predictions enable mid- and long-term traffic optimization, which is crucial for efficient transportation planning. However, the inclusion of diverse external features, alongside the complexities of spatial relationships and temporal uncertainties, significantly increases the complexity of forecasting models. Additionally, traditional approaches have handled data preprocessing separately from the learning model, leading to inefficiencies caused by repeated trials of preprocessing and training. In this study, we propose a federated architecture capable of learning directly from raw data with varying features and time granularities or lengths. The model adopts a unified design that accommodates different feature types, time scales, and temporal periods. Our experiments focus on federating route search records and begin by processing raw data within the model framework. Unlike traditional models, this approach integrates the data federation phase into the learning process, enabling compatibility with various time frequencies and input/output configurations. The accuracy of the proposed model is demonstrated through evaluations using diverse learning patterns and parameter settings. The results show that online search log data is useful for forecasting long-term traffic, highlighting the model’s adaptability and efficiency.

[AI-29] Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)

链接: https://arxiv.org/abs/2412.17364
作者: Jeongsu Yu
关键词: Augmented Generation, natural language processing, Text embedding models, utilization of RAG, Text embedding
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text embedding models play a crucial role in natural language processing, particularly in information retrieval, and their importance is further highlighted with the recent utilization of RAG (Retrieval- Augmented Generation). This study presents an efficient fine-tuning methodology encompassing data selection, loss function, and model architecture to enhance the information retrieval performance of pre-trained text embedding models. In particular, this study proposes a novel Contrastive Learning Penalty function that overcomes the limitations of existing Contrastive Learning. The proposed methodology achieves significant performance improvements over existing methods in document retrieval tasks. This study is expected to contribute to improving the performance of information retrieval systems through fine-tuning of text embedding models. The code for this study can be found at this https URL, and the best-performing model can be found at this https URL.

[AI-30] Enhancing Topic Interpretability for Neural Topic Modeling through Topic-wise Contrastive Learning

链接: https://arxiv.org/abs/2412.17338
作者: Xin Gao,Yang Lin,Ruiqing Li,Yasha Wang,Xu Chu,Xinyu Ma,Hailong Yu
关键词: extracting valuable insights, essential aspects, aspects of extracting, Data mining, mining and knowledge
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data mining and knowledge discovery are essential aspects of extracting valuable insights from vast datasets. Neural topic models (NTMs) have emerged as a valuable unsupervised tool in this field. However, the predominant objective in NTMs, which aims to discover topics maximizing data likelihood, often lacks alignment with the central goals of data mining and knowledge discovery which is to reveal interpretable insights from large data repositories. Overemphasizing likelihood maximization without incorporating topic regularization can lead to an overly expansive latent space for topic modeling. In this paper, we present an innovative approach to NTMs that addresses this misalignment by introducing contrastive learning measures to assess topic interpretability. We propose a novel NTM framework, named ContraTopic, that integrates a differentiable regularizer capable of evaluating multiple facets of topic interpretability throughout the training process. Our regularizer adopts a unique topic-wise contrastive methodology, fostering both internal coherence within topics and clear external distinctions among them. Comprehensive experiments conducted on three diverse datasets demonstrate that our approach consistently produces topics with superior interpretability compared to state-of-the-art NTMs.

[AI-31] APEX2: Adaptive and Extreme Summarization for Personalized Knowledge Graphs KDD2025

链接: https://arxiv.org/abs/2412.17336
作者: Zihao Li,Dongqi Fu,Mengting Ai,Jingrui He
关键词: personalized knowledge graphs, Knowledge graphs, serve various applications, store an extensive, extensive number
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Symbolic Computation (cs.SC)
*备注: Accepted by KDD 2025. 27 pages

点击查看摘要

Abstract:Knowledge graphs (KGs), which store an extensive number of relational facts, serve various applications. Recently, personalized knowledge graphs (PKGs) have emerged as a solution to optimize storage costs by customizing their content to align with users’ specific interests within particular domains. In the real world, on one hand, user queries and their underlying interests are inherently evolving, requiring PKGs to adapt continuously; on the other hand, the summarization is constantly expected to be as small as possible in terms of storage cost. However, the existing PKG summarization methods implicitly assume that the user’s interests are constant and do not shift. Furthermore, when the size constraint of PKG is extremely small, the existing methods cannot distinguish which facts are more of immediate interest and guarantee the utility of the summarized PKG. To address these limitations, we propose APEX ^2 , a highly scalable PKG summarization framework designed with robust theoretical guarantees to excel in adaptive summarization tasks with extremely small size constraints. To be specific, after constructing an initial PKG, APEX ^2 continuously tracks the interest shift and adjusts the previous summary. We evaluate APEX ^2 under an evolving query setting on benchmark KGs containing up to 12 million triples, summarizing with compression ratios \leq 0.1% . The experiments show that APEX outperforms state-of-the-art baselines in terms of both query-answering accuracy and efficiency.

[AI-32] Complete Implementation of WXF Chinese Chess Rules

链接: https://arxiv.org/abs/2412.17334
作者: Daniel Tan,Neftali Watkinson Medina
关键词: Chinese Chess application, facing Chinese Chess, Chinese Chess, Western Chess, Unlike repetitions
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:Unlike repetitions in Western Chess where all repetitions are draws, repetitions in Chinese Chess could result in a win, draw, or loss depending on the kind of repetition being made by both players. One of the biggest hurdles facing Chinese Chess application development is a proper system for judging games correctly. This paper introduces a complete algorithm for ruling the WXF rules correctly in all 110 example cases found in the WXF manual. We introduce several novel optimizations for speeding up the repetition handling without compromising the program correctness. This algorithm is usable in engines, and we saw a total increase in playing strength by +10 point rating increase, or an increased 5% winrate when integrating this approach into our prototype engine.

[AI-33] Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition

链接: https://arxiv.org/abs/2412.17333
作者: Jaeheun Jung,Jaehyuk Lee,Chang-Hae Jung,Hanyoung Kim,Bosung Jung,Donghun Lee
关键词: earthquake, Latent Diffusion Model, Earthquake Data Center, earthquake data, Southern California Earthquake
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Earthquakes are rare. Hence there is a fundamental call for reliable methods to generate realistic ground motion data for data-driven approaches in seismology. Recent GAN-based methods fall short of the call, as the methods either require special information such as geological traits or generate subpar waveforms that fail to satisfy seismological constraints such as phase arrival times. We propose a specialized Latent Diffusion Model (LDM) that reliably generates realistic waveforms after learning from real earthquake data with minimal conditions: location and magnitude. We also design a domain-specific training method that exploits the traits of earthquake dataset: multiple observed waveforms time-aligned and paired to each earthquake source that are tagged with seismological metadata comprised of earthquake magnitude, depth of focus, and the locations of epicenter and seismometers. We construct the time-aligned earthquake dataset using Southern California Earthquake Data Center (SCEDC) API, and train our model with the dataset and our proposed training method for performance evaluation. Our model surpasses all comparable data-driven methods in various test criteria not only from waveform generation domain but also from seismology such as phase arrival time, GMPE analysis, and spectrum analysis. Our result opens new future research directions for deep learning applications in seismology.

[AI-34] EcoSearch: A Constant-Delay Best-First Search Algorithm for Program Synthesis AAAI2025

链接: https://arxiv.org/abs/2412.17330
作者: Théo Matricon,Nathanaël Fijalkow,Guillaume Lagarde
关键词: program synthesis perform, synthesis perform, cost function, combinatorial search, heuristic cost functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: Extended version of AAAI 2025

点击查看摘要

Abstract:Many approaches to program synthesis perform a combinatorial search within a large space of programs to find one that satisfies a given specification. To tame the search space blowup, previous works introduced probabilistic and neural approaches to guide this combinatorial search by inducing heuristic cost functions. Best-first search algorithms ensure to search in the exact order induced by the cost function, significantly reducing the portion of the program space to be explored. We present a new best-first search algorithm called EcoSearch, which is the first constant-delay algorithm for pre-generation cost function: the amount of compute required between outputting two programs is constant, and in particular does not increase over time. This key property yields important speedups: we observe that EcoSearch outperforms its predecessors on two classic domains.

[AI-35] xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition

链接: https://arxiv.org/abs/2412.17323
作者: Artyom Stitsyuk,Jaesik Choi
关键词: received significant attention, recent years, application of transformer-based, received significant, significant attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the application of transformer-based models in time-series forecasting has received significant attention. While often demonstrating promising results, the transformer architecture encounters challenges in fully exploiting the temporal relations within time series data due to its attention mechanism. In this work, we design eXponential Patch (xPatch for short), a novel dual-stream architecture that utilizes exponential decomposition. Inspired by the classical exponential smoothing approaches, xPatch introduces the innovative seasonal-trend exponential decomposition module. Additionally, we propose a dual-flow architecture that consists of an MLP-based linear stream and a CNN-based non-linear stream. This model investigates the benefits of employing patching and channel-independence techniques within a non-transformer model. Finally, we develop a robust arctangent loss function and a sigmoid learning rate adjustment scheme, which prevent overfitting and boost forecasting performance. The code is available at the following repository: this https URL.

[AI-36] Popularity Estimation and New Bundle Generation using Content and Context based Embeddings

链接: https://arxiv.org/abs/2412.17310
作者: Ashutosh Nayak,Prajwal NJ,Sameeksha Keshav,Kavitha S.N.,Roja Reddy,Rajasekhara Reddy Duvvuru Muni
关键词: systems create enormous, Recommender systems create, create enormous, bundle, Recommender systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender systems create enormous value for businesses and their consumers. They increase revenue for businesses while improving the consumer experience by recommending relevant products amidst huge product base. Product bundling is an exciting development in the field of product recommendations. It aims at generating new bundles and recommending exciting and relevant bundles to their consumers. Unlike traditional recommender systems that recommend single items to consumers, product bundling aims at targeting a bundle, or a set of items, to the consumers. While bundle recommendation has attracted significant research interest recently, extant literature on bundle generation is scarce. Moreover, metrics to identify if a bundle is popular or not is not well studied. In this work, we aim to fulfill this gap by introducing new bundle popularity metrics based on sales, consumer experience and item diversity in a bundle. We use these metrics in the methodology proposed in this paper to generate new bundles for mobile games using content aware and context aware embeddings. We use opensource Steam Games dataset for our analysis. Our experiments indicate that we can generate new bundles that can outperform the existing bundles on the popularity metrics by 32% - 44%. Our experiments are computationally efficient and the proposed methodology is generic that can be extended to other bundling problems e.g. product bundling, music bundling.

[AI-37] On the Feasibility of Vision-Language Models for Time-Series Classification

链接: https://arxiv.org/abs/2412.17304
作者: Vinay Prithyani,Mohsin Mohammed,Richa Gadgil,Ricardo Buitrago,Vinija Jain,Aman Chadha
关键词: Vision Language Models, Language Models, Vision Language, Models, Vision
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We build upon time-series classification by leveraging the capabilities of Vision Language Models (VLMs). We find that VLMs produce competitive results after two or less epochs of fine-tuning. We develop a novel approach that incorporates graphical data representations as images in conjunction with numerical data. This approach is rooted in the hypothesis that graphical representations can provide additional contextual information that numerical data alone may not capture. Additionally, providing a graphical representation can circumvent issues such as limited context length faced by LLMs. To further advance this work, we implemented a scalable end-to-end pipeline for training on different scenarios, allowing us to isolate the most effective strategies for transferring learning capabilities from LLMs to Time Series Classification (TSC) tasks. Our approach works with univariate and multivariate time-series data. In addition, we conduct extensive and practical experiments to show how this approach works for time-series classification and generative labels.

[AI-38] Dynamic Scheduling Strategies for Resource Optimization in Computing Environments

链接: https://arxiv.org/abs/2412.17301
作者: Xiaoye Wang
关键词: Google Cluster Data, container scheduling, face many challenges, promoted the widespread, widespread application
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of cloud-native architecture has promoted the widespread application of container technology, but the optimization problems in container scheduling and resource management still face many challenges. This paper proposes a container scheduling method based on multi-objective optimization, which aims to balance key performance indicators such as resource utilization, load balancing and task completion efficiency. By introducing optimization models and heuristic algorithms, the scheduling strategy is comprehensively improved, and experimental verification is carried out using the real Google Cluster Data dataset. The experimental results show that compared with traditional static rule algorithms and heuristic algorithms, the optimized scheduling scheme shows significant advantages in resource utilization, load balancing and burst task completion efficiency. This shows that the proposed method can effectively improve resource management efficiency and ensure service quality and system stability in complex dynamic cloud environments. At the same time, this paper also explores the future development direction of scheduling algorithms in multi-tenant environments, heterogeneous cloud computing, and cross-edge and cloud collaborative computing scenarios, and proposes research prospects for energy consumption optimization, adaptive scheduling and fairness. The research results not only provide a theoretical basis and practical reference for container scheduling under cloud-native architecture, but also lay a foundation for further realizing intelligent and efficient resource management.

[AI-39] Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples AAAI2025

链接: https://arxiv.org/abs/2412.17288
作者: Taewoong Kim,Byeonghwi Kim,Jonghyun Choi
关键词: requires large free-form, short high-level instructions, perform complex tasks, complex tasks based, free-form language annotations
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: AAAI 2025 (Project page: this https URL )

点击查看摘要

Abstract:Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large free-form language annotations, especially for short high-level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state-of-the-art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few-shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and outperforms state-of-the-art approaches. Our code is available at this https URL.

[AI-40] LLM 4AD: A Platform for Algorithm Design with Large Language Model

链接: https://arxiv.org/abs/2412.17287
作者: Fei Liu,Rui Zhang,Zhuoliang Xie,Rui Sun,Kai Li,Xi Lin,Zhenkun Wang,Zhichao Lu,Qingfu Zhang
关键词: large language models, unified Python platform, algorithm design tasks, algorithm design, unified Python
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce LLM4AD, a unified Python platform for algorithm design (AD) with large language models (LLMs). LLM4AD is a generic framework with modularized blocks for search methods, algorithm design tasks, and LLM interface. The platform integrates numerous key methods and supports a wide range of algorithm design tasks across various domains including optimization, machine learning, and scientific discovery. We have also designed a unified evaluation sandbox to ensure a secure and robust assessment of algorithms. Additionally, we have compiled a comprehensive suite of support resources, including tutorials, examples, a user manual, online resources, and a dedicated graphical user interface (GUI) to enhance the usage of LLM4AD. We believe this platform will serve as a valuable tool for fostering future development in the merging research direction of LLM-assisted algorithm design.

[AI-41] Enabling Time-series Foundation Model for Building Energy Forecasting via Contrastive Curriculum Learning

链接: https://arxiv.org/abs/2412.17285
作者: Rui Liang,Yang Deng,Donghua Xie,Fang He,Dan Wang
关键词: Advances in time-series, machine learning models, conventional machine learning, foundation models, generalized knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advances in time-series forecasting are driving a shift from conventional machine learning models to foundation models (FMs) that are trained with generalized knowledge. However, existing FMs still perform poorly in the energy fields, such as building energy forecasting (BEF). This paper studies the adaptation of FM to BEF tasks. We demonstrate the shortcomings of fine-tuning FM straightforwardly from both the perspectives of FM and the data. To overcome these limitations, we propose a new \textitcontrastive curriculum learning-based training method. Our method optimizes the ordering of training data in the context of TSFM adaptation. Experiments show that our method can improve the zero/few-shot performance by 14.6% compared to the existing FMs. Our code and new TSFM will be available at Anonymous Github Repo.

[AI-42] Evaluating the Design Features of an Intelligent Tutoring System for Advanced Mathematics Learning

链接: https://arxiv.org/abs/2412.17265
作者: Ying Fang,Bo He,Zhi Liu,Sannyuya Liu,Zhonghua Yan,Jianwen Sun
关键词: Chinese college students, intelligent tutoring system, math entrance exam, learning advanced mathematics, graduate school math
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); History and Overview (math.HO)
*备注:

点击查看摘要

Abstract:Xiaomai is an intelligent tutoring system (ITS) designed to help Chinese college students in learning advanced mathematics and preparing for the graduate school math entrance exam. This study investigates two distinctive features within Xiaomai: the incorporation of free-response questions with automatic feedback and the metacognitive element of reflecting on self-made errors.

[AI-43] “From Unseen Needs to Classroom Solutions”: Exploring AI Literacy Challenges Opportunities with Project-based Learning Toolkit in K-12 Education AAAI2025

链接: https://arxiv.org/abs/2412.17243
作者: Hanqi Li,Ruiwei Xiao,Hsuan Nieu,Ying-Jui Tseng,Guanze Liao
关键词: artificial intelligence, computer science, increasingly central, extend beyond computer, Art Lab
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Accepted to AAAI2025

点击查看摘要

Abstract:As artificial intelligence (AI) becomes increasingly central to various fields, there is a growing need to equip K-12 students with AI literacy skills that extend beyond computer science. This paper explores the integration of a Project-Based Learning (PBL) AI toolkit into diverse subject areas, aimed at helping educators teach AI concepts more effectively. Through interviews and co-design sessions with K-12 teachers, we examined current AI literacy levels and how teachers adapt AI tools like the AI Art Lab, AI Music Studio, and AI Chatbot into their course designs. While teachers appreciated the potential of AI tools to foster creativity and critical thinking, they also expressed concerns about the accuracy, trustworthiness, and ethical implications of AI-generated content. Our findings reveal the challenges teachers face, including limited resources, varying student and instructor skill levels, and the need for scalable, adaptable AI tools. This research contributes insights that can inform the development of AI curricula tailored to diverse educational contexts.

[AI-44] Rethinking Cancer Gene Identification through Graph Anomaly Analysis AAAI2025

链接: https://arxiv.org/abs/2412.17240
作者: Yilong Zang,Lingfei Ren,Yue Li,Zhikang Wang,David Antony Selby,Zheng Wang,Sebastian Josef Vollmer,Hongzhi Yin,Jiangning Song,Junhang Wu
关键词: integrating protein-protein interaction, identifying cancer genes, cancer genes, Graph neural networks, PPI networks
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: It has been accepted by the AAAI 2025 conference

点击查看摘要

Abstract:Graph neural networks (GNNs) have shown promise in integrating protein-protein interaction (PPI) networks for identifying cancer genes in recent studies. However, due to the insufficient modeling of the biological information in PPI networks, more faithfully depiction of complex protein interaction patterns for cancer genes within the graph structure remains largely unexplored. This study takes a pioneering step toward bridging biological anomalies in protein interactions caused by cancer genes to statistical graph anomaly. We find a unique graph anomaly exhibited by cancer genes, namely weight heterogeneity, which manifests as significantly higher variance in edge weights of cancer gene nodes within the graph. Additionally, from the spectral perspective, we demonstrate that the weight heterogeneity could lead to the “flattening out” of spectral energy, with a concentration towards the extremes of the spectrum. Building on these insights, we propose the HIerarchical-Perspective Graph Neural Network (HIPGNN) that not only determines spectral energy distribution variations on the spectral perspective, but also perceives detailed protein interaction context on the spatial perspective. Extensive experiments are conducted on two reprocessed datasets STRINGdb and CPDB, and the experimental results demonstrate the superiority of HIPGNN.

[AI-45] MatchMiner-AI: An Open-Source Solution for Cancer Clinical Trial Matching

链接: https://arxiv.org/abs/2412.17228
作者: Ethan Cerami,Pavel Trukhanov,Morgan A. Paul,Michael J. Hassett,Irbaz B. Riaz,James Lindsay,Emily Mallaber,Harry Klein,Gufran Gungor,Matthew Galvin,Stephen C. Van Nostrand,Joyce Yu,Tali Mazor,Kenneth L. Kehl
关键词: trials drive improvements, https URL, treatments and outcomes, drive improvements, Clinical trials drive
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinical trials drive improvements in cancer treatments and outcomes. However, most adults with cancer do not participate in trials, and trials often fail to enroll enough patients to answer their scientific questions. Artificial intelligence could accelerate matching of patients to appropriate clinical trials. Here, we describe the development and evaluation of the MatchMiner-AI pipeline for clinical trial searching and ranking. MatchMiner-AI focuses on matching patients to potential trials based on core criteria describing clinical “spaces,” or disease contexts, targeted by a trial. It aims to accelerate the human work of identifying potential matches, not to fully automate trial screening. The pipeline includes modules for extraction of key information from a patient’s longitudinal electronic health record; rapid ranking of candidate trial-patient matches based on embeddings in vector space; and classification of whether a candidate match represents a reasonable clinical consideration. Code and synthetic data are available at this https URL . Model weights based on synthetic data are available at this https URL and this https URL . A simple cancer clinical trial search engine to demonstrate pipeline components is available at this https URL .

[AI-46] Q-LIME pi: A Quantum-Inspired Extension to LIME

链接: https://arxiv.org/abs/2412.17197
作者: Nelson Colón Vargas
关键词: Machine learning models, offer powerful predictive, powerful predictive capabilities, Machine learning, learning models offer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning models offer powerful predictive capabilities but often lack transparency. Local Interpretable Model-agnostic Explanations (LIME) addresses this by perturbing features and measuring their impact on a model’s output. In text-based tasks, LIME typically removes present words (bits set to 1) to identify high-impact tokens. We propose \textbfQ-LIME \pi (Quantum LIME \pi ), a quantum-inspired extension of LIME that encodes a binary feature vector in a quantum state, leveraging superposition and interference to explore local neighborhoods more efficiently. Our method focuses on flipping bits from 1 \rightarrow 0 to emulate LIME’s ``removal’’ strategy, and can be extended to 0 \rightarrow 1 where adding features is relevant. Experiments on subsets of the IMDb dataset demonstrate that Q-LIME \pi often achieves near-identical top-feature rankings compared to classical LIME while exhibiting lower runtime in small- to moderate-dimensional feature spaces. This quantum-classical hybrid approach thus provides a new pathway for interpretable AI, suggesting that, with further improvements in quantum hardware and methods, quantum parallelism may facilitate more efficient local explanations for high-dimensional data.

[AI-47] Better Think with Tables: Leveraging Tables to Enhance Large Language Model Comprehension

链接: https://arxiv.org/abs/2412.17189
作者: Jio Oh,Geon Heo,Seungjun Oh,Jindong Wang,Xing Xie,Steven Euijong Whang
关键词: Large Langauge Models, Large Langauge, advancement of Large, involving multiple conditions, common in real-world
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Despite the recent advancement of Large Langauge Models (LLMs), they struggle with complex queries often involving multiple conditions, common in real-world scenarios. We propose Thinking with Tables, a technique that assists LLMs to leverage tables for intermediate thinking aligning with human cognitive behavior. By introducing a pre-instruction that triggers an LLM to organize information in tables, our approach achieves a 40.29% average relative performance increase, higher robustness, and show generalizability to different requests, conditions, or scenarios. We additionally show the influence of data structuredness for the model by comparing results from four distinct structuring levels that we introduce.

[AI-48] Hierarchically Gated Experts for Efficient Online Continual Learning

链接: https://arxiv.org/abs/2412.17188
作者: Kevin Luong,Michael Thielscher
关键词: Continual Learning models, Online Continual Learning, Learning models aim, Continual Learning, Continual Learning framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual Learning models aim to learn a set of tasks under the constraint that the tasks arrive sequentially with no way to access data from previous tasks. The Online Continual Learning framework poses a further challenge where the tasks are unknown and instead the data arrives as a single stream. Building on existing work, we propose a method for identifying these underlying tasks: the Gated Experts (GE) algorithm, where a dynamically growing set of experts allows for new knowledge to be acquired without catastrophic forgetting. Furthermore, we extend GE to Hierarchically Gated Experts (HGE), a method which is able to efficiently select the best expert for each data sample by organising the experts into a hierarchical structure. On standard Continual Learning benchmarks, GE and HGE are able to achieve results comparable with current methods, with HGE doing so more efficiently.

[AI-49] DCC: Differentiable Cardinality Constraints for Partial Index Tracking AAAI2025

链接: https://arxiv.org/abs/2412.17175
作者: Wooyeon Jo,Hyunsouk Cho
关键词: high transaction costs, popular passive investment, passive investment strategy, investment strategy aimed, optimizing portfolios
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 10 pages, 6 figures, AAAI 2025 (accepted, but not published)

点击查看摘要

Abstract:Index tracking is a popular passive investment strategy aimed at optimizing portfolios, but fully replicating an index can lead to high transaction costs. To address this, partial replication have been proposed. However, the cardinality constraint renders the problem non-convex, non-differentiable, and often NP-hard, leading to the use of heuristic or neural network-based methods, which can be non-interpretable or have NP-hard complexity. To overcome these limitations, we propose a Differentiable Cardinality Constraint ( \textbfDCC ) for index tracking and introduce a floating-point precision-aware method ( \textbfDCC_fpp ) to address implementation issues. We theoretically prove our methods calculate cardinality accurately and enforce actual cardinality with polynomial time complexity. We propose the range of the hyperparameter a ensures that \textbfDCC_fpp has no error in real implementations, based on theoretical proof and experiment. Our method applied to mathematical method outperforms baseline methods across various datasets, demonstrating the effectiveness of the identified hyperparameter a .

[AI-50] Survey on Abstractive Text Summarization: Dataset Models and Metrics

链接: https://arxiv.org/abs/2412.17165
作者: Gospel Ozioma Nnadi,Flavio Bertini
关键词: natural language processing, deep learning, language processing, advancements in deep, pivotal in enhancing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancements in deep learning, particularly the introduction of transformers, have been pivotal in enhancing various natural language processing (NLP) tasks. These include text-to-text applications such as machine translation, text classification, and text summarization, as well as data-to-text tasks like response generation and image-to-text tasks such as captioning. Transformer models are distinguished by their attention mechanisms, pretraining on general knowledge, and fine-tuning for downstream tasks. This has led to significant improvements, particularly in abstractive summarization, where sections of a source document are paraphrased to produce summaries that closely resemble human expression. The effectiveness of these models is assessed using diverse metrics, encompassing techniques like semantic overlap and factual correctness. This survey examines the state of the art in text summarization models, with a specific focus on the abstractive summarization approach. It reviews various datasets and evaluation metrics used to measure model performance. Additionally, it includes the results of test cases using abstractive summarization models to underscore the advantages and limitations of contemporary transformer-based models. The source codes and the data are available at this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17165 [cs.AI] (or arXiv:2412.17165v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.17165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] Semantic Web: Past Present and Future

链接: https://arxiv.org/abs/2412.17159
作者: Ansgar Scherp,Gerd Groener,Petr Škoda,Katja Hose,Maria-Esther Vidal
关键词: Semantic Web, Semantic Web Layer, Semantic, Web, vision was formulated
类目: Artificial Intelligence (cs.AI)
*备注: Extended Version 2024-12-13 of TGDK 2(1): 3:1-3:37 (2024) If you like to contribute, please contact the first author and visit: this https URL Please cite this paper as, see this https URL

点击查看摘要

Abstract:Ever since the vision was formulated, the Semantic Web has inspired many generations of innovations. Semantic technologies have been used to share vast amounts of information on the Web, enhance them with semantics to give them meaning, and enable inference and reasoning on them. Throughout the years, semantic technologies, and in particular knowledge graphs, have been used in search engines, data integration, enterprise settings, and machine learning. In this paper, we recap the classical concepts and foundations of the Semantic Web as well as modern and recent concepts and applications, building upon these foundations. The classical topics we cover include knowledge representation, creating and validating knowledge on the Web, reasoning and linking, and distributed querying. We enhance this classical view of the so-called ``Semantic Web Layer Cake’’ with an update of recent concepts that include provenance, security and trust, as well as a discussion of practical impacts from industry-led contributions. We conclude with an outlook on the future directions of the Semantic Web. Comments: Extended Version 2024-12-13 of TGDK 2(1): 3:1-3:37 (2024) If you like to contribute, please contact the first author and visit: this https URL Please cite this paper as, see this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17159 [cs.AI] (or arXiv:2412.17159v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.17159 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: TGDK 2(1): 3:1-3:37 (2024) Related DOI: https://doi.org/10.4230/TGDK.2.1.3 Focus to learn more DOI(s) linking to related resources

[AI-52] LLM Agent for Fire Dynamics Simulations NEURIPS2024

链接: https://arxiv.org/abs/2412.17146
作者: Leidong Xu,Danyal Mohaddes,Yi Wang
关键词: leveraging foundation models, large language models, foundation models, accelerate complex scientific, LLM agent
类目: Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注: NeurIPS 2024 Foundation Models for Science Workshop (38th Conference on Neural Information Processing Systems). 12 pages, 8 figures

点击查看摘要

Abstract:Significant advances have been achieved in leveraging foundation models, such as large language models (LLMs), to accelerate complex scientific workflows. In this work we introduce FoamPilot, a proof-of-concept LLM agent designed to enhance the usability of FireFOAM, a specialized solver for fire dynamics and fire suppression simulations built using OpenFOAM, a popular open-source toolbox for computational fluid dynamics (CFD). FoamPilot provides three core functionalities: code insight, case configuration and simulation evaluation. Code insight is an alternative to traditional keyword searching leveraging retrieval-augmented generation (RAG) and aims to enable efficient navigation and summarization of the FireFOAM source code for developers and experienced users. For case configuration, the agent interprets user requests in natural language and aims to modify existing simulation setups accordingly to support intermediate users. FoamPilot’s job execution functionality seeks to manage the submission and execution of simulations in high-performance computing (HPC) environments and provide preliminary analysis of simulation results to support less experienced users. Promising results were achieved for each functionality, particularly for simple tasks, and opportunities were identified for significant further improvement for more complex tasks. The integration of these functionalities into a single LLM agent is a step aimed at accelerating the simulation workflow for engineers and scientists employing FireFOAM for complex simulations critical for improving fire safety.

[AI-53] ASP-based Multi-shot Reasoning via DLV2 with Incremental Grounding

链接: https://arxiv.org/abs/2412.17143
作者: Francesco Calimeri,Giovambattista Ianni,Francesco Pacenza,Simona Perri,Jessica Zanfari
关键词: Knowledge Representation, logic-based declarative formalism, supports Answer Set, Answer Set Programming, tool for Knowledge
类目: Artificial Intelligence (cs.AI)
*备注: 22 pager, 4 figures

点击查看摘要

Abstract:DLV2 is an AI tool for Knowledge Representation and Reasoning which supports Answer Set Programming (ASP) - a logic-based declarative formalism, successfully used in both academic and industrial applications. Given a logic program modelling a computational problem, an execution of DLV2 produces the so-called answer sets that correspond one-to-one to the solutions to the problem at hand. The computational process of DLV2 relies on the typical Ground Solve approach where the grounding step transforms the input program into a new, equivalent ground program, and the subsequent solving step applies propositional algorithms to search for the answer sets. Recently, emerging applications in contexts such as stream reasoning and event processing created a demand for multi-shot reasoning: here, the system is expected to be reactive while repeatedly executed over rapidly changing data. In this work, we present a new incremental reasoner obtained from the evolution of DLV2 towards iterated reasoning. Rather than restarting the computation from scratch, the system remains alive across repeated shots, and it incrementally handles the internal grounding process. At each shot, the system reuses previous computations for building and maintaining a large, more general ground program, from which a smaller yet equivalent portion is determined and used for computing answer sets. Notably, the incremental process is performed in a completely transparent fashion for the user. We describe the system, its usage, its applicability and performance in some practically relevant domains. Under consideration in Theory and Practice of Logic Programming (TPLP).

[AI-54] AI-Based Teat Shape and Skin Condition Prediction for Dairy Management

链接: https://arxiv.org/abs/2412.17142
作者: Yuexing Hao,Tiancheng Yuan,Yuting Yang,Aarushi Gupta,Matthias Wieland,Ken Birman,Parminder S. Basran
关键词: owners spend significant, spend significant effort, Dairy owners spend, animals healthy, owners spend
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dairy owners spend significant effort to keep their animals healthy. There is good reason to hope that technologies such as computer vision and artificial intelligence (AI) could reduce these costs, yet obstacles arise when adapting advanced tools to farming environments. In this work, we adapt AI tools to dairy cow teat localization, teat shape, and teat skin condition classifications. We also curate a data collection and analysis methodology for a Machine Learning (ML) pipeline. The resulting teat shape prediction model achieves a mean Average Precision (mAP) of 0.783, and the teat skin condition model achieves a mean average precision of 0.828. Our work leverages existing ML vision models to facilitate the individualized identification of teat health and skin conditions, applying AI to the dairy management industry.

[AI-55] On the ETHOS of AI Agents : An Ethical Technology and Holistic Oversight System

链接: https://arxiv.org/abs/2412.17114
作者: Tomer Jordi Chaffer,Justin Goldston,Bayo Okusanya,Gemach D.A.T.A.I
关键词: world increasingly defined, Bipartisan House Task, Risk Management Report, House Task Force, machine intelligence
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:In a world increasingly defined by machine intelligence, the future depends on how we govern the development and integration of AI into society. Recent initiatives, such as the EU AI Act, EDPB opinion, U.S. Bipartisan House Task Force and NIST AI Risk Management Report, highlight the urgent need for robust governance frameworks to address the challenges posed by advancing AI technologies. However, existing frameworks fail to adequately address the rise of AI agents or the ongoing debate between centralized and decentralized governance models. To bridge these gaps, we propose the Ethical Technology and Holistic Oversight System framework, which leverages Web3 technologies, including blockchain, smart contracts, decentralized autonomous organizations, and soulbound tokens, to establish a decentralized global registry for AI agents. ETHOS incorporates the concept of AI specific legal entities, enabling these systems to assume limited liability and ensuring accountability through mechanisms like insurance and compliance monitoring. Additionally, the framework emphasizes the need for a collaborative, participatory approach to AI governance, engaging diverse stakeholders through public education, transparency, and international coordination. ETHOS balances innovation with ethical accountability, providing a forward looking strategy for the responsible integration of AI agents into society. Finally, this exploration reflects the emergence of a new interdisciplinary field we define as Systems Thinking at the Intersection of AI, Web3, and Society.

[AI-56] Grams: Gradient Descent with Adaptive Momentum Scaling

链接: https://arxiv.org/abs/2412.17107
作者: Yang Cao,Xiaoyu Li,Zhao Song
关键词: textbf, algorithm that decouples, adient Descent, Grams, parameter updates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce \textbfGradient Descent with \textbfAdaptive \textbfMomentum \textbfScaling (\textbfGrams), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We establish a global convergence guarantee for Grams and validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams’ superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams’ potential as a transformative approach for efficient optimization in large-scale machine learning.

[AI-57] Analysis on LLM s Performance for Code Summarization

链接: https://arxiv.org/abs/2412.17094
作者: Md. Ahnaf Akib,Md. Muktadir Mazumder,Salman Ahsan
关键词: natural language descriptions, Large Language Models, generate concise natural, concise natural language, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Code summarization aims to generate concise natural language descriptions for source code. Deep learning has been used more and more recently in software engineering, particularly for tasks like code creation and summarization. Specifically, it appears that the most current Large Language Models with coding perform well on these tasks. Large Language Models (LLMs) have significantly advanced the field of code summarization, providing sophisticated methods for generating concise and accurate summaries of source code. This study aims to perform a comparative analysis of several open-source LLMs, namely LLaMA-3, Phi-3, Mistral, and Gemma. These models’ performance is assessed using important metrics such as BLEU\textsubscript3.1 and ROUGE\textsubscript3.2. Through this analysis, we seek to identify the strengths and weaknesses of each model, offering insights into their applicability and effectiveness in code summarization tasks. Our findings contribute to the ongoing development and refinement of LLMs, supporting their integration into tools that enhance software development and maintenance processes. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17094 [cs.SE] (or arXiv:2412.17094v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.17094 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] Aligning Graphical and Functional Causal Abstractions

链接: https://arxiv.org/abs/2412.17080
作者: Wilem Schooltink,Fabio Massimo Zennaro
关键词: Cluster DAGs, relate causal models, Partial Cluster DAGs, Causal abstractions, abstractions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Causal abstractions allow us to relate causal models on different levels of granularity. To ensure that the models agree on cause and effect, frameworks for causal abstractions define notions of consistency. Two distinct methods for causal abstraction are common in the literature: (i) graphical abstractions, such as Cluster DAGs, which relate models on a structural level, and (ii) functional abstractions, like \alpha -abstractions, which relate models by maps between variables and their ranges. In this paper we will align the notions of graphical and functional consistency and show an equivalence between the class of Cluster DAGs, consistent \alpha -abstractions, and constructive \tau -abstractions. Furthermore, we extend this alignment and the expressivity of graphical abstractions by introducing Partial Cluster DAGs. Our results provide a rigorous bridge between the functional and graphical frameworks and allow for adoption and transfer of results between them.

[AI-59] SubstationAI: Multimodal Large Model-Based Approaches for Analyzing Substation Equipment Faults

链接: https://arxiv.org/abs/2412.17077
作者: Jinzhi Wang,Qinfeng Song,Lidong Qian,Haozhou Li,Qinke Peng,Jiangbo Zhang
关键词: methods heavily rely, equipment fault analysis, analysis methods heavily, fault analysis, power systems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The reliability of substation equipment is crucial to the stability of power systems, but traditional fault analysis methods heavily rely on manual expertise, limiting their effectiveness in handling complex and large-scale data. This paper proposes a substation equipment fault analysis method based on a multimodal large language model (MLLM). We developed a database containing 40,000 entries, including images, defect labels, and analysis reports, and used an image-to-video generation model for data augmentation. Detailed fault analysis reports were generated using GPT-4. Based on this database, we developed SubstationAI, the first model dedicated to substation fault analysis, and designed a fault diagnosis knowledge base along with knowledge enhancement methods. Experimental results show that SubstationAI significantly outperforms existing models, such as GPT-4, across various evaluation metrics, demonstrating higher accuracy and practicality in fault cause analysis, repair suggestions, and preventive measures, providing a more advanced solution for substation equipment fault analysis.

[AI-60] Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)

链接: https://arxiv.org/abs/2412.17069
作者: Mohammadreza Sharifi
关键词: present significant challenges, deep learning models, modern deep learning, large datasets present, datasets present significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper was presented at Machine Learning Knowledge Discovery (MLKD2024) conference at Amirkabir University of Technology

点击查看摘要

Abstract:In modern deep learning models, long training times and large datasets present significant challenges to both efficiency and scalability. Effective data curation and sample selection are crucial for optimizing the training process of deep neural networks. This paper introduces SALN, a method designed to prioritize and select samples within each batch rather than from the entire dataset. By utilizing jointly selected batches, SALN enhances training efficiency compared to independent batch selection. The proposed method applies a spectral analysis-based heuristic to identify the most informative data points within each batch, improving both training speed and accuracy. The SALN algorithm significantly reduces training time and enhances accuracy when compared to traditional batch prioritization or standard training procedures. It demonstrates up to an 8x reduction in training time and up to a 5% increase in accuracy over standard training methods. Moreover, SALN achieves better performance and shorter training times compared to Google’s JEST method developed by DeepMind.

[AI-61] DR-Encoder: Encode Low-rank Gradients with Random Prior for Large Language Models Differentially Privately

链接: https://arxiv.org/abs/2412.17053
作者: Huiwen Wu,Deyi Zhang,Xiaohan Li,Xiaogang Xu,Jiafei Wu,Zhe Liu
关键词: Large Language Model, including language understanding, relational logic reasoning, Large Language, differential equations solving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The emergence of the Large Language Model (LLM) has shown their superiority in a wide range of disciplines, including language understanding and translation, relational logic reasoning, and even partial differential equations solving. The transformer is the pervasive backbone architecture for the foundation model construction. It is vital to research how to adjust the Transformer architecture to achieve an end-to-end privacy guarantee in LLM fine-tuning. In this paper, we investigate three potential information leakage during a federated fine-tuning procedure for LLM (FedLLM). Based on the potential information leakage, we provide an end-to-end privacy guarantee solution for FedLLM by inserting two-stage randomness. The first stage is to train a gradient auto-encoder with a Gaussian random prior based on the statistical information of the gradients generated by local clients. The second stage is to fine-tune the overall LLM with a differential privacy guarantee by adopting appropriate Gaussian noises. We show the efficiency and accuracy gains of our proposed method with several foundation models and two popular evaluation benchmarks. Furthermore, we present a comprehensive privacy analysis with Gaussian Differential Privacy (GDP) and Renyi Differential Privacy (RDP).

[AI-62] ViLBias: A Framework for Bias Detection using Linguistic and Visual Cues

链接: https://arxiv.org/abs/2412.17052
作者: Shaina Raza,Caesar Saleh,Emrul Hasan,Franklin Ogidi,Maximus Powers,Veronica Chatrath,Marcelo Lotif,Roya Javadi,Anam Zahid,Vahid Reza Khazaie
关键词: Large Language Models, addressing complex challenges, integration of Large, Small Language Models, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) and Vision-Language Models (VLMs) opens new avenues for addressing complex challenges in multimodal content analysis, particularly in biased news detection. This study introduces ViLBias, a framework that leverages state of the art LLMs and VLMs to detect linguistic and visual biases in news content, addressing the limitations of traditional text-only approaches. Our contributions include a novel dataset pairing textual content with accompanying visuals from diverse news sources and a hybrid annotation framework, combining LLM-based annotations with human review to enhance quality while reducing costs and improving scalability. We evaluate the efficacy of LLMs and VLMs in identifying biases, revealing their strengths in detecting subtle framing and text-visual inconsistencies. Empirical analysis demonstrates that incorporating visual cues alongside text enhances bias detection accuracy by 3 to 5 %, showcasing the complementary strengths of LLMs in generative reasoning and Small Language Models (SLMs) in classification. This study offers a comprehensive exploration of LLMs and VLMs as tools for detecting multimodal biases in news content, highlighting both their potential and limitations. Our research paves the way for more robust, scalable, and nuanced approaches to media bias detection, contributing to the broader field of natural language processing and multimodal analysis. (The data and code will be made available for research purposes).

[AI-63] GraphAgent : Agent ic Graph Language Assistant

链接: https://arxiv.org/abs/2412.17029
作者: Yuhao Yang,Jiabin Tang,Lianghao Xia,Xingchen Zou,Yuxuan Liang,Chao Huang
关键词: include explicit links, encompassing complex relationships, Real-world data, Graph Generator Agent, Task Planning Agent
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world data is represented in both structured (e.g., graph connections) and unstructured (e.g., textual, visual information) formats, encompassing complex relationships that include explicit links (such as social connections and user behaviors) and implicit interdependencies among semantic entities, often illustrated through knowledge graphs. In this work, we propose GraphAgent, an automated agent pipeline that addresses both explicit graph dependencies and implicit graph-enhanced semantic inter-dependencies, aligning with practical data scenarios for predictive tasks (e.g., node classification) and generative tasks (e.g., text generation). GraphAgent comprises three key components: (i) a Graph Generator Agent that builds knowledge graphs to reflect complex semantic dependencies; (ii) a Task Planning Agent that interprets diverse user queries and formulates corresponding tasks through agentic self-planning; and (iii) a Task Execution Agent that efficiently executes planned tasks while automating tool matching and invocation in response to user queries. These agents collaborate seamlessly, integrating language models with graph language models to uncover intricate relational information and data semantic dependencies. Through extensive experiments on various graph-related predictive and text generative tasks on diverse datasets, we demonstrate the effectiveness of our GraphAgent across various settings. We have made our proposed GraphAgent open-source at: this https URL.

[AI-64] GAS: Generative Auto-bidding with Post-training Search

链接: https://arxiv.org/abs/2412.17018
作者: Yewen Li,Shuai Mao,Jingtong Gao,Nan Jiang,Yunjian Xu,Qingpeng Cai,Fei Pan,Peng Jiang,Bo An
关键词: automatically placing bids, behalf of advertisers, essential in facilitating, automatically placing, Generative auto-bidding
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Auto-bidding is essential in facilitating online advertising by automatically placing bids on behalf of advertisers. Generative auto-bidding, which generates bids based on an adjustable condition using models like transformers and diffusers, has recently emerged as a new trend due to its potential to learn optimal strategies directly from data and adjust flexibly to preferences. However, generative models suffer from low-quality data leading to a mismatch between condition, return to go, and true action value, especially in long sequential decision-making. Besides, the majority preference in the dataset may hinder models’ generalization ability on minority advertisers’ preferences. While it is possible to collect high-quality data and retrain multiple models for different preferences, the high cost makes it unaffordable, hindering the advancement of auto-bidding into the era of large foundation models. To address this, we propose a flexible and practical Generative Auto-bidding scheme using post-training Search, termed GAS, to refine a base policy model’s output and adapt to various preferences. We use weak-to-strong search alignment by training small critics for different preferences and an MCTS-inspired search to refine the model’s output. Specifically, a novel voting mechanism with transformer-based critics trained with policy indications could enhance search alignment performance. Additionally, utilizing the search, we provide a fine-tuning method for high-frequency preference scenarios considering computational efficiency. Extensive experiments conducted on the real-world dataset and online A/B test on the Kuaishou advertising platform demonstrate the effectiveness of GAS, achieving significant improvements, e.g., 1.554% increment of target cost.

[AI-65] Data value estimation on private gradients

链接: https://arxiv.org/abs/2412.17008
作者: Zijian Zhou,Xinyi Xu,Daniela Rus,Bryan Kian Hsiang Low
关键词: facto differential privacy, gradient-based machine learning, stochastic gradient descent, random Gaussian noise, methods commonly adopted
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:For gradient-based machine learning (ML) methods commonly adopted in practice such as stochastic gradient descent, the de facto differential privacy (DP) technique is perturbing the gradients with random Gaussian noise. Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP such as data pricing, collaborative ML, and federated learning (FL). Can existing data valuation methods still be used when DP is enforced via gradient perturbations? We show that the answer is no with the default approach of injecting i.i.d.~random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget, producing estimates almost like random guesses. To address this issue, we propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.~the budget. We also empirically demonstrate that our method gives better data value estimates on various ML tasks and is applicable to use cases including dataset valuation and~FL.

[AI-66] Solving Nonlinear Energy Supply and Demand System Using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2412.17001
作者: Van Truong Vo,Samad Noeiaghdam,Denis Sidorov,Aliona Dreglea,Liguo Wang
关键词: time-dependent factors exhibit, exhibit nonlinear characteristics, factors exhibit nonlinear, neural network, neural network model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: Submitted to Computation J

点击查看摘要

Abstract:Nonlinear differential equations and systems play a crucial role in modeling systems where time-dependent factors exhibit nonlinear characteristics. Due to their nonlinear nature, solving such systems often presents significant difficulties and challenges. In this study, we propose a method utilizing Physics-Informed Neural Networks (PINNs) to solve the nonlinear energy supply-demand (ESD) system. We design a neural network with four outputs, where each output approximates a function that corresponds to one of the unknown functions in the nonlinear system of differential equations describing the four-dimensional ESD problem. The neural network model is then trained and the parameters are identified, optimized to achieve a more accurate solution. The solutions obtained from the neural network for this problem are equivalent when we compare and evaluate them against the Runge-Kutta numerical method of order 4/5 (RK45). However, the method utilizing neural networks is considered a modern and promising approach, as it effectively exploits the superior computational power of advanced computer systems, especially in solving complex problems. Another advantage is that the neural network model, after being trained, can solve the nonlinear system of differential equations across a continuous domain. In other words, neural networks are not only trained to approximate the solution functions for the nonlinear ESD system but can also represent the complex dynamic relationships between the system’s components. However, this approach requires significant time and computational power due to the need for model training.

[AI-67] LLM -Powered User Simulator for Recommender System

链接: https://arxiv.org/abs/2412.16984
作者: Zijian Zhang,Shuchang Liu,Ziru Liu,Rui Zhong,Qingpeng Cai,Xiangyu Zhao,Chunxu Zhang,Qidong Liu,Peng Jiang
关键词: reinforcement learning-based recommender, learning-based recommender systems, User, providing a testing, iteration and optimization
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:User simulators can rapidly generate a large volume of timely user behavior data, providing a testing platform for reinforcement learning-based recommender systems, thus accelerating their iteration and optimization. However, prevalent user simulators generally suffer from significant limitations, including the opacity of user preference modeling and the incapability of evaluating simulation accuracy. In this paper, we introduce an LLM-powered user simulator to simulate user engagement with items in an explicit manner, thereby enhancing the efficiency and effectiveness of reinforcement learning-based recommender systems training. Specifically, we identify the explicit logic of user preferences, leverage LLMs to analyze item characteristics and distill user sentiments, and design a logical model to imitate real human engagement. By integrating a statistical model, we further enhance the reliability of the simulation, proposing an ensemble model that synergizes logical and statistical insights for user interaction simulations. Capitalizing on the extensive knowledge and semantic generation capabilities of LLMs, our user simulator faithfully emulates user behaviors and preferences, yielding high-fidelity training data that enrich the training of recommendation algorithms. We establish quantifying and qualifying experiments on five datasets to validate the simulator’s effectiveness and stability across various recommendation scenarios.

[AI-68] Environment Descriptions for Usability and Generalisation in Reinforcement Learning

链接: https://arxiv.org/abs/2412.16970
作者: Dennis J.N.J. Soemers,Spyridon Samothrakis,Kurt Driessens,Mark H.M. Winands
关键词: current reinforcement learning, CUDA or JAX, research involves training, general-purpose programming languages, reinforcement learning
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted by ICAART 2025

点击查看摘要

Abstract:The majority of current reinforcement learning (RL) research involves training and deploying agents in environments that are implemented by engineers in general-purpose programming languages and more advanced frameworks such as CUDA or JAX. This makes the application of RL to novel problems of interest inaccessible to small organisations or private individuals with insufficient engineering expertise. This position paper argues that, to enable more widespread adoption of RL, it is important for the research community to shift focus towards methodologies where environments are described in user-friendly domain-specific or natural languages. Aside from improving the usability of RL, such language-based environment descriptions may also provide valuable context and boost the ability of trained agents to generalise to unseen environments within the set of all environments that can be described in any language of choice.

[AI-69] Efficiently Solving Turn-Taking Stochastic Games with Extensive-Form Correlation

链接: https://arxiv.org/abs/2412.16934
作者: Hanrui Zhang,Yu Cheng,Vincent Conitzer
关键词: two-player turn-taking stochastic, extensive-form correlated equilibrium, study equilibrium computation, Stackelberg extensive-form correlated, turn-taking stochastic games
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: EC 2023

点击查看摘要

Abstract:We study equilibrium computation with extensive-form correlation in two-player turn-taking stochastic games. Our main results are two-fold: (1) We give an algorithm for computing a Stackelberg extensive-form correlated equilibrium (SEFCE), which runs in time polynomial in the size of the game, as well as the number of bits required to encode each input number. (2) We give an efficient algorithm for approximately computing an optimal extensive-form correlated equilibrium (EFCE) up to machine precision, i.e., the algorithm achieves approximation error \varepsilon in time polynomial in the size of the game, as well as \log(1 / \varepsilon) . Our algorithm for SEFCE is the first polynomial-time algorithm for equilibrium computation with commitment in such a general class of stochastic games. Existing algorithms for SEFCE typically make stronger assumptions such as no chance moves, and are designed for extensive-form games in the less succinct tree form. Our algorithm for approximately optimal EFCE is, to our knowledge, the first algorithm that achieves 3 desiderata simultaneously: approximate optimality, polylogarithmic dependency on the approximation error, and compatibility with stochastic games in the more succinct graph form. Existing algorithms achieve at most 2 of these desiderata, often also relying on additional technical assumptions. Comments: EC 2023 Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2412.16934 [cs.GT] (or arXiv:2412.16934v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2412.16934 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] Enhancing Supply Chain Transparency in Emerging Economies Using Online Contents and LLM s

链接: https://arxiv.org/abs/2412.16922
作者: Bohan Jin,Qianyou Sun,Lihua Chen
关键词: supply chain transparency, current global economy, supply chain, monitor supplier performance, chain transparency plays
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:In the current global economy, supply chain transparency plays a pivotal role in ensuring this security by enabling companies to monitor supplier performance and fostering accountability and responsibility. Despite the advancements in supply chain relationship datasets like Bloomberg and FactSet, supply chain transparency remains a significant challenge in emerging economies due to issues such as information asymmetry and institutional gaps in regulation. This study proposes a novel approach to enhance supply chain transparency in emerging economies by leveraging online content and large language models (LLMs). We develop a Supply Chain Knowledge Graph Mining System that integrates advanced LLMs with web crawler technology to automatically collect and analyze supply chain information. The system’s effectiveness is validated through a case study focusing on the semiconductor supply chain, a domain that has recently gained significant attention due to supply chain risks. Our results demonstrate that the proposed system provides greater applicability for emerging economies, such as mainland China, complementing the data gaps in existing datasets. However, challenges including the accurate estimation of monetary and material flows, the handling of time series data, synonyms disambiguation, and mitigating biases from online contents still remains. Future research should focus on addressing these issues to further enhance the system’s capabilities and broaden its application to other emerging economies and industries.

[AI-71] Map Imagination Like Blind Humans: Group Diffusion Model for Robotic Map Generation

链接: https://arxiv.org/abs/2412.16908
作者: Qijin Song,Weibang Bai
关键词: http URL, blind people, maps, generate, URL
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Can robots imagine or generate maps like humans do, especially when only limited information can be perceived like blind people? To address this challenging task, we propose a novel group diffusion model (GDM) based architecture for robots to generate point cloud maps with very limited input this http URL from the blind humans’ natural capability of imagining or generating mental maps, the proposed method can generate maps without visual perception data or depth data. With additional limited super-sparse spatial positioning data, like the extra contact-based positioning information the blind individuals can obtain, the map generation quality can be improved even this http URL on public datasets are conducted, and the results indicate that our method can generate reasonable maps solely based on path data, and produce even more refined maps upon incorporating exiguous LiDAR this http URL to conventional mapping approaches, our novel method significantly mitigates sensor dependency, enabling the robots to imagine and generate elementary maps without heavy onboard sensory devices.

[AI-72] A Backdoor Attack Scheme with Invisible Triggers Based on Model Architecture Modification

链接: https://arxiv.org/abs/2412.16905
作者: Yuan Ma,Xu Ma,Jiankang Wei,Jinmeng Tang,Xiaoyu Zhang,Yilun Lyu,Kehao Chen,Jingtong Huang
关键词: Machine learning systems, attackers manipulate model, manipulate model behavior, Machine learning, backdoor attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning systems are vulnerable to backdoor attacks, where attackers manipulate model behavior through data tampering or architectural modifications. Traditional backdoor attacks involve injecting malicious samples with specific triggers into the training data, causing the model to produce targeted incorrect outputs in the presence of the corresponding triggers. More sophisticated attacks modify the model’s architecture directly, embedding backdoors that are harder to detect as they evade traditional data-based detection methods. However, the drawback of the architectural modification based backdoor attacks is that the trigger must be visible in order to activate the backdoor. To further strengthen the invisibility of the backdoor attacks, a novel backdoor attack method is presented in the paper. To be more specific, this method embeds the backdoor within the model’s architecture and has the capability to generate inconspicuous and stealthy triggers. The attack is implemented by modifying pre-trained models, which are then redistributed, thereby posing a potential threat to unsuspecting users. Comprehensive experiments conducted on standard computer vision benchmarks validate the effectiveness of this attack and highlight the stealthiness of its triggers, which remain undetectable through both manual visual inspection and advanced detection tools.

[AI-73] Preventing Non-intrusive Load Monitoring Privacy Invasion: A Precise Adversarial Attack Scheme for Networked Smart Meters

链接: https://arxiv.org/abs/2412.16893
作者: Jialing He,Jiacheng Wang,Ning Wang,Shangwei Guo,Liehuang Zhu,Dusit Niyato,Tao Xiang
关键词: networked smart meters, smart meters employing, non-intrusive load monitoring, Smart grid, networked smart
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smart grid, through networked smart meters employing the non-intrusive load monitoring (NILM) technique, can considerably discern the usage patterns of residential appliances. However, this technique also incurs privacy leakage. To address this issue, we propose an innovative scheme based on adversarial attack in this paper. The scheme effectively prevents NILM models from violating appliance-level privacy, while also ensuring accurate billing calculation for users. To achieve this objective, we overcome two primary challenges. First, as NILM models fall under the category of time-series regression models, direct application of traditional adversarial attacks designed for classification tasks is not feasible. To tackle this issue, we formulate a novel adversarial attack problem tailored specifically for NILM and providing a theoretical foundation for utilizing the Jacobian of the NILM model to generate imperceptible perturbations. Leveraging the Jacobian, our scheme can produce perturbations, which effectively misleads the signal prediction of NILM models to safeguard users’ appliance-level privacy. The second challenge pertains to fundamental utility requirements, where existing adversarial attack schemes struggle to achieve accurate billing calculation for users. To handle this problem, we introduce an additional constraint, mandating that the sum of added perturbations within a billing period must be precisely zero. Experimental validation on real-world power datasets REDD and UK-DALE demonstrates the efficacy of our proposed solutions, which can significantly amplify the discrepancy between the output of the targeted NILM model and the actual power signal of appliances, and enable accurate billing at the same time. Additionally, our solutions exhibit transferability, making the generated perturbation signal from one target model applicable to other diverse NILM models.

[AI-74] Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model AAMAS25

链接: https://arxiv.org/abs/2412.16878
作者: Songjun Tu,Jingbo Sun,Qichao Zhang,Xiangyuan Lan,Dongbin Zhao
关键词: Preference-based reinforcement learning, Preference-based reinforcement, avoid meticulous reward, meticulous reward engineering, Large Language Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, The 24th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS25)

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a “scripted teacher” that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify an failure issue in LLM-based preference discrimination, specifically “query ambiguity”, in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical “scripted teacher” feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback.

[AI-75] A Multi-modal Approach to Dysarthria Detection and Severity Assessment Using Speech and Text Information

链接: https://arxiv.org/abs/2412.16874
作者: Anuprabha M,Krishna Gurugubelli,Kesavaraj V,Anil Kumar Vuppala
关键词: delivering targeted therapeutic, targeted therapeutic interventions, Automatic detection, interventions to patients, dysarthria are crucial
类目: Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Automatic detection and severity assessment of dysarthria are crucial for delivering targeted therapeutic interventions to patients. While most existing research focuses primarily on speech modality, this study introduces a novel approach that leverages both speech and text modalities. By employing cross-attention mechanism, our method learns the acoustic and linguistic similarities between speech and text representations. This approach assesses specifically the pronunciation deviations across different severity levels, thereby enhancing the accuracy of dysarthric detection and severity assessment. All the experiments have been performed using UA-Speech dysarthric database. Improved accuracies of 99.53% and 93.20% in detection, and 98.12% and 51.97% for severity assessment have been achieved when speaker-dependent and speaker-independent, unseen and seen words settings are used. These findings suggest that by integrating text information, which provides a reference linguistic knowledge, a more robust framework has been developed for dysarthric detection and assessment, thereby potentially leading to more effective diagnoses.

[AI-76] OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning

链接: https://arxiv.org/abs/2412.16849
作者: Yuxiang Zhang,Yuqi Yang,Jiangming Shu,Yuhang Wang,Jinlin Xiao,Jitao Sang
关键词: OpenAI recent introduction, simple pattern imitation, Reinforcement Fine-Tuning, introduction of Reinforcement, OpenAI recent
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:OpenAI’s recent introduction of Reinforcement Fine-Tuning (RFT) showcases the potential of reasoning foundation model and offers a new paradigm for fine-tuning beyond simple pattern imitation. This technical report presents \emphOpenRFT, our attempt to fine-tune generalist reasoning models for domain-specific tasks under the same settings as RFT. OpenRFT addresses two key challenges of lacking reasoning step data and the limited quantity of training samples, by leveraging the domain-specific samples in three ways: question augmentation, synthesizing reasoning-process data, and few-shot ICL. The evaluation is conducted on SciKnowEval, where OpenRFT achieves notable performance gains with only 100 domain-specific samples for each task. More experimental results will be updated continuously in later versions. Source codes, datasets, and models are disclosed at: this https URL

[AI-77] ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2412.16848
作者: Kun Wu,Yinuo Zhao,Zhiyuan Xu,Zhengping Che,Chengxiang Yin,Chi Harold Liu,Qinru Qiu,Feiferi Feng,Jian Tang
关键词: Offline Reinforcement Learning, Reinforcement Learning, Conservative Level, promising control policy, Offline Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 19 pages, 4 figures, IEEE Transactions on Neural Networks and Learning Systems (2024)

点击查看摘要

Abstract:Offline Reinforcement Learning (RL), which operates solely on static datasets without further interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. The prevailing methods typically learn a conservative policy to mitigate the problem of Q-value overestimation, but it is prone to overdo it, leading to an overly conservative policy. Moreover, they optimize all samples equally with fixed constraints, lacking the nuanced ability to control conservative levels in a fine-grained manner. Consequently, this limitation results in a performance decline. To address the above two challenges in a united way, we propose a framework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits the Q-values in a mild range and enables adaptive control on the conservative level over each state-action pair, i.e., lifting the Q-values more for good transitions and less for bad transitions. We theoretically analyze the conditions under which the conservative level of the learned Q-function can be limited in a mild range and how to optimize each transition adaptively. Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL, which uses two learnable adaptive weight functions to control the conservative level over each transition. Subsequently, we design a monotonicity loss and surrogate losses to train the adaptive weight functions, Q-function, and policy network alternatively. We evaluate ACL-QL on the commonly used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.

[AI-78] Graph Learning-based Regional Heavy Rainfall Prediction Using Low-Cost Rain Gauges

链接: https://arxiv.org/abs/2412.16842
作者: Edwin Salcedo
关键词: flood risk management, effective flood risk, Accurate and timely, disaster preparedness, flood risk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted for publication in the proceedings of the 2024 Latin American Conference on Computational Intelligence (IEEE LA-CCI 2024)

点击查看摘要

Abstract:Accurate and timely prediction of heavy rainfall events is crucial for effective flood risk management and disaster preparedness. By monitoring, analysing, and evaluating rainfall data at a local level, it is not only possible to take effective actions to prevent any severe climate variation but also to improve the planning of surface and underground hydrological resources. However, developing countries often lack the weather stations to collect data continuously due to the high cost of installation and maintenance. In light of this, the contribution of the present paper is twofold: first, we propose a low-cost IoT system for automatic recording, monitoring, and prediction of rainfall in rural regions. Second, we propose a novel approach to regional heavy rainfall prediction by implementing graph neural networks (GNNs), which are particularly well-suited for capturing the complex spatial dependencies inherent in rainfall patterns. The proposed approach was tested using a historical dataset spanning 72 months, with daily measurements, and experimental results demonstrated the effectiveness of the proposed method in predicting heavy rainfall events, making this approach particularly attractive for regions with limited resources or where traditional weather radar or station coverage is sparse.

[AI-79] Online Learning from Strategic Human Feedback in LLM Fine-Tuning

链接: https://arxiv.org/abs/2412.16834
作者: Shugang Hao,Lingjie Duan
关键词: large language models, fine-tuning large language, Reinforcement learning, language models, essential step
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become an essential step in fine-tuning large language models (LLMs) to align them with human preferences. However, human labelers are selfish and have diverse preferences. They may strategically misreport their online feedback to influence the system’s aggregation towards their own preferences. Current practice simply averages labelers’ feedback per time and fails to identify the most accurate human labeler, leading to linear regret \mathcalO(T) for T time slots. To our best knowledge, we are the first to study online learning mechanisms against strategic human labelers in the LLM fine-tuning process. We formulate a new dynamic Bayesian game and dynamically adjust human labelers’ weights in the preference aggregation, ensuring their truthful feedback and sublinear regret \mathcalO(T^1/2) . Simulation results demonstrate our mechanism’s great advantages over the existing benchmark schemes.

[AI-80] KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis AAAI-25

链接: https://arxiv.org/abs/2412.16833
作者: Kaiwen Zuo,Yirui Jiang,Fan Mo,Pietro Lio
关键词: Large Language Models, Integrating Large Language, Integrating Large, Language Models, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages,5 figures,published to AAAI-25 Bridge Program

点击查看摘要

Abstract:Integrating Large Language Models (LLMs) in healthcare diagnosis demands systematic frameworks that can handle complex medical scenarios while maintaining specialized expertise. We present KG4Diagnosis, a novel hierarchical multi-agent framework that combines LLMs with automated knowledge graph construction, encompassing 362 common diseases across medical specialties. Our framework mirrors real-world medical systems through a two-tier architecture: a general practitioner (GP) agent for initial assessment and triage, coordinating with specialized agents for in-depth diagnosis in specific domains. The core innovation lies in our end-to-end knowledge graph generation methodology, incorporating: (1) semantic-driven entity and relation extraction optimized for medical terminology, (2) multi-dimensional decision relationship reconstruction from unstructured medical texts, and (3) human-guided reasoning for knowledge expansion. KG4Diagnosis serves as an extensible foundation for specialized medical diagnosis systems, with capabilities to incorporate new diseases and medical knowledge. The framework’s modular design enables seamless integration of domain-specific enhancements, making it valuable for developing targeted medical diagnosis systems. We provide architectural guidelines and protocols to facilitate adoption across medical contexts.

[AI-81] Visual Prompting with Iterative Refinement for Design Critique Generation

链接: https://arxiv.org/abs/2412.16829
作者: Peitong Duan,Chin-Yi Chen,Bjoern Hartmann,Yang Li
关键词: Feedback is crucial, automating design critiques, user interface, design, significantly improve
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques – a complex task that requires producing detailed design comments that are visually grounded in a given design’s image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50% for one rating metric. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.

[AI-82] Unsupervised Discovery of Formulas for Mathematical Constants

链接: https://arxiv.org/abs/2412.16818
作者: Michael Shalyt,Uri Seligmann,Itay Beit Halachmi,Ofir David,Rotem Elimelech,Ido Kaminer
关键词: Ongoing efforts, formulas, efforts that span, span over decades, decades show
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Number Theory (math.NT)
*备注: 8 figures, 5 tables, 28 pages including the supplementary information. For a 5-minute video abstract see this https URL . Code can be found at this https URL

点击查看摘要

Abstract:Ongoing efforts that span over decades show a rise of AI methods for accelerating scientific discovery, yet accelerating discovery in mathematics remains a persistent challenge for AI. Specifically, AI methods were not effective in creation of formulas for mathematical constants because each such formula must be correct for infinite digits of precision, with “near-true” formulas providing no insight toward the correct ones. Consequently, formula discovery lacks a clear distance metric needed to guide automated discovery in this realm. In this work, we propose a systematic methodology for categorization, characterization, and pattern identification of such formulas. The key to our methodology is introducing metrics based on the convergence dynamics of the formulas, rather than on the numerical value of the formula. These metrics enable the first automated clustering of mathematical formulas. We demonstrate this methodology on Polynomial Continued Fraction formulas, which are ubiquitous in their intrinsic connections to mathematical constants, and generalize many mathematical functions and structures. We test our methodology on a set of 1,768,900 such formulas, identifying many known formulas for mathematical constants, and discover previously unknown formulas for \pi , \ln(2) , Gauss’, and Lemniscate’s constants. The uncovered patterns enable a direct generalization of individual formulas to infinite families, unveiling rich mathematical structures. This success paves the way towards a generative model that creates formulas fulfilling specified mathematical properties, accelerating the rate of discovery of useful formulas. Comments: 8 figures, 5 tables, 28 pages including the supplementary information. For a 5-minute video abstract see this https URL . Code can be found at this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Number Theory (math.NT) Cite as: arXiv:2412.16818 [cs.AI] (or arXiv:2412.16818v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.16818 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Advances in Neural Information Processing Systems 37 (NeurIPS 2024)

[AI-83] An Exploration of Pattern Mining with ChatGPT

链接: https://arxiv.org/abs/2412.16814
作者: Michael Weiss
关键词: exploratory approach, approach to examine, Large Language Models, pattern, paper
类目: Artificial Intelligence (cs.AI)
*备注: This is the author’s version of the work. The definitive version of record was published in 29th European Conference on Pattern Languages of Programs, People, and Practices (EuroPLOP 2024), July 3-7, 2024, Irsee, Germany, ACM

点击查看摘要

Abstract:This paper takes an exploratory approach to examine the use of ChatGPT for pattern mining. It proposes an eight-step collaborative process that combines human insight with AI capabilities to extract patterns from known uses. The paper offers a practical demonstration of this process by creating a pattern language for integrating Large Language Models (LLMs) with data sources and tools. LLMs, such as ChatGPT, are a new class of AI models that have been trained on large amounts of text, and can create new content, including text, images, or video. The paper also argues for adding affordances of the underlying components as a new element of pattern descriptions. The primary audience of the paper includes pattern writers interested in pattern mining using LLMs.

[AI-84] Enhancing web traffic attacks identification through ensemble methods and feature selection

链接: https://arxiv.org/abs/2412.16791
作者: Daniel Urda,Branly Martínez,Nuño Basurto,Meelis Kull,Ángel Arroyo,Álvaro Herrero
关键词: essential digital assets, high traffic volume, digital assets, impact of breaches, essential digital
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Websites, as essential digital assets, are highly vulnerable to cyberattacks because of their high traffic volume and the significant impact of breaches. This study aims to enhance the identification of web traffic attacks by leveraging machine learning techniques. A methodology was proposed to extract relevant features from HTTP traces using the CSIC2010 v2 dataset, which simulates e-commerce web traffic. Ensemble methods, such as Random Forest and Extreme Gradient Boosting, were employed and compared against baseline classifiers, including k-nearest Neighbor, LASSO, and Support Vector Machines. The results demonstrate that the ensemble methods outperform baseline classifiers by approximately 20% in predictive accuracy, achieving an Area Under the ROC Curve (AUC) of 0.989. Feature selection methods such as Information Gain, LASSO, and Random Forest further enhance the robustness of these models. This study highlights the efficacy of ensemble models in improving attack detection while minimizing performance variability, offering a practical framework for securing web traffic in diverse application contexts.

[AI-85] DCOR: Anomaly Detection in Attributed Networks via Dual Contrastive Learning Reconstruction

链接: https://arxiv.org/abs/2412.16788
作者: Hossein Rafiee Zade,Hadi Zare,Mohsen Ghassemi Parsa,Hadi Davardoust,Meshkat Shariat Bagheri
关键词: identify abnormal events, security breaches, applied domains, identify abnormal, abnormal events
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, accepted at the Thirteenth International Conference on Complex Networks and Their Applications

点击查看摘要

Abstract:Anomaly detection using a network-based approach is one of the most efficient ways to identify abnormal events such as fraud, security breaches, and system faults in a variety of applied domains. While most of the earlier works address the complex nature of graph-structured data and predefined anomalies, the impact of data attributes and emerging anomalies are often neglected. This paper introduces DCOR, a novel approach on attributed networks that integrates reconstruction-based anomaly detection with Contrastive Learning. Utilizing a Graph Neural Network (GNN) framework, DCOR contrasts the reconstructed adjacency and feature matrices from both the original and augmented graphs to detect subtle anomalies. We employed comprehensive experimental studies on benchmark datasets through standard evaluation measures. The results show that DCOR significantly outperforms state-of-the-art methods. Obtained results demonstrate the efficacy of proposed approach in attributed networks with the potential of uncovering new patterns of anomalies.

[AI-86] Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans? NEURIPS2024

链接: https://arxiv.org/abs/2412.16772
作者: Ivan Zakazov,Mikolaj Boronski,Lorenzo Drudi,Robert West
关键词: large language models, language modelling, large language, ongoing revolution, modelling has led
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024 Workshop on Behavioral Machine Learning

点击查看摘要

Abstract:The ongoing revolution in language modelling has led to various novel applications, some of which rely on the emerging “social abilities” of large language models (LLMs). Already, many turn to the new “cyber friends” for advice during pivotal moments of their lives and trust them with their deepest secrets, implying that accurate shaping of LLMs’ “personalities” is paramount. Leveraging the vast diversity of data on which LLMs are pretrained, state-of-the-art approaches prompt them to adopt a particular personality. We ask (i) if personality-prompted models behave (i.e. “make” decisions when presented with a social situation) in line with the ascribed personality, and (ii) if their behavior can be finely controlled. We use classic psychological experiments - the Milgram Experiment and the Ultimatum Game - as social interaction testbeds and apply personality prompting to GPT-3.5/4/4o-mini/4o. Our experiments reveal failure modes of the prompt-based modulation of the models’ “behavior”, thus challenging the feasibility of personality prompting with today’s LLMs.

[AI-87] A Comparative Study on Machine Learning Models to Classify Diseases Based on Patient Behaviour and Habits

链接: https://arxiv.org/abs/2412.16768
作者: Elham Musaaed,Nabil Hewahi,Abdulla Alasaadi
关键词: potential application area, posed a potential, predicting diseases based, PRF, recent years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, ML algorithms have been shown to be useful for predicting diseases based on health data and posed a potential application area for these algorithms such as modeling of diseases. The majority of these applications employ supervised rather than unsupervised ML algorithms. In addition, each year, the amount of data in medical science grows rapidly. Moreover, these data include clinical and Patient-Related Factors (PRF), such as height, weight, age, other physical characteristics, blood sugar, lipids, insulin, etc., all of which will change continually over time. Analysis of historical data can help identify disease risk factors and their interactions, which is useful for disease diagnosis and prediction. This wealth of valuable information in these data will help doctors diagnose accurately and people can become more aware of the risk factors and key indicators to act proactively. The purpose of this study is to use six supervised ML approaches to fill this gap by conducting a comprehensive experiment to investigate the correlation between PRF and Diabetes, Stroke, Heart Disease (HD), and Kidney Disease (KD). Moreover, it will investigate the link between Diabetes, Stroke, and KD and PRF with HD. Further, the research aims to compare and evaluate various ML algorithms for classifying diseases based on the PRF. Additionally, it aims to compare and evaluate ML algorithms for classifying HD based on PRF as well as Diabetes, Stroke, Asthma, Skin Cancer, and KD as attributes. Lastly, HD predictions will be provided through a Web-based application on the most accurate classifier, which allows the users to input their values and predict the output.

[AI-88] Apples to Apples: Establishing Comparability in Knowledge Generation Tasks Involving Users

链接: https://arxiv.org/abs/2412.16766
作者: Christophe Debruyne,Ademar Crotti Junior
关键词: Knowledge graph construction, issue frequently brought, facilitating user involvement, Knowledge graph, knowledge generation languages
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: For associated repository, see this https URL

点击查看摘要

Abstract:Knowledge graph construction (KGC) from (semi-)structured data is challenging, and facilitating user involvement is an issue frequently brought up within this community. We cannot deny the progress we have made with respect to (declarative) knowledge generation languages and tools to help build such mappings. However, it is surprising that no two studies report on similar protocols. This heterogeneity does not allow for a comparison of KGC languages, techniques, and tools. This paper first analyses the various studies that report on studies involving users to identify the points of comparison. These gaps include a lack of systematic consistency in task design, participant selection, and evaluation metrics. Moreover, there needs to be a systematic way of analyzing the data and reporting the findings, which is also lacking. We thus propose and introduce a user protocol for KGC designed to address this challenge. Where possible, we draw and take elements from the literature we deem fit for such a protocol. The protocol, as such, allows for the comparison of languages and techniques for the RDF Mapping Languages core functionality, which is covered by most of the other state-of-the-art techniques and tools. We also propose how the protocol can be amended to compare extensions (of RML). This protocol provides an important step towards a more comparable evaluation of KGC user studies.

[AI-89] owards Selection and Transition Between Behavior-Based Neural Networks for Automated Driving

链接: https://arxiv.org/abs/2412.16764
作者: Iqra Aslam,Igor Anpilogov,Andreas Rausch
关键词: Autonomous driving technology, End systems based, complex End, Autonomous driving, End To End
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:Autonomous driving technology is progressing rapidly, largely due to complex End To End systems based on deep neural networks. While these systems are effective, their complexity can make it difficult to understand their behavior, raising safety concerns. This paper presents a new solution a Behavior Selector that uses multiple smaller artificial neural networks (ANNs) to manage different driving tasks, such as lane following and turning. Rather than relying on a single large network, which can be burdensome, require extensive training data, and is hard to understand, the developed approach allows the system to dynamically select the appropriate neural network for each specific behavior (e.g., turns) in real time. We focus on ensuring smooth transitions between behaviors while considering the vehicles current speed and orientation to improve stability and safety. The proposed system has been tested using the AirSim simulation environment, demonstrating its effectiveness.

[AI-90] A Method for the Runtime Validation of AI-based Environment Perception in Automated Driving System

链接: https://arxiv.org/abs/2412.16762
作者: Iqra Aslam,Abhishek Buragohain,Daniel Bamal,Adina Aniculaesei,Meng Zhang,Andreas Rausch
关键词: Autonomous Driving Systems, dynamic driving task, driving task executed, Autonomous Driving, executed by Autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:Environment perception is a fundamental part of the dynamic driving task executed by Autonomous Driving Systems (ADS). Artificial Intelligence (AI)-based approaches have prevailed over classical techniques for realizing the environment perception. Current safety-relevant standards for automotive systems, International Organization for Standardization (ISO) 26262 and ISO 21448, assume the existence of comprehensive requirements specifications. These specifications serve as the basis on which the functionality of an automotive system can be rigorously tested and checked for compliance with safety regulations. However, AI-based perception systems do not have complete requirements specification. Instead, large datasets are used to train AI-based perception systems. This paper presents a function monitor for the functional runtime monitoring of a two-folded AI-based environment perception for ADS, based respectively on camera and LiDAR sensors. To evaluate the applicability of the function monitor, we conduct a qualitative scenario-based evaluation in a controlled laboratory environment using a model car. The evaluation results then are discussed to provide insights into the monitor’s performance and its suitability for real-world applications.

[AI-91] Unpacking Political Bias in Large Language Models : Insights Across Topic Polarization

链接: https://arxiv.org/abs/2412.16746
作者: Kaiqi Yang,Hang Li,Yucheng Chu,Yuping Lin,Tai-Quan Peng,Hui Liu
关键词: Large Language Models, Large Language, social topics due, political bias, Language Models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely used to generate responses on social topics due to their world knowledge and generative capabilities. Beyond reasoning and generation performance, political bias is an essential issue that warrants attention. Political bias, as a universal phenomenon in human society, may be transferred to LLMs and distort LLMs’ behaviors of information acquisition and dissemination with humans, leading to unequal access among different groups of people. To prevent LLMs from reproducing and reinforcing political biases, and to encourage fairer LLM-human interactions, comprehensively examining political bias in popular LLMs becomes urgent and crucial. In this study, we systematically measure the political biases in a wide range of LLMs, using a curated set of questions addressing political bias in various contexts. Our findings reveal distinct patterns in how LLMs respond to political topics. For highly polarized topics, most LLMs exhibit a pronounced left-leaning bias. Conversely, less polarized topics elicit greater consensus, with similar response patterns across different LLMs. Additionally, we analyze how LLM characteristics, including release date, model scale, and region of origin affect political bias. The results indicate political biases evolve with model scale and release date, and are also influenced by regional factors of LLMs. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.16746 [cs.CY] (or arXiv:2412.16746v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2412.16746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-92] Reasoning about Actual Causes in Nondeterministic Domains – Extended Version

链接: https://arxiv.org/abs/2412.16728
作者: Shakil M. Khan,Yves Lespérance,Maryam Rostamigiv
关键词: formalization of rationality, observations is crucial, actual, domains, Abstract
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reasoning about the causes behind observations is crucial to the formalization of rationality. While extensive research has been conducted on root cause analysis, most studies have predominantly focused on deterministic settings. In this paper, we investigate causation in more realistic nondeterministic domains, where the agent does not have any control on and may not know the choices that are made by the environment. We build on recent preliminary work on actual causation in the nondeterministic situation calculus to formalize more sophisticated forms of reasoning about actual causes in such domains. We investigate the notions of Certainly Causes'' and Possibly Causes’’ that enable the representation of actual cause for agent actions in these domains. We then show how regression in the situation calculus can be extended to reason about such notions of actual causes.

[AI-93] Argumentation Computation with Large Language Models : A Benchmark Study

链接: https://arxiv.org/abs/2412.16725
作者: Zhaoqun Li,Xiaotong Fang,Chen Chen,Mengze Li,Beishui Liao
关键词: made significant advancements, large language models, recent years, large language, made significant
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have made significant advancements in neuro-symbolic computing. However, the combination of LLM with argumentation computation remains an underexplored domain, despite its considerable potential for real-world applications requiring defeasible reasoning. In this paper, we aim to investigate the capability of LLMs in determining the extensions of various abstract argumentation semantics. To achieve this, we develop and curate a benchmark comprising diverse abstract argumentation frameworks, accompanied by detailed explanations of algorithms for computing extensions. Subsequently, we fine-tune LLMs on the proposed benchmark, focusing on two fundamental extension-solving tasks. As a comparative baseline, LLMs are evaluated using a chain-of-thought approach, where they struggle to accurately compute semantics. In the experiments, we demonstrate that the process explanation plays a crucial role in semantics computation learning. Models trained with explanations show superior generalization accuracy compared to those trained solely with question-answer pairs. Furthermore, by leveraging the self-explanation capabilities of LLMs, our approach provides detailed illustrations that mitigate the lack of transparency typically associated with neural networks. Our findings contribute to the broader understanding of LLMs’ potential in argumentation computation, offering promising avenues for further research in this domain.

[AI-94] Coupling Neural Networks and Physics Equations For Li-Ion Battery State-of-Charge Prediction

链接: https://arxiv.org/abs/2412.16724
作者: Giovanni Pollo,Alessio Burrello,Enrico Macii,Massimo Poncino,Sara Vinco,Daniele Jahier Pagliari
关键词: implementing effective power, effective power management, power management policies, State of Charge, Estimating the evolution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Estimating the evolution of the battery’s State of Charge (SoC) in response to its usage is critical for implementing effective power management policies and for ultimately improving the system’s lifetime. Most existing estimation methods are either physics-based digital twins of the battery or data-driven models such as Neural Networks (NNs). In this work, we propose two new contributions in this domain. First, we introduce a novel NN architecture formed by two cascaded branches: one to predict the current SoC based on sensor readings, and one to estimate the SoC at a future time as a function of the load behavior. Second, we integrate battery dynamics equations into the training of our NN, merging the physics-based and data-driven approaches, to improve the models’ generalization over variable prediction horizons. We validate our approach on two publicly accessible datasets, showing that our Physics-Informed Neural Networks (PINNs) outperform purely data-driven ones while also obtaining superior prediction accuracy with a smaller architecture with respect to the state-of-the-art.

[AI-95] OpenAI o1 System Card

链接: https://arxiv.org/abs/2412.16720
作者: OpenAI:Aaron Jaech,Adam Kalai,Adam Lerer,Adam Richardson,Ahmed El-Kishky,Aiden Low,Alec Helyar,Aleksander Madry,Alex Beutel,Alex Carney,Alex Iftimie,Alex Karpenko,Alex Tachard Passos,Alexander Neitz,Alexander Prokofiev,Alexander Wei,Allison Tam,Ally Bennett,Ananya Kumar,Andre Saraiva,Andrea Vallone,Andrew Duberstein,Andrew Kondrich,Andrey Mishchenko,Andy Applebaum,Angela Jiang,Ashvin Nair,Barret Zoph,Behrooz Ghorbani,Ben Rossen,Benjamin Sokolowsky,Boaz Barak,Bob McGrew,Borys Minaiev,Botao Hao,Bowen Baker,Brandon Houghton,Brandon McKinzie,Brydon Eastman,Camillo Lugaresi,Cary Bassin,Cary Hudson,Chak Ming Li,Charles de Bourcy,Chelsea Voss,Chen Shen,Chong Zhang,Chris Koch,Chris Orsinger,Christopher Hesse,Claudia Fischer,Clive Chan,Dan Roberts,Daniel Kappler,Daniel Levy,Daniel Selsam,David Dohan,David Farhi,David Mely,David Robinson,Dimitris Tsipras,Doug Li,Dragos Oprica,Eben Freeman,Eddie Zhang,Edmund Wong,Elizabeth Proehl,Enoch Cheung,Eric Mitchell,Eric Wallace,Erik Ritter,Evan Mays,Fan Wang,Felipe Petroski Such,Filippo Raso,Florencia Leoni,Foivos Tsimpourlas,Francis Song,Fred von Lohmann,Freddie Sulit,Geoff Salmon,Giambattista Parascandolo,Gildas Chabot,Grace Zhao,Greg Brockman,Guillaume Leclerc,Hadi Salman,Haiming Bao,Hao Sheng,Hart Andrin,Hessam Bagherinezhad,Hongyu Ren,Hunter Lightman,Hyung Won Chung,Ian Kivlichan,Ian O’Connell,Ian Osband,Ignasi Clavera Gilaberte,Ilge Akkaya
关键词: large-scale reinforcement learning, series is trained, trained with large-scale, large-scale reinforcement, reinforcement learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

[AI-96] Large Language Models Compression via Low-Rank Feature Distillation

链接: https://arxiv.org/abs/2412.16719
作者: Yaya Sy,Christophe Cerisara,Irina Illina
关键词: Current LLM structured, LLM structured pruning, Current LLM, LLM structured, structured pruning methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Current LLM structured pruning methods involve two steps: (1) compressing with calibration data and (2) continued pretraining on billions of tokens to recover the lost performance. This costly second step is needed as the first step significantly impacts performance. Previous studies have found that pretrained Transformer weights aren’t inherently low-rank, unlike their activations, which may explain this performance drop. Based on this observation, we introduce a one-shot compression method that locally distills low-rank weights. We accelerate convergence by initializing the low-rank weights with SVD and using a joint loss that combines teacher and student activations. We reduce memory requirements by applying local gradient updates only. Our approach can compress Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while maintaining over 95% of the original performance. Phi-2 3B can be compressed by 40% using only 13 million calibration tokens into a small model that competes with recent models of similar size. We show our method generalizes well to non-transformer architectures: Mamba-3B can be compressed by 20% while maintaining 99% of its performance.

[AI-97] FAP-CD: Fairness-Driven Age-Friendly Community Planning via Conditional Diffusion Generation

链接: https://arxiv.org/abs/2412.16699
作者: Jinlin Li,Xintong Li,Xiao Zhou
关键词: populations age rapidly, incorporating age-specific considerations, sustainable urban development, age-friendly built environments, global populations age
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As global populations age rapidly, incorporating age-specific considerations into urban planning has become essential to addressing the urgent demand for age-friendly built environments and ensuring sustainable urban development. However, current practices often overlook these considerations, resulting in inadequate and unevenly distributed elderly services in cities. There is a pressing need for equitable and optimized urban renewal strategies to support effective age-friendly planning. To address this challenge, we propose a novel framework, Fairness-driven Age-friendly community Planning via Conditional Diffusion generation (FAP-CD). FAP-CD leverages a conditioned graph denoising diffusion probabilistic model to learn the joint probability distribution of aging facilities and their spatial relationships at a fine-grained regional level. Our framework generates optimized facility distributions by iteratively refining noisy graphs, conditioned on the needs of the elderly during the diffusion process. Key innovations include a demand-fairness pre-training module that integrates community demand features and facility characteristics using an attention mechanism and min-max optimization, ensuring equitable service distribution across regions. Additionally, a discrete graph structure captures walkable accessibility within regional road networks, guiding model sampling. To enhance information integration, we design a graph denoising network with an attribute augmentation module and a hybrid graph message aggregation module, combining local and global node and edge information. Empirical results across multiple metrics demonstrate the effectiveness of FAP-CD in balancing age-friendly needs with regional equity, achieving an average improvement of 41% over competitive baseline models.

[AI-98] Formal Language Knowledge Corpus for Retrieval Augmented Generation

链接: https://arxiv.org/abs/2412.16689
作者: Majd Zayyad,Yossi Adi
关键词: integration of retrieval-augmented, retrieval-augmented techniques, shown promise, promise in improving, improving performance
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of retrieval-augmented techniques with LLMs has shown promise in improving performance across various domains. However, their utility in tasks requiring advanced reasoning, such as generating and evaluating mathematical statements and proofs, remains underexplored. This study explores the use of Lean, a programming language for writing mathematical proofs, to populate the knowledge corpus used by RAG systems. We hope for this to lay the foundation to exploring different methods of using RAGs to improve the performance of LLMs in advanced logical reasoning tasks.

[AI-99] Subgoal Discovery Using a Free Energy Paradigm and State Aggregations

链接: https://arxiv.org/abs/2412.16687
作者: Amirhossein Mesbah,Reshad Hosseini,Seyed Pooya Shariatpanahi,Majid Nili Ahmadabadi
关键词: Reinforcement learning, solving complex sequential, complex sequential decision-making, sequential decision-making tasks, role in solving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) plays a major role in solving complex sequential decision-making tasks. Hierarchical and goal-conditioned RL are promising methods for dealing with two major problems in RL, namely sample inefficiency and difficulties in reward shaping. These methods tackle the mentioned problems by decomposing a task into simpler subtasks and temporally abstracting a task in the action space. One of the key components for task decomposition of these methods is subgoal discovery. We can use the subgoal states to define hierarchies of actions and also use them in decomposing complex tasks. Under the assumption that subgoal states are more unpredictable, we propose a free energy paradigm to discover them. This is achieved by using free energy to select between two spaces, the main space and an aggregation space. The model ; changes from neighboring states to a given state shows the unpredictability of a given state, and therefore it is used in this paper for subgoal discovery. Our empirical results on navigation tasks like grid-world environments show that our proposed method can be applied for subgoal discovery without prior knowledge of the task. Our proposed method is also robust to the stochasticity of environments.

[AI-100] STAMPsy: Towards SpatioTemporal-Aware Mixed-Type Dialogues for Psychological Counseling

链接: https://arxiv.org/abs/2412.16674
作者: Jieyi Wang,Yue Huang,Zeming Liu,Dexuan Xu,Chuan Wang,Xiaoming Shi,Ruiyuan Guan,Hongxing Wang,Weihua Yue,Yu Huang
关键词: Online psychological counseling, traditional in-person therapy, Online psychological, counseling dialogue systems, dialogue
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online psychological counseling dialogue systems are trending, offering a convenient and accessible alternative to traditional in-person therapy. However, existing psychological counseling dialogue systems mainly focus on basic empathetic dialogue or QA with minimal professional knowledge and without goal guidance. In many real-world counseling scenarios, clients often seek multi-type help, such as diagnosis, consultation, therapy, console, and common questions, but existing dialogue systems struggle to combine different dialogue types naturally. In this paper, we identify this challenge as how to construct mixed-type dialogue systems for psychological counseling that enable clients to clarify their goals before proceeding with counseling. To mitigate the challenge, we collect a mixed-type counseling dialogues corpus termed STAMPsy, covering five dialogue types, task-oriented dialogue for diagnosis, knowledge-grounded dialogue, conversational recommendation, empathetic dialogue, and question answering, over 5,000 conversations. Moreover, spatiotemporal-aware knowledge enables systems to have world awareness and has been proven to affect one’s mental health. Therefore, we link dialogues in STAMPsy to spatiotemporal state and propose a spatiotemporal-aware mixed-type psychological counseling dataset. Additionally, we build baselines on STAMPsy and develop an iterative self-feedback psychological dialogue generation framework, named Self-STAMPsy. Results indicate that clarifying dialogue goals in advance and utilizing spatiotemporal states are effective.

[AI-101] On Enhancing Network Throughput using Reinforcement Learning in Sliced Testbeds

链接: https://arxiv.org/abs/2412.16673
作者: Daniel Pereira Monteiro,Lucas Nardelli de Freitas Botelho Saar,Larissa Ferreira Rodrigues Moreira,Rodrigo Moreira
关键词: high reliability connectivity, pose significant challenges, slicing orchestration architectures, demand high throughput, applications demand high
类目: Artificial Intelligence (cs.AI)
*备注: Paper already published at Anais do XV Workshop de Pesquisa Experimental da Internet do Futuro (WPEIF)

点击查看摘要

Abstract:Novel applications demand high throughput, low latency, and high reliability connectivity and still pose significant challenges to slicing orchestration architectures. The literature explores network slicing techniques that employ canonical methods, artificial intelligence, and combinatorial optimization to address errors and ensure throughput for network slice data plane. This paper introduces the Enhanced Mobile Broadband (eMBB)-Agent as a new approach that uses Reinforcement Learning (RL) in a vertical application to enhance network slicing throughput to fit Service-Level Agreements (SLAs). The eMBB-Agent analyzes application transmission variables and proposes actions within a discrete space to adjust the reception window using a Deep Q-Network (DQN). This paper also presents experimental results that examine the impact of factors such as the channel error rate, DQN model layers, and learning rate on model convergence and achieved throughput, providing insights on embedding intelligence in network slicing.

[AI-102] Internalized Self-Correction for Large Language Models

链接: https://arxiv.org/abs/2412.16653
作者: Nishanth Upadhyaya,Raghavendra Sridharamurthy
关键词: large language models, Internalized Self-Correction, language models, large language, inference time
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this article, we introduce ‘Internalized Self-Correction’ (InSeC) for large language models (LLMs). While many approaches exist for self-reflection at inference time, we propose a novel method that combines ideas from negative sampling, self-reflection during training, and inference time. InSeC allows LLMs to correct themselves by introducing mistakes and their corresponding corrections during training, thereby converting the learning process into a true supervised learning task with both positive and negative examples. This approach can be extended to improve instruction following and correct hallucinations or incorrect sentences generated by LLMs.

[AI-103] meRAG: BOOSTING LLM Time Series Forecasting via Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2412.16643
作者: Silin Yang,Dong Wang,Haoqi Zheng,Ruochun Jin
关键词: time series forecasting, series forecasting LLMs, existing LLM-based solutions, LLM-based solutions require, solutions require excessive
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although the rise of large language models (LLMs) has introduced new opportunities for time series forecasting, existing LLM-based solutions require excessive training and exhibit limited transferability. In view of these challenges, we propose TimeRAG, a framework that incorporates Retrieval-Augmented Generation (RAG) into time series forecasting LLMs, which constructs a time series knowledge base from historical sequences, retrieves reference sequences from the knowledge base that exhibit similar patterns to the query sequence measured by Dynamic Time Warping (DTW), and combines these reference sequences and the prediction query as a textual prompt to the time series forecasting LLM. Experiments on datasets from various domains show that the integration of RAG improved the prediction accuracy of the original model by 2.97% on average.

[AI-104] A Systems Thinking Approach to Algorithmic Fairness

链接: https://arxiv.org/abs/2412.16641
作者: Chris Lam
关键词: data generating process, encode prior knowledge, generating process, algorithmic fairness problem, encode prior
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: This paper will be submitted to the 2025 ACM FAccT conference for review

点击查看摘要

Abstract:Systems thinking provides us with a way to model the algorithmic fairness problem by allowing us to encode prior knowledge and assumptions about where we believe bias might exist in the data generating process. We can then model this using a series of causal graphs, enabling us to link AI/ML systems to politics and the law. By treating the fairness problem as a complex system, we can combine techniques from machine learning, causal inference, and system dynamics. Each of these analytical techniques is designed to capture different emergent aspects of fairness, allowing us to develop a deeper and more holistic view of the problem. This can help policymakers on both sides of the political aisle to understand the complex trade-offs that exist from different types of fairness policies, providing a blueprint for designing AI policy that is aligned to their political agendas.

[AI-105] POEX: Policy Executable Embodied AI Jailbreak Attacks

链接: https://arxiv.org/abs/2412.16633
作者: Xuancun Lu,Zhengxian Huang,Xinfeng Li,Xiaoyu ji,Wenyuan Xu
关键词: Embodied Artificial Intelligence, Artificial Intelligence, translate complex user, Embodied Artificial, Embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Homepage: this https URL

点击查看摘要

Abstract:The integration of large language models (LLMs) into the planning module of Embodied Artificial Intelligence (Embodied AI) systems has greatly enhanced their ability to translate complex user instructions into executable policies. In this paper, we demystified how traditional LLM jailbreak attacks behave in the Embodied AI context. We conducted a comprehensive safety analysis of the LLM-based planning module of embodied AI systems against jailbreak attacks. Using the carefully crafted Harmful-RLbench, we accessed 20 open-source and proprietary LLMs under traditional jailbreak attacks, and highlighted two key challenges when adopting the prior jailbreak techniques to embodied AI contexts: (1) The harmful text output by LLMs does not necessarily induce harmful policies in Embodied AI context, and (2) even we can generate harmful policies, we have to guarantee they are executable in practice. To overcome those challenges, we propose Policy Executable (POEX) jailbreak attacks, where harmful instructions and optimized suffixes are injected into LLM-based planning modules, leading embodied AI to perform harmful actions in both simulated and physical environments. Our approach involves constraining adversarial suffixes to evade detection and fine-tuning a policy evaluater to improve the executability of harmful policies. We conducted extensive experiments on both a robotic arm embodied AI platform and simulators, to validate the attack and policy success rates on 136 harmful instructions from Harmful-RLbench. Our findings expose serious safety vulnerabilities in LLM-based planning modules, including the ability of POEX to be transferred across models. Finally, we propose mitigation strategies, such as safety-constrained prompts, pre- and post-planning checks, to address these vulnerabilities and ensure the safe deployment of embodied AI in real-world settings.

[AI-106] Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement

链接: https://arxiv.org/abs/2412.16626
作者: Junyu Wang,Zizhen Lin,Tianrui Wang,Meng Ge,Longbiao Wang,Jianwu Dang
关键词: recent speech enhancement, predominant methodologies, variants have emerged, low computational complexity, speech enhancement
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In recent speech enhancement (SE) research, transformer and its variants have emerged as the predominant methodologies. However, the quadratic complexity of the self-attention mechanism imposes certain limitations on practical deployment. Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision due to its strong capabilities in modeling long sequences and relatively low computational complexity. In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks. By leveraging bidirectional Mamba to model forward and backward dependencies of speech signals at different resolutions, and incorporating skip connections to capture multi-scale information, our approach achieves state-of-the-art (SOTA) performance. Experimental results on the VCTK+DEMAND dataset indicate that Mamba-SEUNet attains a PESQ score of 3.59, while maintaining low computational complexity. When combined with the Perceptual Contrast Stretching technique, Mamba-SEUNet further improves the PESQ score to 3.73.

[AI-107] Distributed Inference on Mobile Edge and Cloud: A Data-Cartography based Clustering Approach

链接: https://arxiv.org/abs/2412.16616
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
关键词: limited resources, IoT platforms, large size, poses a significant, significant challenge
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2410.05338

点击查看摘要

Abstract:The large size of DNNs poses a significant challenge for deployment on devices with limited resources, such as mobile, edge, and IoT platforms. To address this issue, a distributed inference framework can be utilized. In this framework, a small-scale DNN (initial layers) is deployed on mobile devices, a larger version on edge devices, and the full DNN on the cloud. Samples with low complexity (easy) can be processed on mobile, those with moderate complexity (medium) on edge devices, and high complexity (hard) samples on the cloud. Given that the complexity of each sample is unknown in advance, the crucial question in distributed inference is determining the sample complexity for appropriate DNN processing. We introduce a novel method named \our, which leverages the Data Cartography approach initially proposed for enhancing DNN generalization. By employing data cartography, we assess sample complexity. \our aims to boost accuracy while considering the offloading costs from mobile to edge/cloud. Our experimental results on GLUE datasets, covering a variety of NLP tasks, indicate that our approach significantly lowers inference costs by more than 43% while maintaining a minimal accuracy drop of less than 0.5% compared to performing all inferences on the cloud. The source code is available at this https URL.

[AI-108] Automated Classification of Cybercrime Complaints using Transformer-based Language Models for Hinglish Texts

链接: https://arxiv.org/abs/2412.16614
作者: Nanda Rani,Divyanshu Singh,Bikash Saha,Sandeep Kumar Shukla
关键词: complexity of multilingual, law enforcement, cybercrime complaint, Cybercrime Coordination Centre, cybercrime
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise in cybercrime and the complexity of multilingual and code-mixed complaints present significant challenges for law enforcement and cybersecurity agencies. These organizations need automated, scalable methods to identify crime types, enabling efficient processing and prioritization of large complaint volumes. Manual triaging is inefficient, and traditional machine learning methods fail to capture the semantic and contextual nuances of textual cybercrime complaints. Moreover, the lack of publicly available datasets and privacy concerns hinder the research to present robust solutions. To address these challenges, we propose a framework for automated cybercrime complaint classification. The framework leverages Hinglish-adapted transformers, such as HingBERT and HingRoBERTa, to handle code-mixed inputs effectively. We employ the real-world dataset provided by Indian Cybercrime Coordination Centre (I4C) during CyberGuard AI Hackathon 2024. We employ GenAI open source model-based data augmentation method to address class imbalance. We also employ privacy-aware preprocessing to ensure compliance with ethical standards while maintaining data integrity. Our solution achieves significant performance improvements, with HingRoBERTa attaining an accuracy of 74.41% and an F1-score of 71.49%. We also develop ready-to-use tool by integrating Django REST backend with a modern frontend. The developed tool is scalable and ready for real-world deployment in platforms like the National Cyber Crime Reporting Portal. This work bridges critical gaps in cybercrime complaint management, offering a scalable, privacy-conscious, and adaptable solution for modern cybersecurity challenges.

[AI-109] Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

链接: https://arxiv.org/abs/2412.16599
作者: Hang Yin,Zhifeng Lin,Xin Liu,Bin Sun,Kan Li
关键词: compass direction reasoning, Direction reasoning, compass direction, Direction, real world
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Direction reasoning is essential for intelligent systems to understand the real world. While existing work focuses primarily on spatial reasoning, compass direction reasoning remains underexplored. To address this, we propose the Compass Direction Reasoning (CDR) benchmark, designed to evaluate the direction reasoning capabilities of multimodal language models (MLMs). CDR includes three types images to test spatial (up, down, left, right) and compass (north, south, east, west) directions. Our evaluation reveals that most MLMs struggle with direction reasoning, often performing at random guessing levels. Experiments show that training directly with CDR data yields limited improvements, as it requires an understanding of real-world physical rules. We explore the impact of mixdata and CoT fine-tuning methods, which significantly enhance MLM performance in compass direction reasoning by incorporating diverse data and step-by-step reasoning, improving the model’s ability to understand direction relationships.

[AI-110] AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

链接: https://arxiv.org/abs/2412.16594
作者: Basak Demirok,Mucahid Kutlu
关键词: code, rapid advancement, generating code, problem, problem descriptions
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid advancement of LLM models, they have become widely useful in various fields. While these AI systems can be used for code generation, significantly simplifying and accelerating the tasks of developers, their use for students to do assignments has raised ethical questions in the field of education. In this context, determining the author of a particular code becomes important. In this study, we introduce AIGCodeSet, a dataset for AI-generated code detection tasks, specifically for the Python programming language. We obtain the problem descriptions and human-written codes from the CodeNet dataset. Using the problem descriptions, we generate AI-written codes with CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash models in three approaches: i) generating code from the problem description alone, ii) generating code using the description along with human-written source code containing runtime errors, and iii) generating code using the problem description and human-written code that resulted in wrong answers. Lastly, we conducted a post-processing step to eliminate LLM output irrelevant to code snippets. Overall, AIGCodeSet consists of 2,828 AI-generated and 4,755 human-written code snippets. We share our code with the research community to support studies on this important topic and provide performance results for baseline AI-generated code detection methods.

[AI-111] Effective and Efficient Representation Learning for Flight Trajectories AAAI2025

链接: https://arxiv.org/abs/2412.16581
作者: Shuo Liu,Wenbin Li,Di Yao,Jingping Bi
关键词: traffic management community, trajectory data plays, management community, data plays, plays a vital
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Flight trajectory data plays a vital role in the traffic management community, especially for downstream tasks such as trajectory prediction, flight recognition, and anomaly detection. Existing works often utilize handcrafted features and design models for different tasks individually, which heavily rely on domain expertise and are hard to extend. We argue that different flight analysis tasks share the same useful features of the trajectory. Jointly learning a unified representation for flight trajectories could be beneficial for improving the performance of various tasks. However, flight trajectory representation learning (TRL) faces two primary challenges, \ie unbalanced behavior density and 3D spatial continuity, which disable recent general TRL methods. In this paper, we propose Flight2Vec , a flight-specific representation learning method to address these challenges. Specifically, a behavior-adaptive patching mechanism is used to inspire the learned representation to pay more attention to behavior-dense segments. Moreover, we introduce a motion trend learning technique that guides the model to memorize not only the precise locations, but also the motion trend to generate better representations. Extensive experimental results demonstrate that Flight2Vec significantly improves performance in downstream tasks such as flight trajectory prediction, flight recognition, and anomaly detection.

[AI-112] Breaking the Context Bottleneck on Long Time Series Forecasting

链接: https://arxiv.org/abs/2412.16572
作者: Chao Ma,Yikai Hou,Xiang Li,Yinggang Sun,Haining Yu,Zhou Fang,Jiaxing Qu
关键词: decision-making in economics, essential for planning, planning and decision-making, Long-term time-series forecasting, long foresight
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Time series forecasting algorithm based on multi-scale analysis

点击查看摘要

Abstract:Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation, where long foresight is required. To obtain such long foresight, models must be both efficient and effective in processing long sequence. Recent advancements have enhanced the efficiency of these models; however, the challenge of effectively leveraging longer sequences persists. This is primarily due to the tendency of these models to overfit when presented with extended inputs, necessitating the use of shorter input lengths to maintain tolerable error margins. In this work, we investigate the multiscale modeling method and propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences. We demonstrate that by decoupling patterns at different scales in time series, we can enhance predictability by reducing non-stationarity, improve efficiency through a compact long input representation, and simplify the architecture by providing clear task assignments. Experimental results demonstrate that LDM not only outperforms all baselines in long-term forecasting benchmarks, but also reducing both training time and memory costs.

[AI-113] Learning for Cross-Layer Resource Allocation in MEC-Aided Cell-Free Networks

链接: https://arxiv.org/abs/2412.16565
作者: Chong Zheng,Shiwen He,Yongming Huang,Tony Q. S. Quek
关键词: mobile edge computing, Cross-layer resource allocation, aided cell-free networks, computing resources, Cross-layer resource
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cross-layer resource allocation over mobile edge computing (MEC)-aided cell-free networks can sufficiently exploit the transmitting and computing resources to promote the data rate. However, the technical bottlenecks of traditional methods pose significant challenges to cross-layer optimization. In this paper, joint subcarrier allocation and beamforming optimization are investigated for the MEC-aided cell-free network from the perspective of deep learning to maximize the weighted sum rate. Specifically, we convert the underlying problem into a joint multi-task optimization problem and then propose a centralized multi-task self-supervised learning algorithm to solve the problem so as to avoid costly manual labeling. Therein, two novel and general loss functions, i.e., negative fraction linear loss and exponential linear loss whose advantages in robustness and target domain have been proved and discussed, are designed to enable self-supervised learning. Moreover, we further design a MEC-enabled distributed multi-task self-supervised learning (DMTSSL) algorithm, with low complexity and high scalability to address the challenge of dimensional disaster. Finally, we develop the distance-aware transfer learning algorithm based on the DMTSSL algorithm to handle the dynamic scenario with negligible computation cost. Simulation results under 3 rd generation partnership project 38.901 urban-macrocell scenario demonstrate the superiority of the proposed algorithms over the baseline algorithms.

[AI-114] Predictive Monitoring of Black-Box Dynamical Systems

链接: https://arxiv.org/abs/2412.16564
作者: Thomas A. Henzinger,Fabian Kresse,Kaushik Mallik,Emily Yu,Đorđe Žikelić
关键词: quantitative safety properties, black-box dynamical systems, study the problem, predictive runtime monitoring, states
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: Submitted to L4DC 2025

点击查看摘要

Abstract:We study the problem of predictive runtime monitoring of black-box dynamical systems with quantitative safety properties. The black-box setting stipulates that the exact semantics of the dynamical system and the controller are unknown, and that we are only able to observe the state of the controlled (aka, closed-loop) system at finitely many time points. We present a novel framework for predicting future states of the system based on the states observed in the past. The numbers of past states and of predicted future states are parameters provided by the user. Our method is based on a combination of Taylor’s expansion and the backward difference operator for numerical differentiation. We also derive an upper bound on the prediction error under the assumption that the system dynamics and the controller are smooth. The predicted states are then used to predict safety violations ahead in time. Our experiments demonstrate practical applicability of our method for complex black-box systems, showing that it is computationally lightweight and yet significantly more accurate than the state-of-the-art predictive safety monitoring techniques.

[AI-115] Metagoals Endowing Self-Modifying AGI Systems with Goal Stability or Moderated Goal Evolution: Toward a Formally Sound and Practical Approach

链接: https://arxiv.org/abs/2412.16559
作者: Ben Goertzel
关键词: Contraction Mapping Theorem, aimed to guide, creating AGI systems, Toggle, goal-stability metagoals aimed
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We articulate here a series of specific metagoals designed to address the challenge of creating AGI systems that possess the ability to flexibly self-modify yet also have the propensity to maintain key invariant properties of their goal systems 1) a series of goal-stability metagoals aimed to guide a system to a condition in which goal-stability is compatible with reasonably flexible self-modification 2) a series of moderated-goal-evolution metagoals aimed to guide a system to a condition in which control of the pace of goal evolution is compatible with reasonably flexible self-modification The formulation of the metagoals is founded on fixed-point theorems from functional analysis, e.g. the Contraction Mapping Theorem and constructive approximations to Schauder’s Theorem, applied to probabilistic models of system behavior We present an argument that the balancing of self-modification with maintenance of goal invariants will often have other interesting cognitive side-effects such as a high degree of self understanding Finally we argue for the practical value of a hybrid metagoal combining moderated-goal-evolution with pursuit of goal-stability – along with potentially other metagoals relating to goal-satisfaction, survival and ongoing development – in a flexible fashion depending on the situation Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.16559 [cs.AI] (or arXiv:2412.16559v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.16559 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benjamin Goertzel [view email] [v1] Sat, 21 Dec 2024 09:57:13 UTC (35 KB) Full-text links: Access Paper: View a PDF of the paper titled Metagoals Endowing Self-Modifying AGI Systems with Goal Stability or Moderated Goal Evolution: Toward a Formally Sound and Practical Approach, by Ben GoertzelView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-116] CognTKE: A Cognitive Temporal Knowledge Extrapolation Framework AAAI2025

链接: https://arxiv.org/abs/2412.16557
作者: Wei Chen,Yuting Wu,Shuhan Wu,Zhiyu Zhang,Mengqi Liao,Youfang Lin,Huaiyu Wan
关键词: future unknowable facts, Reasoning future unknowable, holding significant academic, temporal knowledge graphs, challenging task
类目: Artificial Intelligence (cs.AI)
*备注: AAAI2025 Accept, 12 pages, 9 figures

点击查看摘要

Abstract:Reasoning future unknowable facts on temporal knowledge graphs (TKGs) is a challenging task, holding significant academic and practical values for various fields. Existing studies exploring explainable reasoning concentrate on modeling comprehensible temporal paths relevant to the query. Yet, these path-based methods primarily focus on local temporal paths appearing in recent times, failing to capture the complex temporal paths in TKG and resulting in the loss of longer historical relations related to the query. Motivated by the Dual Process Theory in cognitive science, we propose a \textbfCognitive \textbfTemporal \textbfKnowledge \textbfExtrapolation framework (CognTKE), which introduces a novel temporal cognitive relation directed graph (TCR-Digraph) and performs interpretable global shallow reasoning and local deep reasoning over the TCR-Digraph. Specifically, the proposed TCR-Digraph is constituted by retrieving significant local and global historical temporal relation paths associated with the query. In addition, CognTKE presents the global shallow reasoner and the local deep reasoner to perform global one-hop temporal relation reasoning (System 1) and local complex multi-hop path reasoning (System 2) over the TCR-Digraph, respectively. The experimental results on four benchmark datasets demonstrate that CognTKE achieves significant improvement in accuracy compared to the state-of-the-art baselines and delivers excellent zero-shot reasoning ability. \textitThe code is available at this https URL.

[AI-117] ActPC-Chem: Discrete Active Predictive Coding for Goal-Guided Algorithmic Chemistry as a Potential Cognitive Kernel for Hyperon PRIMUS-Based AGI

链接: https://arxiv.org/abs/2412.16547
作者: Ben Goertzel
关键词: goal-guided artificial intelligence, Active Predictive Coding, Discrete Active Predictive, Active Predictive, Discrete Active
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We explore a novel paradigm (labeled ActPC-Chem) for biologically inspired, goal-guided artificial intelligence (AI) centered on a form of Discrete Active Predictive Coding (ActPC) operating within an algorithmic chemistry of rewrite rules. ActPC-Chem is envisioned as a foundational “cognitive kernel” for advanced cognitive architectures, such as the OpenCog Hyperon system, incorporating essential elements of the PRIMUS cognitive architecture. The central thesis is that general-intelligence-capable cognitive structures and dynamics can emerge in a system where both data and models are represented as evolving patterns of metagraph rewrite rules, and where prediction errors, intrinsic and extrinsic rewards, and semantic constraints guide the continual reorganization and refinement of these rules. Using a virtual “robot bug” thought experiment, we illustrate how such a system might self-organize to handle challenging tasks involving delayed and context-dependent rewards, integrating causal rule inference (AIRIS) and probabilistic logical abstraction (PLN) to discover and exploit conceptual patterns and causal constraints. Next, we describe how continuous predictive coding neural networks, which excel at handling noisy sensory data and motor control signals, can be coherently merged with the discrete ActPC substrate. Finally, we outline how these ideas might be extended to create a transformer-like architecture that foregoes traditional backpropagation in favor of rule-based transformations guided by ActPC. This layered architecture, supplemented with AIRIS and PLN, promises structured, multi-modal, and logically consistent next-token predictions and narrative sequences.

[AI-118] Mathematics and Machine Creativity: A Survey on Bridging Mathematics with AI

链接: https://arxiv.org/abs/2412.16543
作者: Shizhe Liang,Wei Zhang,Tianyang Zhong
关键词: mathematical research, highlighting the transformative, presents a comprehensive, transformative role, begun to play
类目: Artificial Intelligence (cs.AI)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:This paper presents a comprehensive survey on the applications of artificial intelligence (AI) in mathematical research, highlighting the transformative role AI has begun to play in this domain. Traditionally, AI advancements have heavily relied on theoretical foundations from fields like mathematics and statistics. However, recent developments in AI, particularly in reinforcement learning (RL) and large language models (LLMs), have demonstrated the potential for AI to contribute back to mathematics, offering flexible algorithmic frameworks and powerful inductive reasoning capabilities that support various aspects of mathematical research. This survey aims to establish a bridge between AI and mathematics, providing insights into the mutual benefits and fostering deeper interdisciplinary understanding. In particular, we argue that while current AI and LLMs may struggle with complex deductive reasoning, their inherent creativity holds significant potential to support and inspire mathematical research. This creative capability, often overlooked, could be the key to unlocking new perspectives and methodologies in mathematics. Furthermore, we address the lack of cross-disciplinary communication: mathematicians may not fully comprehend the latest advances in AI, while AI researchers frequently prioritize benchmarks and standardized testing over AI’s applications in frontier mathematical research. This paper seeks to close that gap, offering a detailed exploration of AI’s basic knowledge, its strengths, and its emerging applications in the mathematical sciences. Comments: 26 pages, 3 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.16543 [cs.AI] (or arXiv:2412.16543v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.16543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-119] owards Environmentally Equitable AI

链接: https://arxiv.org/abs/2412.16539
作者: Mohammad Hajiesmaili,Shaolei Ren,Ramesh K. Sitaraman,Adam Wierman
关键词: deployed power-hungry servers, globally deployed power-hungry, artificial intelligence, power-hungry servers, skyrocketing demand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted by Communications of the ACM. All the authors contributed equally and are listed in alphabetical order of last name

点击查看摘要

Abstract:The skyrocketing demand for artificial intelligence (AI) has created an enormous appetite for globally deployed power-hungry servers. As a result, the environmental footprint of AI systems has come under increasing scrutiny. More crucially, the current way that we exploit AI workloads’ flexibility and manage AI systems can lead to wildly different environmental impacts across locations, increasingly raising environmental inequity concerns and creating unintended sociotechnical consequences. In this paper, we advocate environmental equity as a priority for the management of future AI systems, advancing the boundaries of existing resource management for sustainable AI and also adding a unique dimension to AI fairness. Concretely, we uncover the potential of equity-aware geographical load balancing to fairly re-distribute the environmental cost across different regions, followed by algorithmic challenges. We conclude by discussing a few future directions to exploit the full potential of system management approaches to mitigate AI’s environmental inequity.

[AI-120] From Creation to Curriculum: Examining the role of generative AI in Arts Universities

链接: https://arxiv.org/abs/2412.16531
作者: Atticus Sims(Faculty of Media Creation, Kyoto Seika University)
关键词: Artificial Intelligence, age of Artificial, prior iterations, Intelligence, Artificial
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 17 pages, 5 figures. Based on workshops conducted in July 2023 at Kyoto Seika University

点击查看摘要

Abstract:The age of Artificial Intelligence (AI) is marked by its transformative “generative” capabilities, distinguishing it from prior iterations. This burgeoning characteristic of AI has enabled it to produce new and original content, inherently showcasing its creative prowess. This shift challenges and requires a recalibration in the realm of arts education, urging a departure from established pedagogies centered on human-driven image creation. The paper meticulously addresses the integration of AI tools, with a spotlight on Stable Diffusion (SD), into university arts curricula. Drawing from practical insights gathered from workshops conducted in July 2023, which culminated in an exhibition of AI-driven artworks, the paper aims to provide a roadmap for seamlessly infusing these tools into academic settings. Given their recent emergence, the paper delves into a comprehensive overview of such tools, emphasizing the intricate dance between artists, developers, and researchers in the open-source AI art world. This discourse extends to the challenges and imperatives faced by educational institutions. It presents a compelling case for the swift adoption of these avant-garde tools, underscoring the paramount importance of equipping students with the competencies required to thrive in an AI-augmented artistic landscape.

[AI-121] VSFormer: Value and Shape-Aware Transformer with Prior-Enhanced Self-Attention for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2412.16515
作者: Wenjie Xi,Rundong Zuo,Alejandro Alvarez,Jie Zhang,Byron Choi,Jessica Lin
关键词: attracting growing research, Multivariate time series, growing research interest, research interest due, Multivariate time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multivariate time series classification is a crucial task in data mining, attracting growing research interest due to its broad applications. While many existing methods focus on discovering discriminative patterns in time series, real-world data does not always present such patterns, and sometimes raw numerical values can also serve as discriminative features. Additionally, the recent success of Transformer models has inspired many studies. However, when applying to time series classification, the self-attention mechanisms in Transformer models could introduce classification-irrelevant features, thereby compromising accuracy. To address these challenges, we propose a novel method, VSFormer, that incorporates both discriminative patterns (shape) and numerical information (value). In addition, we extract class-specific prior information derived from supervised information to enrich the positional encoding and provide classification-oriented self-attention learning, thereby enhancing its effectiveness. Extensive experiments on all 30 UEA archived datasets demonstrate the superior performance of our method compared to SOTA models. Through ablation studies, we demonstrate the effectiveness of the improved encoding layer and the proposed self-attention mechanism. Finally, We provide a case study on a real-world time series dataset without discriminative patterns to interpret our model.

[AI-122] Privacy in Fine-tuning Large Language Models : Attacks Defenses and Future Directions

链接: https://arxiv.org/abs/2412.16504
作者: Hao Du,Shang Liu,Lele Zheng,Yang Cao,Atsuyoshi Nakamura,Lei Chen
关键词: leveraging Large Language, Large Language Models, Large Language, specific downstream tasks, leveraging Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning has emerged as a critical process in leveraging Large Language Models (LLMs) for specific downstream tasks, enabling these models to achieve state-of-the-art performance across various domains. However, the fine-tuning process often involves sensitive datasets, introducing privacy risks that exploit the unique characteristics of this stage. In this paper, we provide a comprehensive survey of privacy challenges associated with fine-tuning LLMs, highlighting vulnerabilities to various privacy attacks, including membership inference, data extraction, and backdoor attacks. We further review defense mechanisms designed to mitigate privacy risks in the fine-tuning phase, such as differential privacy, federated learning, and knowledge unlearning, discussing their effectiveness and limitations in addressing privacy risks and maintaining model utility. By identifying key gaps in existing research, we highlight challenges and propose directions to advance the development of privacy-preserving methods for fine-tuning LLMs, promoting their responsible use in diverse applications.

[AI-123] Deep Reinforcement Learning Based Systems for Safety Critical Applications in Aerospace

链接: https://arxiv.org/abs/2412.16489
作者: Abedin Sherifi
关键词: demonstrated substantial growth, High Performance Computing, Recent advancements, artificial intelligence, substantial growth
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) applications within aerospace have demonstrated substantial growth, particularly in the context of control systems. As High Performance Computing (HPC) platforms continue to evolve, they are expected to replace current flight control or engine control computers, enabling increased computational capabilities. This shift will allow real-time AI applications, such as image processing and defect detection, to be seamlessly integrated into monitoring systems, providing real-time awareness and enhanced fault detection and accommodation. Furthermore, AI’s potential in aerospace extends to control systems, where its application can range from full autonomy to enhancing human control through assistive features. AI, particularly deep reinforcement learning (DRL), can offer significant improvements in control systems, whether for autonomous operation or as an augmentative tool.

[AI-124] When Can Proxies Improve the Sample Complexity of Preference Learning?

链接: https://arxiv.org/abs/2412.16475
作者: Yuchen Zhu,Daniel Augusto de Souza,Zhengyan Shi,Mengyue Yang,Pasquale Minervini,Alexander D’Amour,Matt J. Kusner
关键词: Large Language Models, address the problem, necessarily increase, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the problem of reward hacking, where maximising a proxy reward does not necessarily increase the true reward. This is a key concern for Large Language Models (LLMs), as they are often fine-tuned on human preferences that may not accurately reflect a true objective. Existing work uses various tricks such as regularisation, tweaks to the reward model, and reward hacking detectors, to limit the influence that such proxy preferences have on a model. Luckily, in many contexts such as medicine, education, and law, a sparse amount of expert data is often available. In these cases, it is often unclear whether the addition of proxy data can improve policy learning. We outline a set of sufficient conditions on proxy feedback that, if satisfied, indicate that proxy data can provably improve the sample complexity of learning the ground truth policy. These conditions can inform the data collection process for specific tasks. The result implies a parameterisation for LLMs that achieves this improved sample complexity. We detail how one can adapt existing architectures to yield this improved sample complexity.

[AI-125] he Evolving Usage of GenAI by Computing Students

链接: https://arxiv.org/abs/2412.16453
作者: Irene Hou,Hannah Vy Nguyen,Owen Man,Stephen MacNeil
关键词: critical aspect, aspect of learning, learning and problem-solving, students, GenAI
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 2 pages, 1 figure, to be published in SIGCSE 2025

点击查看摘要

Abstract:Help-seeking is a critical aspect of learning and problem-solving for computing students. Recent research has shown that many students are aware of generative AI (GenAI) tools; however, there are gaps in the extent and effectiveness of how students use them. With over two years of widespread GenAI usage, it is crucial to understand whether students’ help-seeking behaviors with these tools have evolved and how. This paper presents findings from a repeated cross-sectional survey conducted among computing students across North American universities (n=95). Our results indicate shifts in GenAI usage patterns. In 2023, 34.1% of students (n=47) reported never using ChatGPT for help, ranking it fourth after online searches, peer support, and class forums. By 2024, this figure dropped sharply to 6.3% (n=48), with ChatGPT nearly matching online search as the most commonly used help resource. Despite this growing prevalence, there has been a decline in students’ hourly and daily usage of GenAI tools, which may be attributed to a common tendency to underestimate usage frequency. These findings offer new insights into the evolving role of GenAI in computing education, highlighting its increasing acceptance and solidifying its position as a key help resource.

[AI-126] A Generalizable Anomaly Detection Method in Dynamic Graphs

链接: https://arxiv.org/abs/2412.16447
作者: Xiao Yang,Xuejiao Zhao,Zhiqi Shen
关键词: Anomaly detection aims, aims to identify, identify deviations, deviations from normal, normal patterns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Anomaly detection aims to identify deviations from normal patterns within data. This task is particularly crucial in dynamic graphs, which are common in applications like social networks and cybersecurity, due to their evolving structures and complex relationships. Although recent deep learning-based methods have shown promising results in anomaly detection on dynamic graphs, they often lack of generalizability. In this study, we propose GeneralDyG, a method that samples temporal ego-graphs and sequentially extracts structural and temporal features to address the three key challenges in achieving generalizability: Data Diversity, Dynamic Feature Capture, and Computational Cost. Extensive experimental results demonstrate that our proposed GeneralDyG significantly outperforms state-of-the-art methods on four real-world datasets.

[AI-127] Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints

链接: https://arxiv.org/abs/2412.16443
作者: Charles Luo
关键词: Large Language Models, demonstrated remarkable capabilities, Central Limit Theorem, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their scalability raises a critical question: Have we reached the scaling ceiling? This paper addresses this pivotal question by developing a unified theoretical framework that integrates mathematical and statistical insights to explain the scaling dynamics of LLMs. We present: 1. Central Limit Theorem (CLT) for Hidden Representations: We show that noise in hidden representations scales inversely with context size, explaining stabilization effects and the limits of context length improvements. 2. Bias-Variance Decomposition: We decompose next-token prediction loss into irreducible entropy, capacity-driven bias, and finite sample variance, revealing trade-offs where scaling yields diminishing returns. 3. Emergent SNR Thresholds: By defining signal-to-noise ratio (SNR), we quantify how capabilities emerge abruptly once SNR surpasses a threshold, offering insights into when scaling becomes less effective. Through this framework, we conclude that while LLMs have not reached an absolute scaling ceiling, practical constraints are increasingly prominent: diminishing returns, resource inefficiencies, and data limitations. Future progress will require a shift from brute-force scaling to innovations in architecture, data quality, and training paradigms. This work provides a roadmap for guiding the efficient development of next-generation LLMs and advancing the field beyond traditional scaling strategies. Keywords: Large Language Models; Scaling Ceiling; Central Limit Theorem; Bias-Variance Trade-Off; Signal-to-Noise Ratio; Emergent Capabilities Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.16443 [cs.LG] (or arXiv:2412.16443v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.16443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-128] Learning Cross-Task Generalities Across Graphs via Task-trees

链接: https://arxiv.org/abs/2412.16441
作者: Zehong Wang,Zheyuan Zhang,Tianyi Ma,Nitesh V Chawla,Chuxu Zhang,Yanfang Ye
关键词: capture shared patterns, Foundation models aim, patterns or concepts, edges in images, sentences in text
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Foundation models aim to create general, cross-task, and cross-domain machine learning models by pretraining on large-scale datasets to capture shared patterns or concepts (generalities), such as contours, colors, textures, and edges in images, or tokens, words, and sentences in text. However, discovering generalities across graphs remains challenging, which has hindered the development of graph foundation models. To tackle this challenge, in this paper, we propose a novel approach to learn generalities across graphs via task-trees. Specifically, we first define the basic learning instances in graphs as task-trees and assume that the generalities shared across graphs are, at least partially, preserved in the task-trees of the given graphs. To validate the assumption, we first perform a theoretical analysis of task-trees in terms of stability, transferability, and generalization. We find that if a graph neural network (GNN) model is pretrained on diverse task-trees through a reconstruction task, it can learn sufficient transferable knowledge for downstream tasks using an appropriate set of fine-tuning samples. To empirically validate the assumption, we further instantiate the theorems by developing a cross-task, cross-domain graph foundation model named Graph generality Identifier on task-Trees (GIT). The extensive experiments over 30 graphs from five domains demonstrate the effectiveness of GIT in fine-tuning, in-context learning, and zero-shot learning scenarios. Particularly, the general GIT model pretrained on large-scale datasets can be quickly adapted to specific domains, matching or even surpassing expert models designed for those domains. Our data and code are available at this https URL.

[AI-129] LearnLM: Improving Gemini for Learning

链接: https://arxiv.org/abs/2412.16429
作者: LearnLM Team Google:Abhinit Modi,Aditya Srikanth Veerubhotla,Aliya Rysbek,Andrea Huber,Brett Wiltshire,Brian Veprek,Daniel Gillick,Daniel Kasenberg,Derek Ahmed,Irina Jurenka,James Cohan,Jennifer She,Julia Wilkowski,Kaiz Alarakyia,Kevin McKee,Lisa Wang,Markus Kunesch,Mike Schaekermann,Miruna Pîslar,Nikhil Joshi,Parsa Mahmoudieh,Paul Jhun,Sara Wiltberger,Shakir Mohamed,Shashank Agarwal,Shubham Milind Phal,Sun Jae Lee,Theofilos Strinopoulos,Wei-Jen Ko,Amy Wang,Ankit Anand,Avishkar Bhoopchand,Dan Wild,Divya Pandya,Filip Bar,Garth Graham,Holger Winnemoeller,Mahvish Nagda,Prateek Kolhar,Renee Schneider,Shaojian Zhu,Stephanie Chan,Steve Yadlowsky,Viknesh Sounderajah,Yannis Assael
关键词: Today generative, information by default, engage users, users in service, human tutor
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today’s generative AI systems are tuned to present information by default rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textitpedagogical instruction following, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning – by enabling the addition of our pedagogical data to post-training mixtures – alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that is preferred substantially by expert raters across a diverse set of learning scenarios, with average preference strengths of 31% over GPT-4o, 11% over Claude 3.5, and 13% over the Gemini 1.5 Pro model LearnLM was based on.

[AI-130] Knowledge as a Breaking of Ergodicity

链接: https://arxiv.org/abs/2412.16411
作者: Yang He,Vassiliy Lubchenko
关键词: generative model defined, generative model, multiple generative models, generative model computationally-manageable, construct a thermodynamic
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Complexity (cs.CC); Machine Learning (stat.ML)
*备注: 51 pages, 12 figures, accepted to Neural Computation

点击查看摘要

Abstract:We construct a thermodynamic potential that can guide training of a generative model defined on a set of binary degrees of freedom. We argue that upon reduction in description, so as to make the generative model computationally-manageable, the potential develops multiple minima. This is mirrored by the emergence of multiple minima in the free energy proper of the generative model itself. The variety of training samples that employ N binary degrees of freedom is ordinarily much lower than the size 2^N of the full phase space. The non-represented configurations, we argue, should be thought of as comprising a high-temperature phase separated by an extensive energy gap from the configurations composing the training set. Thus, training amounts to sampling a free energy surface in the form of a library of distinct bound states, each of which breaks ergodicity. The ergodicity breaking prevents escape into the near continuum of states comprising the high-temperature phase; thus it is necessary for proper functionality. It may however have the side effect of limiting access to patterns that were underrepresented in the training set. At the same time, the ergodicity breaking within the library complicates both learning and retrieval. As a remedy, one may concurrently employ multiple generative models – up to one model per free energy minimum.

[AI-131] Learning Disease Progression Models That Capture Health Disparities

链接: https://arxiv.org/abs/2412.16406
作者: Erica Chiang,Divya Shanmugam,Ashley N. Beecy,Gabriel Sayer,Nir Uriel,Deborah Estrin,Nikhil Garg,Emma Pierson
关键词: Disease progression, Bayesian disease progression, inform the diagnosis, diagnosis and treatment, Disease progression models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for disparities produces biased estimates of severity (underestimating severity for disadvantaged groups, for example). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities meaningfully shifts which patients are considered high-risk.

[AI-132] Autonomous Option Invention for Continual Hierarchical Reinforcement Learning and Planning

链接: https://arxiv.org/abs/2412.16395
作者: Rashmeet Kaur Nayyar,Siddharth Srivastava
关键词: scaling up reinforcement, reinforcement learning, options, autonomously learning abstract, learning abstract state
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Abstraction is key to scaling up reinforcement learning (RL). However, autonomously learning abstract state and action representations to enable transfer and generalization remains a challenging open problem. This paper presents a novel approach for inventing, representing, and utilizing options, which represent temporally extended behaviors, in continual RL settings. Our approach addresses streams of stochastic problems characterized by long horizons, sparse rewards, and unknown transition and reward functions. Our approach continually learns and maintains an interpretable state abstraction, and uses it to invent high-level options with abstract symbolic representations. These options meet three key desiderata: (1) composability for solving tasks effectively with lookahead planning, (2) reusability across problem instances for minimizing the need for relearning, and (3) mutual independence for reducing interference among options. Our main contributions are approaches for continually learning transferable, generalizable options with symbolic representations, and for integrating search techniques with RL to efficiently plan over these learned options to solve new problems. Empirical results demonstrate that the resulting approach effectively learns and transfers abstract knowledge across problem instances, achieving superior sample efficiency compared to state-of-the-art methods. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.16395 [cs.AI] (or arXiv:2412.16395v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.16395 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-133] Ethics and Technical Aspects of Generative AI Models in Digital Content Creation

链接: https://arxiv.org/abs/2412.16389
作者: Atahan Karagoz
关键词: offering industries tools, digital content creation, reshaping digital content, offering industries, reshaping digital
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative AI models like GPT-4o and DALL-E 3 are reshaping digital content creation, offering industries tools to generate diverse and sophisticated text and images with remarkable creativity and efficiency. This paper examines both the capabilities and challenges of these models within creative workflows. While they deliver high performance in generating content with creativity, diversity, and technical precision, they also raise significant ethical concerns. Our study addresses two key research questions: (a) how these models perform in terms of creativity, diversity, accuracy, and computational efficiency, and (b) the ethical risks they present, particularly concerning bias, authenticity, and potential misuse. Through a structured series of experiments, we analyze their technical performance and assess the ethical implications of their outputs, revealing that although generative models enhance creative processes, they often reflect biases from their training data and carry ethical vulnerabilities that require careful oversight. This research proposes ethical guidelines to support responsible AI integration into industry practices, fostering a balance between innovation and ethical integrity.

[AI-134] Collision-based Dynamics for Multi-Marginal Optimal Transport

链接: https://arxiv.org/abs/2412.16385
作者: Mohsen Sadr,Hossein Gorji
关键词: Monte Carlo solution, Carlo solution algorithm, multi-marginal optimal transport, optimal transport problem, randomized pairwise swapping
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Inspired by the Boltzmann kinetics, we propose a collision-based dynamics with a Monte Carlo solution algorithm that approximates the solution of the multi-marginal optimal transport problem via randomized pairwise swapping of sample indices. The computational complexity and memory usage of the proposed method scale linearly with the number of samples, making it highly attractive for high-dimensional settings. In several examples, we demonstrate the efficiency of the proposed method compared to the state-of-the-art methods.

[AI-135] Iterative Encoding-Decoding VAEs Anomaly Detection in NOAAs DART Time Series: A Machine Learning Approach for Enhancing Data Integrity for NASAs GRACE-FO Verification and Validation

链接: https://arxiv.org/abs/2412.16375
作者: Kevin Lee
关键词: NOAA Deep-ocean Assessment, NOAA Deep-ocean, Deep-ocean Assessment, Assessment and Reporting, Iterative Encoding-Decoding VAEs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注: Preprint

点击查看摘要

Abstract:NOAA’s Deep-ocean Assessment and Reporting of Tsunamis (DART) data are critical for NASA-JPL’s tsunami detection, real-time operations, and oceanographic research. However, these time-series data often contain spikes, steps, and drifts that degrade data quality and obscure essential oceanographic features. To address these anomalies, the work introduces an Iterative Encoding-Decoding Variational Autoencoders (Iterative Encoding-Decoding VAEs) model to improve the quality of DART time series. Unlike traditional filtering and thresholding methods that risk distorting inherent signal characteristics, Iterative Encoding-Decoding VAEs progressively remove anomalies while preserving the data’s latent structure. A hybrid thresholding approach further retains genuine oceanographic features near boundaries. Applied to complex DART datasets, this approach yields reconstructions that better maintain key oceanic properties compared to classical statistical techniques, offering improved robustness against spike removal and subtle step changes. The resulting high-quality data supports critical verification and validation efforts for the GRACE-FO mission at NASA-JPL, where accurate surface measurements are essential to modeling Earth’s gravitational field and global water dynamics. Ultimately, this data processing method enhances tsunami detection and underpins future climate modeling with improved interpretability and reliability.

[AI-136] Social Science Is Necessary for Operationalizing Socially Responsible Foundation Models

链接: https://arxiv.org/abs/2412.16355
作者: Adam Davies,Elisa Nguyen,Michael Simeone,Erik Johnston,Martin Gubri
关键词: potential social impacts, social impacts, growing concern, Social science, social
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rise of foundation models, there is growing concern about their potential social impacts. Social science has a long history of studying the social impacts of transformative technologies in terms of pre-existing systems of power and how these systems are disrupted or reinforced by new technologies. In this position paper, we build on prior work studying the social impacts of earlier technologies to propose a conceptual framework studying foundation models as sociotechnical systems, incorporating social science expertise to better understand how these models affect systems of power, anticipate the impacts of deploying these models in various applications, and study the effectiveness of technical interventions intended to mitigate social harms. We advocate for an interdisciplinary and collaborative research paradigm between AI and social science across all stages of foundation model research and development to promote socially responsible research practices and use cases, and outline several strategies to facilitate such research.

[AI-137] Real Faults in Deep Learning Fault Benchmarks: How Real Are They?

链接: https://arxiv.org/abs/2412.16336
作者: Gunel Jahangirova,Nargiz Humbatova,Jinhan Kim,Shin Yoo,Paolo Tonella
关键词: Deep Learning, adoption of Deep, continues to rise, increasing number, number of approaches
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the adoption of Deep Learning (DL) systems continues to rise, an increasing number of approaches are being proposed to test these systems, localise faults within them, and repair those faults. The best attestation of effectiveness for such techniques is an evaluation that showcases their capability to detect, localise and fix real faults. To facilitate these evaluations, the research community has collected multiple benchmarks of real faults in DL systems. In this work, we perform a manual analysis of 490 faults from five different benchmarks and identify that 314 of them are eligible for our study. Our investigation focuses specifically on how well the bugs correspond to the sources they were extracted from, which fault types are represented, and whether the bugs are reproducible. Our findings indicate that only 18.5% of the faults satisfy our realism conditions. Our attempts to reproduce these faults were successful only in 52% of cases.

[AI-138] Optimizing Fintech Marketing: A Comparative Study of Logistic Regression and XGBoost

链接: https://arxiv.org/abs/2412.16333
作者: Sahar Yarmohammadtoosky Dinesh Chowdary Attota
关键词: financial services industry, predicting credit risk, studies have shown, scholarly interest, major concern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:As several studies have shown, predicting credit risk is still a major concern for the financial services industry and is receiving a lot of scholarly interest. This area of study is crucial because it aids financial organizations in determining the probability that borrowers would default, which has a direct bearing on lending choices and risk management tactics. Despite the progress made in this domain, there is still a substantial knowledge gap concerning consumer actions that take place prior to the filing of credit card applications. The objective of this study is to predict customer responses to mail campaigns and assess the likelihood of default among those who engage. This research employs advanced machine learning techniques, specifically logistic regression and XGBoost, to analyze consumer behavior and predict responses to direct mail campaigns. By integrating different data preprocessing strategies, including imputation and binning, we enhance the robustness and accuracy of our predictive models. The results indicate that XGBoost consistently outperforms logistic regression across various metrics, particularly in scenarios using categorical binning and custom imputation. These findings suggest that XGBoost is particularly effective in handling complex data structures and provides a strong predictive capability in assessing credit risk.

[AI-139] owards Safe and Honest AI Agents with Neural Self-Other Overlap NEURIPS2024

链接: https://arxiv.org/abs/2412.16325
作者: Marc Carauleanu,Michael Vaiana,Judd Rosenblatt,Cameron Berg,Diogo Schwerz de Lucena
关键词: make critical decisions, systems increasingly make, increasingly make critical, critical decisions, systems increasingly
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: NeurIPS 2024 Safe Generative AI Workshop

点击查看摘要

Abstract:As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments on LLMs with 7B, 27B, and 78B parameters demonstrate SOO’s efficacy: deceptive responses of Mistral-7B-Instruct-v0.2 dropped from 73.6% to 17.2% with no observed reduction in general task performance, while in Gemma-2-27b-it and CalmeRys-78B-Orpo-v0.1 deceptive responses were reduced from 100% to 9.3% and 2.7%, respectively, with a small impact on capabilities. In reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO’s focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architectures. While current applications focus on language models and simple RL environments, SOO could pave the way for more trustworthy AI in broader domains. Ethical implications and long-term effects warrant further investigation, but SOO represents a significant step forward in AI safety research.

[AI-140] HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases

链接: https://arxiv.org/abs/2412.16311
作者: Meng-Chieh Lee,Qi Zhu,Costas Mavromatis,Zhen Han,Soji Adeshina,Vassilis N. Ioannidis,Huzefa Rangwala,Christos Faloutsos
关键词: effectively retrieve relevant, semi-structured knowledge base, answer user questions, interconnected by relations, retrieve relevant information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Given a semi-structured knowledge base (SKB), where text documents are interconnected by relations, how can we effectively retrieve relevant information to answer user questions? Retrieval-Augmented Generation (RAG) retrieves documents to assist large language models (LLMs) in question answering; while Graph RAG (GRAG) uses structured knowledge bases as its knowledge source. However, many questions require both textual and relational information from SKB - referred to as “hybrid” questions - which complicates the retrieval process and underscores the need for a hybrid retrieval method that leverages both information. In this paper, through our empirical analysis, we identify key insights that show why existing methods may struggle with hybrid question answering (HQA) over SKB. Based on these insights, we propose HybGRAG for HQA consisting of a retriever bank and a critic module, with the following advantages: (1) Agentic, it automatically refines the output by incorporating feedback from the critic module, (2) Adaptive, it solves hybrid questions requiring both textual and relational information with the retriever bank, (3) Interpretable, it justifies decision making with intuitive refinement path, and (4) Effective, it surpasses all baselines on HQA benchmarks. In experiments on the STaRK benchmark, HybGRAG achieves significant performance gains, with an average relative improvement in Hit@1 of 51%.

[AI-141] MetaScientist: A Human-AI Synergistic Framework for Automated Mechanical Metamaterial Design

链接: https://arxiv.org/abs/2412.16270
作者: Jingyuan Qi,Zian Jia,Minqian Liu,Wangzhi Zhan,Junkai Zhang,Xiaofei Wen,Jingru Gan,Jianpeng Chen,Qin Liu,Mingyu Derek Ma,Bangzheng Li,Haohui Wang,Adithya Kulkarni,Muhao Chen,Dawei Zhou,Ling Li,Wei Wang,Lifu Huang
关键词: chemical composition, resource-demanding process, knowledge-intensive and resource-demanding, engineered structures, mechanical metamaterial designs
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The discovery of novel mechanical metamaterials, whose properties are dominated by their engineered structures rather than chemical composition, is a knowledge-intensive and resource-demanding process. To accelerate the design of novel metamaterials, we present MetaScientist, a human-in-the-loop system that integrates advanced AI capabilities with expert oversight with two primary phases: (1) hypothesis generation, where the system performs complex reasoning to generate novel and scientifically sound hypotheses, supported with domain-specific foundation models and inductive biases retrieved from existing literature; (2) 3D structure synthesis, where a 3D structure is synthesized with a novel 3D diffusion model based on the textual hypothesis and refined it with a LLM-based refinement model to achieve better structure properties. At each phase, domain experts iteratively validate the system outputs, and provide feedback and supplementary materials to ensure the alignment of the outputs with scientific principles and human preferences. Through extensive evaluation from human scientists, MetaScientist is able to deliver novel and valid mechanical metamaterial designs that have the potential to be highly impactful in the metamaterial field.

[AI-142] Autoware.Flex: Human-Instructed Dynamically Reconfigurable Autonomous Driving Systems

链接: https://arxiv.org/abs/2412.16265
作者: Ziwei Song,Mingsong Lv,Tianchi Ren,Chun Jason Xue,Jen-Ming Wu,Nan Guan
关键词: Existing Autonomous Driving, Existing Autonomous, independently make driving, ADS, significant limitations
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: 14 pages, 13 figures

点击查看摘要

Abstract:Existing Autonomous Driving Systems (ADS) independently make driving decisions, but they face two significant limitations. First, in complex scenarios, ADS may misinterpret the environment and make inappropriate driving decisions. Second, these systems are unable to incorporate human driving preferences in their decision-making processes. This paper proposes this http URL, a novel ADS system that incorporates human input into the driving process, allowing users to guide the ADS in making more appropriate decisions and ensuring their preferences are satisfied. Achieving this needs to address two key challenges: (1) translating human instructions, expressed in natural language, into a format the ADS can understand, and (2) ensuring these instructions are executed safely and consistently within the ADS’ s decision-making framework. For the first challenge, we employ a Large Language Model (LLM) assisted by an ADS-specialized knowledge base to enhance domain-specific translation. For the second challenge, we design a validation mechanism to ensure that human instructions result in safe and consistent driving behavior. Experiments conducted on both simulators and a real-world autonomous vehicle demonstrate that this http URL effectively interprets human instructions and executes them safely.

[AI-143] Continual Learning with Strategic Selection and Forgetting for Network Intrusion Detection

链接: https://arxiv.org/abs/2412.16264
作者: Xinchen Zhang,Running Zhao,Zhihan Jiang,Handi Chen,Yulong Ding,Edith C.H. Ngai,Shuang-Hua Yang
关键词: safeguarding digital infrastructure, Intrusion Detection Systems, Detection Systems, digital infrastructure, crucial for safeguarding
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by IEEE International Conference on Computer Communications (INFOCOM) 2025

点击查看摘要

Abstract:Intrusion Detection Systems (IDS) are crucial for safeguarding digital infrastructure. In dynamic network environments, both threat landscapes and normal operational behaviors are constantly changing, resulting in concept drift. While continuous learning mitigates the adverse effects of concept drift, insufficient attention to drift patterns and excessive preservation of outdated knowledge can still hinder the IDS’s adaptability. In this paper, we propose SSF (Strategic Selection and Forgetting), a novel continual learning method for IDS, providing continuous model updates with a constantly refreshed memory buffer. Our approach features a strategic sample selection algorithm to select representative new samples and a strategic forgetting mechanism to drop outdated samples. The proposed strategic sample selection algorithm prioritizes new samples that cause the `drifted’ pattern, enabling the model to better understand the evolving landscape. Additionally, we introduce strategic forgetting upon detecting significant drift by discarding outdated samples to free up memory, allowing the incorporation of more recent data. SSF captures evolving patterns effectively and ensures the model is aligned with the change of data patterns, significantly enhancing the IDS’s adaptability to concept drift. The state-of-the-art performance of SSF on NSL-KDD and UNSW-NB15 datasets demonstrates its superior adaptability to concept drift for network intrusion detection.

[AI-144] Aria-UI: Visual Grounding for GUI Instructions

链接: https://arxiv.org/abs/2412.16256
作者: Yuhao Yang,Yue Wang,Dongxu Li,Ziyang Luo,Bei Chen,Chao Huang,Junnan Li
关键词: Digital agents, increasingly important, platforms by directly, directly manipulating, Digital
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at this https URL.

[AI-145] Post-hoc Interpretability Illumination for Scientific Interaction Discovery

链接: https://arxiv.org/abs/2412.16252
作者: Ling Zhang,Zhichao Hou,Tingxiang Ji,Yuanyuan Xu,Runze Li
关键词: garnered substantial attention, Model interpretability, Iterative Kings’ Forests, recent years, decision-making applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model interpretability and explainability have garnered substantial attention in recent years, particularly in decision-making applications. However, existing interpretability tools often fall short in delivering satisfactory performance due to limited capabilities or efficiency issues. To address these challenges, we propose a novel post-hoc method: Iterative Kings’ Forests (iKF), designed to uncover complex multi-order interactions among variables. iKF iteratively selects the next most important variable, the “King”, and constructs King’s Forests by placing it at the root node of each tree to identify variables that interact with the “King”. It then generates ranked short lists of important variables and interactions of varying orders. Additionally, iKF provides inference metrics to analyze the patterns of the selected interactions and classify them into one of three interaction types: Accompanied Interaction, Synergistic Interaction, and Hierarchical Interaction. Extensive experiments demonstrate the strong interpretive power of our proposed iKF, highlighting its great potential for explainable modeling and scientific discovery across diverse scientific fields.

[AI-146] Optimizing Low-Speed Autonomous Driving: A Reinforcement Learning Approach to Route Stability and Maximum Speed

链接: https://arxiv.org/abs/2412.16248
作者: Benny Bao-Sheng Li,Elena Wu,Hins Shao-Xuan Yang,Nicky Yao-Jin Liang
关键词: garnered significant attention, optimizing vehicle performance, recent years, varying conditions, garnered significant
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Autonomous driving has garnered significant attention in recent years, especially in optimizing vehicle performance under varying conditions. This paper addresses the challenge of maintaining maximum speed stability in low-speed autonomous driving while following a predefined route. Leveraging reinforcement learning (RL), we propose a novel approach to optimize driving policies that enable the vehicle to achieve near-maximum speed without compromising on safety or route accuracy, even in low-speed scenarios.

[AI-147] Neural diversity is key to collective artificial learning

链接: https://arxiv.org/abs/2412.16244
作者: Matteo Bettini,Ryan Kortvelesy,Amanda Prorok
关键词: require complex collective, complex collective problem-solving, collective artificial learning, pressing issues, global peace
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Many of the world’s most pressing issues, such as climate change and global peace, require complex collective problem-solving skills. Recent studies indicate that diversity in individuals’ behaviors is key to developing such skills and increasing collective performance. Yet behavioral diversity in collective artificial learning is understudied, with today’s machine learning paradigms commonly favoring homogeneous agent strategies over heterogeneous ones, mainly due to computational considerations. In this work, we employ novel diversity measurement and control paradigms to study the impact of behavioral heterogeneity in several facets of collective artificial learning. Through experiments in team play and other cooperative tasks, we show the emergence of unbiased behavioral roles that improve team outcomes; how neural diversity synergizes with morphological diversity; how diverse agents are more effective at finding cooperative solutions in sparse reward settings; and how behaviorally heterogeneous teams learn and retain latent skills to overcome repeated disruptions. Overall, our results indicate that, by controlling diversity, we can obtain non-trivial benefits over homogeneous training paradigms, demonstrating that diversity is a fundamental component of collective artificial learning, an insight thus far overlooked.

[AI-148] Agents Are Not Enough

链接: https://arxiv.org/abs/2412.16241
作者: Chirag Shah,Ryen W. White
关键词: Artificial Intelligence, integration of Artificial, experiencing a resurgence, growing integration, agents
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In the midst of the growing integration of Artificial Intelligence (AI) into various aspects of our lives, agents are experiencing a resurgence. These autonomous programs that act on behalf of humans are neither new nor exclusive to the mainstream AI movement. By exploring past incarnations of agents, we can understand what has been done previously, what worked, and more importantly, what did not pan out and why. This understanding lets us to examine what distinguishes the current focus on agents. While generative AI is appealing, this technology alone is insufficient to make new generations of agents more successful. To make the current wave of agents effective and sustainable, we envision an ecosystem that includes not only agents but also Sims, which represent user preferences and behaviors, as well as Assistants, which directly interact with the user and coordinate the execution of user tasks with the help of the agents.

[AI-149] A jury evaluation theorem

链接: https://arxiv.org/abs/2412.16238
作者: Andrés Corrada-Emmanuel
关键词: Majority voting, crowd algorithm, Majority, American Community Survey, Condorcet
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:Majority voting (MV) is the prototypical ``wisdom of the crowd’’ algorithm. Theorems considering when MV is optimal for group decisions date back to Condorcet’s 1785 jury decision theorem. The same assumption of error independence used by Condorcet is used here to prove a jury evaluation theorem that does purely algebraic evaluation (AE). Three or more binary jurors are enough to obtain the only two possible statistics of their correctness on a joint test they took. AE is shown to be superior to MV since it allows one to choose the minority vote depending on how the jurors agree or disagree. In addition, AE is self-alarming about the failure of the error-independence assumption. Experiments labeling demographic datasets from the American Community Survey are carried out to compare MV and AE on nearly error-independent ensembles. In general, using algebraic evaluation leads to better classifier evaluations and group labeling decisions.

[AI-150] A Proposal for Extending the Common Model of Cognition to Emotion

链接: https://arxiv.org/abs/2412.16231
作者: Paul S. Rosenbloom,John E. Laird,Christian Lebiere,Andrea Stocco,Richard H. Granger,Christian Huyck
关键词: Common Model existing, Common Model, complete model, humanlike mind, Model existing modules
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: A version of this article was published in Proceedings of the 22nd International Conference on Cognitive Modeling (2024)

点击查看摘要

Abstract:Cognition and emotion must be partnered in any complete model of a humanlike mind. This article proposes an extension to the Common Model of Cognition – a developing consensus concerning what is required in such a mind – for emotion that includes a linked pair of modules for emotion and metacognitive assessment, plus pervasive connections between these two new modules and the Common Model’s existing modules and links.

[AI-151] AACKIT: Track Annotation and Analytics with Continuous Knowledge Integration Tool

链接: https://arxiv.org/abs/2412.16228
作者: Lily Lee,Julian Fontes,Andrew Weinert,Laura Schomacker,Daniel Stabile,Jonathan Hou
关键词: Machine learning, efficiently analyzing data, detecting patterns, efficiently analyzing, forecasting trends
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) is a powerful tool for efficiently analyzing data, detecting patterns, and forecasting trends across various domains such as text, audio, and images. The availability of annotation tools to generate reliably annotated data is crucial for advances in ML applications. In the domain of geospatial tracks, the lack of such tools to annotate and validate data impedes rapid and accessible ML application development. This paper presents Track Annotation and Analytics with Continuous Knowledge Integration Tool (TAACKIT) to serve the critically important functions of annotating geospatial track data and validating ML models. We demonstrate an ML application use case in the air traffic domain to illustrate its data annotation and model evaluation power and quantify the annotation effort reduction.

[AI-152] Quantified Linear and Polynomial Arithmetic Satisfiability via Template-based Skolemization AAAI2025

链接: https://arxiv.org/abs/2412.16226
作者: Krishnendu Chatterjee,Ehsan Kafshdar Goharshady,Mehrdad Karrabi,Harshit J Motwani,Maximilian Seeliger,Đorđe Žikelić
关键词: non-linear real arithmetic, real arithmetic, linear real arithmetic, checking satisfiability, NRA formulas
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:The problem of checking satisfiability of linear real arithmetic (LRA) and non-linear real arithmetic (NRA) formulas has broad applications, in particular, they are at the heart of logic-related applications such as logic for artificial intelligence, program analysis, etc. While there has been much work on checking satisfiability of unquantified LRA and NRA formulas, the problem of checking satisfiability of quantified LRA and NRA formulas remains a significant challenge. The main bottleneck in the existing methods is a computationally expensive quantifier elimination step. In this work, we propose a novel method for efficient quantifier elimination in quantified LRA and NRA formulas. We propose a template-based Skolemization approach, where we automatically synthesize linear/polynomial Skolem functions in order to eliminate quantifiers in the formula. The key technical ingredients in our approach are Positivstellensätze theorems from algebraic geometry, which allow for an efficient manipulation of polynomial inequalities. Our method offers a range of appealing theoretical properties combined with a strong practical performance. On the theory side, our method is sound, semi-complete, and runs in subexponential time and polynomial space, as opposed to existing sound and complete quantifier elimination methods that run in doubly-exponential time and at least exponential space. On the practical side, our experiments show superior performance compared to state-of-the-art SMT solvers in terms of the number of solved instances and runtime, both on LRA and on NRA benchmarks.

[AI-153] Bayesian Critique-Tune-Based Reinforcement Learning with Attention-Based Adaptive Pressure for Multi-Intersection Traffic Signal Control

链接: https://arxiv.org/abs/2412.16225
作者: Wenchang Duan,Zhenguo Gao. Jinguo Xian
关键词: significantly alleviate urban, urban traffic congestion, alleviate urban traffic, Traffic Signal Control, Adaptive Traffic Signal
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Adaptive Traffic Signal Control (ATSC) system is a critical component of intelligent transportation, with the capability to significantly alleviate urban traffic congestion. Although reinforcement learning (RL)-based methods have demonstrated promising performance in achieving ATSC, existing methods find it prone to convergence to local optima. Consequently, this paper proposes a novel Bayesian Critique-Tune-Based Reinforcement Learning with Attention-Based Adaptive Pressure (BCT-APRL) for multi-intersection signal control. In BCT-APRL, the Critique-Tune (CT) framework, a two-layer Bayesian structure is designed to refine the RL policies. Specifically, the Bayesian inference-based Critique layer provides effective evaluations of the credibility of policies; the Bayesian decision-based Tune layer fine-tunes policies by minimizing the posterior risks when the evaluations are negative. Furthermore, an attention-based Adaptive Pressure (AP) is designed to specify the traffic movement representation as an effective and efficient pressure of vehicle queues in the traffic network. Achieving enhances the reasonableness of RL policies. Extensive experiments conducted with a simulator across a range of intersection layouts show that BCT-APRL is superior to other state-of-the-art methods in seven real-world datasets. Codes are open-sourced.

[AI-154] Machine Learning-Based Estimation Of Wave Direction For Unmanned Surface Vehicles

链接: https://arxiv.org/abs/2412.16205
作者: Manele Ait Habouche,Mickaël Kerboeuf,Goulven Guillou,Jean-Philippe Babau
关键词: Unmanned Surface Vehicles, Unmanned Surface, Surface Vehicles, environmental monitoring, marine exploration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Unmanned Surface Vehicles (USVs) have become critical tools for marine exploration, environmental monitoring, and autonomous navigation. Accurate estimation of wave direction is essential for improving USV navigation and ensuring operational safety, but traditional methods often suffer from high costs and limited spatial resolution. This paper proposes a machine learning-based approach leveraging LSTM (Long Short-Term Memory) networks to predict wave direction using sensor data collected from USVs. Experimental results show the capability of the LSTM model to learn temporal dependencies and provide accurate predictions, outperforming simpler baselines.

[AI-155] CLIP-RLDrive: Human-Aligned Autonomous Driving via CLIP-Based Reward Shaping in Reinforcement Learning

链接: https://arxiv.org/abs/2412.16201
作者: Erfan Doroudian,Hamid Taghavifar
关键词: paper presents CLIP-RLDrive, complex urban driving, urban driving scenarios, Contrastive Language-Image Pretraining, presents CLIP-RLDrive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents CLIP-RLDrive, a new reinforcement learning (RL)-based framework for improving the decision-making of autonomous vehicles (AVs) in complex urban driving scenarios, particularly in unsignalized intersections. To achieve this goal, the decisions for AVs are aligned with human-like preferences through Contrastive Language-Image Pretraining (CLIP)-based reward shaping. One of the primary difficulties in RL scheme is designing a suitable reward model, which can often be challenging to achieve manually due to the complexity of the interactions and the driving scenarios. To deal with this issue, this paper leverages Vision-Language Models (VLMs), particularly CLIP, to build an additional reward model based on visual and textual cues.

[AI-156] AgroXAI: Explainable AI-Driven Crop Recommendation System for Agriculture 4.0

链接: https://arxiv.org/abs/2412.16196
作者: Ozlem Turgut,Ibrahim Kok,Suat Ozdemir
关键词: improve food safety, food safety, safety and quality, meet the increasing, increasing demand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Accepted in 2024 IEEE International Conference on Big Data (IEEE BigData), 10 pages, 9 Figures, 5 Tables

点击查看摘要

Abstract:Today, crop diversification in agriculture is a critical issue to meet the increasing demand for food and improve food safety and quality. This issue is considered to be the most important challenge for the next generation of agriculture due to the diminishing natural resources, the limited arable land, and unpredictable climatic conditions caused by climate change. In this paper, we employ emerging technologies such as the Internet of Things (IoT), machine learning (ML), and explainable artificial intelligence (XAI) to improve operational efficiency and productivity in the agricultural sector. Specifically, we propose an edge computing-based explainable crop recommendation system, AgroXAI, which suggests suitable crops for a region based on weather and soil conditions. In this system, we provide local and global explanations of ML model decisions with methods such as ELI5, LIME, SHAP, which we integrate into ML models. More importantly, we provide regional alternative crop recommendations with the counterfactual explainability method. In this way, we envision that our proposed AgroXAI system will be a platform that provides regional crop diversity in the next generation agriculture.

[AI-157] A Decade of Deep Learning: A Survey on The Magnificent Seven

链接: https://arxiv.org/abs/2412.16188
作者: Dilshod Azizov,Muhammad Arslan Manzoor,Velibor Bojkovic,Yingxu Wang,Zixiao Wang,Zangir Iklassov,Kailong Zhao,Liang Li,Siwei Liu,Yu Zhong,Wei Liu,Shangsong Liang
关键词: enabling remarkable achievements, Graph Neural Networks, past decade, enabling remarkable, Generative Adversarial Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning has fundamentally reshaped the landscape of artificial intelligence over the past decade, enabling remarkable achievements across diverse domains. At the heart of these developments lie multi-layered neural network architectures that excel at automatic feature extraction, leading to significant improvements in machine learning tasks. To demystify these advances and offer accessible guidance, we present a comprehensive overview of the most influential deep learning algorithms selected through a broad-based survey of the field. Our discussion centers on pivotal architectures, including Residual Networks, Transformers, Generative Adversarial Networks, Variational Autoencoders, Graph Neural Networks, Contrastive Language-Image Pre-training, and Diffusion models. We detail their historical context, highlight their mathematical foundations and algorithmic principles, and examine subsequent variants, extensions, and practical considerations such as training methodologies, normalization techniques, and learning rate schedules. Beyond historical and technical insights, we also address their applications, challenges, and potential research directions. This survey aims to serve as a practical manual for both newcomers seeking an entry point into cutting-edge deep learning methods and experienced researchers transitioning into this rapidly evolving domain.

[AI-158] More complex environments may be required to discover benefits of lifetime learning in evolving robots

链接: https://arxiv.org/abs/2412.16184
作者: Ege de Bruin,Kyrre Glette,Kai Olav Ellefsen
关键词: controller optimization loop, additional controller optimization, evolving robot morphologies, optimization loop, morphologies for locomotion
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:It is well known that intra-life learning, defined as an additional controller optimization loop, is beneficial for evolving robot morphologies for locomotion. In this work, we investigate this further by comparing it in two different environments: an easy flat environment and a more challenging hills environment. We show that learning is significantly more beneficial in a hilly environment than in a flat environment and that it might be needed to evaluate robots in a more challenging environment to see the benefits of learning.

[AI-159] Minimum Weighted Feedback Arc Sets for Ranking from Pairwise Comparisons

链接: https://arxiv.org/abs/2412.16181
作者: Soroush Vahidi,Ioannis Koutis
关键词: Feedback Arc Set, Minimum Weighted Feedback, Weighted Feedback Arc, Arc Set, Minimum Weighted
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: This is a preliminary paper

点击查看摘要

Abstract:The Minimum Weighted Feedback Arc Set (MWFAS) problem is fundamentally connected to the Ranking Problem – the task of deriving global rankings from pairwise comparisons. Recent work [He et al. ICML2022] has advanced the state-of-the-art for the Ranking Problem using learning-based methods, improving upon multiple previous approaches. However, the connection to MWFAS remains underexplored. This paper investigates this relationship and presents efficient combinatorial algorithms for solving MWFAS, thus addressing the Ranking Problem. Our experimental results demonstrate that these simple, learning-free algorithms not only significantly outperform learning-based methods in terms of speed but also generally achieve superior ranking accuracy.

[AI-160] Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs

链接: https://arxiv.org/abs/2412.16178
作者: Michael Wornow,Suhana Bedi,Miguel Angel Fuentes Hernandez,Ethan Steinberg,Jason Alan Fries,Christopher Ré,Sanmi Koyejo,Nigam H. Shah
关键词: Electronic Health Records, Health Records, Electronic Health, trained on Electronic, EHR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of 1k tokens. This prevents them from modeling full patient EHRs which can exceed 10k’s of events. Recent advancements in subquadratic long-context architectures (e.g., Mamba) offer a promising solution. However, their application to EHR data has not been well-studied. We address this gap by presenting the first systematic evaluation of the effect of context length on modeling EHR data. We find that longer context models improve predictive performance – our Mamba-based model surpasses the prior state-of-the-art on 9/14 tasks on the EHRSHOT prediction benchmark. For clinical applications, however, model performance alone is insufficient – robustness to the unique properties of EHR is crucial. Thus, we also evaluate models across three previously underexplored properties of EHR data: (1) the prevalence of “copy-forwarded” diagnoses which creates artificial repetition of tokens within EHR sequences; (2) the irregular time intervals between EHR events which can lead to a wide range of timespans within a context window; and (3) the natural increase in disease complexity over time which makes later tokens in the EHR harder to predict than earlier ones. Stratifying our EHRSHOT results, we find that higher levels of each property correlate negatively with model performance, but that longer context models are more robust to more extreme levels of these properties. Our work highlights the potential for using long-context architectures to model EHR data, and offers a case study for identifying new challenges in modeling sequential data motivated by domains outside of natural language. We release our models and code at: this https URL

[AI-161] Mining Math Conjectures from LLM s: A Pruning Approach NEURIPS

链接: https://arxiv.org/abs/2412.16177
作者: Jake Chuharski,Elias Rojas Collins,Mark Meringolo
关键词: Large Language Models, Language Models, Large Language, generating mathematical conjectures, approach to generating
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages, 10 figures, NeurIPS MathAI Workshop 2024

点击查看摘要

Abstract:We present a novel approach to generating mathematical conjectures using Large Language Models (LLMs). Focusing on the solubilizer, a relatively recent construct in group theory, we demonstrate how LLMs such as ChatGPT, Gemini, and Claude can be leveraged to generate conjectures. These conjectures are pruned by allowing the LLMs to generate counterexamples. Our results indicate that LLMs are capable of producing original conjectures that, while not groundbreaking, are either plausible or falsifiable via counterexamples, though they exhibit limitations in code execution.

[AI-162] Reconciling Spatial and Temporal Abstractions for Goal Representation

链接: https://arxiv.org/abs/2401.09870
作者: Mehdi Zadem,Sergio Mover,Sao Mai Nguyen
关键词: Hierarchical Reinforcement Learning, Hierarchical Reinforcement, performance of Hierarchical, Goal representation affects, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Goal representation affects the performance of Hierarchical Reinforcement Learning (HRL) algorithms by decomposing the complex learning problem into easier subtasks. Recent studies show that representations that preserve temporally abstract environment dynamics are successful in solving difficult problems and provide theoretical guarantees for optimality. These methods however cannot scale to tasks where environment dynamics increase in complexity i.e. the temporally abstract transition relations depend on larger number of variables. On the other hand, other efforts have tried to use spatial abstraction to mitigate the previous issues. Their limitations include scalability to high dimensional environments and dependency on prior knowledge. In this paper, we propose a novel three-layer HRL algorithm that introduces, at different levels of the hierarchy, both a spatial and a temporal goal abstraction. We provide a theoretical study of the regret bounds of the learned policies. We evaluate the approach on complex continuous control tasks, demonstrating the effectiveness of spatial and temporal abstractions learned by this approach. Find open-source code at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2401.09870 [cs.LG] (or arXiv:2401.09870v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2401.09870 Focus to learn more arXiv-issued DOI via DataCite Journalreference: ICLR 2024

[AI-163] owards Human Haptic Gesture Interpretation for Robotic Systems IROS2021

链接: https://arxiv.org/abs/2012.01959
作者: Bibit Bianchini,Prateek Verma,Kenneth Salisbury
关键词: Physical human-robot interactions, Physical human-robot, human-robot interactions, human-human interactions, efficient and communicative
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, Accepted at 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021) in Prague, Czech Republic

点击查看摘要

Abstract:Physical human-robot interactions (pHRI) are less efficient and communicative than human-human interactions, and a key reason is a lack of informative sense of touch in robotic systems. Interpreting human touch gestures is a nuanced, challenging task with extreme gaps between human and robot capability. Among prior works that demonstrate human touch recognition capability, differences in sensors, gesture classes, feature sets, and classification algorithms yield a conglomerate of non-transferable results and a glaring lack of a standard. To address this gap, this work presents 1) four proposed touch gesture classes that cover an important subset of the gesture characteristics identified in the literature, 2) the collection of an extensive force dataset on a common pHRI robotic arm with only its internal wrist force-torque sensor, and 3) an exhaustive performance comparison of combinations of feature sets and classification algorithms on this dataset. We demonstrate high classification accuracies among our proposed gesture definitions on a test set, emphasizing that neural net-work classifiers on the raw data outperform other combinations of feature sets and algorithms. The accompanying video is here: this https URL

[AI-164] Neural Style Transfer for Audio Spectograms NIPS2017

链接: https://arxiv.org/abs/1801.01589
作者: Prateek Verma,Julius O. Smith
关键词: creating artistic transformations, artistic transformations, images by Gatys, fascinating work, creating artistic
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Appeared in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA at the workshop for Machine Learning for Creativity and Design

点击查看摘要

Abstract:There has been fascinating work on creating artistic transformations of images by Gatys. This was revolutionary in how we can in some sense alter the ‘style’ of an image while generally preserving its ‘content’. In our work, we present a method for creating new sounds using a similar approach, treating it as a style-transfer problem, starting from a random-noise input signal and iteratively using back-propagation to optimize the sound to conform to filter-outputs from a pre-trained neural architecture of interest. For demonstration, we investigate two different tasks, resulting in bandwidth expansion/compression, and timbral transfer from singing voice to musical instruments. A feature of our method is that a single architecture can generate these different audio-style-transfer types using the same set of parameters which otherwise require different complex hand-tuned diverse signal processing pipelines. Comments: Appeared in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA at the workshop for Machine Learning for Creativity and Design Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS) Cite as: arXiv:1801.01589 [cs.SD] (or arXiv:1801.01589v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.1801.01589 Focus to learn more arXiv-issued DOI via DataCite

[AI-165] PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion

链接: https://arxiv.org/abs/2412.17780
作者: Sophia Tang,Yinuo Zhang,Pranam Chatterjee
关键词: receptor agonists revolutionizing, achieved remarkable success, diabetes and cancer, diabetes and obesity, class of medicines
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Peptide therapeutics, a major class of medicines, have achieved remarkable success across diseases such as diabetes and cancer, with landmark examples such as GLP-1 receptor agonists revolutionizing the treatment of type-2 diabetes and obesity. Despite their success, designing peptides that satisfy multiple conflicting objectives, such as target binding affinity, solubility, and membrane permeability, remains a major challenge. Classical drug development and structure-based design are ineffective for such tasks, as they fail to optimize global functional properties critical for therapeutic efficacy. Existing generative frameworks are largely limited to continuous spaces, unconditioned outputs, or single-objective guidance, making them unsuitable for discrete sequence optimization across multiple properties. To address this, we present PepTune, a multi-objective discrete diffusion model for the simultaneous generation and optimization of therapeutic peptide SMILES. Built on the Masked Discrete Language Model (MDLM) framework, PepTune ensures valid peptide structures with state-dependent masking schedules and penalty-based objectives. To guide the diffusion process, we propose a Monte Carlo Tree Search (MCTS)-based strategy that balances exploration and exploitation to iteratively refine Pareto-optimal sequences. MCTS integrates classifier-based rewards with search-tree expansion, overcoming gradient estimation challenges and data sparsity inherent to discrete spaces. Using PepTune, we generate diverse, chemically-modified peptides optimized for multiple therapeutic properties, including target binding affinity, membrane permeability, solubility, hemolysis, and non-fouling characteristics on various disease-relevant targets. In total, our results demonstrate that MCTS-guided discrete diffusion is a powerful and modular approach for multi-objective sequence design in discrete state spaces.

[AI-166] An Investigation on the Potential of KAN in Speech Enhancement

链接: https://arxiv.org/abs/2412.17778
作者: Haoyang Li,Yuchen Hu,Chen Chen,Eng Siong Chng
关键词: requires sophisticated modeling, High-fidelity speech enhancement, multiscale patterns, High-fidelity speech, capture intricate
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 2 figure, 4 tables

点击查看摘要

Abstract:High-fidelity speech enhancement often requires sophisticated modeling to capture intricate, multiscale patterns. Standard activation functions, while introducing nonlinearity, lack the flexibility to fully address this complexity. Kolmogorov-Arnold Networks (KAN), an emerging methodology that employs learnable activation functions on graph edges, present a promising alternative. This work investigates two novel KAN variants based on rational and radial basis functions for speech enhancement. We integrate the rational variant into the 1D CNN blocks of Demucs and the GRU-Transformer blocks of MP-SENet, while the radial variant is adapted to the 2D CNN-based decoders of MP-SENet. Experiments on the VoiceBank-DEMAND dataset show that replacing standard activations with KAN-based activations improves speech quality across both the time-domain and time-frequency domain methods with minimal impact on model size and FLOP, underscoring KAN’s potential to improve speech enhancement models.

[AI-167] Signal Transformation for Effective Multi-Channel Signal Processing

链接: https://arxiv.org/abs/2412.17478
作者: Sunil Kumar Kopparapu
关键词: EEG, EEG signals, signal, EEG signal processing, signal transformation
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 5 Figures

点击查看摘要

Abstract:Electroencephalography (EEG) is an non-invasive method to record the electrical activity of the brain. The EEG signals are low bandwidth and recorded from multiple electrodes simultaneously in a time synchronized manner. Typical EEG signal processing involves extracting features from all the individual channels separately and then fusing these features for downstream applications. In this paper, we propose a signal transformation, using basic signal processing, to combine the individual channels of a low-bandwidth signal, like the EEG into a single-channel high-bandwidth signal, like audio. Further this signal transformation is bi-directional, namely the high-bandwidth single-channel can be transformed to generate the individual low-bandwidth signals without any loss of information. Such a transformation when applied to EEG signals overcomes the need to process multiple signals and allows for a single-channel processing. The advantage of this signal transformation is that it allows the use of pre-trained single-channel pre-trained models, for multi-channel signal processing and analysis. We further show the utility of the signal transformation on publicly available EEG dataset.

[AI-168] VirusT5: Harnessing Large Language Models to Predicting SARS-CoV-2 Evolution

链接: https://arxiv.org/abs/2412.16262
作者: Vishwajeet Marathe,Deewan Bajracharya,Changhui Yan
关键词: DNA repair efficiency,these, repair efficiency,these constraints, efficiency,these constraints contribute, http URL, URL with factors
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: This is a preprint of a paper submitted to IEEE for consideration

点击查看摘要

Abstract:During a virus’s evolution,various regions of the genome are subjected to distinct levels of functional this http URL with factors like codon bias and DNA repair efficiency,these constraints contribute to unique mutation patterns within the genome or a specific gene. In this project, we harnessed the power of Large Language Models(LLMs) to predict the evolution of SARS-CoV-2. By treating the mutation process from one generation to the next as a translation task, we trained a transformer model, called VirusT5, to capture the mutation patterns underlying SARS-CoV-2 evolution. We evaluated the VirusT5’s ability to detect these mutation patterns including its ability to identify mutation hotspots and explored the potential of using VirusT5 to predict future virus variants. Our findings demonstrate the feasibility of using a large language model to model viral evolution as a translation process. This study establishes the groundbreaking concept of “mutation-as-translation,” paving the way for new methodologies and tools for combating virus threats

[AI-169] Cross-Attention Graph Neural Networks for Inferring Gene Regulatory Networks with Skewed Degree Distribution

链接: https://arxiv.org/abs/2412.16220
作者: Jiaqi Xiong,Nan Yin,Yifan Sun,Haoyang Li,Yingxu Wang,Duo Ai,Fang Pan,Shiyang Liang
关键词: Inferencing Gene Regulatory, Gene Regulatory Networks, innovative computational methods, graph embedding, Inferencing Gene
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures,1 tabels

点击查看摘要

Abstract:Inferencing Gene Regulatory Networks (GRNs) from gene expression data is a pivotal challenge in systems biology, and several innovative computational methods have been introduced. However, most of these studies have not considered the skewed degree distribution of this http URL, some genes may regulate multiple target genes while some genes may be regulated by multiple regulator this http URL a skewed degree distribution issue significantly complicates the application of directed graph embedding methods. To tackle this issue, we propose the Cross-Attention Complex Dual Graph Embedding Model (XATGRN). Our XATGRN employs a cross-attention mechanism to effectively capture intricate gene interactions from gene expression profiles. Additionally, it uses a Dual Complex Graph Embedding approach to manage the skewed degree distribution, thereby ensuring precise prediction of regulatory relationships and their directionality. Our model consistently outperforms existing state-of-the-art methods across various datasets, underscoring its efficacy in elucidating complex gene regulatory mechanisms. Our codes used in this paper are publicly available at: this https URL.

[AI-170] Unveiling the Role of Artificial Intelligence and Stock Market Growth in Achieving Carbon Neutrality in the United States: An ARDL Model Analysis

链接: https://arxiv.org/abs/2412.16166
作者: Azizul Hakim Rafi,Abdullah Al Abrar Chowdhury,Adita Sultana,Abdulla All Noman
关键词: mitigate climate change, climate change, mitigate climate, recent years, specialized research
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注: 26 pages, 8 tables

点击查看摘要

Abstract:Given the fact that climate change has become one of the most pressing problems in many countries in recent years, specialized research on how to mitigate climate change has been adopted by many countries. Within this discussion, the influence of advanced technologies in achieving carbon neutrality has been discussed. While several studies investigated how AI and Digital innovations could be used to reduce the environmental footprint, the actual influence of AI in reducing CO2 emissions (a proxy measuring carbon footprint) has yet to be investigated. This paper studies the role of advanced technologies in general, and Artificial Intelligence (AI) and ICT use in particular, in advancing carbon neutrality in the United States, between 2021. Secondly, this paper examines how Stock Market Growth, ICT use, Gross Domestic Product (GDP), and Population affect CO2 emissions using the STIRPAT model. After examining stationarity among the variables using a variety of unit root tests, this study concluded that there are no unit root problems across all the variables, with a mixed order of integration. The ARDL bounds test for cointegration revealed that variables in this study have a long-run relationship. Moreover, the estimates revealed from the ARDL model in the short- and long-run indicated that economic growth, stock market capitalization, and population significantly contributed to the carbon emissions in both the short-run and long-run. Conversely, AI and ICT use significantly reduced carbon emissions over both periods. Furthermore, findings were confirmed to be robust using FMOLS, DOLS, and CCR estimations. Furthermore, diagnostic tests indicated the absence of serial correlation, heteroscedasticity, and specification errors and, thus, the model was robust.

机器学习

[LG-0] oken Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

链接: https://arxiv.org/abs/2412.17810
作者: Ziyang Wu,Tianjiao Ding,Yifu Lu,Druv Pai,Jingyuan Zhang,Weida Wang,Yaodong Yu,Yi Ma,Benjamin D. Haeffele
关键词: key distinguishing factor, Token Statistics Transformer, arguably the key, key distinguishing, distinguishing factor
类目: Machine Learning (cs.LG)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant computational burden, with the computational complexity scaling quadratically with the number of tokens. In this work, we propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. We derive our network architecture by extending prior work which has shown that a transformer style architecture naturally arises by “white-box” architecture design, where each layer of the network is designed to implement an incremental optimization step of a maximal coding rate reduction objective (MCR ^2 ). Specifically, we derive a novel variational form of the MCR ^2 objective and show that the architecture that results from unrolled gradient descent of this variational objective leads to a new attention module called Token Statistics Self-Attention (TSSA). TSSA has linear computational and memory complexity and radically departs from the typical attention architecture that computes pairwise similarities between tokens. Experiments on vision, language, and long sequence tasks show that simply swapping TSSA for standard self-attention, which we refer to as the Token Statistics Transformer (ToST), achieves competitive performance with conventional transformers while being significantly more computationally efficient and interpretable. Our results also somewhat call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures. Code will be available at this https URL.

[LG-1] Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models

链接: https://arxiv.org/abs/2412.17803
作者: Precious Jones,Weisi Liu,I-Chan Huang,Xiaolei Huang
关键词: ICD code prediction, ICD code, Data imbalance, distributions are uneven, language models
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Data imbalance is a fundamental challenge in applying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven. While state-of-the-art language models have been increasingly adopted in biomedical tasks, few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups. This study fills the gap by statistically probing the relationship between data imbalance and model performance in ICD code prediction. We analyze imbalances in a standard benchmark data across gender, age, ethnicity, and social determinants of health by state-of-the-art biomedical language models. By deploying diverse performance metrics and statistical analyses, we explore the influence of data imbalance on performance variations and demographic fairness. Our study shows that data imbalance significantly impacts model performance and fairness, but feature similarity to the majority class may be a more critical factor. We believe this study provides valuable insights for developing more equitable and robust language models in healthcare applications.

[LG-2] Memory makes computation universal remember?

链接: https://arxiv.org/abs/2412.17794
作者: Erik Garrison
关键词: Recent breakthroughs, increasingly sophisticated architectures, memory makes computation, alignment techniques, attributed to increasingly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in AI capability have been attributed to increasingly sophisticated architectures and alignment techniques, but a simpler principle may explain these advances: memory makes computation universal. Memory enables universal computation through two fundamental capabilities: recursive state maintenance and reliable history access. We formally prove these requirements are both necessary and sufficient for universal computation. This principle manifests across scales, from cellular computation to neural networks to language models. Complex behavior emerges not from sophisticated processing units but from maintaining and accessing state across time. We demonstrate how parallel systems like neural networks achieve universal computation despite limitations in their basic units by maintaining state across iterations. This theoretical framework reveals a universal pattern: computational advances consistently emerge from enhanced abilities to maintain and access state rather than from more complex basic operations. Our analysis unifies understanding of computation across biological systems, artificial intelligence, and human cognition, reminding us that humanity’s own computational capabilities have evolved in step with our technical ability to remember through oral traditions, writing, and now computing.

[LG-3] HyperQ-Opt: Q-learning for Hyperparameter Optimization

链接: https://arxiv.org/abs/2412.17765
作者: Md. Tarek Hasan
关键词: computationally intensive search, Model-based Bayesian Optimization, Sequential Model-based Bayesian, large parameter space, Random Search suffer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperparameter optimization (HPO) is critical for enhancing the performance of machine learning models, yet it often involves a computationally intensive search across a large parameter space. Traditional approaches such as Grid Search and Random Search suffer from inefficiency and limited scalability, while surrogate models like Sequential Model-based Bayesian Optimization (SMBO) rely heavily on heuristic predictions that can lead to suboptimal results. This paper presents a novel perspective on HPO by formulating it as a sequential decision-making problem and leveraging Q-learning, a reinforcement learning technique, to optimize hyperparameters. The study explores the works of H.S. Jomaa et al. and Qi et al., which model HPO as a Markov Decision Process (MDP) and utilize Q-learning to iteratively refine hyperparameter settings. The approaches are evaluated for their ability to find optimal or near-optimal configurations within a limited number of trials, demonstrating the potential of reinforcement learning to outperform conventional methods. Additionally, this paper identifies research gaps in existing formulations, including the limitations of discrete search spaces and reliance on heuristic policies, and suggests avenues for future exploration. By shifting the paradigm toward policy-based optimization, this work contributes to advancing HPO methods for scalable and efficient machine learning applications.

[LG-4] he Superposition of Diffusion Models Using the Ito Density Estimator

链接: https://arxiv.org/abs/2412.17762
作者: Marta Skreta,Lazar Atanackovic,Avishek Joey Bose,Alexander Tong,Kirill Neklyudov
关键词: pre-trained diffusion models, significant computational burden, easily accessible pre-trained, larger combined model, accessible pre-trained diffusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable Itô density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson’s estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditional de novo structure design of proteins. this https URL

[LG-5] Sensitivity Curve Maximization: Attacking Robust Aggregators in Distributed Learning

链接: https://arxiv.org/abs/2412.17740
作者: Christian A. Schroth,Stefan Vlaski,Abdelhak M. Zoubir
关键词: global learning problem, aim at collaboratively, collaboratively solving, solving a global, distributed learning agents
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In distributed learning agents aim at collaboratively solving a global learning problem. It becomes more and more likely that individual agents are malicious or faulty with an increasing size of the network. This leads to a degeneration or complete breakdown of the learning process. Classical aggregation schemes are prone to breakdown at small contamination rates, therefore robust aggregation schemes are sought for. While robust aggregation schemes can generally tolerate larger contamination rates, many have been shown to be susceptible to carefully crafted malicious attacks. In this work, we show how the sensitivity curve (SC), a classical tool from robust statistics, can be used to systematically derive optimal attack patterns against arbitrary robust aggregators, in most cases rendering them ineffective. We show the effectiveness of the proposed attack in multiple simulations.

[LG-6] Contextual Backpropagation Loops: Amplifying Deep Reasoning with Iterative Top-Down Feedback

链接: https://arxiv.org/abs/2412.17737
作者: Jacob Fein-Ashley
关键词: resolve ambiguous inputs, neural networks typically, networks typically rely, single forward pass, Deep neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks typically rely on a single forward pass for inference, which can limit their capacity to resolve ambiguous inputs. We introduce Contextual Backpropagation Loops (CBLs) as an iterative mechanism that incorporates top-down feedback to refine intermediate representations, thereby improving accuracy and robustness. This repeated process mirrors how humans continuously re-interpret sensory information in daily life-by checking and re-checking our perceptions using contextual cues. Our results suggest that CBLs can offer a straightforward yet powerful way to incorporate such contextual reasoning in modern deep learning architectures.

[LG-7] LASE: Learned Adjacency Spectral Embeddings

链接: https://arxiv.org/abs/2412.17734
作者: Sofía Pérez Casulo,Marcelo Fiori,Federico Larroca,Gonzalo Mateos
关键词: nodal Adjacency Spectral, Adjacency Spectral Embeddings, learn nodal Adjacency, nodal Adjacency, graph neural network
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We put forth a principled design of a neural architecture to learn nodal Adjacency Spectral Embeddings (ASE) from graph inputs. By bringing to bear the gradient descent (GD) method and leveraging the principle of algorithm unrolling, we truncate and re-interpret each GD iteration as a layer in a graph neural network (GNN) that is trained to approximate the ASE. Accordingly, we call the resulting embeddings and our parametric model Learned ASE (LASE), which is interpretable, parameter efficient, robust to inputs with unobserved edges, and offers controllable complexity during inference. LASE layers combine Graph Convolutional Network (GCN) and fully-connected Graph Attention Network (GAT) modules, which is intuitively pleasing since GCN-based local aggregations alone are insufficient to express the sought graph eigenvectors. We propose several refinements to the unrolled LASE architecture (such as sparse attention in the GAT module and decoupled layerwise parameters) that offer favorable approximation error versus computation tradeoffs; even outperforming heavily-optimized eigendecomposition routines from scientific computing libraries. Because LASE is a differentiable function with respect to its parameters as well as its graph input, we can seamlessly integrate it as a trainable module within a larger (semi-)supervised graph representation learning pipeline. The resulting end-to-end system effectively learns ``discriminative ASEs’’ that exhibit competitive performance in supervised link prediction and node classification tasks, outperforming a GNN even when the latter is endowed with open loop, meaning task-agnostic, precomputed spectral positional encodings.

[LG-8] Asynchronous Federated Learning: A Scalable Approach for Decentralized Machine Learning

链接: https://arxiv.org/abs/2412.17723
作者: Ali Forootani,Raffaele Iervolino
关键词: enabling collaborative model, sharing raw data, Asynchronous Federated Learning, Federated Learning, enabling collaborative
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a powerful paradigm for decentralized machine learning, enabling collaborative model training across diverse clients without sharing raw data. However, traditional FL approaches often face limitations in scalability and efficiency due to their reliance on synchronous client updates, which can result in significant delays and increased communication overhead, particularly in heterogeneous and dynamic environments. To address these challenges in this paper, we propose an Asynchronous Federated Learning (AFL) algorithm, which allows clients to update the global model independently and asynchronously. Our key contributions include a comprehensive convergence analysis of AFL in the presence of client delays and model staleness. By leveraging martingale difference sequence theory and variance bounds, we ensure robust convergence despite asynchronous updates. Assuming strongly convex local objective functions, we establish bounds on gradient variance under random client sampling and derive a recursion formula quantifying the impact of client delays on convergence. Furthermore, we demonstrate the practical applicability of AFL by training a decentralized Long Short-Term Memory (LSTM)-based deep learning model on the CMIP6 climate dataset, effectively handling non-IID and geographically distributed data. The proposed AFL algorithm addresses key limitations of traditional FL methods, such as inefficiency due to global synchronization and susceptibility to client drift. It enhances scalability, robustness, and efficiency in real-world settings with heterogeneous client populations and dynamic network conditions. Our results underscore the potential of AFL to drive advancements in distributed learning systems, particularly for large-scale, privacy-preserving applications in resource-constrained environments. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2412.17723 [cs.LG] (or arXiv:2412.17723v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.17723 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Fast Causal Discovery by Approximate Kernel-based Generalized Score Functions with Linear Computational Complexity

链接: https://arxiv.org/abs/2412.17717
作者: Yixin Ren,Haocheng Zhang,Yewei Xia,Hao Zhang,Jihong Guan,Shuigeng Zhou
关键词: evaluating candidate graphs, Score-based causal discovery, kernel-based generalized score, effectively identify causal, identify causal relationships
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based causal discovery methods can effectively identify causal relationships by evaluating candidate graphs and selecting the one with the highest score. One popular class of scores is kernel-based generalized score functions, which can adapt to a wide range of scenarios and work well in practice because they circumvent assumptions about causal mechanisms and data distributions. Despite these advantages, kernel-based generalized score functions pose serious computational challenges in time and space, with a time complexity of \mathcalO(n^3) and a memory complexity of \mathcalO(n^2) , where n is the sample size. In this paper, we propose an approximate kernel-based generalized score function with \mathcalO(n) time and space complexities by using low-rank technique and designing a set of rules to handle the complex composite matrix operations required to calculate the score, as well as developing sampling algorithms for different data types to benefit the handling of diverse data types efficiently. Our extensive causal discovery experiments on both synthetic and real-world data demonstrate that compared to the state-of-the-art method, our method can not only significantly reduce computational costs, but also achieve comparable accuracy, especially for large datasets.

[LG-10] COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Learning

链接: https://arxiv.org/abs/2412.17684
作者: Arnav M. Das,Gantavya Bhatt,Lilly Kumari,Sahil Verma,Jeff Bilmes
关键词: large auxiliary pools, retrieving additional data, low-data regime, practice of retrieving, retrieving additional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval augmentation, the practice of retrieving additional data from large auxiliary pools, has emerged as an effective technique for enhancing model performance in the low-data regime, e.g. few-shot learning. Prior approaches have employed only nearest-neighbor based strategies for data selection, which retrieve auxiliary samples with high similarity to instances in the target task. However, these approaches are prone to selecting highly redundant samples, since they fail to incorporate any notion of diversity. In our work, we first demonstrate that data selection strategies used in prior retrieval-augmented few-shot learning settings can be generalized using a class of functions known as Combinatorial Mutual Information (CMI) measures. We then propose COBRA (COmBinatorial Retrieval Augmentation), which employs an alternative CMI measure that considers both diversity and similarity to a target dataset. COBRA consistently outperforms previous retrieval approaches across image classification tasks and few-shot learning techniques when used to retrieve samples from LAION-2B. COBRA introduces negligible computational overhead to the cost of retrieval while providing significant gains in downstream model performance.

[LG-11] Benchmarking Generative AI Models for Deep Learning Test Input Generation

链接: https://arxiv.org/abs/2412.17652
作者: Maryam,Matteo Biagiola,Andrea Stocco,Vincenzo Riccio
关键词: provide correct predictions, Test Input Generators, ability of Deep, Deep Learning, Input Generators
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025)

点击查看摘要

Abstract:Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs. Comments: Accepted at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) ACMclasses: D.2.5 Cite as: arXiv:2412.17652 [cs.LG] (or arXiv:2412.17652v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.17652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Rate of Model Collapse in Recursive Training

链接: https://arxiv.org/abs/2412.17646
作者: Ananda Theertha Suresh,Andrew Thangaraj,Aditya Nanda Kishore Khandavally
关键词: creating synthetic data, synthetic data generated, synthetic data, machine learning models, creating synthetic
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 27 pages

点击查看摘要

Abstract:Given the ease of creating synthetic data from machine learning models, new models can be potentially trained on synthetic data generated by previous models. This recursive training process raises concerns about the long-term impact on model quality. As models are recursively trained on generated data from previous rounds, their ability to capture the nuances of the original human-generated data may degrade. This is often referred to as \emphmodel collapse. In this work, we ask how fast model collapse occurs for some well-studied distribution families under maximum likelihood (ML or near ML) estimation during recursive training. Surprisingly, even for fundamental distributions such as discrete and Gaussian distributions, the exact rate of model collapse is unknown. In this work, we theoretically characterize the rate of collapse in these fundamental settings and complement it with experimental evaluations. Our results show that for discrete distributions, the time to forget a word is approximately linearly dependent on the number of times it occurred in the original corpus, and for Gaussian models, the standard deviation reduces to zero roughly at n iterations, where n is the number of samples at each iteration. Both of these findings imply that model forgetting, at least in these simple distributions under near ML estimation with many samples, takes a long time.

[LG-13] Be More Diverse than the Most Diverse: Online Selection of Diverse Mixtures of Generative Models

链接: https://arxiv.org/abs/2412.17622
作者: Parham Rezaei,Farzan Farnia,Cheuk Ting Li
关键词: well-trained generation models, generative models requires, mechanism to form, form a single, group of well-trained
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The availability of multiple training algorithms and architectures for generative models requires a selection mechanism to form a single model over a group of well-trained generation models. The selection task is commonly addressed by identifying the model that maximizes an evaluation score based on the diversity and quality of the generated data. However, such a best-model identification approach overlooks the possibility that a mixture of available models can outperform each individual model. In this work, we explore the selection of a mixture of multiple generative models and formulate a quadratic optimization problem to find an optimal mixture model achieving the maximum of kernel-based evaluation scores including kernel inception distance (KID) and Rényi kernel entropy (RKE). To identify the optimal mixture of the models using the fewest possible sample queries, we propose an online learning approach called Mixture Upper Confidence Bound (Mixture-UCB). Specifically, our proposed online learning method can be extended to every convex quadratic function of the mixture weights, for which we prove a concentration bound to enable the application of the UCB approach. We prove a regret bound for the proposed Mixture-UCB algorithm and perform several numerical experiments to show the success of the proposed Mixture-UCB method in finding the optimal mixture of text-based and image-based generative models. The codebase is available at this https URL .

[LG-14] Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

链接: https://arxiv.org/abs/2412.17613
作者: Lawrence Wang,Stephen J. Roberts
关键词: training loss decreases, loss decreases monotonically, Traditional analyses, critical learning-rate threshold, descent optimization show
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is ‘stable’ and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.

[LG-15] owards Foundation Models on Graphs: An Analysis on Cross-Dataset Transfer of Pretrained GNNs NEURIPS2024

链接: https://arxiv.org/abs/2412.17609
作者: Fabrizio Frasca,Fabian Jogl,Moshe Eliasof,Matan Ostrovsky,Carola-Bibiane Schönlieb,Thomas Gärtner,Haggai Maron
关键词: Graph Neural Networks, Graph Foundation Models, pretrained Graph Neural, Neural Networks, Graph Foundation
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted and presented at the NeurIPS 2024 workshop “Symmetry and Geometry in Neural Representations” (NeuReps 2024)

点击查看摘要

Abstract:To develop a preliminary understanding towards Graph Foundation Models, we study the extent to which pretrained Graph Neural Networks can be applied across datasets, an effort requiring to be agnostic to dataset-specific features and their encodings. We build upon a purely structural pretraining approach and propose an extension to capture feature information while still being feature-agnostic. We evaluate pretrained models on downstream tasks for varying amounts of training samples and choices of pretraining datasets. Our preliminary results indicate that embeddings from pretrained models improve generalization only with enough downstream data points and in a degree which depends on the quantity and properties of pretraining data. Feature information can lead to improvements, but currently requires some similarities between pretraining and downstream feature spaces.

[LG-16] EasyTime: Time Series Forecasting Made Easy ICDE2025

链接: https://arxiv.org/abs/2412.17603
作者: Xiangfei Qiu,Xiuwen Li,Ruiyang Pang,Zhicheng Pan,Xingjian Wu,Liu Yang,Jilin Hu,Yang Shu,Xuesong Lu,Chengcheng Yang,Chenjuan Guo,Aoying Zhou,Christian S. Jensen,Bin Yang
关键词: Time series forecasting, Time series, forecasting, series forecasting, series
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICDE2025

点击查看摘要

Abstract:Time series forecasting has important applications across diverse domains. EasyTime, the system we demonstrate, facilitates easy use of time-series forecasting methods by researchers and practitioners alike. First, EasyTime enables one-click evaluation, enabling researchers to evaluate new forecasting methods using the suite of diverse time series datasets collected in the preexisting time series forecasting benchmark (TFB). This is achieved by leveraging TFB’s flexible and consistent evaluation pipeline. Second, when practitioners must perform forecasting on a new dataset, a nontrivial first step is often to find an appropriate forecasting method. EasyTime provides an Automated Ensemble module that combines the promising forecasting methods to yield superior forecasting accuracy compared to individual methods. Third, EasyTime offers a natural language QA module leveraging large language models. Given a question like “Which method is best for long term forecasting on time series with strong seasonality?”, EasyTime converts the question into SQL queries on the database of results obtained by TFB and then returns an answer in natural language and charts. By demonstrating EasyTime, we intend to show how it is possible to simplify the use of time series forecasting and to offer better support for the development of new generations of time series forecasting methods.

[LG-17] Graph Size-imbalanced Learning with Energy-guided Structural Smoothing WSDM’25

链接: https://arxiv.org/abs/2412.17591
作者: Jiawen Qin,Pengfeng Huang,Qingyun Sun,Cheng Ji,Xingcheng Fu,Jianxin Li
关键词: simulate numerous systems, prevalent data structure, data structure employed, Graph Neural Networks, relationships between entities
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by the 18th ACM International Conference on Web Search and Data Mining (WSDM’25)

点击查看摘要

Abstract:Graph is a prevalent data structure employed to represent the relationships between entities, frequently serving as a tool to depict and simulate numerous systems, such as molecules and social networks. However, real-world graphs usually suffer from the size-imbalanced problem in the multi-graph classification, i.e., a long-tailed distribution with respect to the number of nodes. Recent studies find that off-the-shelf Graph Neural Networks (GNNs) would compromise model performance under the long-tailed settings. We investigate this phenomenon and discover that the long-tailed graph distribution greatly exacerbates the discrepancies in structural features. To alleviate this problem, we propose a novel energy-based size-imbalanced learning framework named \textbfSIMBA, which smooths the features between head and tail graphs and re-weights them based on the energy propagation. Specifically, we construct a higher-level graph abstraction named \textitGraphs-to-Graph according to the correlations between graphs to link independent graphs and smooths the structural discrepancies. We further devise an energy-based message-passing belief propagation method for re-weighting lower compatible graphs in the training process and further smooth local feature discrepancies. Extensive experimental results over five public size-imbalanced datasets demonstrate the superior effectiveness of the model for size-imbalanced graph classification tasks.

[LG-18] HPCNeuroNet: A Neuromorphic Approach Merging SNN Temporal Dynamics with Transformer Attention for FPGA-based Particle Physics

链接: https://arxiv.org/abs/2412.17571
作者: Murat Isik,Hiruna Vishwamith,Jonathan Naoukin,I. Can Dikmen
关键词: Spiking Neural Networks, Neural Networks, Spiking Neural, fusion of Spiking, paper presents
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This paper presents the innovative HPCNeuroNet model, a pioneering fusion of Spiking Neural Networks (SNNs), Transformers, and high-performance computing tailored for particle physics, particularly in particle identification from detector responses. Our approach leverages SNNs’ intrinsic temporal dynamics and Transformers’ robust attention mechanisms to enhance performance when discerning intricate particle interactions. At the heart of HPCNeuroNet lies the integration of the sequential dynamism inherent in SNNs with the context-aware attention capabilities of Transformers, enabling the model to precisely decode and interpret complex detector data. HPCNeuroNet is realized through the HLS4ML framework and optimized for deployment in FPGA environments. The model accuracy and scalability are also enhanced by this architectural choice. Benchmarked against machine learning models, HPCNeuroNet showcases better performance metrics, underlining its transformative potential in high-energy physics. We demonstrate that the combination of SNNs, Transformers, and FPGA-based high-performance computing in particle physics signifies a significant step forward and provides a strong foundation for future research.

[LG-19] GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

链接: https://arxiv.org/abs/2412.17560
作者: Chao Zeng,Songwei Liu,Shu Yang,Fangmin Chen,Xing Mei,Lean Fu
关键词: large language models, risen substantially, rapid growth, scale and complexity, complexity of large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid growth in the scale and complexity of large language models (LLMs), the costs of training and inference have risen substantially. Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper presents Group Quantization and Sparse Acceleration (\textbfGQSA), a novel compression technique tailored for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. The proposed method consists of three key steps. First, GQSA applies group structured pruning to adhere to GPU-friendly sparse pattern constraints. Second, a two-stage sparsity-aware training process is employed to maximize performance retention after compression. Finally, the framework adopts the Block Sparse Row (BSR) format to enable practical deployment and efficient execution. Experimental results on the LLaMA model family show that GQSA achieves an excellent balance between model speed and accuracy. Furthermore, on the latest LLaMA-3 and LLaMA-3.1 models, GQSA outperforms existing LLM compression techniques significantly.

[LG-20] Leveraging Cardiovascular Simulations for In-Vivo Prediction of Cardiac Biomarkers

链接: https://arxiv.org/abs/2412.17542
作者: Laura Manduchi,Antoine Wehenkel,Jens Behrmann,Luca Pegolotti,Andy C. Miller,Ozan Sener,Marco Cuturi,Guillermo Sapiro,Jörn-Henrik Jacobsen
关键词: Whole-body hemodynamics simulators, studying cardiovascular systems, Whole-body hemodynamics, arterial pressure waveforms, model blood flow
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Whole-body hemodynamics simulators, which model blood flow and pressure waveforms as functions of physiological parameters, are now essential tools for studying cardiovascular systems. However, solving the corresponding inverse problem of mapping observations (e.g., arterial pressure waveforms at specific locations in the arterial network) back to plausible physiological parameters remains challenging. Leveraging recent advances in simulation-based inference, we cast this problem as statistical inference by training an amortized neural posterior estimator on a newly built large dataset of cardiac simulations that we publicly release. To better align simulated data with real-world measurements, we incorporate stochastic elements modeling exogenous effects. The proposed framework can further integrate in-vivo data sources to refine its predictive capabilities on real-world data. In silico, we demonstrate that the proposed framework enables finely quantifying uncertainty associated with individual measurements, allowing trustworthy prediction of four biomarkers of clinical interest–namely Heart Rate, Cardiac Output, Systemic Vascular Resistance, and Left Ventricular Ejection Time–from arterial pressure waveforms and photoplethysmograms. Furthermore, we validate the framework in vivo, where our method accurately captures temporal trends in CO and SVR monitoring on the VitalDB dataset. Finally, the predictive error made by the model monotonically increases with the predicted uncertainty, thereby directly supporting the automatic rejection of unusable measurements.

[LG-21] An efficient search-and-score algorithm for ancestral graphs using multivariate information scores

链接: https://arxiv.org/abs/2412.17508
作者: Nikita Lagrange,Herve Isambert
关键词: unobserved latent variables, propose a greedy, originating from unobserved, latent variables, include directed
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:We propose a greedy search-and-score algorithm for ancestral graphs, which include directed as well as bidirected edges, originating from unobserved latent variables. The normalized likelihood score of ancestral graphs is estimated in terms of multivariate information over relevant ``ac-connected subsets’’ of vertices, C, that are connected through collider paths confined to the ancestor set of C. For computational efficiency, the proposed two-step algorithm relies on local information scores limited to the close surrounding vertices of each node (step 1) and edge (step 2). This computational strategy, although restricted to information contributions from ac-connected subsets containing up to two-collider paths, is shown to outperform state-of-the-art causal discovery methods on challenging benchmark datasets.

[LG-22] Improving the Noise Estimation of Latent Neural Stochastic Differential Equations

链接: https://arxiv.org/abs/2412.17499
作者: Linus Heck,Maximilian Gelbrecht,Michael T. Schaub,Niklas Boers
关键词: time series data, stochastic differential equations, stochastic time series, learning generative models, differential equations
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Latent neural stochastic differential equations (SDEs) have recently emerged as a promising approach for learning generative models from stochastic time series data. However, they systematically underestimate the noise level inherent in such data, limiting their ability to capture stochastic dynamics accurately. We investigate this underestimation in detail and propose a straightforward solution: by including an explicit additional noise regularization in the loss function, we are able to learn a model that accurately captures the diffusion component of the data. We demonstrate our results on a conceptual model system that highlights the improved latent neural SDE’s capability to model stochastic bistable dynamics.

[LG-23] A Temporal Convolutional Network-based Approach for Network Intrusion Detection

链接: https://arxiv.org/abs/2412.17452
作者: Rukmini Nazre,Rujuta Budke,Omkar Oak,Suraj Sawant,Amit Joshi
关键词: poses significant challenges, securing modern networks, traffic poses significant, Temporal Convolutional Network, network traffic poses
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Paper presented at IEEE 2nd International Conference on Integrated Intelligence and Communication Systems (ICIICS) 2024

点击查看摘要

Abstract:Network intrusion detection is critical for securing modern networks, yet the complexity of network traffic poses significant challenges to traditional methods. This study proposes a Temporal Convolutional Network(TCN) model featuring a residual block architecture with dilated convolutions to capture dependencies in network traffic data while ensuring training stability. The TCN’s ability to process sequences in parallel enables faster, more accurate sequence modeling than Recurrent Neural Networks. Evaluated on the Edge-IIoTset dataset, which includes 15 classes with normal traffic and 14 cyberattack types, the proposed model achieved an accuracy of 96.72% and a loss of 0.0688, outperforming 1D CNN, CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-GRU-LSTM models. A class-wise classification report, encompassing metrics such as recall, precision, accuracy, and F1-score, demonstrated the TCN model’s superior performance across varied attack categories, including Malware, Injection, and DDoS. These results underscore the model’s potential in addressing the complexities of network intrusion detection effectively.

[LG-24] How Green Can AI Be? A Study of Trends in Machine Learning Environmental Impacts

链接: https://arxiv.org/abs/2412.17376
作者: Clément Morand(STL),Anne-Laure Ligozat(ENSIIE, LISN, STL),Aurélie Névéol(STL, LISN)
关键词: training Artificial Intelligence, Artificial Intelligence, environmental impacts, training Artificial, impacts
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The compute requirements associated with training Artificial Intelligence (AI) models have increased exponentially over time. Optimisation strategies aim to reduce the energy consumption and environmental impacts associated with AI, possibly shifting impacts from the use phase to the manufacturing phase in the life-cycle of hardware. This paper investigates the evolution of individual graphics cards production impacts and of the environmental impacts associated with training Machine Learning (ML) models over time. We collect information on graphics cards used to train ML models and released between 2013 and 2023. We assess the environmental impacts associated with the production of each card to visualize the trends on the same period. Then, using information on notable AI systems from the Epoch AI dataset we assess the environmental impacts associated with training each system. The environmental impacts of graphics cards production have increased continuously. The energy consumption and environmental impacts associated with training models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. These results suggest that current impact reduction strategies cannot curb the growth in the environmental impacts of AI. This is consistent with rebound effect, where the efficiency increases fuel the creation of even larger models thereby cancelling the potential impact reduction. Furthermore, these results highlight the importance of considering the impacts of hardware over the entire life-cycle rather than the sole usage phase in order to avoid impact shifting. The environmental impact of AI cannot be reduced without reducing AI activities as well as increasing efficiency.

[LG-25] Bi-Directional Multi-Scale Graph Dataset Condensation via Information Bottleneck AAAI AAAI-2025

链接: https://arxiv.org/abs/2412.17355
作者: Xingcheng Fu,Yisen Gao,Beining Yang,Yuxuan Wu,Haodong Qian,Qingyun Sun,Xianxian Li
关键词: significantly improved model, computing power brings, multi-scale graph dataset, graph dataset condensation, model training efficiency
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted by the Main Technical Track of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-2025)

点击查看摘要

Abstract:Dataset condensation has significantly improved model training efficiency, but its application on devices with different computing power brings new requirements for different data sizes. Thus, condensing multiple scale graphs simultaneously is the core of achieving efficient training in different on-device scenarios. Existing efficient works for multi-scale graph dataset condensation mainly perform efficient approximate computation in scale order (large-to-small or small-to-large scales). However, for non-Euclidean structures of sparse graph data, these two commonly used paradigms for multi-scale graph dataset condensation have serious scaling down degradation and scaling up collapse problems of a graph. The main bottleneck of the above paradigms is whether the effective information of the original graph is fully preserved when consenting to the primary sub-scale (the first of multiple scales), which determines the condensation effect and consistency of all scales. In this paper, we proposed a novel GNN-centric Bi-directional Multi-Scale Graph Dataset Condensation (BiMSGC) framework, to explore unifying paradigms by operating on both large-to-small and small-to-large for multi-scale graph condensation. Based on the mutual information theory, we estimate an optimal meso-scale'' to obtain the minimum necessary dense graph preserving the maximum utility information of the original graph, and then we achieve stable and consistent bi-directional’’ condensation learning by optimizing graph eigenbasis matching with information bottleneck on other scales. Encouraging empirical results on several datasets demonstrates the significant superiority of the proposed framework in graph condensation at different scales.

[LG-26] ORIGAMI: A generative transformer architecture for predictions from semi-structured data

链接: https://arxiv.org/abs/2412.17348
作者: Thomas Rückstieß,Alana Huang,Robin Vujanic
关键词: supervised learning applied, Generative Autoregressive ModellIng, learning applied directly, semi-structured data formats, popularity and widespread
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI’s effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.

[LG-27] On the Power and Limitations of Examples for Description Logic Concepts IJCAI

链接: https://arxiv.org/abs/2412.17345
作者: Balder ten Cate,Raoul Koudijs,Ana Ozaki
关键词: communicating complex concepts, positive and negative, attractive medium, medium for communicating, communicating complex
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: Published in the Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI)

点击查看摘要

Abstract:Labeled examples (i.e., positive and negative examples) are an attractive medium for communicating complex concepts. They are useful for deriving concept expressions (such as in concept learning, interactive concept specification, and concept refinement) as well as for illustrating concept expressions to a user or domain expert. We investigate the power of labeled examples for describing description-logic concepts. Specifically, we systematically study the existence and efficient computability of finite characterisations, i.e. finite sets of labeled examples that uniquely characterize a single concept, for a wide variety of description logics between EL and ALCQI, both without an ontology and in the presence of a DL-Lite ontology. Finite characterisations are relevant for debugging purposes, and their existence is a necessary condition for exact learnability with membership queries.

[LG-28] Reinforcement Learning with a Focus on Adjusting Policies to Reach Targets

链接: https://arxiv.org/abs/2412.17344
作者: Akane Tsuboya,Yu Kono,Tatsuji Takahashi
关键词: discover better actions, reinforcement learning, exploration, learning, method
类目: Machine Learning (cs.LG)
*备注: Accepted by AROB-ISBC 2025

点击查看摘要

Abstract:The objective of a reinforcement learning agent is to discover better actions through exploration. However, typical exploration techniques aim to maximize rewards, often incurring high costs in both exploration and learning processes. We propose a novel deep reinforcement learning method, which prioritizes achieving an aspiration level over maximizing expected return. This method flexibly adjusts the degree of exploration based on the proportion of target achievement. Through experiments on a motion control task and a navigation task, this method achieved returns equal to or greater than other standard methods. The results of the analysis showed two things: our method flexibly adjusts the exploration scope, and it has the potential to enable the agent to adapt to non-stationary environments. These findings indicated that this method may have effectiveness in improving exploration efficiency in practical applications of reinforcement learning.

[LG-29] Better Knowledge Enhancement for Privacy-Preserving Cross-Project Defect Prediction

链接: https://arxiv.org/abs/2412.17317
作者: Yuying Wang,Yichen Li,Haozhao Wang,Lei Zhao,Xiaofang Zhang
关键词: reliable defect predictor, Cross-Project Defect Prediction, poses a non-trivial, Defect Prediction, non-trivial challenge
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Cross-Project Defect Prediction (CPDP) poses a non-trivial challenge to construct a reliable defect predictor by leveraging data from other projects, particularly when data owners are concerned about data privacy. In recent years, Federated Learning (FL) has become an emerging paradigm to guarantee privacy information by collaborative training a global model among multiple parties without sharing raw data. While the direct application of FL to the CPDP task offers a promising solution to address privacy concerns, the data heterogeneity arising from proprietary projects across different companies or organizations will bring troubles for model training. In this paper, we study the privacy-preserving cross-project defect prediction with data heterogeneity under the federated learning framework. To address this problem, we propose a novel knowledge enhancement approach named FedDP with two simple but effective solutions: 1. Local Heterogeneity Awareness and 2. Global Knowledge Distillation. Specifically, we employ open-source project data as the distillation dataset and optimize the global model with the heterogeneity-aware local model ensemble via knowledge distillation. Experimental results on 19 projects from two datasets demonstrate that our method significantly outperforms baselines.

[LG-30] Collaborative Optimization in Financial Data Mining Through Deep Learning and ResNeXt

链接: https://arxiv.org/abs/2412.17314
作者: Pengbin Feng,Yankaiqi Li,Yijiashun Qi,Xiaojun Guo,Zhenghao Lin
关键词: financial data mining, financial data, data mining, learning framework based, aiming to solve
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:This study proposes a multi-task learning framework based on ResNeXt, aiming to solve the problem of feature extraction and task collaborative optimization in financial data mining. Financial data usually has the complex characteristics of high dimensionality, nonlinearity, and time series, and is accompanied by potential correlations between multiple tasks, making it difficult for traditional methods to meet the needs of data mining. This study introduces the ResNeXt model into the multi-task learning framework and makes full use of its group convolution mechanism to achieve efficient extraction of local patterns and global features of financial data. At the same time, through the design of task sharing layers and dedicated layers, it is established between multiple related tasks. Deep collaborative optimization relationships. Through flexible multi-task loss weight design, the model can effectively balance the learning needs of different tasks and improve overall performance. Experiments are conducted on a real SP 500 financial data set, verifying the significant advantages of the proposed framework in classification and regression tasks. The results indicate that, when compared to other conventional deep learning models, the proposed method delivers superior performance in terms of accuracy, F1 score, root mean square error, and other metrics, highlighting its outstanding effectiveness and robustness in handling complex financial data. This research provides an efficient and adaptable solution for financial data mining, and at the same time opens up a new research direction for the combination of multi-task learning and deep learning, which has important theoretical significance and practical application value.

[LG-31] Improving Pareto Set Learning for Expensive Multi-objective Optimization via Stein Variational Hypernetworks AAAI-25

链接: https://arxiv.org/abs/2412.17312
作者: Minh-Duc Nguyen,Phuong Mai Dinh,Quang-Huy Nguyen,Long P. Hoang,Dung D. Le
关键词: evaluating objective functions, involves extensive computations, Pareto set learning, scenarios where evaluating, costly and involves
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: Accepted to AAAI-25

点击查看摘要

Abstract:Expensive multi-objective optimization problems (EMOPs) are common in real-world scenarios where evaluating objective functions is costly and involves extensive computations or physical experiments. Current Pareto set learning methods for such problems often rely on surrogate models like Gaussian processes to approximate the objective functions. These surrogate models can become fragmented, resulting in numerous small uncertain regions between explored solutions. When using acquisition functions such as the Lower Confidence Bound (LCB), these uncertain regions can turn into pseudo-local optima, complicating the search for globally optimal solutions. To address these challenges, we propose a novel approach called SVH-PSL, which integrates Stein Variational Gradient Descent (SVGD) with Hypernetworks for efficient Pareto set learning. Our method addresses the issues of fragmented surrogate models and pseudo-local optima by collectively moving particles in a manner that smooths out the solution space. The particles interact with each other through a kernel function, which helps maintain diversity and encourages the exploration of underexplored regions. This kernel-based interaction prevents particles from clustering around pseudo-local optima and promotes convergence towards globally optimal solutions. Our approach aims to establish robust relationships between trade-off reference vectors and their corresponding true Pareto solutions, overcoming the limitations of existing methods. Through extensive experiments across both synthetic and real-world MOO benchmarks, we demonstrate that SVH-PSL significantly improves the quality of the learned Pareto set, offering a promising solution for expensive multi-objective optimization problems.

[LG-32] Non-Convex Tensor Recovery from Local Measurements AAAI2025

链接: https://arxiv.org/abs/2412.17281
作者: Tongle Wu,Ying Sun,Jicong Fan
关键词: compressed sensing model, tensor compressed sensing, kappa, mutually independent matrices, sensing model
类目: Machine Learning (cs.LG)
*备注: The paper was accepted by AAAI 2025

点击查看摘要

Abstract:Motivated by the settings where sensing the entire tensor is infeasible, this paper proposes a novel tensor compressed sensing model, where measurements are only obtained from sensing each lateral slice via mutually independent matrices. Leveraging the low tubal rank structure, we reparameterize the unknown tensor \boldsymbol \mathcal X^\star using two compact tensor factors and formulate the recovery problem as a nonconvex minimization problem. To solve the problem, we first propose an alternating minimization algorithm, termed \textsfAlt-PGD-Min, that iteratively optimizes the two factors using a projected gradient descent and an exact minimization step, respectively. Despite nonconvexity, we prove that \textsfAlt-PGD-Min achieves \epsilon -accuracy recovery with \mathcal O\left( \kappa^2 \log \frac1\epsilon\right) iteration complexity and \mathcal O\left( \kappa^6rn_3\log n_3 \left( \kappa^2r\left(n_1 + n_2 \right) + n_1 \log \frac1\epsilon\right) \right) sample complexity, where \kappa denotes tensor condition number of \boldsymbol\mathcal X^\star . To further accelerate the convergence, especially when the tensor is ill-conditioned with large \kappa , we prove \textsfAlt-ScalePGD-Min that preconditions the gradient update using an approximate Hessian that can be computed efficiently. We show that \textsfAlt-ScalePGD-Min achieves \kappa independent iteration complexity \mathcal O(\log \frac1\epsilon) and improves the sample complexity to \mathcal O\left( \kappa^4 rn_3 \log n_3 \left( \kappa^4r(n_1+n_2) + n_1 \log \frac1\epsilon\right) \right) . Experiments validate the effectiveness of the proposed methods.

[LG-33] Multi-view Fuzzy Graph Attention Networks for Enhanced Graph Learning

链接: https://arxiv.org/abs/2412.17271
作者: Jinming Xing,Dongwen Luo,Qisen Cheng,Chang Xue,Ruilin Xing
关键词: Fuzzy Rough Sets, Graph Attention Network, combines Fuzzy Rough, Rough Sets, Attention Network
类目: Machine Learning (cs.LG)
*备注: ISMSI’25

点击查看摘要

Abstract:Fuzzy Graph Attention Network (FGAT), which combines Fuzzy Rough Sets and Graph Attention Networks, has shown promise in tasks requiring robust graph-based learning. However, existing models struggle to effectively capture dependencies from multiple perspectives, limiting their ability to model complex data. To address this gap, we propose the Multi-view Fuzzy Graph Attention Network (MFGAT), a novel framework that constructs and aggregates multi-view information using a specially designed Transformation Block. This block dynamically transforms data from multiple aspects and aggregates the resulting representations via a weighted sum mechanism, enabling comprehensive multi-view modeling. The aggregated information is fed into FGAT to enhance fuzzy graph convolutions. Additionally, we introduce a simple yet effective learnable global pooling mechanism for improved graph-level understanding. Extensive experiments on graph classification tasks demonstrate that MFGAT outperforms state-of-the-art baselines, underscoring its effectiveness and versatility.

[LG-34] A Coalition Game for On-demand Multi-modal 3D Automated Delivery System

链接: https://arxiv.org/abs/2412.17252
作者: Farzan Moosavi,Bilal Farooq
关键词: including high-density areas, real-world operational challenges, multi-modal autonomous delivery, autonomous delivery optimization, delivery optimization framework
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a multi-modal autonomous delivery optimization framework as a coalition game for a fleet of UAVs and ADRs operating in two overlaying networks to address last-mile delivery in urban environments, including high-density areas, road-based routing, and real-world operational challenges. The problem is defined as multiple depot pickup and delivery with time windows constrained over operational restrictions, such as vehicle battery limitation, precedence time window, and building obstruction. Subsequently, the coalition game theory is applied to investigate cooperation structures among the modes to capture how strategic collaboration among vehicles can improve overall routing efficiency. To do so, a generalized reinforcement learning model is designed to evaluate the cost-sharing and allocation to different coalitions for which sub-additive property and non-empty core exist. Our methodology leverages an end-to-end deep multi-agent policy gradient method augmented by a novel spatio-temporal adjacency neighbourhood graph attention network and transformer architecture using a heterogeneous edge-enhanced attention model. Conducting several numerical experiments on last-mile delivery applications, the result from the case study in the city of Mississauga shows that despite the incorporation of an extensive network in the graph for two modes and a complex training structure, the model addresses realistic operational constraints and achieves high-quality solutions compared with the existing transformer-based and heuristics methods and can perform well on non-homogeneous data distribution, generalizes well on the different scale and configuration, and demonstrate a robust performance under stochastic scenarios subject to wind speed and direction.

[LG-35] FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks

链接: https://arxiv.org/abs/2412.17231
作者: Qian Chen,Xianhao Chen,Kaibin Huang
关键词: deliver artificial intelligence, space-ground integrated networks, mobile networks, digital divide, artificial intelligence
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
*备注: 13 pages, 7 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:To bridge the digital divide, the space-ground integrated networks (SGINs), which will be a key component of the six-generation (6G) mobile networks, are expected to deliver artificial intelligence (AI) services to every corner of the world. One mission of SGINs is to support federated learning (FL) at a global scale. However, existing space-ground integrated FL frameworks involve ground stations or costly inter-satellite links, entailing excessive training latency and communication costs. To overcome these limitations, we propose an infrastructure-free federated learning framework based on a model dispersal (FedMeld) strategy, which exploits periodic movement patterns and store-carry-forward capabilities of satellites to enable parameter mixing across large-scale geographical regions. We theoretically show that FedMeld leads to global model convergence and quantify the effects of round interval and mixing ratio between adjacent areas on its learning performance. Based on the theoretical results, we formulate a joint optimization problem to design the staleness control and mixing ratio (SC-MR) for minimizing the training loss. By decomposing the problem into sequential SC and MR subproblems without compromising the optimality, we derive the round interval solution in a closed form and the mixing ratio in a semi-closed form to achieve the \textitoptimal latency-accuracy tradeoff. Experiments using various datasets demonstrate that FedMeld achieves superior model accuracy while significantly reducing communication costs as compared with traditional FL schemes for SGINs.

[LG-36] Foundation Model for Lossy Compression of Spatiotemporal Scientific Data

链接: https://arxiv.org/abs/2412.17184
作者: Xiao Li,Jaemoon Lee,Anand Rangarajan,Sanjay Ranka
关键词: combining a variational, variational autoencoder, present a foundation, hyper-prior structure, VAE
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a foundation model (FM) for lossy scientific data compression, combining a variational autoencoder (VAE) with a hyper-prior structure and a super-resolution (SR) module. The VAE framework uses hyper-priors to model latent space dependencies, enhancing compression efficiency. The SR module refines low-resolution representations into high-resolution outputs, improving reconstruction quality. By alternating between 2D and 3D convolutions, the model efficiently captures spatiotemporal correlations in scientific data while maintaining low computational cost. Experimental results demonstrate that the FM generalizes well to unseen domains and varying data shapes, achieving up to 4 times higher compression ratios than state-of-the-art methods after domain-specific fine-tuning. The SR module improves compression ratio by 30 percent compared to simple upsampling techniques. This approach significantly reduces storage and transmission costs for large-scale scientific simulations while preserving data integrity and fidelity.

[LG-37] WPMixer: Efficient Multi-Resolution Mixing for Long-Term Time Series Forecasting AAAI-2025

链接: https://arxiv.org/abs/2412.17176
作者: Md Mahmuddun Nabi Murad,Mehmet Aktukmak,Yasin Yilmaz
关键词: Time series forecasting, power load forecasting, Time series, series forecasting, multi-resolution wavelet decomposition
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 Figures, AAAI-2025

点击查看摘要

Abstract:Time series forecasting is crucial for various applications, such as weather forecasting, power load forecasting, and financial analysis. In recent studies, MLP-mixer models for time series forecasting have been shown as a promising alternative to transformer-based models. However, the performance of these models is still yet to reach its potential. In this paper, we propose Wavelet Patch Mixer (WPMixer), a novel MLP-based model, for long-term time series forecasting, which leverages the benefits of patching, multi-resolution wavelet decomposition, and mixing. Our model is based on three key components: (i) multi-resolution wavelet decomposition, (ii) patching and embedding, and (iii) MLP mixing. Multi-resolution wavelet decomposition efficiently extracts information in both the frequency and time domains. Patching allows the model to capture an extended history with a look-back window and enhances capturing local information while MLP mixing incorporates global information. Our model significantly outperforms state-of-the-art MLP-based and transformer-based models for long-term time series forecasting in a computationally efficient way, demonstrating its efficacy and potential for practical applications.

[LG-38] Enhancing Item Tokenization for Generative Recommendation through Self-Improvement

链接: https://arxiv.org/abs/2412.17171
作者: Runjin Chen,Mingxuan Ju,Ngoc Bui,Dimosthenis Antypas,Stanley Cai,Xiaopeng Wu,Leonardo Neves,Zhangyang Wang,Neil Shah,Tong Zhao
关键词: predicting user preferences, large language models, driven by large, present an innovative, LLM
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation systems, driven by large language models (LLMs), present an innovative approach to predicting user preferences by modeling items as token sequences and generating recommendations in a generative manner. A critical challenge in this approach is the effective tokenization of items, ensuring that they are represented in a form compatible with LLMs. Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. While text-based representations integrate seamlessly with LLM tokenization, they are often too lengthy, leading to inefficiencies and complicating accurate generation. Numerical strings, while concise, lack semantic depth and fail to capture meaningful item relationships. Tokenizing items as sequences of newly defined tokens has gained traction, but it often requires external models or algorithms for token assignment. These external processes may not align with the LLM’s internal pretrained tokenization schema, leading to inconsistencies and reduced model performance. To address these limitations, we propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process. Our approach starts with item tokenizations generated by any external model and periodically adjusts these tokenizations based on the LLM’s learned patterns. Such alignment process ensures consistency between the tokenization and the LLM’s internal understanding of the items, leading to more accurate recommendations. Furthermore, our method is simple to implement and can be integrated as a plug-and-play enhancement into existing generative recommendation systems. Experimental results on multiple datasets and using various initial tokenization strategies demonstrate the effectiveness of our method, with an average improvement of 8% in recommendation performance.

[LG-39] Unifying Feature-Based Explanations with Functional ANOVA and Cooperative Game Theory

链接: https://arxiv.org/abs/2412.17152
作者: Fabian Fumagalli,Maximilian Muschalik,Eyke Hüllermeier,Barbara Hammer,Julia Herbinger
关键词: machine learning models, black box machine, box machine learning, perturbations or gradients, learning models
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Feature-based explanations, using perturbations or gradients, are a prevalent tool to understand decisions of black box machine learning models. Yet, differences between these methods still remain mostly unknown, which limits their applicability for practitioners. In this work, we introduce a unified framework for local and global feature-based explanations using two well-established concepts: functional ANOVA (fANOVA) from statistics, and the notion of value and interaction from cooperative game theory. We introduce three fANOVA decompositions that determine the influence of feature distributions, and use game-theoretic measures, such as the Shapley value and interactions, to specify the influence of higher-order interactions. Our framework combines these two dimensions to uncover similarities and differences between a wide range of explanation techniques for features and groups of features. We then empirically showcase the usefulness of our framework on synthetic and real-world datasets.

[LG-40] SplitFedZip: Learned Compression for Data Transfer Reduction in Split-Federated Learning AAAI2025

链接: https://arxiv.org/abs/2412.17150
作者: Chamani Shiranthika,Hadi Hadizadeh,Parvaneh Saeedi,Ivan V. Bajić
关键词: enables multiple clients, Federated Learning, enables multiple, train a collaborative, sharing their local
类目: Machine Learning (cs.LG)
*备注: Accepted for paper presentation at the the 1st Workshop on Federated Learning for Unbounded and Intelligent Decentralization (FLUID), in AAAI 2025

点击查看摘要

Abstract:Federated Learning (FL) enables multiple clients to train a collaborative model without sharing their local data. Split Learning (SL) allows a model to be trained in a split manner across different locations. Split-Federated (SplitFed) learning is a more recent approach that combines the strengths of FL and SL. SplitFed minimizes the computational burden of FL by balancing computation across clients and servers, while still preserving data privacy. This makes it an ideal learning framework across various domains, especially in healthcare, where data privacy is of utmost importance. However, SplitFed networks encounter numerous communication challenges, such as latency, bandwidth constraints, synchronization overhead, and a large amount of data that needs to be transferred during the learning process. In this paper, we propose SplitFedZip – a novel method that employs learned compression to reduce data transfer in SplitFed learning. Through experiments on medical image segmentation, we show that learned compression can provide a significant data communication reduction in SplitFed learning, while maintaining the accuracy of the final trained model. The implementation is available at: \urlthis https URL.

[LG-41] Empirical evaluation of normalizing flows in Markov Chain Monte Carlo

链接: https://arxiv.org/abs/2412.17136
作者: David Nabergoj,Erik Štrumbelj
关键词: Recent advances, MCMC, distant regions, enable jumps, jumps to distant
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 41 pages, 7 figures

点击查看摘要

Abstract:Recent advances in MCMC use normalizing flows to precondition target distributions and enable jumps to distant regions. However, there is currently no systematic comparison of different normalizing flow architectures for MCMC. As such, many works choose simple flow architectures that are readily available and do not consider other models. Guidelines for choosing an appropriate architecture would reduce analysis time for practitioners and motivate researchers to take the recommended models as foundations to be improved. We provide the first such guideline by extensively evaluating many normalizing flow architectures on various flow-based MCMC methods and target distributions. When the target density gradient is available, we show that flow-based MCMC outperforms classic MCMC for suitable NF architecture choices with minor hyperparameter tuning. When the gradient is unavailable, flow-based MCMC wins with off-the-shelf architectures. We find contractive residual flows to be the best general-purpose models with relatively low sensitivity to hyperparameter choice. We also provide various insights into normalizing flow behavior within MCMC when varying their hyperparameters, properties of target distributions, and the overall computational budget.

[LG-42] Fairness in Reinforcement Learning with Bisimulation Metrics

链接: https://arxiv.org/abs/2412.17123
作者: Sahand Rezaei-Shoshtari,Hanna Yurchyk,Scott Fujimoto,Doina Precup,David Meger
关键词: developing automated decision, Ensuring long-term fairness, decision making systems, crucial when developing, developing automated
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environments. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.

[LG-43] Fair and Accurate Regression: Strong Formulations and Algorithms

链接: https://arxiv.org/abs/2412.17116
作者: Anna Deza,Andrés Gómez,Alper Atamtürk
关键词: incorporate fairness metrics, paper introduces mixed-integer, introduces mixed-integer optimization, mixed-integer optimization methods, fairness metrics
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces mixed-integer optimization methods to solve regression problems that incorporate fairness metrics. We propose an exact formulation for training fair regression models. To tackle this computationally hard problem, we study the polynomially-solvable single-factor and single-observation subproblems as building blocks and derive their closed convex hull descriptions. Strong formulations obtained for the general fair regression problem in this manner are utilized to solve the problem with a branch-and-bound algorithm exactly or as a relaxation to produce fair and accurate models rapidly. Moreover, to handle large-scale instances, we develop a coordinate descent algorithm motivated by the convex-hull representation of the single-factor fair regression problem to improve a given solution efficiently. Numerical experiments conducted on fair least squares and fair logistic regression problems show competitive statistical performance with state-of-the-art methods while significantly reducing training times.

[LG-44] Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps

链接: https://arxiv.org/abs/2412.17113
作者: Benjamin Ellis,Matthew T. Jackson,Andrei Lupu,Alexander D. Goldie,Mattie Fellows,Shimon Whiteson,Jakob Foerster
关键词: neural network function, network function approximators, momentum-based optimizers, common to apply, apply techniques
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In reinforcement learning (RL), it is common to apply techniques used broadly in machine learning such as neural network function approximators and momentum-based optimizers. However, such tools were largely developed for supervised learning rather than nonstationary RL, leading practitioners to adopt target networks, clipped policy updates, and other RL-specific implementation tricks to combat this mismatch, rather than directly adapting this toolchain for use in RL. In this paper, we take a different approach and instead address the effect of nonstationarity by adapting the widely used Adam optimiser. We first analyse the impact of nonstationary gradient magnitude – such as that caused by a change in target network – on Adam’s update size, demonstrating that such a change can lead to large updates and hence sub-optimal performance. To address this, we introduce Adam-Rel. Rather than using the global timestep in the Adam update, Adam-Rel uses the local timestep within an epoch, essentially resetting Adam’s timestep to 0 after target changes. We demonstrate that this avoids large updates and reduces to learning rate annealing in the absence of such increases in gradient magnitude. Evaluating Adam-Rel in both on-policy and off-policy RL, we demonstrate improved performance in both Atari and Craftax. We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.

[LG-45] MARINA-P: Superior Performance in Non-smooth Federated Optimization with Adaptive Stepsizes

链接: https://arxiv.org/abs/2412.17082
作者: Igor Sokolov,Peter Richtárik
关键词: machine learning applications, largely unexplored theoretically, remains largely unexplored, non-smooth convex setting, Non-smooth communication-efficient federated
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 49 Pages, 5 Algorithms, 4 Theorems, 6 Lemmas, 8 Figures

点击查看摘要

Abstract:Non-smooth communication-efficient federated optimization is crucial for many machine learning applications, yet remains largely unexplored theoretically. Recent advancements have primarily focused on smooth convex and non-convex regimes, leaving a significant gap in understanding the non-smooth convex setting. Additionally, existing literature often overlooks efficient server-to-worker communication (downlink), focusing primarily on worker-to-server communication (uplink). We consider a setup where uplink costs are negligible and focus on optimizing downlink communication by improving state-of-the-art schemes like EF21-P (arXiv:2209.15218) and MARINA-P (arXiv:2402.06412) in the non-smooth convex setting. We extend the non-smooth convex theory of EF21-P [Anonymous, 2024], originally developed for single-node scenarios, to the distributed setting, and extend MARINA-P to the non-smooth convex setting. For both algorithms, we prove an optimal O(1/\sqrtT) convergence rate and establish communication complexity bounds matching classical subgradient methods. We provide theoretical guarantees under constant, decreasing, and adaptive (Polyak-type) stepsizes. Our experiments demonstrate that MARINA-P with correlated compressors outperforms other methods in both smooth non-convex and non-smooth convex settings. This work presents the first theoretical results for distributed non-smooth optimization with server-to-worker compression, along with comprehensive analysis for various stepsize schemes.

[LG-46] Interactive Classification Metrics: A graphical application to build robust intuition for classification model evaluation

链接: https://arxiv.org/abs/2412.17066
作者: David H. Brown,Davide Chicco
关键词: Machine learning continues, popularity in academia, continues to grow, grow in popularity, Interactive Classification Metrics
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Machine learning continues to grow in popularity in academia, in industry, and is increasingly used in other fields. However, most of the common metrics used to evaluate even simple binary classification models have shortcomings that are neither immediately obvious nor consistently taught to practitioners. Here we present Interactive Classification Metrics (ICM), an application to visualize and explore the relationships between different evaluation metrics. The user changes the distribution statistics and explores corresponding changes across a suite of evaluation metrics. The interactive, graphical nature of this tool emphasizes the tradeoffs of each metric without the overhead of data wrangling and model training. The goals of this application are: (1) to aid practitioners in the ever-expanding machine learning field to choose the most appropriate evaluation metrics for their classification problem; (2) to promote careful attention to interpretation that is required even in the simplest scenarios like binary classification. Our application is publicly available for free under the MIT license as a Python package on PyPI at this https URL and on GitHub at this https URL.

[LG-47] HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

链接: https://arxiv.org/abs/2412.17040
作者: Eric Hedlin,Munawar Hayat,Fatih Porikli,Kwang Moo Yi,Shweta Mahajan
关键词: efficiently adapt large, adapt large models, drawn interest, efficiently adapt, adapt large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To efficiently adapt large models or to train generative models of neural representations, Hypernetworks have drawn interest. While hypernetworks work well, training them is cumbersome, and often requires ground truth optimized weights for each sample. However, obtaining each of these weights is a training problem of its own-one needs to train, e.g., adaptation weights or even an entire neural field for hypernetworks to regress to. In this work, we propose a method to train hypernetworks, without the need for any per-sample ground truth. Our key idea is to learn a Hypernetwork Field and estimate the entire trajectory of network weight training instead of simply its converged state. In other words, we introduce an additional input to the Hypernetwork, the convergence state, which then makes it act as a neural field that models the entire convergence pathway of a task network. A critical benefit in doing so is that the gradient of the estimated weights at any convergence state must then match the gradients of the original task – this constraint alone is sufficient to train the Hypernetwork Field. We demonstrate the effectiveness of our method through the task of personalized image generation and 3D shape reconstruction from images and point clouds, demonstrating competitive results without any per-sample ground truth.

[LG-48] Generate to Discriminate: Expert Routing for Continual Learning

链接: https://arxiv.org/abs/2412.17009
作者: Yewon Byun,Sanket V. Mehta,Saurabh Garg,Emma Strubell,Michael Oberst,Bryan Wilder,Zachary C. Lipton
关键词: economic incentives permit, real-world settings, regulations and economic, institutional boundaries, economic incentives
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world settings, regulations and economic incentives permit the sharing of models but not data across institutional boundaries. In such scenarios, practitioners might hope to adapt models to new domains, without losing performance on previous domains (so-called catastrophic forgetting). While any single model may struggle to achieve this goal, learning an ensemble of domain-specific experts offers the potential to adapt more closely to each individual institution. However, a core challenge in this context is determining which expert to deploy at test time. In this paper, we propose Generate to Discriminate (G2D), a domain-incremental continual learning method that leverages synthetic data to train a domain-discriminator that routes samples at inference time to the appropriate expert. Surprisingly, we find that leveraging synthetic data in this capacity is more effective than using the samples to \textitdirectly train the downstream classifier (the more common approach to leveraging synthetic data in the lifelong learning literature). We observe that G2D outperforms competitive domain-incremental learning methods on tasks in both vision and language modalities, providing a new perspective on the use of synthetic data in the lifelong learning literature.

[LG-49] FedCross: Intertemporal Federated Learning Under Evolutionary Games

链接: https://arxiv.org/abs/2412.16968
作者: Jianfeng Lu,Ying Zhang,Riheng Jia,Shuqin Cao,Jing Liu,Hao Fu
关键词: decentralized machine learning, Federated Learning, train collaboratively locally, allowing multiple clients, machine learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) mitigates privacy leakage in decentralized machine learning by allowing multiple clients to train collaboratively locally. However, dynamic mobile networks with high mobility, intermittent connectivity, and bandwidth limitation severely hinder model updates to the cloud server. Although previous studies have typically addressed user mobility issue through task reassignment or predictive modeling, frequent migrations may result in high communication overhead. Overcoming this obstacle involves not only dealing with resource constraints, but also finding ways to mitigate the challenges posed by user migrations. We therefore propose an intertemporal incentive framework, FedCross, which ensures the continuity of FL tasks by migrating interrupted training tasks to feasible mobile devices. Specifically, FedCross comprises two distinct stages. In Stage 1, we address the task allocation problem across regions under resource constraints by employing a multi-objective migration algorithm to quantify the optimal task receivers. Moreover, we adopt evolutionary game theory to capture the dynamic decision-making of users, forecasting the evolution of user proportions across different regions to mitigate frequent migrations. In Stage 2, we utilize a procurement auction mechanism to allocate rewards among base stations, ensuring that those providing high-quality models receive optimal compensation. This approach incentivizes sustained user participation, thereby ensuring the overall feasibility of FedCross. Finally, experimental results validate the theoretical soundness of FedCross and demonstrate its significant reduction in communication overhead.

[LG-50] Rethinking Performance Analysis for Configurable Software Systems: A Case Study from a Fitness Landscape Perspective ISSTA2025

链接: https://arxiv.org/abs/2412.16888
作者: Mingyu Huang,Peili Mao,Ke Li
关键词: Modern software systems, tailor varied requirements, Modern software, diverse stakeholders, highly configurable
类目: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 23 pages, 8 figures, accepted as a conference paper at ISSTA 2025

点击查看摘要

Abstract:Modern software systems are often highly configurable to tailor varied requirements from diverse stakeholders. Understanding the mapping between configurations and the desired performance attributes plays a fundamental role in advancing the controllability and tuning of the underlying system, yet has long been a dark hole of knowledge due to its black-box nature. While there have been previous efforts in performance analysis for these systems, they analyze the configurations as isolated data points without considering their inherent spatial relationships. This renders them incapable of interrogating many important aspects of the configuration space like local optima. In this work, we advocate a novel perspective to rethink performance analysis – modeling the configuration space as a structured ``landscape’'. To support this proposition, we designed \our, an open-source, graph data mining empowered fitness landscape analysis (FLA) framework. By applying this framework to 86 M benchmarked configurations from 32 running workloads of 3 real-world systems, we arrived at 6 main findings, which together constitute a holistic picture of the landscape topography, with thorough discussions about their implications on both configuration tuning and performance modeling.

[LG-51] Algorithm Design for Continual Learning in IoT Networks

链接: https://arxiv.org/abs/2412.16830
作者: Shugang Hao,Lingjie Duan
关键词: online learning technique, sequentially generated streaming, Continual learning, small forgetting loss, generated streaming data
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Continual learning (CL) is a new online learning technique over sequentially generated streaming data from different tasks, aiming to maintain a small forgetting loss on previously-learned tasks. Existing work focuses on reducing the forgetting loss under a given task sequence. However, if similar tasks continuously appear to the end time, the forgetting loss is still huge on prior distinct tasks. In practical IoT networks, an autonomous vehicle to sample data and learn different tasks can route and alter the order of task pattern at increased travelling cost. To our best knowledge, we are the first to study how to opportunistically route the testing object and alter the task sequence in CL. We formulate a new optimization problem and prove it NP-hard. We propose a polynomial-time algorithm to achieve approximation ratios of \frac32 for underparameterized case and \frac32 + r^1-T for overparameterized case, respectively, where r:=1-\fracnm is a parameter of feature number m and sample number n and T is the task number. Simulation results verify our algorithm’s close-to-optimum performance.

[LG-52] Balls-and-Bins Sampling for DP-SGD

链接: https://arxiv.org/abs/2412.16802
作者: Lynn Chua,Badih Ghazi,Charlie Harrison,Ethan Leeman,Pritish Kamath,Ravi Kumar,Pasin Manurangsi,Amer Sinha,Chiyuan Zhang
关键词: differentially private, optimization methods, sampling, Poisson subsampling, DP-SGD
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce the Balls-and-Bins sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (ICML 2024) however pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regimes of parameters. We show that the Balls-and-Bins sampling achieves the “best-of-both” samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained using DP-SGD with Balls-and-Bins sampling achieve utility comparable to those trained using DP-SGD with Shuffling at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling in practical regimes.

[LG-53] Symplectic Neural Flows for Modeling and Discovery

链接: https://arxiv.org/abs/2412.16787
作者: Priscilla Canizares,Davide Murari,Carola-Bibiane Schönlieb,Ferdia Sherry,Zakhar Shumaylov
关键词: reliable long-term simulations, modeling complex physical, preserving key properties, complex physical systems, long-term simulations
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注: 26 pages, 14 figures

点击查看摘要

Abstract:Hamilton’s equations are fundamental for modeling complex physical systems, where preserving key properties such as energy and momentum is crucial for reliable long-term simulations. Geometric integrators are widely used for this purpose, but neural network-based methods that incorporate these principles remain underexplored. This work introduces SympFlow, a time-dependent symplectic neural network designed using parameterized Hamiltonian flow maps. This design allows for backward error analysis and ensures the preservation of the symplectic structure. SympFlow allows for two key applications: (i) providing a time-continuous symplectic approximation of the exact flow of a Hamiltonian system–purely based on the differential equations it satisfies, and (ii) approximating the flow map of an unknown Hamiltonian system relying on trajectory data. We demonstrate the effectiveness of SympFlow on diverse problems, including chaotic and dissipative systems, showing improved energy conservation compared to general-purpose numerical methods and accurate

[LG-54] Fed-ZOE: Communication-Efficient Over-the-Air Federated Learning via Zeroth-Order Estimation

链接: https://arxiv.org/abs/2412.16779
作者: Jonggyu Jang,Hyeonsu Lyu,David J. Love,Hyun Jong Yang
关键词: grow increasingly complex, decentralized edge data, networks grow increasingly, efficiently leveraging decentralized, leveraging decentralized edge
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages

点击查看摘要

Abstract:As 6G and beyond networks grow increasingly complex and interconnected, federated learning (FL) emerges as an indispensable paradigm for securely and efficiently leveraging decentralized edge data for AI. By virtue of the superposition property of communication signals, over-the-air FL (OtA-FL) achieves constant communication overhead irrespective of the number of edge devices (EDs). However, training neural networks over the air still incurs substantial communication costs, as the number of transmitted symbols equals the number of trainable parameters. To alleviate this issue, the most straightforward approach is to reduce the number of transmitted symbols by 1) gradient compression and 2) gradient sparsification. Unfortunately, these methods are incompatible with OtA-FL due to the loss of its superposition property. In this work, we introduce federated zeroth-order estimation (Fed-ZOE), an efficient framework inspired by the randomized gradient estimator (RGE) commonly used in zeroth-order optimization (ZOO). In FedZOE, EDs perform local weight updates as in standard FL, but instead of transmitting full gradient vectors, they send compressed local model update vectors in the form of several scalar-valued inner products between the local model update vectors and random vectors. These scalar values enable the parameter server (PS) to reconstruct the gradient using the RGE trick with highly reduced overhead, as well as preserving the superposition property. Unlike conventional ZOO leveraging RGE for step-wise gradient descent, Fed-ZOE compresses local model update vectors before transmission, thereby achieving higher accuracy and computational efficiency. Numerical evaluations using ResNet-18 on datasets such as CIFAR-10, TinyImageNet, SVHN, CIFAR-100, and Brain-CT demonstrate that Fed-ZOE achieves performance comparable to Fed-OtA while drastically reducing communication costs.

[LG-55] Does calibration mean what they say it means; or the reference class problem rises again

链接: https://arxiv.org/abs/2412.16769
作者: Lily Hu
关键词: fairness commonly convey, Meaning picture, commonly convey, convey the normative, normative significance
类目: Machine Learning (cs.LG)
*备注: 27 pages, 4 figures

点击查看摘要

Abstract:Discussions of statistical criteria for fairness commonly convey the normative significance of calibration within groups by invoking what risk scores “mean.” On the Same Meaning picture, group-calibrated scores “mean the same thing” (on average) across individuals from different groups and accordingly, guard against disparate treatment of individuals based on group membership. My contention is that calibration guarantees no such thing. Since concrete actual people belong to many groups, calibration cannot ensure the kind of consistent score interpretation that the Same Meaning picture implies matters for fairness, unless calibration is met within every group to which an individual belongs. Alas only perfect predictors may meet this bar. The Same Meaning picture thus commits a reference class fallacy by inferring from calibration within some group to the “meaning” or evidential value of an individual’s score, because they are a member of that group. Furthermore, the reference class answer it presumes is almost surely wrong. I then show that the reference class problem besets not just calibration but all group statistical facts that claim a close connection to fairness. Reflecting on the origins of this error opens a wider lens onto the predominant methodology in algorithmic fairness based on stylized cases.

[LG-56] Optimization Insights into Deep Diagonal Linear Networks

链接: https://arxiv.org/abs/2412.16765
作者: Hippolyte Labarrière,Cesare Molinari,Lorenzo Rosasco,Silvia Villa,Cristian Vega
关键词: Overparameterized models trained, modern machine learning, Overparameterized models, machine learning, descent are ubiquitous
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Overparameterized models trained with (stochastic) gradient descent are ubiquitous in modern machine learning. These large models achieve unprecedented performance on test data, but their theoretical understanding is still limited. In this paper, we take a step towards filling this gap by adopting an optimization perspective. More precisely, we study the implicit regularization properties of the gradient flow “algorithm” for estimating the parameters of a deep diagonal neural network. Our main contribution is showing that this gradient flow induces a mirror flow dynamic on the model, meaning that it is biased towards a specific solution of the problem depending on the initialization of the network. Along the way, we prove several properties of the trajectory.

[LG-57] Paraformer: Parameterization of Sub-grid Scale Processes Using Transformers

链接: https://arxiv.org/abs/2412.16763
作者: Shuochen Wang,Nishant Yadav,Auroop R. Ganguly
关键词: Global Climate Models, generation of Global, scale physical processes, Global Climate, physical processes
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:One of the major sources of uncertainty in the current generation of Global Climate Models (GCMs) is the representation of sub-grid scale physical processes. Over the years, a series of deep-learning-based parameterization schemes have been developed and tested on both idealized and real-geography GCMs. However, datasets on which previous deep-learning models were trained either contain limited variables or have low spatial-temporal coverage, which can not fully simulate the parameterization process. Additionally, these schemes rely on classical architectures while the latest attention mechanism used in Transformer models remains unexplored in this field. In this paper, we propose Paraformer, a “memory-aware” Transformer-based model on ClimSim, the largest dataset ever created for climate parameterization. Our results demonstrate that the proposed model successfully captures the complex non-linear dependencies in the sub-grid scale variables and outperforms classical deep-learning architectures. This work highlights the applicability of the attenuation mechanism in this field and provides valuable insights for developing future deep-learning-based climate parameterization schemes.

[LG-58] Leveraging Highly Approximated Multipliers in DNN Inference

链接: https://arxiv.org/abs/2412.16757
作者: Georgios Zervakis,Fabio Frustaci,Ourania Spantidi,Iraklis Anagnostopoulos,Hussam Amrouch,Jörg Henkel
关键词: Deep Neural Network, Neural Network, Deep Neural, control variate approximation, highly approximate multipliers
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present a control variate approximation technique that enables the exploitation of highly approximate multipliers in Deep Neural Network (DNN) accelerators. Our approach does not require retraining and significantly decreases the induced error due to approximate multiplications, improving the overall inference accuracy. As a result, our approach enables satisfying tight accuracy loss constraints while boosting the power savings. Our experimental evaluation, across six different DNNs and several approximate multipliers, demonstrates the versatility of our approach and shows that compared to the accurate design, our control variate approximation achieves the same performance, 45% power reduction, and less than 1% average accuracy loss. Compared to the corresponding approximate designs without using our technique, our approach improves the accuracy by 1.9x on average.

[LG-59] Gradient-based Trajectory Optimization with Parallelized Differentiable Traffic Simulation

链接: https://arxiv.org/abs/2412.16750
作者: Sanghyun Son,Laura Zheng,Brian Clipp,Connor Greenwell,Sujin Philip,Ming C. Lin
关键词: Intelligent Driver Model, incorporates driver behavior, Intelligent Driver, parallelized differentiable traffic, incorporates driver
类目: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 2 tables

点击查看摘要

Abstract:We present a parallelized differentiable traffic simulator based on the Intelligent Driver Model (IDM), a car-following framework that incorporates driver behavior as key variables. Our simulator efficiently models vehicle motion, generating trajectories that can be supervised to fit real-world data. By leveraging its differentiable nature, IDM parameters are optimized using gradient-based methods. With the capability to simulate up to 2 million vehicles in real time, the system is scalable for large-scale trajectory optimization. We show that we can use the simulator to filter noise in the input trajectories (trajectory filtering), reconstruct dense trajectories from sparse ones (trajectory reconstruction), and predict future trajectories (trajectory prediction), with all generated trajectories adhering to physical laws. We validate our simulator and algorithm on several datasets including NGSIM and Waymo Open Dataset.

[LG-60] Solving Inverse Problems via Diffusion Optimal Control NEURIPS2024

链接: https://arxiv.org/abs/2412.16748
作者: Henry Li,Marcus Pereira
关键词: signal recovery task, Existing approaches, desired posterior distribution, probabilistic sampling episode, problem solvers frame
类目: Machine Learning (cs.LG)
*备注: Presented at NeurIPS 2024

点击查看摘要

Abstract:Existing approaches to diffusion-based inverse problem solvers frame the signal recovery task as a probabilistic sampling episode, where the solution is drawn from the desired posterior distribution. This framework suffers from several critical drawbacks, including the intractability of the conditional likelihood function, strict dependence on the score network approximation, and poor \mathbfx_0 prediction quality. We demonstrate that these limitations can be sidestepped by reframing the generative process as a discrete optimal control episode. We derive a diffusion-based optimal controller inspired by the iterative Linear Quadratic Regulator (iLQR) algorithm. This framework is fully general and able to handle any differentiable forward measurement operator, including super-resolution, inpainting, Gaussian deblurring, nonlinear deblurring, and even highly nonlinear neural classifiers. Furthermore, we show that the idealized posterior sampling equation can be recovered as a special case of our algorithm. We then evaluate our method against a selection of neural inverse problem solvers, and establish a new baseline in image reconstruction with inverse problems.

[LG-61] KKANs: Kurkova-Kolmogorov-Arnold Networks and Their Learning Dynamics

链接: https://arxiv.org/abs/2412.16738
作者: Juan Diego Toscano,Li-Lian Wang,George Em Karniadakis
关键词: robust multi-layer perceptron, combines robust multi-layer, flexible linear combinations, theorem and Kurkova, Kurkova principle
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Kolmogorov-Arnold representation theorem; physics-informed neural networks; Kolmogorov-Arnold networks; optimization algorithms; self-adaptive weights; information bottleneck theory

点击查看摘要

Abstract:Inspired by the Kolmogorov-Arnold representation theorem and Kurkova’s principle of using approximate representations, we propose the Kurkova-Kolmogorov-Arnold Network (KKAN), a new two-block architecture that combines robust multi-layer perceptron (MLP) based inner functions with flexible linear combinations of basis functions as outer functions. We first prove that KKAN is a universal approximator, and then we demonstrate its versatility across scientific machine-learning applications, including function regression, physics-informed machine learning (PIML), and operator-learning frameworks. The benchmark results show that KKANs outperform MLPs and the original Kolmogorov-Arnold Networks (KANs) in function approximation and operator learning tasks and achieve performance comparable to fully optimized MLPs for PIML. To better understand the behavior of the new representation models, we analyze their geometric complexity and learning dynamics using information bottleneck theory, identifying three universal learning stages, fitting, transition, and diffusion, across all types of architectures. We find a strong correlation between geometric complexity and signal-to-noise ratio (SNR), with optimal generalization achieved during the diffusion stage. Additionally, we propose self-scaled residual-based attention weights to maintain high SNR dynamically, ensuring uniform convergence and prolonged learning.

[LG-62] A Unifying Family of Data-Adaptive Partitioning Algorithms

链接: https://arxiv.org/abs/2412.16713
作者: Guy B. Oldaker IV,Maria Emelianenko
关键词: remain valuable tools, algorithms remain valuable, Clustering algorithms remain, remain valuable, valuable tools
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Clustering algorithms remain valuable tools for grouping and summarizing the most important aspects of data. Example areas where this is the case include image segmentation, dimension reduction, signals analysis, model order reduction, numerical analysis, and others. As a consequence, many clustering approaches have been developed to satisfy the unique needs of each particular field. In this article, we present a family of data-adaptive partitioning algorithms that unifies several well-known methods (e.g., k-means and k-subspaces). Indexed by a single parameter and employing a common minimization strategy, the algorithms are easy to use and interpret, and scale well to large, high-dimensional problems. In addition, we develop an adaptive mechanism that (a) exhibits skill at automatically uncovering data structures and problem parameters without any expert knowledge and, (b) can be used to augment other existing methods. By demonstrating the performance of our methods on examples from disparate fields including subspace clustering, model order reduction, and matrix approximation, we hope to highlight their versatility and potential for extending the boundaries of existing scientific domains. We believe our family’s parametrized structure represents a synergism of algorithms that will foster new developments and directions, not least within the data science community.

[LG-63] From Correlation to Causation: Understanding Climate Change through Causal Analysis and LLM Interpretations

链接: https://arxiv.org/abs/2412.16691
作者: Shan Shan
关键词: machine learning-based causality, learning-based causality discovery, identify socioeconomic factors, socioeconomic factors influencing, factors influencing carbon
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This research presents a three-step causal inference framework that integrates correlation analysis, machine learning-based causality discovery, and LLM-driven interpretations to identify socioeconomic factors influencing carbon emissions and contributing to climate change. The approach begins with identifying correlations, progresses to causal analysis, and enhances decision making through LLM-generated inquiries about the context of climate change. The proposed framework offers adaptable solutions that support data-driven policy-making and strategic decision-making in climate-related contexts, uncovering causal relationships within the climate change domain.

[LG-64] Label Privacy in Split Learning for Large Models with Parameter-Efficient Training

链接: https://arxiv.org/abs/2412.16669
作者: Philip Zmushko,Marat Mansurov,Ruslan Svirschevski,Denis Kuznedelev,Max Ryabinin,Aleksandr Beznosikov
关键词: practitioners turn, deep learning models, fine-tuning, deep learning, parameter-efficient fine-tuning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As deep learning models become larger and more expensive, many practitioners turn to fine-tuning APIs. These web services allow fine-tuning a model between two parties: the client that provides the data, and the server that hosts the model. While convenient, these APIs raise a new concern: the data of the client is at risk of privacy breach during the training procedure. This challenge presents an important practical case of vertical federated learning, where the two parties perform parameter-efficient fine-tuning (PEFT) of a large model. In this study, we systematically search for a way to fine-tune models over an API while keeping the labels private. We analyze the privacy of LoRA, a popular approach for parameter-efficient fine-tuning when training over an API. Using this analysis, we propose P ^3 EFT, a multi-party split learning algorithm that takes advantage of existing PEFT properties to maintain privacy at a lower performance overhead. To validate our algorithm, we fine-tune DeBERTa-v2-XXLarge, Flan-T5 Large and LLaMA-2 7B using LoRA adapters on a range of NLP tasks. We find that P ^3 EFT is competitive with existing privacy-preserving methods in multi-party and two-party setups while having higher accuracy.

[LG-65] ransformer-based toxin-protein interaction analysis prioritizes airborne particulate matter components with potential adverse health effects

链接: https://arxiv.org/abs/2412.16664
作者: Yan Zhu,Shihao Wang,Yong Han,Yao Lu,Shulan Qiu,Ling Jin,Xiangdong Li,Weixiong Zhang
关键词: airborne particulate matter, public health globally, air pollution impacts, Air pollution, pollution impacts health
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Air pollution, particularly airborne particulate matter (PM), poses a significant threat to public health globally. It is crucial to comprehend the association between PM-associated toxic components and their cellular targets in humans to understand the mechanisms by which air pollution impacts health and to establish causal relationships between air pollution and public health consequences. Although many studies have explored the impact of PM on human health, the understanding of the association between toxins and the associated targets remain limited. Leveraging cutting-edge deep learning technologies, we developed tipFormer (toxin-protein interaction prediction based on transformer), a novel deep-learning tool for identifying toxic components capable of penetrating human cells and instigating pathogenic biological activities and signaling cascades. Experimental results show that tipFormer effectively captures interactions between proteins and toxic components. It incorporates dual pre-trained language models to encode protein sequences and chemicals. It employs a convolutional encoder to assimilate the sequential attributes of proteins and chemicals. It then introduces a learning module with a cross-attention mechanism to decode and elucidate the multifaceted interactions pivotal for the hotspots binding proteins and chemicals. Experimental results show that tipFormer effectively captures interactions between proteins and toxic components. This approach offers significant value to air quality and toxicology researchers by allowing high-throughput identification and prioritization of hazards. It supports more targeted laboratory studies and field measurements, ultimately enhancing our understanding of how air pollution impacts human health.

[LG-66] FedGA: Federated Learning with Gradient Alignment for Error Asymmetry Mitigation

链接: https://arxiv.org/abs/2412.16582
作者: Chenguang Xiao,Zheming Zuo,Shuo Wang
关键词: inter-client class imbalance, biased client updates, Federated learning, triggers intra-client, class imbalance
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning (FL) triggers intra-client and inter-client class imbalance, with the latter compared to the former leading to biased client updates and thus deteriorating the distributed models. Such a bias is exacerbated during the server aggregation phase and has yet to be effectively addressed by conventional re-balancing methods. To this end, different from the off-the-shelf label or loss-based approaches, we propose a gradient alignment (GA)-informed FL method, dubbed as FedGA, where the importance of error asymmetry (EA) in bias is observed and its linkage to the gradient of the loss to raw logits is explored. Concretely, GA, implemented by label calibration during the model backpropagation process, prevents catastrophic forgetting of rate and missing classes, hence boosting model convergence and accuracy. Experimental results on five benchmark datasets demonstrate that GA outperforms the pioneering counterpart FedAvg and its four variants in minimizing EA and updating bias, and accordingly yielding higher F1 score and accuracy margins when the Dirichlet distribution sampling factor \alpha increases. The code and more details are available at \urlthis https URL.

[LG-67] A Meta-Learning Approach to Bayesian Causal Discovery

链接: https://arxiv.org/abs/2412.16577
作者: Anish Dhir,Matthew Ashman,James Requeima,Mark van der Wilk
关键词: inherent identifiability issues, Discovering a unique, causal, unique causal structure, identifiability issues
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those obtained from a Bayesian posterior, are often necessary for downstream tasks. Finding an accurate approximation to this posterior is challenging, due to the large number of possible causal graphs, as well as the difficulty in the subproblem of finding posteriors over the functional relationships of the causal edges. Recent works have used meta-learning to view the problem of estimating the maximum a-posteriori causal graph as supervised learning. Yet, these methods are limited when estimating the full posterior as they fail to encode key properties of the posterior, such as correlation between edges and permutation equivariance with respect to nodes. Further, these methods also cannot reliably sample from the posterior over causal structures. To address these limitations, we propose a Bayesian meta learning model that allows for sampling causal structures from the posterior and encodes these key properties. We compare our meta-Bayesian causal discovery against existing Bayesian causal discovery methods, demonstrating the advantages of directly learning a posterior over causal structure.

[LG-68] High-Dimensional Bayesian Optimization via Random Projection of Manifold Subspaces

链接: https://arxiv.org/abs/2412.16554
作者: Quoc-Anh Hoang Nguyen, TheHung Tran
关键词: Bayesian Optimization, Bayesian, Optimization, black-box functions, popular approach
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) is a popular approach to optimizing expensive-to-evaluate black-box functions. Despite the success of BO, its performance may decrease exponentially as the dimensionality increases. A common framework to tackle this problem is to assume that the objective function depends on a limited set of features that lie on a low-dimensional manifold embedded in the high-dimensional ambient space. The latent space can be linear or more generally nonlinear. To learn feature mapping, existing works usually use an encode-decoder framework which is either computationally expensive or susceptible to overfittting when the labeled data is limited. This paper proposes a new approach for BO in high dimensions by exploiting a new representation of the objective function. Our approach combines a random linear projection to reduce the dimensionality, with a representation learning of the nonlinear manifold. When the geometry of the latent manifold is available, a solution to exploit this geometry is proposed for representation learning. In contrast, we use a neural network. To mitigate overfitting by using the neural network, we train the feature mapping in a geometry-aware semi-supervised manner. Our approach enables efficient optimizing of BO’s acquisition function in the low-dimensional space, with the advantage of projecting back to the original high-dimensional space compared to existing works in the same setting. Finally, we show empirically that our algorithm outperforms other high-dimensional BO baselines in various synthetic functions and real applications.

[LG-69] DOFEN: Deep Oblivious Forest ENsemble NEURIPS2024

链接: https://arxiv.org/abs/2412.16534
作者: Kuan-Yu Chen,Ping-Han Chiang,Hsin-Rung Chou,Chih-Sheng Chen,Tien-Hao Chang
关键词: Deep Neural Networks, Neural Networks, revolutionized artificial intelligence, Deep Neural, diverse data types
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 (poster)

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have revolutionized artificial intelligence, achieving impressive results on diverse data types, including images, videos, and texts. However, DNNs still lag behind Gradient Boosting Decision Trees (GBDT) on tabular data, a format extensively utilized across various domains. This paper introduces DOFEN, which stands for Deep Oblivious Forest ENsemble. DOFEN is a novel DNN architecture inspired by oblivious decision trees and achieves on-off sparse selection of columns. DOFEN surpasses other DNNs on tabular data, achieving state-of-the-art performance on the well-recognized benchmark: Tabular Benchmark, which includes 73 total datasets spanning a wide array of domains. The code of DOFEN is available at: this https URL.

[LG-70] Physics-Guided Fair Graph Sampling for Water Temperature Prediction in River Networks

链接: https://arxiv.org/abs/2412.16523
作者: Erhu He,Declan Kutscher,Yiqun Xie,Jacob Zwart,Zhe Jiang,Huaxiu Yao,Xiaowei Jia
关键词: stream water temperature, predict stream water, education levels, graph neural networks, income and education
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Physics and Society (physics.soc-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work introduces a novel graph neural networks (GNNs)-based method to predict stream water temperature and reduce model bias across locations of different income and education levels. Traditional physics-based models often have limited accuracy because they are necessarily approximations of reality. Recently, there has been an increasing interest of using GNNs in modeling complex water dynamics in stream networks. Despite their promise in improving the accuracy, GNNs can bring additional model bias through the aggregation process, where node features are updated by aggregating neighboring nodes. The bias can be especially pronounced when nodes with similar sensitive attributes are frequently connected. We introduce a new method that leverages physical knowledge to represent the node influence in GNNs, and then utilizes physics-based influence to refine the selection and weights over the neighbors. The objective is to facilitate equitable treatment over different sensitive groups in the graph aggregation, which helps reduce spatial bias over locations, especially for those in underprivileged groups. The results on the Delaware River Basin demonstrate the effectiveness of the proposed method in preserving equitable performance across locations in different sensitive groups.

[LG-71] Batch Selection for Multi-Label Classification Guided by Uncertainty and Dynamic Label Correlations

链接: https://arxiv.org/abs/2412.16521
作者: Ao Zhou,Bin Liu,Jin Wang,Grigorios Tsoumakas
关键词: deep neural networks, neural networks, networks is significantly, significantly influenced, mini-batch construction
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The accuracy of deep neural networks is significantly influenced by the effectiveness of mini-batch construction during training. In single-label scenarios, such as binary and multi-class classification tasks, it has been demonstrated that batch selection algorithms preferring samples with higher uncertainty achieve better performance than difficulty-based methods. Although there are two batch selection methods tailored for multi-label data, none of them leverage important uncertainty information. Adapting the concept of uncertainty to multi-label data is not a trivial task, since there are two issues that should be tackled. First, traditional variance or entropy-based uncertainty measures ignore fluctuations of predictions within sliding windows and the importance of the current model state. Second, existing multi-label methods do not explicitly exploit the label correlations, particularly the uncertainty-based label correlations that evolve during the training process. In this paper, we propose an uncertainty-based multi-label batch selection algorithm. It assesses uncertainty for each label by considering differences between successive predictions and the confidence of current outputs, and further leverages dynamic uncertainty-based label correlations to emphasize instances whose uncertainty is synergistically expressed across multiple labels. Empirical studies demonstrate the effectiveness of our method in improving the performance and accelerating the convergence of various multi-label deep learning models.

[LG-72] STKDRec: Spatial-Temporal Knowledge Distillation for Takeaway Recommendation AAAI2025

链接: https://arxiv.org/abs/2412.16502
作者: Shuyuan Zhao,Wei Chen,Boyan Shi,Liyong Zhou,Shuohao Lin,Huaiyu Wan
关键词: increasing merchant sales, recommend users’ future, historical purchase behaviors, improving user satisfaction, users’ future takeaway
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: AAAI2025

点击查看摘要

Abstract:The takeaway recommendation system is designed to recommend users’ future takeaway purchases based on their historical purchase behaviors, thereby improving user satisfaction and increasing merchant sales. Existing methods focus on incorporating auxiliary information or leveraging knowledge graphs to alleviate the sparsity issue of user purchase sequence data. However, two main challenges limit the performance of these approaches: (1) how to capture dynamic user preferences on complex geospatial information and (2) how to efficiently integrate spatial-temporal knowledge from graphs and sequence data with low calculation costs. In this paper, we propose a novel spatial-temporal knowledge distillation for takeaway recommendation model (STKDRec) based on the two-stage training process. Specifically, during the first pre-training stage, a spatial-temporal knowledge graph (STKG) encoder is pre-trained to extract the high-order spatial-temporal and collaborative associations within the STKG. During the second STKD stage, a spatial-temporal Transformer is employed to comprehensively model dynamic user preferences on various types of fine-grained geospatial information from a sequence perspective. Furthermore, the STKD strategy is introduced to adaptively fuse the rich spatial-temporal knowledge from the pre-trained STKG encoder and the spatial-temporal transformer while reducing the cost of model training. Extensive experiments on three real-world datasets show that our STKDRec significantly outperforms the state-of-the-art baselines. Our code is available at:this https URL.

[LG-73] MOL-Mamba: Enhancing Molecular Representation with Structural Electronic Insights AAAI2025

链接: https://arxiv.org/abs/2412.16483
作者: Jingjing Hu,Dan Guo,Zhan Si,Deguang Liu,Yunfeng Diao,Jing Zhang,Jinxing Zhou,Meng Wang
关键词: Graph Neural Networks, molecular property prediction, downstream tasks, drug design, Neural Networks
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of self-supervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electronic information, as well as the internal semantic reasoning within molecules. This omission of fundamental chemical knowledge in graph semantics leads to incomplete molecular representations, missing the integration of structural and electronic data. To address these issues, we introduce MOL-Mamba, a framework that enhances molecular representation by combining structural and electronic insights. MOL-Mamba consists of an Atom Fragment Mamba-Graph (MG) for hierarchical structural reasoning and a Mamba-Transformer (MT) fuser for integrating molecular structure and electronic correlation learning. Additionally, we propose a Structural Distribution Collaborative Training and E-semantic Fusion Training framework to further enhance molecular representation learning. Extensive experiments demonstrate that MOL-Mamba outperforms state-of-the-art baselines across eleven chemical-biological molecular datasets. Code is available at this https URL.

[LG-74] Learn2Mix: Training Neural Networks Using Adaptive Data Integration

链接: https://arxiv.org/abs/2412.16482
作者: Shyam Venkatasubramanian,Vahid Tarokh
关键词: Accelerating model convergence, Accelerating model, resource-constrained environments, environments is essential, essential for fast
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accelerating model convergence in resource-constrained environments is essential for fast and efficient neural network training. This work presents learn2mix, a new training strategy that adaptively adjusts class proportions within batches, focusing on classes with higher error rates. Unlike classical training methods that use static class proportions, learn2mix continually adapts class proportions during training, leading to faster convergence. Empirical evaluations on benchmark datasets show that neural networks trained with learn2mix converge faster than those trained with classical approaches, achieving improved results for classification, regression, and reconstruction tasks under limited training resources and with imbalanced classes. Our empirical findings are supported by theoretical analysis.

[LG-75] he Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

链接: https://arxiv.org/abs/2412.16468
作者: HyunJin Kim,Xiaoyuan Yi,Jing Yao,Jianxun Lian,Muhua Huang,Shitong Duan,JinYeong Bak,Xing Xie
关键词: Artificial Superintelligence, large language models, surpassing human intelligence, system surpassing human, language models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) has sparked the possibility of about Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. However, existing alignment paradigms struggle to guide such advanced AI systems. Superalignment, the alignment of AI systems with human values and safety requirements at superhuman levels of capability aims to addresses two primary goals – scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we examine scalable oversight methods and potential solutions for superalignment. Specifically, we explore the concept of ASI, the challenges it poses, and the limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of ASI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.

[LG-76] Condensed Stein Variational Gradient Descent for Uncertainty Quantification of Neural Networks

链接: https://arxiv.org/abs/2412.16462
作者: Govinda Anantha Padmanabha,Cosmin Safta,Nikolaos Bouklas,Reese E. Jones
关键词: complexly parameterized model, Stein variational gradient, concurrently sparsify, neural network, Stein variational
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 18 pages, 13 figures

点击查看摘要

Abstract:We propose a Stein variational gradient descent method to concurrently sparsify, train, and provide uncertainty quantification of a complexly parameterized model such as a neural network. It employs a graph reconciliation and condensation process to reduce complexity and increase similarity in the Stein ensemble of parameterizations. Therefore, the proposed condensed Stein variational gradient (cSVGD) method provides uncertainty quantification on parameters, not just outputs. Furthermore, the parameter reduction speeds up the convergence of the Stein gradient descent as it reduces the combinatorial complexity by aligning and differentiating the sensitivity to parameters. These properties are demonstrated with an illustrative example and an application to a representation problem in solid mechanics.

[LG-77] CBNN: 3-Party Secure Framework for Customized Binary Neural Networks Inference

链接: https://arxiv.org/abs/2412.16449
作者: Benchang Dong,Zhili Chen,Xin Chen,Shiwen Wei,Jie Fu,Huifa Li
关键词: Binarized Neural Networks, machine learning tasks, Privacy-Preserving Machine Learning, machine learning, facilitate Privacy-Preserving Machine
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Binarized Neural Networks (BNN) offer efficient implementations for machine learning tasks and facilitate Privacy-Preserving Machine Learning (PPML) by simplifying operations with binary values. Nevertheless, challenges persist in terms of communication and accuracy in their application scenarios. In this work, we introduce CBNN, a three-party secure computation framework tailored for efficient BNN inference. Leveraging knowledge distillation and separable convolutions, CBNN transforms standard BNNs into MPC-friendly customized BNNs, maintaining high utility. It performs secure inference using optimized protocols for basic operations. Specifically, CBNN enhances linear operations with replicated secret sharing and MPC-friendly convolutions, while introducing a novel secure activation function to optimize non-linear operations. We demonstrate the effectiveness of CBNN by transforming and securely implementing several typical BNN models. Experimental results indicate that CBNN maintains impressive performance even after customized binarization and security measures

[LG-78] Iterative Feature Exclusion Ranking for Deep Tabular Learning

链接: https://arxiv.org/abs/2412.16442
作者: Fathi Said Emhemed Shaninah,AbdulRahman M. A. Baraka,Mohd Halim Mohd Noor
关键词: Tabular data, feature, proposed module, common format, format for storing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data is a common format for storing information in rows and columns to represent data entries and their features. Although deep neural networks have become the main approach for modeling a wide range of domains including computer vision and NLP, many of them are not well-suited for tabular data. Recently, a few deep learning models have been proposed for deep tabular learning, featuring an internal feature selection mechanism with end-to-end gradient-based optimization. However, their feature selection mechanisms are unidimensional, and hence fail to account for the contextual dependence of feature importance, potentially overlooking crucial interactions that govern complex tasks. In addition, they overlook the bias of high-impact features and the risk associated with the limitations of attention generalization. To address this limitation, this study proposes a novel iterative feature exclusion module that enhances the feature importance ranking in tabular data. The proposed module iteratively excludes each feature from the input data and computes the attention scores, which represent the impact of the features on the prediction. By aggregating the attention scores from each iteration, the proposed module generates a refined representation of feature importance that captures both global and local interactions between features. The effectiveness of the proposed module is evaluated on four public datasets. The results demonstrate that the proposed module consistently outperforms state-of-the-art methods and baseline models in feature ranking and classification tasks. The code is publicly available at this https URL and this https URL

[LG-79] HeGCN: Temporal Heterophilic Graph Convolutional Network

链接: https://arxiv.org/abs/2412.16435
作者: Yuchen Yan,Yuzhong Chen,Huiyuan Chen,Xiaoting Li,Zhe Xu,Zhichen Zeng,Zhining Liu,Hanghang Tong
关键词: Graph Neural Networks, graph learning tasks, diverse graph learning, temporal heterophily issue, Neural Networks
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have exhibited remarkable efficacy in diverse graph learning tasks, particularly on static homophilic graphs. Recent attention has pivoted towards more intricate structures, encompassing (1) static heterophilic graphs encountering the edge heterophily issue in the spatial domain and (2) event-based continuous graphs in the temporal domain. State-of-the-art (SOTA) has been concurrently addressing these two lines of work but tends to overlook the presence of heterophily in the temporal domain, constituting the temporal heterophily issue. Furthermore, we highlight that the edge heterophily issue and the temporal heterophily issue often co-exist in event-based continuous graphs, giving rise to the temporal edge heterophily challenge. To tackle this challenge, this paper first introduces the temporal edge heterophily measurement. Subsequently, we propose the Temporal Heterophilic Graph Convolutional Network (THeGCN), an innovative model that incorporates the low/high-pass graph signal filtering technique to accurately capture both edge (spatial) heterophily and temporal heterophily. Specifically, the THeGCN model consists of two key components: a sampler and an aggregator. The sampler selects events relevant to a node at a given moment. Then, the aggregator executes message-passing, encoding temporal information, node attributes, and edge attributes into node embeddings. Extensive experiments conducted on 5 real-world datasets validate the efficacy of THeGCN.

[LG-80] GAT-RWOS: Graph Attention-Guided Random Walk Oversampling for Imbalanced Data Classification

链接: https://arxiv.org/abs/2412.16394
作者: Zahiriddin Rustamov,Abderrahmane Lakas,Nazar Zaki
关键词: Class imbalance poses, Graph Attention Networks, biased models favouring, machine learning, imbalance poses
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICKG 2024. Code is available at this http URL

点击查看摘要

Abstract:Class imbalance poses a significant challenge in machine learning (ML), often leading to biased models favouring the majority class. In this paper, we propose GAT-RWOS, a novel graph-based oversampling method that combines the strengths of Graph Attention Networks (GATs) and random walk-based oversampling. GAT-RWOS leverages the attention mechanism of GATs to guide the random walk process, focusing on the most informative neighbourhoods for each minority node. By performing attention-guided random walks and interpolating features along the traversed paths, GAT-RWOS generates synthetic minority samples that expand class boundaries while preserving the original data distribution. Extensive experiments on a diverse set of imbalanced datasets demonstrate the effectiveness of GAT-RWOS in improving classification performance, outperforming state-of-the-art oversampling techniques. The proposed method has the potential to significantly improve the performance of ML models on imbalanced datasets and contribute to the development of more reliable classification systems.

[LG-81] Navigating AI to Unpack Youth Privacy Concerns: An In-Depth Exploration and Systematic Review

链接: https://arxiv.org/abs/2412.16369
作者: Ajay Kumar Shrestha,Ankur Barthwal,Molly Campbell,Austin Shouli,Saad Syed,Sandhya Joshi,Julita Vassileva
关键词: social media platforms, review investigates perceptions, literature review investigates, systematic literature review, gaming systems
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: To appear in the 2024 IEEE Annual Information Technology, Electronics and Mobile Communication Conference proceedings

点击查看摘要

Abstract:This systematic literature review investigates perceptions, concerns, and expectations of young digital citizens regarding privacy in artificial intelligence (AI) systems, focusing on social media platforms, educational technology, gaming systems, and recommendation algorithms. Using a rigorous methodology, the review started with 2,000 papers, narrowed down to 552 after initial screening, and finally refined to 108 for detailed analysis. Data extraction focused on privacy concerns, data-sharing practices, the balance between privacy and utility, trust factors in AI, transparency expectations, and strategies to enhance user control over personal data. Findings reveal significant privacy concerns among young users, including a perceived lack of control over personal information, potential misuse of data by AI, and fears of data breaches and unauthorized access. These issues are worsened by unclear data collection practices and insufficient transparency in AI applications. The intention to share data is closely associated with perceived benefits and data protection assurances. The study also highlights the role of parental mediation and the need for comprehensive education on data privacy. Balancing privacy and utility in AI applications is crucial, as young digital citizens value personalized services but remain wary of privacy risks. Trust in AI is significantly influenced by transparency, reliability, predictable behavior, and clear communication about data usage. Strategies to improve user control over personal data include access to and correction of data, clear consent mechanisms, and robust data protection assurances. The review identifies research gaps and suggests future directions, such as longitudinal studies, multicultural comparisons, and the development of ethical AI frameworks.

[LG-82] Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study

链接: https://arxiv.org/abs/2412.16335
作者: Daniel Smolyak,Arshana Welivita,Margrét V. Bjarnadóttir,Ritu Agarwal
关键词: data, synthetic data, Objective, machine learning, synthetic
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 26 pages, 4 figures

点击查看摘要

Abstract:Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham “Offspring and OMNI-1 Cohorts” datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a training dataset in a downstream machine learning task, focusing specifically on model performance metrics across groups. Results. The performance of GPT4-Turbo augmentation is generally superior but not always. In the majority of experiments our method outperforms standard modeling baselines, however, prompting GPT-4-Turbo to produce data specific to a group provides little to no additional benefit over a prompt that does not specify the group. Conclusion. We developed a method for using LLMs out-of-the-box to synthesize group-specific data to address imbalances in demographic representation in medical datasets. As another “tool in the toolbox”, this method can improve model fairness and thus health equity. More research is needed to understand the conditions under which LLM generated synthetic data is useful for non-representative medical data sets. Comments: 26 pages, 4 figures Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2412.16335 [cs.LG] (or arXiv:2412.16335v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.16335 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniel Smolyak [view email] [v1] Fri, 20 Dec 2024 20:49:17 UTC (1,142 KB)

[LG-83] Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

链接: https://arxiv.org/abs/2412.16318
作者: Junyan Liu,Lillian J. Ratliff
关键词: principal indirectly interacts, principal-agent bandit game, repeated principal-agent bandit, unknown environment, self-interested learning agent
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon T , achieving regret bounds of \widetildeO(\sqrtT) and \widetildeO( T^2/3 ) , respectively. Specifically, these algorithms are established upon a novel elimination framework coupled with newly-developed search algorithms which accommodate the uncertainty arising from the learning behavior of the agent. We then extend the framework to handle the exploratory learning agent and develop an algorithm to achieve a \widetildeO(T^2/3) regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when reducing our agent behaviors to the one studied in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a \widetildeO(\sqrtT) regret bound, significantly improving upon their \widetildeO(T^11/12) bound.

[LG-84] Long-Term Upper-Limb Prosthesis Myocontrol via High-Density sEMG and Incremental Learning

链接: https://arxiv.org/abs/2412.16271
作者: Dario Di Domenico,Nicolò Boccardo,Andrea Marinelli,Michele Canepa,Emanuele Gruppioni,Matteo Laffranchi,Raffaello Camoriano
关键词: controlling robotic prostheses, Noninvasive human-machine interfaces, Noninvasive human-machine, surface electromyography, robotic prostheses
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Pre-print version of published IEEE Robotics and Automation Letters paper (2024). 8 pages, 7 figures

点击查看摘要

Abstract:Noninvasive human-machine interfaces such as surface electromyography (sEMG) have long been employed for controlling robotic prostheses. However, classical controllers are limited to few degrees of freedom (DoF). More recently, machine learning methods have been proposed to learn personalized controllers from user data. While promising, they often suffer from distribution shift during long-term usage, requiring costly model re-training. Moreover, most prosthetic sEMG sensors have low spatial density, which limits accuracy and the number of controllable motions. In this work, we address both challenges by introducing a novel myoelectric prosthetic system integrating a high density-sEMG (HD-sEMG) setup and incremental learning methods to accurately control 7 motions of the Hannes prosthesis. First, we present a newly designed, compact HD-sEMG interface equipped with 64 dry electrodes positioned over the forearm. Then, we introduce an efficient incremental learning system enabling model adaptation on a stream of data. We thoroughly analyze multiple learning algorithms across 7 subjects, including one with limb absence, and 6 sessions held in different days covering an extended period of several months. The size and time span of the collected data represent a relevant contribution for studying long-term myocontrol performance. Therefore, we release the DELTA dataset together with our experimental code.

[LG-85] A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Speech

链接: https://arxiv.org/abs/2412.16267
作者: Mary Paterson,James Moor,Luisa Cutillo
关键词: coming years, laryngeal cancer, Cases of laryngeal, predicted to rise, rise significantly
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
*备注: 24 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Cases of laryngeal cancer are predicted to rise significantly in the coming years. Current diagnostic pathways cause many patients to be incorrectly referred to urgent suspected cancer pathways, putting undue stress on both patients and the medical system. Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient speech, which could help prioritise referrals more effectively and reduce inappropriate referrals of non-cancer patients. To realise this potential, open science is crucial. A major barrier in this field is the lack of open-source datasets and reproducible benchmarks, forcing researchers to start from scratch. Our work addresses this challenge by introducing a benchmark suite comprising 36 models trained and evaluated on open-source datasets. These models are accessible in a public repository, providing a foundation for future research. They evaluate three different algorithms and three audio feature sets, offering a comprehensive benchmarking framework. We propose standardised metrics and evaluation methodologies to ensure consistent and comparable results across future studies. The presented models include both audio-only inputs and multimodal inputs that incorporate demographic and symptom data, enabling their application to datasets with diverse patient information. By providing these benchmarks, future researchers can evaluate their datasets, refine the models, and use them as a foundation for more advanced approaches. This work aims to provide a baseline for establishing reproducible benchmarks, enabling researchers to compare new methods against these standards and ultimately advancing the development of AI tools for detecting laryngeal cancer. Comments: 24 pages, 6 figures, 7 tables Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM) Cite as: arXiv:2412.16267 [cs.SD] (or arXiv:2412.16267v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2412.16267 Focus to learn more arXiv-issued DOI via DataCite

[LG-86] Learned Compression of Nonlinear Time Series With Random Access ICDE2025

链接: https://arxiv.org/abs/2412.16266
作者: Andrea Guerra,Giorgio Vinciguerra,Antonio Boffa,Paolo Ferragina
关键词: Time series, including finance, Time series play, environmental monitoring, play a crucial
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: Accepted for publication in Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE 2025)

点击查看摘要

Abstract:Time series play a crucial role in many fields, including finance, healthcare, industry, and environmental monitoring. The storage and retrieval of time series can be challenging due to their unstoppable growth. In fact, these applications often sacrifice precious historical data to make room for new data. General-purpose compressors can mitigate this problem with their good compression ratios, but they lack efficient random access on compressed data, thus preventing real-time analyses. Ad-hoc streaming solutions, instead, typically optimise only for compression and decompression speed, while giving up compression effectiveness and random access functionality. Furthermore, all these methods lack awareness of certain special regularities of time series, whose trends over time can often be described by some linear and nonlinear functions. To address these issues, we introduce NeaTS, a randomly-accessible compression scheme that approximates the time series with a sequence of nonlinear functions of different kinds and shapes, carefully selected and placed by a partitioning algorithm to minimise the space. The approximation residuals are bounded, which allows storing them in little space and thus recovering the original data losslessly, or simply discarding them to obtain a lossy time series representation with maximum error guarantees. Our experiments show that NeaTS improves the compression ratio of the state-of-the-art lossy compressors that use linear or nonlinear functions (or both) by up to 14%. Compared to lossless compressors, NeaTS emerges as the only approach to date providing, simultaneously, compression ratios close to or better than the best existing compressors, a much faster decompression speed, and orders of magnitude more efficient random access, thus enabling the storage and real-time analysis of massive and ever-growing amounts of (historical) time series data. Comments: Accepted for publication in Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE 2025) Subjects: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR) ACMclasses: E.1; E.4; H.2.2; H.3.2; H.3.1 Cite as: arXiv:2412.16266 [cs.LG] (or arXiv:2412.16266v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.16266 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Giorgio Vinciguerra [view email] [v1] Fri, 20 Dec 2024 10:30:06 UTC (2,067 KB)

[LG-87] Multi-Source Unsupervised Domain Adaptation with Prototype Aggregation

链接: https://arxiv.org/abs/2412.16255
作者: Min Huang,Zifeng Xie,Bo Sun,Ning Wang
关键词: Multi-source domain adaptation, industrial model generalization, Multi-source domain, plays an important, important role
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-source domain adaptation (MSDA) plays an important role in industrial model generalization. Recent efforts on MSDA focus on enhancing multi-domain distributional alignment while omitting three issues, e.g., the class-level discrepancy quantification, the unavailability of noisy pseudo-label, and source transferability discrimination, potentially resulting in suboptimal adaption performance. Therefore, we address these issues by proposing a prototype aggregation method that models the discrepancy between source and target domains at the class and domain levels. Our method achieves domain adaptation based on a group of prototypes (i.e., representative feature embeddings). A similarity score-based strategy is designed to quantify the transferability of each domain. At the class level, our method quantifies class-specific cross-domain discrepancy according to reliable target pseudo-labels. At the domain level, our method establishes distributional alignment between noisy pseudo-labeled target samples and the source domain prototypes. Therefore, adaptation at the class and domain levels establishes a complementary mechanism to obtain accurate predictions. The results on three standard benchmarks demonstrate that our method outperforms most state-of-the-art methods. In addition, we provide further elaboration of the proposed method in light of the interpretable results obtained from the analysis experiments.

[LG-88] Know2Vec: A Black-Box Proxy for Neural Network Retrieval AAAI2025

链接: https://arxiv.org/abs/2412.16251
作者: Zhuoyi Shang,Yanwei Liu,Jinxia Liu,Xiaoyan Gu,Ying Ding,Xiangyang Ji
关键词: neural network, model, challenging and labor-intensive, neural network models, neural network retrieval
类目: Machine Learning (cs.LG)
*备注: AAAI2025 accepted

点击查看摘要

Abstract:For general users, training a neural network from scratch is usually challenging and labor-intensive. Fortunately, neural network zoos enable them to find a well-performing model for directly use or fine-tuning it in their local environments. Although current model retrieval solutions attempt to convert neural network models into vectors to avoid complex multiple inference processes required for model selection, it is still difficult to choose a suitable model due to inaccurate vectorization and biased correlation alignment between the query dataset and models. From the perspective of knowledge consistency, i.e., whether the knowledge possessed by the model can meet the needs of query tasks, we propose a model retrieval scheme, named Know2Vec, that acts as a black-box retrieval proxy for model zoo. Know2Vec first accesses to models via a black-box interface in advance, capturing vital decision knowledge from models while ensuring their privacy. Next, it employs an effective encoding technique to transform the knowledge into precise model vectors. Secondly, it maps the user’s query task to a knowledge vector by probing the semantic relationships within query samples. Furthermore, the proxy ensures the knowledge-consistency between query vector and model vectors within their alignment space, which is optimized through the supervised learning with diverse loss functions, and finally it can identify the most suitable model for a given task during the inference stage. Extensive experiments show that our Know2Vec achieves superior retrieval accuracy against the state-of-the-art methods in diverse neural network retrieval tasks.

[LG-89] raining-free Heterogeneous Graph Condensation via Data Selection

链接: https://arxiv.org/abs/2412.16250
作者: Yuxuan Liang,Wentao Zhang,Xinyi Gao,Ling Yang,Chong Chen,Hongzhi Yin,Yunhai Tong,Bin Cui
关键词: large-scale heterogeneous graphs, heterogeneous graphs, heterogeneous, Heterogeneous Graph Condensation, graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient training of large-scale heterogeneous graphs is of paramount importance in real-world applications. However, existing approaches typically explore simplified models to mitigate resource and time overhead, neglecting the crucial aspect of simplifying large-scale heterogeneous graphs from the data-centric perspective. Addressing this gap, HGCond introduces graph condensation (GC) in heterogeneous graphs and generates a small condensed graph for efficient model training. Despite its efficacy in graph generation, HGCond encounters two significant limitations. The first is low effectiveness, HGCond excessively relies on the simplest relay model for the condensation procedure, which restricts the ability to exert powerful Heterogeneous Graph Neural Networks (HGNNs) with flexible condensation ratio and limits the generalization ability. The second is low efficiency, HGCond follows the existing GC methods designed for homogeneous graphs and leverages the sophisticated optimization paradigm, resulting in a time-consuming condensing procedure. In light of these challenges, we present the first Training \underlineFree Heterogeneous Graph Condensation method, termed FreeHGC, facilitating both efficient and high-quality generation of heterogeneous condensed graphs. Specifically, we reformulate the heterogeneous graph condensation problem as a data selection issue, offering a new perspective for assessing and condensing representative nodes and edges in the heterogeneous graphs. By leveraging rich meta-paths, we introduce a new, high-quality heterogeneous data selection criterion to select target-type nodes. Furthermore, two training-free condensation strategies for heterogeneous graphs are designed to condense and synthesize other-types nodes effectively.

[LG-90] Decoding fairness: a reinforcement learning perspective

链接: https://arxiv.org/abs/2412.16249
作者: Guozhong Zheng,Jiqiang Zhang,Xin Ou,Shengfeng Deng,Li Chen
关键词: orthodox Economics, prefer fair acts, humans prefer fair, ultimatum game, humans prefer
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph); Populations and Evolution (q-bio.PE)
*备注: 12 pages, 13 figures. Comments are appreciated

点击查看摘要

Abstract:Behavioral experiments on the ultimatum game (UG) reveal that we humans prefer fair acts, which contradicts the prediction made in orthodox Economics. Existing explanations, however, are mostly attributed to exogenous factors within the imitation learning framework. Here, we adopt the reinforcement learning paradigm, where individuals make their moves aiming to maximize their accumulated rewards. Specifically, we apply Q-learning to UG, where each player is assigned two Q-tables to guide decisions for the roles of proposer and responder. In a two-player scenario, fairness emerges prominently when both experiences and future rewards are appreciated. In particular, the probability of successful deals increases with higher offers, which aligns with observations in behavioral experiments. Our mechanism analysis reveals that the system undergoes two phases, eventually stabilizing into fair or rational strategies. These results are robust when the rotating role assignment is replaced by a random or fixed manner, or the scenario is extended to a latticed population. Our findings thus conclude that the endogenous factor is sufficient to explain the emergence of fairness, exogenous factors are not needed.

[LG-91] Bag of Tricks for Multimodal AutoML with Image Text and Tabular Data

链接: https://arxiv.org/abs/2412.16243
作者: Zhiqiang Tang,Zihan Zhong,Tong He,Gerald Friedland
关键词: automatic machine learning, machine learning, paper studies, practices for automatic, automatic machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.

[LG-92] Utilizing Causal Network Markers to Identify Tipping Points ahead of Critical Transition

链接: https://arxiv.org/abs/2412.16235
作者: Shirui Bian,Zezhou Wang,Siyang Leng,Wei Lin,Jifan Shi
关键词: introducing timely interventions, Early-warning signals, predict critical transitions, timely interventions, delicate design
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Early-warning signals of delicate design are always used to predict critical transitions in complex systems, which makes it possible to render the systems far away from the catastrophic state by introducing timely interventions. Traditional signals including the dynamical network biomarker (DNB), based on statistical properties such as variance and autocorrelation of nodal dynamics, overlook directional interactions and thus have limitations in capturing underlying mechanisms and simultaneously sustaining robustness against noise perturbations. This paper therefore introduces a framework of causal network markers (CNMs) by incorporating causality indicators, which reflect the directional influence between variables. Actually, to detect and identify the tipping points ahead of critical transition, two markers are designed: CNM-GC for linear causality and CNM-TE for non-linear causality, as well as a functional representation of different causality indicators and a clustering technique to verify the system’s dominant group. Through demonstrations using benchmark models and real-world datasets of epileptic seizure, the framework of CNMs shows higher predictive power and accuracy than the traditional DNB indicator. It is believed that, due to the versatility and scalability, the CNMs are suitable for comprehensively evaluating the systems. The most possible direction for application includes the identification of tipping points in clinical disease.

[LG-93] Is AI Robust Enough for Scientific Research?

链接: https://arxiv.org/abs/2412.16234
作者: Jun-Jie Zhang,Jiahao Song,Xiu-Cheng Wang,Fu-Peng Li,Zehan Liu,Jian-Nan Chen,Haoning Dang,Shiyao Wang,Yiyan Zhang,Jianhui Xu,Chunxiang Shi,Fei Wang,Long-Gang Pang,Nan Cheng,Weiwei Zhang,Duo Zhang,Deyu Meng
关键词: phenomenon largely overlooked, exhibit high susceptibility, networks exhibit high, scientific community utilizing, neural networks exhibit
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:We uncover a phenomenon largely overlooked by the scientific community utilizing AI: neural networks exhibit high susceptibility to minute perturbations, resulting in significant deviations in their outputs. Through an analysis of five diverse application areas – weather forecasting, chemical energy and force calculations, fluid dynamics, quantum chromodynamics, and wireless communication – we demonstrate that this vulnerability is a broad and general characteristic of AI systems. This revelation exposes a hidden risk in relying on neural networks for essential scientific computations, calling further studies on their reliability and security.

[LG-94] GNN-Transformer Cooperative Architecture for Trustworthy Graph Contrastive Learning

链接: https://arxiv.org/abs/2412.16218
作者: Jianqing Liang,Xinkai Wei,Min Chen,Yuan Liu,Zhiqiang Wang,Jiye Liang
关键词: Graph contrastive learning, hot topic, Trustworthy Graph Contrastive, contrastive learning, Graph contrastive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) has become a hot topic in the field of graph representation learning. In contrast to traditional supervised learning relying on a large number of labels, GCL exploits augmentation strategies to generate multiple views and positive/negative pairs, both of which greatly influence the performance. Unfortunately, commonly used random augmentations may disturb the underlying semantics of graphs. Moreover, traditional GNNs, a type of widely employed encoders in GCL, are inevitably confronted with over-smoothing and over-squashing problems. To address these issues, we propose GNN-Transformer Cooperative Architecture for Trustworthy Graph Contrastive Learning (GTCA), which inherits the advantages of both GNN and Transformer, incorporating graph topology to obtain comprehensive graph representations. Theoretical analysis verifies the trustworthiness of the proposed method. Extensive experiments on benchmark datasets demonstrate state-of-the-art empirical performance.

[LG-95] FairTP: A Prolonged Fairness Framework for Traffic Prediction

链接: https://arxiv.org/abs/2412.16214
作者: Jiangnan Xia,Yu Yang,Jiaxing Shen,Senzhang Wang,Jiannong Cao
关键词: intelligent transportation systems, fairness, plays a crucial, crucial role, role in intelligent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic prediction plays a crucial role in intelligent transportation systems. Existing approaches primarily focus on improving overall accuracy, often neglecting a critical issue: whether predictive models lead to biased decisions by transportation authorities. In practice, the uneven deployment of traffic sensors across urban areas results in imbalanced data, causing prediction models to perform poorly in certain regions and leading to unfair decision-making. This imbalance ultimately harms the equity and quality of life for residents. Moreover, current fairness-aware machine learning models only ensure fairness at specific time points, failing to maintain fairness over extended periods. As traffic conditions change, such static fairness approaches become ineffective. To address this gap, we propose FairTP, a framework for prolonged fair traffic prediction. We introduce two new fairness definitions tailored for dynamic traffic scenarios. Fairness in traffic prediction is not static; it varies over time and across regions. Each sensor or urban area can alternate between two states: “sacrifice” (low prediction accuracy) and “benefit” (high prediction accuracy). Prolonged fairness is achieved when the overall states of sensors remain similar over a given period. We define two types of fairness: region-based static fairness and sensor-based dynamic fairness. To implement this, FairTP incorporates a state identification module to classify sensors’ states as either “sacrifice” or “benefit,” enabling prolonged fairness-aware predictions. Additionally, we introduce a state-guided balanced sampling strategy to further enhance fairness, addressing performance disparities among regions with uneven sensor distributions. Extensive experiments on two real-world datasets demonstrate that FairTP significantly improves prediction fairness while minimizing accuracy degradation.

[LG-96] Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased

链接: https://arxiv.org/abs/2412.16209
作者: Nathan Phelps,Daniel J. Lizotte,Douglas G. Woolford
关键词: Imbalanced binary classification, binary classification problems, classification problems arise, Imbalanced binary, majority class
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Imbalanced binary classification problems arise in many fields of study. When using machine learning models for these problems, it is common to subsample the majority class (i.e., undersampling) to create a (more) balanced dataset for model training. This biases the model’s predictions because the model learns from a dataset that does not follow the same data generating process as new data. One way of accounting for this bias is to analytically map the resulting predictions to new values based on the sampling rate for the majority class, which was used to create the training dataset. While this approach may work well for some machine learning models, we have found that calibrating a random forest this way has unintended negative consequences, including prevalence estimates that can be upwardly biased. These prevalence estimates depend on both i) the number of predictors considered at each split in the random forest; and ii) the sampling rate used. We explain the former using known properties of random forests and analytical calibration. However, in investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.

[LG-97] Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults

链接: https://arxiv.org/abs/2412.16208
作者: Youssef A. Ait Alama,Sampada Sakpal,Ke Wang,Razvan Bunescu,Avinash Karanth,Ahmed Louri
关键词: machine learning accelerators, permanent hardware failure, growing challenge, challenge for machine, machine learning
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Hardware failures are a growing challenge for machine learning accelerators, many of which are based on systolic arrays. When a permanent hardware failure occurs in a systolic array, existing solutions include localizing and isolating the faulty processing element (PE), using a redundant PE for re-execution, or in some extreme cases decommissioning the entire accelerator for further investigation. In this paper, we propose novel algorithmic approaches that mitigate permanent hardware faults in neural network (NN) accelerators by uniquely integrating the behavior of the faulty component instead of bypassing it. In doing so, we aim for a more sustainable use of the accelerator where faulty hardware is neither bypassed nor discarded, instead being given a second life. We first introduce a CUDA-accelerated systolic array simulator in PyTorch, which enabled us to quantify the impact of permanent faults appearing on links connecting two PEs or in weight registers, where one bit is stuck at 0 or 1 in the float32, float16, or bfloat16 representation. We then propose several algorithmic mitigation techniques for a subset of stuck-at faults, such as Invertible Scaling or Shifting of activations and weights, or fine tuning with the faulty behavior. Notably, the proposed techniques do not require any hardware modification, instead relying on existing components of widely used systolic array based accelerators, such as normalization, activation, and storage units. Extensive experimental evaluations using fully connected and convolutional NNs trained on MNIST, CIFAR-10 and ImageNet show that the proposed fault-tolerant approach matches or gets very close to the original fault-free accuracy.

[LG-98] Synthetic Time Series Data Generation for Healthcare Applications: A PCG Case Study

链接: https://arxiv.org/abs/2412.16207
作者: Ainaz Jamshidi,Muhammad Arif,Sabir Ali Kalhoro,Alexander Gelbukh
关键词: safeguarding patient privacy, advancing healthcare diagnostics, patient privacy, PCG, PCG data
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The generation of high-quality medical time series data is essential for advancing healthcare diagnostics and safeguarding patient privacy. Specifically, synthesizing realistic phonocardiogram (PCG) signals offers significant potential as a cost-effective and efficient tool for cardiac disease pre-screening. Despite its potential, the synthesis of PCG signals for this specific application received limited attention in research. In this study, we employ and compare three state-of-the-art generative models from different categories - WaveNet, DoppelGANger, and DiffWave - to generate high-quality PCG data. We use data from the George B. Moody PhysioNet Challenge 2022. Our methods are evaluated using various metrics widely used in the previous literature in the domain of time series data generation, such as mean absolute error and maximum mean discrepancy. Our results demonstrate that the generated PCG data closely resembles the original datasets, indicating the effectiveness of our generative models in producing realistic synthetic PCG data. In our future work, we plan to incorporate this method into a data augmentation pipeline to synthesize abnormal PCG signals with heart murmurs, in order to address the current scarcity of abnormal data. We hope to improve the robustness and accuracy of diagnostic tools in cardiology, enhancing their effectiveness in detecting heart murmurs.

[LG-99] Stabilizing Machine Learning for Reproducible and Explainable Results: A Novel Validation Approach to Subject-Specific Insights

链接: https://arxiv.org/abs/2412.16199
作者: Gideon Vos,Liza van Eijk,Zoltan Sarnyai,Mostafa Rahimi Azghadi
关键词: Machine Learning, Learning is transforming, improving diagnostic accuracy, feature importance, personalizing treatments
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:Machine Learning is transforming medical research by improving diagnostic accuracy and personalizing treatments. General ML models trained on large datasets identify broad patterns across populations, but their effectiveness is often limited by the diversity of human biology. This has led to interest in subject-specific models that use individual data for more precise predictions. However, these models are costly and challenging to develop. To address this, we propose a novel validation approach that uses a general ML model to ensure reproducible performance and robust feature importance analysis at both group and subject-specific levels. We tested a single Random Forest (RF) model on nine datasets varying in domain, sample size, and demographics. Different validation techniques were applied to evaluate accuracy and feature importance consistency. To introduce variability, we performed up to 400 trials per subject, randomly seeding the ML algorithm for each trial. This generated 400 feature sets per subject, from which we identified top subject-specific features. A group-specific feature importance set was then derived from all subject-specific results. We compared our approach to conventional validation methods in terms of performance and feature importance consistency. Our repeated trials approach, with random seed variation, consistently identified key features at the subject level and improved group-level feature importance analysis using a single general model. Subject-specific models address biological variability but are resource-intensive. Our novel validation technique provides consistent feature importance and improved accuracy within a general ML model, offering a practical and explainable alternative for clinical research.

[LG-100] Real-valued continued fraction of straight lines

链接: https://arxiv.org/abs/2412.16191
作者: Vijay Prakash S
关键词: continued fraction, straight lines, mathematical analysis, extensively for mathematical, continued
类目: Machine Learning (cs.LG)
*备注: 12 pages, 30 figures

点击查看摘要

Abstract:In an unbounded plane, straight lines are used extensively for mathematical analysis. They are tools of convenience. However, those with high slope values become unbounded at a faster rate than the independent variable. So, straight lines, in this work, are made to be bounded by introducing a parametric nonlinear term that is positive. The straight lines are transformed into bounded nonlinear curves that become unbounded at a much slower rate than the independent variable. This transforming equation can be expressed as a continued fraction of straight lines. The continued fraction is real-valued and converges to the solutions of the transforming equation. Following Euler’s method, the continued fraction has been reduced into an infinite series. The usefulness of the bounding nature of continued fraction is demonstrated by solving the problem of image classification. Parameters estimated on the Fashion-MNIST dataset of greyscale images using continued fraction of regression lines have less variance, converge quickly and are more accurate than the linear counterpart. Moreover, this multi-dimensional parametric estimation problem can be expressed on xy- plane using the parameters of the continued fraction and patterns emerge on planar plots.

[LG-101] Hierarchical Multi-Agent DRL Based Dynamic Cluster Reconfiguration for UAV Mobility Management

链接: https://arxiv.org/abs/2412.16167
作者: Irshad A. Meer,Karl-Ludwig Besser,Mustafa Ozger,Dominic Schupke,H. Vincent Poor,Cicek Cavdar
关键词: Multi-connectivity involves dynamic, efficient mobility management, mobility management strategies, coordinated resource allocation, involves dynamic cluster
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Multi-connectivity involves dynamic cluster formation among distributed access points (APs) and coordinated resource allocation from these APs, highlighting the need for efficient mobility management strategies for users with multi-connectivity. In this paper, we propose a novel mobility management scheme for unmanned aerial vehicles (UAVs) that uses dynamic cluster reconfiguration with energy-efficient power allocation in a wireless interference network. Our objective encompasses meeting stringent reliability demands, minimizing joint power consumption, and reducing the frequency of cluster reconfiguration. To achieve these objectives, we propose a hierarchical multi-agent deep reinforcement learning (H-MADRL) framework, specifically tailored for dynamic clustering and power allocation. The edge cloud connected with a set of APs through low latency optical back-haul links hosts the high-level agent responsible for the optimal clustering policy, while low-level agents reside in the APs and are responsible for the power allocation policy. To further improve the learning efficiency, we propose a novel action-observation transition-driven learning algorithm that allows the low-level agents to use the action space from the high-level agent as part of the local observation space. This allows the lower-level agents to share partial information about the clustering policy and allocate the power more efficiently. The simulation results demonstrate that our proposed distributed algorithm achieves comparable performance to the centralized algorithm. Additionally, it offers better scalability, as the decision time for clustering and power allocation increases by only 10% when doubling the number of APs, compared to a 90% increase observed with the centralized approach.

[LG-102] owards structure-preserving quantum encodings

链接: https://arxiv.org/abs/2412.17772
作者: Arthur J. Parzygnat,Tai-Danae Bradley,Andrew Vlasic,Anh Pham
关键词: potential computational advantage, Harnessing the potential, learning tasks relies, machine learning tasks, quantum computers
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Category Theory (math.CT)
*备注: 17 pages body, 10 pages back matter; Comments welcome!

点击查看摘要

Abstract:Harnessing the potential computational advantage of quantum computers for machine learning tasks relies on the uploading of classical data onto quantum computers through what are commonly referred to as quantum encodings. The choice of such encodings may vary substantially from one task to another, and there exist only a few cases where structure has provided insight into their design and implementation, such as symmetry in geometric quantum learning. Here, we propose the perspective that category theory offers a natural mathematical framework for analyzing encodings that respect structure inherent in datasets and learning tasks. We illustrate this with pedagogical examples, which include geometric quantum machine learning, quantum metric learning, topological data analysis, and more. Moreover, our perspective provides a language in which to ask meaningful and mathematically precise questions for the design of quantum encodings and circuits for quantum machine learning tasks.

[LG-103] Minimax Optimal Simple Regret in Two-Armed Best-Arm Identification

链接: https://arxiv.org/abs/2412.17753
作者: Masahiro Kato
关键词: fixed-budget best-arm identification, two-armed fixed-budget best-arm, minimax optimal algorithm, Neyman allocation, Neyman allocation asymptotically
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study investigates an asymptotically minimax optimal algorithm in the two-armed fixed-budget best-arm identification (BAI) problem. Given two treatment arms, the objective is to identify the arm with the highest expected outcome through an adaptive experiment. We focus on the Neyman allocation, where treatment arms are allocated following the ratio of their outcome standard deviations. Our primary contribution is to prove the minimax optimality of the Neyman allocation for the simple regret, defined as the difference between the expected outcomes of the true best arm and the estimated best arm. Specifically, we first derive a minimax lower bound for the expected simple regret, which characterizes the worst-case performance achievable under the location-shift distributions, including Gaussian distributions. We then show that the simple regret of the Neyman allocation asymptotically matches this lower bound, including the constant term, not just the rate in terms of the sample size, under the worst-case distribution. Notably, our optimality result holds without imposing locality restrictions on the distribution, such as the local asymptotic normality. Furthermore, we demonstrate that the Neyman allocation reduces to the uniform allocation, i.e., the standard randomized controlled trial, under Bernoulli distributions.

[LG-104] owards An Unsupervised Learning Scheme for Efficiently Solving Parameterized Mixed-Integer Programs

链接: https://arxiv.org/abs/2412.17623
作者: Shiyuan Qu,Fenglian Dong,Zhiwei Wei,Chao Shang
关键词: unsupervised learning scheme, unsupervised learning fashion, unsupervised learning, scheme for accelerating, learning scheme
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we describe a novel unsupervised learning scheme for accelerating the solution of a family of mixed integer programming (MIP) problems. Distinct substantially from existing learning-to-optimize methods, our proposal seeks to train an autoencoder (AE) for binary variables in an unsupervised learning fashion, using data of optimal solutions to historical instances for a parametric family of this http URL a deliberate design of AE architecture and exploitation of its statistical implication, we present a simple and straightforward strategy to construct a class of cutting plane constraints from the decoder parameters of an offline-trained AE. These constraints reliably enclose the optimal binary solutions of new problem instances thanks to the representation strength of the AE. More importantly, their integration into the primal MIP problem leads to a tightened MIP with the reduced feasible region, which can be resolved at decision time using off-the-shelf solvers with much higher efficiency. Our method is applied to a benchmark batch process scheduling problem formulated as a mixed integer linear programming (MILP) problem. Comprehensive results demonstrate that our approach significantly reduces the computational cost of off-the-shelf MILP solvers while retaining a high solution quality. The codes of this work are open-sourced at this https URL.

[LG-105] Probability-density-aware Semi-supervised Learning

链接: https://arxiv.org/abs/2412.17547
作者: Shuyang Liu,Ruiqiu Zheng,Yunhang Shen,Ke Li,Xing Sun,Zhou Yu,Shaohui Lin
关键词: Semi-supervised learning, Label Propagation, neighbor points lie, cluster assumption, Measure Label Propagation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) assumes that neighbor points lie in the same category (neighbor assumption), and points in different clusters belong to various categories (cluster assumption). Existing methods usually rely on similarity measures to retrieve the similar neighbor points, ignoring cluster assumption, which may not utilize unlabeled information sufficiently and effectively. This paper first provides a systematical investigation into the significant role of probability density in SSL and lays a solid theoretical foundation for cluster assumption. To this end, we introduce a Probability-Density-Aware Measure (PM) to discern the similarity between neighbor points. To further improve Label Propagation, we also design a Probability-Density-Aware Measure Label Propagation (PMLP) algorithm to fully consider the cluster assumption in label propagation. Last but not least, we prove that traditional pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP’s superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

[LG-106] Optimal Convergence Rates for Neural Operators

链接: https://arxiv.org/abs/2412.17518
作者: Mike Nguyen,Nicole Mücke
关键词: kernel Hilbert spaces, neural tangent kernel, reproducing kernel Hilbert, tangent kernel, generalization properties
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the neural tangent kernel (NTK) regime for two-layer neural operators and analyze their generalization properties. For early-stopped gradient descent (GD), we derive fast convergence rates that are known to be minimax optimal within the framework of non-parametric regression in reproducing kernel Hilbert spaces (RKHS). We provide bounds on the number of hidden neurons and the number of second-stage samples necessary for generalization. To justify our NTK regime, we additionally show that any operator approximable by a neural operator can also be approximated by an operator from the RKHS. A key application of neural operators is learning surrogate maps for the solution operators of partial differential equations (PDEs). We consider the standard Poisson equation to illustrate our theoretical findings with simulations.

[LG-107] Uncertainties of Satellite-based Essential Climate Variables from Deep Learning

链接: https://arxiv.org/abs/2412.17506
作者: Junyang Gou,Arnt-Børre Salberg,Mostafa Kiani Shahvandi,Mohammad J. Tourian,Ulrich Meyer,Eva Boergens,Anders U. Waldeland,Isabella Velicogna,Fredrik Dahl,Adrian Jäggi,Konrad Schindler,Benedikt Soja
关键词: Accurate uncertainty information, essential climate variables, reliable climate modeling, Accurate uncertainty, deep learning
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate uncertainty information associated with essential climate variables (ECVs) is crucial for reliable climate modeling and understanding the spatiotemporal evolution of the Earth system. In recent years, geoscience and climate scientists have benefited from rapid progress in deep learning to advance the estimation of ECV products with improved accuracy. However, the quantification of uncertainties associated with the output of such deep learning models has yet to be thoroughly adopted. This survey explores the types of uncertainties associated with ECVs estimated from deep learning and the techniques to quantify them. The focus is on highlighting the importance of quantifying uncertainties inherent in ECV estimates, considering the dynamic and multifaceted nature of climate data. The survey starts by clarifying the definition of aleatoric and epistemic uncertainties and their roles in a typical satellite observation processing workflow, followed by bridging the gap between conventional statistical and deep learning views on uncertainties. Then, we comprehensively review the existing techniques for quantifying uncertainties associated with deep learning algorithms, focusing on their application in ECV studies. The specific need for modification to fit the requirements from both the Earth observation side and the deep learning side in such interdisciplinary tasks is discussed. Finally, we demonstrate our findings with two ECV examples, snow cover and terrestrial water storage, and provide our perspectives for future research.

[LG-108] More is Less? A Simulation-Based Approach to Dynamic Interactions between Biases in Multimodal Models

链接: https://arxiv.org/abs/2412.17505
作者: Mounia Drissi
关键词: including public safety, critical domains including, domains including public, Multimodal machine learning, machine learning models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Multimodal machine learning models, such as those that combine text and image modalities, are increasingly used in critical domains including public safety, security, and healthcare. However, these systems inherit biases from their single modalities. This study proposes a systemic framework for analyzing dynamic multimodal bias interactions. Using the MMBias dataset, which encompasses categories prone to bias such as religion, nationality, and sexual orientation, this study adopts a simulation-based heuristic approach to compute bias scores for text-only, image-only, and multimodal embeddings. A framework is developed to classify bias interactions as amplification (multimodal bias exceeds both unimodal biases), mitigation (multimodal bias is lower than both), and neutrality (multimodal bias lies between unimodal biases), with proportional analyzes conducted to identify the dominant mode and dynamics in these interactions. The findings highlight that amplification (22%) occurs when text and image biases are comparable, while mitigation (11%) arises under the dominance of text bias, highlighting the stabilizing role of image bias. Neutral interactions (67%) are related to a higher text bias without divergence. Conditional probabilities highlight the text’s dominance in mitigation and mixed contributions in neutral and amplification cases, underscoring complex modality interplay. In doing so, the study encourages the use of this heuristic, systemic, and interpretable framework to analyze multimodal bias interactions, providing insight into how intermodal biases dynamically interact, with practical applications for multimodal modeling and transferability to context-based datasets, all essential for developing fair and equitable AI models.

[LG-109] Learning from Summarized Data: Gaussian Process Regression with Sample Quasi-Likelihood AAAI2025

链接: https://arxiv.org/abs/2412.17455
作者: Yuta Shikuri
关键词: powerful Bayesian nonlinear, Bayesian nonlinear regression, powerful Bayesian, Bayesian nonlinear, Gaussian process regression
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, 5 tables, AAAI2025

点击查看摘要

Abstract:Gaussian process regression is a powerful Bayesian nonlinear regression method. Recent research has enabled the capture of many types of observations using non-Gaussian likelihoods. To deal with various tasks in spatial modeling, we benefit from this development. Difficulties still arise when we can only access summarized data consisting of representative features, summary statistics, and data point counts. Such situations frequently occur primarily due to concerns about confidentiality and management costs associated with spatial data. This study tackles learning and inference using only summarized data within the framework of Gaussian process regression. To address this challenge, we analyze the approximation errors in the marginal likelihood and posterior distribution that arise from utilizing representative features. We also introduce the concept of sample quasi-likelihood, which facilitates learning and inference using only summarized data. Non-Gaussian likelihoods satisfying certain assumptions can be captured by specifying a variance function that characterizes a sample quasi-likelihood function. Theoretical and experimental results demonstrate that the approximation performance is influenced by the granularity of summarized data relative to the length scale of covariance functions. Experiments on a real-world dataset highlight the practicality of our method for spatial modeling.

[LG-110] Emerging Microelectronic Materials by Design: Navigating Combinatorial Design Space with Scarce and Dispersed Data

链接: https://arxiv.org/abs/2412.17283
作者: Hengrui Zhang,Alexandru B. Georgescu,Suraj Yerramilli,Christopher Karpovich,Daniel W. Apley,Elsa A. Olivetti,James M. Rondinelli,Wei Chen
关键词: biomedical applications call, materials, sustainable energy, materials design, increasing demands
类目: Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:The increasing demands of sustainable energy, electronics, and biomedical applications call for next-generation functional materials with unprecedented properties. Of particular interest are emerging materials that display exceptional physical properties, making them promising candidates in energy-efficient microelectronic devices. As the conventional Edisonian approach becomes significantly outpaced by growing societal needs, emerging computational modeling and machine learning (ML) methods are employed for the rational design of materials. However, the complex physical mechanisms, cost of first-principles calculations, and the dispersity and scarcity of data pose challenges to both physics-based and data-driven materials modeling. Moreover, the combinatorial composition-structure design space is high-dimensional and often disjoint, making design optimization nontrivial. In this Account, we review a team effort toward establishing a framework that integrates data-driven and physics-based methods to address these challenges and accelerate materials design. We begin by presenting our integrated materials design framework and its three components in a general context. We then provide an example of applying this materials design framework to metal-insulator transition (MIT) materials, a specific type of emerging materials with practical importance in next-generation memory technologies. We identify multiple new materials which may display this property and propose pathways for their synthesis. Finally, we identify some outstanding challenges in data-driven materials design, such as materials data quality issues and property-performance mismatch. We seek to raise awareness of these overlooked issues hindering materials design, thus stimulating efforts toward developing methods to mitigate the gaps.

[LG-111] Machine learning and natural language processing models to predict the extent of food processing

链接: https://arxiv.org/abs/2412.17217
作者: Nalin Arora,Sumit Bhagat,Riya Dhama,Ganesh Bagler
关键词: adverse health effects, numerous adverse health, ultra-processed food consumption, ultra-processed food, dramatic increase
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 60 Pages (22 Pages of Main Manuscript + Supplementary Material), 2 Figures, 6 Tables

点击查看摘要

Abstract:The dramatic increase in consumption of ultra-processed food has been associated with numerous adverse health effects. Given the public health consequences linked to ultra-processed food consumption, it is highly relevant to build computational models to predict the processing of food products. We created a range of machine learning, deep learning, and NLP models to predict the extent of food processing by integrating the FNDDS dataset of food products and their nutrient profiles with their reported NOVA processing level. Starting with the full nutritional panel of 102 features, we further implemented coarse-graining of features to 65 and 13 nutrients by dropping flavonoids and then by considering the 13-nutrient panel of FDA, respectively. LGBM Classifier and Random Forest emerged as the best model for 102 and 65 nutrients, respectively, with an F1-score of 0.9411 and 0.9345 and MCC of 0.8691 and 0.8543. For the 13-nutrient panel, Gradient Boost achieved the best F1-score of 0.9284 and MCC of 0.8425. We also implemented NLP based models, which exhibited state-of-the-art performance. Besides distilling nutrients critical for model performance, we present a user-friendly web server for predicting processing level based on the nutrient panel of a food product: this https URL.

[LG-112] hermodynamic computing out of equilibrium

链接: https://arxiv.org/abs/2412.17183
作者: Stephen Whitelam,Corneel Casert
关键词: present the design, arbitrary nonlinear calculations, perform arbitrary nonlinear, thermodynamic neural networks, neural networks
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We present the design for a thermodynamic computer that can perform arbitrary nonlinear calculations in or out of equilibrium. Simple thermodynamic circuits, fluctuating degrees of freedom in contact with a thermal bath and confined by a quartic potential, display an activity that is a nonlinear function of their input. Such circuits can therefore be regarded as thermodynamic neurons, and can serve as the building blocks of networked structures that act as thermodynamic neural networks, universal function approximators whose operation is powered by thermal fluctuations. We simulate a digital model of a thermodynamic neural network, and show that its parameters can be adjusted by genetic algorithm to perform nonlinear calculations at specified observation times, regardless of whether the system has attained thermal equilibrium. This work expands the field of thermodynamic computing beyond the regime of thermal equilibrium, enabling fully nonlinear computations, analogous to those performed by classical neural networks, at specified observation times.

[LG-113] Scalable Speech Enhancement with Dynamic Channel Pruning ICASSP

链接: https://arxiv.org/abs/2412.17121
作者: Riccardo Miccini,Clement Laroche,Tobias Piechowiak,Luca Pezzarossa
关键词: Speech Enhancement, remote collaborative environments, collaborative environments, essential for improving, improving productivity
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted for publication at the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

点击查看摘要

Abstract:Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save 29.6% of MACs while only causing a 0.75% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.

[LG-114] Differentially Private Random Block Coordinate Descent

链接: https://arxiv.org/abs/2412.17054
作者: Artavazd Maranjyan,Abdurakhmon Sadiev,Peter Richtárik
关键词: complex optimization tasks, gained significant attention, machine learning due, solving high-dimensional problems, decompose complex optimization
类目: Optimization and Control (math.OC); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Coordinate Descent (CD) methods have gained significant attention in machine learning due to their effectiveness in solving high-dimensional problems and their ability to decompose complex optimization tasks. However, classical CD methods were neither designed nor analyzed with data privacy in mind, a critical concern when handling sensitive information. This has led to the development of differentially private CD methods, such as DP-CD (Differentially Private Coordinate Descent) proposed by Mangold et al. (ICML 2022), yet a disparity remains between non-private CD and DP-CD methods. In our work, we propose a differentially private random block coordinate descent method that selects multiple coordinates with varying probabilities in each iteration using sketch matrices. Our algorithm generalizes both DP-CD and the classical DP-SGD (Differentially Private Stochastic Gradient Descent), while preserving the same utility guarantees. Furthermore, we demonstrate that better utility can be achieved through importance sampling, as our method takes advantage of the heterogeneity in coordinate-wise smoothness constants, leading to improved convergence rates.

[LG-115] Integrating Random Effects in Variational Autoencoders for Dimensionality Reduction of Correlated Data

链接: https://arxiv.org/abs/2412.16899
作者: Giora Simchoni,Saharon Rosset
关键词: Variational Autoencoders, dimensionality reduction, reduction of large-scale, large-scale tabular, tabular and image
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages, 5 figures

点击查看摘要

Abstract:Variational Autoencoders (VAE) are widely used for dimensionality reduction of large-scale tabular and image datasets, under the assumption of independence between data observations. In practice, however, datasets are often correlated, with typical sources of correlation including spatial, temporal and clustering structures. Inspired by the literature on linear mixed models (LMM), we propose LMMVAE – a novel model which separates the classic VAE latent model into fixed and random parts. While the fixed part assumes the latent variables are independent as usual, the random part consists of latent variables which are correlated between similar clusters in the data such as nearby locations or successive measurements. The classic VAE architecture and loss are modified accordingly. LMMVAE is shown to improve squared reconstruction error and negative likelihood loss significantly on unseen data, with simulated as well as real datasets from various applications and correlation scenarios. It also shows improvement in the performance of downstream tasks such as supervised classification on the learned representations.

[LG-116] A Parameter-Efficient Quantum Anomaly Detection Method on a Superconducting Quantum Processor

链接: https://arxiv.org/abs/2412.16867
作者: Maida Wang,Jinyang Jiang,Peter V. Coveney
关键词: address computational challenges, Quantum machine learning, Vector Data Description, Support Vector Data, computational challenges
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 30 pages, 13 figures

点击查看摘要

Abstract:Quantum machine learning has gained attention for its potential to address computational challenges. However, whether those algorithms can effectively solve practical problems and outperform their classical counterparts, especially on current quantum hardware, remains a critical question. In this work, we propose a novel quantum machine learning method, called Quantum Support Vector Data Description (QSVDD), for practical anomaly detection, which aims to achieve both parameter efficiency and superior accuracy compared to classical models. Emulation results indicate that QSVDD demonstrates favourable recognition capabilities compared to classical baselines, achieving an average accuracy of over 90% on benchmarks with significantly fewer trainable parameters. Theoretical analysis confirms that QSVDD has a comparable expressivity to classical counterparts while requiring only a fraction of the parameters. Furthermore, we demonstrate the first implementation of a quantum machine learning method for anomaly detection on a superconducting quantum processor. Specifically, we achieve an accuracy of over 80% with only 16 parameters on the device, providing initial evidence of QSVDD’s practical viability in the noisy intermediate-scale quantum era and highlighting its significant reduction in parameter requirements.

[LG-117] Bi-Sparse Unsupervised Feature Selection

链接: https://arxiv.org/abs/2412.16819
作者: Xianchao Xiu,Chenyi Huang,Pan Shang,Wanquan Liu
关键词: dimension reduction, bi-sparse UFS method, efficiently deal, deal with high-dimensional, unsupervised feature selection
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To efficiently deal with high-dimensional datasets in many areas, unsupervised feature selection (UFS) has become a rising technique for dimension reduction. Even though there are many UFS methods, most of them only consider the global structure of datasets by embedding a single sparse regularization or constraint. In this paper, we introduce a novel bi-sparse UFS method, called BSUFS, to simultaneously characterize both global and local structures. The core idea of BSUFS is to incorporate \ell_2,p -norm and \ell_q -norm into the classical principal component analysis (PCA), which enables our proposed method to select relevant features and filter out irrelevant noise accurately. Here, the parameters p and q are within the range of [0,1). Therefore, BSUFS not only constructs a unified framework for bi-sparse optimization, but also includes some existing works as special cases. To solve the resulting non-convex model, we propose an efficient proximal alternating minimization (PAM) algorithm using Riemannian manifold optimization and sparse optimization techniques. Theoretically, PAM is proven to have global convergence, i.e., for any random initial point, the generated sequence converges to a critical point that satisfies the first-order optimality condition. Extensive numerical experiments on synthetic and real-world datasets demonstrate the effectiveness of our proposed BSUFS. Specifically, the average accuracy (ACC) is improved by at least 4.71% and the normalized mutual information (NMI) is improved by at least 3.14% on average compared to the existing UFS competitors. The results validate the advantages of bi-sparse optimization in feature selection and show its potential for other fields in image processing. Our code will be available at this https URL.

[LG-118] Gradient-Based Non-Linear Inverse Learning

链接: https://arxiv.org/abs/2412.16794
作者: Abhishake,Nicole Mücke,Tapio Helin
关键词: statistical inverse learning, study statistical inverse, nonlinear inverse problems, study statistical, gradient descent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study statistical inverse learning in the context of nonlinear inverse problems under random design. Specifically, we address a class of nonlinear problems by employing gradient descent (GD) and stochastic gradient descent (SGD) with mini-batching, both using constant step sizes. Our analysis derives convergence rates for both algorithms under classical a priori assumptions on the smoothness of the target function. These assumptions are expressed in terms of the integral operator associated with the tangent kernel, as well as through a bound on the effective dimension. Additionally, we establish stopping times that yield minimax-optimal convergence rates within the classical reproducing kernel Hilbert space (RKHS) framework. These results demonstrate the efficacy of GD and SGD in achieving optimal rates for nonlinear inverse problems in random design.

[LG-119] Fast Multi-Group Gaussian Process Factor Models

链接: https://arxiv.org/abs/2412.16773
作者: Evren Gokcen,Anna I. Jasper,Adam Kohn,Christian K. Machens,Byron M. Yu
关键词: Gaussian process factor, dimensionality reduction approaches, reduction approaches tailored, high-dimensional neural activity, tailored to neuroscience
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Gaussian processes are now commonly used in dimensionality reduction approaches tailored to neuroscience, especially to describe changes in high-dimensional neural activity over time. As recording capabilities expand to include neuronal populations across multiple brain areas, cortical layers, and cell types, interest in extending Gaussian process factor models to characterize multi-population interactions has grown. However, the cubic runtime scaling of current methods with the length of experimental trials and the number of recorded populations (groups) precludes their application to large-scale multi-population recordings. Here, we improve this scaling from cubic to linear in both trial length and group number. We present two approximate approaches to fitting multi-group Gaussian process factor models based on (1) inducing variables and (2) the frequency domain. Empirically, both methods achieved orders of magnitude speed-up with minimal impact on statistical performance, in simulation and on neural recordings of hundreds of neurons across three brain areas. The frequency domain approach, in particular, consistently provided the greatest runtime benefits with the fewest trade-offs in statistical performance. We further characterize the estimation biases introduced by the frequency domain approach and demonstrate effective strategies to mitigate them. This work enables a powerful class of analysis techniques to keep pace with the growing scale of multi-population recordings, opening new avenues for exploring brain function.

[LG-120] An explainable operator approximation framework under the guideline of Greens function

链接: https://arxiv.org/abs/2412.16644
作者: Jianghang Gu,Ling Wen,Yuntian Chen,Shiyi Chen
关键词: adress partial differential, Traditional numerical methods, Traditional numerical, partial differential equations, finite element method
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: no comments

点击查看摘要

Abstract:Traditional numerical methods, such as the finite element method and finite volume method, adress partial differential equations (PDEs) by discretizing them into algebraic equations and solving these iteratively. However, this process is often computationally expensive and time-consuming. An alternative approach involves transforming PDEs into integral equations and solving them using Green’s functions, which provide analytical solutions. Nevertheless, deriving Green’s functions analytically is a challenging and non-trivial task, particularly for complex systems. In this study, we introduce a novel framework, termed GreensONet, which is constructed based on the strucutre of deep operator networks (DeepONet) to learn embedded Green’s functions and solve PDEs via Green’s integral formulation. Specifically, the Trunk Net within GreensONet is designed to approximate the unknown Green’s functions of the system, while the Branch Net are utilized to approximate the auxiliary gradients of the Green’s function. These outputs are subsequently employed to perform surface integrals and volume integrals, incorporating user-defined boundary conditions and source terms, respectively. The effectiveness of the proposed framework is demonstrated on three types of PDEs in bounded domains: 3D heat conduction equations, reaction-diffusion equations, and Stokes equations. Comparative results in these cases demonstrate that GreenONet’s accuracy and generalization ability surpass those of existing methods, including Physics-Informed Neural Networks (PINN), DeepONet, Physics-Informed DeepONet (PI-DeepONet), and Fourier Neural Operators (FNO).

[LG-121] A learning-based approach to stochastic optimal control under reach-avoid constraint

链接: https://arxiv.org/abs/2412.16561
作者: Tingting Ni,Maryam Kamgarpour
关键词: Markovian systems subject, optimally control stochastic, optimally control, develop a model-free, constrained stochastic control
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a model-free approach to optimally control stochastic, Markovian systems subject to a reach-avoid constraint. Specifically, the state trajectory must remain within a safe set while reaching a target set within a finite time horizon. Due to the time-dependent nature of these constraints, we show that, in general, the optimal policy for this constrained stochastic control problem is non-Markovian, which increases the computational complexity. To address this challenge, we apply the state-augmentation technique from arXiv:2402.19360, reformulating the problem as a constrained Markov decision process (CMDP) on an extended state space. This transformation allows us to search for a Markovian policy, avoiding the complexity of non-Markovian policies. To learn the optimal policy without a system model, and using only trajectory data, we develop a log-barrier policy gradient approach. We prove that under suitable assumptions, the policy parameters converge to the optimal parameters, while ensuring that the system trajectories satisfy the stochastic reach-avoid constraint with high probability.

[LG-122] Robust random graph matching in dense graphs via vector approximate message passing

链接: https://arxiv.org/abs/2412.16457
作者: Zhangsong Li
关键词: correlated Gaussian Wigner, Gaussian Wigner matrices, latent vertex correspondence, Gaussian Wigner, correlated Gaussian
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 37 pages

点击查看摘要

Abstract:In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. We are particularly interested in a robust version of this problem such that our observation is a perturbed input (A+E,B+F) where (A,B) is a pair of correlated Gaussian Wigner matrices and E,F are adversarially chosen matrices supported on an unknown \epsilon n * \epsilon n principle minor of A,B , respectively. We propose a vector-approximate message passing (vector-AMP) algorithm that succeeds in polynomial time as long as the correlation \rho between (A,B) is a non-vanishing constant and \epsilon = o\big( \tfrac1(\log n)^20 \big) . The main methodological inputs for our result are the iterative random graph matching algorithm proposed in \citeDL22+, DL23+ and the spectral cleaning procedure proposed in \citeIS24+. To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of n^1-o(1) size. Comments: 37 pages Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST) MSC classes: 68Q87, 90C35 Cite as: arXiv:2412.16457 [stat.ML] (or arXiv:2412.16457v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.16457 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-123] Sharp Results for Hypothesis Testing with Risk-Sensitive Agents

链接: https://arxiv.org/abs/2412.16452
作者: Flora C. Shi,Stephen Bates,Martin J. Wainwright
关键词: involving multiple parties, decision-making involving multiple, Statistical protocols, multiple parties, decision-making involving
类目: Methodology (stat.ME); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Statistical protocols are often used for decision-making involving multiple parties, each with their own incentives, private information, and ability to influence the distributional properties of the data. We study a game-theoretic version of hypothesis testing in which a statistician, also known as a principal, interacts with strategic agents that can generate data. The statistician seeks to design a testing protocol with controlled error, while the data-generating agents, guided by their utility and prior information, choose whether or not to opt in based on expected utility maximization. This strategic behavior affects the data observed by the statistician and, consequently, the associated testing error. We analyze this problem for general concave and monotonic utility functions and prove an upper bound on the Bayes false discovery rate (FDR). Underlying this bound is a form of prior elicitation: we show how an agent’s choice to opt in implies a certain upper bound on their prior null probability. Our FDR bound is unimprovable in a strong sense, achieving equality at a single point for an individual agent and at any countable number of points for a population of agents. We also demonstrate that our testing protocols exhibit a desirable maximin property when the principal’s utility is considered. To illustrate the qualitative predictions of our theory, we examine the effects of risk aversion, reward stochasticity, and signal-to-noise ratio, as well as the implications for the Food and Drug Administration’s testing protocols.

[LG-124] Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity AAAI25

链接: https://arxiv.org/abs/2412.16414
作者: Dmitry Bylinkin,Aleksandr Beznosikov
关键词: training high-performance models, recent years, sizes have increased, high-performance models, essential tool
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted at AAAI25, 31 pages, 108 figures, 9 appendices

点击查看摘要

Abstract:In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these approaches for efficient distributed optimization. We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity, leveraging variance reduction and error feedback frameworks. Our results are of record and confirmed by experiments on different average losses and datasets.

[LG-125] Machine Learning Neutrino-Nucleus Cross Sections

链接: https://arxiv.org/abs/2412.16303
作者: Daniel C. Hackett,Joshua Isaacson,Shirley Weishi Li,Karla Tame-Narvaez,Michael L. Wagman
关键词: Neutrino-nucleus scattering cross, critical theoretical inputs, Neutrino-nucleus scattering, scattering cross sections, critical theoretical
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Theory (nucl-th)
*备注: 5 pages, 2 figures + 6 pages Supplemental Material

点击查看摘要

Abstract:Neutrino-nucleus scattering cross sections are critical theoretical inputs for long-baseline neutrino oscillation experiments. However, robust modeling of these cross sections remains challenging. For a simple but physically motivated toy model of the DUNE experiment, we demonstrate that an accurate neural-network model of the cross section – leveraging Standard Model symmetries – can be learned from near-detector data. We then perform a neutrino oscillation analysis with simulated far-detector events, finding that the modeled cross section achieves results consistent with what could be obtained if the true cross section were known exactly. This proof-of-principle study highlights the potential of future neutrino near-detector datasets and data-driven cross-section models.

[LG-126] SGAC: A Graph Neural Network Framework for Imbalanced and Structure-Aware AMP Classification

链接: https://arxiv.org/abs/2412.16276
作者: Yingxu Wang,Victor Liang,Nan Yin,Siwei Liu,Eran Segal
关键词: Classifying antimicrobial peptides, Classifying antimicrobial, metagenomic sequencing data, antibiotic resistance, vast array
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classifying antimicrobial peptides(AMPs) from the vast array of peptides mined from metagenomic sequencing data is a significant approach to addressing the issue of antibiotic resistance. However, current AMP classification methods, primarily relying on sequence-based data, neglect the spatial structure of peptides, thereby limiting the accurate classification of AMPs. Additionally, the number of known AMPs is significantly lower than that of non-AMPs, leading to imbalanced datasets that reduce predictive accuracy for AMPs. To alleviate these two limitations, we first employ Omegafold to predict the three-dimensional spatial structures of AMPs and non-AMPs, constructing peptide graphs based on the amino acids’ C _\alpha positions. Building upon this, we propose a novel classification model named Spatial GNN-based AMP Classifier (SGAC). Our SGAC model employs a graph encoder based on Graph Neural Networks (GNNs) to process peptide graphs, generating high-dimensional representations that capture essential features from the three-dimensional spatial structure of amino acids. Then, to address the inherent imbalanced datasets, SGAC first incorporates Weight-enhanced Contrastive Learning, which clusters similar peptides while ensuring separation between dissimilar ones, using weighted contributions to emphasize AMP-specific features. Furthermore, SGAC employs Weight-enhanced Pseudo-label Distillation to dynamically generate high-confidence pseudo labels for ambiguous peptides, further refining predictions and promoting balanced learning between AMPs and non-AMPs. Experiments on publicly available AMP and non-AMP datasets demonstrate that SGAC significantly outperforms traditional sequence-based methods and achieves state-of-the-art performance among graph-based models, validating its effectiveness in AMP classification.

[LG-127] Mean–Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms Regret Analysis and Empirical Study

链接: https://arxiv.org/abs/2412.16175
作者: Yilie Huang,Yanwei Jia,Xun Yu Zhou
关键词: diffusion processes driven, variance portfolio selection, diffusion processes, stock prices, driven by observable
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 76 pages, 5 figures, 7 tables

点击查看摘要

Abstract:We study continuous-time mean–variance portfolio selection in markets where stock prices are diffusion processes driven by observable factors that are also diffusion processes yet the coefficients of these processes are unknown. Based on the recently developed reinforcement learning (RL) theory for diffusion processes, we present a general data-driven RL algorithm that learns the pre-committed investment strategy directly without attempting to learn or estimate the market coefficients. For multi-stock Black–Scholes markets without factors, we further devise a baseline algorithm and prove its performance guarantee by deriving a sublinear regret bound in terms of Sharpe ratio. For performance enhancement and practical implementation, we modify the baseline algorithm into four variants, and carry out an extensive empirical study to compare their performance, in terms of a host of common metrics, with a large number of widely used portfolio allocation strategies on S\P 500 constituents. The results demonstrate that the continuous-time RL strategies are consistently among the best especially in a volatile bear market, and decisively outperform the model-based continuous-time counterparts by significant margins.

[LG-128] Online High-Frequency Trading Stock Forecasting with Automated Feature Clustering and Radial Basis Function Neural Networks

链接: https://arxiv.org/abs/2412.16160
作者: Adamantios Ntakaris,Gbenga Ibikunle
关键词: experimental machine learning, machine learning protocol, shallow neural network, neural network topology, feature importance mechanism
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: This paper was presented at the Economics of Financial Technology Conference, June 2023, in Edinburgh, UK

点击查看摘要

Abstract:This study presents an autonomous experimental machine learning protocol for high-frequency trading (HFT) stock price forecasting that involves a dual competitive feature importance mechanism and clustering via shallow neural network topology for fast training. By incorporating the k-means algorithm into the radial basis function neural network (RBFNN), the proposed method addresses the challenges of manual clustering and the reliance on potentially uninformative features. More specifically, our approach involves a dual competitive mechanism for feature importance, combining the mean-decrease impurity (MDI) method and a gradient descent (GD) based feature importance mechanism. This approach, tested on HFT Level 1 order book data for 20 S\P 500 stocks, enhances the forecasting ability of the RBFNN regressor. Our findings suggest that an autonomous approach to feature selection and clustering is crucial, as each stock requires a different input feature space. Overall, by automating the feature selection and clustering processes, we remove the need for manual topological grid search and provide a more efficient way to predict LOB’s mid-price.

信息检索

[IR-0] Leveraging Memory Retrieval to Enhance LLM -based Generative Recommendation

链接: https://arxiv.org/abs/2412.17593
作者: Chengbing Wang,Yang Zhang,Fengbin Zhu,Jizhi Zhang,Tianhao Shi,Fuli Feng
关键词: Leveraging Large Language, Large Language Models, Leveraging Large, Language Models, Large Language
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Leveraging Large Language Models (LLMs) to harness user-item interaction histories for item generation has emerged as a promising paradigm in generative recommendation. However, the limited context window of LLMs often restricts them to focusing on recent user interactions only, leading to the neglect of long-term interests involved in the longer histories. To address this challenge, we propose a novel Automatic Memory-Retrieval framework (AutoMR), which is capable of storing long-term interests in the memory and extracting relevant information from it for next-item generation within LLMs. Extensive experimental results on two real-world datasets demonstrate the effectiveness of our proposed AutoMR framework in utilizing long-term interests for generative recommendation.

[IR-1] Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark

链接: https://arxiv.org/abs/2412.17374
作者: Xiaopeng Li,Jingtong Gao,Pengyue Jia,Yichao Wang,Wanyu Wang,Yejing Wang,Yuhao Wang,Huifeng Guo,Ruiming Tang
关键词: Multi Scenario Recommendation, Multi Scenario, referring to building, gained much attention, building a unified
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi Scenario Recommendation (MSR) tasks, referring to building a unified model to enhance performance across all recommendation scenarios, have recently gained much attention. However, current research in MSR faces two significant challenges that hinder the field’s development: the absence of uniform procedures for multi-scenario dataset processing, thus hindering fair comparisons, and most models being closed-sourced, which complicates comparisons with current SOTA models. Consequently, we introduce our benchmark, \textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline. Additionally, we validated the benchmark using an industrial advertising dataset, reinforcing its reliability and applicability in real-world scenarios. We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models based on our benchmark and thereby fostering a collaborative research ecosystem in MSR. Our source code is also publicly available.

[IR-2] SyNeg: LLM -Driven Synthetic Hard-Negatives for Dense Retrieval

链接: https://arxiv.org/abs/2412.17250
作者: Xiaopeng Li,Xiangyang Li,Hao Zhang,Zhaocheng Du,Pengyue Jia,Yichao Wang,Xiangyu Zhao,Huifeng Guo,Ruiming Tang
关键词: naive negative sampling, negative sampling, Dense retrieval, negative, significantly influenced
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling. Traditional DR methods primarily depend on naive negative sampling techniques or on mining hard negatives through external retriever and meticulously crafted strategies. However, naive negative sampling often fails to adequately capture the accurate boundaries between positive and negative samples, whereas existing hard negative sampling methods are prone to false negatives, resulting in performance degradation and training instability. Recent advancements in large language models (LLMs) offer an innovative solution to these challenges by generating contextually rich and diverse negative samples. In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples. We first devise a \textitmulti-attribute self-reflection prompting strategy to direct LLMs in hard negative sample generation. Then, we implement a \textithybrid sampling strategy that integrates these synthetic negatives with traditionally retrieved negatives, thereby stabilizing the training process and improving retrieval performance. Extensive experiments on five benchmark datasets demonstrate the efficacy of our approach, and code is also publicly available.

[IR-3] GraphHash: Graph Clustering Enables Parameter Efficiency in Recommender Systems

链接: https://arxiv.org/abs/2412.17245
作者: Xinyi Wu,Donald Loveland,Runjin Chen,Yozen Liu,Xin Chen,Leonardo Neves,Ali Jadbabaie,Clark Mingxuan Ju,Neil Shah,Tong Zhao
关键词: handle high-cardinality categorical, high-cardinality categorical features, face significant memory, significant memory constraints, Deep recommender systems
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Deep recommender systems rely heavily on large embedding tables to handle high-cardinality categorical features such as user/item identifiers, and face significant memory constraints at scale. To tackle this challenge, hashing techniques are often employed to map multiple entities to the same embedding and thus reduce the size of the embedding tables. Concurrently, graph-based collaborative signals have emerged as powerful tools in recommender systems, yet their potential for optimizing embedding table reduction remains unexplored. This paper introduces GraphHash, the first graph-based approach that leverages modularity-based bipartite graph clustering on user-item interaction graphs to reduce embedding table sizes. We demonstrate that the modularity objective has a theoretical connection to message-passing, which provides a foundation for our method. By employing fast clustering algorithms, GraphHash serves as a computationally efficient proxy for message-passing during preprocessing and a plug-and-play graph-based alternative to traditional ID hashing. Extensive experiments show that GraphHash substantially outperforms diverse hashing baselines on both retrieval and click-through-rate prediction tasks. In particular, GraphHash achieves on average a 101.52% improvement in recall when reducing the embedding table size by more than 75%, highlighting the value of graph-based collaborative information for model reduction.

[IR-4] LLM -based relevance assessment still cant replace human relevance assessment

链接: https://arxiv.org/abs/2412.17156
作者: Charles L. A. Clarke,Laura Dietz
关键词: large language models, gained significant attention, recent studies suggesting, LLM-based relevance assessments, provide comparable evaluations
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al. make a bold claim that LLM-based relevance assessments, such as those generated by the UMBRELA system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. really supports their claim, particularly if a test collection is used asa benchmark for future improvements. Second, through a submission deliberately intended to do so, we demonstrate the ease with which automatic evaluation metrics can be subverted, showing that systems designed to exploit these evaluations can achieve artificially high scores. Theoretical challenges – such as the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance – must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

[IR-5] Multifaceted User Modeling in Recommendation: A Federated Foundation Models Approach AAAI25

链接: https://arxiv.org/abs/2412.16969
作者: Chunxu Zhang,Guodong Long,Hongkuan Guo,Zhaojie Liu,Guorui Zhou,Zijian Zhang,Yang Liu,Bo Yang
关键词: uncover fine-grained patterns, Multifaceted user modeling, revealing their diverse, uncover fine-grained, learn representations
类目: Information Retrieval (cs.IR)
*备注: Accepted as a regular paper of AAAI25

点击查看摘要

Abstract:Multifaceted user modeling aims to uncover fine-grained patterns and learn representations from user data, revealing their diverse interests and characteristics, such as profile, preference, and personality. Recent studies on foundation model-based recommendation have emphasized the Transformer architecture’s remarkable ability to capture complex, non-linear user-item interaction relationships. This paper aims to advance foundation model-based recommendersystems by introducing enhancements to multifaceted user modeling capabilities. We propose a novel Transformer layer designed specifically for recommendation, using the self-attention mechanism to capture sequential user-item interaction patterns. Specifically, we design a group gating network to identify user groups, enabling hierarchical discovery across different layers, thereby capturing the multifaceted nature of user interests through multiple Transformer layers. Furthermore, to broaden the data scope and further enhance multifaceted user modeling, we extend the framework to a federated setting, enabling the use of private datasets while ensuring privacy. Experimental validations on benchmark datasets demonstrate the superior performance of our proposed method. Code is available.

[IR-6] owards More Robust Retrieval-Augmented Generation: Evaluating RAG Under Adversarial Poisoning Attacks

链接: https://arxiv.org/abs/2412.16708
作者: Jinyan Su,Jin Peng Zhou,Zhengxin Zhang,Preslav Nakov,Claire Cardie
关键词: mitigate LLM hallucinations, LLM hallucinations, knowledge-intensive domains, mitigate LLM, promising solution
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution to mitigate LLM hallucinations and enhance their performance in knowledge-intensive domains. However, these systems are vulnerable to adversarial poisoning attacks, where malicious passages injected into retrieval databases can mislead the model into generating factually incorrect outputs. In this paper, we investigate both the retrieval and the generation components of RAG systems to understand how to enhance their robustness against such attacks. From the retrieval perspective, we analyze why and how the adversarial contexts are retrieved and assess how the quality of the retrieved passages impacts downstream generation. From a generation perspective, we evaluate whether LLMs’ advanced critical thinking and internal knowledge capabilities can be leveraged to mitigate the impact of adversarial contexts, i.e., using skeptical prompting as a self-defense mechanism. Our experiments and findings provide actionable insights into designing safer and more resilient retrieval-augmented frameworks, paving the way for their reliable deployment in real-world applications.

[IR-7] Improving FIM Code Completions via Context Curriculum Based Learning

链接: https://arxiv.org/abs/2412.16589
作者: Hitesh Sagtani,Rishabh Mehrotra,Beyang Liu
关键词: FIM code completion, contextually relevant suggestions, code completion, FIM code, Santa Coder FIM
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Fill-in-the-Middle (FIM) models play a vital role in code completion tasks, leveraging both prefix and suffix context to provide more accurate and contextually relevant suggestions. This paper presents approaches to improve FIM code completion while addressing the challenge of maintaining low latency for real-time coding assistance. We enhance FIM code completion by incorporating context and curriculum examples in the training process. We identify patterns where completion suggestions fail more frequently, revealing complexities that smaller language models struggle with. To address these challenges, we develop a curriculum dataset by extracting hard-to-complete patterns from code repositories and generate context examples using semantic and static analysis tools (e.g. TSC compiler). We fine-tune various sized models, including StarCoder and DeepSeek, on this enhanced dataset. Our evaluation encompasses three key dimensions: the Santa Coder FIM task, the Amazon CCEval benchmark, and a new Multi-Line Infilling evaluation benchmark derived from SWE-bench. Comprehensive ablation studies across multiple model sizes reveal that while all fine-tuned models show improvements, the performance gains are more pronounced for smaller parameter models and incorporating difficult-to-complete examples, as part of curriculum learning, improves the code completion performance. This finding is particularly significant given the latency constraints of code completion tasks. While larger models like GPT and Claude perform well in multi-line completions but are prohibitively challenging to use given high latency, and our fine-tuned models achieve a balance between performance and latency. Finally, we validate our approach through online A/B testing, demonstrating tangible improvements in Completion Acceptance Rate (CAR) and Completion Persistence Rate (CPR), with zero latency impact.

[IR-8] EMPRA: Embedding Perturbation Rank Attack against Neural Ranking Models

链接: https://arxiv.org/abs/2412.16382
作者: Amin Bigdeli,Negar Arabzadeh,Ebrahim Bagheri,Charles L. A. Clarke
关键词: information retrieval techniques, neural information retrieval, Recent research, adversarial attacks, Perturbation Rank Attack
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent research has shown that neural information retrieval techniques may be susceptible to adversarial attacks. Adversarial attacks seek to manipulate the ranking of documents, with the intention of exposing users to targeted content. In this paper, we introduce the Embedding Perturbation Rank Attack (EMPRA) method, a novel approach designed to perform adversarial attacks on black-box Neural Ranking Models (NRMs). EMPRA manipulates sentence-level embeddings, guiding them towards pertinent context related to the query while preserving semantic integrity. This process generates adversarial texts that seamlessly integrate with the original content and remain imperceptible to humans. Our extensive evaluation conducted on the widely-used MS MARCO V1 passage collection demonstrate the effectiveness of EMPRA against a wide range of state-of-the-art baselines in promoting a specific set of target documents within a given ranked results. Specifically, EMPRA successfully achieves a re-ranking of almost 96% of target documents originally ranked between 51-100 to rank within the top 10. Furthermore, EMPRA does not depend on surrogate models for adversarial text generation, enhancing its robustness against different NRMs in realistic settings.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-24

目录

概览 (2024-12-24)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载