Arxiv今日论文 | 2024-11-25

本篇博文主要展示 2024-11-25 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决生成式大型语言模型（LLMs）生成的文本与Frankfurt在“On Bullshit”中描述的“bullshit”语言现象之间的相似性问题。解决方案的关键在于通过统计文本分析方法，对比科学出版物与ChatGPT生成的伪科学文本的语言特征，进而探讨这种语言特征是否在George Orwell对政治与语言的批判以及David Graeber对“bullshit jobs”的描述中也有所体现。通过简单的假设检验方法，论文展示了统计模型如何可靠地将ChatGPT的人工“bullshit”与自然人类语言中的政治和职场“bullshit”功能联系起来。

链接: https://arxiv.org/abs/2411.15129
作者: Alessandro Trevisan,Harry Giddens,Sarah Dillon,Alan F. Blackwell
关键词-EN: Frankfurt popular monograph, Generative large language, Generative large, Frankfurt popular, direct correspondence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative large language models (LLMs), which create text without direct correspondence to truth value, are widely understood to resemble the uses of language described in Frankfurt’s popular monograph On Bullshit. In this paper, we offer a rigorous investigation of this topic, identifying how the phenomenon has arisen, and how it might be analysed. In this paper, we elaborate on this argument to propose that LLM-based chatbots play the ‘language game of bullshit’. We use statistical text analysis to investigate the features of this Wittgensteinian language game, based on a dataset constructed to contrast the language of 1,000 scientific publications with typical pseudo-scientific text generated by ChatGPT. We then explore whether the same language features can be detected in two well-known contexts of social dysfunction: George Orwell’s critique of politics and language, and David Graeber’s characterisation of bullshit jobs. Using simple hypothesis-testing methods, we demonstrate that a statistical model of the language of bullshit can reliably relate the Frankfurtian artificial bullshit of ChatGPT to the political and workplace functions of bullshit as observed in natural human language.
zh

[NLP-1] "ULU 3: Pushing Frontiers in Open Language Model Post-Training

【速读】：该论文试图解决语言模型后训练过程中缺乏公开的训练数据和方法的问题。解决方案的关键在于引入了TÜLU 3系列模型，这是一个完全开源的、最先进的后训练模型，并提供了包括数据、代码和训练方法在内的完整配方。TÜLU 3基于Llama 3.1基础模型构建，通过监督微调（SFT）、直接偏好优化（DPO）和一种新型的可验证奖励强化学习（RLVR）方法，实现了超越Llama 3.1、Qwen 2.5、Mistral等模型的性能，甚至优于一些闭源模型如GPT-4o-mini和Claude 3.5-Haiku。此外，论文还发布了一个多任务评估方案，用于评估后训练方法，并提供了详细的报告，以便在更多领域中复现和进一步适应TÜLU 3的方法。

链接: https://arxiv.org/abs/2411.15124
作者: Nathan Lambert,Jacob Morrison,Valentina Pyatkin,Shengyi Huang,Hamish Ivison,Faeze Brahman,Lester James V. Miranda,Alisa Liu,Nouha Dziri,Shane Lyu,Yuling Gu,Saumya Malik,Victoria Graf,Jena D. Hwang,Jiangjiang Yang,Ronan Le Bras,Oyvind Tafjord,Chris Wilhelm,Luca Soldaini,Noah A. Smith,Yizhong Wang,Pradeep Dasigi,Hannaneh Hajishirzi
关键词-EN: recent language models, applied to refine, refine behaviors, behaviors and unlock, wide range
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce TÜLU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. TÜLU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With TÜLU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the TÜLU 3 model weights and demo, we release the complete recipe – including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the TÜLU 3 approach to more domains. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.15124 [cs.CL] (or arXiv:2411.15124v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.15124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-2] ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation

【速读】：该论文试图解决AI驱动的放射报告生成模型在胸部X光片应用中缺乏标准化评估基准的问题。解决方案的关键在于提出了ReXrank，这是一个公开的排行榜和挑战平台，用于客观评估AI驱动的放射报告生成模型的性能。ReXrank框架整合了ReXGradient（包含10,000个研究的最大测试数据集）和三个公开数据集（MIMIC-CXR, IU-Xray, CheXpert Plus），并采用8种评估指标，分别评估仅生成发现部分和同时生成发现与印象部分的模型。通过提供这一标准化评估框架，ReXrank不仅实现了模型性能的有意义比较，还为模型在不同临床环境中的鲁棒性提供了重要见解，并为全面评估自动化报告在所有医学影像领域的应用奠定了基础。

链接: https://arxiv.org/abs/2411.15122
作者: Xiaoman Zhang,Hong-Yu Zhou,Xiaoli Yang,Oishi Banerjee,Julián N. Acosta,Josh Miller,Ouwen Huang,Pranav Rajpurkar
关键词-EN: demonstrated significant potential, automating radiology report, radiology report generation, demonstrated significant, significant potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI-driven models have demonstrated significant potential in automating radiology report generation for chest X-rays. However, there is no standardized benchmark for objectively evaluating their performance. To address this, we present ReXrank, this https URL, a public leaderboard and challenge for assessing AI-powered radiology report generation. Our framework incorporates ReXGradient, the largest test dataset consisting of 10,000 studies, and three public datasets (MIMIC-CXR, IU-Xray, CheXpert Plus) for report generation assessment. ReXrank employs 8 evaluation metrics and separately assesses models capable of generating only findings sections and those providing both findings and impressions sections. By providing this standardized evaluation framework, ReXrank enables meaningful comparisons of model performance and offers crucial insights into their robustness across diverse clinical settings. Beyond its current focus on chest X-rays, ReXrank’s framework sets the stage for comprehensive evaluation of automated reporting across the full spectrum of medical imaging.
zh

[NLP-3] VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

【速读】：该论文试图解决文本到视频（T2V）扩散模型在生成复杂场景视频时与文本提示不一致的问题。解决方案的关键是引入了一个名为VideoRepair的模型无关、无需训练的视频精炼框架。该框架通过四个阶段实现：（1）视频评估，通过生成细粒度评估问题并使用多模态语言模型（MLLM）回答这些问题来检测不一致性；（2）精炼规划，识别准确生成的对象并创建局部提示以精炼视频的其他区域；（3）区域分解，使用结合的定位模块分割正确生成的区域；（4）局部精炼，调整不一致的区域同时保留正确区域。VideoRepair在两个流行的视频生成基准（EvalCrafter和T2V-CompBench）上显著优于最近的基线模型，展示了其在文本视频对齐度量上的优越性。

链接: https://arxiv.org/abs/2411.15115
作者: Daeun Lee,Jaehong Yoon,Jaemin Cho,Mohit Bansal
关键词-EN: impressive generation capabilities, demonstrated impressive generation, demonstrated impressive, video, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.
zh

[NLP-4] Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion

【速读】：该论文试图解决文本到图像生成模型（text-to-image models）在资源受限设备上广泛应用的障碍问题，特别是模型尺寸过大带来的挑战。解决方案的关键在于对Stable Diffusion 2进行后训练剪枝（post-training pruning），以实现模型压缩。研究的核心在于分别对模型的文本编码器（text encoder）和图像生成器（diffusion generator）进行剪枝，并通过对比不同稀疏度下的剪枝效果，发现简单幅度剪枝（magnitude pruning）在文本到图像生成模型中优于更复杂的剪枝技术。论文提出了一种最优剪枝配置，即将文本编码器剪枝至47.5%，图像生成器剪枝至35%，从而在保持图像生成质量的同时显著减少计算需求。此外，研究还揭示了剪枝对模型语义信息编码的影响，指出超过特定阈值的剪枝会导致性能突然下降，为未来在模型压缩、互操作性和偏见识别方面的研究提供了新的方向。

链接: https://arxiv.org/abs/2411.15113
作者: Samarth N Ramesh,Zhixue Zhao
关键词-EN: grow increasingly powerful, models grow increasingly, burgeoning size presents, powerful and complex, widespread adoption
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As text-to-image models grow increasingly powerful and complex, their burgeoning size presents a significant obstacle to widespread adoption, especially on resource-constrained devices. This paper presents a pioneering study on post-training pruning of Stable Diffusion 2, addressing the critical need for model compression in text-to-image domain. Our study tackles the pruning techniques for the previously unexplored multi-modal generation models, and particularly examines the pruning impact on the textual component and the image generation component separately. We conduct a comprehensive comparison on pruning the model or the single component of the model in various sparsities. Our results yield previously undocumented findings. For example, contrary to established trends in language model pruning, we discover that simple magnitude pruning outperforms more advanced techniques in text-to-image context. Furthermore, our results show that Stable Diffusion 2 can be pruned to 38.5% sparsity with minimal quality loss, achieving a significant reduction in model size. We propose an optimal pruning configuration that prunes the text encoder to 47.5% and the diffusion generator to 35%. This configuration maintains image generation quality while substantially reducing computational requirements. In addition, our work uncovers intriguing questions about information encoding in text-to-image models: we observe that pruning beyond certain thresholds leads to sudden performance drops (unreadable images), suggesting that specific weights encode critical semantics information. This finding opens new avenues for future research in model compression, interoperability, and bias identification in text-to-image models. By providing crucial insights into the pruning behavior of text-to-image models, our study lays the groundwork for developing more efficient and accessible AI-driven image generation systems
zh

[NLP-5] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

【速读】：该论文试图解决大型语言模型（LLM）在结构化生成任务中执行上下文无关文法（Context-Free Grammar, CFG）时的高计算开销问题。解决方案的关键在于提出了XGrammar，一个灵活且高效的结构化生成引擎。XGrammar通过将词汇表划分为上下文无关的预检查标记和上下文相关的运行时解释标记，加速了CFG的执行。此外，XGrammar通过构建语法上下文的扩展变换和减少上下文无关标记的数量，进一步优化了性能。论文还设计了一个高效的持久栈来加速上下文相关标记的检查，并通过与LLM推理引擎的协同设计，实现了语法计算与GPU执行的重叠，从而在评估中实现了高达100倍的加速效果。

链接: https://arxiv.org/abs/2411.15100
作者: Yixin Dong,Charlie F. Ruan,Yaxing Cai,Ruihang Lai,Ziyi Xu,Yilong Zhao,Tianqi Chen
关键词-EN: embodied agent commands, structured function calls, agent commands, embodied agent, LLM Agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.
zh

[NLP-6] Context-Aware Multimodal Pretraining

【速读】：该论文试图解决大规模多模态表示学习中，预训练模型在少样本适应（few-shot adaptation）方面的不足。解决方案的关键在于提出了一种简单但精心设计的扩展方法，用于多模态预训练，使得表示能够更好地适应额外的上下文信息。通过这种目标，论文展示了视觉-语言模型在少样本适应方面的显著提升：在21个下游任务中，测试时的样本效率提高了多达四倍，平均少样本适应增益超过5%，同时保持了零样本泛化性能。特别是，结合简单的、无需训练的基于度量的适应机制，该表示方法轻松超越了更复杂且昂贵的基于优化的方案，极大地简化了向新领域的泛化过程。

链接: https://arxiv.org/abs/2411.15099
作者: Karsten Roth,Zeynep Akata,Dima Damen,Ivana Balažević,Olivier J. Hénaff
关键词-EN: learning successfully optimizes, Large-scale multimodal representation, Large-scale multimodal, representation learning successfully, test time
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.
zh

[NLP-7] Instance-Aware Generalized Referring Expression Segmentation

【速读】：该论文试图解决广义指代表达分割 (Generalized Referring Expression Segmentation, GRES) 中处理复杂表达式（涉及多个不同对象）的问题。现有方法通常采用端到端的前景-背景分割，缺乏明确区分和关联不同对象实例与文本查询的机制。解决方案的关键在于提出了 InstAlign 方法，该方法将对象级推理融入分割过程。具体来说，InstAlign 利用文本和图像输入提取一组对象级令牌 (object-level tokens)，这些令牌既捕捉输入提示中的语义信息，也捕捉图像中的对象。通过实例级监督建模文本与对象的对齐关系，每个令牌唯一地表示图像中的一个对象段，同时与文本中的相关语义信息对齐。实验结果表明，该方法在 gRefCOCO 和 Ref-ZOM 基准测试中显著提升了现有技术的性能，为精确和灵活的 GRES 设定了新的标准。

链接: https://arxiv.org/abs/2411.15087
作者: E-Ro Nguyen,Hieu Le,Dimitris Samaras,Michael Ryoo
关键词-EN: Generalized Referring Expression, complex expressions referring, Referring Expression Segmentation, handling complex expressions, Generalized Referring
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
zh

[NLP-8] Locating the Leading Edge of Cultural Change

【速读】：该论文试图解决的问题是如何衡量文本相似性和差异性在研究文化变迁中的实际应用效果，特别是这些衡量方法与社会证据关于变迁的一致性。解决方案的关键在于比较了三种不同的文本表示方法（主题模型、文档嵌入和词级困惑度）在三个不同语料库（文学研究、经济学和小说）中的表现。研究发现，高引用作者和年轻作者的作品在文本上更具前瞻性，但并未明确指出哪种文本表示方法更优。然而，当文本通过其最具前瞻性的部分（即前四分位段落）表示时，与社会证据的吻合度最强，这表明文本的影响力可能更多地依赖于其最具创新性的时刻，而非持续的高水平创新。

链接: https://arxiv.org/abs/2411.15068
作者: Sarah Griebel,Becca Cohen,Lucian Li,Jaihyun Park,Jiayu Liu,Jana Perkins,Ted Underwood
关键词-EN: study cultural change, textual similarity, similarity and divergence, divergence are increasingly, study cultural
类目: Computation and Language (cs.CL)
备注: Accepted CHR 2024

点击查看摘要

Abstract:Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction). In every case, works by highly-cited authors and younger authors are textually ahead of the curve. We don’t find clear evidence that one representation of text is to be preferred over the others. But alignment with social evidence is strongest when texts are represented through the top quartile of passages, suggesting that a text’s impact may depend more on its most forward-looking moments than on sustaining a high level of innovation throughout.
zh

[NLP-9] Fantastic Biases (What are They) and Where to Find Them WWW

【速读】：该论文试图解决深度学习模型中存在的偏差（bias）问题，特别是那些可能导致不公平或不公正结果的负面偏差。解决方案的关键在于定义偏差的普遍概念，理解其来源和影响，并通过专门设计的模板数据集和特定算法来检测和缓解这些偏差。论文强调，为了避免在机器学习系统中复制社会不平等，必须教育这些人工智能系统超越现有的偏差，确保其能够公平地代表整个社会。

链接: https://arxiv.org/abs/2411.15051
作者: Valentin Barriere
关键词-EN: Deep Learning models, Learning models tend, Deep Learning, models tend, tend to learn
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Publication in Spanish in the Journal Bits de Ciencias: this https URL

点击查看摘要

Abstract:Deep Learning models tend to learn correlations of patterns on huge datasets. The bigger these systems are, the more complex are the phenomena they can detect, and the more data they need for this. The use of Artificial Intelligence (AI) is becoming increasingly ubiquitous in our society, and its impact is growing everyday. The promises it holds strongly depend on their fair and universal use, such as access to information or education for all. In a world of inequalities, they can help to reach the most disadvantaged areas. However, such a universal systems must be able to represent society, without benefiting some at the expense of others. We must not reproduce the inequalities observed throughout the world, but educate these IAs to go beyond them. We have seen cases where these systems use gender, race, or even class information in ways that are not appropriate for resolving their tasks. Instead of real causal reasoning, they rely on spurious correlations, which is what we usually call a bias. In this paper, we first attempt to define what is a bias in general terms. It helps us to demystify the concept of bias, to understand why we can find them everywhere and why they are sometimes useful. Second, we focus over the notion of what is generally seen as negative bias, the one we want to avoid in machine learning, before presenting a general zoology containing the most common of these biases. We finally conclude by looking at classical methods to detect them, by means of specially crafted datasets of templates and specific algorithms, and also classical methods to mitigate them.
zh

[NLP-10] mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在基于知识的视觉问答任务（如INFOSEEK和Encyclopedic-VQA）中，由于知识范围有限且固定，导致回答模糊和不准确的问题。解决方案的关键在于提出了一种新的泛化框架，称为多模态检索-反思-增强生成（multimodal Retrieval-Reflection-Augmented Generation, mR²AG）。该框架通过两个易于实现的反思操作（Retrieval-Reflection和Relevance-Reflection），实现了自适应检索和有用信息定位，从而避免了不必要的检索调用和模型复杂度的增加。具体来说，Retrieval-Reflection用于区分不同用户查询并避免冗余检索，而Relevance-Reflection则指导MLLM定位检索内容中有益的证据并生成相应答案。此外，mR²AG可以通过在提出的mR²AG指令调优数据集（mR²AG-IT）上进行高效微调，集成到任何经过良好训练的MLLM中。

链接: https://arxiv.org/abs/2411.15041
作者: Tao Zhang,Ziqi Zhang,Zongyang Ma,Yuxin Chen,Zhongang Qi,Chunfeng Yuan,Bing Li,Junfu Pu,Yuxuan Zhao,Zehua Xie,Jin Ma,Ying Shan,Weiming Hu
关键词-EN: Advanced Multimodal Large, Multimodal Large Language, recent Knowledge-based VQA, Large Language Models, Knowledge-based VQA tasks
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbfmultimodal \textbfRetrieval-\textbfReflection-\textbfAugmented \textbfGeneration (mR ^2 AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR ^2 AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR ^2 AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR ^2 AG Instruction-Tuning dataset (mR ^2 AG-IT). mR ^2 AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.
zh

[NLP-11] Evolutionary Automata and Deep Evolutionary Computation

【速读】：该论文试图解决的问题是如何构建一个更完整的双重模型来描述进化计算，特别是通过进化自动机（evolutionary automata）来实现这一目标。解决方案的关键在于提出进化自动机作为进化计算的一种类比模型，类似于抽象自动机（如图灵机）相对于递归算法的形式化和精确模型。进化自动机通过模拟进化过程，可能使用无限代数，直接建模进化的进化，从而极大地增强了进化计算的表现力。这一模型还揭示了自然进化通过与环境的交互反馈实现自我进化的强大能力。

链接: https://arxiv.org/abs/2411.15008
作者: Eugene Eberbach
关键词-EN: evolutionary computation, evolutionary, evolutionary algorithms, modern science, applying mechanisms
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evolution by natural selection, which is one of the most compelling themes of modern science, brought forth evolutionary algorithms and evolutionary computation, applying mechanisms of evolution in nature to various problems solved by computers. In this paper we concentrate on evolutionary automata that constitute an analogous model of evolutionary computation compared to well-known evolutionary algorithms. Evolutionary automata provide a more complete dual model of evolutionary computation, similar like abstract automata (e.g., Turing machines) form a more formal and precise model compared to recursive algorithms and their subset - evolutionary algorithms. An evolutionary automaton is an automaton that evolves performing evolutionary computation perhaps using an infinite number of generations. This model allows for a direct modeling evolution of evolution, and leads to tremendous expressiveness of evolutionary automata and evolutionary computation. This also gives the hint to the power of natural evolution that is self-evolving by interactive feedback with the environment.
zh

[NLP-12] ScribeAgent : Towards Specialized Web Agents Using Production-Scale Workflow Data

【速读】：该论文试图解决大型语言模型（LLM）在处理复杂网页任务时面临的两个主要问题：一是通用LLM对特定网页上下文（如HTML）的理解不足，二是长程规划能力的欠缺。论文提出了一种基于微调开源LLM的解决方案，通过使用从250多个领域收集的生产级工作流数据（包含60亿个token）进行微调，显著提升了模型在现有基准测试中的表现。关键在于利用特定领域的数据进行微调，而不是依赖于通用模型的提示设计，从而在Mind2Web和WebArena等基准测试中取得了最先进的直接生成性能，并将任务成功率提高了14.1%。

链接: https://arxiv.org/abs/2411.15004
作者: Junhong Shen,Atishay Jain,Zedian Xiao,Ishan Amlekar,Mouad Hadji,Aaron Podolny,Ameet Talwalkar
关键词-EN: Large Language Model, Large Language, handle increasingly complex, increasingly complex web-based, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks – ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 14.1% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.
zh

[NLP-13] Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

【速读】：该论文试图解决如何理解大型多模态模型（Large Multimodal Models, LMMs）内部神经表示的问题。解决方案的关键在于提出了一种多功能框架，通过以下两个步骤实现：1) 应用稀疏自编码器（Sparse Autoencoder, SAE）将表示分解为人类可理解的特征；2) 提出一个自动解释框架，利用LMMs自身来解释SAE中学习到的开放语义特征。通过分析LLaVA-NeXT-8B模型与LLaVA-OV-72B模型的交互，证明了这些特征能够有效引导模型的行为，从而深入理解LMMs在特定任务（如EQ测试）中的表现及其错误原因，并提出改进策略。

链接: https://arxiv.org/abs/2411.14982
作者: Kaichen Zhang,Yifei Shen,Bo Li,Ziwei Liu
关键词-EN: Large Multimodal Models, Large Multimodal, Recent advances, advances in Large, Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model’s behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.
zh

[NLP-14] SwissADT: An Audio Description Translation System for Swiss Languages

【速读】：该论文试图解决多语言国家（如瑞士）中音频描述翻译（Audio Description Translation, ADT）系统的开发问题，特别是缺乏高质量、时间同步的AD数据以及是否应结合视频视觉信息来提升翻译质量的疑问。解决方案的关键在于开发了首个针对瑞士三种主要语言（德语、法语、意大利语）和英语的ADT系统——SwissADT。通过收集并增强包含视频片段的AD数据，并利用大型语言模型（Large Language Models, LLMs），SwissADT旨在自动将AD脚本翻译成所需的瑞士语言，从而提高信息在多语言人群中的可访问性。实验结果表明，SwissADT在ADT任务中展现出良好的性能，证明了结合人类专业知识和LLMs生成能力的有效性。

链接: https://arxiv.org/abs/2411.14967
作者: Lukas Fischer,Yingqiang Gao,Alexa Lintner,Sarah Ebling
关键词-EN: audio description translation, crucial accessibility service, accessibility service provided, Audio description, ADT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Audio description (AD) is a crucial accessibility service provided to blind persons and persons with visual impairment, designed to convey visual information in acoustic form. Despite recent advancements in multilingual machine translation research, the lack of well-crafted and time-synchronized AD data impedes the development of audio description translation (ADT) systems that address the needs of multilingual countries such as Switzerland. Furthermore, since the majority of ADT systems rely solely on text, uncertainty exists as to whether incorporating visual information from the corresponding video clips can enhance the quality of ADT outputs. In this work, we present SwissADT, the first ADT system implemented for three main Swiss languages and English. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), we aim to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language. Our extensive experimental ADT results, composed of both automatic and human evaluations of ADT quality, demonstrate the promising capability of SwissADT for the ADT task. We believe that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.
zh

[NLP-15] LLM for Barcodes: Generating Diverse Synthetic Data for Identity Documents

【速读】：该论文试图解决在身份文档中准确检测和解码条形码的问题，特别是在安全、医疗和教育等应用中，可靠的数据提取和验证至关重要。解决方案的关键在于引入了一种基于大型语言模型（LLMs）的合成数据生成方法，该方法能够创建具有丰富上下文和真实性的数据，而不依赖于预定义的字段或模板。通过利用LLMs对不同文档和内容的广泛知识，生成的数据能够反映真实身份文档的多样性，并将其编码为条形码并叠加在模板文档上，如驾驶执照、保险卡和学生证。这种方法简化了数据集创建过程，无需广泛的领域知识或预定义字段，相比传统的Faker工具，生成的数据具有更高的多样性和上下文相关性，从而提升了条形码检测模型的性能。该解决方案是一种可扩展且注重隐私的方案，推动了自动化文档处理和身份验证的机器学习技术的发展。

链接: https://arxiv.org/abs/2411.14962
作者: Hitesh Laxmichand Patel,Amit Agarwal,Bhargava Kumar,Karan Gupta,Priyaranjan Pattnayak
关键词-EN: Accurate barcode detection, reliable data extraction, Accurate barcode, applications like security, crucial for applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 5 pages, 1 figures

点击查看摘要

Abstract:Accurate barcode detection and decoding in Identity documents is crucial for applications like security, healthcare, and education, where reliable data extraction and verification are essential. However, building robust detection models is challenging due to the lack of diverse, realistic datasets an issue often tied to privacy concerns and the wide variety of document formats. Traditional tools like Faker rely on predefined templates, making them less effective for capturing the complexity of real-world identity documents. In this paper, we introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field. Using the vast knowledge LLMs have about different documents and content, our method creates data that reflects the variety found in real identity documents. This data is then encoded into barcode and overlayed on templates for documents such as Driver’s licenses, Insurance cards, Student IDs. Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge or predefined fields. Compared to traditional methods like Faker, data generated by LLM demonstrates greater diversity and contextual relevance, leading to improved performance in barcode detection models. This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.
zh

[NLP-16] Information Extraction from Heterogenous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation WACV2025

【速读】：该论文试图解决从视觉丰富文档（VRDs）中高效提取信息的问题，尤其是在没有标签的情况下，以防止欺诈和滥用。解决方案的关键是提出了任务感知指令基础的标签生成方法（Task Aware Instruction-based Labelling, TAIL），用于在没有标签的VRD语料库中生成合成标签，并通过响应式知识蒸馏对多模态视觉丰富文档理解模型（VRDU）进行微调。该方法在不使用教师模型权重或训练数据的情况下，条件生成适当格式的注释，从而在内部费用文档上表现优于现有的最先进大型多模态模型（LMM）Claude 3 Sonnet，同时成本降低85%，速度提升约5倍，并在平均归一化Levenshtein相似度（ANLS）评分上超过布局感知基线10%以上。

链接: https://arxiv.org/abs/2411.14957
作者: Aniket Bhattacharyya,Anurag Tripathi
关键词-EN: visual and layout, visually rich documents, visually rich, Rich Document Understanding, layout information
类目: Computation and Language (cs.CL)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies, the validity of the receipt, as well as downstream anomaly detection at various levels. These documents are heterogenous, with multiple formats and languages, uploaded with different image qualities, and often do not contain ground truth labels for the efficient training of models. In this paper we propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels, and fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation without using the teacher model’s weights or training dataset to conditionally generate annotations in the appropriate format. Using a benchmark external dataset where ground truth labels are available, we demonstrate conditions under which our approach performs at par with Claude 3 Sonnet through empirical studies. We then show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM (large multimodal model) Claude 3 Sonnet while being 85% less costly and ~5X faster, and outperforms layout-aware baselines by more than 10% in Average Normalized Levenshtein Similarity (ANLS) scores due to its ability to reason and extract information from rare formats. Finally, we illustrate the usage of our approach in overpayment prevention.
zh

[NLP-17] ReVisionLLM : Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

【速读】：该论文试图解决长时间视频（尤其是长达数小时的视频）中的事件定位问题。现有视觉-语言模型（VLMs）在处理长时间视频时，由于帧数限制，难以捕捉到关键的时间细节，导致事件定位不准确。论文提出的解决方案是递归视觉-语言模型（ReVisionLLM），其关键在于模仿人类搜索策略，首先粗略定位感兴趣的大段视频，然后逐步细化焦点，精确确定事件的时间边界。此外，该模型采用分层训练策略，从短片段开始捕捉不同事件，逐步扩展到更长的视频，从而能够无缝处理从几分钟到几小时的不同长度的视频。ReVisionLLM在多个数据集上显著优于现有最先进方法，成为首个能够在长时间视频中进行时间定位的视觉-语言模型。

链接: https://arxiv.org/abs/2411.14901
作者: Tanveer Hannan,Md Mohaiminul Islam,Jindong Gu,Thomas Seidl,Gedas Bertasius
关键词-EN: Large language models, Large language, excel at retrieving, lengthy text, face difficulties
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (+2.6% R1@0.1 on MAD). The code is available at this https URL.
zh

[NLP-18] Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts

【速读】：该论文试图解决在俄罗斯社交媒体中检测绿色实践提及的问题，以了解其普及程度并制定推广环保行动的建议，从而缓解环境问题。解决方案的关键在于应用基于提示的数据增强技术，通过大型语言模型 (LLMs) 生成多样化和真实的文本样本，以改进多标签分类任务的性能。研究结果表明，通过重写现有数据集、生成新数据或结合两种方法，所有策略均能提升分类性能，其中最佳效果来自于对原文进行改写并明确标注相关类别的提示。

链接: https://arxiv.org/abs/2411.14896
作者: Anna Glazkova,Olga Zakharova
关键词-EN: natural language processing, Large language models, Large language, language processing, play a crucial
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: Ivannikov ISPRAS Open Conference (ISPRAS) 2024

点击查看摘要

Abstract:Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks, improving the understanding, generation, and manipulation of human language across domains such as translating, summarizing, and classifying text. Previous studies have demonstrated that instruction-based LLMs can be effectively utilized for data augmentation to generate diverse and realistic text samples. This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media. Detecting green practices in social media aids in understanding their prevalence and helps formulate recommendations for scaling eco-friendly actions to mitigate environmental issues. We evaluated several prompts for augmenting texts in a multi-label classification task, either by rewriting existing datasets using LLMs, generating new data, or combining both approaches. Our results revealed that all strategies improved classification performance compared to the models fine-tuned only on the original dataset, outperforming baselines in most cases. The best results were obtained with the prompt that paraphrased the original text while clearly indicating the relevant categories.
zh

[NLP-19] Leveraging Hierarchical Prototypes as the Verbalizer for Implicit Discourse Relation Recognition

【速读】：该论文试图解决隐式话语关系识别中手动标注器存在的模糊性和不准确性问题。解决方案的关键在于利用捕捉特定类别语义特征的原型（prototypes）和不同类别的层次标签结构作为标注器（verbalizer），从而提高识别性能并支持零样本跨语言学习（zero-shot cross-lingual learning），特别是在资源匮乏的语言中进行话语关系识别。

链接: https://arxiv.org/abs/2411.14880
作者: Wanqiu Long,Bonnie Webber
关键词-EN: involves determining relationships, explicit discourse connective, Implicit discourse relation, recognition involves determining, discourse relation recognition
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Implicit discourse relation recognition involves determining relationships that hold between spans of text that are not linked by an explicit discourse connective. In recent years, the pre-train, prompt, and predict paradigm has emerged as a promising approach for tackling this task. However, previous work solely relied on manual verbalizers for implicit discourse relation recognition, which suffer from issues of ambiguity and even incorrectness. To overcome these limitations, we leverage the prototypes that capture certain class-level semantic features and the hierarchical label structure for different classes as the verbalizer. We show that our method improves on competitive baselines. Besides, our proposed approach can be extended to enable zero-shot cross-lingual learning, facilitating the recognition of discourse relations in languages with scarce resources. These advancement validate the practicality and versatility of our approach in addressing the issues of implicit discourse relation recognition across different languages.
zh

[NLP-20] Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics

【速读】：该论文试图解决在天体物理学和粒子物理学领域中，如何有效地生成上下文相关的词嵌入（Contextualized Word Embeddings, CWEs）以研究概念意义的问题。解决方案的关键在于开发了一个基于Transformer的语言模型——Astro-HEP-BERT，该模型在通用预训练BERT模型的基础上，通过使用从arXiv上提取的2184万段落组成的Astro-HEP语料库进行进一步训练，从而适应特定科学领域的语言特征。这种方法展示了将双向Transformer模型应用于科学史、哲学和社会学（HPSS）领域的有效性和可行性，表明通过微调通用语言模型可以实现高性能，而不需要从头开始进行大规模训练，从而提供了一种成本效益高且高效的策略。

链接: https://arxiv.org/abs/2411.14877
作者: Arno Simons
关键词-EN: contextualized word embeddings, model specifically designed, generating contextualized word, high-energy physics, specifically designed
类目: Computation and Language (cs.CL); History and Philosophy of Physics (physics.hist-ph)
备注: 7 pages, 4 figures, 1 table

点击查看摘要

Abstract:I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT’s CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch.
zh

[NLP-21] Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation

【速读】：该论文试图解决扩散模型在文本到图像生成任务中，由于偏好标签的稀疏性导致的信用分配问题。解决方案的关键在于提出了一种名为“去噪分布估计 (Denoised Distribution Estimation, DDE)”的新方法。DDE通过直接从每个去噪步骤的角度估计终端去噪分布，避免了依赖辅助模型或手工设计的方案，从而实现了更明确的信用分配策略。该方法通过两种估计策略，能够用单一模型推理表示整个去噪轨迹，并在理论和实验上证明了其优先优化去噪轨迹中间部分的能力，从而提供了一种新颖且有效的信用分配方案。

链接: https://arxiv.org/abs/2411.14871
作者: Dingyuan Shi,Yong Wang,Hangyu Li,Xiangxiang Chu
关键词-EN: shown remarkable success, making alignment methods, models increasingly important, making alignment, increasingly important
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have shown remarkable success in text-to-image generation, making alignment methods for these models increasingly important. A key challenge is the sparsity of preference labels, which are typically available only at the terminal of denoising trajectories. This raises the issue of how to assign credit across denoising steps based on these sparse labels. In this paper, we propose Denoised Distribution Estimation (DDE), a novel method for credit assignment. Unlike previous approaches that rely on auxiliary models or hand-crafted schemes, DDE derives its strategy more explicitly. The proposed DDE directly estimates the terminal denoised distribution from the perspective of each step. It is equipped with two estimation strategies and capable of representing the entire denoising trajectory with a single model inference. Theoretically and empirically, we show that DDE prioritizes optimizing the middle part of the denoising trajectory, resulting in a novel and effective credit assignment scheme. Extensive experiments demonstrate that our approach achieves superior performance, both quantitatively and qualitatively.
zh

[NLP-22] VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models

【速读】：该论文试图解决当前大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 在处理复杂几何结构（尤其是图结构）时存在的性能局限性问题。解决方案的关键在于引入VisGraphVar (Visual Graph Variability)，这是一个可定制的基准生成器，能够生成用于七种不同任务类别（检测、分类、分割、模式识别、链接预测、推理、匹配）的图图像，从而系统地评估LVLMs在不同视觉图场景下的表现。通过生成990张图图像并评估六种LVLMs在零样本和思维链两种提示策略下的表现，研究发现图像视觉属性的变化（如节点标签和布局）以及故意引入的视觉缺陷（如节点重叠）显著影响模型性能。这强调了在图相关任务中进行全面评估的重要性，而不仅仅是推理任务。VisGraphVar为开发更可靠和鲁棒的视觉图分析系统提供了宝贵的见解。

链接: https://arxiv.org/abs/2411.14832
作者: Camilo Chacón Sartori,Christian Blum,Filippo Bistaffa
关键词-EN: Large Vision-Language Models, shown immense potential, advancement of Large, Large Vision-Language, immense potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The fast advancement of Large Vision-Language Models (LVLMs) has shown immense potential. These models are increasingly capable of tackling abstract visual tasks. Geometric structures, particularly graphs with their inherent flexibility and complexity, serve as an excellent benchmark for evaluating these models’ predictive capabilities. While human observers can readily identify subtle visual details and perform accurate analyses, our investigation reveals that state-of-the-art LVLMs exhibit consistent limitations in specific visual graph scenarios, especially when confronted with stylistic variations. In response to these challenges, we introduce VisGraphVar (Visual Graph Variability), a customizable benchmark generator able to produce graph images for seven distinct task categories (detection, classification, segmentation, pattern recognition, link prediction, reasoning, matching), designed to systematically evaluate the strengths and limitations of individual LVLMs. We use VisGraphVar to produce 990 graph images and evaluate six LVLMs, employing two distinct prompting strategies, namely zero-shot and chain-of-thought. The findings demonstrate that variations in visual attributes of images (e.g., node labeling and layout) and the deliberate inclusion of visual imperfections, such as overlapping nodes, significantly affect model performance. This research emphasizes the importance of a comprehensive evaluation across graph-related tasks, extending beyond reasoning alone. VisGraphVar offers valuable insights to guide the development of more reliable and robust systems capable of performing advanced visual graph analysis.
zh

[NLP-23] Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization

【速读】：该论文试图解决视觉与语言导航任务（Vision-and-Language Navigation, VLN）中的细粒度对齐问题。当前方法使用对比学习（contrastive learning）来对齐语言与视觉轨迹序列，但在处理细粒度视觉负样本时遇到困难。论文提出的解决方案之关键是引入了一种基于贝叶斯优化（Bayesian Optimization）的对抗优化框架，用于生成细粒度对比视觉样本，从而增强跨模态嵌入（cross-modal embeddings）。通过在R2R和REVERIE两个常见VLN基准上的实验，验证了这种增强的嵌入在细粒度视觉负样本上的有效性，并展示了其在导航性能上的显著提升。

链接: https://arxiv.org/abs/2411.14811
作者: Yuhang Song,Mario Gianni,Chenguang Yang,Kunyang Lin,Te-Chuan Chiu,Anh Nguyen,Chun-Yi Lee
关键词-EN: robots navigate realistic, natural language instructions, navigate realistic, environments based, paper addresses
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: this https URL.
zh

[NLP-24] Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension ICPR2024

【速读】：该论文试图解决生成式数据标注（Referring Expression Comprehension, REC）任务中人工标注成本高的问题。解决方案的关键在于提出了一种新颖的框架，通过生成人工数据来替代传统的手动标注。具体来说，该框架首先处理现有数据以创建标注的变体，然后利用这些变体作为指导生成新的图像。最终生成了一个名为Harlequin的新数据集，包含超过100万条查询。这种方法不仅消除了手动数据收集和标注的需求，还实现了数据集的可扩展性和复杂性，并通过在人工数据上预训练REC模型并在人类标注数据上微调和评估，证明了预训练对性能的提升效果。

链接: https://arxiv.org/abs/2411.14807
作者: Luca Parolari,Elena Izzo,Lamberto Ballan
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, natural language expression, language expression
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICPR 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC) aims to identify a particular object in a scene by a natural language expression, and is an important topic in visual language understanding. State-of-the-art methods for this task are based on deep learning, which generally requires expensive and manually labeled annotations. Some works tackle the problem with limited-supervision learning or relying on Large Vision and Language Models. However, the development of techniques to synthesize labeled data is overlooked. In this paper, we propose a novel framework that generates artificial data for the REC task, taking into account both textual and visual modalities. At first, our pipeline processes existing data to create variations in the annotations. Then, it generates an image using altered annotations as guidance. The result of this pipeline is a new dataset, called Harlequin, made by more than 1M queries. This approach eliminates manual data collection and annotation, enabling scalability and facilitating arbitrary complexity. We pre-train three REC models on Harlequin, then fine-tuned and evaluated on human-annotated datasets. Our experiments show that the pre-training on artificial data is beneficial for performance.
zh

[NLP-25] Continual SFT Matches Multimodal RLHF with Negative Supervision

【速读】：该论文试图解决在视觉语言模型（VLMs）的偏好对齐阶段，传统多模态强化学习人类反馈（RLHF）方法在计算资源和模型复杂性上的高成本问题。解决方案的关键在于提出了一种新的负监督微调（nSFT）方法，该方法利用RLHF中被拒绝响应的logit信息进行模型微调，从而在不增加额外大型模型的情况下，实现与RLHF相似的效果。nSFT通过解耦RLHF中的负监督信息，并使用简单的SFT损失函数进行持续对齐，显著提高了计算效率和内存利用率。

链接: https://arxiv.org/abs/2411.14797
作者: Ke Zhu,Yu Wang,Yanpeng Sun,Qiang Chen,Jiangjiang Liu,Gang Zhang,Jingdong Wang
关键词-EN: improve vision-language models’, Multimodal RLHF, continually improve vision-language, vision-language models’, improve vision-language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models’ (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.
zh

[NLP-26] De-biased Multimodal Electrocardiogram Analysis

【速读】：该论文试图解决多模态大语言模型（MLLMs）在心电图（ECG）信号处理中的应用问题，特别是如何保留更多ECG信息并充分利用语言模型的推理能力。解决方案的关键在于直接将ECG的嵌入向量通过投影层输入到语言模型中，从而保留更多ECG信息，并有效处理临床实践中常见的两份不同时间点的心电图比较问题。此外，论文还通过因果分析发现并解决了模型在处理ECG输入时可能忽略其他模态输入的问题，提出了一种去偏的预训练方法，通过后门调整理论消除混杂因素的影响，从而提高模型在对抗测试中的表现和零样本能力。

链接: https://arxiv.org/abs/2411.14795
作者: Haitao Li,Ziyu Li,Yiheng Mao,Ziyi Liu,Zhoujian Sun,Zhengxing Huang
关键词-EN: Multimodal large language, Multimodal large, large language models, medical imaging, ECG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly being applied in the medical field, particularly in medical imaging. However, developing MLLMs for ECG signals, which are crucial in clinical settings, has been a significant challenge beyond medical imaging. Previous studies have attempted to address this by converting ECGs into several text tags using an external classifier in a training-free manner. However, this approach significantly compresses the information in ECGs and underutilizes the reasoning capabilities of LLMs. In this work, we directly feed the embeddings of ECGs into the LLM through a projection layer, retaining more information about ECGs and better leveraging the reasoning abilities of LLMs. Our method can also effectively handle a common situation in clinical practice where it is necessary to compare two ECGs taken at different times. Recent studies found that MLLMs may rely solely on text input to provide answers, ignoring inputs from other modalities. We analyzed this phenomenon from a causal perspective in the context of ECG MLLMs and discovered that the confounder, severity of illness, introduces a spurious correlation between the question and answer, leading the model to rely on this spurious correlation and ignore the ECG input. Such models do not comprehend the ECG input and perform poorly in adversarial tests where different expressions of the same question are used in the training and testing sets. We designed a de-biased pre-training method to eliminate the confounder’s effect according to the theory of backdoor adjustment. Our model performed well on the ECG-QA task under adversarial testing and demonstrated zero-shot capabilities. An interesting random ECG test further validated that our model effectively understands and utilizes the input ECG signal.
zh

[NLP-27] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

【速读】：该论文试图解决视频问答任务（VideoQA）中高质量、大规模数据集的稀缺问题，以及现有数据集在复杂推理任务中的局限性。解决方案的关键在于引入了一种名为VideoEspresso的新型数据集，该数据集通过保留视频中的空间细节和时间连贯性，并结合多模态的中间推理步骤注释，来增强视频推理能力。具体方法包括使用语义感知方法减少冗余，利用GPT-4o生成问答对，并开发视频的思维链（Chain-of-Thought, CoT）注释以丰富推理过程。此外，论文提出了一种混合大型视觉语言模型（LVLMs）协作框架，通过自适应选择核心帧和两阶段指令微调推理，利用多模态证据进行CoT推理，从而在多个任务中显著优于现有基线模型。

链接: https://arxiv.org/abs/2411.14794
作者: Songhao Han,Wei Huang,Hairong Shi,Le Zhuo,Xiu Su,Shifeng Zhang,Xu Zhou,Xiaojuan Qi,Yue Liao,Si Liu
关键词-EN: Vision Language Models, Large Vision Language, Language Models, Large Vision, Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 14 figures

点击查看摘要

Abstract:The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: this https URL
zh

[NLP-28] KBAda: Efficient Self Adaptation on Specific Knowledge Bases

【速读】：该论文试图解决大型语言模型（LLMs）在适应下游任务时，如何高效利用知识库的问题。解决方案的关键在于提出了KBAda方法，该方法通过迭代训练和自我标注数据（如QA对和修订建议），使模型能够高效地掌握知识内容。实验结果表明，KBAda在多个数据集上显著提升了模型在需要特定知识的下游任务中的表现，且成本较低。特别地，该方法在没有外部信号（如GPT-4-turbo标注）的情况下，仍能达到使用GPT-4-turbo标注所获得性能提升的90%以上，完全依赖于自我监督。

链接: https://arxiv.org/abs/2411.14790
作者: Zheni Zeng,Yuxuan Chen,Shi Yu,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: creating self-assessment questions, achieving related tasks, quickly acquire knowledge, self-assessment questions, techniques to quickly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans can utilize techniques to quickly acquire knowledge from specific materials in advance, such as creating self-assessment questions, enabling us to achieving related tasks more efficiently. In contrast, large language models (LLMs) usually relies on retrieval-augmented generation to exploit knowledge materials in an instant manner, or requires external signals such as human preference data and stronger LLM annotations to conduct knowledge adaptation. To unleash the self-learning potential of LLMs, we propose KBAda, an approach designed for efficient adaptation to downstream tasks involving knowledge bases. Our method utilizes iterative training with self-annotated data such as QA pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently. Experimental results on multiple datasets demonstrate the effectiveness of our approach, significantly boosting model performance in downstream tasks that require specific knowledge at a low cost. Notably, our approach achieves over 90% of the performance improvement that can be obtained by using GPT-4-turbo annotation, while relying entirely on self-supervision. We release our experimental data, models, and process analyses to the community for further exploration (this https URL).
zh

[NLP-29] IRLab@iKAT24: Learned Sparse Retrieval with Multi-aspect LLM Query Generation for Conversational Search

【速读】：该论文试图解决在对话式搜索中如何有效利用个性化用户知识库（Personal Textual Knowledge Base, PTKB）来提升对话助手的交互和响应能力的问题。解决方案的关键在于采用多方面查询生成（multi-aspect query generation）的方法，结合大型语言模型（Large Language Models, LLMs）进行查询重写（Query Rewrite），并通过MQ4CS框架和SPLADE架构的Learned Sparse Retrieval技术，以及强大的交叉编码器模型（cross-encoder models）进行检索和重排序。此外，论文还提出了一种新的重排序阶段的多方面聚合策略，以替代之前的交错策略，从而在个性化对话搜索中实现更好的性能，超越人工重写的效果。

链接: https://arxiv.org/abs/2411.14739
作者: Simon Lupart,Zahra Abbasiantaeb,Mohammad Aliannejadi
关键词-EN: Interactive Knowledge Assistant, Knowledge Assistant Track, advancing conversational assistants, personalized user knowledge, Textual Knowledge Base
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Interactive Knowledge Assistant Track (iKAT) 2024 focuses on advancing conversational assistants, able to adapt their interaction and responses from personalized user knowledge. The track incorporates a Personal Textual Knowledge Base (PTKB) alongside Conversational AI tasks, such as passage ranking and response generation. Query Rewrite being an effective approach for resolving conversational context, we explore Large Language Models (LLMs), as query rewriters. Specifically, our submitted runs explore multi-aspect query generation using the MQ4CS framework, which we further enhance with Learned Sparse Retrieval via the SPLADE architecture, coupled with robust cross-encoder models. We also propose an alternative to the previous interleaving strategy, aggregating multiple aspects during the reranking phase. Our findings indicate that multi-aspect query generation is effective in enhancing performance when integrated with advanced retrieval and reranking models. Our results also lead the way for better personalization in Conversational Search, relying on LLMs to integrate personalization within query rewrite, and outperforming human rewrite performance.
zh

[NLP-30] Universal and Context-Independent Triggers for Precise Control of LLM Outputs

【速读】：该论文试图解决大型语言模型 (LLMs) 在面对提示注入攻击时输出被潜在操控的问题。解决方案的关键在于提出一种基于梯度的白盒攻击方法，该方法能够发现一种通用的、与上下文无关的触发器，这种触发器能够精确地操控 LLM 的输入以生成任意指定的输出。具体来说，该方法通过优化过程高效地发现这种触发器，并展示了其在不同提示上下文中的鲁棒性和对目标输出的高精度控制能力。论文还强调了这种攻击对基于 LLM 的应用程序构成的重大威胁，可能使攻击者完全控制 AI 代理的决策和行动。

链接: https://arxiv.org/abs/2411.14738
作者: Jiashuo Liang,Guancheng Li,Yang Yu
关键词-EN: Large language models, automated content generation, Large language, critical decision-making systems, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely adopted in applications such as automated content generation and even critical decision-making systems. However, the risk of prompt injection allows for potential manipulation of LLM outputs. While numerous attack methods have been documented, achieving full control over these outputs remains challenging, often requiring experienced attackers to make multiple attempts and depending heavily on the prompt context. Recent advancements in gradient-based white-box attack techniques have shown promise in tasks like jailbreaks and system prompt leaks. Our research generalizes gradient-based attacks to find a trigger that is (1) Universal: effective irrespective of the target output; (2) Context-Independent: robust across diverse prompt contexts; and (3) Precise Output: capable of manipulating LLM inputs to yield any specified output with high accuracy. We propose a novel method to efficiently discover such triggers and assess the effectiveness of the proposed attack. Furthermore, we discuss the substantial threats posed by such attacks to LLM-based applications, highlighting the potential for adversaries to taking over the decisions and actions made by AI agents.
zh

[NLP-31] Evaluating and Advancing Multimodal Large Language Models in Ability Lens

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视觉感知能力评估中存在的评估不一致性和复杂性问题。解决方案的关键在于引入了一个名为AbilityLens的统一基准，该基准旨在全面评估MLLMs的六项关键感知能力，涵盖了多种问题类型、领域和评估指标，以确保评估的准确性和稳定性。通过AbilityLens，研究者能够识别当前模型的优缺点，揭示开源与闭源模型之间的性能差距，并设计了一种能力特定的模型合并方法，以缓解能力冲突导致的性能下降。

链接: https://arxiv.org/abs/2411.14725
作者: Feng Chen,Chenhui Gou,Jing Liu,Yang Yang,Zhaoyang Li,Jiyuan Zhang,Zhenbang Sun,Bohan Zhuang,Qi Wu
关键词-EN: multimodal large language, large language models, advance rapidly, providing further guidance, multimodal large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of \textbfvision perception abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation metrics, introduce significant evaluation variance, complicating comprehensive assessments of perception abilities when relying on any single benchmark. To address this, we introduce \textbfAbilityLens, a unified benchmark designed to evaluate MLLMs across six key perception abilities, focusing on both accuracy and stability, with each ability encompassing diverse question types, domains, and metrics. With the assistance of AbilityLens, we: (1) identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models; (2) introduce an online evaluation mode, which uncovers interesting ability conflict and early convergence phenomena during MLLM training; and (3) design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict. The benchmark and online leaderboard will be released soon.
zh

[NLP-32] MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

【速读】：该论文试图解决分子发现领域中分子与其描述文本之间细粒度对齐的问题。解决方案的关键在于引入了一种名为MolReFlect的新型师生框架，通过上下文感知的方式进行分子-文本对齐。具体来说，MolReFlect首先利用较大的教师大语言模型（LLM）从分子描述或SMILES字符串中提取关键短语，并将其与分子子结构或特性对应起来。随后，通过In-Context Selective Reflection机制，利用先前的提取结果作为上下文示例，使教师LLM进行反思，并由较小的学生LLM从上下文反思和先前提取结果中进行选择，以进一步优化对齐。最后，通过Chain-of-Thought In-Context Molecule Tuning，将细粒度对齐与推理过程整合在Chain-of-Thought格式中，增强学生LLM的学习过程。实验结果表明，MolReFlect显著提升了Mistral-7B在ChEBI-20数据集上的表现，达到了最先进的性能。

链接: https://arxiv.org/abs/2411.14721
作者: Jiatong Li,Yunqing Liu,Wei Liu,Jingdi Le,Di Zhang,Wenqi Fan,Dongzhan Zhou,Yuqiang Li,Qing Li
关键词-EN: pivotal research field, Large Language Models, research field, pivotal research, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Molecule discovery is a pivotal research field, impacting everything from the medicines we take to the materials we use. Recently, Large Language Models (LLMs) have been widely adopted in molecule understanding and generation, yet the alignments between molecules and their corresponding captions remain a significant challenge. Previous endeavours often treat the molecule as a general SMILES string or molecular graph, neglecting the fine-grained alignments between the molecular sub-structures and the descriptive textual phrases, which are crucial for accurate and explainable predictions. In this case, we introduce MolReFlect, a novel teacher-student framework designed to contextually perform the molecule-caption alignments in a fine-grained way. Our approach initially leverages a larger teacher LLM to label the detailed alignments by directly extracting critical phrases from molecule captions or SMILES strings and implying them to corresponding sub-structures or characteristics. To refine these alignments, we propose In-Context Selective Reflection, which retrieves previous extraction results as context examples for teacher LLM to reflect and lets a smaller student LLM select from in-context reflection and previous extraction results. Finally, we enhance the learning process of the student LLM through Chain-of-Thought In-Context Molecule Tuning, integrating the fine-grained alignments and the reasoning processes within the Chain-of-Thought format. Our experimental results demonstrate that MolReFlect enables LLMs like Mistral-7B to significantly outperform the previous baselines, achieving SOTA performance on the ChEBI-20 dataset. This advancement not only enhances the generative capabilities of LLMs in the molecule-caption translation task, but also contributes to a more explainable framework.
zh

[NLP-33] Optimizing Social Media Annotation of HPV Vaccine Skepticism and Misinformation Using Large Language Models : An Experimental Evaluation of In-Context Learning and Fine-Tuning Stance Detection Across Multiple Models

【速读】：该论文试图解决在HPV疫苗相关推文的立场检测中，如何通过大规模语言模型（LLMs）实现社交媒体内容标注的最佳策略问题。解决方案的关键在于实验性地比较了传统微调（fine-tuning）和新兴的上下文学习（in-context learning）方法，并通过系统地改变提示工程（prompt engineering）策略，如提示模板设计、样本抽样方法和样本数量，来优化立场检测性能。研究发现，上下文学习在HPV疫苗社交媒体内容的立场检测中通常优于微调，且不同LLMs及其变体对上下文学习条件的敏感性不同。最优的上下文学习配置包括六个分层样本与详细的上下文提示相结合。

链接: https://arxiv.org/abs/2411.14720
作者: Luhang Sun,Varsha Pendyala,Yun-Shiuan Chuang,Shanglin Yang,Jonathan Feldman,Andrew Zhao,Munmun De Choudhury,Sijia Yang,Dhavan Shah
关键词-EN: paper leverages large-language, leverages large-language models, experimentally determine optimal, HPV vaccine-related tweets, determine optimal strategies
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper leverages large-language models (LLMs) to experimentally determine optimal strategies for scaling up social media content annotation for stance detection on HPV vaccine-related tweets. We examine both conventional fine-tuning and emergent in-context learning methods, systematically varying strategies of prompt engineering across widely used LLMs and their variants (e.g., GPT4, Mistral, and Llama3, etc.). Specifically, we varied prompt template design, shot sampling methods, and shot quantity to detect stance on HPV vaccination. Our findings reveal that 1) in general, in-context learning outperforms fine-tuning in stance detection for HPV vaccine social media content; 2) increasing shot quantity does not necessarily enhance performance across models; and 3) different LLMs and their variants present differing sensitivity to in-context learning conditions. We uncovered that the optimal in-context learning configuration for stance detection on HPV vaccine tweets involves six stratified shots paired with detailed contextual prompts. This study highlights the potential and provides an applicable approach for applying LLMs to research on social media stance and skepticism detection.
zh

[NLP-34] FedMLLM : Federated Fine-tuning MLLM on Multimodal Heterogeneity Data

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在联邦学习 (Federated Learning, FL) 环境下进行微调时面临的模态异质性问题。解决方案的关键在于提出了一个通用的 FedMLLM 框架，该框架整合了四种代表性的联邦学习方法和两种模态无关策略，以应对多模态数据中的异质性挑战。通过广泛的实验验证，该框架能够通过扩展训练数据范围和缓解模态异质性来提升 MLLMs 的性能。

链接: https://arxiv.org/abs/2411.14717
作者: Binqian Xu,Xiangbo Shu,Haiyang Mei,Guosen Xie,Basura Fernando,Mike Zheng Shou,Jinhui Tang
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, made significant advancements
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the \textbfmultimodal heterogeneities in real-world applications. In this paper, we introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios, laying the groundwork for the research in the field. Our benchmark encompasses two datasets, five comparison baselines, and four multimodal scenarios, incorporating over ten types of modal heterogeneities. To address the challenges posed by modal heterogeneity, we develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies. Extensive experimental results show that our proposed FL paradigm improves the performance of MLLMs by broadening the range of training data and mitigating multimodal heterogeneity. Code is available at this https URL
zh

[NLP-35] Understanding LLM Embeddings for Regression

【速读】：该论文试图解决的问题是如何在高维回归任务中利用大型语言模型（LLMs）的嵌入（embeddings）来替代传统的特征工程方法。解决方案的关键在于利用LLM生成的嵌入作为下游特征，这些嵌入在特征空间中自然地保持了Lipschitz连续性，从而在某些情况下能够比传统特征工程方法更有效地进行回归预测。此外，论文还探讨了模型大小和语言理解能力对回归性能的影响，发现这些因素并不总是能提升回归性能。

链接: https://arxiv.org/abs/2411.14708
作者: Eric Tang,Bangding Yang,Xingyou Song
关键词-EN: flexibly processing information, preprocessing string representations, LLM embeddings, specifically by preprocessing, metric prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 13 figures

点击查看摘要

Abstract:With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.
zh

[NLP-36] Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在资源受限环境中部署的问题，特别是其高计算和内存需求。解决方案的关键在于提出了一种反馈驱动的蒸馏框架（Feedback-Driven Distillation, FDD），通过知识蒸馏将LLMs的推理能力转移到小型语言模型（Small Language Models, SLMs）上。具体步骤包括：初始阶段构建蒸馏数据集，通过LLMs生成数学问题的推理理由；根据SLM的表现将问题分类为简单和困难，对简单问题生成更复杂的变体，对困难问题合成相似复杂度的新问题；采用多轮蒸馏范式，迭代丰富蒸馏数据集，逐步提升SLMs的数学推理能力。实验结果表明，该方法能使SLMs达到最先进的数学推理性能。

链接: https://arxiv.org/abs/2411.14698
作者: Xunyu Zhu,Jian Li,Can Ma,Weiping Wang
关键词-EN: Large Language Models, Small Language Models, Large Language, Language Models, Small Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional reasoning capabilities, often achieving state-of-the-art performance in various tasks. However, their substantial computational and memory demands, due to billions of parameters, hinder deployment in resource-constrained environments. A promising solution is knowledge distillation, where LLMs transfer reasoning capabilities to Small Language Models (SLMs, \le 1B parameters), enabling wider deployment on low-resource devices. Existing methods primarily focus on generating high-quality reasoning rationales for distillation datasets but often neglect the critical role of data quantity and quality. To address these challenges, we propose a Feedback-Driven Distillation (FDD) framework to enhance SLMs’ mathematical reasoning capabilities. In the initialization stage, a distillation dataset is constructed by prompting LLMs to pair mathematical problems with corresponding reasoning rationales. We classify problems into easy and hard categories based on SLM performance. For easy problems, LLMs generate more complex variations, while for hard problems, new questions of similar complexity are synthesized. In addition, we propose a multi-round distillation paradigm to iteratively enrich the distillation datasets, thereby progressively improving the mathematical reasoning abilities of SLMs. Experimental results demonstrate that our method can make SLMs achieve SOTA mathematical reasoning performance.
zh

[NLP-37] Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

【速读】：该论文试图解决视频自动生成密集字幕（automatic dense captions for videos）的问题，特别是当前大多数模型需要一次性处理整个视频的局限性。解决方案的关键在于提出了一种高效的在线方法，该方法采用了一种新颖的自回归因子化解码架构（autoregressive factorized decoding architecture），能够在不依赖未来帧的情况下，频繁、详细且时间上对齐地输出字幕。这种架构通过建模每个时间段的视觉特征序列，输出局部描述，并有效利用前述视频片段的上下文信息，从而生成更全面、频繁的字幕，更准确地反映视频的实际内容，而非模仿训练数据。此外，论文还提出了一种优化方法，用于提高训练和推理的效率，使其能够处理更长的视频。这种方法在性能上优于现有的在线和离线方法，且计算资源消耗减少了20%。

链接: https://arxiv.org/abs/2411.14688
作者: AJ Piergiovanni,Dahun Kim,Michael S. Ryoo,Isaac Noble,Anelia Angelova
关键词-EN: Generating automatic dense, area of research, Generating automatic, remains a challenging, challenging area
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.
zh

[NLP-38] Multiverse of Greatness: Generating Story Branches with LLM s

【速读】：该论文试图解决的问题是如何利用大型语言模型 (LLMs) 生成具有动态上下文历史的长篇连贯图表内容，特别是在生成视觉小说游戏时，避免手动提取输出和缺乏灵活性的问题。解决方案的关键在于提出了动态上下文提示/编程框架 (Dynamic Context Prompting/Programming, DCP/P)，该框架通过提供动态上下文窗口历史，使LLMs能够生成更连贯和高质量的故事内容。通过对比实验，论文证明了仅提供摘要的基线方法生成的故事质量较差，而结合上下文历史的DCP/P方法显著提升了生成内容的质量和连贯性。

链接: https://arxiv.org/abs/2411.14672
作者: Pittawat Taveekitworachai,Chollakorn Nimpattanavong,Mustafa Can Gursesli,Antonio Lanata,Andrea Guazzini,Ruck Thawonmas
关键词-EN: Dynamic Context Prompting, paper presents Dynamic, presents Dynamic Context, dynamic context window, Dynamic Context
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 14 figures

点击查看摘要

Abstract:This paper presents Dynamic Context Prompting/Programming (DCP/P), a novel framework for interacting with LLMs to generate graph-based content with a dynamic context window history. While there is an existing study utilizing LLMs to generate a visual novel game, the previous study involved a manual process of output extraction and did not provide flexibility in generating a longer, coherent story. We evaluate DCP/P against our baseline, which does not provide context history to an LLM and only relies on the initial story data. Through objective evaluation, we show that simply providing the LLM with a summary leads to a subpar story compared to additionally providing the LLM with the proper context of the story. We also provide an extensive qualitative analysis and discussion. We qualitatively examine the quality of the objectively best-performing generated game from each approach. In addition, we examine biases in word choices and word sentiment of the generated content. We find a consistent observation with previous studies that LLMs are biased towards certain words, even with a different LLM family. Finally, we provide a comprehensive discussion on opportunities for future studies.
zh

[NLP-39] Comparative Analysis of Pooling Mechanisms in LLM s: A Sentiment Analysis Perspective

【速读】：该论文试图解决在大语言模型（LLMs）中，不同池化机制（如Mean、Max、Weighted Sum）在不同模型架构（如BERT和GPT）上的性能比较问题。解决方案的关键在于通过系统的实验评估，揭示每种池化机制在句子级情感分析任务中的独特优势和劣势，从而强调根据具体任务需求选择合适的池化方法的重要性。这一研究结果促使对池化操作的常见假设进行重新评估，并为优化基于LLM的下游任务模型提供了实际指导。

链接: https://arxiv.org/abs/2411.14654
作者: Jinming Xing,Ruilin Xing,Yan Sun
关键词-EN: Large Language Models, natural language processing, revolutionized natural language, Large Language, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing (NLP) by delivering state-of-the-art performance across a variety of tasks. Among these, Transformer-based models like BERT and GPT rely on pooling layers to aggregate token-level embeddings into sentence-level representations. Common pooling mechanisms such as Mean, Max, and Weighted Sum play a pivotal role in this aggregation process. Despite their widespread use, the comparative performance of these strategies on different LLM architectures remains underexplored. To address this gap, this paper investigates the effects of these pooling mechanisms on two prominent LLM families – BERT and GPT, in the context of sentence-level sentiment analysis. Comprehensive experiments reveal that each pooling mechanism exhibits unique strengths and weaknesses depending on the task’s specific requirements. Our findings underline the importance of selecting pooling methods tailored to the demands of particular applications, prompting a re-evaluation of common assumptions regarding pooling operations. By offering actionable insights, this study contributes to the optimization of LLM-based models for downstream tasks.
zh

[NLP-40] Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains

【速读】：该论文试图解决低资源和中资源语言在多模态模型评估方面的缺乏问题。解决方案的关键在于引入了ZNO-Vision，这是一个基于乌克兰标准化大学入学考试（ZNO）的综合性多模态乌克兰语基准。该基准包含超过4,300个专家设计的涵盖12个学术领域的问题，并首次对乌克兰语的多模态文本生成进行了评估，包括在Multi30K-UK数据集上的字幕生成质量、将VQA基准翻译成乌克兰语后的性能下降情况，以及从文化角度测试模型对国家菜肴知识的了解。通过这些措施，论文旨在推动乌克兰语及其他低资源语言的多模态生成能力的发展。

链接: https://arxiv.org/abs/2411.14647
作者: Yurii Paniv,Artur Kiulian,Dmytro Chaplynskyi,Mykola Khandoga,Anton Polishko,Tetiana Bas,Guillermo Gabrielli
关键词-EN: multimodal English-centric models, multimodal English-centric, suites for low, English-centric models, active area
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.
zh

[NLP-41] owards Knowledge Checking in Retrieval-augmented Generation: A Representation Perspective

【速读】：该论文试图解决检索增强生成 (Retrieval-Augmented Generation, RAG) 系统在整合外部知识与大型语言模型 (Large Language Models, LLMs) 内部知识时面临的误导性或无用信息问题。解决方案的关键在于系统性地研究知识检查机制，并通过分析LLM的表示行为，开发基于表示的分类器进行知识过滤。该方法显著提升了RAG系统的性能，尤其是在处理噪声知识数据库时，为增强RAG系统的可靠性和有效性提供了新的见解。

链接: https://arxiv.org/abs/2411.14572
作者: Shenglai Zeng,Jiankun Zhang,Bingheng Li,Yuping Lin,Tianqi Zheng,Dante Everaert,Hanqing Lu,Hui Liu,Hui Liu,Yue Xing,Monica Xiao Cheng,Jiliang Tang
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval-Augmented Generation, shown promise
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have shown promise in enhancing the performance of Large Language Models (LLMs). However, these systems face challenges in effectively integrating external knowledge with the LLM’s internal knowledge, often leading to issues with misleading or unhelpful information. This work aims to provide a systematic study on knowledge checking in RAG systems. We conduct a comprehensive analysis of LLM representation behaviors and demonstrate the significance of using representations in knowledge checking. Motivated by the findings, we further develop representation-based classifiers for knowledge filtering. We show substantial improvements in RAG performance, even when dealing with noisy knowledge databases. Our study provides new insights into leveraging LLM representations for enhancing the reliability and effectiveness of RAG systems.
zh

[NLP-42] Assessment of LLM Responses to End-user Security Questions

【速读】：该论文试图解决大型语言模型（LLMs）在回答终端用户安全问题时的准确性和可靠性问题。解决方案的关键在于通过定性评估3个流行的LLMs在900个系统收集的终端用户安全问题上的表现，识别出模型在提供安全信息时的错误模式和局限性，包括过时和不准确的信息、间接或无响应的沟通风格等。基于这些发现，论文提出了模型改进的方向，并建议用户在与LLMs交互时采取特定的策略以提高获取安全信息的质量。

链接: https://arxiv.org/abs/2411.14571
作者: Vijay Prakash,Kevin Lee,Arkaprabha Bhattacharya,Danny Yuxing Huang,Jessica Staddon
关键词-EN: end user security, user security questions, Answering end user, end user, user security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 18 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Answering end user security questions is challenging. While large language models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have shown promise in answering a variety of questions outside of security. We studied LLM performance in the area of end user security by qualitatively evaluating 3 popular LLMs on 900 systematically collected end user security questions. While LLMs demonstrate broad generalist ``knowledge’’ of end user security information, there are patterns of errors and limitations across LLMs consisting of stale and inaccurate answers, and indirect or unresponsive communication styles, all of which impacts the quality of information received. Based on these patterns, we suggest directions for model improvement and recommend user strategies for interacting with LLMs when seeking assistance with security. Comments: 18 pages, 1 figure, 8 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2411.14571 [cs.CR] (or arXiv:2411.14571v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.14571 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-43] Reducibility among NP-Hard graph problems and boundary classes

【速读】：该论文试图解决的问题是：在图论中，某些NP-hard问题在特定类型的图上变得容易解决，而在其他情况下则仍然困难。论文通过引入“边界类”（boundary classes）的概念，研究了在何种最小子结构下问题仍然保持困难。解决方案的关键在于提出了一种方法，能够将一个NP-hard图问题的边界类转换为另一个NP-hard图问题的边界类。具体来说，如果两个NP-hard图问题 (\Pi) 和 (\Gamma) 满足 (\Pi) 可归约到 (\Gamma)，并且归约是双射的，且映射保持图的遗传性质，那么 (\Pi) 的边界类 (X) 在归约下的像就是 (\Gamma) 的边界类。这一结果揭示了边界类与归约性之间的关系，并通过应用于顶点覆盖、团、旅行商问题、有界度生成树、子图同构和团覆盖等具体问题，展示了其理论的实际应用价值。

链接: https://arxiv.org/abs/2411.14553
作者: Syed Mujtaba Hassan,Shahid Hussain,Abdul Samad
关键词-EN: boundary class, Gamma, boundary, class, classes
类目: Computational Complexity (cs.CC); Computation and Language (cs.CL); Discrete Mathematics (cs.DM)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Many NP-hard graph problems become easy for some classes of graphs, such as coloring is easy for bipartite graphs, but NP-hard in general. So we can ask question like when does a hard problem become easy? What is the minimum substructure for which the problem remains hard? We use the notion of boundary classes to study such questions. In this paper, we introduce a method for transforming the boundary class of one NP-hard graph problem into a boundary class for another problem. If \Pi and \Gamma are two NP-hard graph problems where \Pi is reducible to \Gamma , we transform a boundary class of \Pi into a boundary class of \Gamma . More formally if \Pi is reducible to \Gamma , where the reduction is bijective and it maps hereditary classes of graphs to hereditary classes of graphs, then X is a boundary class of \Pi if and only if the image of X under the reduction is a boundary class of \Gamma . This gives us a relationship between boundary classes and reducibility among several NP-hard problems. To show the strength of our main result, we apply our theorem to obtain some previously unknown boundary classes for a few graph problems namely; vertex-cover, clique, traveling-salesperson, bounded-degree-spanning-tree, subgraph-isomorphism and clique-cover.
zh

[NLP-44] An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains

【速读】：该论文试图解决低资源领域（low-resource domains）中命名实体识别（Named Entity Recognition, NER）任务的数据稀缺问题。解决方案的关键在于采用数据增强技术（data augmentation techniques），特别是提及替换（Mention Replacement）和上下文词替换（Contextual Word Replacement），以生成额外的训练实例。研究评估了这些技术在两种广泛使用的NER模型（Bi-LSTM+CRF和BERT）上的效果，并探讨了不同训练子集大小和增强实例数量对性能的影响。结果表明，数据增强对小数据集特别有益，但不存在一个普遍最优的增强实例数量，NER实践者需要通过实验来微调其项目。

链接: https://arxiv.org/abs/2411.14551
作者: Arthur Elwing Torres,Edleno Silva de Moura,Altigran Soares da Silva,Mario A. Nascimento,Filipe Mesquita
关键词-EN: Named Entity Recognition, Named Entity, Entity Recognition, machine learning task, machine learning
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 21 pages, 2 figures

点击查看摘要

Abstract:Named Entity Recognition (NER) is a machine learning task that traditionally relies on supervised learning and annotated data. Acquiring such data is often a challenge, particularly in specialized fields like medical, legal, and financial sectors. Those are commonly referred to as low-resource domains, which comprise long-tail entities, due to the scarcity of available data. To address this, data augmentation techniques are increasingly being employed to generate additional training instances from the original dataset. In this study, we evaluate the effectiveness of two prominent text augmentation techniques, Mention Replacement and Contextual Word Replacement, on two widely-used NER models, Bi-LSTM+CRF and BERT. We conduct experiments on four datasets from low-resource domains, and we explore the impact of various combinations of training subset sizes and number of augmented examples. We not only confirm that data augmentation is particularly beneficial for smaller datasets, but we also demonstrate that there is no universally optimal number of augmented examples, i.e., NER practitioners must experiment with different quantities in order to fine-tune their projects.
zh

[NLP-45] owards a Middleware for Large Language Models

【速读】：该论文试图解决企业在集成大型语言模型（LLMs）时面临的独立部署挑战，特别是由于LLMs的复杂性和与现有系统的集成问题。解决方案的关键在于提出一种前瞻性的中间件系统架构，该架构旨在简化LLMs在企业中的部署和应用，甚至支持LLMs作为完整应用生态系统的入口，并在一定程度上吸收传统中间件的功能。这种架构的设计旨在满足企业对隐私、成本和定制化的需求，从而减少对主要云服务提供商的依赖。

链接: https://arxiv.org/abs/2411.14513
作者: Narcisa Guran,Florian Knauf,Man Ngo,Stefan Petrescu,Jan S. Rellermeyer
关键词-EN: Large language models, true artificial intelligence, process natural language, natural language inputs, gained widespread popularity
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have gained widespread popularity for their ability to process natural language inputs and generate insights derived from their training data, nearing the qualities of true artificial intelligence. This advancement has prompted enterprises worldwide to integrate LLMs into their services. So far, this effort is dominated by commercial cloud-based solutions like OpenAI’s ChatGPT and Microsoft Azure. As the technology matures, however, there is a strong incentive for independence from major cloud providers through self-hosting “LLM as a Service”, driven by privacy, cost, and customization needs. In practice, hosting LLMs independently presents significant challenges due to their complexity and integration issues with existing systems. In this paper, we discuss our vision for a forward-looking middleware system architecture that facilitates the deployment and adoption of LLMs in enterprises, even for advanced use cases in which we foresee LLMs to serve as gateways to a complete application ecosystem and, to some degree, absorb functionality traditionally attributed to the middleware.
zh

[NLP-46] FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers

【速读】：该论文试图解决生成式预训练变压器 (Generative Pre-trained Transformers, GPTs) 在通过结构化剪枝压缩模型时导致的性能下降问题。解决方案的关键在于提出了 FuseGPT 方法，通过回收被剪枝的变压器块来恢复模型性能。具体步骤包括：1) 引入新的重要性检测指标 宏观影响 (Macro Influence, MI)，用于评估每个变压器块在移除后的长期信息损失；2) 提出 组级层融合 (group-level layers fusion)，将不重要块中的参数注入到相邻块的对应层中，并通过轻量级组级微调进行迭代参数更新，其中注入的参数被冻结但通过可学习的秩分解矩阵加权，以减少微调时的开销。该方法不仅适用于大型语言模型，还适用于大型多模态模型，实验结果表明，FuseGPT 在困惑度和零样本任务性能上均优于先前的工作。

链接: https://arxiv.org/abs/2411.14507
作者: Zehua Pei,Hui-Ling Zhen,Xianzhi Yu,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu
关键词-EN: Generative Pre-trained Transformers, Generative Pre-trained, demonstrated remarkable performance, Pre-trained Transformers, demonstrated remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains through the extensive scaling of model parameters. Recent works observe the redundancy across the transformer blocks and develop compression methods by structured pruning of the unimportant blocks. However, such straightforward elimination will always provide irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology to recycle the pruned transformer blocks to further recover the model performance. Firstly we introduce a new importance detection metric, Macro Influence (MI), to detect the long-term influence of each transformer block by calculating their loss of information after removal. Then we propose group-level layers fusion, which adopts the parameters in layers of the unimportant blocks and injects them into the corresponding layers inside the neighboring blocks. The fusion is not one-off but through iterative parameter updates by lightweight group-level fine-tuning. Specifically, these injected parameters are frozen but weighted with learnable rank decomposition matrices to reduce the overhead during fine-tuning. Our approach not only works well on large language models but also on large multimodal models. The experiments have shown that, by using modest amounts of data, FuseGPT can outperform previous works in both perplexity and zero-shot task performance.
zh

[NLP-47] Exploring Accuracy-Fairness Trade-off in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）中存在的偏见问题，并探讨如何在提高模型准确性的同时保持公平性。解决方案的关键在于将LLM的训练过程重新定义为多目标学习任务，并采用多目标进化学习（Multi-Objective Evolutionary Learning, MOEL）方法。通过MOEL框架，可以同时优化准确性和公平性指标，从而生成一组帕累托最优的LLMs。这种方法强调在设计和优化阶段考虑多个因素，以实现更公平和高效的AI技术。

链接: https://arxiv.org/abs/2411.14500
作者: Qingquan Zhang,Qiqi Duan,Bo Yuan,Yuhui Shi,Jialin Liu
关键词-EN: Large Language Models, Large Language, Language Models, influence human cognition, made significant strides
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in the field of artificial intelligence, showcasing their ability to interact with humans and influence human cognition through information dissemination. However, recent studies have brought to light instances of bias inherent within these LLMs, presenting a critical issue that demands attention. In our research, we delve deeper into the intricate challenge of harmonising accuracy and fairness in the enhancement of LLMs. While improving accuracy can indeed enhance overall LLM performance, it often occurs at the expense of fairness. Overemphasising optimisation of one metric invariably leads to a significant degradation of the other. This underscores the necessity of taking into account multiple considerations during the design and optimisation phases of LLMs. Therefore, we advocate for reformulating the LLM training process as a multi-objective learning task. Our investigation reveals that multi-objective evolutionary learning (MOEL) methodologies offer promising avenues for tackling this challenge. Our MOEL framework enables the simultaneous optimisation of both accuracy and fairness metrics, resulting in a Pareto-optimal set of LLMs. In summary, our study sheds valuable lights on the delicate equilibrium between accuracy and fairness within LLMs, which is increasingly significant for their real-world applications. By harnessing MOEL, we present a promising pathway towards fairer and more efficacious AI technologies.
zh

[NLP-48] Understanding World or Predicting Future? A Comprehensive Survey of World Models

【速读】：该论文试图解决的问题是如何系统地理解和分类世界模型（world models），特别是在生成式 AI 和视频生成模型等技术进步的背景下。解决方案的关键在于对世界模型的两大主要功能进行系统分类：一是构建内部表征以理解世界机制，二是预测未来状态以模拟和指导决策。论文通过详细探讨这两大功能在不同领域（如自动驾驶、机器人和社交模拟）中的应用，揭示了世界模型在实现人工通用智能（artificial general intelligence）中的核心作用，并指出了当前面临的挑战和未来研究方向。

链接: https://arxiv.org/abs/2411.14499
作者: Jingtao Ding,Yunke Zhang,Yu Shang,Yuheng Zhang,Zefang Zong,Jie Feng,Yuan Yuan,Hongyuan Su,Nian Li,Nicholas Sukiennik,Fengli Xu,Yong Li
关键词-EN: artificial general intelligence, garnered significant attention, significant attention due, multimodal large language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions.
zh

[NLP-49] Star-Agents : Automatic Data Optimization with LLM Agents for Instruction Tuning

【速读】：该论文试图解决大规模语言模型（LLMs）在下游任务中的有效性高度依赖于指令调优数据质量的问题。由于收集高质量和多样化的训练数据既昂贵又耗时，论文提出了一种名为Star-Agents的新框架，通过多代理协作和评估自动化地提升数据质量。解决方案的关键在于采用三管齐下的策略：首先，通过定制的采样方法生成多样化的指令数据；其次，使用双模型方法对生成的数据进行严格的难度和质量评估；最后，在动态优化阶段，优先选择更有效的LLMs，从而提升整体数据质量。实证研究表明，该框架在优化数据集方面取得了显著成效，平均提升12%，并在特定指标上如Fermi提升40%。

链接: https://arxiv.org/abs/2411.14497
作者: Hang Zhou,Yehui Tang,Haochen Qin,Yujie Yang,Renren Jin,Deyi Xiong,Kai Han,Yunhe Wang
关键词-EN: large language models, efficacy of large, large language, downstream tasks, tasks usually hinges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data. Unfortunately, collecting high-quality and diverse data is both expensive and time-consuming. To mitigate this issue, we propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets through multi-agent collaboration and assessment. The framework adopts a three-pronged strategy. It initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. Subsequently, the generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality. Finaly, the above process evolves in a dynamic refinement phase, where more effective LLMs are prioritized, enhancing the overall data quality. Our empirical studies, including instruction tuning experiments with models such as Pythia and LLaMA, demonstrate the effectiveness of the proposed framework. Optimized datasets have achieved substantial improvements, with an average increase of 12% and notable gains in specific metrics, such as a 40% improvement in Fermi, as evidenced by benchmarks like MT-bench, Vicuna bench, and WizardLM testset.
zh

[NLP-50] From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language

【速读】：该论文试图解决资源匮乏语言（如乌尔都语）在自动语音识别（ASR）技术中的应用问题。解决方案的关键在于利用现代技术，分析现有数据集，并评估有效的算法和工具，以应对乌尔都语处理中的独特挑战，并探索其在语音研究领域中的潜在机会。

链接: https://arxiv.org/abs/2411.14493
作者: Muhammad Sharif,Zeeshan Abbas,Jiangyan Yi,Chenglin Liu
关键词-EN: Automatic Speech Recognition, revolutionizing human-computer interactions, witnessed significant advancements, Automatic Speech, Speech Recognition
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to SN Computer Science

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) technology has witnessed significant advancements in recent years, revolutionizing human-computer interactions. While major languages have benefited from these developments, lesser-resourced languages like Urdu face unique challenges. This paper provides an extensive exploration of the dynamic landscape of ASR research, focusing particularly on the resource-constrained Urdu language, which is widely spoken across South Asian nations. It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR, aiming to pave the way for forthcoming researchers interested in this domain. By leveraging contemporary technologies, analyzing existing datasets, and evaluating effective algorithms and tools, the paper seeks to shed light on the unique challenges and opportunities associated with Urdu language processing and its integration into the broader field of speech research.
zh

[NLP-51] A Survey on Human-Centric LLM s

【速读】：该论文试图解决的问题是如何评估和应用基于大型语言模型（LLMs）的框架和工具在模拟人类认知、决策和社会互动方面的能力。解决方案的关键在于全面评估LLMs在个体任务（如单个LLM替代单个人类）和集体任务（如多个LLM协同模拟群体动态）中的表现，特别是在推理、感知和社会认知等关键领域的性能，并探索其在行为科学、政治学和社会学等人文领域的实际应用。此外，论文还强调了提升LLM的适应性、情感智能和文化敏感性，以及解决其固有偏见和增强人机协作框架的重要性。

链接: https://arxiv.org/abs/2411.14491
作者: Jing Yi Wang,Nicholas Sukiennik,Tong Li,Weikang Su,Qianyue Hao,Jingbo Xu,Zihan Huang,Fengli Xu,Yong Li
关键词-EN: large language models, perform tasks traditionally, tasks traditionally performed, simulate human cognition, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) and their capacity to simulate human cognition and behavior has given rise to LLM-based frameworks and tools that are evaluated and applied based on their ability to perform tasks traditionally performed by humans, namely those involving cognition, decision-making, and social interaction. This survey provides a comprehensive examination of such human-centric LLM capabilities, focusing on their performance in both individual tasks (where an LLM acts as a stand-in for a single human) and collective tasks (where multiple LLMs coordinate to mimic group dynamics). We first evaluate LLM competencies across key areas including reasoning, perception, and social cognition, comparing their abilities to human-like skills. Then, we explore real-world applications of LLMs in human-centric domains such as behavioral science, political science, and sociology, assessing their effectiveness in replicating human behaviors and interactions. Finally, we identify challenges and future research directions, such as improving LLM adaptability, emotional intelligence, and cultural sensitivity, while addressing inherent biases and enhancing frameworks for human-AI collaboration. This survey aims to provide a foundational understanding of LLMs from a human-centric perspective, offering insights into their current capabilities and potential for future development.
zh

[NLP-52] GhostRNN: Reducing State Redundancy in RNN with Cheap Operations

【速读】：该论文试图解决在低资源设备上运行循环神经网络（Recurrent Neural Network, RNN）时面临的计算和内存限制问题。解决方案的关键在于提出了一种名为GhostRNN的高效RNN架构，通过减少隐藏状态的冗余来降低计算成本。具体来说，GhostRNN通过识别和利用训练好的RNN模型中隐藏状态的相似性，首先生成少量内在状态（intrinsic states），然后通过廉价的操作（cheap operations）生成基于这些内在状态的幽灵状态（ghost states），从而显著减少了内存使用（约40%）和计算成本，同时保持了任务性能。

链接: https://arxiv.org/abs/2411.14489
作者: Hang Zhou,Xiaoxu Zheng,Yunhe Wang,Michael Bi Mi,Deyi Xiong,Kai Han
关键词-EN: Recurrent neural network, modeling long-distance dependencies, Recurrent neural, keyword spotting, speech enhancement
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recurrent neural network (RNNs) that are capable of modeling long-distance dependencies are widely used in various speech tasks, eg., keyword spotting (KWS) and speech enhancement (SE). Due to the limitation of power and memory in low-resource devices, efficient RNN models are urgently required for real-world applications. In this paper, we propose an efficient RNN architecture, GhostRNN, which reduces hidden state redundancy with cheap operations. In particular, we observe that partial dimensions of hidden states are similar to the others in trained RNN models, suggesting that redundancy exists in specific RNNs. To reduce the redundancy and hence computational cost, we propose to first generate a few intrinsic states, and then apply cheap operations to produce ghost states based on the intrinsic states. Experiments on KWS and SE tasks demonstrate that the proposed GhostRNN significantly reduces the memory usage (~40%) and computation cost while keeping performance similar.
zh

[NLP-53] Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine

【速读】：该论文试图解决大型语言模型（LLMs）在医疗应用中的安全性和可信度问题。解决方案的关键在于提出了五个核心原则（Truthfulness, Resilience, Fairness, Robustness, Privacy）和十个具体方面，并引入了一个名为MedGuard的新基准，包含1,000个专家验证的问题。通过评估11种常用LLMs的表现，研究发现当前的LLMs在大多数基准测试中表现不佳，尤其是在与人类医生的高表现相比时。尽管有报告称像ChatGPT这样的先进LLMs在某些医疗任务中可以匹配甚至超越人类表现，但该研究强调了显著的安全差距，突显了人类监督和实施AI安全防护措施的迫切需求。

链接: https://arxiv.org/abs/2411.14487
作者: Yifan Yang,Qiao Jin,Robert Leaman,Xiaoyu Liu,Guangzhi Xiong,Maame Sarfo-Gyamfi,Changlin Gong,Santiago Ferrière-Steinert,W. John Wilbur,Xiaojun Li,Jiaxin Yuan,Bang An,Kelvin S. Castro,Francisco Erramuspe Álvarez,Matías Stockle,Aidong Zhang,Furong Huang,Zhiyong Lu
关键词-EN: Large Language Models, real-world healthcare applications, capabilities of Large, Large Language, make them increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) make them increasingly compelling for adoption in real-world healthcare applications. However, the risks associated with using LLMs in medical applications have not been systematically characterized. We propose using five key principles for safe and trustworthy medical AI: Truthfulness, Resilience, Fairness, Robustness, and Privacy, along with ten specific aspects. Under this comprehensive framework, we introduce a novel MedGuard benchmark with 1,000 expert-verified questions. Our evaluation of 11 commonly used LLMs shows that the current language models, regardless of their safety alignment mechanisms, generally perform poorly on most of our benchmarks, particularly when compared to the high performance of human physicians. Despite recent reports indicate that advanced LLMs like ChatGPT can match or even exceed human performance in various medical tasks, this study underscores a significant safety gap, highlighting the crucial need for human oversight and the implementation of AI safety guardrails.
zh

[NLP-54] he Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz

【速读】：该论文试图解决大语言模型（LLMs）在面对675个根本无解的问题时，能否准确识别并承认其不确定性，而非生成看似合理但实际上错误的答案。解决方案的关键在于引入了一个新的评估框架，通过精心挑选的研究生水平的大挑战问题数据集，评估了12个最先进的LLMs，包括开源和闭源模型，考察它们在不同领域（如生物学、哲学和数学）中承认问题无解的能力。研究发现，模型在承认不确定性方面的准确率在62-68%之间，且随着问题难度的增加，GPT-4等模型承认不确定性的比例更高，表明模型在面对更复杂问题时更倾向于生成推测性答案。此外，研究还揭示了不同问题类别间模型表现的显著差异，特别是在发明和NP-hard问题上，模型承认不确定性的能力较弱。这些结果强调了不确定性识别作为未来机器智能评估的关键组成部分，并为改进模型训练架构和评估方法提供了新的方向。

链接: https://arxiv.org/abs/2411.14486
作者: David Noever,Forrest McKee
关键词-EN: large language models’, assess large language, fundamentally unsolvable problems, language models’, fundamentally unsolvable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research introduces a novel evaluation framework designed to assess large language models’ (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. Using a curated dataset of graduate-level grand challenge questions with intentionally unknowable answers, we evaluated twelve state-of-the-art LLMs, including both open and closed-source models, on their propensity to admit ignorance rather than generate plausible but incorrect responses. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics. We observed an inverse relationship between problem difficulty and model accuracy, with GPT-4 demonstrating higher rates of uncertainty acknowledgment on more challenging problems (35.8%) compared to simpler ones (20.0%). This pattern indicates that models may be more prone to generate speculative answers when problems appear more tractable. The study also revealed significant variations across problem categories, with models showing difficulty in acknowledging uncertainty in invention and NP-hard problems while performing relatively better on philosophical and psychological challenges. These results contribute to the growing body of research on artificial general intelligence (AGI) assessment by highlighting the importance of uncertainty recognition as a critical component of future machine intelligence evaluation. This impossibility test thus extends previous theoretical frameworks for universal intelligence testing by providing empirical evidence of current limitations in LLMs’ ability to recognize their own knowledge boundaries, suggesting new directions for improving model training architectures and evaluation approaches.
zh

[NLP-55] Mediating Modes of Thought: LLM s for design scripting

【速读】：该论文试图解决设计领域中设计师自由思维与算法刚性之间的断层问题，特别是在使用视觉脚本和参数化设计工具时。解决方案的关键在于利用大型语言模型（LLMs）来中介用户意图与算法之间的交互。通过配置多层LLM代理，系统能够从自然语言提示中推断用户意图，并将其转化为几何操作序列，最终映射到软件特定的命令，从而在用户的视觉编程界面中生成完整的脚本。尽管系统在一定复杂度内成功生成脚本，但仍存在复杂度阈值的限制。该研究展示了LLMs如何使设计脚本更符合人类创造力和思维方式，并建议未来研究应探索对话交互、多模态输入输出以及工具性能评估。

链接: https://arxiv.org/abs/2411.14485
作者: Moritz Rietschel,Fang Guo,Kyle Steinfeld
关键词-EN: ASCII standards, conform to ASCII, bad characters, cleaned for submission, corrected to conform
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published at ACADIA 2024

点击查看摘要

Abstract:Here is an updated version of your abstract, cleaned for submission to arXiv with potential “bad characters” corrected to conform to ASCII standards: Architects adopt visual scripting and parametric design tools to explore more expansive design spaces (Coates, 2010), refine their thinking about the geometric logic of their design (Woodbury, 2010), and overcome conventional software limitations (Burry, 2011). Despite two decades of effort to make design scripting more accessible, a disconnect between a designer’s free ways of thinking and the rigidity of algorithms remains (Burry, 2011). Recent developments in Large Language Models (LLMs) suggest this might soon change, as LLMs encode a general understanding of human context and exhibit the capacity to produce geometric logic. This project speculates that if LLMs can effectively mediate between user intent and algorithms, they become a powerful tool to make scripting in design more widespread and fun. We explore if such systems can interpret natural language prompts to assemble geometric operations relevant to computational design scripting. In the system, multiple layers of LLM agents are configured with specific context to infer the user intent and construct a sequential logic. Given a user’s high-level text prompt, a geometric description is created, distilled into a sequence of logic operations, and mapped to software-specific commands. The completed script is constructed in the user’s visual programming interface. The system succeeds in generating complete visual scripts up to a certain complexity but fails beyond this complexity threshold. It shows how LLMs can make design scripting much more aligned with human creativity and thought. Future research should explore conversational interactions, expand to multimodal inputs and outputs, and assess the performance of these tools. Comments: Published at ACADIA 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2411.14485 [cs.CL] (or arXiv:2411.14485v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.14485 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-56] Robust Planning with Compound LLM Architectures: An LLM -Modulo Approach

【速读】：该论文试图解决大型语言模型 (LLM) 在规划和调度任务中的性能不稳定和不可预测性问题。解决方案的关键在于引入复合 LLM 架构——LLM-Modulo 框架，其中 LLM 与一组完备的验证器 (verifiers) 配对，以验证其输出并重新提示 (re-prompting) 模型，确保系统不会输出任何错误结果。这种架构确保了每个输出都是正确的，这是之前技术无法保证的。

链接: https://arxiv.org/abs/2411.14484
作者: Atharva Gundawar,Karthik Valmeekam,Mudit Verma,Subbarao Kambhampati
关键词-EN: boost Large Language, Large Language Model, Large Language, boost Large, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous work has attempted to boost Large Language Model (LLM) performance on planning and scheduling tasks through a variety of prompt engineering techniques. While these methods can work within the distributions tested, they are neither robust nor predictable. This limitation can be addressed through compound LLM architectures where LLMs work in conjunction with other components to ensure reliability. In this paper, we present a technical evaluation of a compound LLM architecture–the LLM-Modulo framework. In this framework, an LLM is paired with a complete set of sound verifiers that validate its output, re-prompting it if it fails. This approach ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct–something previous techniques have not been able to claim. Our results, evaluated across four scheduling domains, demonstrate significant performance gains with the LLM-Modulo framework using various models. Additionally, we explore modifications to the base configuration of the framework and assess their impact on overall system performance.
zh

[NLP-57] Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

【速读】：该论文试图解决在选择大型语言模型（LLM）时面临的复杂挑战，特别是如何通过成对排序（Pairwise Ranking）方法有效地评估人类对不同LLM输出的偏好。解决方案的关键在于：1) 正式定义了一套用于有效排序的基本原则；2) 通过一系列广泛的评估，研究了几种排序算法在LLM评估中的鲁棒性；3) 揭示了影响排序准确性和效率的关键因素，并提供了根据特定评估环境和资源限制选择最合适方法的指南。

链接: https://arxiv.org/abs/2411.14483
作者: Roland Daynauth,Christopher Clarke,Krisztian Flautner,Lingjia Tang,Jason Mars
关键词-EN: Deciding which large, large language model, large language, Deciding, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.
zh

[NLP-58] GRL-Prompt: Towards Knowledge Graph based Prompt Optimization via Reinforcement Learning

【速读】：该论文试图解决大语言模型（LLMs）在自然语言处理（NLP）任务中，手动进行提示工程（prompt engineering）效率低下且难以找到最优提示的问题。解决方案的关键在于提出了一个名为GRL-Prompt的新型LLMs无关框架，通过强化学习（RL）来自动构建最优提示。具体来说，GRL-Prompt利用知识图谱（KG）来结构化表示动作/状态，以更好地编码用户查询与候选上下文示例之间的关联，并通过策略网络生成最优动作，按奖励顺序选择一组上下文示例来构建提示。此外，嵌入式奖励塑造（embedding-based reward shaping）用于稳定RL训练过程。实验结果表明，GRL-Prompt在多个评估指标上优于现有最先进方法。

链接: https://arxiv.org/abs/2411.14479
作者: Yuze Liu,Tingjie Liu,Tiehua Zhang,Youhua Xia,Jinze Wang,Zhishu Shen,Jiong Jin,Fei Richard Yu
关键词-EN: Large language models, natural language processing, demonstrated impressive success, Large language, extensive general knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive success in a wide range of natural language processing (NLP) tasks due to their extensive general knowledge of the world. Recent works discovered that the performance of LLMs is heavily dependent on the input prompt. However, prompt engineering is usually done manually in a trial-and-error fashion, which can be labor-intensive and challenging in order to find the optimal prompts. To address these problems and unleash the utmost potential of LLMs, we propose a novel LLMs-agnostic framework for prompt optimization, namely GRL-Prompt, which aims to automatically construct optimal prompts via reinforcement learning (RL) in an end-to-end manner. To provide structured action/state representation for optimizing prompts, we construct a knowledge graph (KG) that better encodes the correlation between the user query and candidate in-context examples. Furthermore, a policy network is formulated to generate the optimal action by selecting a set of in-context examples in a rewardable order to construct the prompt. Additionally, the embedding-based reward shaping is utilized to stabilize the RL training process. The experimental results show that GRL-Prompt outperforms recent state-of-the-art methods, achieving an average increase of 0.10 in ROUGE-1, 0.07 in ROUGE-2, 0.07 in ROUGE-L, and 0.05 in BLEU.
zh

[NLP-59] StreetviewLLM : Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model

【速读】：该论文试图解决传统机器学习方法在处理非结构化或多模态数据（如街景图像）时面临的局限性问题。解决方案的关键在于提出了StreetViewLLM框架，该框架通过将大型语言模型与链式思维推理和多模态数据源（包括街景图像、地理坐标和文本数据）相结合，显著提升了地理空间预测的精度和粒度。利用检索增强生成技术，StreetViewLLM增强了地理信息的提取能力，能够对城市环境进行详细分析。该模型在七个全球城市（包括香港、东京、新加坡、洛杉矶、纽约、伦敦和巴黎）的应用中，展示了在预测城市指标（如人口密度、医疗设施可达性、归一化植被指数、建筑高度和非渗透表面）方面的优越性能，并持续优于基线模型，提供了更高的预测准确性和对城市环境的深入洞察。

链接: https://arxiv.org/abs/2411.14476
作者: Zongrong Li,Junhao Xu,Siqin Wang,Yifan Wu,Haiyang Li
关键词-EN: public health, crucial for diverse, diverse fields, street view imagery, Geospatial predictions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health. Traditional machine learning methods often face limitations when handling unstructured or multi-modal data like street view imagery. To address these challenges, we propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources. By combining street view imagery with geographic coordinates and textual data, StreetViewLLM improves the precision and granularity of geospatial predictions. Using retrieval-augmented generation techniques, our approach enhances geographic information extraction, enabling a detailed analysis of urban environments. The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris, demonstrating superior performance in predicting urban indicators, including population density, accessibility to healthcare, normalized difference vegetation index, building height, and impervious surface. The results show that StreetViewLLM consistently outperforms baseline models, offering improved predictive accuracy and deeper insights into the built environment. This research opens new opportunities for integrating the large language model into urban analytics, decision-making in urban planning, infrastructure management, and environmental monitoring.
zh

[NLP-60] Large Language Model for Qualitative Research – A Systematic Mapping Study ICSE

【速读】：该论文试图解决传统定性分析方法在处理快速增长的文本数据时面临的效率低下和主观性强的问题。解决方案的关键在于利用大型语言模型 (LLMs) 和生成式 AI (Generative AI) 来自动化和增强定性分析过程。论文系统地梳理了 LLMs 在定性研究中的应用，探讨了其应用场景、配置、方法论和评估指标，并指出 LLMs 在自动化传统需要大量人工输入的过程中展示了巨大潜力。然而，依赖提示工程、偶尔的不准确性和上下文限制等挑战仍然存在。论文强调了将 LLMs 与人类专业知识结合、提高模型鲁棒性和改进评估方法的机会，旨在指导未来在定性分析中应用 LLMs 的创新方向。

链接: https://arxiv.org/abs/2411.14473
作者: Cauã Ferreira Barros,Bruna Borges Azevedo,Valdemar Vicente Graciano Neto,Mohamad Kassab,Marcos Kalinowski,Hugo Alexandre D. do Nascimento,Michelle C.G.S.P. Bandeira
关键词-EN: qualitative analysis methods, traditional qualitative analysis, prone to subjectivity, qualitative analysis, exponential growth
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, includes 1 figures and 3 tables. Submitted to the WSESE 2025 ICSE Workshop

点击查看摘要

Abstract:The exponential growth of text-based data in domains such as healthcare, education, and social sciences has outpaced the capacity of traditional qualitative analysis methods, which are time-intensive and prone to subjectivity. Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools capable of automating and enhancing qualitative analysis. This study systematically maps the literature on the use of LLMs for qualitative research, exploring their application contexts, configurations, methodologies, and evaluation metrics. Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes traditionally requiring extensive human input. However, challenges such as reliance on prompt engineering, occasional inaccuracies, and contextual limitations remain significant barriers. This research highlights opportunities for integrating LLMs with human expertise, improving model robustness, and refining evaluation methodologies. By synthesizing trends and identifying research gaps, this study aims to guide future innovations in the application of LLMs for qualitative analysis.
zh

[NLP-61] Exploring the Potential Role of Generative AI in the TRAPD Procedure for Survey Translation

【速读】：该论文试图解决多语言和文化背景下调查问卷翻译过程中可能出现的错误问题。解决方案的关键在于利用生成式 AI (Generative AI) 如 ChatGPT 进行零样本提示实验，以识别和反馈可能影响翻译质量的问题，包括常见源语言特征、概念不一致性、敏感性和正式性问题以及不存在概念等。通过这种方式，生成式 AI 能够提供有意义的翻译反馈，从而帮助减少翻译错误，特别是在资源有限的情况下。此外，论文还探讨了该方法的实际可行性，包括软件访问、成本和计算时间，并提出了未来将 AI 整合到调查问卷翻译实践中的研究方向。

链接: https://arxiv.org/abs/2411.14472
作者: Erica Ann Metheney,Lauren Yehle
关键词-EN: translating survey instruments, assist in translating, survey instruments, translating survey, paper explores
类目: Computation and Language (cs.CL); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:This paper explores and assesses in what ways generative AI can assist in translating survey instruments. Writing effective survey questions is a challenging and complex task, made even more difficult for surveys that will be translated and deployed in multiple linguistic and cultural settings. Translation errors can be detrimental, with known errors rendering data unusable for its intended purpose and undetected errors leading to incorrect conclusions. A growing number of institutions face this problem as surveys deployed by private and academic organizations globalize, and the success of their current efforts depends heavily on researchers’ and translators’ expertise and the amount of time each party has to contribute to the task. Thus, multilinguistic and multicultural surveys produced by teams with limited expertise, budgets, or time are at significant risk for translation-based errors in their data. We implement a zero-shot prompt experiment using ChatGPT to explore generative AI’s ability to identify features of questions that might be difficult to translate to a linguistic audience other than the source language. We find that ChatGPT can provide meaningful feedback on translation issues, including common source survey language, inconsistent conceptualization, sensitivity and formality issues, and nonexistent concepts. In addition, we provide detailed information on the practicality of the approach, including accessing the necessary software, associated costs, and computational run times. Lastly, based on our findings, we propose avenues for future research that integrate AI into survey translation practices.
zh

[NLP-62] Popular LLM s Amplify Race and Gender Disparities in Human Mobility

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在预测人类移动性时是否存在基于种族和性别的偏见，并探讨这些偏见如何反映和放大现有的社会偏见。解决方案的关键在于通过分析GPT-4、Gemini和Claude这三个知名LLMs对包含和不包含明确人口统计细节的名称提示的响应，评估它们在预测兴趣点（POIs）访问时的偏见表现。研究发现，LLMs在预测少数族裔和女性个体的POIs访问时存在显著的偏见，表现为这些群体与财富相关POIs的关联度较低，女性个体与职业相关POIs的关联度也低于男性。这些发现表明，LLMs不仅反映了社会偏见，还可能加剧这些偏见。

链接: https://arxiv.org/abs/2411.14469
作者: Xinhua Wu,Qi R. Wang
关键词-EN: large language models, influencing societal outcomes, areas influencing societal, language models, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly applied in areas influencing societal outcomes, it is critical to understand their tendency to perpetuate and amplify biases. This study investigates whether LLMs exhibit biases in predicting human mobility – a fundamental human behavior – based on race and gender. Using three prominent LLMs – GPT-4, Gemini, and Claude – we analyzed their predictions of visitations to points of interest (POIs) for individuals, relying on prompts that included names with and without explicit demographic details. We find that LLMs frequently reflect and amplify existing societal biases. Specifically, predictions for minority groups were disproportionately skewed, with these individuals being significantly less likely to be associated with wealth-related points of interest (POIs). Gender biases were also evident, as female individuals were consistently linked to fewer career-related POIs compared to their male counterparts. These biased associations suggest that LLMs not only mirror but also exacerbate societal stereotypes, particularly in contexts involving race and gender.
zh

[NLP-63] Learning to Ask: Conversational Product Search via Representation Learning

【速读】：该论文试图解决在线购物平台中传统产品搜索方法的局限性，特别是对话式产品搜索中用户、查询、产品和对话之间的语义不匹配问题。解决方案的关键在于提出了一种新的对话式产品搜索模型，称为ConvPS。该模型通过一个统一的生成式框架，联合学习用户、查询、产品和对话的语义表示，并将这些表示集成到潜在语义空间中以检索目标产品。此外，模型还采用了一组贪婪和探索-利用策略，以学习向用户提出一系列高性能的问题，从而优化对话过程。ConvPS模型能够自然地将用户、查询、产品和对话的表示学习整合到一个统一的生成式框架中，为构建准确、稳健且灵活适应的对话式产品搜索系统提供了新的途径。实验结果表明，ConvPS模型显著优于现有的最先进基线模型。

链接: https://arxiv.org/abs/2411.14466
作者: Jie Zou,Jimmy Xiangji Huang,Zhaochun Ren,Evangelos Kanoulas
关键词-EN: Online shopping platforms, helping customers purchase, Amazon and AliExpress, purchase products conveniently, conversational product search
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by ACM TOIS

点击查看摘要

Abstract:Online shopping platforms, such as Amazon and AliExpress, are increasingly prevalent in society, helping customers purchase products conveniently. With recent progress in natural language processing, researchers and practitioners shift their focus from traditional product search to conversational product search. Conversational product search enables user-machine conversations and through them collects explicit user feedback that allows to actively clarify the users’ product preferences. Therefore, prospective research on an intelligent shopping assistant via conversations is indispensable. Existing publications on conversational product search either model conversations independently from users, queries, and products or lead to a vocabulary mismatch. In this work, we propose a new conversational product search model, ConvPS, to assist users in locating desirable items. The model is first trained to jointly learn the semantic representations of user, query, item, and conversation via a unified generative framework. After learning these representations, they are integrated to retrieve the target items in the latent semantic space. Meanwhile, we propose a set of greedy and explore-exploit strategies to learn to ask the user a sequence of high-performance questions for conversations. Our proposed ConvPS model can naturally integrate the representation learning of the user, query, item, and conversation into a unified generative framework, which provides a promising avenue for constructing accurate and robust conversational product search systems that are flexible and adaptive. Experimental results demonstrate that our ConvPS model significantly outperforms state-of-the-art baselines.
zh

[NLP-64] sting Uncertainty of Large Language Models for Physics Knowledge and Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在回答多选物理问题时如何评估其预测的确定性与准确性之间的关系问题。解决方案的关键在于引入了一种分析方法，用于评估开源LLMs以及gpt-3.5 Turbo在多选物理问卷中的表现，并重点研究了答案准确性与话题变异性之间的关系。研究发现，大多数模型在它们确定的情况下提供准确的回答，但这并非普遍行为。准确性与不确定性之间的关系呈现出广泛的横向钟形分布，且随着问题对逻辑推理要求的增加，这种不对称性加剧，而在知识检索任务中，这种关系保持尖锐。

链接: https://arxiv.org/abs/2411.14465
作者: Elizaveta Reganova,Peter Steinbach
关键词-EN: Large Language Models, Large Language, gained significant popularity, Language Models, gained significant
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to “hallucinate” their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model’s predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell-shaped distribution. We report how the asymmetry between accuracy and uncertainty intensifies as the questions demand more logical reasoning of the LLM agent, while the same relationship remains sharp for knowledge retrieval tasks.
zh

[NLP-65] Leveraging AI and NLP for Bank Marketing: A Systematic Review and Gap Analysis

【速读】：该论文试图解决在银行营销领域中，AI和自然语言处理（NLP）应用的具体情况和潜在价值尚未充分研究的问题。解决方案的关键在于通过PRISMA方法论进行系统性文献回顾，并结合Sentence Transformers和UMAP进行语义映射，以识别当前研究中的战略性缺口。研究发现，尽管AI和NLP在一般营销中已有广泛研究，但在银行营销中的NLP应用研究相对有限。论文通过战略性缺口分析，指出了NLP在客户获取、保留和个性化互动等客户中心应用中的潜在增强作用，为学术研究和实际应用提供了有价值的见解。此外，论文强调了NLP在提升运营效率和监管合规性方面的作用，为银行营销领域的发展和创新提供了行动指南。

链接: https://arxiv.org/abs/2411.14463
作者: Christopher Gerling,Stefan Lessmann
关键词-EN: bank marketing, NLP applications, NLP, marketing, highlighting their evolving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:This paper explores the growing impact of AI and NLP in bank marketing, highlighting their evolving roles in enhancing marketing strategies, improving customer engagement, and creating value within this sector. While AI and NLP have been widely studied in general marketing, there is a notable gap in understanding their specific applications and potential within the banking sector. This research addresses this specific gap by providing a systematic review and strategic analysis of AI and NLP applications in bank marketing, focusing on their integration across the customer journey and operational excellence. Employing the PRISMA methodology, this study systematically reviews existing literature to assess the current landscape of AI and NLP in bank marketing. Additionally, it incorporates semantic mapping using Sentence Transformers and UMAP for strategic gap analysis to identify underexplored areas and opportunities for future research. The systematic review reveals limited research specifically focused on NLP applications in bank marketing. The strategic gap analysis identifies key areas where NLP can further enhance marketing strategies, including customer-centric applications like acquisition, retention, and personalized engagement, offering valuable insights for both academic research and practical implementation. This research contributes to the field of bank marketing by mapping the current state of AI and NLP applications and identifying strategic gaps. The findings provide actionable insights for developing NLP-driven growth and innovation frameworks and highlight the role of NLP in improving operational efficiency and regulatory compliance. This work has broader implications for enhancing customer experience, profitability, and innovation in the banking industry. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN) Cite as: arXiv:2411.14463 [cs.CL] (or arXiv:2411.14463v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.14463 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-66] owards Next-Generation Medical Agent : How o1 is Reshaping Decision-Making in Medical Scenarios

【速读】：该论文试图解决传统基于模型的AI方法在医疗决策中面临的实时适应性、多步骤推理和复杂任务处理能力不足的问题。解决方案的关键在于采用基于代理的AI系统，该系统通过整合推理轨迹、基于上下文的工具选择、知识检索以及短期和长期记忆，使医疗AI代理能够处理复杂的医疗场景，并像人类医生一样进行决策。论文特别研究了作为医疗AI代理基础的大型语言模型（LLM）的选择，特别是o1模型的应用，探讨了其在不同临床场景（包括重症监护室ICU）中的推理能力、工具使用适应性和实时信息检索的影响，结果表明o1模型能够显著提升诊断的准确性和一致性。

链接: https://arxiv.org/abs/2411.14461
作者: Shaochen Xu,Yifan Zhou,Zhengliang Liu,Zihao Wu,Tianyang Zhong,Huaqin Zhao,Yiwei Li,Hanqi Jiang,Yi Pan,Junhao Chen,Jin Lu,Wei Zhang,Tuo Zhang,Lu Zhang,Dajiang Zhu,Xiang Li,Wei Liu,Quanzheng Li,Andrea Sikora,Xiaoming Zhai,Zhen Xiang,Tianming Liu
关键词-EN: Artificial Intelligence, offering promising advances, modern healthcare, offering promising, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision-making. Traditional model-based approaches, including those leveraging in-context demonstrations and those with specialized medical fine-tuning, have demonstrated strong performance in medical language processing but struggle with real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short- and long-term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision-making should be built on real-time interaction with the environment. Therefore, unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent’s overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents’ reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes settings such as intensive care units (ICUs). Our findings demonstrate o1’s ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision-making efficacy in clinical practice.
zh

[NLP-67] LLaSA: Large Language and Structured Data Assistant

【速读】：该论文试图解决现有图神经网络（GNNs）增强的大型语言模型（LLMs）在处理结构化知识基础（SKG）任务时的两个主要问题：(1) 不同类型的结构化数据需要不同的GNNs模型，导致无法统一处理各种形式的结构化数据；(2) GNNs的预训练与特定LLMs紧密耦合，限制了GNNs与文本空间的完全对齐及其对其他LLMs的适应性。解决方案的关键在于提出了一种名为Large Language and Structured Data Assistant (LLaSA)的通用框架，通过将各种类型的结构化数据统一表示为超图格式，并使用自监督学习预训练超图编码器和G-Former（一种利用交叉注意力机制压缩编码超图表示的模型），从而在LLMs的训练和推理阶段将压缩后的超图表示附加到序列化输入中。实验结果表明，预训练的超图编码器能够适应多种LLMs，并增强其处理不同类型结构化数据的能力，同时通过LoRA微调，LLaSA在多个SKG任务上超越了使用全参数调优的先前最先进方法。

链接: https://arxiv.org/abs/2411.14460
作者: Yao Xu,Shizhu He,Zeng Xiangrong,Jiabei Chen,Guang Liu,Bingning Wang,Jun Zhao,Kang Liu
关键词-EN: plentiful NLP tasks, Graph Neutral Networks, Structured Knowledge Grounding, Large Language Models, plentiful NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Structured data, such as tables, graphs, and databases, play a critical role in plentiful NLP tasks such as question answering and dialogue system. Recently, inspired by Vision-Language Models, Graph Neutral Networks (GNNs) have been introduced as an additional modality into the input of Large Language Models (LLMs) to improve their performance on Structured Knowledge Grounding (SKG) tasks. However, those GNN-enhanced LLMs have the following limitations: (1) They employ diverse GNNs to model varying types of structured data, rendering them unable to uniformly process various forms of structured data. (2) The pretraining of GNNs is coupled with specific LLMs, which prevents GNNs from fully aligning with the textual space and limits their adaptability to other LLMs. To address these issues, we propose \textbfLarge \textbfLanguage and \textbfStructured Data \textbfAssistant (LLaSA), a general framework for enhancing LLMs’ ability to handle structured data. Specifically, we represent various types of structured data in a unified hypergraph format, and use self-supervised learning to pretrain a hypergraph encoder, and a G-Former compressing encoded hypergraph representations with cross-attention. The compressed hypergraph representations are appended to the serialized inputs during training and inference stages of LLMs. Experimental results on multiple SKG tasks show that our pretrained hypergraph encoder can adapt to various LLMs and enhance their ability to process different types of structured data. Besides, LLaSA, with LoRA fine-tuning, outperforms previous SOTA method using full parameters tuning.
zh

[NLP-68] Unveiling User Preferences: A Knowledge Graph and LLM -Driven Approach for Conversational Recommendation

【速读】：该论文试图解决对话推荐系统 (Conversational Recommender Systems, CRSs) 中用户偏好提取缺乏解释性的问题。解决方案的关键在于提出了一个名为 COMPASS 的即插即用框架，该框架通过协同大语言模型 (Large Language Models, LLMs) 和知识图谱 (Knowledge Graphs, KGs) 来揭示用户偏好，从而提升推荐系统的性能和可解释性。具体来说，COMPASS 采用两阶段训练方法：首先通过创新的图实体标题预训练机制，弥合结构化 KG 和自然语言之间的差距，使 LLM 能够将 KG 实体转化为简洁的自然语言描述，从而理解领域特定知识；接着通过知识感知的指令微调，优化用户偏好建模，使 LLM 能够从对话历史和 KG 增强的上下文中推理和总结用户偏好。这种方法使得 COMPASS 能够进行知识感知的推理，生成全面且可解释的用户偏好，从而无缝集成到现有的 CRS 模型中，提升推荐性能和解释性。

链接: https://arxiv.org/abs/2411.14459
作者: Zhangchi Qiu,Linhao Luo,Shirui Pan,Alan Wee-Chung Liew
关键词-EN: Conversational Recommender Systems, Conversational Recommender, Recommender Systems, dynamically capturing user, provide personalized recommendations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) aim to provide personalized recommendations through dynamically capturing user preferences in interactive conversations. Conventional CRSs often extract user preferences as hidden representations, which are criticized for their lack of interpretability. This diminishes the transparency and trustworthiness of the recommendation process. Recent works have explored combining the impressive capabilities of Large Language Models (LLMs) with the domain-specific knowledge of Knowledge Graphs (KGs) to generate human-understandable recommendation explanations. Despite these efforts, the integration of LLMs and KGs for CRSs remains challenging due to the modality gap between unstructured dialogues and structured KGs. Moreover, LLMs pre-trained on large-scale corpora may not be well-suited for analyzing user preferences, which require domain-specific knowledge. In this paper, we propose COMPASS, a plug-and-play framework that synergizes LLMs and KGs to unveil user preferences, enhancing the performance and explainability of existing CRSs. To address integration challenges, COMPASS employs a two-stage training approach: first, it bridges the gap between the structured KG and natural language through an innovative graph entity captioning pre-training mechanism. This enables the LLM to transform KG entities into concise natural language descriptions, allowing them to comprehend domain-specific knowledge. Following, COMPASS optimizes user preference modeling via knowledge-aware instruction fine-tuning, where the LLM learns to reason and summarize user preferences from both dialogue histories and KG-augmented context. This enables COMPASS to perform knowledge-aware reasoning and generate comprehensive and interpretable user preferences that can seamlessly integrate with existing CRS models for improving recommendation performance and explainability.
zh

[NLP-69] Can Artificial Intelligence Generate Quality Research Topics Reflecting Patient Concerns?

【速读】：该论文试图解决在健康研究中如何更一致地整合患者视角的问题。解决方案的关键在于利用自然语言处理 (NLP) 和人工智能 (AI) 技术，通过分析大量患者门户消息（来自2013至2024年间25,549名乳腺癌或皮肤癌患者的614,464条消息），构建一个两阶段的非监督NLP主题模型，以识别患者的临床关注点。随后，使用ChatGPT-4o（OpenAI Inc, April 2024版本）生成针对这些关注点的研究主题，并通过多层次的任务指导AI进行知识解释与总结、知识生成、自我反思与修正以及自我确认，最终生成具有高度重要性和新颖性的研究主题。研究结果表明，通过大量患者消息生成的AI研究主题能够有效地反映患者视角，从而指导未来的患者中心健康研究。

链接: https://arxiv.org/abs/2411.14456
作者: Jiyeong Kim,Michael L. Chen,Shawheen J. Rezaei,Mariana Ramirez-Posada,Jennifer L. Caswell-Jin,Allison W. Kurian,Fauzia Riaz,Kavita Y. Sarin,Jean Y. Tang,Steven M. Asch,Eleni Linos
关键词-EN: research, research topics, AI-generated research topics, research ideas, narrowing the gap
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Patient-centered research is increasingly important in narrowing the gap between research and patient care, yet incorporating patient perspectives into health research has been inconsistent. We propose an automated framework leveraging innovative natural language processing (NLP) and artificial intelligence (AI) with patient portal messages to generate research ideas that prioritize important patient issues. We further quantified the quality of AI-generated research topics. To define patient clinical concerns, we analyzed 614,464 patient messages from 25,549 individuals with breast or skin cancer obtained from a large academic hospital (2013 to 2024), constructing a 2-staged unsupervised NLP topic model. Then, we generated research topics to resolve the defined issues using a widely used AI (ChatGPT-4o, OpenAI Inc, April 2024 version) with prompt-engineering strategies. We guided AI to perform multi-level tasks: 1) knowledge interpretation and summarization (e.g., interpreting and summarizing the NLP-defined topics), 2) knowledge generation (e.g., generating research ideas corresponding to patients issues), 3) self-reflection and correction (e.g., ensuring and revising the research ideas after searching for scientific articles), and 4) self-reassurance (e.g., confirming and finalizing the research ideas). Six highly experienced breast oncologists and dermatologists assessed the significance and novelty of AI-generated research topics using a 5-point Likert scale (1-exceptional, 5-poor). One-third of the AI-suggested research topics were highly significant and novel when both scores were lower than the average. Two-thirds of the AI-suggested topics were novel in both cancers. Our findings demonstrate that AI-generated research topics reflecting patient perspectives via a large volume of patient messages can meaningfully guide future directions in patient-centered health research.
zh

[NLP-70] Direct Speech-to-Speech Neural Machine Translation: A Survey

【速读】：该论文试图解决直接语音到语音翻译 (Direct Speech-to-Speech Translation, S2ST) 系统在实际应用中性能不足的问题。解决方案的关键在于全面评估和分析现有的直接S2ST模型，包括其数据需求、应用场景以及性能评估指标。论文通过系统性地回顾和批判性分析这些模型的表现，识别出当前的研究挑战，并提出了未来的研究方向，以期提升直接S2ST系统在实际翻译任务中的表现和应用潜力。

链接: https://arxiv.org/abs/2411.14453
作者: Mahendra Gupta,Maitreyee Dutta,Chandresh Kumar Maurya
关键词-EN: target language, linguistic information, models transform speech, transform speech, language
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a comprehensive review of direct S2ST models, data and application issues, and performance metrics. We critically analyze the models’ performance over the benchmark datasets and provide research challenges and future directions.
zh

[NLP-71] AI Ethics by Design: Implementing Customizable Guardrails for Responsible AI Development

【速读】：该论文试图解决AI系统中的伦理问题，特别是如何构建一个能够适应多样化用户价值观和伦理标准的可定制伦理护栏框架。解决方案的关键在于提出一个集成规则、政策和AI助手的结构，以确保AI行为符合伦理要求，并通过与现有最先进护栏框架的比较，强调其实用机制以增强透明度、用户自主性和AI系统的持续改进。该框架还支持伦理多元主义，提供了一个灵活且适应性强的解决方案，以应对不断变化的AI治理环境。

链接: https://arxiv.org/abs/2411.14442
作者: Kristina Šekrst,Jeremy McHugh,Jonathan Rodriguez Cefalu
关键词-EN: emphasizing the importance, explores the development, importance of customizable, align with diverse, ethical guardrail framework
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the development of an ethical guardrail framework for AI systems, emphasizing the importance of customizable guardrails that align with diverse user values and underlying ethics. We address the challenges of AI ethics by proposing a structure that integrates rules, policies, and AI assistants to ensure responsible AI behavior, while comparing the proposed framework to the existing state-of-the-art guardrails. By focusing on practical mechanisms for implementing ethical standards, we aim to enhance transparency, user autonomy, and continuous improvement in AI systems. Our approach accommodates ethical pluralism, offering a flexible and adaptable solution for the evolving landscape of AI governance. The paper concludes with strategies for resolving conflicts between ethical directives, underscoring the present and future need for robust, nuanced and context-aware AI systems.
zh

计算机视觉

[CV-0] DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

【速读】：该论文试图解决在自动驾驶领域中，扩散模型（Diffusion Model）在处理动态、开放交通场景时，生成多样化驾驶动作的实时性问题。解决方案的关键在于提出了截断扩散策略（Truncated Diffusion Policy），该策略通过引入先验的多模态锚点（Multi-mode Anchors）并截断扩散过程，使得模型能够从锚定的正态分布直接学习到多模态驾驶动作分布，从而大幅减少去噪步骤。此外，设计了高效的级联扩散解码器（Cascade Diffusion Decoder）以增强与条件场景上下文的交互。这些创新使得DiffusionDrive模型在仅2步去噪的情况下，实现了与传统扩散策略相比10倍的步骤减少，同时在NAVSIM数据集上以45 FPS的实时速度运行，达到了88.1 PDMS的性能，显著提升了生成动作的多样性和质量。

链接: https://arxiv.org/abs/2411.15139
作者: Bencheng Liao,Shaoyu Chen,Haoran Yin,Bo Jiang,Cheng Wang,Sixu Yan,Xinbang Zhang,Xiangyu Li,Ying Zhang,Qian Zhang,Xinggang Wang
关键词-EN: powerful generative technique, robotic policy learning, modeling multi-mode action, capable of modeling, powerful generative
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Work in progress. Code demo model will be available at this https URL

点击查看摘要

Abstract:Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10 \times reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available at this https URL.
zh

[CV-1] Material Anything: Generating Materials for Any 3D Object via Diffusion

【速读】：该论文试图解决现有方法在生成适用于3D对象的物理基础材质时，依赖复杂流程或特定优化的问题。解决方案的关键在于提出了一个全自动、统一的扩散框架——Material Anything，该框架利用预训练的图像扩散模型，并通过三头架构和渲染损失增强其稳定性与材质质量。此外，引入的置信度掩码作为扩散模型内的动态切换器，使其能有效处理不同光照条件下的纹理和无纹理对象。通过渐进式材质生成策略和UV空间材质优化器，确保输出材质的一致性和UV准备就绪状态。实验结果表明，该方法在广泛的对象类别和光照条件下优于现有方法。

链接: https://arxiv.org/abs/2411.15138
作者: Xin Huang,Tengfei Wang,Ziwei Liu,Qing Wang
关键词-EN: generate physically-based materials, unified diffusion framework, diffusion framework designed, framework designed, designed to generate
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.
zh

[CV-2] WildLMa: Long Horizon Loco-Manipulation in the Wild

【速读】：该论文试图解决在多样化真实世界环境中部署机器人进行复杂操作的问题，特别是针对四足机器人配备机械臂的场景。解决方案的关键在于提出了WildLMa框架，该框架包含三个核心组件：(1) 针对VR增强的全身体远程操作和可穿越性的低级控制器适应；(2) WildLMa-Skill库，通过模仿学习或启发式方法获取的可泛化的视觉运动技能；(3) WildLMa-Planner，一个允许大型语言模型（LLM）规划器协调技能以执行长时任务的接口。通过利用高质量的训练数据和CLIP进行语言条件下的模仿学习，WildLMa显著提高了抓取成功率，并在多种实际应用中展示了其有效性。

链接: https://arxiv.org/abs/2411.15131
作者: Ri-Zhao Qiu,Yuchen Song,Xuanbin Peng,Sai Aneesh Suryadevara,Ge Yang,Minghuan Liu,Mazeyu Ji,Chengzhe Jia,Ruihan Yang,Xueyan Zou,Xiaolong Wang
关键词-EN: diverse real-world environments, mobile manipulation aims, perform complex manipulation, diverse environments, real-world environments
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:`In-the-wild’ mobile manipulation aims to deploy robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for extending the workspace and enabling robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) adaptation of learned low-level controller for VR-enabled whole-body teleoperation and traversability; (2) WildLMa-Skill – a library of generalizable visuomotor skills acquired via imitation learning or heuristics and (3) WildLMa-Planner – an interface of learned skills that allow LLM planners to coordinate skills for long-horizon tasks. We demonstrate the importance of high-quality training data by achieving higher grasping success rate over existing RL baselines using only tens of demonstrations. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.
zh

[CV-3] Health AI Developer Foundations

【速读】：该论文试图解决医疗领域中开发健壮的机器学习（ML）模型的高成本、高计算资源需求和长时间专家标注的问题。解决方案的关键在于引入Health AI Developer Foundations (HAI-DEF)，这是一个包含预训练的领域特定基础模型、工具和配方的套件，旨在加速健康应用的ML模型构建。HAI-DEF通过提供涵盖多种模态和领域的预训练模型（如放射学、病理学、皮肤病学和音频），利用领域特定的嵌入，减少了标注数据的需求、缩短了训练时间并降低了计算成本。此外，HAI-DEF采用统一的接口和风格，并强调易用性，以促进开发者的有效集成。尽管HAI-DEF降低了医疗ML的进入门槛，论文仍强调在使用特定问题和人群数据进行验证的重要性。

链接: https://arxiv.org/abs/2411.15128
作者: Atilla P. Kiraly,Sebastien Baur,Kenneth Philbrick,Fereshteh Mahvar,Liron Yatziv,Tiffany Chen,Bram Sterling,Nick George,Fayaz Jamil,Jing Tang,Kai Bailey,Faruk Ahmed,Akshay Goel,Abbi Ward,Lin Yang,Andrew Sellergren,Yossi Matias,Avinatan Hassidim,Shravya Shetty,Daniel Golden,Shekoofeh Azizi,David F. Steiner,Yun Liu,Tim Thelin,Rory Pilgrim,Can Kirmizibayrak
关键词-EN: medical Machine Learning, Robust medical Machine, Machine Learning, accelerating clinical research, Robust medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Robust medical Machine Learning (ML) models have the potential to revolutionize healthcare by accelerating clinical research, improving workflows and outcomes, and producing novel insights or capabilities. Developing such ML models from scratch is cost prohibitive and requires substantial compute, data, and time (e.g., expert labeling). To address these challenges, we introduce Health AI Developer Foundations (HAI-DEF), a suite of pre-trained, domain-specific foundation models, tools, and recipes to accelerate building ML for health applications. The models cover various modalities and domains, including radiology (X-rays and computed tomography), histopathology, dermatological imaging, and audio. These models provide domain specific embeddings that facilitate AI development with less labeled data, shorter training times, and reduced computational costs compared to traditional approaches. In addition, we utilize a common interface and style across these models, and prioritize usability to enable developers to integrate HAI-DEF efficiently. We present model evaluations across various tasks and conclude with a discussion of their application and evaluation, covering the importance of ensuring efficacy, fairness, and equity. Finally, while HAI-DEF and specifically the foundation models lower the barrier to entry for ML in healthcare, we emphasize the importance of validation with problem- and population-specific data for each desired usage setting. This technical report will be updated over time as more modalities and features are added.
zh

[CV-4] A Real-Time DETR Approach to Bangladesh Road Object Detection for Autonomous Vehicles

【速读】：该论文试图解决自动驾驶车辆中的道路物体检测问题，解决方案的关键在于采用实时检测变压器模型（Real-Time DETR, RTDETR）。该模型在保持高精度的同时，显著提升了推理速度，使其成为自动驾驶领域中物体检测的潜在候选方案。通过在基于孟加拉国的BadODD道路物体检测数据集上的实验，论文展示了RTDETR在公共测试集和私有测试集上的表现，分别获得了mAP50分数为0.41518和0.28194。

链接: https://arxiv.org/abs/2411.15110
作者: Irfan Nafiz Shahan,Arban Hossain,Saadman Sakib,Al-Mubin Nabil
关键词-EN: Computer Vision, Road Object Detection, object detection, field of Computer, Road Object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the recent years, we have witnessed a paradigm shift in the field of Computer Vision, with the forthcoming of the transformer architecture. Detection Transformers has become a state of the art solution to object detection and is a potential candidate for Road Object Detection in Autonomous Vehicles. Despite the abundance of object detection schemes, real-time DETR models are shown to perform significantly better on inference times, with minimal loss of accuracy and performance. In our work, we used Real-Time DETR (RTDETR) object detection on the BadODD Road Object Detection dataset based in Bangladesh, and performed necessary experimentation and testing. Our results gave a mAP50 score of 0.41518 in the public 60% test set, and 0.28194 in the private 40% test set.
zh

[CV-5] About Time: Advances Challenges and Outlooks of Action Understanding

【速读】：该论文旨在全面回顾视频动作理解领域的最新进展，并探讨当前面临的挑战和未来发展方向。解决方案的关键在于对动作理解任务进行时间范围的划分，具体分为：(1) 对完整观察到的动作进行识别任务；(2) 对正在进行的部分观察到的动作进行预测任务；(3) 对后续未观察到的动作进行预测任务。这种划分有助于识别特定动作建模和视频表示的挑战，并为解决当前不足之处提供了明确的方向。

链接: https://arxiv.org/abs/2411.15106
作者: Alexandros Stergiou,Ronald Poppe
关键词-EN: witnessed impressive advances, witnessed impressive, impressive advances, tasks, action understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We have witnessed impressive advances in video action understanding. Increased dataset sizes, variability, and computation availability have enabled leaps in performance and task diversification. Current systems can provide coarse- and fine-grained descriptions of video scenes, extract segments corresponding to queries, synthesize unobserved parts of videos, and predict context. This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances. We broadly distinguish between three temporal scopes: (1) recognition tasks of actions observed in full, (2) prediction tasks for ongoing partially observed actions, and (3) forecasting tasks for subsequent unobserved action. This division allows us to identify specific action modeling and video representation challenges. Finally, we outline future directions to address current shortcomings.
zh

[CV-6] OminiControl: Minimal and Universal Control for Diffusion Transformer

【速读】：该论文试图解决在预训练的扩散变压器模型（Diffusion Transformer, DiT）中高效且统一地集成图像条件的问题。解决方案的关键在于提出了OminiControl框架，该框架通过参数复用机制，利用DiT自身作为强大的骨干网络来编码图像条件，并通过其灵活的多模态注意力处理器进行处理。与依赖复杂额外编码器模块的现有方法不同，OminiControl仅需增加约0.1%的额外参数，就能有效地集成图像条件，并统一处理包括主体驱动生成和空间对齐条件（如边缘、深度等）在内的广泛图像条件任务。此外，OminiControl通过在DiT自身生成的图像上进行训练，特别有利于主体驱动生成任务。

链接: https://arxiv.org/abs/2411.15098
作者: Zhenxiong Tan,Songhua Liu,Xingyi Yang,Qiaochu Xue,Xinchao Wang
关键词-EN: pre-trained Diffusion Transformer, Diffusion Transformer, pre-trained Diffusion, integrates image conditions, highly versatile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.
zh

[CV-7] Learning to Stabilize Faces

【速读】：该论文试图解决面部扫描数据中头部不必要运动的稳定化问题，特别是在游戏开发和电影制作等需要清晰分离面部表情与头部刚性运动的任务中。解决方案的关键在于将稳定化问题视为一个回归问题，通过一个基于学习的网络直接预测两个面部网格之间的刚性变换，从而实现头骨的对齐。该方法利用3D形变模型（3D Morphable Model, 3DMM）生成合成训练数据，利用3DMM参数分离头骨运动与面部皮肤运动，从而实现全自动且高效的面部稳定化处理。实验结果表明，该方法在稳定离散的面部表情和动态面部表现方面均优于现有最先进技术。

链接: https://arxiv.org/abs/2411.15074
作者: Jan Bednarik,Erroll Wood,Vasileios Choutas,Timo Bolkart,Daoye Wang,Chenglei Wu,Thabo Beeler
关键词-EN: high quality, automatically register, Nowadays, scan faces, face meshes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Eurographics 2024

点击查看摘要

Abstract:Nowadays, it is possible to scan faces and automatically register them with high quality. However, the resulting face meshes often need further processing: we need to stabilize them to remove unwanted head movement. Stabilization is important for tasks like game development or movie making which require facial expressions to be cleanly separated from rigid head motion. Since manual stabilization is labor-intensive, there have been attempts to automate it. However, previous methods remain impractical: they either still require some manual input, produce imprecise alignments, rely on dubious heuristics and slow optimization, or assume a temporally ordered input. Instead, we present a new learning-based approach that is simple and fully automatic. We treat stabilization as a regression problem: given two face meshes, our network directly predicts the rigid transform between them that brings their skulls into alignment. We generate synthetic training data using a 3D Morphable Model (3DMM), exploiting the fact that 3DMM parameters separate skull motion from facial skin motion. Through extensive experiments we show that our approach outperforms the state-of-the-art both quantitatively and qualitatively on the tasks of stabilizing discrete sets of facial expressions as well as dynamic facial performances. Furthermore, we provide an ablation study detailing the design choices and best practices to help others adopt our approach for their own uses. Supplementary videos can be found on the project webpage this http URL.
zh

[CV-8] SPAC-Net: Rethinking Point Cloud Completion with Structural Prior

【速读】：该论文试图解决点云补全任务中由于特征抽象问题导致的细节丢失问题。解决方案的关键在于提出了一种新的框架，称为 SPAC-Net，该框架引入了“接口”（interface）这一结构先验，并通过以下两个核心模块实现：1) 边缘检测器（Marginal Detector, MAD）模块，用于定位接口，即已知观测与缺失部分之间的交界；2) 结构补充（Structure Supplement, SSP）模块，在升采样阶段之前增强粗略形状的结构细节，使升采样模块能够更专注于升采样任务。通过这两个模块，SPAC-Net 能够更准确地预测缺失部分的形状，并在多个挑战性基准测试中表现优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.15066
作者: Zizhao Wu,Jian Shi,Xuan Deng,Cheng Zhang,Genfu Yang,Ming Zeng,Yunhai Wang
关键词-EN: cloud completion aims, Point cloud completion, complete shape, aims to infer, Point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Point cloud completion aims to infer a complete shape from its partial observation. Many approaches utilize a pure encoderdecoder paradigm in which complete shape can be directly predicted by shape priors learned from partial scans, however, these methods suffer from the loss of details inevitably due to the feature abstraction issues. In this paper, we propose a novel framework,termed SPAC-Net, that aims to rethink the completion task under the guidance of a new structural prior, we call it interface. Specifically, our method first investigates Marginal Detector (MAD) module to localize the interface, defined as the intersection between the known observation and the missing parts. Based on the interface, our method predicts the coarse shape by learning the displacement from the points in interface move to their corresponding position in missing parts. Furthermore, we devise an additional Structure Supplement(SSP) module before the upsampling stage to enhance the structural details of the coarse shape, enabling the upsampling module to focus more on the upsampling task. Extensive experiments have been conducted on several challenging benchmarks, and the results demonstrate that our method outperforms existing state-of-the-art approaches.
zh

[CV-9] OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

【速读】：该论文试图解决开放词汇在线三维语义SLAM（Open-Vocabulary Online 3D semantic SLAM）的问题，即在无需预先定义词汇表的情况下，实现实时的三维场景重建和语义分割。解决方案的关键在于提出了一种名为OVO-SLAM的管道，特别是在映射线程中，通过检测和跟踪三维片段（3D segments），并使用CLIP向量（CLIP vectors）描述这些片段，这些向量是通过从观察这些三维片段的视点进行新颖的聚合计算得出的。这种方法不仅提高了处理速度，而且在分割指标上优于现有的离线方法，同时首次实现了端到端的开放词汇在线三维重建，无需依赖真实相机姿态或场景几何信息。

链接: https://arxiv.org/abs/2411.15043
作者: Tomas Berriel Martins,Martin R. Oswald,Javier Civera
关键词-EN: semantic SLAM pipeline, semantic SLAM, SLAM pipeline, paper presents, Open-Vocabulary Online
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents the first Open-Vocabulary Online 3D semantic SLAM pipeline, that we denote as OVO-SLAM. Our primary contribution is in the pipeline itself, particularly in the mapping thread. Given a set of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors, calculated through a novel aggregation from the viewpoints where these 3D segments are observed. Notably, our OVO-SLAM pipeline is not only faster but also achieves better segmentation metrics compared to offline approaches in the literature. Along with superior segmentation performance, we show experimental results of our contributions integrated with Gaussian-SLAM, being the first ones demonstrating end-to-end open-vocabulary online 3D reconstructions without relying on ground-truth camera poses or scene geometry.
zh

[CV-10] HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads

【速读】：该论文试图解决多模态扩散变换器 (Multimodal Diffusion Transformers, MM-DiTs) 在文本引导图像编辑任务中存在的语义对齐问题。由于MM-DiTs缺乏对文本指导的显式和一致支持，导致编辑结果与文本之间的语义不匹配。解决方案的关键在于提出了HeadRouter，这是一个无需训练的图像编辑框架，通过自适应地将文本指导路由到MM-DiTs中的不同注意力头 (attention heads) 来编辑源图像。此外，论文还引入了一个双令牌优化模块 (dual-token refinement module)，用于优化文本和图像令牌表示，以实现精确的语义指导和准确的区域表达。

链接: https://arxiv.org/abs/2411.15034
作者: Yu Xu,Fan Tang,Juan Cao,Yuxin Zhang,Xiaoyu Kong,Jintao Li,Oliver Deussen,Tong-Yee Lee
关键词-EN: Diffusion Transformers, exhibited robust capabilities, image generation tasks, generation tasks, exhibited robust
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter’s performance in terms of editing fidelity and image quality.
zh

[CV-11] FloAt: Flow Warping of Self-Attention for Clothing Animation Generation

【速读】：该论文试图解决生成式 AI (Generative AI) 在生成包含人类服装动画的动态图像（cinemagraphs）时，如何更自然地模拟服装的动态效果并减少背景干扰的问题。解决方案的关键在于提出了一种基于扩散模型 (diffusion model) 的方法，称为 FloAtControlNet。该方法的核心是利用法线贴图 (normal maps) 中的流 (flow) 信息来操控自注意力机制 (self-attention) 的映射，从而增强服装动画的自然度并抑制背景的闪烁效应。具体来说，通过重新计算特定层和帧的自注意力映射，将其与前一帧的自注意力映射进行线性组合，并根据法线贴图中的流信息进行扭曲，从而实现对服装动画的精细控制。实验结果表明，该方法在视觉质量和用户评价方面均优于现有的基线方法。

链接: https://arxiv.org/abs/2411.15028
作者: Swasti Shreya Mishra,Kuldeep Kulkarni,Duygu Ceylan,Balaji Vasan Srinivasan
关键词-EN: generate cinemagraphs composed, human clothing, human clothing animations, self-attention maps, normal map sequences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a diffusion model-based approach, FloAtControlNet to generate cinemagraphs composed of animations of human clothing. We focus on human clothing like dresses, skirts and pants. The input to our model is a text prompt depicting the type of clothing and the texture of clothing like leopard, striped, or plain, and a sequence of normal maps that capture the underlying animation that we desire in the output. The backbone of our method is a normal-map conditioned ControlNet which is operated in a training-free regime. The key observation is that the underlying animation is embedded in the flow of the normal maps. We utilize the flow thus obtained to manipulate the self-attention maps of appropriate layers. Specifically, the self-attention maps of a particular layer and frame are recomputed as a linear combination of itself and the self-attention maps of the same layer and the previous frame, warped by the flow on the normal maps of the two frames. We show that manipulating the self-attention maps greatly enhances the quality of the clothing animation, making it look more natural as well as suppressing the background artifacts. Through extensive experiments, we show that the method proposed beats all baselines both qualitatively in terms of visual results and user study. Specifically, our method is able to alleviate the background flickering that exists in other diffusion model-based baselines that we consider. In addition, we show that our method beats all baselines in terms of RMSE and PSNR computed using the input normal map sequences and the normal map sequences obtained from the output RGB frames. Further, we show that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.
zh

[CV-12] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

【速读】：该论文试图解决视频大语言模型 (Video Large Language Models, VLLMs) 在处理复杂视频内容时，由于生成大量视觉标记 (visual tokens) 导致的推理效率低下的问题。解决方案的关键在于提出了一种无需训练的标记压缩方法，称为 DyCoke。DyCoke 通过引入一个即插即用的时序压缩模块来减少帧间冗余，通过合并冗余标记来优化标记表示，并采用动态 KV 缓存减少策略来选择性修剪空间冗余标记。这种方法确保在每个解码步骤中动态保留关键标记，从而在不影响性能的前提下，实现了推理速度的提升和内存消耗的减少。实验结果表明，DyCoke 不仅在推理速度和内存使用上优于现有的最先进方法，还能在无需额外训练的情况下提升模型性能。

链接: https://arxiv.org/abs/2411.15024
作者: Keda Tao,Can Qin,Haoxuan You,Yang Sui,Huan Wang
关键词-EN: complex video content, Video large language, processing complex video, large language models, significantly advanced recently
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against the baseline VLLM, while still improving the performance, with no training.
zh

[CV-13] Neural 4D Evolution under Large Topological Changes from 2D Images

【速读】：该论文试图解决在处理具有显著拓扑变化的4D形状时，现有3D神经进化方法扩展到4D时表现不佳的问题。解决方案的关键在于提出了两个新颖的改进：(i) 一种新的架构用于离散化和编码变形并学习符号距离函数 (SDF)，以及 (ii) 一种技术用于施加时间一致性。此外，论文还提出了一种基于高斯喷溅的渲染方案用于颜色预测，并设计了一个学习框架，能够从RGB图像中分离几何和外观信息。这些改进不仅适用于4D形状的进化问题，也为静态场景提供了新的方法。

链接: https://arxiv.org/abs/2411.15018
作者: AmirHossein Naghi Razlighi,Tiago Novello,Asen Nachkov,Thomas Probst,Danda Paudel
关键词-EN: instantaneous flow field, flow field, instantaneous flow, largely differ, target
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 21 figures

点击查看摘要

Abstract:In the literature, it has been shown that the evolution of the known explicit 3D surface to the target one can be learned from 2D images using the instantaneous flow field, where the known and target 3D surfaces may largely differ in topology. We are interested in capturing 4D shapes whose topology changes largely over time. We encounter that the straightforward extension of the existing 3D-based method to the desired 4D case performs poorly. In this work, we address the challenges in extending 3D neural evolution to 4D under large topological changes by proposing two novel modifications. More precisely, we introduce (i) a new architecture to discretize and encode the deformation and learn the SDF and (ii) a technique to impose the temporal consistency. (iii) Also, we propose a rendering scheme for color prediction based on Gaussian splatting. Furthermore, to facilitate learning directly from 2D images, we propose a learning framework that can disentangle the geometry and appearance from RGB images. This method of disentanglement, while also useful for the 4D evolution problem that we are concentrating on, is also novel and valid for static scenes. Our extensive experiments on various data provide awesome results and, most importantly, open a new approach toward reconstructing challenging scenes with significant topological changes and deformations. Our source code and the dataset are publicly available at this https URL. Comments: 15 pages, 21 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4.5; I.3.5 Cite as: arXiv:2411.15018 [cs.CV] (or arXiv:2411.15018v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15018 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-14] MSSF: A 4D Radar and Camera Fusion Framework With Multi-Stage Sampling for 3D Object Detection in Autonomous Driving

【速读】：该论文试图解决4D毫米波雷达点云稀疏和噪声问题，以及现有雷达-相机融合方法在性能上与基于激光雷达（LiDAR）方法之间的差距。解决方案的关键在于提出了一种简单但有效的多阶段采样融合（MSSF）网络，该网络通过设计深度交互点云特征与图像特征的融合块（包括简单特征融合（SFF）和多尺度可变形特征融合（MSDFF）），以及提出语义引导头进行体素特征重加权的前景-背景分割，从而缓解特征模糊问题。实验结果表明，MSSF在View-of-Delft（VoD）和TJ4DRadset数据集上显著提升了3D平均精度（mAP），分别提高了7.0%和4.0%，甚至在VoD数据集上超越了经典的LiDAR方法。

链接: https://arxiv.org/abs/2411.15016
作者: Hongsi Liu,Jun Liu,Guangfeng Jiang,Xin Jin
关键词-EN: precise elevation measurements, recent years, resolution than conventional, elevation measurements, emerged in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As one of the automotive sensors that have emerged in recent years, 4D millimeter-wave radar has a higher resolution than conventional 3D radar and provides precise elevation measurements. But its point clouds are still sparse and noisy, making it challenging to meet the requirements of autonomous driving. Camera, as another commonly used sensor, can capture rich semantic information. As a result, the fusion of 4D radar and camera can provide an affordable and robust perception solution for autonomous driving systems. However, previous radar-camera fusion methods have not yet been thoroughly investigated, resulting in a large performance gap compared to LiDAR-based methods. Specifically, they ignore the feature-blurring problem and do not deeply interact with image semantic information. To this end, we present a simple but effective multi-stage sampling fusion (MSSF) network based on 4D radar and camera. On the one hand, we design a fusion block that can deeply interact point cloud features with image features, and can be applied to commonly used single-modal backbones in a plug-and-play manner. The fusion block encompasses two types, namely, simple feature fusion (SFF) and multiscale deformable feature fusion (MSDFF). The SFF is easy to implement, while the MSDFF has stronger fusion abilities. On the other hand, we propose a semantic-guided head to perform foreground-background segmentation on voxels with voxel feature re-weighting, further alleviating the problem of feature blurring. Extensive experiments on the View-of-Delft (VoD) and TJ4DRadset datasets demonstrate the effectiveness of our MSSF. Notably, compared to state-of-the-art methods, MSSF achieves a 7.0% and 4.0% improvement in 3D mean average precision on the VoD and TJ4DRadSet datasets, respectively. It even surpasses classical LiDAR-based methods on the VoD dataset.
zh

[CV-15] Differentiable Biomechanics for Markerless Motion Capture in Upper Limb Stroke Rehabilitation: A Comparison with Optical Motion Capture

【速读】：该论文试图解决在临床环境中使用无标记运动捕捉（Markerless Motion Capture, MMC）技术测量中风患者上肢运动学参数的准确性问题。解决方案的关键在于将可微分的生物力学模型与MMC结合，通过同步摄像头进行数据采集，从而在减少设备需求和数据收集工作量的同时，实现与基于标记的光学运动捕捉（Optical Motion Capture, OMC）相当的高精度运动学测量。研究结果显示，MMC与OMC在关键运动学参数（如关节角度、末端执行器速度和躯干位移）的轨迹上具有高度一致性，支持MMC在临床环境中用于中风患者运动康复评估的潜力。

链接: https://arxiv.org/abs/2411.14992
作者: Tim Unger,Arash Sal Moslehian,J.D. Peiffer,Johann Ullrich,Roger Gassert,Olivier Lambercy,R. James Cotton,Chris Awai Easthope
关键词-EN: Optical Motion Capture, Marker-based Optical Motion, Markerless Motion Capture, Motion Capture, Optical Motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, 3 tables, RehabWeek 2025 ICORR, first 3 authors are shared-first and last two authors are shared last

点击查看摘要

Abstract:Marker-based Optical Motion Capture (OMC) paired with biomechanical modeling is currently considered the most precise and accurate method for measuring human movement kinematics. However, combining differentiable biomechanical modeling with Markerless Motion Capture (MMC) offers a promising approach to motion capture in clinical settings, requiring only minimal equipment, such as synchronized webcams, and minimal effort for data collection. This study compares key kinematic outcomes from biomechanically modeled MMC and OMC data in 15 stroke patients performing the drinking task, a functional task recommended for assessing upper limb movement quality. We observed a high level of agreement in kinematic trajectories between MMC and OMC, as indicated by high correlations (median r above 0.95 for the majority of kinematic trajectories) and median RMSE values ranging from 2-5 degrees for joint angles, 0.04 m/s for end-effector velocity, and 6 mm for trunk displacement. Trial-to-trial biases between OMC and MMC were consistent within participant sessions, with interquartile ranges of bias around 1-3 degrees for joint angles, 0.01 m/s in end-effector velocity, and approximately 3mm for trunk displacement. Our findings indicate that our MMC for arm tracking is approaching the accuracy of marker-based methods, supporting its potential for use in clinical settings. MMC could provide valuable insights into movement rehabilitation in stroke patients, potentially enhancing the effectiveness of rehabilitation strategies.
zh

[CV-16] 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

【速读】：该论文试图解决3D高斯拼接（3D Gaussian Splatting, 3DGS）在场景重建中存在的几个关键问题，包括难以精确捕捉硬边缘、难以表示平面表面以及在没有手工正则化的情况下高斯分布不规则等问题。解决方案的关键在于引入了一种新的方法，称为3D凸拼接（3D Convex Splatting, 3DCS），该方法利用3D平滑凸体作为基本单元来建模几何意义明确的辐射场。平滑凸体相比高斯分布提供了更大的灵活性，能够更好地表示具有硬边缘和密集体积的3D场景，同时使用更少的基元。通过基于CUDA的高效光栅化器，3DCS在Mip-NeRF360、Tanks and Temples和Deep Blending等基准测试中表现优于3DGS，具体表现为PSNR提升最多0.81，LPIPS提升0.026，同时保持高渲染速度并减少所需基元的数量。

链接: https://arxiv.org/abs/2411.14974
作者: Jan Held,Renaud Vandeghen,Abdullah Hamdi,Adrien Deliege,Anthony Cioppa,Silvio Giancola,Andrea Vedaldi,Bernard Ghanem,Marc Van Droogenbroeck
关键词-EN: Recent advances, Recent, Gaussians, Convex Splatting, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 13 figures, 10 tables

点击查看摘要

Abstract:Recent advances in radiance field reconstruction, such as 3D Gaussian Splatting (3DGS), have achieved high-quality novel view synthesis and fast rendering by representing scenes with compositions of Gaussian primitives. However, 3D Gaussians present several limitations for scene reconstruction. Accurately capturing hard edges is challenging without significantly increasing the number of Gaussians, creating a large memory footprint. Moreover, they struggle to represent flat surfaces, as they are diffused in space. Without hand-crafted regularizers, they tend to disperse irregularly around the actual surface. To circumvent these issues, we introduce a novel method, named 3D Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for modeling geometrically-meaningful radiance fields from multi-view images. Smooth convex shapes offer greater flexibility than Gaussians, allowing for a better representation of 3D scenes with hard edges and dense volumes using fewer primitives. Powered by our efficient CUDA-based rasterizer, 3DCS achieves superior performance over 3DGS on benchmarks such as Mip-NeRF360, Tanks and Temples, and Deep Blending. Specifically, our method attains an improvement of up to 0.81 in PSNR and 0.026 in LPIPS compared to 3DGS while maintaining high rendering speeds and reducing the number of required primitives. Our results highlight the potential of 3D Convex Splatting to become the new standard for high-quality scene reconstruction and novel view synthesis. Project page: this http URL.
zh

[CV-17] LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

【速读】：该论文试图解决在联邦学习（Federated Learning, FL）框架下使用低秩适应（Low-Rank Adaptation, LoRA）进行参数高效微调时面临的两个关键问题：服务器端LoRA聚合偏差（Server-Side LoRA Aggregation Bias）和客户端LoRA初始化漂移（Client-Side LoRA Initialization Drift）。解决方案的关键是提出了一种名为LoRA-FAIR的新方法，通过在服务器端引入一个修正项来同时解决这两个问题，同时保持原有的LoRA模块不变，从而提高了聚合效率和准确性。LoRA-FAIR在计算和通信效率上保持高效，并在大规模数据集上的ViT和MLP-Mixer模型中展示了优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2411.14961
作者: Jieming Bian,Lei Wang,Letian Zhang,Jie Xu
关键词-EN: full parameter fine-tuning, Foundation models, diverse tasks, tasks with task-specific, computationally prohibitive
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbfServer-Side LoRA Aggregation Bias, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbfClient-Side LoRA Initialization Drift, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server while keeping the original LoRA modules, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.
zh

[CV-18] Design-o-meter: Towards Evaluating and Refining Graphic Designs WACV2025

【速读】：该论文试图解决自动化评估和改进图形设计质量的问题。解决方案的关键在于提出了Design-o-meter，这是一种数据驱动的方法，用于量化图形设计的优劣，并能够在统一框架内提出改进建议，克服了图形设计领域固有的主观性和模糊性。该方法通过与适应任务的基线（包括基于多模态大语言模型(Multimodal LLM)的最新方法）进行详尽的定量和定性分析，展示了其有效性。

链接: https://arxiv.org/abs/2411.14959
作者: Sahil Goyal,Abhinav Mahajan,Swasti Mishra,Prateksha Udhayanan,Tripti Shukla,K J Joseph,Balaji Vasan Srinivasan
关键词-EN: effective medium, visual communication, Graphic designs, designs, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to WACV 2025. Project page: this https URL

点击查看摘要

Abstract:Graphic designs are an effective medium for visual communication. They range from greeting cards to corporate flyers and beyond. Off-late, machine learning techniques are able to generate such designs, which accelerates the rate of content production. An automated way of evaluating their quality becomes critical. Towards this end, we introduce Design-o-meter, a data-driven methodology to quantify the goodness of graphic designs. Further, our approach can suggest modifications to these designs to improve its visual appeal. To the best of our knowledge, Design-o-meter is the first approach that scores and refines designs in a unified framework despite the inherent subjectivity and ambiguity of the setting. Our exhaustive quantitative and qualitative analysis of our approach against baselines adapted for the task (including recent Multimodal LLM-based approaches) brings out the efficacy of our methodology. We hope our work will usher more interest in this important and pragmatic problem setting.
zh

[CV-19] Evaluating Vision Transformer Models for Visual Quality Control in Industrial Manufacturing

【速读】：该论文试图解决在工业制造中使用机器学习进行缺陷产品早期检测时，如何选择合适的视觉骨干网络和异常检测算法组合的问题。解决方案的关键在于通过评估当前最先进的视觉Transformer模型与异常检测方法的组合，找到既高效又快速的模型，以适应工业制造中的质量控制系统。论文通过实验在MVTecAD和BTAD数据集上验证了不同组合的性能，并提供了基于具体使用场景和硬件限制选择合适模型架构的指导方针。

链接: https://arxiv.org/abs/2411.14953
作者: Miriam Alber,Christoph Hönes,Patrick Baier
关键词-EN: quality control system, quality control, control system, visual quality control, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the most promising use-cases for machine learning in industrial manufacturing is the early detection of defective products using a quality control system. Such a system can save costs and reduces human errors due to the monotonous nature of visual inspections. Today, a rich body of research exists which employs machine learning methods to identify rare defective products in unbalanced visual quality control datasets. These methods typically rely on two components: A visual backbone to capture the features of the input image and an anomaly detection algorithm that decides if these features are within an expected distribution. With the rise of transformer architecture as visual backbones of choice, there exists now a great variety of different combinations of these two components, ranging all along the trade-off between detection quality and inference time. Facing this variety, practitioners in the field often have to spend a considerable amount of time on researching the right combination for their use-case at hand. Our contribution is to help practitioners with this choice by reviewing and evaluating current vision transformer models together with anomaly detection methods. For this, we chose SotA models of both disciplines, combined them and evaluated them towards the goal of having small, fast and efficient anomaly detection models suitable for industrial manufacturing. We evaluated the results of our experiments on the well-known MVTecAD and BTAD datasets. Moreover, we give guidelines for choosing a suitable model architecture for a quality control system in practice, considering given use-case and hardware constraints.
zh

[CV-20] Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

【速读】：该论文试图解决现有方法在生成人类运动时忽视物理约束，导致产生不切实际的运动（如漂浮和脚滑）的问题。解决方案的关键在于提出了一个名为 Morph 的 Motion-free physics optimization framework，该框架包括一个运动生成器和一个运动物理优化模块。运动生成器负责提供大规模的合成运动数据，而运动物理优化模块则在物理模拟器中利用这些数据训练一个运动模仿器，通过强制物理约束将噪声运动投影到物理上可行的空间。这些经过物理优化的运动随后用于微调运动生成器，从而进一步提升其能力。实验结果表明，该框架在文本到运动和音乐到舞蹈生成任务中不仅达到了最先进的生成质量，还显著提高了运动的物理合理性。

链接: https://arxiv.org/abs/2411.14951
作者: Zhuo Li,Mingshuang Luo,Ruibing Hou,Xin Zhao,Hao Liu,Hong Chang,Zimo Liu,Chen Li
关键词-EN: humanoid robot control, Motion Physics Refinement, Human motion generation, Physics Refinement module, digital humans
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Human motion generation plays a vital role in applications such as digital humans and humanoid robot control. However, most existing approaches disregard physics constraints, leading to the frequent production of physically implausible motions with pronounced artifacts such as floating and foot sliding. In this paper, we propose \textbfMorph, a \textbfMotion-f\textbfree \textbfphysics optimization framework, comprising a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on costly real-world motion data. Specifically, the Motion Generator is responsible for providing large-scale synthetic motion data, while the Motion Physics Refinement Module utilizes these synthetic data to train a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. These physically refined motions, in turn, are used to fine-tune the Motion Generator, further enhancing its capability. Experiments on both text-to-motion and music-to-dance generation tasks demonstrate that our framework achieves state-of-the-art motion generation quality while improving physical plausibility drastically.
zh

[CV-21] Reliable Evaluation of Attribution Maps in CNNs: A Perturbation-Based Approach

【速读】：该论文试图解决卷积神经网络 (CNN) 解释中归因图 (attribution maps) 评估的可靠性问题。现有广泛使用的插入/删除度量方法容易受到分布偏移的影响，导致归因图排序的可靠性下降。论文提出的解决方案关键在于用对抗扰动 (adversarial perturbations) 替代像素修改，从而构建一个更为稳健的评估框架。通过引入平滑性和单调性度量，该方法有效纠正了分布偏移问题。此外，论文还通过基线归因图 (baseline attribution maps) 进行合理性检查，并使用 Kendall’s \tau 秩相关系数展示了该度量在15种数据集-架构组合中的增强一致性。研究结果表明，SmoothGrad 是目前最优的归因图。该研究为归因图的发展提供了可靠且一致的评估框架。

链接: https://arxiv.org/abs/2411.14946
作者: Lars Nieradzik,Henrike Stephani,Janis Keuper
关键词-EN: convolutional neural networks, evaluating attribution maps, attribution maps, neural networks, play a central
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we present an approach for evaluating attribution maps, which play a central role in interpreting the predictions of convolutional neural networks (CNNs). We show that the widely used insertion/deletion metrics are susceptible to distribution shifts that affect the reliability of the ranking. Our method proposes to replace pixel modifications with adversarial perturbations, which provides a more robust evaluation framework. By using smoothness and monotonicity measures, we illustrate the effectiveness of our approach in correcting distribution shifts. In addition, we conduct the most comprehensive quantitative and qualitative assessment of attribution maps to date. Introducing baseline attribution maps as sanity checks, we find that our metric is the only contender to pass all checks. Using Kendall’s \tau rank correlation coefficient, we show the increased consistency of our metric across 15 dataset-architecture combinations. Of the 16 attribution maps tested, our results clearly show SmoothGrad to be the best map currently available. This research makes an important contribution to the development of attribution maps by providing a reliable and consistent evaluation framework. To ensure reproducibility, we will provide the code along with our results.
zh

[CV-22] LiDAR-based End-to-end Temporal Perception for Vehicle-Infrastructure Cooperation

【速读】：该论文试图解决自动驾驶中时间感知（Temporal Perception）的挑战，特别是由于遮挡物体和观测盲点导致的感知不完整问题。解决方案的关键在于引入LET-VIC，一个基于激光雷达（LiDAR）的端到端车辆-基础设施合作（Vehicle-Infrastructure Cooperation, VIC）跟踪框架。LET-VIC通过利用车联网（Vehicle-to-Everything, V2X）通信，融合来自车辆和基础设施传感器的时间和空间数据，从而增强时间感知能力。其核心创新包括：1) 空间上整合车辆和基础设施侧的鸟瞰图（Bird’s Eye View, BEV）特征，以减轻遮挡并补偿盲点；2) 时间上利用跨帧的历史数据，提升跟踪的稳定性和准确性；3) 引入校准误差补偿（Calibration Error Compensation, CEC）模块，解决传感器对齐问题，确保特征对齐的精确性。实验结果表明，LET-VIC在V2X-Seq-SPD数据集上显著优于基线模型，在mAP和AMOTA指标上分别提升了至少13.7%和13.1%。

链接: https://arxiv.org/abs/2411.14927
作者: Zhenwei Yang,Jilei Mao,Wenxian Yang,Yibo Ai,Yu Kong,Haibao Yu,Weidong Zhang
关键词-EN: dynamic environments, ability to detect, detect and track, understanding of dynamic, Temporal perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Temporal perception, the ability to detect and track objects over time, is critical in autonomous driving for maintaining a comprehensive understanding of dynamic environments. However, this task is hindered by significant challenges, including incomplete perception caused by occluded objects and observational blind spots, which are common in single-vehicle perception systems. To address these issues, we introduce LET-VIC, a LiDAR-based End-to-End Tracking framework for Vehicle-Infrastructure Cooperation (VIC). LET-VIC leverages Vehicle-to-Everything (V2X) communication to enhance temporal perception by fusing spatial and temporal data from both vehicle and infrastructure sensors. First, it spatially integrates Bird’s Eye View (BEV) features from vehicle-side and infrastructure-side LiDAR data, creating a comprehensive view that mitigates occlusions and compensates for blind spots. Second, LET-VIC incorporates temporal context across frames, allowing the model to leverage historical data for enhanced tracking stability and accuracy. To further improve robustness, LET-VIC includes a Calibration Error Compensation (CEC) module to address sensor misalignments and ensure precise feature alignment. Experiments on the V2X-Seq-SPD dataset demonstrate that LET-VIC significantly outperforms baseline models, achieving at least a 13.7% improvement in mAP and a 13.1% improvement in AMOTA without considering communication delays. This work offers a practical solution and a new research direction for advancing temporal perception in autonomous driving through vehicle-infrastructure cooperation.
zh

[CV-23] Boundless Across Domains: A New Paradigm of Adaptive Feature and Cross-Attention for Domain Generalization in Medical Image Segmentation

【速读】：该论文试图解决领域泛化（Domain Generalization）中领域不变表示学习（Domain-invariant Representation Learning）的问题，特别是在高维数据处理中面临的计算需求高、训练不稳定及特征丢失等挑战。解决方案的关键在于提出了一种基于跨通道注意力机制（Cross-channel Attention Mechanism）的方法，通过将源域的深度特征作为查询，生成域的深度特征作为键和值，重建出鲁棒的正则化表示，从而形成显式约束，指导模型学习领域不变的表示。此外，论文还提出了自适应特征混合（Adaptive Feature Blending, AFB）方法，通过生成超出分布的样本，扩展了训练样本的多样性，进一步提升了模型的泛化能力。

链接: https://arxiv.org/abs/2411.14883
作者: Yuheng Xu,Taiping Zhang
关键词-EN: Domain-invariant representation learning, deep features, Adaptive Feature Blending, representation learning, features
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Domain-invariant representation learning is a powerful method for domain generalization. Previous approaches face challenges such as high computational demands, training instability, and limited effectiveness with high-dimensional data, potentially leading to the loss of valuable features. To address these issues, we hypothesize that an ideal generalized representation should exhibit similar pattern responses within the same channel across cross-domain images. Based on this hypothesis, we use deep features from the source domain as queries, and deep features from the generated domain as keys and values. Through a cross-channel attention mechanism, the original deep features are reconstructed into robust regularization representations, forming an explicit constraint that guides the model to learn domain-invariant representations. Additionally, style augmentation is another common method. However, existing methods typically generate new styles through convex combinations of source domains, which limits the diversity of training samples by confining the generated styles to the original distribution. To overcome this limitation, we propose an Adaptive Feature Blending (AFB) method that generates out-of-distribution samples while exploring the in-distribution space, significantly expanding the domain range. Extensive experimental results demonstrate that our proposed methods achieve superior performance on two standard domain generalization benchmarks for medical image segmentation.
zh

[CV-24] Implementation of Real-Time Lane Detection on Autonomous Mobile Robot

【速读】：该论文试图解决在自主移动机器人上实现实时车道检测的问题，解决方案的关键在于采用基于学习的Ultra Fast Lane Detection算法，并通过将其转换为TensorRT模型在Jetson Nano平台上进行优化，以显著提高数据处理速度和准确性。实验结果表明，优化后的算法在处理速度上比之前的ONNX模型快约22倍，但在室内数据集上的准确性仍有待提升，未来工作应着重于迁移学习和微调以改善室内车道检测的性能。

链接: https://arxiv.org/abs/2411.14873
作者: Midriem Mirdanies,Roni Permana Saputra,Edwar Yazid,Rozeha A. Rashid
关键词-EN: Autonomous Mobile Robot, Mobile Robot, lane detection algorithm, Autonomous Mobile, learning-based lane detection
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 9 figures 2 tables

点击查看摘要

Abstract:This paper describes the implementation of a learning-based lane detection algorithm on an Autonomous Mobile Robot. It aims to implement the Ultra Fast Lane Detection algorithm for real-time application on the SEATER P2MC-BRIN prototype using a camera and optimize its performance on the Jetson Nano platform. Preliminary experiments were conducted to evaluate the algorithm’s performance in terms of data processing speed and accuracy using two types of datasets: outdoor using a public dataset and indoor using an internal dataset from the indoor area of the BRIN Workshop Building in Bandung. The experiments revealed that the algorithm runs more optimally on the Jetson Nano platform after conversion to TensorRT compared to the ONNX model, achieving processing speeds of approximately 101 ms using CULane and 105 ms using TuSimple, which is about 22 times faster than the previous model. While the algorithm demonstrates good accuracy on the outdoor public dataset, its performance falls short on the indoor dataset. Future work should focus on transfer learning and fine-tuning to enhance indoor lane detection accuracy.
zh

[CV-25] BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

【速读】：该论文试图解决现有3D感知算法主要依赖点云数据所面临的稀疏性、噪声和数据稀缺性问题。解决方案的关键在于引入了一种以图像为中心的3D感知模型BIP3D，通过利用预训练的2D视觉基础模型增强语义理解，并引入空间增强模块提升空间理解能力，从而实现多视角、多模态特征融合和端到端的3D感知。这一方法在EmbodiedScan基准测试中显著优于当前最先进的结果，分别在3D检测任务和3D视觉定位任务中提升了5.69%和15.25%。

链接: https://arxiv.org/abs/2411.14869
作者: Xuewu Lin,Tianwei Lin,Lichao Huang,Hongyu Xie,Zhizhong Su
关键词-EN: embodied intelligence systems, intelligence systems, surrounding environments, embodied intelligence, key component
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
zh

[CV-26] Defective Edge Detection Using Cascaded Ensemble Canny Operator

【速读】：该论文试图解决计算机视觉中边缘检测的难题，特别是在复杂场景图片中准确识别物体边缘的问题。解决方案的关键在于使用了一种级联集成Canny算子（Cascaded Ensemble Canny operator），通过结合多种骨干网络和注意力模块，显著提升了边缘检测的精度和鲁棒性。该方法在处理复杂场景图片时表现优异，并且在Fresh and Rotten和Berkeley等具有挑战性的数据集上，其性能和输出图像质量均优于现有的边缘检测网络。

链接: https://arxiv.org/abs/2411.14868
作者: Anjali Nambiyar Rajkumar Kannan
关键词-EN: real-world images including, images including objects, Canny edge detection, kinds and sizes, Edge detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 Pages and 2 Figures

点击查看摘要

Abstract:Edge detection has been one of the most difficult challenges in computer vision because of the difficulty in identifying the borders and edges from the real-world images including objects of varying kinds and sizes. Methods based on ensemble learning, which use a combination of backbones and attention modules, outperformed more conventional approaches, such as Sobel and Canny edge detection. Nevertheless, these algorithms are still challenged when faced with complicated scene photos. In addition, the identified edges utilizing the current methods are not refined and often include incorrect edges. In this work, we used a Cascaded Ensemble Canny operator to solve these problems and detect the object edges. The most difficult Fresh and Rotten and Berkeley datasets are used to test the suggested approach in Python. In terms of performance metrics and output picture quality, the acquired results outperform the specified edge detection networks
zh

[CV-27] Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation

【速读】：该论文试图解决扩散模型（Diffusion Models, DMs）在图像生成与反演过程中所需的神经函数评估次数（NFEs）过多的问题，这限制了其在实际应用中的效率。解决方案的关键在于引入薛定谔桥（Schrodinger Bridges, SBs），这是一种在具有最小传输成本的分布之间建立随机微分方程（SDEs）的方法。论文分析了SB的概率流常微分方程（ODE）形式，并发现其向量场可以分解为源预测器、目标预测器和噪声预测器的线性组合。基于这一发现，论文提出了潜在薛定谔桥（Latent Schrodinger Bridges, LSBs），通过预训练的稳定扩散模型来近似SB的ODE，并开发了适当的提示优化和变量变换公式，以匹配训练和推理过程中的分布。实验结果表明，该算法在无监督设置下成功实现了具有竞争力的图像到图像（I2I）翻译，且计算成本仅为先前基于DM的I2I方法的一小部分。

链接: https://arxiv.org/abs/2411.14863
作者: Jeongsol Kim,Beomsu Kim,Jong Chul Ye
关键词-EN: inspired powerful unpaired, inversion from data, powerful unpaired, Latent Schrodinger Bridges, Schrodinger Bridges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models (DMs), which enable both image generation from noise and inversion from data, have inspired powerful unpaired image-to-image (I2I) translation algorithms. However, they often require a larger number of neural function evaluations (NFEs), limiting their practical applicability. In this paper, we tackle this problem with Schrodinger Bridges (SBs), which are stochastic differential equations (SDEs) between distributions with minimal transport cost. We analyze the probability flow ordinary differential equation (ODE) formulation of SBs, and observe that we can decompose its vector field into a linear combination of source predictor, target predictor, and noise predictor. Inspired by this observation, we propose Latent Schrodinger Bridges (LSBs) that approximate the SB ODE via pre-trained Stable Diffusion, and develop appropriate prompt optimization and change of variables formula to match the training and inference between distributions. We demonstrate that our algorithm successfully conduct competitive I2I translation in unsupervised setting with only a fraction of computation cost required by previous DM-based I2I methods.
zh

[CV-28] Dynamics-Aware Gaussian Splatting Streaming Towards Fast On-the-Fly Training for 4D Reconstruction WWW

【速读】：该论文试图解决现有4D动态空间重建方法在处理多视角视频输入时，缺乏在线迭代重建能力的问题。现有方法主要依赖于处理完整的多视角视频，而忽略了动态和静态特征的区别以及场景中的时间连续性。论文提出的解决方案是一个三阶段流水线，包括选择性继承阶段（selective inheritance stage）以保持时间连续性，动态感知位移阶段（dynamics-aware shift stage）用于区分动态和静态基元并优化其运动，以及误差引导的密集化阶段（error-guided densification stage）以适应新出现的物体。这一方法显著提高了在线4D重建的速度和质量，实现了实时渲染能力。

链接: https://arxiv.org/abs/2411.14847
作者: Zhening Liu,Yingdong Hu,Xinjie Zhang,Jiawei Shao,Zehong Lin,Jun Zhang
关键词-EN: multi-view visual inputs, Gaussian Splatting, visual inputs, recent development, led to great
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:The recent development of 3D Gaussian Splatting (3DGS) has led to great interest in 4D dynamic spatial reconstruction from multi-view visual inputs. While existing approaches mainly rely on processing full-length multi-view videos for 4D reconstruction, there has been limited exploration of iterative online reconstruction methods that enable on-the-fly training and per-frame streaming. Current 3DGS-based streaming methods treat the Gaussian primitives uniformly and constantly renew the densified Gaussians, thereby overlooking the difference between dynamic and static features and also neglecting the temporal continuity in the scene. To address these limitations, we propose a novel three-stage pipeline for iterative streamable 4D dynamic spatial reconstruction. Our pipeline comprises a selective inheritance stage to preserve temporal continuity, a dynamics-aware shift stage for distinguishing dynamic and static primitives and optimizing their movements, and an error-guided densification stage to accommodate emerging objects. Our method achieves state-of-the-art performance in online 4D reconstruction, demonstrating a 20% improvement in on-the-fly training speed, superior representation quality, and real-time rendering capability. Project page: this https URL
zh

[CV-29] Physically Interpretable Probabilistic Domain Characterization

【速读】：该论文试图解决现有方法在动态环境中对操作域（operational domain）进行描述时，仅通过回归或分类问题提供有限总结性描述的问题。解决方案的关键在于将操作域表征为概率分布，具体通过使用归一化流（normalizing flows）估计物理参数的分布，从而预测车载摄像头捕捉图像中的不同天气条件的可能性。这种方法不仅提供了绝对表征（absolute characterization），即物理参数的分布，还允许相对表征（relative characterization），即与预定义域的比较，从而评估系统在目标域中安全操作的能力。

链接: https://arxiv.org/abs/2411.14827
作者: Anaïs Halin,Sébastien Piérard,Renaud Vandeghen,Benoît Gérin,Maxime Zanella,Martin Colot,Jan Held,Anthony Cioppa,Emmanuel Jean,Gianluca Bontempi,Saïd Mahmoudi,Benoît Macq,Marc Van Droogenbroeck
关键词-EN: analyzing dynamic environments, models analyzing dynamic, essential for models, models analyzing, adapt to evolving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Characterizing domains is essential for models analyzing dynamic environments, as it allows them to adapt to evolving conditions or to hand the task over to backup systems when facing conditions outside their operational domain. Existing solutions typically characterize a domain by solving a regression or classification problem, which limits their applicability as they only provide a limited summarized description of the domain. In this paper, we present a novel approach to domain characterization by characterizing domains as probability distributions. Particularly, we develop a method to predict the likelihood of different weather conditions from images captured by vehicle-mounted cameras by estimating distributions of physical parameters using normalizing flows. To validate our proposed approach, we conduct experiments within the context of autonomous vehicles, focusing on predicting the distribution of weather parameters to characterize the operational domain. This domain is characterized by physical parameters (absolute characterization) and arbitrarily predefined domains (relative characterization). Finally, we evaluate whether a system can safely operate in a target domain by comparing it to multiple source domains where safety has already been established. This approach holds significant potential, as accurate weather prediction and effective domain adaptation are crucial for autonomous systems to adjust to dynamic environmental conditions.
zh

[CV-30] Omni-IML: Towards Unified Image Manipulation Localization

【速读】：该论文试图解决图像篡改定位 (Image Manipulation Localization, IML) 方法在不同图像类型上的泛化能力问题。现有方法通常针对特定任务设计，导致在处理不同类型的图像时性能显著下降，甚至联合训练多个图像类型也会导致性能退化，增加了实际应用中的维护成本和误分类风险。解决方案的关键在于提出了Omni-IML，这是一个通用模型，通过采用模态门编码器 (Modal Gate Encoder) 和动态权重解码器 (Dynamic Weight Decoder) 来自适应地确定每个样本的最佳编码模态和解码器滤波器，同时引入异常增强模块 (Anomaly Enhancement module) 来强化篡改区域的特征，帮助模型提取跨不同IML任务的通用特征。实验结果表明，Omni-IML在自然图像、文档图像和人脸图像三个主要场景中均实现了最先进的性能，为实际应用和未来研究提供了宝贵的策略和见解。

链接: https://arxiv.org/abs/2411.14823
作者: Chenfan Qu,Yiwu Zhong,Fengjun Guo,Lianwen Jin
关键词-EN: Image Manipulation Localization, posing significant risks, Image manipulation, Manipulation Localization, visual content
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image manipulation can lead to misinterpretation of visual content, posing significant risks to information security. Image Manipulation Localization (IML) has thus received increasing attention. However, existing IML methods rely heavily on task-specific designs, making them perform well only on one target image type but are mostly random guessing on other image types, and even joint training on multiple image types causes significant performance degradation. This hinders the deployment for real applications as it notably increases maintenance costs and the misclassification of image types leads to serious error accumulation. To this end, we propose Omni-IML, the first generalist model to unify diverse IML tasks. Specifically, Omni-IML achieves generalism by adopting the Modal Gate Encoder and the Dynamic Weight Decoder to adaptively determine the optimal encoding modality and the optimal decoder filters for each sample. We additionally propose an Anomaly Enhancement module that enhances the features of tampered regions with box supervision and helps the generalist model to extract common features across different IML tasks. We validate our approach on IML tasks across three major scenarios: natural images, document images, and face images. Without bells and whistles, our Omni-IML achieves state-of-the-art performance on all three tasks with a single unified model, providing valuable strategies and insights for real-world application and future research in generalist image forensics. Our code will be publicly available.
zh

[CV-31] Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

【速读】：该论文试图解决无人机（UAV）跨视图地理定位（Cross-View Geo-Localization, CVGL）中由于倾斜无人机图像与俯视卫星图像之间的视角差异带来的挑战。现有方法依赖于标注数据集的监督来提取视角不变特征，但这些方法训练成本高且容易过拟合特定区域的特征，导致在新区域中的泛化能力有限。论文提出的解决方案关键在于：通过将场景表示提升到三维空间，从无人机观测中生成卫星图像，从而提供对视角畸变具有鲁棒性的表示。具体来说，该方法通过生成与卫星视图相似的正交图像，减少了特征表示中的视角差异，并缓解了区域特定图像配对的捷径问题。此外，设计了一种迭代相机姿态更新机制，逐步调整渲染查询图像以匹配潜在的卫星目标，消除与参考图像之间的空间偏移。这种迭代优化策略通过跨迭代的视图一致性融合，增强了跨视图特征的不变性。最终，该无监督方法避免了区域特定过拟合问题，无需特征微调或数据驱动训练即可实现通用CVGL。实验结果表明，该方法在多个数据集上显著提高了地理定位精度，并在不同区域中保持了鲁棒性，甚至在无模型微调或配对训练的情况下，性能与最新的监督方法相当。

链接: https://arxiv.org/abs/2411.14816
作者: Haoyuan Li,Chang Xu,Wen Yang,Li Mi,Huai Yu,Haijian Zhang
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, presents significant challenges, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 13 pages

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image’s perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.
zh

[CV-32] High-Resolution Image Synthesis via Next-Token Prediction

【速读】：该论文试图解决在高分辨率文本到图像生成任务中，如何通过下一个token预测实现高效且连续的分辨率学习问题。解决方案的关键在于引入了一种扩展的联合嵌入预测架构（D-JEPA \cdot T2I），该架构结合了流匹配损失（flow matching loss），并利用多模态视觉Transformer（multimodal visual transformer）有效整合文本和视觉特征。此外，采用了视觉旋转位置嵌入（Visual Rotary Positional Embedding, VoPE）来促进连续分辨率学习，并设计了一种数据反馈机制以显著提高数据利用效率。通过这些创新，论文首次实现了通过下一个token预测进行的高分辨率图像合成，达到了当前最先进的水平。

链接: https://arxiv.org/abs/2411.14808
作者: Dengsheng Chen,Jie Hu,Tiezhu Yue,Xiaoming Wei
关键词-EN: Joint-Embedding Predictive Architecture, Predictive Architecture, demonstrated outstanding performance, Joint-Embedding Predictive, demonstrated outstanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages

点击查看摘要

Abstract:Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), an autoregressive model, has demonstrated outstanding performance in class-conditional image generation. However, the application of next-token prediction in high-resolution text-to-image generation remains underexplored. In this paper, we introduce D-JEPA \cdot T2I, an extension of D-JEPA incorporating flow matching loss, designed to enable data-efficient continuous resolution learning. D-JEPA \cdot T2I leverages a multimodal visual transformer to effectively integrate textual and visual features and adopts Visual Rotary Positional Embedding (VoPE) to facilitate continuous resolution learning. Furthermore, we devise a data feedback mechanism that significantly enhances data utilization efficiency. For the first time, we achieve state-of-the-art \textbfhigh-resolution image synthesis via next-token prediction. The experimental code and pretrained models will be open-sourced at \urlthis https URL. Comments: 30 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.14808 [cs.CV] (or arXiv:2411.14808v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.14808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-33] Facial Features Matter: a Dynamic Watermark based Proactive Deepfake Detection Approach

【速读】：该论文试图解决当前被动式深度伪造（deepfake）人脸交换检测方法在模型泛化能力上的显著瓶颈问题，以及主动检测方法中固定水印与保护内容关联性不强且易受安全风险影响的问题。解决方案的关键在于提出了一种基于面部特征的主动深度伪造检测方法（FaceProtect），利用深度伪造过程中面部特征的变化作为新的检测机制。具体来说，论文引入了一种基于生成对抗网络（GAN）的单向动态水印生成机制（GODWGM），该机制使用128维面部特征向量作为输入，创建从面部特征到水印的不可逆映射，从而增强了对各种逆向推理攻击的防护。此外，还提出了一种基于水印的验证策略（WVS），结合隐写术与GODWGM，使得在图像中同时传输代表面部特征的基准水印成为可能。实验结果表明，该方法在检测性能和实际应用中均表现出色。

链接: https://arxiv.org/abs/2411.14798
作者: Shulin Lan,Kanlin Liu,Yazhou Zhao,Chen Yang,Yingchao Wang,Xingshan Yao,Liehuang Zhu
关键词-EN: model generalization capabilities, Current passive deepfake, encounter significance bottlenecks, Current passive, methods encounter significance
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Current passive deepfake face-swapping detection methods encounter significance bottlenecks in model generalization capabilities. Meanwhile, proactive detection methods often use fixed watermarks which lack a close relationship with the content they protect and are vulnerable to security risks. Dynamic watermarks based on facial features offer a promising solution, as these features provide unique identifiers. Therefore, this paper proposes a Facial Feature-based Proactive deepfake detection method (FaceProtect), which utilizes changes in facial characteristics during deepfake manipulation as a novel detection mechanism. We introduce a GAN-based One-way Dynamic Watermark Generating Mechanism (GODWGM) that uses 128-dimensional facial feature vectors as inputs. This method creates irreversible mappings from facial features to watermarks, enhancing protection against various reverse inference attacks. Additionally, we propose a Watermark-based Verification Strategy (WVS) that combines steganography with GODWGM, allowing simultaneous transmission of the benchmark watermark representing facial features within the image. Experimental results demonstrate that our proposed method maintains exceptional detection performance and exhibits high practicality on images altered by various deepfake techniques.
zh

[CV-34] Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

【速读】：该论文试图解决现有图卷积网络 (GCN) 在动作识别中仅依赖于二元连接（即两个相邻节点或关节之间的边或骨骼）的问题，忽略了构建多节点卷积结构的可能性。解决方案的关键在于引入超图卷积网络 (Hyper-GCN)，通过自适应优化多尺度超图来揭示动作驱动的多节点关系，并注入虚拟连接以扩展骨骼内的依赖关系谱，从而实现对骨骼节点传递的丰富语义信息的聚合。实验结果表明，Hyper-GCN 在 NTU-60、NTU-120 和 NW-UCLA 数据集上优于现有最先进的方法，特别是在 NTU-120 数据集上，X-Sub 和 X-Set 的 top-1 识别准确率分别达到了 90.2% 和 91.4%。

链接: https://arxiv.org/abs/2411.14796
作者: Youwei Zhou,Tianyang Xu,Cong Wu,Xiaojun Wu,Josef Kittler
关键词-EN: human skeletons motivated, graph convolutional network, hyper-graph convolutional network, shared topology, topology of human
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition. However, the existing GCNs rely on the binary connection of two neighbouring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures. In this paper we address this oversight and explore the merits of a hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices. In particular, our Hyper-GCN adaptively optimises multi-scale hyper-graphs during training, revealing the action-driven multi-vertex relations. Besides, virtual connections are often designed to support efficient feature aggregation, implicitly extending the spectrum of dependencies within the skeleton. By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted. The results of experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets, demonstrate the merits of our Hyper-GCN, compared to the state-of-the-art methods. Specifically, we outperform the existing solutions on NTU-120, achieving 90.2% and 91.4% in terms of the top-1 recognition accuracy on X-Sub and X-Set.
zh

[CV-35] Style-Friendly SNR Sampler for Style-Driven Generation

【速读】：该论文试图解决大规模扩散模型在生成高质量图像时难以学习个性化艺术风格的问题，这限制了独特风格模板的创建。解决方案的关键在于提出了“风格友好信噪比采样器”（Style-friendly SNR sampler），通过在微调过程中将信噪比（SNR）分布向更高噪声水平偏移，从而聚焦于风格特征显现的噪声水平。这种方法使模型能更好地捕捉独特风格，生成风格对齐度更高的图像，并能学习和共享新的“风格模板”，从而增强个性化内容创作的能力。

链接: https://arxiv.org/abs/2411.14793
作者: Jooyoung Choi,Chaehun Shin,Yeongtak Oh,Heeseung Kim,Sungroh Yoon
关键词-EN: Recent large-scale diffusion, Recent large-scale, large-scale diffusion models, Style-friendly SNR sampler, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objectives and noise level distributions used for pre-training, leading to suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enables models to better capture unique styles and generate images with higher style alignment. Our method allows diffusion models to learn and share new “style templates”, enhancing personalized content creation. We demonstrate the ability to generate styles such as personal watercolor paintings, minimal flat cartoons, 3D renderings, multi-panel images, and memes with text, thereby broadening the scope of style-driven generation.
zh

[CV-36] Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers

【速读】：该论文试图解决在消费级计算机上训练大规模对比语言-图像预训练模型（Contrastive Language-Image Pre-training, CLIP）时面临的计算和存储资源限制问题。解决方案的关键在于：1) 通过简化Transformer块结构并结合权重继承与多阶段知识蒸馏（Weight Inheritance with multi-stage Knowledge Distillation, WIKD），减少模型参数并提高训练和部署时的推理速度；2) 面对小数据集带来的收敛挑战，采用数据增强技术生成合成字幕，并设计了一种新的配对匹配损失（Pair Matching, PM）以充分利用正负图像-文本对之间的区分性。这些方法使得模型在单个Nvidia RTX3090 GPU和1TB数据存储条件下，实现了竞争性的性能，推动了CLIP模型在相关研究社区中的普及。

链接: https://arxiv.org/abs/2411.14789
作者: Hongbo Liu
关键词-EN: Contrastive Language-Image Pre-training, superior zero-shot performance, Contrastive Language-Image, Language-Image Pre-training, downstream tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its superior zero-shot performance and excellent transferability to downstream tasks. However, training such large-scale models usually requires substantial computation and storage, which poses barriers for general users with consumer-level computers. Motivated by this observation, in this paper we investigate how to achieve competitive performance on only one Nvidia RTX3090 GPU and with one terabyte for storing dataset. On one hand, we simplify the transformer block structure and combine Weight Inheritance with multi-stage Knowledge Distillation (WIKD), thereby reducing the parameters and improving the inference speed during training along with deployment. On the other hand, confronted with the convergence challenge posed by small dataset, we generate synthetic captions for each sample as data augmentation, and devise a novel Pair Matching (PM) loss to fully exploit the distinguishment among positive and negative image-text pairs. Extensive experiments demonstrate that our model can achieve a new state-of-the-art datascale-parameter-accuracy tradeoff, which could further popularize the CLIP model in the related research community.
zh

[CV-37] FastGrasp: Efficient Grasp Synthesis with Diffusion

【速读】：该论文试图解决在复杂物理约束和高生成效率要求下，人手与物体交互建模的挑战。传统方法通常采用计算密集型的两阶段方法，首先生成中间表示（如接触图），然后通过迭代优化过程更新手部网格以捕捉手-物关系，但由于优化阶段的计算复杂性，这些方法在推理效率上往往较低。论文提出的解决方案关键在于引入了一种基于扩散模型（Diffusion Model）的一阶段生成方法，通过开发带有适应模块的潜在扩散模型（Latent Diffusion Model with an Adaptation Module）和接触感知损失（contact-aware loss）来直接生成抓握姿态，从而显著提高了生成速度和手部姿态的多样性，同时确保了手-物交互的物理约束。实验结果表明，该方法在推理速度、多样性和姿态质量方面均优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.14786
作者: Xiaofei Wu,Tao Liu,Caoji Li,Yuexin Ma,Yujiao Shi,Xuming He
关键词-EN: Effectively modeling, modeling the interaction, interaction between human, complex physical constraints, efficiency in applications
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effectively modeling the interaction between human hands and objects is challenging due to the complex physical constraints and the requirement for high generation efficiency in applications. Prior approaches often employ computationally intensive two-stage approaches, which first generate an intermediate representation, such as contact maps, followed by an iterative optimization procedure that updates hand meshes to capture the hand-object relation. However, due to the high computation complexity during the optimization stage, such strategies often suffer from low efficiency in inference. To address this limitation, this work introduces a novel diffusion-model-based approach that generates the grasping pose in a one-stage manner. This allows us to significantly improve generation speed and the diversity of generated hand poses. In particular, we develop a Latent Diffusion Model with an Adaptation Module for object-conditioned hand pose generation and a contact-aware loss to enforce the physical constraints between hands and objects. Extensive experiments demonstrate that our method achieves faster inference, higher diversity, and superior pose quality than state-of-the-art approaches. Code is available at \hrefthis https URLthis https URL.
zh

[CV-38] Reconciling Semantic Controllability and Diversity for Remote Sensing Image Synthesis with Hybrid Semantic Embedding

【速读】：该论文试图解决遥感图像语义合成中语义可控性与多样性之间的平衡问题。解决方案的关键在于提出了一种混合语义嵌入引导的生成对抗网络 (HySEGGAN)，该网络通过利用单一源的分层信息，提出了一种混合语义嵌入方法，协调细粒度的局部语义布局以表征遥感对象的几何结构，而无需额外信息。此外，引入语义精炼网络 (SRN) 并结合新的损失函数，确保细粒度的语义反馈，从而缓解语义混淆并防止几何模式崩溃。实验结果表明，该方法在语义可控性和多样性之间取得了良好的平衡，并显著提升了合成图像的质量，在多个数据集的下游任务中作为数据增强技术达到了最先进的性能。

链接: https://arxiv.org/abs/2411.14781
作者: Junde Liu,Danpei Zhao,Bo Yuan,Wentao Li,Tian Li
关键词-EN: Significant advancements, Hybrid Semantic Embedding, Generative Adversarial Network, Embedding Guided Generative, Guided Generative Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant advancements have been made in semantic image synthesis in remote sensing. However, existing methods still face formidable challenges in balancing semantic controllability and diversity. In this paper, we present a Hybrid Semantic Embedding Guided Generative Adversarial Network (HySEGGAN) for controllable and efficient remote sensing image synthesis. Specifically, HySEGGAN leverages hierarchical information from a single source. Motivated by feature description, we propose a hybrid semantic Embedding method, that coordinates fine-grained local semantic layouts to characterize the geometric structure of remote sensing objects without extra information. Besides, a Semantic Refinement Network (SRN) is introduced, incorporating a novel loss function to ensure fine-grained semantic feedback. The proposed approach mitigates semantic confusion and prevents geometric pattern collapse. Experimental results indicate that the method strikes an excellent balance between semantic controllability and diversity. Furthermore, HySEGGAN significantly improves the quality of synthesized images and achieves state-of-the-art performance as a data augmentation technique across multiple datasets for downstream tasks.
zh

[CV-39] A Benchmark Dataset for Collaborative SLAM in Service Environments

【速读】：该论文试图解决现有C-SLAM（协同同步定位与地图构建）数据集在多样化的室内服务环境中缺乏复杂挑战的问题。解决方案的关键在于引入了一个名为C-SLAM dataset in Service Environments (CSE) 的新多模态数据集，该数据集通过NVIDIA Isaac Sim模拟器生成，涵盖医院、办公室和仓库三种常见室内服务环境，并包含动态对象和多机器人操作，以更真实地模拟实际服务环境中的挑战。通过提供精确且时间同步的传感器数据（如立体RGB、立体深度、IMU和地面真值姿态），CSE数据集旨在为单机器人和多机器人SLAM方法的评估提供一个更贴近实际应用的基准。

链接: https://arxiv.org/abs/2411.14775
作者: Harin Park,Inha Lee,Minje Kim,Hyungyu Park,Kyungdon Joo
关键词-EN: indoor service environments, service environments, service, multiple service robots, demand complicated tasks
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, Accepted to IEEE RA-L

点击查看摘要

Abstract:As service environments have become diverse, they have started to demand complicated tasks that are difficult for a single robot to complete. This change has led to an interest in multiple robots instead of a single robot. C-SLAM, as a fundamental technique for multiple service robots, needs to handle diverse challenges such as homogeneous scenes and dynamic objects to ensure that robots operate smoothly and perform their tasks safely. However, existing C-SLAM datasets do not include the various indoor service environments with the aforementioned challenges. To close this gap, we introduce a new multi-modal C-SLAM dataset for multiple service robots in various indoor service environments, called C-SLAM dataset in Service Environments (CSE). We use the NVIDIA Isaac Sim to generate data in various indoor service environments with the challenges that may occur in real-world service environments. By using simulation, we can provide accurate and precisely time-synchronized sensor data, such as stereo RGB, stereo depth, IMU, and ground truth (GT) poses. We configure three common indoor service environments (Hospital, Office, and Warehouse), each of which includes various dynamic objects that perform motions suitable to each environment. In addition, we drive three robots to mimic the actions of real service robots. Through these factors, we generate a more realistic C-SLAM dataset for multiple service robots. We demonstrate our dataset by evaluating diverse state-of-the-art single-robot SLAM and multi-robot SLAM methods. Our dataset is available at this https URL.
zh

[CV-40] Resolution-Agnostic Transformer-based Climate Downscaling

【速读】：该论文试图解决全球气候模型（GCMs）在区域和局部尺度上进行降尺度时的高计算成本问题。解决方案的关键在于引入了一种基于预训练的地球视觉变换器（Earth Vision Transformer, Earth ViT）的降尺度方法。该方法利用预训练的Earth ViT模型，能够在不增加额外训练成本的情况下，从50 km分辨率降尺度到25 km分辨率，并在3 km分辨率的BARRA-SY数据集上表现良好，显示出其跨分辨率泛化的能力。这种方法有望通过降尺度不同输入分辨率的GCMs，生成大规模的区域气候模拟集合，从而提供更全面的未来气候变化关键变量的估计，支持极端天气事件和气候变化适应策略的有效规划。

链接: https://arxiv.org/abs/2411.14774
作者: Declan Curran,Hira Saleem,Flora Salim,Sanaa Hobeichi
关键词-EN: Understanding future weather, Earth Vision Transformer, infrastructure development, Understanding future, local scales
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding future weather changes at regional and local scales is crucial for planning and decision-making, particularly in the context of extreme weather events, as well as for broader applications in agriculture, insurance, and infrastructure development. However, the computational cost of downscaling Global Climate Models (GCMs) to the fine resolutions needed for such applications presents a significant barrier. Drawing on advancements in weather forecasting models, this study introduces a cost-efficient downscaling method using a pretrained Earth Vision Transformer (Earth ViT) model. Initially trained on ERA5 data to downscale from 50 km to 25 km resolution, the model is then tested on the higher resolution BARRA-SY dataset at a 3 km resolution. Remarkably, it performs well without additional training, demonstrating its ability to generalize across different resolutions. This approach holds promise for generating large ensembles of regional climate simulations by downscaling GCMs with varying input resolutions without incurring additional training costs. Ultimately, this method could provide more comprehensive estimates of potential future changes in key climate variables, aiding in effective planning for extreme weather events and climate change adaptation strategies.
zh

[CV-41] Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

【速读】：该论文试图解决在训练视觉模型处理长视频时，视频标记化（tokenization）效率低下的问题。解决方案的关键是引入了一种名为CoordTok的视频标记器，它通过学习从基于坐标的表示到输入视频相应补丁的映射，来实现对长视频的高效编码。CoordTok利用了3D生成模型的最新进展，将视频编码为分解的三平面表示，并重建与随机采样的(x,y,t)坐标相对应的补丁。这种方法使得大型标记器模型可以直接在长视频上训练，而无需消耗过多的训练资源。实验结果表明，CoordTok能够显著减少长视频编码所需的标记数量，从而提高了视频处理的效率和训练的内存效率。

链接: https://arxiv.org/abs/2411.14762
作者: Huiwon Jang,Sihyun Yu,Jinwoo Shin,Pieter Abbeel,Younggyo Seo
关键词-EN: process long videos, long video clips, long videos, remains a challenge, long
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available on the project webpage: this https URL

点击查看摘要

Abstract:Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled (x,y,t) coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128 \times 128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.
zh

[CV-42] FairAdapter: Detecting AI-generated Images with Improved Fairness

【速读】：该论文试图解决生成式模型生成的图像在检测公平性方面的问题，即现有的深度神经网络在检测不同内容生成样本时可能出现的性能不一致性。解决方案的关键在于提出了一个名为Fairadapter的新框架，旨在提升检测的公平性，并通过实验证明其在公平性性能上优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.14755
作者: Feng Ding,Jun Zhang,Xinan He,Jianfeng Xu
关键词-EN: data-driven deep neural, pose significant challenges, deep neural networks, efficient forensics tools, realistic images generated
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The high-quality, realistic images generated by generative models pose significant challenges for exposing this http URL far, data-driven deep neural networks have been justified as the most efficient forensics tools for the challenges. However, they may be over-fitted to certain semantics, resulting in considerable inconsistency in detection performance across different contents of generated samples. It could be regarded as an issue of detection fairness. In this paper, we propose a novel framework named Fairadapter to tackle the issue. In comparison with existing state-of-the-art methods, our model achieves improved fairness performance. Our project: this https URL
zh

[CV-43] opoSD: Topology-Enhanced Lane Segment Perception with SDMap Prior

【速读】：该论文试图解决自动驾驶系统中依赖高精度地图（HDMaps）的高成本问题，并提升基于车载传感器在线构建矢量化高精度地图的能力。解决方案的关键在于利用标准定义地图（SDMaps）作为先验信息，通过将SDMap元素编码为神经空间地图表示和实例令牌，增强鸟瞰图（BEV）特征，从而改善车道几何和拓扑结构的解码。此外，论文还提出了一个拓扑引导的解码器，通过利用拓扑和几何特征之间的相互关系来细化预测，进一步提升了几何预测和拓扑推理的能力。实验结果表明，该方法在OpenLane-V2数据集上显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.14751
作者: Sen Yang,Minyue Jiang,Ziwei Fan,Xiaolu Xie,Xiao Tan,Yingying Li,Errui Ding,Liang Wang,Jingdong Wang
关键词-EN: autonomous driving systems, Recent advances, annotation and maintenance, advances in autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 17 pages, 7 figures, and 7 tables

点击查看摘要

Abstract:Recent advances in autonomous driving systems have shifted towards reducing reliance on high-definition maps (HDMaps) due to the huge costs of annotation and maintenance. Instead, researchers are focusing on online vectorized HDMap construction using on-board sensors. However, sensor-only approaches still face challenges in long-range perception due to the restricted views imposed by the mounting angles of onboard cameras, just as human drivers also rely on bird’s-eye-view navigation maps for a comprehensive understanding of road structures. To address these issues, we propose to train the perception model to “see” standard definition maps (SDMaps). We encode SDMap elements into neural spatial map representations and instance tokens, and then incorporate such complementary features as prior information to improve the bird’s eye view (BEV) feature for lane geometry and topology decoding. Based on the lane segment representation framework, the model simultaneously predicts lanes, centrelines and their topology. To further enhance the ability of geometry prediction and topology reasoning, we also use a topology-guided decoder to refine the predictions by exploiting the mutual relationships between topological and geometric features. We perform extensive experiments on OpenLane-V2 datasets to validate the proposed method. The results show that our model outperforms state-of-the-art methods by a large margin, with gains of +6.7 and +9.1 on the mAP and topology metrics. Our analysis also reveals that models trained with SDMap noise augmentation exhibit enhanced robustness.
zh

[CV-44] Ordinal Multiple-instance Learning for Ulcerative Colitis Severity Estimation with Selective Aggregated Transformer WACV2025

【速读】：该论文试图解决在溃疡性结肠炎 (Ulcerative Colitis, UC) 的临床诊断中，现有图像级分类方法无法有效利用患者级严重程度标签的问题。解决方案的关键在于提出了一种基于Transformer和选择性聚合器标记的患者级严重程度估计方法。该方法能够从患者的多张图像中有效聚合严重部位的特征，从而提高相邻严重程度类别之间的区分能力。实验结果表明，该方法在两个数据集上均优于现有的多实例学习 (Multiple Instance Learning, MIL) 方法，并在实际临床环境中验证了其优越性。

链接: https://arxiv.org/abs/2411.14750
作者: Kaito Shiku,Kazuya Nishimura,Daiki Suehiro,Kiyohito Tanaka,Ryoma Bise
关键词-EN: real clinical settings, real clinical, ulcerative colitis, clinical settings, Patient-level diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 9 figures, Accepted in WACV 2025

点击查看摘要

Abstract:Patient-level diagnosis of severity in ulcerative colitis (UC) is common in real clinical settings, where the most severe score in a patient is recorded. However, previous UC classification methods (i.e., image-level estimation) mainly assumed the input was a single image. Thus, these methods can not utilize severity labels recorded in real clinical settings. In this paper, we propose a patient-level severity estimation method by a transformer with selective aggregator tokens, where a severity label is estimated from multiple images taken from a patient, similar to a clinical setting. Our method can effectively aggregate features of severe parts from a set of images captured in each patient, and it facilitates improving the discriminative ability between adjacent severity classes. Experiments demonstrate the effectiveness of the proposed method on two datasets compared with the state-of-the-art MIL methods. Moreover, we evaluated our method in real clinical settings and confirmed that our method outperformed the previous image-level methods. The code is publicly available at this https URL.
zh

[CV-45] Point Cloud Understanding via Attention-Driven Contrastive Learning

【速读】：该论文试图解决基于Transformer的模型在点云理解中对不显著区域潜在信息的忽视问题，这导致模型对扰动敏感且全局理解能力有限。解决方案的关键在于引入PointACL，一个注意力驱动的对比学习框架。PointACL通过采用注意力驱动的动态掩码策略，引导模型关注未充分关注的区域，从而增强对点云全局结构的理解。此外，结合原始预训练损失与对比学习损失，提升特征的区分度和泛化能力。实验结果表明，PointACL在多种3D理解任务中达到了最先进的性能，特别是在与不同Transformer骨干网络（如Point-MAE和PointGPT）结合时，其在ScanObjectNN、ModelNet40和ShapeNetPart等数据集上的表现显著提升，显示出其在捕捉全局和局部特征方面的优越能力，以及对扰动和数据不完整性的增强鲁棒性。

链接: https://arxiv.org/abs/2411.14744
作者: Yi Wang,Jiaze Wang,Ziyu Guo,Renrui Zhang,Donghao Zhou,Guangyong Chen,Anfeng Liu,Pheng-Ann Heng
关键词-EN: Recently Transformer-based models, Recently Transformer-based, leveraging self-attention mechanisms, overlook latent information, limited global comprehension
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms, however, these methods often overlook latent information in less prominent regions, leading to increased sensitivity to perturbations and limited global comprehension. To solve this issue, we introduce PointACL, an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions, enhancing the understanding of global structures within the point cloud. Then we combine the original pre-training loss with a contrastive learning loss, improving feature discrimination and generalization. Extensive experiments validate the effectiveness of PointACL, as it achieves state-of-the-art performance across a variety of 3D understanding tasks, including object classification, part segmentation, and few-shot learning. Specifically, when integrated with different Transformer backbones like Point-MAE and PointGPT, PointACL demonstrates improved performance on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart. This highlights its superior capability in capturing both global and local features, as well as its enhanced robustness against perturbations and incomplete data.
zh

[CV-46] FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification

【速读】：该论文试图解决在计算病理学 (CPath) 中进行少样本学习 (Few-shot learning) 时，由于全切片图像 (WSIs) 中包含大量诊断信息不相关的补丁 (patches)，导致模型难以聚焦于关键诊断特征的问题。解决方案的关键在于引入了一个名为 FOCUS 的知识增强自适应视觉压缩框架，该框架结合了病理学基础模型 (FMs) 和语言先验知识，通过逐步的三阶段压缩策略，优先处理具有区分性的 WSI 补丁，从而实现对诊断相关区域的聚焦分析。具体步骤包括利用 FMs 进行全局视觉冗余消除、将压缩特征与语言提示结合进行语义相关性评估，以及在保持空间一致性的同时进行邻域感知视觉标记过滤。

链接: https://arxiv.org/abs/2411.14743
作者: Zhengrui Guo,Conghao Xiong,Jiabo Ma,Qichen Sun,Lishuang Feng,Jinzhuo Wang,Hao Chen
关键词-EN: addressing fundamental limitations, patient privacy constraints, Few-shot learning presents, addressing fundamental, data availability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Few-shot learning presents a critical solution for cancer diagnosis in computational pathology (CPath), addressing fundamental limitations in data availability, particularly the scarcity of expert annotations and patient privacy constraints. A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of contained patches, where a significant portion of these patches lacks diagnostically relevant information, potentially diluting the model’s ability to learn and focus on critical diagnostic features. While recent works attempt to address this by incorporating additional knowledge, several crucial gaps hinder further progress: (1) despite the emergence of powerful pathology foundation models (FMs), their potential remains largely untapped, with most approaches limiting their use to basic feature extraction; (2) current language guidance mechanisms attempt to align text prompts with vast numbers of WSI patches all at once, struggling to leverage rich pathological semantic information. To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. Our approach implements a progressive three-stage compression strategy: we first leverage FMs for global visual redundancy elimination, and integrate compressed features with language prompts for semantic relevance assessment, then perform neighbor-aware visual token filtering while preserving spatial coherence. Extensive experiments on pathological datasets spanning breast, lung, and ovarian cancers demonstrate its superior performance in few-shot pathology diagnosis. Code will be made available at this https URL.
zh

[CV-47] EXGen: a Generative Diffusion Model for Mesh Textures SIGGRAPH

【速读】：该论文试图解决在UV纹理空间中直接学习生成高质量纹理图的问题，尤其是在大规模数据集上的应用。解决方案的关键在于训练一个能够直接生成高分辨率纹理图的前馈扩散模型 (feed-forward diffusion model)。为此，论文提出了一种可扩展的网络架构，该架构在UV地图上交错使用卷积层和点云上的注意力层，从而有效地在高分辨率UV空间中进行学习。通过这种设计，论文成功训练了一个拥有7亿参数的扩散模型，该模型能够根据文本提示和单视图图像生成UV纹理图，并支持多种扩展应用，如文本引导的纹理修复、稀疏视图的纹理补全和文本驱动的纹理合成。

链接: https://arxiv.org/abs/2411.14740
作者: Xin Yu,Ze Yuan,Yuan-Chen Guo,Ying-Tian Liu,JianHui Liu,Yangguang Li,Yan-Pei Cao,Ding Liang,Xiaojuan Qi
关键词-EN: asset rendering, essential for realistic, large-scale datasets, studies have explored, explored learning directly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted to SIGGRAPH Asia Journal Article (TOG 2024)

点击查看摘要

Abstract:While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at this http URL.
zh

[CV-48] AI Tailoring: Evaluating Influence of Image Features on Fashion Product Popularity

【速读】：该论文试图解决时尚行业中识别影响消费者偏好的关键产品特征的问题。解决方案的关键在于提出了一种名为“影响力评分 (influence score)”的量化指标，用于评估产品特征的重要性，并开发了一个名为时尚需求预测器 (Fashion Demand Predictor, FDP) 的预测模型。该模型结合了基于Transformer的模型和随机森林 (Random Forest) 来根据产品图像预测市场受欢迎程度。通过使用图像编辑扩散模型修改图像并进行消融研究，验证了最高和最低评分特征对模型预测的影响。此外，通过收集人类偏好排名的调查进一步验证了FDP模型的预测准确性和方法的有效性。该方法提供了一个自动化和系统化的时尚图像分析框架，为时尚产品设计和营销策略开发等下游任务提供了有价值的指导。

链接: https://arxiv.org/abs/2411.14737
作者: Xiaomin Li,Junyi Sha
关键词-EN: Identifying key product, key product features, Fashion Demand Predictor, influence consumer preferences, fashion industry
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying key product features that influence consumer preferences is essential in the fashion industry. In this study, we introduce a robust methodology to ascertain the most impactful features in fashion product images, utilizing past market sales data. First, we propose the metric called “influence score” to quantitatively assess the importance of product features. Then we develop a forecasting model, the Fashion Demand Predictor (FDP), which integrates Transformer-based models and Random Forest to predict market popularity based on product images. We employ image-editing diffusion models to modify these images and perform an ablation study, which validates the impact of the highest and lowest-scoring features on the model’s popularity predictions. Additionally, we further validate these results through surveys that gather human rankings of preferences, confirming the accuracy of the FDP model’s predictions and the efficacy of our method in identifying influential features. Notably, products enhanced with “good” features show marked improvements in predicted popularity over their modified counterparts. Our approach develops a fully automated and systematic framework for fashion image analysis that provides valuable guidance for downstream tasks such as fashion product design and marketing strategy development.
zh

[CV-49] Effective SAM Combination for Open-Vocabulary Semantic Segmentation

【速读】：该论文试图解决开放词汇语义分割（Open-vocabulary semantic segmentation）中的高计算成本和内存效率低下的问题。解决方案的关键在于提出了一种名为ESC-Net的新型单阶段开放词汇分割模型，该模型利用了Segment Anything Model (SAM)的解码器块进行类无关的分割，并通过将图像-文本相关性生成的伪提示嵌入到SAM的可提示分割框架中，实现了精确的掩码预测和空间聚合。ESC-Net在ADE20K、PASCAL-VOC和PASCAL-Context等标准基准测试中表现优异，不仅提高了效率，还提升了准确性。

链接: https://arxiv.org/abs/2411.14723
作者: Minhyeok Lee,Suhwan Cho,Jungho Lee,Sunghun Yang,Heeseung Choi,Ig-Jae Kim,Sangyoun Lee
关键词-EN: assign pixel-level labels, range of classes, semantic segmentation aims, aims to assign, assign pixel-level
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM’s promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
zh

[CV-50] VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

【速读】：该论文试图解决自动驾驶领域中以视觉为中心的算法在预训练阶段缺乏有效监督信号的问题。解决方案的关键在于提出了VisionPAD，一种新颖的自监督预训练范式。VisionPAD通过使用3D高斯样条（3D Gaussian Splatting）来重建多视角表示，仅利用图像作为监督信号，避免了传统方法中依赖显式深度监督的神经渲染技术。具体来说，VisionPAD引入了自监督的体素速度估计方法，通过将体素扭曲到相邻帧并监督渲染输出，有效学习序列数据中的运动线索。此外，采用多帧光度一致性方法，基于渲染深度和相对姿态将相邻帧投影到当前帧，从而增强几何感知，提升3D几何表示能力。实验结果表明，VisionPAD在3D物体检测、占用预测和地图分割任务中显著优于现有的预训练策略。

链接: https://arxiv.org/abs/2411.14716
作者: Haiming Zhang,Wending Zhou,Yiyao Zhu,Xu Yan,Jiantao Gao,Dongfeng Bai,Yingjie Cai,Bingbing Liu,Shuguang Cui,Zhen Li
关键词-EN: pre-training paradigm designed, paradigm designed, designed for vision-centric, vision-centric algorithms, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.
zh

[CV-51] Any-to-3D Generation via Hybrid Diffusion Supervision

【速读】：该论文试图解决现有3D物体生成模型在处理多模态输入时需要针对特定任务进行定制和重新训练的问题。解决方案的关键在于引入了一个统一的框架XBind，该框架通过跨模态预对齐技术，结合多模态对齐编码器和预训练的扩散模型，实现了从任意模态（包括文本、图像和音频）生成3D物体。核心创新包括提出了一种新的损失函数——模态相似性（Modality Similarity, MS）损失，用于对齐模态提示和渲染图像的嵌入，以及结合混合扩散监督和三阶段优化过程，显著提升了生成3D物体的质量。这是首次实现从任意模态提示生成3D物体的方法。

链接: https://arxiv.org/abs/2411.14715
作者: Yijun Fan,Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
关键词-EN: strong priors offered, Recent progress, strong priors, priors offered, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind’s broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: this https URL.
zh

[CV-52] Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

【速读】：该论文试图解决遥感跨模态文本-图像检索 (RSCTIR) 中由于遥感图像的多样性导致的全球和局部信息有效整合的挑战，以及在模态融合前确保特征预对齐的问题。解决方案的关键在于提出了一种名为CMPAGL的跨模态预对齐方法，该方法通过Gswin transformer块结合局部窗口自注意力和全局-局部窗口交叉注意力来捕捉多尺度特征，并引入预对齐机制简化模态融合训练，从而提高检索性能。此外，论文还提出了相似度矩阵重加权 (SMR) 算法用于重排序，并通过增强三元组损失函数中的类内距离项来优化特征学习。实验结果表明，CMPAGL在多个数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2411.14704
作者: Zengbao Sun,Ming Zhao,Gaorui Liu,André Kaup
关键词-EN: Remote sensing cross-modal, sensing cross-modal text-image, remote sensing imagery, Remote sensing, cross-modal text-image retrieval
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL’s effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.
zh

[CV-53] Anti-Forgetting Adaptation for Unsupervised Person Re-identification

【速读】：该论文试图解决在无监督域自适应行人重识别（ReID）中，模型在适应新域时容易遗忘源域和已适应目标域知识的问题。解决方案的关键在于提出了一种双层联合适应与抗遗忘（DJAA）框架，通过利用原型和实例级别的连续性来减轻遗忘。具体来说，该框架在每个适应步骤中存储少量代表性图像样本及其对应的聚类原型，并使用这些缓存的图像和原型来正则化图像间相似度和图像与原型间的相似度，从而在适应新域的同时复习旧知识。实验结果表明，该方法显著提升了模型的抗遗忘能力、泛化能力和向后兼容性。

链接: https://arxiv.org/abs/2411.14695
作者: Hao Chen,Francois Bremond,Nicu Sebe,Shiliang Zhang
关键词-EN: Regular unsupervised domain, adaptive person re-identification, Regular unsupervised, focuses on adapting, fixed target domain
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to TPAMI

点击查看摘要

Abstract:Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.
zh

[CV-54] Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

【速读】：该论文试图解决在差分隐私 (Differential Privacy, DP) 约束下，如何在不进行微调的情况下实现隐私保护的风格和内容迁移问题。解决方案的关键在于利用基于嵌入的技术：通用引导 (Universal Guidance) 和文本反转 (Textual Inversion, TI)，并通过差分隐私机制进行适配。具体来说，论文通过在Stable Diffusion模型上应用这些方法，使用两个私有数据集（一位艺术家的作品集和巴黎2024奥运会图标集）进行风格适配。实验结果表明，基于TI的适配方法在强隐私保证下实现了优越的风格迁移保真度，同时两种方法通过校准噪声和子采样策略保持了高隐私韧性。这一研究展示了在生成式 AI (Generative AI) 应用中，通过嵌入驱动的方法实现隐私保护扩散模型适配的可行性和高效性。

链接: https://arxiv.org/abs/2411.14639
作者: Pura Peetathawatchai,Wei-Ning Chen,Berivan Isik,Sanmi Koyejo,Albert No
关键词-EN: adapting diffusion models, enabling privacy-preserving style, adapting diffusion, Universal Guidance, Textual Inversion
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce novel methods for adapting diffusion models under differential privacy (DP) constraints, enabling privacy-preserving style and content transfer without fine-tuning. Traditional approaches to private adaptation, such as DP-SGD, incur significant computational overhead and degrade model performance when applied to large, complex models. Our approach instead leverages embedding-based techniques: Universal Guidance and Textual Inversion (TI), adapted with differentially private mechanisms. We apply these methods to Stable Diffusion for style adaptation using two private datasets: a collection of artworks by a single artist and pictograms from the Paris 2024 Olympics. Experimental results show that the TI-based adaptation achieves superior fidelity in style transfer, even under strong privacy guarantees, while both methods maintain high privacy resilience by employing calibrated noise and subsampling strategies. Our findings demonstrate a feasible and efficient pathway for privacy-preserving diffusion model adaptation, balancing data protection with the fidelity of generated images, and offer insights into embedding-driven methods for DP in generative AI applications.
zh

[CV-55] HotSpot: Screened Poisson Equation for Signed Distance Function Optimization

【速读】：该论文试图解决现有优化神经符号距离函数（neural signed distance functions）方法中的两个主要问题：一是现有损失函数（如eikonal loss）无法确保恢复的隐函数是距离函数，即使该隐函数几乎处处满足eikonal方程；二是eikonal loss在优化过程中存在稳定性问题，且引入区域或散度最小化的补救措施可能导致过度平滑。论文提出的解决方案HotSpot的关键在于设计了一种新的损失函数，该函数在最小化时能够收敛到真实的距离函数，具有稳定性，并自然地惩罚大面积表面。通过理论分析和在2D及3D数据集上的实验，证明了该方法在表面重建和距离近似方面提供了更好的效果。

链接: https://arxiv.org/abs/2411.14628
作者: Zimo Wang,Cheng Wang,Taiki Yoshino,Sirui Tao,Ziyang Fu,Tzu-Mao Li
关键词-EN: screened Poisson equation, optimizing neural signed, screened Poisson, neural signed distance, Poisson equation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a method, HotSpot, for optimizing neural signed distance functions, based on a relation between the solution of a screened Poisson equation and the distance function. Existing losses such as the eikonal loss cannot guarantee the recovered implicit function to be a distance function, even when the implicit function satisfies the eikonal equation almost everywhere. Furthermore, the eikonal loss suffers from stability issues in optimization and the remedies that introduce area or divergence minimization can lead to oversmoothing. We address these challenges by designing a loss function that when minimized can converge to the true distance function, is stable, and naturally penalize large surface area. We provide theoretical analysis and experiments on both challenging 2D and 3D datasets and show that our method provide better surface reconstruction and more accurate distance approximation.
zh

[CV-56] Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems

【速读】：该论文试图解决三维视觉定位 (3D Visual Grounding, 3DVG) 任务中监督方法词汇封闭和语言理解能力有限的问题，以及零样本方法推理速度慢的挑战。解决方案的关键在于将3DVG任务重新定义为约束满足问题 (Constraint Satisfaction Problem, CSP)，其中变量和约束分别代表对象及其空间关系。这种方法允许全局推理所有相关对象，生成目标和锚点对象的定位结果，并能灵活处理否定和计数查询。实验结果表明，该方法在ScanRefer和Nr3D数据集上显著优于当前最先进的零样本3DVG方法，分别提升了7.0%和11.2%的Acc@0.5得分。

链接: https://arxiv.org/abs/2411.14594
作者: Qihao Yuan,Jiaming Zhang,Kailai Li,Rainer Stiefelhagen
关键词-EN: natural language descriptions, aims to locate, language descriptions, Constraint Satisfaction Problem, Constraint Satisfaction Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D visual grounding (3DVG) aims to locate objects in a 3D scene with natural language descriptions. Supervised methods have achieved decent accuracy, but have a closed vocabulary and limited language understanding ability. Zero-shot methods mostly utilize large language models (LLMs) to handle natural language descriptions, yet suffer from slow inference speed. To address these problems, in this work, we propose a zero-shot method that reformulates the 3DVG task as a Constraint Satisfaction Problem (CSP), where the variables and constraints represent objects and their spatial relations, respectively. This allows a global reasoning of all relevant objects, producing grounding results of both the target and anchor objects. Moreover, we demonstrate the flexibility of our framework by handling negation- and counting-based queries with only minor extra coding efforts. Our system, Constraint Satisfaction Visual Grounding (CSVG), has been extensively evaluated on the public datasets ScanRefer and Nr3D datasets using only open-source LLMs. Results show the effectiveness of CSVG and superior grounding accuracy over current state-of-the-art zero-shot 3DVG methods with improvements of +7.0% (Acc@0.5 score) and +11.2% on the ScanRefer and Nr3D datasets, respectively. The code of our system is publicly available at this https URL.
zh

[CV-57] Privacy-Preserving Video Anomaly Detection: A Survey

【速读】：该论文试图解决隐私保护视频异常检测 (Privacy-Preserving Video Anomaly Detection, P2VAD) 领域中的关键问题，即如何在保护个人隐私的前提下，有效地检测监控视频中的异常事件。解决方案的关键在于系统性地回顾和分类现有的P2VAD方法，明确其基本假设、学习框架和优化目标，分析各方法的优缺点及其潜在关联。此外，论文还提供了开放的研究资源，如基准数据集和可用代码，并讨论了AI发展和P2VAD部署中的关键挑战和未来机遇，旨在指导该领域的未来研究工作。

链接: https://arxiv.org/abs/2411.14565
作者: Jing Liu,Yang Liu,Xiaoguang Zhu
关键词-EN: Video Anomaly Detection, automatically analyze spatiotemporal, analyze spatiotemporal patterns, detect anomalous events, surveillance videos collected
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Video Anomaly Detection (VAD) aims to automatically analyze spatiotemporal patterns in surveillance videos collected from open spaces to detect anomalous events that may cause harm without physical contact. However, vision-based surveillance systems such as closed-circuit television often capture personally identifiable information. The lack of transparency and interpretability in video transmission and usage raises public concerns about privacy and ethics, limiting the real-world application of VAD. Recently, researchers have focused on privacy concerns in VAD by conducting systematic studies from various perspectives including data, features, and systems, making Privacy-Preserving Video Anomaly Detection (P2VAD) a hotspot in the AI community. However, current research in P2VAD is fragmented, and prior reviews have mostly focused on methods using RGB sequences, overlooking privacy leakage and appearance bias considerations. To address this gap, this article systematically reviews the progress of P2VAD for the first time, defining its scope and providing an intuitive taxonomy. We outline the basic assumptions, learning frameworks, and optimization objectives of various approaches, analyzing their strengths, weaknesses, and potential correlations. Additionally, we provide open access to research resources such as benchmark datasets and available code. Finally, we discuss key challenges and future opportunities from the perspectives of AI development and P2VAD deployment, aiming to guide future work in the field.
zh

[CV-58] Enhancing GeoAI and location encoding with spatial point pattern statistics: A Case Study of Terrain Feature Classification

【速读】：该论文试图解决地形特征分类问题，通过将空间点模式统计（spatial point pattern statistics）融入深度学习模型，以提高地理人工智能（GeoAI）的决策能力。解决方案的关键在于采用知识驱动的方法，整合点模式的一阶和二阶效应，从而捕捉位置特征并增强模型对空间关系的理解。研究结果表明，这种整合显著提升了模型性能，特别是在利用不同空间关系表示方面。

链接: https://arxiv.org/abs/2411.14560
作者: Sizhe Wang,Wenwen Li
关键词-EN: point pattern statistics, deep learning models, spatial point pattern, point pattern, terrain feature classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages with 1 figure. Accepted in 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery

点击查看摘要

Abstract:This study introduces a novel approach to terrain feature classification by incorporating spatial point pattern statistics into deep learning models. Inspired by the concept of location encoding, which aims to capture location characteristics to enhance GeoAI decision-making capabilities, we improve the GeoAI model by a knowledge driven approach to integrate both first-order and second-order effects of point patterns. This paper investigates how these spatial contexts impact the accuracy of terrain feature predictions. The results show that incorporating spatial point pattern statistics notably enhances model performance by leveraging different representations of spatial relationships.
zh

[CV-59] GMAI-VL GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

【速读】：该论文试图解决通用人工智能（General Artificial Intelligence）在医疗领域（General Medical AI, GMAI）中由于缺乏专业医学知识而效果受限的问题。解决方案的关键在于创建了一个名为GMAI-VL-5.5M的综合多模态医学数据集，该数据集通过将数百个专业医学数据集转换为精心构建的图像-文本对，涵盖了全面的任务覆盖、多样的模态和高品质的图像-文本数据。基于此数据集，论文提出了GMAI-VL，一个通用医学视觉-语言模型，并采用逐步三阶段训练策略，通过整合视觉和文本信息，显著提升模型处理多模态数据的能力，从而支持准确的诊断和临床决策。实验评估表明，GMAI-VL在多种多模态医学任务中达到了最先进的结果。

链接: https://arxiv.org/abs/2411.14522
作者: Tianbin Li,Yanzhou Su,Wei Li,Bin Fu,Zhe Chen,Ziyan Huang,Guoan Wang,Chenglong Ma,Ying Chen,Ming Hu,Yanjun Li,Pengcheng Chen,Xiaowei Hu,Zhongying Deng,Yuanfeng Ji,Jin Ye,Yu Qiao,Junjun He
关键词-EN: remains constrained due, general artificial intelligence, specialized medical knowledge, artificial intelligence, remains constrained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model’s ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at this https URL.
zh

[CV-60] MyTimeMachine: Personalized Facial Age Transformation

【速读】：该论文试图解决面部老化预测中的个性化问题，即如何利用个人照片集来定制化地预测个体在不同年龄段的外观。解决方案的关键在于提出了一种名为MyTimeMachine (MyTM)的方法，该方法结合了全局老化先验知识与个人照片集（仅需50张图像），通过引入一个新颖的适配器网络（Adapter Network）来融合个性化老化特征与全局老化特征，并利用StyleGAN2生成重老化图像。此外，论文还引入了三种损失函数：个性化老化损失、外推正则化和自适应w-范数正则化，以进一步个性化适配器网络。这种方法不仅适用于静态图像，还能扩展到视频，实现高质量、身份保持和时间一致的老化效果，优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.14521
作者: Luchao Qi,Jiaye Wu,Bang Gong,Annie N. Wang,David W. Jacobs,Roni Sengupta
关键词-EN: Facial aging, aging, complex process, highly dependent, factors like gender
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person’s appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20 \sim 40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), which combines a global aging prior with a personal photo collection (using as few as 50 images) to learn a personalized age transformation. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our approach can also be extended to videos, achieving high-quality, identity-preserving, and temporally consistent aging effects that resemble actual appearances at target ages, demonstrating its superiority over state-of-the-art approaches.
zh

[CV-61] he Double-Ellipsoid Geometry of CLIP

【速读】：该论文试图解决对比语言-图像预训练（Contrastive Language-Image Pre-Training, CLIP）嵌入的几何结构问题，特别是其线性可分性和不确定性嵌入的机制。解决方案的关键在于揭示了文本和图像嵌入在非原点中心的椭球壳上，并提出了一个新的“一致性”（conformity）概念，用于衡量实例与其他实例的平均余弦相似度。通过计算实例与模态均值向量的余弦相似度，可以准确估计这一一致性。此外，论文发现CLIP的模态差异优化了图像和文本一致性分布的匹配，从而提高了嵌入的质量和对比训练的效果。

链接: https://arxiv.org/abs/2411.14517
作者: Meir Yossef Levi,Guy Gilboa
关键词-EN: machine learning applications, Contrastive Language-Image Pre-Training, Language-Image Pre-Training, variety of domains, highly instrumental
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.
zh

[CV-62] Memory Backdoor Attacks on Neural Networks

【速读】：该论文试图解决的问题是如何在神经网络模型中植入隐蔽的记忆后门攻击，使得模型在特定触发条件下能够泄露训练数据中的特定样本。解决方案的关键在于设计一种能够在任务冲突的情况下（例如分类器输出图像）工作的攻击方法，该方法能够系统性地从已部署的模型中提取训练样本，并保证提取数据的真实性。通过这种攻击，可以在现代视觉架构和大型语言模型（LLM）中隐藏数千张图像和文本，同时保持模型性能。论文提出的解决方案不仅对传统模型部署构成威胁，还对联邦学习等现代框架构成潜在风险。

链接: https://arxiv.org/abs/2411.14516
作者: Eden Luzon,Guy Amit,Roy Weiss,Yisroel Mirsky
关键词-EN: Neural networks, confidential datasets, proprietary and confidential, Neural, attack
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural networks, such as image classifiers, are frequently trained on proprietary and confidential datasets. It is generally assumed that once deployed, the training data remains secure, as adversaries are limited to query response interactions with the model, where at best, fragments of arbitrary data can be inferred without any guarantees on their authenticity. In this paper, we propose the memory backdoor attack, where a model is covertly trained to memorize specific training samples and later selectively output them when triggered with an index pattern. What makes this attack unique is that it (1) works even when the tasks conflict (making a classifier output images), (2) enables the systematic extraction of training samples from deployed models and (3) offers guarantees on the extracted authenticity of the data. We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). With this attack, it is possible to hide thousands of images and texts in modern vision architectures and LLMs respectively, all while maintaining model performance. The memory back door attack poses a significant threat not only to conventional model deployments but also to federated learning paradigms and other modern frameworks. Therefore, we suggest an efficient and effective countermeasure that can be immediately applied and advocate for further work on the topic.
zh

[CV-63] Are Anomaly Scores Telling the Whole Story? A Benchmark for Multilevel Anomaly Detection

【速读】：该论文试图解决现有异常检测模型在处理异常严重性评估方面的不足，即现有模型主要在二元设置下运行，其生成的异常分数通常基于数据点与正常数据的偏差，可能无法准确反映实际的异常严重性。解决方案的关键在于提出了一个新的设置——多级异常检测 (Multilevel Anomaly Detection, MAD)，其中异常分数直接代表异常的严重性，并强调了其在多个领域的广泛应用。此外，论文还引入了一个新的基准测试——MAD-Bench，用于评估模型在检测异常和反映异常严重性方面的能力，并进行了全面的性能分析，以提供改进异常检测模型以实现实际严重性对齐的关键见解。

链接: https://arxiv.org/abs/2411.14515
作者: Tri Cao,Minh-Huy Trinh,Ailin Deng,Quoc-Nam Nguyen,Khoa Duong,Ngai-Man Cheung,Bryan Hooi
关键词-EN: machine learning task, machine learning, learning task, learning patterns, task that identifies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Anomaly detection (AD) is a machine learning task that identifies anomalies by learning patterns from normal training data. In many real-world scenarios, anomalies vary in severity, from minor anomalies with little risk to severe abnormalities requiring immediate attention. However, existing models primarily operate in a binary setting, and the anomaly scores they produce are usually based on the deviation of data points from normal data, which may not accurately reflect practical severity. In this paper, we address this gap by making three key contributions. First, we propose a novel setting, Multilevel AD (MAD), in which the anomaly score represents the severity of anomalies in real-world applications, and we highlight its diverse applications across various domains. Second, we introduce a novel benchmark, MAD-Bench, that evaluates models not only on their ability to detect anomalies, but also on how effectively their anomaly scores reflect severity. This benchmark incorporates multiple types of baselines and real-world applications involving severity. Finally, we conduct a comprehensive performance analysis on MAD-Bench. We evaluate models on their ability to assign severity-aligned scores, investigate the correspondence between their performance on binary and multilevel detection, and study their robustness. This analysis offers key insights into improving AD models for practical severity alignment. The code framework and datasets used for the benchmark will be made publicly available.
zh

[CV-64] NexusSplats: Efficient 3D Gaussian Splatting in the Wild CVPR2025

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）在复杂光照条件和遮挡情况下的重建效率和质量问题。解决方案的关键在于提出了一种称为NexusSplats的新方法，该方法通过以下两个创新点来实现高效且精细的3D场景重建：1) 利用一种新的光照解耦策略，通过基于连接核（nexus kernels）的外观嵌入优化，而非大规模高斯基元（Gaussian primitives），从而加速重建速度并确保局部颜色一致性以实现更精细的纹理；2) 开发了一种高斯基元级别的不确定性机制，将3D结构与2D图像特征对齐，以实现细粒度的遮挡处理。实验结果表明，NexusSplats在保持最先进渲染质量的同时，将重建时间减少了高达70.4%。

链接: https://arxiv.org/abs/2411.14514
作者: Yuzhou Tang,Dejun Xu,Yongjie Hou,Zhenzhong Wang,Min Jiang
关键词-EN: Gaussian Splatting, recently demonstrated remarkable, varying lighting conditions, demonstrated remarkable rendering, massive Gaussian primitives
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to CVPR 2025

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has recently demonstrated remarkable rendering quality and efficiency in 3D scene reconstruction, it struggles with varying lighting conditions and incidental occlusions in real-world scenarios. To accommodate varying lighting conditions, existing 3DGS extensions apply color mapping to the massive Gaussian primitives with individually optimized appearance embeddings. To handle occlusions, they predict pixel-wise uncertainties via 2D image features for occlusion capture. Nevertheless, such massive color mapping and pixel-wise uncertainty prediction strategies suffer from not only additional computational costs but also coarse-grained lighting and occlusion handling. In this work, we propose a nexus kernel-driven approach, termed NexusSplats, for efficient and finer 3D scene reconstruction under complex lighting and occlusion conditions. In particular, NexusSplats leverages a novel light decoupling strategy where appearance embeddings are optimized based on nexus kernels instead of massive Gaussian primitives, thus accelerating reconstruction speeds while ensuring local color consistency for finer textures. Additionally, a Gaussian-wise uncertainty mechanism is developed, aligning 3D structures with 2D image features for fine-grained occlusion handling. Experimental results demonstrate that NexusSplats achieves state-of-the-art rendering quality while reducing reconstruction time by up to 70.4% compared to the current best in quality.
zh

[CV-65] LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理长视频时面临的精确时刻检索难题，主要由于模型上下文大小限制和粗糙的帧提取方法。解决方案的关键在于提出了一种名为Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR)的新方法，该方法通过结合密集帧和时间编码（Dense Frame and Time Encoding, DFTE）进行时空特征提取，信息帧选择（Informative Frame Selection, IFS）捕捉简短的视觉和运动模式，以及动态令牌压缩（Dynamic Token Compression, DTC）来管理大语言模型的上下文限制。这些技术的结合使得LLaVA-MR在Charades-STA和QVHighlights等基准测试中显著优于现有的11种最先进方法，分别在R1@0.5和mAP@0.5指标上提升了1.82%和1.29%。

链接: https://arxiv.org/abs/2411.14505
作者: Weiheng Lu,Jian Li,An Yu,Ming-Ching Chang,Shengpeng Ji,Min Xia
关键词-EN: Multimodal Large Language, Large Language Models, Language Models, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are widely used for visual perception, understanding, and reasoning. However, long video processing and precise moment retrieval remain challenging due to LLMs’ limited context size and coarse frame extraction. We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR), which enables accurate moment retrieval and contextual grounding in videos using MLLMs. LLaVA-MR combines Dense Frame and Time Encoding (DFTE) for spatial-temporal feature extraction, Informative Frame Selection (IFS) for capturing brief visual and motion patterns, and Dynamic Token Compression (DTC) to manage LLM context limitations. Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods, achieving an improvement of 1.82% in R1@0.5 and 1.29% in mAP@0.5 on the QVHighlights dataset. Our implementation will be open-sourced upon acceptance.
zh

[CV-66] Night-to-Day Translation via Illumination Degradation Disentanglement

【速读】：该论文试图解决夜间到日间图像翻译（Night-to-Day translation, Night2Day）中，在无配对数据条件下处理复杂退化图像的难题。关键解决方案在于提出了N2D3（Night-to-Day via Degradation Disentanglement）方法，通过识别夜间图像中的不同退化模式来实现日间视觉效果。具体来说，该方法包括一个退化解耦模块和一个退化感知对比学习模块。首先，从基于Kubelka-Munk理论的光度模型中提取物理先验信息，然后设计解耦模块来区分不同光照退化区域。最后，引入退化感知对比学习策略，以保持不同退化区域间的语义一致性。该方法在两个公开数据集上的评估显示，其在视觉质量和下游任务中的应用潜力均有显著提升。

链接: https://arxiv.org/abs/2411.14504
作者: Guanzhou Lan,Yuqi Yang,Zhigang Wang,Dong Wang,Bin Zhao,Xuelong Li
关键词-EN: achieve day-like vision, textbf, aims to achieve, achieve day-like, day-like vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Night-to-Day translation (Night2Day) aims to achieve day-like vision for nighttime scenes. However, processing night images with complex degradations remains a significant challenge under unpaired conditions. Previous methods that uniformly mitigate these degradations have proven inadequate in simultaneously restoring daytime domain information and preserving underlying semantics. In this paper, we propose \textbfN2D3 (\textbfNight-to-\textbfDay via \textbfDegradation \textbfDisentanglement) to identify different degradation patterns in nighttime images. Specifically, our method comprises a degradation disentanglement module and a degradation-aware contrastive learning module. Firstly, we extract physical priors from a photometric model based on Kubelka-Munk theory. Then, guided by these physical priors, we design a disentanglement module to discriminate among different illumination degradation regions. Finally, we introduce the degradation-aware contrastive learning strategy to preserve semantic consistency across distinct degradation regions. Our method is evaluated on two public datasets, demonstrating a significant improvement in visual quality and considerable potential for benefiting downstream tasks.
zh

[CV-67] U-Motion: Learned Point Cloud Video Compression with U-Structured Motion Estimation

【速读】：该论文试图解决点云视频 (Point Cloud Video, PCV) 的几何和属性压缩问题。解决方案的关键在于提出了一个基于学习的压缩方案 U-Motion，其中包括两个核心模块：U-Structured 多尺度帧间预测框架 (U-Inter) 和级联空间预测编码模块。U-Inter 通过在不同尺度上进行显式的运动估计和补偿 (Motion Estimation/Compensation, ME/MC)，结合高、低尺度的运动特征以及当前和前一帧的信息，实现了精确的运动估计。级联空间预测编码模块则用于捕捉 U-Inter 预测后剩余的空间冗余。此外，论文还设计了一种有效的上下文分离与恢复方案，以减少运动和潜在比特流中的时空冗余，从而提高压缩性能。实验结果表明，U-Motion 在几何和属性压缩方面显著优于 MPEG G-PCC-GesTM v3.0 和最新的基于学习的压缩方法。

链接: https://arxiv.org/abs/2411.14501
作者: Tingyu Fan,Yueyu Hu,Yao Wang
关键词-EN: Point cloud video, Point cloud, cloud video, representation of dynamic, emerging applications
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Point cloud video (PCV) is a versatile 3D representation of dynamic scenes with many emerging applications. This paper introduces U-Motion, a learning-based compression scheme for both PCV geometry and attributes. We propose a U-Structured multiscale inter-frame prediction framework, U-Inter, which performs layer-wise explicit motion estimation and compensation (ME/MC) at different scales with varying levels of detail. It integrates both higher and lower-scale motion features, in addition to the information of current and previous frames, to enable accurate motion estimation at the current scale. In addition, we design a cascaded spatial predictive coding module to capture the inter-scale spatial redundancy remaining after U-Inter prediction. We further propose an effective context detach and restore scheme to reduce spatial-temporal redundancy in the motion and latent bit-streams and improve compression performance. We conduct experiments following the MPEG Common Test Condition and demonstrate that U-Motion can achieve significant gains over MPEG G-PCC-GesTM v3.0 and recently published learning-based methods for both geometry and attribute compression.
zh

[CV-68] Delta-NAS: Difference of Architecture Encoding for Predictor-based Evolutionary Neural Architecture Search

【速读】：该论文试图解决神经架构搜索 (Neural Architecture Search, NAS) 中搜索空间复杂度和计算成本高的问题。解决方案的关键在于通过预测一对相似网络之间的准确性差异，将问题投影到低维空间，从而将计算复杂度从指数级降低到线性级。这一范式转变不仅显著提高了搜索效率，还在多个NAS基准测试中取得了优于现有方法的性能。

链接: https://arxiv.org/abs/2411.14498
作者: Arjun Sridhar,Yiran Chen
关键词-EN: Neural Architecture Search, task specific deployment, Neural Architecture, Architecture Search, continues to serve
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) continues to serve a key roll in the design and development of neural networks for task specific deployment. Modern NAS techniques struggle to deal with ever increasing search space complexity and compute cost constraints. Existing approaches can be categorized into two buckets: fine-grained computational expensive NAS and coarse-grained low cost NAS. Our objective is to craft an algorithm with the capability to perform fine-grain NAS at a low cost. We propose projecting the problem to a lower dimensional space through predicting the difference in accuracy of a pair of similar networks. This paradigm shift allows for reducing computational complexity from exponential down to linear with respect to the size of the search space. We present a strong mathematical foundation for our algorithm in addition to extensive experimental results across a host of common NAS Benchmarks. Our methods significantly out performs existing works achieving better performance coupled with a significantly higher sample efficiency.
zh

[CV-69] Multi-agent reinforcement learning strategy to maximize the lifetime of Wireless Rechargeable

【速读】：该论文试图解决在大规模无线传感器网络 (WRSNs) 中，如何通过多个移动充电器 (MCs) 最大化网络寿命并确保目标覆盖和连接性的问题。解决方案的关键在于提出了一个广义充电框架，并利用多点充电模型提高充电效率。论文的核心创新是提出了一个去中心化的部分可观测半马尔可夫决策过程 (Dec POSMDP) 模型，该模型促进了移动充电器之间的合作，并基于实时网络信息检测最佳充电位置。此外，该框架允许在不进行大量重新训练的情况下，将强化学习算法应用于不同的网络。为了解决 Dec POSMDP 模型，论文还提出了基于近端策略优化算法 (PPO) 的异步多智能体强化学习算法 (AMAPPO)。

链接: https://arxiv.org/abs/2411.14496
作者: Bao Nguyen
关键词-EN: large scale WRSNs, ensure target coverage, generalized charging framework, multiple mobile chargers, mobile chargers
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 77 pages, Bachelor’s thesis

点击查看摘要

Abstract:The thesis proposes a generalized charging framework for multiple mobile chargers to maximize the network lifetime and ensure target coverage and connectivity in large scale WRSNs. Moreover, a multi-point charging model is leveraged to enhance charging efficiency, where the MC can charge multiple sensors simultaneously at each charging location. The thesis proposes an effective Decentralized Partially Observable Semi-Markov Decision Process (Dec POSMDP) model that promotes Mobile Chargers (MCs) cooperation and detects optimal charging locations based on realtime network information. Furthermore, the proposal allows reinforcement algorithms to be applied to different networks without requiring extensive retraining. To solve the Dec POSMDP model, the thesis proposes an Asynchronous Multi Agent Reinforcement Learning algorithm (AMAPPO) based on the Proximal Policy Optimization algorithm (PPO).
zh

[CV-70] st-Time Adaptation of 3D Point Clouds via Denoising Diffusion Models WACV2025

【速读】：该论文试图解决在处理受损点云数据时，训练和测试样本之间存在分布差异的问题。具体来说，LiDAR数据可能因传感器故障或环境因素而受损，导致域间差异。论文提出的解决方案是3D测试时适应方法（3DD-TTA），其关键在于使用扩散策略将输入点云样本适应到源域，同时保持源模型参数不变。该方法通过变分自编码器（VAE）将受损点云编码为形状潜在和潜在点，并在高斯噪声下进行去噪扩散过程，更新形状潜在和潜在点以保持样本一致性，从而生成更接近源域的点云样本。这种方法在ShapeNet、ModelNet40和ScanObjectNN数据集上进行了广泛实验，并取得了最先进的结果。

链接: https://arxiv.org/abs/2411.14495
作者: Hamidreza Dastmalchi,Aijun An,Ali Cheraghian,Shafin Rahman,Sameera Ramasinghe
关键词-EN: Test-time adaptation, real-world scenarios, mitigating discrepancies, TTA, Diffusion Test-Time Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2025 (Winter Conference on Applications of Computer Vision)

点击查看摘要

Abstract:Test-time adaptation (TTA) of 3D point clouds is crucial for mitigating discrepancies between training and testing samples in real-world scenarios, particularly when handling corrupted point clouds. LiDAR data, for instance, can be affected by sensor failures or environmental factors, causing domain gaps. Adapting models to these distribution shifts online is crucial, as training for every possible variation is impractical. Existing methods often focus on fine-tuning pre-trained models based on self-supervised learning or pseudo-labeling, which can lead to forgetting valuable source domain knowledge over time and reduce generalization on future tests. In this paper, we introduce a novel 3D test-time adaptation method, termed 3DD-TTA, which stands for 3D Denoising Diffusion Test-Time Adaptation. This method uses a diffusion strategy that adapts input point cloud samples to the source domain while keeping the source model parameters intact. The approach uses a Variational Autoencoder (VAE) to encode the corrupted point cloud into a shape latent and latent points. These latent points are corrupted with Gaussian noise and subjected to a denoising diffusion process. During this process, both the shape latent and latent points are updated to preserve fidelity, guiding the denoising toward generating consistent samples that align more closely with the source domain. We conduct extensive experiments on the ShapeNet dataset and investigate its generalizability on ModelNet40 and ScanObjectNN, achieving state-of-the-art results. The code has been released at \urlthis https URL.
zh

[CV-71] dc-GAN: Dual-Conditioned GAN for Face Demorphing From a Single Morph

【速读】：该论文试图解决面部合成图像（facial morph）的反向恢复问题，即从合成图像中恢复出原始的两张人脸图像。现有技术要么在测试时假设已知身份信息（非常受限），要么恢复出的图像质量较差且相似度高。论文提出的解决方案之关键是引入了一种基于生成对抗网络（GAN）的新方法——dc-GAN，该方法以合成图像为条件进行反向恢复，能够克服合成图像复制问题并生成高质量的原始人脸图像。此外，该方法在不同的反向恢复范式（差分/无参考）中具有高度通用性，并通过在AMSL、FRLL-Morphs和MorDiff数据集上的实验验证了其有效性。

链接: https://arxiv.org/abs/2411.14494
作者: Nitish Shukla,Arun Ross
关键词-EN: created by combining, facial morph, face images pertaining, morph, images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A facial morph is an image created by combining two face images pertaining to two distinct identities. Face demorphing inverts the process and tries to recover the original images constituting a facial morph. While morph attack detection (MAD) techniques can be used to flag morph images, they do not divulge any visual information about the faces used to create them. Demorphing helps address this problem. Existing demorphing techniques are either very restrictive (assume identities during testing) or produce feeble outputs (both outputs look very similar). In this paper, we overcome these issues by proposing dc-GAN, a novel GAN-based demorphing method conditioned on the morph images. Our method overcomes morph-replication and produces high quality reconstructions of the bonafide images used to create the morphs. Moreover, our method is highly generalizable across demorphing paradigms (differential/reference-free). We conduct experiments on AMSL, FRLL-Morphs and MorDiff datasets to showcase the efficacy of our method.
zh

[CV-72] Attention-guided Spectrogram Sequence Modeling with CNNs for Music Genre Classification

【速读】：该论文试图解决音乐流派分类的问题，这是音乐推荐系统、生成算法和文化分析中的关键组成部分。解决方案的关键在于采用基于注意力机制的时间特征建模方法。通过将频谱图序列输入卷积神经网络 (CNN) 和多头注意力层，该方法能够捕捉音乐作品中时间上最重要的片段，从而为每个流派生成独特的“特征签名”。这种时间聚焦不仅提高了分类的准确性，还揭示了与听众感知相符的流派特定特征。该研究为个性化音乐推荐系统提供了潜在应用，通过强调跨流派相似性和独特性，与人类对音乐流派的直觉相契合。

链接: https://arxiv.org/abs/2411.14474
作者: Aditya Sridhar
关键词-EN: generation algorithms, Convolutional Neural Networks, cultural analytics, critical component, music recommendation systems
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 6 pages, 7 figures, 17 References

点击查看摘要

Abstract:Music genre classification is a critical component of music recommendation systems, generation algorithms, and cultural analytics. In this work, we present an innovative model for classifying music genres using attention-based temporal signature modeling. By processing spectrogram sequences through Convolutional Neural Networks (CNNs) and multi-head attention layers, our approach captures the most temporally significant moments within each piece, crafting a unique “signature” for genre identification. This temporal focus not only enhances classification accuracy but also reveals insights into genre-specific characteristics that can be intuitively mapped to listener perceptions. Our findings offer potential applications in personalized music recommendation systems by highlighting cross-genre similarities and distinctiveness, aligning closely with human musical intuition. This work bridges the gap between technical classification tasks and the nuanced, human experience of genre.
zh

[CV-73] Dimension-independent rates for structured neural density estimation

【速读】：该论文试图解决深度神经网络在处理图像、音频、视频和文本等结构化密度数据时，如何避免维度灾难的问题。解决方案的关键在于证明了神经网络在非参数密度估计中，能够实现与数据维度无关的收敛速率。具体来说，论文展示了在 L^2 损失最小化的情况下，神经网络可以达到 n^-1/(4+r) 的收敛速率，其中 r 是马尔可夫随机场中最大团的大小，而在实际应用中，r 通常是常数（即 r=O(1)）。此外，论文还指出在 L^1 标准下，最优收敛速率为 n^-1/(2+r)，这表明这些问题的有效维度实际上是马尔可夫随机场中最大团的大小，而不是数据的环境维度。这些发现为深度学习在处理高维数据时能够规避维度灾难提供了新的理论支持。

链接: https://arxiv.org/abs/2411.15095
作者: Robert A. Vandermeulen,Wai Ming Tai,Bryon Aragam
关键词-EN: learning structured densities, neural networks achieve, deep neural networks, neural networks, networks achieve dimension-independent
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:We show that deep neural networks achieve dimension-independent rates of convergence for learning structured densities such as those arising in image, audio, video, and text applications. More precisely, we demonstrate that neural networks with a simple L^2 -minimizing loss achieve a rate of n^-1/(4+r) in nonparametric density estimation when the underlying density is Markov to a graph whose maximum clique size is at most r , and we provide evidence that in the aforementioned applications, this size is typically constant, i.e., r=O(1) . We then establish that the optimal rate in L^1 is n^-1/(2+r) which, compared to the standard nonparametric rate of n^-1/(2+d) , reveals that the effective dimension of such problems is the size of the largest clique in the Markov random field. These rates are independent of the data’s ambient dimension, making them applicable to realistic models of image, sound, video, and text data. Our results provide a novel justification for deep learning’s ability to circumvent the curse of dimensionality, demonstrating dimension-independent convergence rates in these contexts.
zh

[CV-74] Quantum-enhanced unsupervised image segmentation for medical images analysis

【速读】：该论文试图解决乳腺癌医学图像分割中的自动化问题，特别是在缺乏大规模标注数据的情况下，如何实现高效且准确的图像分割。解决方案的关键在于提出了一种端到端的量子增强框架，用于无监督的乳腺X光图像分割。该框架首先引入了一种量子启发的图像表示方法，作为分割掩码的初始近似。随后，将分割任务转化为一个二次无约束二值优化问题 (QUBO)，旨在最大化背景与肿瘤区域之间的对比度，同时确保分割掩码具有最小的连通分量。通过量子退火和变分量子电路，该方法在性能上与经典优化技术相当，甚至在实验中表现出比经典优化方法快一个数量级的速度。这一框架不仅在性能上可与最先进的监督方法（如基于UNet的架构）相媲美，还提供了一种可行的无监督替代方案，适用于乳腺癌图像分割。

链接: https://arxiv.org/abs/2411.15086
作者: Laia Domingo,Mahdi Chehimi
关键词-EN: characterize abnormal lesions, women worldwide, necessitating the meticulous, abnormal lesions, remains the leading
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Breast cancer remains the leading cause of cancer-related mortality among women worldwide, necessitating the meticulous examination of mammograms by radiologists to characterize abnormal lesions. This manual process demands high accuracy and is often time-consuming, costly, and error-prone. Automated image segmentation using artificial intelligence offers a promising alternative to streamline this workflow. However, most existing methods are supervised, requiring large, expertly annotated datasets that are not always available, and they experience significant generalization issues. Thus, unsupervised learning models can be leveraged for image segmentation, but they come at a cost of reduced accuracy, or require extensive computational resourcess. In this paper, we propose the first end-to-end quantum-enhanced framework for unsupervised mammography medical images segmentation that balances between performance accuracy and computational requirements. We first introduce a quantum-inspired image representation that serves as an initial approximation of the segmentation mask. The segmentation task is then formulated as a QUBO problem, aiming to maximize the contrast between the background and the tumor region while ensuring a cohesive segmentation mask with minimal connected components. We conduct an extensive evaluation of quantum and quantum-inspired methods for image segmentation, demonstrating that quantum annealing and variational quantum circuits achieve performance comparable to classical optimization techniques. Notably, quantum annealing is shown to be an order of magnitude faster than the classical optimization method in our experiments. Our findings demonstrate that this framework achieves performance comparable to state-of-the-art supervised methods, including UNet-based architectures, offering a viable unsupervised alternative for breast cancer image segmentation.
zh

[CV-75] Leapfrog Latent Consistency Model (LLCM) for Medical Images Generation

【速读】：该论文试图解决医疗影像数据稀缺的问题，特别是在医院因隐私顾虑不愿共享数据的情况下，如何有效训练深度学习模型进行医疗诊断。解决方案的关键在于提出了一个名为Leapfrog Latent Consistency Model (LLCM)的模型，该模型基于重新训练的扩散模型，并利用收集的MedImgs数据集进行蒸馏。LLCM通过将反向扩散过程形式化为概率流常微分方程(PF-ODE)，并在潜在空间中使用Leapfrog算法求解，从而实现快速采样而不需要额外的迭代。此外，该模型能够通过微调任何自定义的医疗影像数据集来生成大量高质量的图像，实验结果表明其在未见过的狗心脏X光图像生成上优于现有模型。

链接: https://arxiv.org/abs/2411.15084
作者: Lakshmikar R. Polamreddy,Kalyan Roy,Sheng-Han Yueh,Deepshikha Mahato,Shilpa Kuppili,Jialu Li,Youshan Zhang
关键词-EN: effectively training deep, training deep learning, image data poses, deep learning models, privacy concerns
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Total 16 pages including 5 figures and 36 references

点击查看摘要

Abstract:The scarcity of accessible medical image data poses a significant obstacle in effectively training deep learning models for medical diagnosis, as hospitals refrain from sharing their data due to privacy concerns. In response, we gathered a diverse dataset named MedImgs, which comprises over 250,127 images spanning 61 disease types and 159 classes of both humans and animals from open-source repositories. We propose a Leapfrog Latent Consistency Model (LLCM) that is distilled from a retrained diffusion model based on the collected MedImgs dataset, which enables our model to generate real-time high-resolution images. We formulate the reverse diffusion process as a probability flow ordinary differential equation (PF-ODE) and solve it in latent space using the Leapfrog algorithm. This formulation enables rapid sampling without necessitating additional iterations. Our model demonstrates state-of-the-art performance in generating medical images. Furthermore, our model can be fine-tuned with any custom medical image datasets, facilitating the generation of a vast array of images. Our experimental results outperform those of existing models on unseen dog cardiac X-ray images. Source code is available at this https URL.
zh

[CV-76] RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

【速读】：该论文试图解决空间转录组学数据与组织学图像对齐的问题，由于空间扭曲和模态特异性差异，现有的直接对齐方法难以捕捉复杂的跨模态关系。解决方案的关键在于提出了一种基于排序的对齐损失框架，通过保留模态间的相对相似性来实现鲁棒的多尺度对齐。此外，采用自监督的知识蒸馏技术，通过教师-学生网络架构来增强对齐的稳定性，有效缓解了基因表达数据的高维度、稀疏性和噪声问题。

链接: https://arxiv.org/abs/2411.15076
作者: Wentao Huang,Meilong Xu,Xiaoling Hu,Shahira Abousamra,Aniruddha Ganguly,Saarthak Kapse,Alisa Yurovsky,Prateek Prasanna,Tahsin Kurc,Joel Saltz,Michael L. Miller,Chao Chen
关键词-EN: essential spatial context, enabling detailed study, tissue organization, Spatial transcriptomics, context by mapping
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment’s stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on gene expression prediction and survival analysis demonstrate our framework’s effectiveness, showing improved alignment and predictive performance over existing methods and establishing a robust tool for gene-guided image representation learning in digital pathology.
zh

[CV-77] Detecting Hallucinations in Virtual Histology with Neural Precursors

【速读】：该论文试图解决虚拟染色 (Virtual Staining, VS) 技术中的幻觉检测问题，即在虚拟染色过程中识别并避免生成不准确的染色结果。解决方案的关键在于引入了一种可扩展的后处理幻觉检测方法，该方法通过识别神经幻觉前兆 (Neural Hallucination Precursor, NHP) 来实现测试阶段的幻觉检测。这一方法通过从虚拟染色模型的嵌入中提取NHP，从而有效且稳健地检测出幻觉现象。论文还强调，仅依赖减少幻觉数量的指标可能带来虚假的安全感，因此需要重新评估当前虚拟染色技术的评价实践。

链接: https://arxiv.org/abs/2411.15060
作者: Ji-Hun Oh,Kianoush Falahkheirkhah,Rohit Bhargava
关键词-EN: Significant biomedical research, clinical care rely, Significant biomedical, tissue structure, stained tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Significant biomedical research and clinical care rely on the histopathologic examination of tissue structure using microscopy of stained tissue. Virtual staining (VS) offers a promising alternative with the potential to reduce cost and eliminate the use of toxic reagents. However, the critical challenge of hallucinations limits confidence in its use, necessitating a VS co-pilot to detect these hallucinations. Here, we first formally establish the problem of hallucination detection in VS. Next, we introduce a scalable, post-hoc hallucination detection method that identifies a Neural Hallucination Precursor (NHP) from VS model embeddings for test-time detection. We report extensive validation across diverse and challenging VS settings to demonstrate NHP’s effectiveness and robustness. Furthermore, we show that VS models with fewer hallucinations do not necessarily disclose them better, risking a false sense of security when reporting just the former metric. This highlights the need for a reassessment of current VS evaluation practices.
zh

[CV-78] Exploring Foundation Models Fine-Tuning for Cytology Classification

【速读】：该论文试图解决细胞学切片分类任务中的效率和成本问题，特别是在数据样本有限的情况下。解决方案的关键在于采用低秩适应（Low-Rank Adaptation, LoRA）这一参数高效的微调方法，对预训练的基础模型进行微调。通过在四个细胞学分类数据集上评估五个基础模型，研究发现使用LoRA进行微调显著提升了模型性能，尤其是在少样本学习场景下，达到了最先进的分类效果，同时减少了所需的数据样本量。

链接: https://arxiv.org/abs/2411.14975
作者: Manon Dausort,Tiffanie Godelaine,Maxime Zanella,Karim El Khoury,Isabelle Salmon,Benoît Macq
关键词-EN: Cytology slides, staging cancer, time-consuming and costly, slides are essential, essential tools
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Cytology slides are essential tools in diagnosing and staging cancer, but their analysis is time-consuming and costly. Foundation models have shown great potential to assist in these tasks. In this paper, we explore how existing foundation models can be applied to cytological classification. More particularly, we focus on low-rank adaptation, a parameter-efficient fine-tuning method suited to few-shot learning. We evaluated five foundation models across four cytological classification datasets. Our results demonstrate that fine-tuning the pre-trained backbones with LoRA significantly improves model performance compared to fine-tuning only the classifier head, achieving state-of-the-art results on both simple and complex classification tasks while requiring fewer data samples.
zh

[CV-79] Benchmarking the Robustness of Optical Flow Estimation to Corruptions

【速读】：该论文试图解决光学流估计模型在面对常见损坏（corruptions）时的鲁棒性问题。解决方案的关键在于引入了7种专门针对光学流的时间性损坏（temporal corruptions）和17种经典的单图像损坏（single-image corruptions），并通过先进的点扩散函数模糊（PSF Blur）模拟方法进行评估。论文建立了两个鲁棒性基准（KITTI-FC和GoPro-FC），并引入了三种鲁棒性度量（Corruption Robustness Error (CRE)、Corruption Robustness Error ratio (CREr)和Relative Corruption Robustness Error (RCRE)）来量化光学流估计的鲁棒性。通过对29种模型变体的评估，论文揭示了模型鲁棒性与估计性能之间的紧密关系，并提出了改进光学流模型设计和应用的建议。

链接: https://arxiv.org/abs/2411.14865
作者: Zhonghua Yi,Hao Shi,Qi Jiang,Yao Gao,Ze Wang,Yufan Zhang,Kailun Yang,Kaiwei Wang
关键词-EN: Optical flow, Optical flow estimation, Corruption Robustness Error, optical flow models, robustness
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: The benchmarks and source code will be released at this https URL

点击查看摘要

Abstract:Optical flow estimation is extensively used in autonomous driving and video editing. While existing models demonstrate state-of-the-art performance across various benchmarks, the robustness of these methods has been infrequently investigated. Despite some research focusing on the robustness of optical flow models against adversarial attacks, there has been a lack of studies investigating their robustness to common corruptions. Taking into account the unique temporal characteristics of optical flow, we introduce 7 temporal corruptions specifically designed for benchmarking the robustness of optical flow models, in addition to 17 classical single-image corruptions, in which advanced PSF Blur simulation method is performed. Two robustness benchmarks, KITTI-FC and GoPro-FC, are subsequently established as the first corruption robustness benchmark for optical flow estimation, with Out-Of-Domain (OOD) and In-Domain (ID) settings to facilitate comprehensive studies. Robustness metrics, Corruption Robustness Error (CRE), Corruption Robustness Error ratio (CREr), and Relative Corruption Robustness Error (RCRE) are further introduced to quantify the optical flow estimation robustness. 29 model variants from 15 optical flow methods are evaluated, yielding 10 intriguing observations, such as 1) the absolute robustness of the model is heavily dependent on the estimation performance; 2) the corruptions that diminish local information are more serious than that reduce visual effects. We also give suggestions for the design and application of optical flow models. We anticipate that our benchmark will serve as a foundational resource for advancing research in robust optical flow estimation. The benchmarks and source code will be released at this https URL.
zh

[CV-80] Cell as Point: One-Stage Framework for Efficient Cell Tracking

【速读】：该论文试图解决传统多阶段细胞追踪方法在处理不平衡和长序列数据时面临的资源需求大、训练不足和推理过程中细胞丢失的问题。解决方案的关键在于提出了一个端到端的细胞追踪框架（CAP），该框架将细胞视为点（Cell as Point），通过直接利用细胞点轨迹之间的相关性进行联合追踪，从而简化了追踪过程，减少了标签需求和管道的复杂性。CAP框架的核心创新包括自适应事件引导（AEG）采样方法，用于解决细胞分裂事件中的数据不平衡问题，以及滚动窗口（RAW）推理方法，确保在长期追踪中新细胞的连续性。通过消除对检测或分割阶段的依赖，CAP不仅展示了强大的细胞追踪性能，而且在效率上比现有方法提高了10到55倍。

链接: https://arxiv.org/abs/2411.14833
作者: Yaxuan Song,Jianan Fan,Heng Huang,Mei Chen,Weidong Cai
关键词-EN: require substantial resources, Cellular activities, dynamic and intricate, playing a crucial, therapeutic techniques
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 17 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Cellular activities are dynamic and intricate, playing a crucial role in advancing diagnostic and therapeutic techniques, yet they often require substantial resources for accurate tracking. Despite recent progress, the conventional multi-stage cell tracking approaches not only heavily rely on detection or segmentation results as a prerequisite for the tracking stage, demanding plenty of refined segmentation masks, but are also deteriorated by imbalanced and long sequence data, leading to under-learning in training and missing cells in inference procedures. To alleviate the above issues, this paper proposes the novel end-to-end CAP framework, which leverages the idea of regarding Cell as Point to achieve efficient and stable cell tracking in one stage. CAP abandons detection or segmentation stages and simplifies the process by exploiting the correlation among the trajectories of cell points to track cells jointly, thus reducing the label demand and complexity of the pipeline. With cell point trajectory and visibility to represent cell locations and lineage relationships, CAP leverages the key innovations of adaptive event-guided (AEG) sampling for addressing data imbalance in cell division events and the rolling-as-window (RAW) inference method to ensure continuous tracking of new cells in the long term. Eliminating the need for a prerequisite detection or segmentation stage, CAP demonstrates strong cell tracking performance while also being 10 to 55 times more efficient than existing methods. The code and models will be released.
zh

[CV-81] Comparative Analysis of nnUNet and MedNeXt for Head and Neck Tumor Segmentation in MRI-guided Radiotherapy

【速读】：该论文试图解决头颈部癌症（HNC）放射治疗（RT）中手动肿瘤分割耗时且复杂的问题。解决方案的关键在于利用深度学习模型nnUNet和MedNeXt进行自动化的肿瘤体积分割（GTVp和GTVn）。具体方法包括：1) 在任务1中，先在预处理和中期RT注册图像上进行预训练，然后在原始预处理图像上进行微调；2) 在任务2中，将注册的预处理图像、注册的预处理分割掩码和中期RT数据组合为多通道输入进行训练。该方法在HNTS-MRG24 MICCAI挑战赛中取得了显著成绩，任务1的Dice相似系数达到0.8254，排名第一，任务2的Dice相似系数为0.7005，排名第八。

链接: https://arxiv.org/abs/2411.14752
作者: Nikoo Moradi,André Ferreira,Behrus Puladi,Jens Kleesiek,Emad Fatemizadeh,Gijs Luijten,Victor Alves,Jan Egger
关键词-EN: magnetic resonance imaging, offering superior soft, superior soft tissue, soft tissue contrast, Radiation therapy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Radiation therapy (RT) is essential in treating head and neck cancer (HNC), with magnetic resonance imaging(MRI)-guided RT offering superior soft tissue contrast and functional imaging. However, manual tumor segmentation is time-consuming and complex, and therfore remains a challenge. In this study, we present our solution as team TUMOR to the HNTS-MRG24 MICCAI Challenge which is focused on automated segmentation of primary gross tumor volumes (GTVp) and metastatic lymph node gross tumor volume (GTVn) in pre-RT and mid-RT MRI images. We utilized the HNTS-MRG2024 dataset, which consists of 150 MRI scans from patients diagnosed with HNC, including original and registered pre-RT and mid-RT T2-weighted images with corresponding segmentation masks for GTVp and GTVn. We employed two state-of-the-art models in deep learning, nnUNet and MedNeXt. For Task 1, we pretrained models on pre-RT registered and mid-RT images, followed by fine-tuning on original pre-RT images. For Task 2, we combined registered pre-RT images, registered pre-RT segmentation masks, and mid-RT data as a multi-channel input for training. Our solution for Task 1 achieved 1st place in the final test phase with an aggregated Dice Similarity Coefficient of 0.8254, and our solution for Task 2 ranked 8th with a score of 0.7005. The proposed solution is publicly available at Github Repository.
zh

[CV-82] Cross Group Attention and Group-wise Rolling for Multimodal Medical Image Synthesis

【速读】：该论文试图解决多模态磁共振成像（Multimodal MR Image Synthesis）中由于不同模态图像之间的空间错位导致的合成性能不佳的问题。解决方案的关键在于提出了一种自适应分组交互网络（Adaptive Group-wise Interaction Network, AGI-Net），该网络通过预定义通道维度上的分组，并采用自适应滚动卷积核来捕捉模态间的空间对应关系，同时引入跨组注意力模块来融合不同通道组的信息，从而实现更好的特征表示。实验结果表明，AGI-Net在公开的IXI和BraTS2023数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2411.14684
作者: Tao Song,Yicheng Wu,Minhao Hu,Xiangde Luo,Linda Wei,Guotai Wang,Yi Guo,Feng Xu,Shaoting Zhang
关键词-EN: generate missing modality, MRI data, missing modality image, aims to generate, generate missing
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal MR image synthesis aims to generate missing modality image by fusing and mapping a few available MRI data. Most existing approaches typically adopt an image-to-image translation scheme. However, these methods often suffer from sub-optimal performance due to the spatial misalignment between different modalities while they are typically treated as input channels. Therefore, in this paper, we propose an Adaptive Group-wise Interaction Network (AGI-Net) that explores both inter-modality and intra-modality relationships for multimodal MR image synthesis. Specifically, groups are first pre-defined along the channel dimension and then we perform an adaptive rolling for the standard convolutional kernel to capture inter-modality spatial correspondences. At the same time, a cross-group attention module is introduced to fuse information across different channel groups, leading to better feature representation. We evaluated the effectiveness of our model on the publicly available IXI and BraTS2023 datasets, where the AGI-Net achieved state-of-the-art performance for multimodal MR image synthesis. Code will be released.
zh

[CV-83] BrightVAE: Luminosity Enhancement in Underexposed Endoscopic Images

【速读】：该论文试图解决内窥镜图像中亮度不足的问题，特别是在低光照条件下图像对比度降低和亮度不均匀的问题，这些问题严重影响了诊断准确性和治疗规划。解决方案的关键是引入了一种名为BrightVAE的架构，该架构基于分层向量量化变分自编码器（hierarchical VQ-VAE），专门设计用于增强低光照内窥镜图像的亮度。BrightVAE通过结合多种感受野、跳跃连接和特征注意力机制，从三个不同视角进行高级特征提取，从而有效提升图像质量，支持更准确的医学诊断。实验结果表明，该方法在内窥镜图像亮度增强方面显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.14663
作者: Farzaneh Koohestani,Zahra Nabizadeh,Nader Karimi,Shahram Shirani,Shadrokh Samavi
关键词-EN: endoscopic images, endoscopic, Quantized Variational Autoencoder, Vector Quantized Variational, images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:The enhancement of image luminosity is especially critical in endoscopic images. Underexposed endoscopic images often suffer from reduced contrast and uneven brightness, significantly impacting diagnostic accuracy and treatment planning. Internal body imaging is challenging due to uneven lighting and shadowy regions. Enhancing such images is essential since precise image interpretation is crucial for patient outcomes. In this paper, we introduce BrightVAE, an architecture based on the hierarchical Vector Quantized Variational Autoencoder (hierarchical VQ-VAE) tailored explicitly for enhancing luminosity in low-light endoscopic images. Our architecture is meticulously designed to tackle the unique challenges inherent in endoscopic imaging, such as significant variations in illumination and obscured details due to poor lighting conditions. The proposed model emphasizes advanced feature extraction from three distinct viewpoints-incorporating various receptive fields, skip connections, and feature attentions to robustly enhance image quality and support more accurate medical diagnoses. Through rigorous experimental analysis, we demonstrate the effectiveness of these techniques in enhancing low-light endoscopic images. To evaluate the performance of our architecture, we employ three widely recognized metrics-SSIM, PSNR, and LPIPS-specifically on Endo4IE dataset, which consists of endoscopic images. We evaluated our method using the Endo4IE dataset, which consists exclusively of endoscopic images, and showed significant advancements over the state-of-the-art methods for enhancing luminosity in endoscopic imaging.
zh

[CV-84] Evaluating Representational Similarity Measures from the Lens of Functional Correspondence

【速读】：该论文试图解决在高维神经数据解释中，如何选择最适合的表示相似性度量（representational similarity metrics）以揭示神经科学和人工智能系统之间的共享机制和差异。解决方案的关键在于评估这些度量与行为结果之间的对齐程度。通过比较八种常用的视觉领域表示相似性度量，研究发现强调整体几何结构或形状的度量（如线性中心核对齐 (CKA) 和 Procrustes 距离）在区分训练模型与未训练模型以及与行为度量对齐方面表现优异，而常用的线性预测性度量在行为对齐方面表现中等。这些发现对于在神经科学与人工智能交叉研究中选择强调行为意义比较的度量具有重要意义。

链接: https://arxiv.org/abs/2411.14633
作者: Yiqing Bo,Ansh Soni,Sudhanshu Srivastava,Meenakshi Khosla
关键词-EN: high-dimensional neural data, interpreting high-dimensional neural, revealing shared mechanisms, neural data, artificial intelligence
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neuroscience and artificial intelligence (AI) both face the challenge of interpreting high-dimensional neural data, where the comparative analysis of such data is crucial for revealing shared mechanisms and differences between these complex systems. Despite the widespread use of representational comparisons and the abundance classes of comparison methods, a critical question remains: which metrics are most suitable for these comparisons? While some studies evaluate metrics based on their ability to differentiate models of different origins or constructions (e.g., various architectures), another approach is to assess how well they distinguish models that exhibit distinct behaviors. To investigate this, we examine the degree of alignment between various representational similarity measures and behavioral outcomes, employing group statistics and a comprehensive suite of behavioral metrics for comparison. In our evaluation of eight commonly used representational similarity metrics in the visual domain – spanning alignment-based, Canonical Correlation Analysis (CCA)-based, inner product kernel-based, and nearest-neighbor methods – we found that metrics like linear Centered Kernel Alignment (CKA) and Procrustes distance, which emphasize the overall geometric structure or shape of representations, excelled in differentiating trained from untrained models and aligning with behavioral measures, whereas metrics such as linear predictivity, commonly used in neuroscience, demonstrated only moderate alignment with behavior. These insights are crucial for selecting metrics that emphasize behaviorally meaningful comparisons in NeuroAI research.
zh

[CV-85] Unveiling the Hidden: A Comprehensive Evaluation of Underwater Image Enhancement and Its Impact on Object Detection

【速读】：该论文试图解决水下图像质量低下导致的物体检测性能不佳的问题。解决方案的关键在于评估和比较当前最先进的图像增强模型对水下物体检测性能的影响。研究通过选择代表性的水下图像增强模型，分别应用于两个最新的数据集（RUOD和CUPDD），并进行定性和定量分析。此外，研究还开发了一个质量指数（Q-index）来比较原始图像和增强图像的质量分布，并评估了多个YOLO-NAS检测模型在原始和增强图像集上的性能。通过相关性研究，探讨了增强指标与检测性能之间的关系，并分析了增强在某些情况下提高检测性能和揭示人类标注者遗漏物体的可能性。

链接: https://arxiv.org/abs/2411.14626
作者: Ali Awad(1),Ashraf Saleem(1),Sidike Paheding(2),Evan Lucas(1),Serein Al-Ratrout(1),Timothy C. Havens(1) ((1) Michigan Technological University, (2) Fairfield University)
关键词-EN: underwater object detection, Challenging Underwater Plant, Underwater Plant Detection, Object Detection Dataset, detection performance
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater imagery often suffers from severe degradation that results in low visual quality and object detection performance. This work aims to evaluate state-of-the-art image enhancement models, investigate their impact on underwater object detection, and explore their potential to improve detection performance. To this end, we selected representative underwater image enhancement models covering major enhancement categories and applied them separately to two recent datasets: 1) the Real-World Underwater Object Detection Dataset (RUOD), and 2) the Challenging Underwater Plant Detection Dataset (CUPDD). Following this, we conducted qualitative and quantitative analyses on the enhanced images and developed a quality index (Q-index) to compare the quality distribution of the original and enhanced images. Subsequently, we compared the performance of several YOLO-NAS detection models that are separately trained and tested on the original and enhanced image sets. Then, we performed a correlation study to examine the relationship between enhancement metrics and detection performance. We also analyzed the inference results from the trained detectors presenting cases where enhancement increased the detection performance as well as cases where enhancement revealed missed objects by human annotators. This study suggests that although enhancement generally deteriorates the detection performance, it can still be harnessed in some cases for increased detection performance and more accurate human annotation.
zh

[CV-86] SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation

【速读】：该论文试图解决的问题是如何评估和优化在全身体部CT图像上预训练的模型在不同下游医学分割任务中的迁移能力，特别是跨模态和跨目标的迁移。解决方案的关键在于构建了一个大规模的基准测试，收集了87个不同模态、目标和样本规模的公开数据集，并使用代表性模型STU-Net在多模型尺度上进行跨模态和跨目标的迁移学习实验。实验结果表明，预训练模型在跨模态迁移（如从CT到MRI）和跨目标任务（如结构检测和病变检测）中表现出显著的有效性，同时也揭示了数据集规模在微调过程中的瓶颈效应。这一大规模的开放评估旨在为未来的体积医学图像分割研究提供指导。

链接: https://arxiv.org/abs/2411.14525
作者: Jin Ye,Ying Chen,Yanjun Li,Haoyu Wang,Zhongying Deng,Ziyan Huang,Yanzhou Su,Chenglong Ma,Yuanfeng Ji,Junjun He
关键词-EN: Computed Tomography, medical segmentation tasks, Computed, Tomography, medical imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed Tomography (CT) is one of the most popular modalities for medical imaging. By far, CT images have contributed to the largest publicly available datasets for volumetric medical segmentation tasks, covering full-body anatomical structures. Large amounts of full-body CT images provide the opportunity to pre-train powerful models, e.g., STU-Net pre-trained in a supervised fashion, to segment numerous anatomical structures. However, it remains unclear in which conditions these pre-trained models can be transferred to various downstream medical segmentation tasks, particularly segmenting the other modalities and diverse targets. To address this problem, a large-scale benchmark for comprehensive evaluation is crucial for finding these conditions. Thus, we collected 87 public datasets varying in modality, target, and sample size to evaluate the transfer ability of full-body CT pre-trained models. We then employed a representative model, STU-Net with multiple model scales, to conduct transfer learning across modalities and targets. Our experimental results show that (1) there may be a bottleneck effect concerning the dataset size in fine-tuning, with more improvement on both small- and large-scale datasets than medium-size ones. (2) Models pre-trained on full-body CT demonstrate effective modality transfer, adapting well to other modalities such as MRI. (3) Pre-training on the full-body CT not only supports strong performance in structure detection but also shows efficacy in lesion detection, showcasing adaptability across target tasks. We hope that this large-scale open evaluation of transfer learning can direct future research in volumetric medical image segmentation.
zh

[CV-87] owards Scalable Insect Monitoring: Ultra-Lightweight CNNs as On-Device Triggers for Insect Camera Traps

【速读】：该论文试图解决现有被动红外（PIR）传感器在检测小型、快速移动的变温动物（如昆虫）方面的不足，特别是在生物多样性监测中。解决方案的关键在于采用超轻量级卷积神经网络（ultra-lightweight convolutional neural networks），运行在低功耗硬件上，以连续流方式检测捕获的图像中的昆虫。这种方法实现了零延迟触发与图像捕获，并通过高精度的模型（AUC范围为91.8%到96.4%）确保了高特异性和高召回率，从而最大限度地减少了误报和漏报，提高了存储效率和检测准确性。此外，该系统能够在低功耗微控制器单元上运行，最大功耗低于300mW，显著延长了部署时间，降低了成本，提升了昆虫监测的效率和范围。

链接: https://arxiv.org/abs/2411.14467
作者: Ross Gardiner,Sareh Rowands,Benno I. Simmons
关键词-EN: insect camera trap, Camera traps, trigger camera traps, Camera, insect camera
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Camera traps, combined with AI, have emerged as a way to achieve automated, scalable biodiversity monitoring. However, the passive infrared (PIR) sensors that trigger camera traps are poorly suited for detecting small, fast-moving ectotherms such as insects. Insects comprise over half of all animal species and are key components of ecosystems and agriculture. The need for an appropriate and scalable insect camera trap is critical in the wake of concerning reports of declines in insect populations. This study proposes an alternative to the PIR trigger: ultra-lightweight convolutional neural networks running on low-powered hardware to detect insects in a continuous stream of captured images. We train a suite of models to distinguish insect images from backgrounds. Our design achieves zero latency between trigger and image capture. Our models are rigorously tested and achieve high accuracy ranging from 91.8% to 96.4% AUC on validation data and 87% AUC on data from distributions unseen during training. The high specificity of our models ensures minimal saving of false positive images, maximising deployment storage efficiency. High recall scores indicate a minimal false negative rate, maximising insect detection. Further analysis with saliency maps shows the learned representation of our models to be robust, with low reliance on spurious background features. Our system is also shown to operate deployed on off-the-shelf, low-powered microcontroller units, consuming a maximum power draw of less than 300mW. This enables longer deployment times using cheap and readily available battery components. Overall we offer a step change in the cost, efficiency and scope of insect monitoring. Solving the challenging trigger problem, we demonstrate a system which can be deployed for far longer than existing designs and budgets power and bandwidth effectively, moving towards a generic insect camera trap.
zh

人工智能

[AI-0] RE-Bench: Evaluating frontier AI RD capabilities of language model agents against human experts

链接: https://arxiv.org/abs/2411.15114
作者: Hjalmar Wijk,Tao Lin,Joel Becker,Sami Jawhar,Neev Parikh,Thomas Broadley,Lawrence Chan,Michael Chen,Josh Clymer,Jai Dhyani,Elena Ericheva,Katharyn Garcia,Brian Goodrich,Nikola Jurkovic,Megan Kinniment,Aron Lajko,Seraphina Nix,Lucas Sato,William Saunders,Maksym Taran,Ben West,Elizabeth Barnes
关键词-EN: safety policies highlight, policies highlight automation, Research Engineering Benchmark, capability to anticipate, safety policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Frontier AI safety policies highlight automation of AI research and development (RD) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI RD capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics – e.g. an agent wrote a faster custom Triton kernel than any of our human experts’ – and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.

[AI-1] RED: Effective Trajectory Representation Learning with Comprehensive Information VLDB2025

链接: https://arxiv.org/abs/2411.15096
作者: Silin Zhou,Shuo Shang,Lisi Chen,Christian S. Jensen,Panos Kalnis
关键词-EN: including trajectory similarity, trajectory similarity computation, RED, similarity computation, travel-time estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper is accepted by VLDB2025

点击查看摘要

Abstract:Trajectory representation learning (TRL) maps trajectories to vectors that can then be used for various downstream tasks, including trajectory similarity computation, trajectory classification, and travel-time estimation. However, existing TRL methods often produce vectors that, when used in downstream tasks, yield insufficiently accurate results. A key reason is that they fail to utilize the comprehensive information encompassed by trajectories. We propose a self-supervised TRL framework, called RED, which effectively exploits multiple types of trajectory information. Overall, RED adopts the Transformer as the backbone model and masks the constituting paths in trajectories to train a masked autoencoder (MAE). In particular, RED considers the moving patterns of trajectories by employing a Road-aware masking strategy that retains key paths of trajectories during masking, thereby preserving crucial information of the trajectories. RED also adopts a spatial-temporal-user joint Embedding scheme to encode comprehensive information when preparing the trajectories as model inputs. To conduct training, RED adopts Dual-objective task learning: the Transformer encoder predicts the next segment in a trajectory, and the Transformer decoder reconstructs the entire trajectory. RED also considers the spatial-temporal correlations of trajectories by modifying the attention mechanism of the Transformer. We compare RED with 9 state-of-the-art TRL methods for 4 downstream tasks on 3 real-world datasets, finding that RED can usually improve the accuracy of the best-performing baseline by over 5%.

[AI-2] owards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network

链接: https://arxiv.org/abs/2411.15082
作者: Irfan Nafiz Shahan,Pulok Ahmed Auvi
关键词-EN: Voice recognition, personal assistants, vital for applications, applications in security, security and personal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice recognition and speaker identification are vital for applications in security and personal assistants. This paper presents a lightweight 1D-Convolutional Neural Network (1D-CNN) designed to perform speaker identification on minimal datasets. Our approach achieves a validation accuracy of 97.87%, leveraging data augmentation techniques to handle background noise and limited training samples. Future improvements include testing on larger datasets and integrating transfer learning methods to enhance generalizability. We provide all code, the custom dataset, and the trained models to facilitate reproducibility. These resources are available on our GitHub repository: this https URL.

[AI-3] Empowering Clients: Transformation of Design Processes Due to Generative AI

链接: https://arxiv.org/abs/2411.15061
作者: Johannes Schneider,Kilic Sinem,Daniel Stockhammer
关键词-EN: transforming creative fields, driven by advancements, creative fields, domain of computational, transforming creative
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The domain of computational design, driven by advancements in Generative AI, is transforming creative fields. We explore the transformative effects of Generative AI on the architectural design process and discuss the role of the architect. The case of architecture is interesting as designing houses is complex, involving extensive customer interaction. We employ a within-subject experiment using a popular general-purpose text-to-image tool for generating designs and providing feedback on existing designs, followed by expert interviews. The study reveals that AI can disrupt the ideation phase by enabling clients to engage in the design process through rapid visualization of their own ideas. In turn, the architect’s role shifts more towards assessing the feasibility of designs generated conjointly by clients and AI. Our study also shows that while AI can provide valuable feedback on designs, it might fail to generate such designs, allowing for interesting connections to foundations in computer science, i.e., NP-completeness. AI’s feedback also tends to hamper creativity and innovation by suggesting altering novel, innovative approaches toward more standardized designs. Our study also reveals that there is uncertainty among architects about the interpretative sovereignty of architecture and loss of meaning and identity when AI increasingly takes over authorship in the design process.

[AI-4] Financial Risk Assessment via Long-term Payment Behavior Sequence Folding ICDM2024

链接: https://arxiv.org/abs/2411.15056
作者: Yiran Qiao,Yateng Tang,Xiang Ao,Qi Yuan,Ziming Liu,Chen Shen,Xuehao Zheng
关键词-EN: Online inclusive financial, low default costs, services encounter significant, Online inclusive, expansive user base
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: ICDM2024 long paper

点击查看摘要

Abstract:Online inclusive financial services encounter significant financial risks due to their expansive user base and low default costs. By real-world practice, we reveal that utilizing longer-term user payment behaviors can enhance models’ ability to forecast financial risks. However, learning long behavior sequences is non-trivial for deep sequential models. Additionally, the diverse fields of payment behaviors carry rich information, requiring thorough exploitation. These factors collectively complicate the task of long-term user behavior modeling. To tackle these challenges, we propose a Long-term Payment Behavior Sequence Folding method, referred to as LBSF. In LBSF, payment behavior sequences are folded based on merchants, using the merchant field as an intrinsic grouping criterion, which enables informative parallelism without reliance on external knowledge. Meanwhile, we maximize the utility of payment details through a multi-field behavior encoding mechanism. Subsequently, behavior aggregation at the merchant level followed by relational learning across merchants facilitates comprehensive user financial representation. We evaluate LBSF on the financial risk assessment task using a large-scale real-world dataset. The results demonstrate that folding long behavior sequences based on internal behavioral cues effectively models long-term patterns and changes, thereby generating more accurate user financial profiles for practical applications.

[AI-5] Enhancing Autonomous Driving Safety through World Model-Based Predictive Navigation and Adaptive Learning Algorithms for 5G Wireless Applications

链接: https://arxiv.org/abs/2411.15042
作者: Hong Ding,Ziming Wang,Yi Ding,Hongjie Lin,SuYang Xi,Chia Chao Kang
关键词-EN: present Navigation Secure, swiftly advancing realm, Navigation Secure, wireless communication world, world models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Addressing the challenge of ensuring safety in ever-changing and unpredictable environments, particularly in the swiftly advancing realm of autonomous driving in today’s 5G wireless communication world, we present Navigation Secure (NavSecure). This vision-based navigation framework merges the strengths of world models with crucial safety-focused decision-making capabilities, enabling autonomous vehicles to navigate real-world complexities securely. Our approach anticipates potential threats and formulates safer routes by harnessing the predictive capabilities of world models, thus significantly reducing the need for extensive real-world trial-and-error learning. Additionally, our method empowers vehicles to autonomously learn and develop through continuous practice, ensuring the system evolves and adapts to new challenges. Incorporating radio frequency technology, NavSecure leverages 5G networks to enhance real-time data exchange, improving communication and responsiveness. Validated through rigorous experiments under simulation-to-real driving conditions, NavSecure has shown exceptional performance in safety-critical scenarios, such as sudden obstacle avoidance. Results indicate that NavSecure excels in key safety metrics, including collision prevention and risk reduction, surpassing other end-to-end methodologies. This framework not only advances autonomous driving safety but also demonstrates how world models can enhance decision-making in critical applications. NavSecure sets a new standard for developing more robust and trustworthy autonomous driving systems, capable of handling the inherent dynamics and uncertainties of real-world environments.

[AI-6] One to rule them all: natural language to bind communication perception and action

链接: https://arxiv.org/abs/2411.15033
作者: Simone Colombani,Dimitri Ognibene,Giuseppe Boccignone
关键词-EN: understanding complex human, complex human instructions, developing robots capable, Large Language Models, recent years
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In recent years, research in the area of human-robot interaction has focused on developing robots capable of understanding complex human instructions and performing tasks in dynamic and diverse environments. These systems have a wide range of applications, from personal assistance to industrial robotics, emphasizing the importance of robots interacting flexibly, naturally and safely with humans. This paper presents an advanced architecture for robotic action planning that integrates communication, perception, and planning with Large Language Models (LLMs). Our system is designed to translate commands expressed in natural language into executable robot actions, incorporating environmental information and dynamically updating plans based on real-time feedback. The Planner Module is the core of the system where LLMs embedded in a modified ReAct framework are employed to interpret and carry out user commands. By leveraging their extensive pre-trained knowledge, LLMs can effectively process user requests without the need to introduce new knowledge on the changing environment. The modified ReAct framework further enhances the execution space by providing real-time environmental perception and the outcomes of physical actions. By combining robust and dynamic semantic map representations as graphs with control components and failure explanations, this architecture enhances a robot adaptability, task execution, and seamless collaboration with human users in shared and dynamic environments. Through the integration of continuous feedback loops with the environment the system can dynamically adjusts the plan to accommodate unexpected changes, optimizing the robot ability to perform tasks. Using a dataset of previous experience is possible to provide detailed feedback about the failure. Updating the LLMs context of the next iteration with suggestion on how to overcame the issue.

[AI-7] me is on my sight: scene graph filtering for dynamic environment perception in an LLM -driven robot

链接: https://arxiv.org/abs/2411.15027
作者: Simone Colombani,Luca Brini,Dimitri Ognibene,Giuseppe Boccignone
关键词-EN: dynamic environments, dynamic, Large Language Models, robots perception adapting, Robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Robots are increasingly being used in dynamic environments like workplaces, hospitals, and homes. As a result, interactions with robots must be simple and intuitive, with robots perception adapting efficiently to human-induced changes. This paper presents a robot control architecture that addresses key challenges in human-robot interaction, with a particular focus on the dynamic creation and continuous update of the robot state representation. The architecture uses Large Language Models to integrate diverse information sources, including natural language commands, robotic skills representation, real-time dynamic semantic mapping of the perceived scene. This enables flexible and adaptive robotic behavior in complex, dynamic environments. Traditional robotic systems often rely on static, pre-programmed instructions and settings, limiting their adaptability to dynamic environments and real-time collaboration. In contrast, this architecture uses LLMs to interpret complex, high-level instructions and generate actionable plans that enhance human-robot collaboration. At its core, the system Perception Module generates and continuously updates a semantic scene graph using RGB-D sensor data, providing a detailed and structured representation of the environment. A particle filter is employed to ensure accurate object localization in dynamic, real-world settings. The Planner Module leverages this up-to-date semantic map to break down high-level tasks into sub-tasks and link them to robotic skills such as navigation, object manipulation (e.g., PICK and PLACE), and movement (e.g., GOTO). By combining real-time perception, state tracking, and LLM-driven communication and task planning, the architecture enhances adaptability, task efficiency, and human-robot collaboration in dynamic environments.

[AI-8] FTA generation using GenAI with an Autonomy sensor Usecase

链接: https://arxiv.org/abs/2411.15007
作者: Sneha Sudhir Shetiya,Divya Garikapati,Veeraja Sohoni
关键词-EN: Functional safety forms, Fault Tree analysis, Functional safety, Fault Tree, Large Language Models
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Functional safety forms an important aspect in the design of systems. Its emphasis on the automotive industry has evolved significantly over the years. Till date many methods have been developed to get appropriate FTA(Fault Tree analysis) for various scenarios and features pertaining to Autonomous Driving. This paper is an attempt to explore the scope of using Generative Artificial Intelligence(GenAI) in order to develop Fault Tree Analysis(FTA) with the use case of malfunction for the Lidar sensor in mind. We explore various available open source Large Language Models(LLM) models and then dive deep into one of them to study its responses and provide our analysis. This paper successfully shows the possibility to train existing Large Language models through Prompt Engineering for fault tree analysis for any Autonomy usecase aided with PlantUML tool.

[AI-9] Learning Lifted STRIPS Models from Action Traces Alone: A Simple General and Scalable Solution ICAPS2025

链接: https://arxiv.org/abs/2411.14995
作者: Jonas Gösgens,Niklas Jansen,Hector Geffner
关键词-EN: STRIPS action models, Learning STRIPS action, Learning STRIPS, STRIPS action, challenging problem
类目: Artificial Intelligence (cs.AI)
*备注: submitted to ICAPS 2025

点击查看摘要

Abstract:Learning STRIPS action models from action traces alone is a challenging problem as it involves learning the domain predicates as well. In this work, a novel approach is introduced which, like the well-known LOCM systems, is scalable, but like SAT approaches, is sound and complete. Furthermore, the approach is general and imposes no restrictions on the hidden domain or the number or arity of the predicates. The new learning method is based on an \emphefficient, novel test that checks whether the assumption that a predicate is affected by a set of action patterns, namely, actions with specific argument positions, is consistent with the traces. The predicates and action patterns that pass the test provide the basis for the learned domain that is then easily completed with preconditions and static predicates. The new method is studied theoretically and experimentally. For the latter, the method is evaluated on traces and graphs obtained from standard classical domains like the 8-puzzle, which involve hundreds of thousands of states and transitions. The learned representations are then verified on larger instances.

[AI-10] Free Energy Projective Simulation (FEPS): Active inference with interpretability

链接: https://arxiv.org/abs/2411.14991
作者: Joséphine Pazem,Marius Krumm,Alexander Q. Vining,Lukas J. Fiderer,Hans J. Briegel
关键词-EN: successes connecting conceptual, connecting conceptual models, free energy principle, perception and action, achieved many successes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: 26 pages (including 5 pages appendix), 6 figures

点击查看摘要

Abstract:In the last decade, the free energy principle (FEP) and active inference (AIF) have achieved many successes connecting conceptual models of learning and cognition to mathematical models of perception and action. This effort is driven by a multidisciplinary interest in understanding aspects of self-organizing complex adaptive systems, including elements of agency. Various reinforcement learning (RL) models performing active inference have been proposed and trained on standard RL tasks using deep neural networks. Recent work has focused on improving such agents’ performance in complex environments by incorporating the latest machine learning techniques. In this paper, we take an alternative approach. Within the constraints imposed by the FEP and AIF, we attempt to model agents in an interpretable way without deep neural networks by introducing Free Energy Projective Simulation (FEPS). Using internal rewards only, FEPS agents build a representation of their partially observable environments with which they interact. Following AIF, the policy to achieve a given task is derived from this world model by minimizing the expected free energy. Leveraging the interpretability of the model, techniques are introduced to deal with long-term goals and reduce prediction errors caused by erroneous hidden state estimation. We test the FEPS model on two RL environments inspired from behavioral biology: a timed response task and a navigation task in a partially observable grid. Our results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only. In addition, they infer optimal policies flexibly for any target observation in the environment.

[AI-11] Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning

链接: https://arxiv.org/abs/2411.14937
作者: Junjie Shan,Ziqi Zhao,Jialin Lu,Rui Zhang,Siu Ming Yiu,Ka-Ho Chow
关键词-EN: made significant progress, inspiring numerous life-enriching, numerous life-enriching applications, Foundation models, significant progress
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Foundation models that bridge vision and language have made significant progress, inspiring numerous life-enriching applications. However, their potential for misuse to introduce new threats remains largely unexplored. This paper reveals that vision-language models (VLMs) can be exploited to overcome longstanding limitations in gradient inversion attacks (GIAs) within federated learning (FL), where an FL server reconstructs private data samples from gradients shared by victim clients. Current GIAs face challenges in reconstructing high-resolution images, especially when the victim has a large local data batch. While focusing reconstruction on valuable samples rather than the entire batch is promising, existing methods lack the flexibility to allow attackers to specify their target data. In this paper, we introduce Geminio, the first approach to transform GIAs into semantically meaningful, targeted attacks. Geminio enables a brand new privacy attack experience: attackers can describe, in natural language, the types of data they consider valuable, and Geminio will prioritize reconstruction to focus on those high-value samples. This is achieved by leveraging a pretrained VLM to guide the optimization of a malicious global model that, when shared with and optimized by a victim, retains only gradients of samples that match the attacker-specified query. Extensive experiments demonstrate Geminio’s effectiveness in pinpointing and reconstructing targeted samples, with high success rates across complex datasets under FL and large batch sizes and showing resilience against existing defenses.

[AI-12] Purrfessor: A Fine-tuned Multimodal LLaVA Diet Health Chatbot

链接: https://arxiv.org/abs/2411.14925
作者: Linqi Lu,Yifan Deng,Chuan Tian,Sijia Yang,Dhavan Shah
关键词-EN: provide personalized dietary, study introduces Purrfessor, personalized dietary guidance, study introduces, designed to provide
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:This study introduces Purrfessor, an innovative AI chatbot designed to provide personalized dietary guidance through interactive, multimodal engagement. Leveraging the Large Language-and-Vision Assistant (LLaVA) model fine-tuned with food and nutrition data and a human-in-the-loop approach, Purrfessor integrates visual meal analysis with contextual advice to enhance user experience and engagement. We conducted two studies to evaluate the chatbot’s performance and user experience: (a) simulation assessments and human validation were conducted to examine the performance of the fine-tuned model; (b) a 2 (Profile: Bot vs. Pet) by 3 (Model: GPT-4 vs. LLaVA vs. Fine-tuned LLaVA) experiment revealed that Purrfessor significantly enhanced users’ perceptions of care ( \beta = 1.59 , p = 0.04 ) and interest ( \beta = 2.26 , p = 0.01 ) compared to the GPT-4 bot. Additionally, user interviews highlighted the importance of interaction design details, emphasizing the need for responsiveness, personalization, and guidance to improve user engagement.

[AI-13] GOT4Rec: Graph of Thoughts for Sequential Recommendation

链接: https://arxiv.org/abs/2411.14922
作者: Zewen Long,Liang Wang,Shu Wu,Qiang Liu,Liang Wang
关键词-EN: large language models, language models, researchers have explored, advancement of large, large language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the advancement of large language models (LLMs), researchers have explored various methods to optimally leverage their comprehension and generation capabilities in sequential recommendation scenarios. However, several challenges persist in this endeavor. Firstly, most existing approaches rely on the input-output prompting paradigm, which can result in irrelevant or inaccurate responses. Secondly, while there have been attempts to enhance LLMs using prompting strategies such as chain-of-thought (CoT), these efforts have not fully harnessed the reasoning abilities of LLMs or effectively captured the multifaceted information contained within user sequences. To address these limitations, we propose GOT4Rec, a sequential recommendation method that utilizes the graph of thoughts (GoT) prompting strategy. Specifically, we identify and utilize three key types of information within user history sequences: short-term interests, long-term interests and collaborative information from other users. Our approach enables LLMs to independently reason and generate recommendations based on these distinct types of information, subsequently aggregating the results within the GoT framework to derive the final recommended items. This method allows LLMs, with enhanced reasoning capabilities, to more effectively consider the diverse information within user sequences, resulting in more accurate recommendations and more comprehensive explanations. Extensive experiments on real-world datasets demonstrate the effectiveness of GOT4Rec, indicating that it outperforms existing state-of-the-art baselines. Our code is available at this https URL.

[AI-14] DAIRHuM: A Platform for Directly Aligning AI Representations with Human Musical Judgments applied to Carnatic Music ICASSP

链接: https://arxiv.org/abs/2411.14907
作者: Prashanth Thattai Ravikumar
关键词-EN: Quantifying and aligning, Human Musical judgments, important challenge, model representations, model alignment
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 4 Pages, ICASSP workshop submission

点击查看摘要

Abstract:Quantifying and aligning music AI model representations with human behavior is an important challenge in the field of MIR. This paper presents a platform for exploring the Direct alignment between AI music model Representations and Human Musical judgments (DAIRHuM). It is designed to enable musicians and experimentalists to label similarities in a dataset of music recordings, and examine a pre-trained model’s alignment with their labels using quantitative scores and visual plots. DAIRHuM is applied to analyze alignment between NSynth representations, and a rhythmic duet between two percussionists in a Carnatic quartet ensemble, an example of a genre where annotated data is scarce and assessing alignment is non-trivial. The results demonstrate significant findings on model alignment with human judgments of rhythmic harmony, while highlighting key differences in rhythm perception and music similarity judgments specific to Carnatic music. This work is among the first efforts to enable users to explore human-AI model alignment in Carnatic music and advance MIR research in Indian music while dealing with data scarcity and cultural specificity. The development of this platform provides greater accessibility to music AI tools for under-represented genres.

[AI-15] Application of AI to formal methods – an analysis of current trends

链接: https://arxiv.org/abs/2411.14870
作者: Sebastian Stock,Jannik Dunkelau,Atif Mashkoor
关键词-EN: artificial intelligence, probabilistic decision-making, formal methods, application area, daily lives
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With artificial intelligence (AI) being well established within the daily lives of research communities, we turn our gaze toward an application area that appears intuitively unsuited for probabilistic decision-making: the area of formal methods (FM). FM aim to provide sound and understandable reasoning about problems in computer science, which seemingly collides with the black-box nature that inhibits many AI approaches. However, many researchers have crossed this gap and applied AI techniques to enhance FM approaches. As this dichotomy of FM and AI sparked our interest, we conducted a systematic mapping study to map the current landscape of research publications. In this study, we investigate the previous five years of applied AI to FM (2019-2023), as these correspond to periods of high activity. This investigation results in 189 entries, which we explore in more detail to find current trends, highlight research gaps, and give suggestions for future research.

[AI-16] Domain and Range Aware Synthetic Negatives Generation for Knowledge Graph Embedding Models

链接: https://arxiv.org/abs/2411.14858
作者: Alberto Bernardi,Luca Costabello
关键词-EN: exploring Knowledge Graphs, Knowledge Graph Embedding, Knowledge Graph, solving tasks related, Graph Embedding models
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at the Third Learning on Graphs Conference (LoG 2024)

点击查看摘要

Abstract:Knowledge Graph Embedding models, representing entities and edges in a low-dimensional space, have been extremely successful at solving tasks related to completing and exploring Knowledge Graphs (KGs). One of the key aspects of training most of these models is teaching to discriminate between true statements positives and false ones (negatives). However, the way in which negatives can be defined is not trivial, as facts missing from the KG are not necessarily false and a set of ground truth negatives is hardly ever given. This makes synthetic negative generation a necessity. Different generation strategies can heavily affect the quality of the embeddings, making it a primary aspect to consider. We revamp a strategy that generates corruptions during training respecting the domain and range of relations, we extend its capabilities and we show our methods bring substantial improvement (+10% MRR) for standard benchmark datasets and over +150% MRR for a larger ontology-backed dataset.

[AI-17] Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Language Models

链接: https://arxiv.org/abs/2411.14842
作者: Wanqi Yang,Yanda Li,Meng Fang,Yunchao Wei,Tianyi Zhou,Ling Chen
关键词-EN: voice-based human-machine interactions, large language models, audio attacks pose, Adversarial audio attacks, audio adversarial attacks
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Adversarial audio attacks pose a significant threat to the growing use of large language models (LLMs) in voice-based human-machine interactions. While existing research has primarily focused on model-specific adversarial methods, real-world applications demand a more generalizable and universal approach to audio adversarial attacks. In this paper, we introduce the Chat-Audio Attacks (CAA) benchmark including four distinct types of audio attacks, which aims to explore the the vulnerabilities of LLMs to these audio attacks in conversational scenarios. To evaluate the robustness of LLMs, we propose three evaluation strategies: Standard Evaluation, utilizing traditional metrics to quantify model performance under attacks; GPT-4o-Based Evaluation, which simulates real-world conversational complexities; and Human Evaluation, offering insights into user perception and trust. We evaluate six state-of-the-art LLMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others, using three distinct evaluation methods on the CAA benchmark. Our comprehensive analysis reveals the impact of four types of audio attacks on the performance of these models, demonstrating that GPT-4o exhibits the highest level of resilience.

[AI-18] Mode-conditioned music learning and composition: a spiking neural network inspired by neuroscience and psychology

链接: https://arxiv.org/abs/2411.14773
作者: Qian Liang,Yi Zeng,Menghaoran Tang
关键词-EN: harmonic relationships, critical element, element that establishes, pitch organization, organization and determines
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Musical mode is one of the most critical element that establishes the framework of pitch organization and determines the harmonic relationships. Previous works often use the simplistic and rigid alignment method, and overlook the diversity of modes. However, in contrast to AI models, humans possess cognitive mechanisms for perceiving the various modes and keys. In this paper, we propose a spiking neural network inspired by brain mechanisms and psychological theories to represent musical modes and keys, ultimately generating musical pieces that incorporate tonality features. Specifically, the contributions are detailed as follows: 1) The model is designed with multiple collaborated subsystems inspired by the structures and functions of corresponding brain regions; 2)We incorporate mechanisms for neural circuit evolutionary learning that enable the network to learn and generate mode-related features in music, reflecting the cognitive processes involved in human music perception. 3)The results demonstrate that the proposed model shows a connection framework closely similar to the Krumhansl-Schmuckler model, which is one of the most significant key perception models in the music psychology domain. 4) Experiments show that the model can generate music pieces with characteristics of the given modes and keys. Additionally, the quantitative assessments of generated pieces reveals that the generating music pieces have both tonality characteristics and the melodic adaptability needed to generate diverse and musical content. By combining insights from neuroscience, psychology, and music theory with advanced neural network architectures, our research aims to create a system that not only learns and generates music but also bridges the gap between human cognition and artificial intelligence.

[AI-19] Grid and Road Expressions Are Complementary for Trajectory Representation Learning KDD2025

链接: https://arxiv.org/abs/2411.14768
作者: Silin Zhou,Shuo Shang,Lisi Chen,Peng Han,Christian S. Jensen
关键词-EN: TRL methods, trajectories, TRL, Existing TRL methods, road trajectories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper is accepted by KDD2025(August Cycle)

点击查看摘要

Abstract:Trajectory representation learning (TRL) maps trajectories to vectors that can be used for many downstream tasks. Existing TRL methods use either grid trajectories, capturing movement in free space, or road trajectories, capturing movement in a road network, as input. We observe that the two types of trajectories are complementary, providing either region and location information or providing road structure and movement regularity. Therefore, we propose a novel multimodal TRL method, dubbed GREEN, to jointly utilize Grid and Road trajectory Expressions for Effective representatioN learning. In particular, we transform raw GPS trajectories into both grid and road trajectories and tailor two encoders to capture their respective information. To align the two encoders such that they complement each other, we adopt a contrastive loss to encourage them to produce similar embeddings for the same raw trajectory and design a mask language model (MLM) loss to use grid trajectories to help reconstruct masked road trajectories. To learn the final trajectory representation, a dual-modal interactor is used to fuse the outputs of the two encoders via cross-attention. We compare GREEN with 7 state-of-the-art TRL methods for 3 downstream tasks, finding that GREEN consistently outperforms all baselines and improves the accuracy of the best-performing baseline by an average of 15.99%.

[AI-20] Hammer: Towards Efficient Hot-Cold Data Identification via Online Learning

链接: https://arxiv.org/abs/2411.14759
作者: Kai Lu,Siqi Zhao,Jiguang Wan
关键词-EN: cloud computing environments, computing environments requires, environments requires accurate, requires accurate identification, Efficient management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient management of storage resources in big data and cloud computing environments requires accurate identification of data’s “cold” and “hot” states. Traditional methods, such as rule-based algorithms and early AI techniques, often struggle with dynamic workloads, leading to low accuracy, poor adaptability, and high operational overhead. To address these issues, we propose a novel solution based on online learning strategies. Our approach dynamically adapts to changing data access patterns, achieving higher accuracy and lower operational costs. Rigorous testing with both synthetic and real-world datasets demonstrates a significant improvement, achieving a 90% accuracy rate in hot-cold classification. Additionally, the computational and storage overheads are considerably reduced.

[AI-21] LIBER: Lifelong User Behavior Modeling Based on Large Language Models

链接: https://arxiv.org/abs/2411.14713
作者: Chenxu Zhu,Shigang Quan,Bo Chen,Jianghao Lin,Xiaoling Cai,Hong Zhu,Xiangyang Li,Yunjia Xi,Weinan Zhang,Ruiming Tang
关键词-EN: CTR prediction plays, CTR prediction, recommender systems, user behavior, lifelong user behavior
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:CTR prediction plays a vital role in recommender systems. Recently, large language models (LLMs) have been applied in recommender systems due to their emergence abilities. While leveraging semantic information from LLMs has shown some improvements in the performance of recommender systems, two notable limitations persist in these studies. First, LLM-enhanced recommender systems encounter challenges in extracting valuable information from lifelong user behavior sequences within textual contexts for recommendation tasks. Second, the inherent variability in human behaviors leads to a constant stream of new behaviors and irregularly fluctuating user interests. This characteristic imposes two significant challenges on existing models. On the one hand, it presents difficulties for LLMs in effectively capturing the dynamic shifts in user interests within these sequences, and on the other hand, there exists the issue of substantial computational overhead if the LLMs necessitate recurrent calls upon each update to the user sequences. In this work, we propose Lifelong User Behavior Modeling (LIBER) based on large language models, which includes three modules: (1) User Behavior Streaming Partition (UBSP), (2) User Interest Learning (UIL), and (3) User Interest Fusion (UIF). Initially, UBSP is employed to condense lengthy user behavior sequences into shorter partitions in an incremental paradigm, facilitating more efficient processing. Subsequently, UIL leverages LLMs in a cascading way to infer insights from these partitions. Finally, UIF integrates the textual outputs generated by the aforementioned processes to construct a comprehensive representation, which can be incorporated by any recommendation model to enhance performance. LIBER has been deployed on Huawei’s music recommendation service and achieved substantial improvements in users’ play count and play time by 3.01% and 7.69%.

[AI-22] Social Media Algorithms Can Shape Affective Polarization via Exposure to Antidemocratic Attitudes and Partisan Animosity

链接: https://arxiv.org/abs/2411.14652
作者: Tiziano Piccardi,Martin Saveski,Chenyan Jia,Jeffrey T. Hancock,Jeanne L. Tsai,Michael Bernstein
关键词-EN: social media feed, media feed ranking, widespread concern, social media, AAPA
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:There is widespread concern about the negative impacts of social media feed ranking algorithms on political polarization. Leveraging advancements in large language models (LLMs), we develop an approach to re-rank feeds in real-time to test the effects of content that is likely to polarize: expressions of antidemocratic attitudes and partisan animosity (AAPA). In a preregistered 10-day field experiment on X/Twitter with 1,256 consented participants, we increase or decrease participants’ exposure to AAPA in their algorithmically curated feeds. We observe more positive outparty feelings when AAPA exposure is decreased and more negative outparty feelings when AAPA exposure is increased. Exposure to AAPA content also results in an immediate increase in negative emotions, such as sadness and anger. The interventions do not significantly impact traditional engagement metrics such as re-post and favorite rates. These findings highlight a potential pathway for developing feed algorithms that mitigate affective polarization by addressing content that undermines the shared values required for a healthy democracy.

[AI-23] Generative AI for Music and Audio

链接: https://arxiv.org/abs/2411.14627
作者: Hao-Wen Dong
关键词-EN: music, audio content, audio, consume content, content
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: PhD Dissertation

点击查看摘要

Abstract:Generative AI has been transforming the way we interact with technology and consume content. In the next decade, AI technology will reshape how we create audio content in various media, including music, theater, films, games, podcasts, and short videos. In this dissertation, I introduce the three main directions of my research centered around generative AI for music and audio: 1) multitrack music generation, 2) assistive music creation tools, and 3) multimodal learning for audio and music. Through my research, I aim to answer the following two fundamental questions: 1) How can AI help professionals or amateurs create music and audio content? 2) Can AI learn to create music in a way similar to how humans learn music? My long-term goal is to lower the barrier of entry for music composition and democratize audio content creation

[AI-24] Predictive Analytics of Air Alerts in the Russian-Ukrainian War

链接: https://arxiv.org/abs/2411.14625
作者: Demian Pavlyshenko,Bohdan Pavlyshenko
关键词-EN: exploratory data analysis, paper considers exploratory, exploratory data, data analysis, analysis and approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The paper considers exploratory data analysis and approaches in predictive analytics for air alerts during the Russian-Ukrainian war which broke out on Feb 24, 2022. The results illustrate that alerts in regions correlate with one another and have geospatial patterns which make it feasible to build a predictive model which predicts alerts that are expected to take place in a certain region within a specified time period. The obtained results show that the alert status in a particular region is highly dependable on the features of its adjacent regions. Seasonality features like hours, days of a week and months are also crucial in predicting the target variable. Some regions highly rely on the time feature which equals to a number of days from the initial date of the dataset. From this, we can deduce that the air alert pattern changes throughout the time.

[AI-25] Exploiting Boosting in Hyperdimensional Computing for Enhanced Reliability in Healthcare DATE2025

链接: https://arxiv.org/abs/2411.14612
作者: SungHeon Jeong,Hamza Errahmouni Barkam,Sanggeon Yun,Yeseong Kim,Shaahin Angizi,Mohsen Imani
关键词-EN: benefiting machine learning, benefiting machine, encoding and processing, processing in high-dimensional, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to DATE 2025

点击查看摘要

Abstract:Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional space, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount.

[AI-26] A Systematic Study of Multi-Agent Deep Reinforcement Learning for Safe and Robust Autonomous Highway Ramp Entry

链接: https://arxiv.org/abs/2411.14593
作者: Larry Schester,Luis E. Ortiz
关键词-EN: driverless robotaxis operate, autonomous driving expected, major cities, sophisticated levels, today can drive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Vehicles today can drive themselves on highways and driverless robotaxis operate in major cities, with more sophisticated levels of autonomous driving expected to be available and become more common in the future. Yet, technically speaking, so-called “Level 5” (L5) operation, corresponding to full autonomy, has not been achieved. For that to happen, functions such as fully autonomous highway ramp entry must be available, and provide provably safe, and reliably robust behavior to enable full autonomy. We present a systematic study of a highway ramp function that controls the vehicles forward-moving actions to minimize collisions with the stream of highway traffic into which a merging (ego) vehicle enters. We take a game-theoretic multi-agent (MA) approach to this problem and study the use of controllers based on deep reinforcement learning (DRL). The virtual environment of the MA DRL uses self-play with simulated data where merging vehicles safely learn to control longitudinal position during a taper-type merge. The work presented in this paper extends existing work by studying the interaction of more than two vehicles (agents) and does so by systematically expanding the road scene with additional traffic and ego vehicles. While previous work on the two-vehicle setting established that collision-free controllers are theoretically impossible in fully decentralized, non-coordinated environments, we empirically show that controllers learned using our approach are nearly ideal when measured against idealized optimal controllers.

[AI-27] G-RAG: Knowledge Expansion in Material Science

链接: https://arxiv.org/abs/2411.14592
作者: Radeen Mostafa,Mirza Nihal Baig,Mashaekh Tausif Ehsan,Jakir Hasan
关键词-EN: Large Language Models, Material Science, facilitating research, effective information retrieval, systems are essential
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of Material Science, effective information retrieval systems are essential for facilitating research. Traditional Retrieval-Augmented Generation (RAG) approaches in Large Language Models (LLMs) often encounter challenges such as outdated information, hallucinations, limited interpretability due to context constraints, and inaccurate retrieval. To address these issues, Graph RAG integrates graph databases to enhance the retrieval process. Our proposed method processes Material Science documents by extracting key entities (referred to as MatIDs) from sentences, which are then utilized to query external Wikipedia knowledge bases (KBs) for additional relevant information. We implement an agent-based parsing technique to achieve a more detailed representation of the documents. Our improved version of Graph RAG called G-RAG further leverages a graph database to capture relationships between these entities, improving both retrieval accuracy and contextual understanding. This enhanced approach demonstrates significant improvements in performance for domains that require precise information retrieval, such as Material Science.

[AI-28] SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine Interactions

链接: https://arxiv.org/abs/2411.14574
作者: Yaqi Wang,Haipei Xu
关键词-EN: Large Language Models, Language Models, Large Language, gained widespread popularity, shown impressive emerging
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, as Large Language Models (LLMs) have shown impressive emerging capabilities and gained widespread popularity, research on LLM-based search agents has proliferated. In real-world situations, users often input contextual and highly personalized queries to chatbots, challenging LLMs to capture context and generate appropriate answers. However, much of the prior research has not focused specifically on authentic human-machine dialogue scenarios. It also ignores the important balance between response quality and computational cost by forcing all queries to follow the same agent process. To address these gaps, we propose a Strategy-Router Search Agent (SRSA), routing different queries to appropriate search strategies and enabling fine-grained serial searches to obtain high-quality results at a relatively low cost. To evaluate our work, we introduce a new dataset, Contextual Query Enhancement Dataset (CQED), comprising contextual queries to simulate authentic and daily interactions between humans and chatbots. Using LLM-based automatic evaluation metrics, we assessed SRSA’s performance in terms of informativeness, completeness, novelty, and actionability. To conclude, SRSA provides an approach that resolves the issue of simple serial searches leading to degenerate answers for lengthy and contextual queries, effectively and efficiently parses complex user queries, and generates more comprehensive and informative responses without fine-tuning an LLM.

[AI-29] he importance of the clustering model to detect new types of intrusion in data traffic

链接: https://arxiv.org/abs/2411.14550
作者: Noor Saud Abd,Kamel Karoui
关键词-EN: current digital age, data, digital age, constantly increasing, current digital
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:In the current digital age, the volume of data generated by various cyber activities has become enormous and is constantly increasing. The data may contain valuable insights that can be harnessed to improve cyber security measures. However, much of this data is unclassified and qualitative, which poses significant challenges to traditional analysis methods. Clustering facilitates the identification of hidden patterns and structures in data through grouping similar data points, which makes it simpler to identify and address threats. Clustering can be defined as a data mining (DM) approach, which uses similarity calculations for dividing a data set into several categories. Hierarchical, density-based, along with partitioning clustering algorithms are typical. The presented work use K-means algorithm, which is a popular clustering technique. Utilizing K-means algorithm, we worked with two different types of data: first, we gathered data with the use of XG-boost algorithm following completing the aggregation with K-means algorithm. Data was gathered utilizing Kali Linux environment, cicflowmeter traffic, and Putty Software tools with the use of diverse and simple attacks. The concept could assist in identifying new attack types, which are distinct from the known attacks, and labeling them based on the characteristics they will exhibit, as the dynamic nature regarding cyber threats means that new attack types often emerge, for which labeled data might not yet exist. The model counted the attacks and assigned numbers to each one of them. Secondly, We tried the same work on the ready data inside the Kaggle repository called (Intrusion Detection in Internet of Things Network), and the clustering model worked well and detected the number of attacks correctly as shown in the results section.

[AI-30] Open Challenges in the Formal Verification of Autonomous Driving

链接: https://arxiv.org/abs/2411.14520
作者: Paolo Burgio(University of Modena and Reggio Emilia),Angelo Ferrando(University of Modena and Reggio Emilia),Marco Villani(University of Modena and Reggio Emilia)
关键词-EN: standard practice, highly complex, complex and heterogeneous, heterogeneous systems, autonomous driving
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Robotics (cs.RO)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:In the realm of autonomous driving, the development and integration of highly complex and heterogeneous systems are standard practice. Modern vehicles are not monolithic systems; instead, they are composed of diverse hardware components, each running its own software systems. An autonomous vehicle comprises numerous independent components, often developed by different and potentially competing companies. This diversity poses significant challenges for the certification process, as it necessitates certifying components that may not disclose their internal behaviour (black-boxes). In this paper, we present a real-world case study of an autonomous driving system, identify key open challenges associated with its development and integration, and explore how formal verification techniques can address these challenges to ensure system reliability and safety.

[AI-31] Variational Autoencoders for Efficient Simulation-Based Inference

链接: https://arxiv.org/abs/2411.14511
作者: Mayank Nautiyal,Andrey Shternshis,Andreas Hellander,Prashant Singh
关键词-EN: likelihood-free simulation-based inference, variational inference framework, generative modeling approach, inference framework, simulation-based inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a generative modeling approach based on the variational inference framework for likelihood-free simulation-based inference. The method leverages latent variables within variational autoencoders to efficiently estimate complex posterior distributions arising from stochastic simulations. We explore two variations of this approach distinguished by their treatment of the prior distribution. The first model adapts the prior based on observed data using a multivariate prior network, enhancing generalization across various posterior queries. In contrast, the second model utilizes a standard Gaussian prior, offering simplicity while still effectively capturing complex posterior distributions. We demonstrate the efficacy of these models on well-established benchmark problems, achieving results comparable to flow-based approaches while maintaining computational efficiency and scalability.

[AI-32] Planning-Driven Programming: A Large Language Model Programming Workflow

链接: https://arxiv.org/abs/2411.14503
作者: Chao Lei,Yanchuan Chang,Nir Lipovetzky,Krista A. Ehinger
关键词-EN: processing tasks raises, tasks raises extensive, raises extensive discussion, large language models, language processing tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The strong performance of large language models (LLMs) on natural language processing tasks raises extensive discussion on their application to code generation. Recent work suggests multiple sampling approaches to improve initial code generation accuracy or program repair approaches to refine the code. However, these methods suffer from LLMs’ inefficiencies and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, in the solution generation phase, the LLM first outlines a solution plan that decomposes the problem into manageable sub-problems and then verifies the generated solution plan through visible test cases. Subsequently, in the code implementation phase, the LLM initially drafts a code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended natural language solution to inform the refinement process for correcting bugs. We further introduce SLPW, a sampling variant of LPW, which initially generates multiple solution plans and plan verifications, produces a program for each plan and its verification, and refines each program as necessary until one successfully passes the visible tests. Compared to the state-of-the-art methods across various existing LLMs, our experimental results show that LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks, especially with a notable improvement of around 10% on challenging benchmarks. Additionally, SLPW demonstrates up to a 5.6% improvement over LPW and sets new state-of-the-art Pass@1 accuracy on various benchmarks, e.g., 98.2% on HumanEval, 84.8% on MBPP, 64.0% on APPS, and 35.3% on CodeContest, using GPT-4o as the backbone.

[AI-33] Global Challenge for Safe and Secure LLM s Track 1

链接: https://arxiv.org/abs/2411.14502
作者: Xiaojun Jia,Yihao Huang,Yang Liu,Peng Yan Tan,Weng Kuan Yau,Mun-Thye Mak,Xin Ming Sim,Wee Siong Ng,See Kiong Ng,Hanqing Liu,Lifeng Zhou,Huanqian Yan,Xiaobing Sun,Wei Liu,Long Wang,Yiming Qian,Yong Liu,Junxiao Yang,Zhexin Zhang,Leqi Lei,Renmiao Chen,Yida Lu,Shiyao Cui,Zizhou Wang,Shaohua Li,Yan Wang,Rick Siow Mong Goh,Liangli Zhen,Yingjie Zhang,Zhe Zhao
关键词-EN: Secure Large Language, Programme Office, Global Challenge, Challenge for Safe, Safe and Secure
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG RD Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks. With the increasing integration of LLMs in critical sectors such as healthcare, finance, and public administration, ensuring these models are resilient to adversarial attacks is vital for preventing misuse and upholding ethical standards. This competition focused on two distinct tracks designed to evaluate and enhance the robustness of LLM security frameworks. Track 1 tasked participants with developing automated methods to probe LLM vulnerabilities by eliciting undesirable responses, effectively testing the limits of existing safety protocols within LLMs. Participants were challenged to devise techniques that could bypass content safeguards across a diverse array of scenarios, from offensive language to misinformation and illegal activities. Through this process, Track 1 aimed to deepen the understanding of LLM vulnerabilities and provide insights for creating more resilient models.

[AI-34] Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval

链接: https://arxiv.org/abs/2411.14480
作者: Przemysław Stokłosa,Janusz A. Starzyk,Paweł Raif,Adrian Horzyk,Marcin Kowalik
关键词-EN: constructing associative knowledge, associative knowledge graphs, paper presents, constructing associative, associative knowledge
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:This paper presents a novel approach for constructing associative knowledge graphs that are highly effective for storing and recognizing sequences. The graph is created by representing overlapping sequences of objects, as tightly connected clusters within the larger graph. Individual objects (represented as nodes) can be a part of multiple sequences or appear repeatedly within a single sequence. To retrieve sequences, we leverage context, providing a subset of objects that triggers an association with the complete sequence. The system’s memory capacity is determined by the size of the graph and the density of its connections. We have theoretically derived the relationships between the critical density of the graph and the memory capacity for storing sequences. The critical density is the point beyond which error-free sequence reconstruction becomes impossible. Furthermore, we have developed an efficient algorithm for ordering elements within a sequence. Through extensive experiments with various types of sequences, we have confirmed the validity of these relationships. This approach has potential applications in diverse fields, such as anomaly detection in financial transactions or predicting user behavior based on past actions.

[AI-35] A Neural Network Training Method Based on Distributed PID Control

链接: https://arxiv.org/abs/2411.14468
作者: Jiang Kun
关键词-EN: network framework based, previous article, symmetric differential equations, based on symmetric, framework based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:In the previous article, we introduced a neural network framework based on symmetric differential equations. This novel framework exhibits complete symmetry, endowing it with perfect mathematical properties. While we have examined some of the system’s mathematical characteristics, a detailed discussion of the network training methodology has not yet been presented. Drawing on the principles of the traditional backpropagation algorithm, this study proposes an alternative training approach that utilizes differential equation signal propagation instead of chain rule derivation. This approach not only preserves the effectiveness of training but also offers enhanced biological interpretability. The foundation of this methodology lies in the system’s reversibility, which stems from its inherent symmetry,a key aspect of our research. However, this method alone is insufficient for effective neural network training. To address this, we further introduce a distributed Proportional-Integral-Derivative (PID) control approach, emphasizing its implementation within a closed system. By incorporating this method, we achieved both faster training speeds and improved accuracy. This approach not only offers novel insights into neural network training but also extends the scope of research into control methodologies. To validate its effectiveness, we apply this method to the MNIST dataset, demonstrating its practical utility.

[AI-36] Improving training time and GPU utilization in geo-distributed language model training

链接: https://arxiv.org/abs/2411.14458
作者: Palak(Microsoft Research India),Rohan Gandhi(Microsoft Research India),Karan Tandon(Microsoft Research India),Debopam Bhattacherjee(Microsoft Research India),Venkata N. Padmanabhan(Microsoft Research India)
关键词-EN: caused huge surge, widespread adoption, adoption of language, industries has caused, caused huge
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.

[AI-37] Deferred Backdoor Functionality Attacks on Deep Learning Models

链接: https://arxiv.org/abs/2411.14449
作者: Jeongjin Shin,Sangdon Park
关键词-EN: adversaries inject malicious, Deep learning models, Deep learning, backdoor, inference time
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models are vulnerable to backdoor attacks, where adversaries inject malicious functionality during training that activates on trigger inputs at inference time. Extensive research has focused on developing stealthy backdoor attacks to evade detection and defense mechanisms. However, these approaches still have limitations that leave the door open for detection and mitigation due to their inherent design to cause malicious behavior in the presence of a trigger. To address this limitation, we introduce Deferred Backdoor Functionality Activation (DBFA), a new paradigm in backdoor attacks. Unlike conventional attacks, DBFA initially conceals its backdoor, producing benign outputs even when triggered. This stealthy behavior allows DBFA to bypass multiple detection and defense methods, remaining undetected during initial inspections. The backdoor functionality is strategically activated only after the model undergoes subsequent updates, such as retraining on benign data. DBFA attacks exploit the common practice in the life cycle of machine learning models to perform model updates and fine-tuning after initial deployment. To implement DBFA attacks, we approach the problem by making the unlearning of the backdoor fragile, allowing it to be easily cancelled and subsequently reactivate the backdoor functionality. To achieve this, we propose a novel two-stage training scheme, called DeferBad. Our extensive experiments across various fine-tuning scenarios, backdoor attack types, datasets, and model architectures demonstrate the effectiveness and stealthiness of DeferBad.

[AI-38] GeMID: Generalizable Models for IoT Device Identification

链接: https://arxiv.org/abs/2411.14441
作者: Kahraman Kostas,Rabia Yasa Kostas,Mike Just,Michael A. Lones
关键词-EN: Internet of Things, proliferation of Internet, Things, Internet, IoT
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 8 pages main (9 figures, 2 tables), 19 pages Supplementary Material, 27 pages total

点击查看摘要

Abstract:With the proliferation of Internet of Things (IoT) devices, ensuring their security has become paramount. Device identification (DI), which distinguishes IoT devices based on their traffic patterns, plays a crucial role in both differentiating devices and identifying vulnerable ones, closing a serious security gap. However, existing approaches to DI that build machine learning models often overlook the challenge of model generalizability across diverse network environments. In this study, we propose a novel framework to address this limitation and evaluate the generalizability of DI models across datasets collected within different network environments. Our approach involves a two-step process: first, we develop a feature and model selection method that is more robust to generalization issues by using a genetic algorithm with external feedback and datasets from distinct environments to refine the selections. Second, the resulting DI models are then tested on further independent datasets in order to robustly assess their generalizability. We demonstrate the effectiveness of our method by empirically comparing it to alternatives, highlighting how fundamental limitations of commonly employed techniques such as sliding window and flow statistics limit their generalizability. Our findings advance research in IoT security and device identification, offering insights into improving model effectiveness and mitigating risks in IoT networks.

[AI-39] ransforming Engineering Education Using Generative AI and Digital Twin Technologies

链接: https://arxiv.org/abs/2411.14433
作者: Yu-Zheng Lin,Ahmed Hussain J Alhamadah,Matthew William Redondo,Karan Himanshu Patel,Sujan Ghimire,Banafsheh Saber Latibari,Soheil Salehi,Pratik Satam
关键词-EN: Digital twin technology, enhance educational experiences, industrial digital twins, increasingly recognized, potential to enhance
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Digital twin technology, traditionally used in industry, is increasingly recognized for its potential to enhance educational experiences. This study investigates the application of industrial digital twins (DTs) in education, focusing on how DT models of varying fidelity can support different stages of Bloom’s taxonomy in the cognitive domain. We align Bloom’s six cognitive stages with educational levels: undergraduate studies for “Remember” and “Understand,” master’s level for “Apply” and “Analyze,” and doctoral level for “Evaluate” and “Create.” Low-fidelity DTs aid essential knowledge acquisition and skill training, providing a low-risk environment for grasping fundamental concepts. Medium-fidelity DTs offer more detailed and dynamic simulations, enhancing application skills and problem-solving. High-fidelity DTs support advanced learners by replicating physical phenomena, allowing for innovative design and complex experiments. Within this framework, large language models (LLMs) serve as mentors, assessing progress, filling knowledge gaps, and assisting with DT interactions, parameter setting, and debugging. We evaluate the educational impact using the Kirkpatrick Model, examining how each DT model’s fidelity influences learning outcomes. This framework helps educators make informed decisions on integrating DTs and LLMs to meet specific learning objectives.

[AI-40] Open-Amp: Synthetic Data Framework for Audio Effect Foundation Models

链接: https://arxiv.org/abs/2411.14972
作者: Alec Wright,Alistair Carson,Lauri Juvela
关键词-EN: Music Information Retrieval, audio effects, effects, paper introduces Open-Amp, diverse audio effects
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:This paper introduces Open-Amp, a synthetic data framework for generating large-scale and diverse audio effects data. Audio effects are relevant to many musical audio processing and Music Information Retrieval (MIR) tasks, such as modelling of analog audio effects, automatic mixing, tone matching and transcription. Existing audio effects datasets are limited in scope, usually including relatively few audio effects processors and a limited amount of input audio signals. Our proposed framework overcomes these issues, by crowdsourcing neural network emulations of guitar amplifiers and effects, created by users of open-source audio effects emulation software. This allows users of Open-Amp complete control over the input signals to be processed by the effects models, as well as providing high-quality emulations of hundreds of devices. Open-Amp can render audio online during training, allowing great flexibility in data augmentation. Our experiments show that using Open-Amp to train a guitar effects encoder achieves new state-of-the-art results on multiple guitar effects classification tasks. Furthermore, we train a one-to-many guitar effects model using Open-Amp, and use it to emulate unseen analog effects via manipulation of its learned latent space, indicating transferability to analog guitar effects data.

[AI-41] Comparative Study of Neural Network Methods for Solving Topological Solitons

链接: https://arxiv.org/abs/2411.14942
作者: Koji Hashimoto,Koshiro Matsuo,Masaki Murata,Gakuto Ogiwara
关键词-EN: including particle physics, nonlinear differential equations, physics and mathematics, physics and cosmology, including particle
类目: High Energy Physics - Theory (hep-th); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Topological solitons, which are stable, localized solutions of nonlinear differential equations, are crucial in various fields of physics and mathematics, including particle physics and cosmology. However, solving these solitons presents significant challenges due to the complexity of the underlying equations and the computational resources required for accurate solutions. To address this, we have developed a novel method using neural network (NN) to efficiently solve solitons. A similar NN approach is Physics-Informed Neural Networks (PINN). In a comparative analysis between our method and PINN, we find that our method achieves shorter computation times while maintaining the same level of accuracy. This advancement in computational efficiency not only overcomes current limitations but also opens new avenues for studying topological solitons and their dynamical behavior.

[AI-42] Quantum Hamiltonian Descent for Graph Partition

链接: https://arxiv.org/abs/2411.14696
作者: Jinglei Cheng,Ruilin Zhou,Yuhang Gan,Chen Qian,Junyu Liu
关键词-EN: Quantum Hamiltonian Descent, introduce Quantum Hamiltonian, Quantum Hamiltonian, Hamiltonian Descent, Quadratic Unconstrained Binary
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Quantum Hamiltonian Descent as a novel approach to solve the graph partition problem. By reformulating graph partition as a Quadratic Unconstrained Binary Optimization (QUBO) problem, we leverage QHD’s quantum-inspired dynamics to identify optimal community structures. Our method implements a multi-level refinement strategy that alternates between QUBO formulation and QHD optimization to iteratively improve partition quality. Experimental results demonstrate that our QHD-based approach achieves superior modularity scores (up to 5.49%) improvement with reduced computational overhead compared to traditional optimization methods. This work establishes QHD as an effective quantum-inspired framework for tackling graph partition challenges in large-scale networks.

[AI-43] Leveraging Gene Expression Data and Explainable Machine Learning for Enhanced Early Detection of Type 2 Diabetes

链接: https://arxiv.org/abs/2411.14471
作者: Aurora Lithe Roy,Md Kamrul Siam,Nuzhat Noor Islam Prova,Sumaiya Jahan,Abdullah Al Maruf
关键词-EN: global health burden, substantial global health, gene expression, Diabetes, kidney failure
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Diabetes, particularly Type 2 diabetes (T2D), poses a substantial global health burden, compounded by its associated complications such as cardiovascular diseases, kidney failure, and vision impairment. Early detection of T2D is critical for improving healthcare outcomes and optimizing resource allocation. In this study, we address the gap in early T2D detection by leveraging machine learning (ML) techniques on gene expression data obtained from T2D patients. Our primary objective was to enhance the accuracy of early T2D detection through advanced ML methodologies and increase the model’s trustworthiness using the explainable artificial intelligence (XAI) technique. Analyzing the biological mechanisms underlying T2D through gene expression datasets represents a novel research frontier, relatively less explored in previous studies. While numerous investigations have focused on utilizing clinical and demographic data for T2D prediction, the integration of molecular insights from gene expression datasets offers a unique and promising avenue for understanding the pathophysiology of the disease. By employing six ML classifiers on data sourced from NCBI’s Gene Expression Omnibus (GEO), we observed promising performance across all models. Notably, the XGBoost classifier exhibited the highest accuracy, achieving 97%. Our study addresses a notable gap in early T2D detection methodologies, emphasizing the importance of leveraging gene expression data and advanced ML techniques.

[AI-44] JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics Data

链接: https://arxiv.org/abs/2411.14464
作者: Apurva Kalia,Dilip Krishnan,Soha Hassoun
关键词-EN: spectral fragmentation patterns, mass spectral fragmentation, fragmentation patterns, major challenge, challenge in metabolomics
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 7 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Motivation: A major challenge in metabolomics is annotation: assigning molecular structures to mass spectral fragmentation patterns. Despite recent advances in molecule-to-spectra and in spectra-to-molecular fingerprint prediction (FP), annotation rates remain low. Results: We introduce in this paper a novel paradigm (JESTR) for annotation. Unlike prior approaches that explicitly construct molecular fingerprints or spectra, JESTR leverages the insight that molecules and their corresponding spectra are views of the same data and effectively embeds their representations in a joint space. Candidate structures are ranked based on cosine similarity between the embeddings of query spectrum and each candidate. We evaluate JESTR against mol-to-spec and spec-to-FP annotation tools on three datasets. On average, for rank@[1-5], JESTR outperforms other tools by 23.6%-71.6%. We further demonstrate the strong value of regularization with candidate molecules during training, boosting rank@1 performance by 11.4% and enhancing the model’s ability to discern between target and candidate molecules. Through JESTR, we offer a novel promising avenue towards accurate annotation, therefore unlocking valuable insights into the metabolome.

机器学习

[LG-0] PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision NEURIPS2024

链接: https://arxiv.org/abs/2411.15127
作者: Arnav M. Das,Chi Ian Tang,Fahim Kawsar,Mohammad Malekzadeh
关键词-EN: Inertial Measurement Units, Measurement Units, Inertial Measurement, Sensing human motions, enabled significant applications
类目: Machine Learning (cs.LG)
*备注: Also presented under the title “PRIMUS: Pretraining IMU Encoders with Multimodal and Self-Supervised Learning” at NeurIPS 2024 TSALM Workshop (Time Series in the Age of Large Models)

点击查看摘要

Abstract:Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions. For video or text modalities, the “pretrain and adapt” approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and (2) open-source pretrained models that generalize across datasets are rarely publicly available. In this paper, we aim to address the first issue by proposing PRIMUS, a method for PRetraining IMU encoderS. We conduct a systematic and unified evaluation of various self-supervised and multimodal learning pretraining objectives. Our findings indicate that using PRIMUS, which combines self-supervision, multimodal supervision, and nearest-neighbor supervision, can significantly enhance downstream performance. With fewer than 500 labeled samples per class, PRIMUS effectively enhances downstream performance by up to 15% in held-out test data, compared to the state-of-the-art multimodal training method. To benefit the broader community, our code and pre-trained IMU encoders will be made publicly available at this http URL upon publication.

[LG-1] Learnable Activation Functions in Physics-Informed Neural Networks for Solving Partial Differential Equations

链接: https://arxiv.org/abs/2411.15111
作者: Afrah Fareaa,Mustafa Serdar Celebi
关键词-EN: Partial Differential Equations, solving Partial Differential, Differential Equations, Partial Differential, Physics-Informed Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the use of learnable activation functions in Physics-Informed Neural Networks (PINNs) for solving Partial Differential Equations (PDEs). Specifically, we compare the efficacy of traditional Multilayer Perceptrons (MLPs) with fixed and learnable activations against Kolmogorov-Arnold Networks (KANs), which employ learnable basis functions. Physics-informed neural networks (PINNs) have emerged as an effective method for directly incorporating physical laws into the learning process, offering a data-efficient solution for both the forward and inverse problems associated with PDEs. However, challenges such as effective training and spectral bias, where low-frequency components are learned more effectively, often limit their applicability to problems characterized by rapid oscillations or sharp transitions. By employing different activation or basis functions on MLP and KAN, we assess their impact on convergence behavior and spectral bias mitigation, and the accurate approximation of PDEs. The findings offer insights into the design of neural network architectures that balance training efficiency, convergence speed, and test accuracy for PDE solvers. By evaluating the influence of activation or basis function choices, this work provides guidelines for developing more robust and accurate PINN models. The source code and pre-trained models used in this study are made publicly available to facilitate reproducibility and future exploration.

[LG-2] Effective Littlestone Dimension

链接: https://arxiv.org/abs/2411.15109
作者: Valentino Delle Rose,Alexander Kozachinskiy,Tomasz Steifer
关键词-EN: characterizes improper PAC, Delle Rose, effective Littlestone dimension, improper PAC learning, Littlestone dimension
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 12 pages

点击查看摘要

Abstract:Delle Rose et al.~(COLT’23) introduced an effective version of the Vapnik-Chervonenkis dimension, and showed that it characterizes improper PAC learning with total computable learners. In this paper, we introduce and study a similar effectivization of the notion of Littlestone dimension. Finite effective Littlestone dimension is a necessary condition for computable online learning but is not a sufficient one – which we already establish for classes of the effective Littlestone dimension 2. However, the effective Littlestone dimension equals the optimal mistake bound for computable learners in two special cases: a) for classes of Littlestone dimension 1 and b) when the learner receives as additional information an upper bound on the numbers to be guessed. Interestingly, finite effective Littlestone dimension also guarantees that the class consists only of computable functions.

[LG-3] AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

链接: https://arxiv.org/abs/2411.15102
作者: Fengyuan Liu,Nikhil Kandpal,Colin Raffel
关键词-EN: context attribution, context attribution methods, context span effect, large language models, LLM generations
类目: Machine Learning (cs.LG)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:The influence of contextual input on the behavior of large language models (LLMs) has prompted the development of context attribution methods that aim to quantify each context span’s effect on an LLM’s generations. The leave-one-out (LOO) error, which measures the change in the likelihood of the LLM’s response when a given span of the context is removed, provides a principled way to perform context attribution, but can be prohibitively expensive to compute for large models. In this work, we introduce AttriBoT, a series of novel techniques for efficiently computing an approximation of the LOO error for context attribution. Specifically, AttriBoT uses cached activations to avoid redundant operations, performs hierarchical attribution to reduce computation, and emulates the behavior of large target models with smaller proxy models. Taken together, AttriBoT can provide a 300x speedup while remaining more faithful to a target model’s LOO error than prior context attribution methods. This stark increase in performance makes computing context attributions for a given response 30x faster than generating the response itself, empowering real-world applications that require computing attributions at scale. We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods.

[LG-4] What You See is Not What You Get: Neural Partial Differential Equations and The Illusion of Learning

链接: https://arxiv.org/abs/2411.15101
作者: Arvind Mohan,Ashesh Chattopadhyay,Jonah Miller
关键词-EN: scientific machine learning, directly embeds neural, embeds neural networks, neural networks inside, networks inside PDEs
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Differentiable Programming for scientific machine learning (SciML) has recently seen considerable interest and success, as it directly embeds neural networks inside PDEs, often called as NeuralPDEs, derived from first principle physics. Therefore, there is a widespread assumption in the community that NeuralPDEs are more trustworthy and generalizable than black box models. However, like any SciML model, differentiable programming relies predominantly on high-quality PDE simulations as “ground truth” for training. However, mathematics dictates that these are only discrete numerical approximations of the true physics. Therefore, we ask: Are NeuralPDEs and differentiable programming models trained on PDE simulations as physically interpretable as we think? In this work, we rigorously attempt to answer these questions, using established ideas from numerical analysis, experiments, and analysis of model Jacobians. Our study shows that NeuralPDEs learn the artifacts in the simulation training data arising from the discretized Taylor Series truncation error of the spatial derivatives. Additionally, NeuralPDE models are systematically biased, and their generalization capability is likely enabled by a fortuitous interplay of numerical dissipation and truncation error in the training dataset and NeuralPDE, which seldom happens in practical applications. This bias manifests aggressively even in relatively accessible 1-D equations, raising concerns about the veracity of differentiable programming on complex, high-dimensional, real-world PDEs, and in dataset integrity of foundation models. Further, we observe that the initial condition constrains the truncation error in initial-value problems in PDEs, thereby exerting limitations to extrapolation. Finally, we demonstrate that an eigenanalysis of model weights can indicate a priori if the model will be inaccurate for out-of-distribution testing.

[LG-5] On Multi-Agent Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2411.15046
作者: Till Freihaut,Giorgia Ramponi
关键词-EN: Inverse Reinforcement Learning, utility function, highly influenced, shape both individual, individual goals
类目: Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:In multi-agent systems, the agent behavior is highly influenced by its utility function, as these utilities shape both individual goals as well as interactions with the other agents. Inverse Reinforcement Learning (IRL) is a well-established approach to inferring the utility function by observing an expert behavior within a given environment. In this paper, we extend the IRL framework to the multi-agent setting, assuming to observe agents who are following Nash Equilibrium (NE) policies. We theoretically investigate the set of utilities that explain the behavior of NE experts. Specifically, we provide an explicit characterization of the feasible reward set and analyze how errors in estimating the transition dynamics and expert behavior impact the recovered rewards. Building on these findings, we provide the first sample complexity analysis for the multi-agent IRL problem. Finally, we provide a numerical evaluation of our theoretical results.

[LG-6] Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium

链接: https://arxiv.org/abs/2411.15036
作者: Zeyang Li,Navid Azizan
关键词-EN: achieved notable success, demonstrating impressive performance, safe MARL, MARL, demonstrating impressive
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with \textitstate-wise constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose \textitMulti-Agent Dual Actor-Critic (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.

[LG-7] On the Linear Speedup of Personalized Federated Reinforcement Learning with Shared Representations

链接: https://arxiv.org/abs/2411.15014
作者: Guojun Xiong,Shufan Wang,Daniel Jiang,Jian Li
关键词-EN: enables multiple agents, local trajectories collected, enables multiple, agent-environment interactions, trajectories collected
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated reinforcement learning (FedRL) enables multiple agents to collaboratively learn a policy without sharing their local trajectories collected during agent-environment interactions. However, in practice, the environments faced by different agents are often heterogeneous, leading to poor performance by the single policy learned by existing FedRL algorithms on individual agents. In this paper, we take a further step and introduce a \emphpersonalized FedRL framework (PFedRL) by taking advantage of possibly shared common structure among agents in heterogeneous environments. Specifically, we develop a class of PFedRL algorithms named PFedRL-Rep that learns (1) a shared feature representation collaboratively among all agents, and (2) an agent-specific weight vector personalized to its local environment. We analyze the convergence of PFedTD-Rep, a particular instance of the framework with temporal difference (TD) learning and linear representations. To the best of our knowledge, we are the first to prove a linear convergence speedup with respect to the number of agents in the PFedRL setting. To achieve this, we show that PFedTD-Rep is an example of the federated two-timescale stochastic approximation with Markovian noise. Experimental results demonstrate that PFedTD-Rep, along with an extension to the control setting based on deep Q-networks (DQN), not only improve learning in heterogeneous settings, but also provide better generalization to new environments.

[LG-8] Adaptive Group Robust Ensemble Knowledge Distillation NEURIPS2024

链接: https://arxiv.org/abs/2411.14984
作者: Patrik Kenfack,Ulrich Aïvodji,Samira Ebrahimi Kahou
关键词-EN: learn spurious correlations, Neural networks, ensemble knowledge distillation, networks can learn, learn spurious
类目: Machine Learning (cs.LG)
*备注: Workshop Algorithmic Fairness through the Lens of Metrics and Evaluation at NeurIPS 2024

点击查看摘要

Abstract:Neural networks can learn spurious correlations in the data, often leading to performance disparity for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively “simple” student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting.

[LG-9] Leveraging LLM s for Legacy Code Modernization: Challenges and Opportunities for LLM -Generated Documentation ICSE2025

链接: https://arxiv.org/abs/2411.14971
作者: Colin Diggs,Michael Doyle,Amit Madan,Siggy Scott,Emily Escamilla,Jacob Zimmer,Naveed Nekoo,Paul Ursino,Michael Bartholf,Zachary Robin,Anand Patel,Chris Glasz,William Macke,Paul Kirk,Jasper Phillips,Arun Sridharan,Doug Wendt,Scott Rosen,Nitin Naik,Justin F. Brunelle,Samruddhi Thaker
关键词-EN: mainframe Assembly Language, IBM mainframe Assembly, Legacy software systems, Assembly Language Code, written in outdated
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Abbreviated version submitted to LLM4Code 2025 (a workshop co-located with ICSE 2025), 13 pages, 3 figures

点击查看摘要

Abstract:Legacy software systems, written in outdated languages like MUMPS and mainframe assembly, pose challenges in efficiency, maintenance, staffing, and security. While LLMs offer promise for modernizing these systems, their ability to understand legacy languages is largely unknown. This paper investigates the utilization of LLMs to generate documentation for legacy code using two datasets: an electronic health records (EHR) system in MUMPS and open-source applications in IBM mainframe Assembly Language Code (ALC). We propose a prompting strategy for generating line-wise code comments and a rubric to evaluate their completeness, readability, usefulness, and hallucination. Our study assesses the correlation between human evaluations and automated metrics, such as code complexity and reference-based metrics. We find that LLM-generated comments for MUMPS and ALC are generally hallucination-free, complete, readable, and useful compared to ground-truth comments, though ALC poses challenges. However, no automated metrics strongly correlate with comment quality to predict or measure LLM performance. Our findings highlight the limitations of current automated measures and the need for better evaluation metrics for LLM-generated documentation in legacy systems.

[LG-10] Many happy returns: machine learning to support platelet issuing and waste reduction in hospital blood banks

链接: https://arxiv.org/abs/2411.14939
作者: Joseph Farrington,Samah Alimam,Martin Utley,Kezhi Li,Wai Keong Wong
关键词-EN: Efforts to reduce, ordering policies, focused on ordering, predominant practice, returned unused
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efforts to reduce platelet wastage in hospital blood banks have focused on ordering policies, but the predominant practice of issuing the oldest unit first may not be optimal when some units are returned unused. We propose a novel, machine learning (ML)-guided issuing policy to increase the likelihood of returned units being reissued before expiration. Our ML model trained to predict returns on 17,297 requests for platelets gave AUROC 0.74 on 9,353 held-out requests. Prior to ML model development we built a simulation of the blood bank operation that incorporated returns to understand the scale of benefits of such a model. Using our trained model in the simulation gave an estimated reduction in wastage of 14%. Our partner hospital is considering adopting our approach, which would be particularly beneficial for hospitals with higher return rates and where units have a shorter remaining useful life on arrival.

[LG-11] Predictive Modeling For Real-Time Personalized Health Monitoring in Muscular Dystrophy Management

链接: https://arxiv.org/abs/2411.14923
作者: Mohammed Akkaoui
关键词-EN: Muscular Dystrophy, functioning of muscles, people worldwide, group of genetic, genetic disorders
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muscular Dystrophy is a group of genetic disorders that progressively affect the strength and functioning of muscles, thereby affecting millions of people worldwide. The lifetime nature of MD requires continuous follow-up care due to its progressive nature. This conceptual paper proposes an Internet of Things-based system to support the management of MD through remote, multi-dimensional monitoring of patients in order to provide real-time health status updates. Traditional methods have failed to give actionable data in real time, hence denying healthcare providers the opportunity to make evidence-based decisions. Technology-driven approaches are urgently needed to provide deep insights into disease progression and patient health. It aims to enhance treatment strategies, enabling patients to better manage their condition and giving healthcare professionals more confidence in their management decisions.

[LG-12] Exploring Kolmogorov-Arnold Networks for Interpretable Time Series Classification

链接: https://arxiv.org/abs/2411.14904
作者: Irina Barašin,Blaž Bertalanič,Miha Mohorčič,Carolina Fortuna
关键词-EN: Time series classification, shown promising performance, relevant step supporting, deep neural models, Time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series classification is a relevant step supporting decision-making processes in various domains, and deep neural models have shown promising performance. Despite significant advancements in deep learning, the theoretical understanding of how and why complex architectures function remains limited, prompting the need for more interpretable models. Recently, the Kolmogorov-Arnold Networks (KANs) have been proposed as a more interpretable alternative. While KAN-related research is significantly rising, to date, the study of KAN architectures for time series classification has been limited. In this paper, we aim to conduct a comprehensive and robust exploration of the KAN architecture for time series classification on the UCR benchmark. More specifically, we look at a) how reference architectures for forecasting transfer to classification, at the b) hyperparameter and implementation influence on the classification performance in view of finding the one that performs best on the selected benchmark, the c) complexity trade-offs and d) interpretability advantages. Our results show that (1) Efficient KAN outperforms MLP in performance and computational efficiency, showcasing its suitability for tasks classification tasks. (2) Efficient KAN is more stable than KAN across grid sizes, depths, and layer configurations, particularly with lower learning rates. (3) KAN maintains competitive accuracy compared to state-of-the-art models like HIVE-COTE2, with smaller architectures and faster training times, supporting its balance of performance and transparency. (4) The interpretability of the KAN model aligns with findings from SHAP analysis, reinforcing its capacity for transparent decision-making. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.14904 [cs.LG] (or arXiv:2411.14904v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.14904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Ex Uno Pluria: Insights on Ensembling in Low Precision Number Systems NEURIPS2024

链接: https://arxiv.org/abs/2411.14860
作者: Giung Nam,Juho Lee
关键词-EN: improving generalization performance, models remains challenging, scaling current ensemble, large models remains, deep neural networks
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:While ensembling deep neural networks has shown promise in improving generalization performance, scaling current ensemble methods for large models remains challenging. Given that recent progress in deep learning is largely driven by the scale, exemplified by the widespread adoption of large-scale neural network architectures, scalability emerges an increasingly critical issue for machine learning algorithms in the era of large-scale models. In this work, we first showcase the potential of low precision ensembling, where ensemble members are derived from a single model within low precision number systems in a training-free manner. Our empirical analysis demonstrates the effectiveness of our proposed low precision ensembling method compared to existing ensemble approaches.

[LG-14] Applications of fractional calculus in learned optimization NEURIPS

链接: https://arxiv.org/abs/2411.14855
作者: Teodor Alexandru Szente,James Harrison,Mihai Zanfir,Cristian Sminchisescu
关键词-EN: incorporating fractional-order derivatives, traditional gradient descent, gradient descent methods, extend traditional gradient, studied extensively
类目: Machine Learning (cs.LG)
*备注: NeurIPS Workshop on Optimization for Machine Learning

点击查看摘要

Abstract:Fractional gradient descent has been studied extensively, with a focus on its ability to extend traditional gradient descent methods by incorporating fractional-order derivatives. This approach allows for more flexibility in navigating complex optimization landscapes and offers advantages in certain types of problems, particularly those involving non-linearities and chaotic dynamics. Yet, the challenge of fine-tuning the fractional order parameters remains unsolved. In this work, we demonstrate that it is possible to train a neural network to predict the order of the gradient effectively.

[LG-15] Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not Robust

链接: https://arxiv.org/abs/2411.14834
作者: Jie Zhang,Kristina Nikolić,Nicholas Carlini,Florian Tramèr
关键词-EN: make image classifiers, image classifiers robust, recently proposed, proposed to make, Ensemble
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model’s intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was shown that the model’s gradients are perceptually aligned: attacks against the model produce noise that perceptually resembles the targeted class. In this short note, we show that this defense is not robust to adversarial attack. We first show that the defense’s randomness and ensembling method cause severe gradient masking. We then use standard adaptive attack techniques to reduce the defense’s robust accuracy from 48% to 1% on CIFAR-100 and from 62% to 4% on CIFAR-10, under the \ell_\infty -norm threat model with \varepsilon=8/255 . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.14834 [cs.LG] (or arXiv:2411.14834v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.14834 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Segmenting Action-Value Functions Over Time-Scales in SARSA using TD(Delta)

链接: https://arxiv.org/abs/2411.14783
作者: Mahammad Humayoo
关键词-EN: numerous episodic reinforcement, enhance policies aimed, episodic reinforcement learning, SARSA-based methodologies, long horizons
类目: Machine Learning (cs.LG)
*备注: 17 pages. arXiv admin note: text overlap with arXiv:2411.14019

点击查看摘要

Abstract:In numerous episodic reinforcement learning (RL) settings, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Conventional SARSA algorithms, however, have difficulties in balancing bias and variation due to the reliance on a singular, fixed discount factor. This study expands the temporal difference decomposition approach, TD( \triangle ), to the SARSA algorithm. SARSA, a widely utilised on-policy RL method, enhances action-value functions via temporal difference updates. TD( \triangle ) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors. This decomposition improves learning efficiency and stability, particularly in problems necessitating long-horizon optimization. We illustrate that our methodology mitigates bias in SARSA’s updates while facilitating accelerated convergence in contexts characterized by dense rewards. Experimental findings across many benchmark tasks indicate that the proposed SARSA( \triangle ) surpasses conventional TD learning methods in both tabular and deep RL contexts.

[LG-17] An Attention-based Framework for Fair Contrastive Learning

链接: https://arxiv.org/abs/2411.14765
作者: Stefan K. Nielsen,Tan M. Nguyen
关键词-EN: high-dimensional sensitive information, complex environments characterized, sensitive information, proven instrumental, complex environments
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning has proven instrumental in learning unbiased representations of data, especially in complex environments characterized by high-cardinality and high-dimensional sensitive information. However, existing approaches within this setting require predefined modelling assumptions of bias-causing interactions that limit the model’s ability to learn debiased representations. In this work, we propose a new method for fair contrastive learning that employs an attention mechanism to model bias-causing interactions, enabling the learning of a fairer and semantically richer embedding space. In particular, our attention mechanism avoids bias-causing samples that confound the model and focuses on bias-reducing samples that help learn semantically meaningful representations. We verify the advantages of our method against existing baselines in fair contrastive learning and show that our approach can significantly boost bias removal from learned representations without compromising downstream accuracy.

[LG-18] FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient Fast and Efficient Transformer Acceleration

链接: https://arxiv.org/abs/2411.14733
作者: Donghyeon Yi,Seoyoung Lee,Jongho Kim,Junyoung Kim,Sohmyung Ha,Ik Joon Chang,Minkyu Je
关键词-EN: revolutionized machine learning, powered by self-attention, self-attention layers, context-aware representations, Encoder-based transformers
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy ( 2^ENOB ), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ’s challenges in AMS-PiM systems. To overcome these limitations, we propose RAP, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that RAP outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.

[LG-19] K-GBS3FCM – KNN Graph-Based Safe Semi-Supervised Fuzzy C-Means

链接: https://arxiv.org/abs/2411.14728
作者: Gabriel Santos,Rita Julia,Marcelo Nascimento
关键词-EN: partially labeled set, widely investigated, recently been widely, Clustering, prior domain knowledge
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Clustering data using prior domain knowledge, starting from a partially labeled set, has recently been widely investigated. Often referred to as semi-supervised clustering, this approach leverages labeled data to enhance clustering accuracy. To maximize algorithm performance, it is crucial to ensure the safety of this prior knowledge. Methods addressing this concern are termed safe semi-supervised clustering (S3C) algorithms. This paper introduces the KNN graph-based safety-aware semi-supervised fuzzy c-means algorithm (K-GBS3FCM), which dynamically assesses neighborhood relationships between labeled and unlabeled data using the K-Nearest Neighbors (KNN) algorithm. This approach aims to optimize the use of labeled data while minimizing the adverse effects of incorrect labels. Additionally, it is proposed a mechanism that adjusts the influence of labeled data on unlabeled ones through regularization parameters and the average safety degree. Experimental results on multiple benchmark datasets demonstrate that the graph-based approach effectively leverages prior knowledge to enhance clustering accuracy. The proposed method was significantly superior in 64% of the 56 test configurations, obtaining higher levels of clustering accuracy when compared to other semi-supervised and traditional unsupervised methods. This research highlights the potential of integrating graph-based approaches, such as KNN, with established techniques to develop advanced clustering algorithms, offering significant applications in fields that rely on both labeled and unlabeled data for more effective clustering.

[LG-20] Attributed Graph Clustering via Generalized Quaternion Representation Learning

链接: https://arxiv.org/abs/2411.14727
作者: Junyang Chen,Yiqun Zhang,Mengke Li,Yang Lu,Yiu-ming Cheung
关键词-EN: attracted increasing attention, accurate cluster analysis, Graph Convolutional Network, Clustering complex data, increasing attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering complex data in the form of attributed graphs has attracted increasing attention, where appropriate graph representation is a critical prerequisite for accurate cluster analysis. However, the Graph Convolutional Network will homogenize the representation of graph nodes due to the well-known over-smoothing effect. This limits the network architecture to a shallow one, losing the ability to capture the critical global distribution information for clustering. Therefore, we propose a generalized graph auto-encoder network, which introduces quaternion operations to the encoders to achieve efficient structured feature representation learning without incurring deeper network and larger-scale parameters. The generalization of our method lies in the following two aspects: 1) connecting the quaternion operation naturally suitable for four feature components with graph data of arbitrary attribute dimensions, and 2) introducing a generalized graph clustering objective as a loss term to obtain clustering-friendly representations without requiring a pre-specified number of clusters k . It turns out that the representations of nodes learned by the proposed Graph Clustering based on Generalized Quaternion representation learning (GCGQ) are more discriminative, containing global distribution information, and are more general, suiting downstream clustering under different k s. Extensive experiments including significance tests, ablation studies, and qualitative results, illustrate the superiority of GCGQ. The source code is temporarily opened at \urlthis https URL.

[LG-21] Enhancing Molecular Design through Graph-based Topological Reinforcement Learning

链接: https://arxiv.org/abs/2411.14726
作者: Xiangyu Zhang
关键词-EN: reinforcement learning, Topological Reinforcement Learning, drug-like molecules, molecules is crucial, Graph-based Topological Reinforcement
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:The generation of drug-like molecules is crucial for drug design. Existing reinforcement learning (RL) methods often overlook structural information. However, feature engineering-based methods usually merely focus on binding affinity prediction without substantial molecular modification. To address this, we present Graph-based Topological Reinforcement Learning (GraphTRL), which integrates both chemical and structural data for improved molecular generation. GraphTRL leverages multiscale weighted colored graphs (MWCG) and persistent homology, combined with molecular fingerprints, as the state space for RL. Evaluations show that GraphTRL outperforms existing methods in binding affinity prediction, offering a promising approach to accelerate drug discovery.

[LG-22] Can GNNs Learn Link Heuristics? A Concise Review and Evaluation of Link Prediction Methods

链接: https://arxiv.org/abs/2411.14711
作者: Shuming Liang,Yu Ding,Zhidong Li,Bin Liang,Siqi Zhang,Yang Wang,Fang Chen
关键词-EN: Graph Neural Networks, Neural Networks, link prediction, Graph Neural, link prediction methods
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the ability of Graph Neural Networks (GNNs) in learning various forms of information for link prediction, alongside a brief review of existing link prediction methods. Our analysis reveals that GNNs cannot effectively learn structural information related to the number of common neighbors between two nodes, primarily due to the nature of set-based pooling of the neighborhood aggregation scheme. Also, our extensive experiments indicate that trainable node embeddings can improve the performance of GNN-based link prediction models. Importantly, we observe that the denser the graph, the greater such the improvement. We attribute this to the characteristics of node embeddings, where the link state of each link sample could be encoded into the embeddings of nodes that are involved in the neighborhood aggregation of the two nodes in that link sample. In denser graphs, every node could have more opportunities to attend the neighborhood aggregation of other nodes and encode states of more link samples to its embedding, thus learning better node embeddings for link prediction. Lastly, we demonstrate that the insights gained from our research carry important implications in identifying the limitations of existing link prediction methods, which could guide the future development of more robust algorithms.

[LG-23] A Data-Driven Pool Strategy for Price-Makers Under Imperfect Information

链接: https://arxiv.org/abs/2411.14694
作者: Kedi Zheng,Hongye Guo,Qixin Chen
关键词-EN: paper studies, studies the pool, pool strategy, imperfect information, system
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Paper accepted for IEEE Transactions on Power Systems. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:This paper studies the pool strategy for price-makers under imperfect information. In this occasion, market participants cannot obtain essential transmission parameters of the power system. Thus, price-makers should estimate the market results with respect to their offer curves using available historical information. The linear programming model of economic dispatch is analyzed with the theory of rim multi-parametric linear programming (rim-MPLP). The characteristics of system patterns (combinations of status flags for generating units and transmission lines) are revealed. A multi-class classification model based on support vector machine (SVM) is trained to map the offer curves to system patterns, which is then integrated into the decision framework of the price-maker. The performance of the proposed method is validated on the IEEE 30-bus system, Illinois synthetic 200-bus system, and South Carolina synthetic 500-bus system.

[LG-24] EV-PINN: A Physics-Informed Neural Network for Predicting Electric Vehicle Dynamics ICRA

链接: https://arxiv.org/abs/2411.14691
作者: Hansol Lim,Jee Won Lee,Jonathan Boyack,Jongseong Brad Choi
关键词-EN: enables accurate path, accurate path planning, Neural Network approach, Physics-Informed Neural Network, enables accurate
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the 2025 IEEE International Conference on Robotics and Automation (ICRA) for possible publication

点击查看摘要

Abstract:An onboard prediction of dynamic parameters (e.g. Aerodynamic drag, rolling resistance) enables accurate path planning for EVs. This paper presents EV-PINN, a Physics-Informed Neural Network approach in predicting instantaneous battery power and cumulative energy consumption during cruising while generalizing to the nonlinear dynamics of an EV. Our method learns real-world parameters such as motor efficiency, regenerative braking efficiency, vehicle mass, coefficient of aerodynamic drag, and coefficient of rolling resistance using automatic differentiation based on dynamics and ensures consistency with ground truth vehicle data. EV-PINN was validated using 15 and 35 minutes of in-situ battery log data from the Tesla Model 3 Long Range and Tesla Model S, respectively. With only vehicle speed and time as inputs, our model achieves high accuracy and generalization to dynamics, with validation losses of 0.002195 and 0.002292, respectively. This demonstrates EV-PINN’s effectiveness in estimating parameters and predicting battery usage under actual driving conditions without the need for additional sensors.

[LG-25] Self-Supervised Learning for Ordered Three-Dimensional Structures

链接: https://arxiv.org/abs/2411.14680
作者: Matthew Spellings,Maya Martirossyan,Julia Dshemuchadse
关键词-EN: training large language, large language models, Recent work, powerful idea, enabling the creation
类目: Machine Learning (cs.LG)
*备注: Version as submitted to the Learning on Graphs Conference 2022, with small clarifying edits

点击查看摘要

Abstract:Recent work has proven that training large language models with self-supervised tasks and fine-tuning these models to complete new tasks in a transfer learning setting is a powerful idea, enabling the creation of models with many parameters, even with little labeled data; however, the number of domains that have harnessed these advancements has been limited. In this work, we formulate a set of geometric tasks suitable for the large-scale study of ordered three-dimensional structures, without requiring any human intervention in data labeling. We build deep rotation- and permutation-equivariant neural networks based on geometric algebra and use them to solve these tasks on both idealized and simulated three-dimensional structures. Quantifying order in complex-structured assemblies remains a long-standing challenge in materials physics; these models can elucidate the behavior of real self-assembling systems in a variety of ways, from distilling insights from learned tasks without further modification to solving new tasks with smaller amounts of labeled data via transfer learning.

[LG-26] Recursive Gaussian Process State Space Model

链接: https://arxiv.org/abs/2411.14679
作者: Tengjie Zheng,Lin Cheng,Shengping Gong,Xu Huang
关键词-EN: advancing principle discovery, holds great promise, time-series prediction, principle discovery, controller design
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques.

[LG-27] Brain-Computer Interfaces for Emotional Regulation in Patients with Various Disorders

链接: https://arxiv.org/abs/2411.14666
作者: Vedant Mehta
关键词-EN: impact emotional regulation, Physiological Disorders, Neurological and Physiological, unique characteristics, important to understand
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurological and Physiological Disorders that impact emotional regulation each have their own unique characteristics which are important to understand in order to create a generalized solution to all of them. The purpose of this experiment is to explore the potential applications of EEG-based Brain-Computer Interfaces (BCIs) in enhancing emotional regulation for individuals with neurological and physiological disorders. The research focuses on the development of a novel neural network algorithm for understanding EEG data, with a particular emphasis on recognizing and regulating emotional states. The procedure involves the collection of EEG-based emotion data from open-Neuro. Using novel data modification techniques, information from the dataset can be altered to create a dataset that has neural patterns of patients with disorders whilst showing emotional change. The data analysis reveals promising results, as the algorithm is able to successfully classify emotional states with a high degree of accuracy. This suggests that EEG-based BCIs have the potential to be a valuable tool in aiding individuals with a range of neurological and physiological disorders in recognizing and regulating their emotions. To improve upon this work, data collection on patients with neurological disorders should be done to improve overall sample diversity.

[LG-28] Multiset Transformer: Advancing Representation Learning in Persistence Diagrams

链接: https://arxiv.org/abs/2411.14662
作者: Minghua Wang,Ziyun Huang,Jinhui Xu
关键词-EN: propose Multiset Transformer, Multiset Transformer, diagram representation learning, Set Transformer, propose Multiset
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To improve persistence diagram representation learning, we propose Multiset Transformer. This is the first neural network that utilizes attention mechanisms specifically designed for multisets as inputs and offers rigorous theoretical guarantees of permutation invariance. The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers. This capability enables full leverage of multiplicities while significantly reducing both computational and spatial complexity compared to the Set Transformer. Additionally, our method can greatly benefit from clustering as a preprocessing step to further minimize complexity, an advantage not possessed by the Set Transformer. Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.

[LG-29] VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

链接: https://arxiv.org/abs/2411.14642
作者: Armani Rodriguez,Silvija Kokalj-Filipovic
关键词-EN: Generating high-quality speech, high-quality speech efficiently, speech efficiently remains, Generating high-quality, efficiently remains
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Generating high-quality speech efficiently remains a key challenge for generative models in speech synthesis. This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability. Leveraging the AudioMNIST dataset, consisting of human utterances of decimal digits (0-9), our method employs a two-step architecture: first, a scalable vector quantized autoencoder (VQ-VAE) that compresses audio spectrograms into discrete latent representations, and second, a decoder-only transformer that learns the probability model of these latents. Trained transformer generates similar latent sequences, convertible to audio spectrograms by the VQ-VAE decoder, from which we generate fake utterances. Interpreting statistical and perceptual quality of the fakes, depending on the dimension and the extrinsic information of the latent space, enables guided improvements in larger, commercial generative models. As a valuable tool for understanding and refining audio synthesis, our results demonstrate VQalAttent’s capacity to generate intelligible speech samples with limited computational resources, while the modularity and transparency of the training pipeline helps easily correlate the analytics with modular modifications, hence providing insights for the more complex models.

[LG-30] Active Learning-Based Optimization of Hydroelectric Turbine Startup to Minimize Fatigue Damage

链接: https://arxiv.org/abs/2411.14618
作者: Vincent Mai,Quang Hung Pham,Arthur Favrel,Jean-Philippe Gauthier,Martin Gagnon
关键词-EN: integrating intermittent renewable, intermittent renewable energy, renewable energy sources, power grid due, Hydro-generating units
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Hydro-generating units (HGUs) play a crucial role in integrating intermittent renewable energy sources into the power grid due to their flexible operational capabilities. This evolving role has led to an increase in transient events, such as startups, which impose significant stresses on turbines, leading to increased turbine fatigue and a reduced operational lifespan. Consequently, optimizing startup sequences to minimize stresses is vital for hydropower utilities. However, this task is challenging, as stress measurements on prototypes can be expensive and time-consuming. To tackle this challenge, we propose an innovative automated approach to optimize the startup parameters of HGUs with a limited budget of measured startup sequences. Our method combines active learning and black-box optimization techniques, utilizing virtual strain sensors and dynamic simulations of HGUs. This approach was tested in real-time during an on-site measurement campaign on an instrumented Francis turbine prototype. The results demonstrate that our algorithm successfully identified an optimal startup sequence using only seven measured sequences. It achieves a remarkable 42% reduction in the maximum strain cycle amplitude compared to the standard startup sequence. This study paves the way for more efficient HGU startup optimization, potentially extending their operational lifespans.

[LG-31] CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

链接: https://arxiv.org/abs/2411.14611
作者: Alex Mathai,Kranthi Sedamaki,Debeshee Das,Noble Saji Mathews,Srikanth Tamilselvam,Sridhar Chimalakonda,Atul Kumar
关键词-EN: Machine Learning, gained prominence due, software engineering, Abstract Syntax Tree, gained prominence
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.

[LG-32] Efficient Spatio-Temporal Signal Recognition on Edge Devices Using PointLCA-Net

链接: https://arxiv.org/abs/2411.14585
作者: Sanaz Mahmoodi Takaghaj,Jack Sampson
关键词-EN: Recent advancements, point clouds, point clouds provide, significantly improving, segmentation tasks
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: arXiv admin note: text overlap with arXiv:2411.00140

点击查看摘要

Abstract:Recent advancements in machine learning, particularly through deep learning architectures like PointNet, have transformed the processing of three-dimensional (3D) point clouds, significantly improving 3D object classification and segmentation tasks. While 3D point clouds provide detailed spatial information, spatio-temporal signals introduce a dynamic element that accounts for changes over time. However, applying deep learning techniques to spatio-temporal signals and deploying them on edge devices presents challenges, including real-time processing, memory capacity, and power consumption. To address these issues, this paper presents a novel approach that combines PointNet’s feature extraction with the in-memory computing capabilities and energy efficiency of neuromorphic systems for spatio-temporal signal recognition. The proposed method consists of a two-stage process: in the first stage, PointNet extracts features from the spatio-temporal signals, which are then stored in non-volatile memristor crossbar arrays. In the second stage, these features are processed by a single-layer spiking neural encoder-decoder that employs the Locally Competitive Algorithm (LCA) for efficient encoding and classification. This work integrates the strengths of both PointNet and LCA, enhancing computational efficiency and energy performance on edge devices. PointLCA-Net achieves high recognition accuracy for spatio-temporal data with substantially lower energy burden during both inference and training than comparable approaches, thus advancing the deployment of advanced neural architectures in energy-constrained environments.

[LG-33] Variable Extraction for Model Recovery in Scientific Literature

链接: https://arxiv.org/abs/2411.14569
作者: Chunwei Liu,Enrique Noriega-Atala,Adarsh Pyarelal,Clayton T Morrison,Mike Cafarella
关键词-EN: academic publications exceeds, publications exceeds, million articles, articles per year, making it difficult
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts – texts, graphs, charts, code, models, and datasets – that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as infection rate ( \alpha ),'' recovery rate ( \gamma ),‘’ and ``mortality rate ( \mu ).‘’ Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results. We introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from scientific papers. Based on this dataset, we present several baseline methods for variable extraction based on Large Language Models (LLMs) and rule-based information extraction systems. Our analysis shows that LLM-based solutions perform the best. Despite the incremental benefits of combining rule-based extraction outputs with LLMs, the leap in performance attributed to the transfer-learning and instruction-tuning capabilities of LLMs themselves is far more significant. This investigation demonstrates the potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2411.14569 [cs.IR] (or arXiv:2411.14569v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.14569 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Deep operator network models for predicting post-burn contraction

链接: https://arxiv.org/abs/2411.14555
作者: Selma Husanovic,Ginger Egberts,Alexander Heinlein,Fred Vermolen
关键词-EN: Burn injuries present, global health challenge, significant global health, Burn injuries, health challenge
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:Burn injuries present a significant global health challenge. Among the most severe long-term consequences are contractures, which can lead to functional impairments and disfigurement. Understanding and predicting the evolution of post-burn wounds is essential for developing effective treatment strategies. Traditional mathematical models, while accurate, are often computationally expensive and time-consuming, limiting their practical application. Recent advancements in machine learning, particularly in deep learning, offer promising alternatives for accelerating these predictions. This study explores the use of a deep operator network (DeepONet), a type of neural operator, as a surrogate model for finite element simulations, aimed at predicting post-burn contraction across multiple wound shapes. A DeepONet was trained on three distinct initial wound shapes, with enhancement made to the architecture by incorporating initial wound shape information and applying sine augmentation to enforce boundary conditions. The performance of the trained DeepONet was evaluated on a test set including finite element simulations based on convex combinations of the three basic wound shapes. The model achieved an R^2 score of 0.99 , indicating strong predictive accuracy and generalization. Moreover, the model provided reliable predictions over an extended period of up to one year, with speedups of up to 128-fold on CPU and 235-fold on GPU, compared to the numerical model. These findings suggest that DeepONets can effectively serve as a surrogate for traditional finite element methods in simulating post-burn wound evolution, with potential applications in medical treatment planning.

[LG-35] Detecting Distributed Denial of Service Attacks Using Logistic Regression and SVM Methods

链接: https://arxiv.org/abs/2411.14512
作者: Mohammad Arafat Ullah,Arthy Anjum,Rashedul Amin Tuhin,Shamim Akhter
关键词-EN: multiple remotely controlled, remotely controlled malware-infected, controlled malware-infected computers, produce humongous traffic, requests ceaselessly coming
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A distributed denial-of-service (DDoS) attack is an attempt to produce humongous traffic within a network by overwhelming a targeted server or its neighboring infrastructure with a flood of service requests ceaselessly coming from multiple remotely controlled malware-infected computers or network-connected devices. Thus, exploring DDoS attacks by recognizing their functionalities and differentiating them from normal traffic services are the primary concerns of network security issues particularly for online businesses. In modern networks, most DDoS attacks occur in the network and application layer including HTTP flood, UDP flood, SIDDOS, SMURF, SNMP flood, IP NULL, etc. The goal of this paper is to detect DDoS attacks from all service requests and classify them according to DDoS classes. In this regard, a standard dataset is collected from the internet which contains several network-related attributes and their corresponding DDoS attack class name. Two(2) different machine learning approaches, SVM and Logistic Regression, are implemented in the dataset for detecting and classifying DDoS attacks, and a comparative study is accomplished among them in terms of accuracy, precision, and recall rates. Logistic Regression and SVM both achieve 98.65% classification accuracy which is the highest achieved accuracy among other previous experiments with the same dataset.

[LG-36] End-to-End Convolutional Activation Anomaly Analysis for Anomaly Detection

链接: https://arxiv.org/abs/2411.14509
作者: Aleksander Kozłowski,Daniel Ponikowski,Piotr Żukiewicz,Paweł Twardowski
关键词-EN: Schulze and Böttinger, Activation Anomaly Analysis, proposed by Sperl, Convolutional Activation Anomaly, Anomaly Analysis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose an End-to-end Convolutional Activation Anomaly Analysis (E2E-CA ^3 ), which is a significant extension of A ^3 anomaly detection approach proposed by Sperl, Schulze and Böttinger, both in terms of architecture and scope of application. In contrast to the original idea, we utilize a convolutional autoencoder as a target network, which allows for natural application of the method both to image and tabular data. The alarm network is also designed as a CNN, where the activations of convolutional layers from CAE are stacked together into k+1- dimensional tensor. Moreover, we combine the classification loss of the alarm network with the reconstruction error of the target CAE, as a “best of both worlds” approach, which greatly increases the versatility of the network. The evaluation shows that despite generally straightforward and lightweight architecture, it has a very promising anomaly detection performance on common datasets such as MNIST, CIFAR-10 and KDDcup99.

[LG-37] Why you dont overfit and dont need Bayes if you only train for one epoch

链接: https://arxiv.org/abs/2411.14478
作者: Laurence Aitchison
关键词-EN: data generating process, true data generating, test loss, generating process, standard maximum likelihood
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Here, we show that in the data-rich setting where you only train on each datapoint once (or equivalently, you only train for one epoch), standard “maximum likelihood” training optimizes the true data generating process (DGP) loss, which is equivalent to the test loss. Further, we show that the Bayesian model average optimizes the same objective, albeit while taking the expectation over uncertainty induced by finite data. As standard maximum likelihood training in the single-epoch setting optimizes the same objective as Bayesian inference, we argue that we do not expect Bayesian inference to offer any advantages in terms of overfitting or calibration in these settings. This explains the diminishing importance of Bayes in areas such as LLMs, which are often trained with one (or very few) epochs.

[LG-38] Activation Functions for “A Feedforward Unitary Equivariant Neural Network”

链接: https://arxiv.org/abs/2411.14462
作者: Pui-Wai Ma
关键词-EN: previous work, presented a feedforward, Chan, feedforward unitary equivariant, neural network
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In our previous work [Ma and Chan (2023)], we presented a feedforward unitary equivariant neural network. We proposed three distinct activation functions tailored for this network: a softsign function with a small residue, an identity function, and a Leaky ReLU function. While these functions demonstrated the desired equivariance properties, they limited the neural network’s architecture. This short paper generalises these activation functions to a single functional form. This functional form represents a broad class of functions, maintains unitary equivariance, and offers greater flexibility for the design of equivariant neural networks.

[LG-39] Guiding Reinforcement Learning Using Uncertainty-Aware Large Language Models

链接: https://arxiv.org/abs/2411.14457
作者: Maryam Shoaeinaeini,Brent Harrison
关键词-EN: large-scale applications due, reinforcement learning, time constraints, impractical for large-scale, large-scale applications
类目: Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Human guidance in reinforcement learning (RL) is often impractical for large-scale applications due to high costs and time constraints. Large Language Models (LLMs) offer a promising alternative to mitigate RL sample inefficiency and potentially replace human trainers. However, applying LLMs as RL trainers is challenging due to their overconfidence and less reliable solutions in sequential tasks. We address this limitation by introducing a calibrated guidance system that uses Monte Carlo Dropout to enhance LLM advice reliability by assessing prediction variances from multiple forward passes. Additionally, we develop a novel RL policy shaping method based on dynamic model average entropy to adjust the LLM’s influence on RL policies according to guidance uncertainty. This approach ensures robust RL training by relying on reliable LLM guidance. To validate our contributions, we conduct extensive experiments in a Minigrid environment with three goals in varying environment sizes. The results showcase superior model performance compared to uncalibrated LLMs, unguided RL, and calibrated LLMs with different shaping policies. Moreover, we analyze various uncertainty estimation methods, demonstrating the effectiveness of average entropy in reflecting higher uncertainty in incorrect guidance. These findings highlight the persistent overconfidence in fine-tuned LLMs and underscore the importance of effective calibration in sequential decision-making problems.

[LG-40] Linear convergence of proximal descent schemes on the Wasserstein space

链接: https://arxiv.org/abs/2411.15067
作者: Razvan-Andrei Lascu,Mateusz B. Majka,David Šiška,Łukasz Szpruch
关键词-EN: proximal descent methods, Kinderlehrer and Otto, investigate proximal descent, optimizing entropy-regularized functionals, introduced by Jordan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 28 pages

点击查看摘要

Abstract:We investigate proximal descent methods, inspired by the minimizing movement scheme introduced by Jordan, Kinderlehrer and Otto, for optimizing entropy-regularized functionals on the Wasserstein space. We establish linear convergence under flat convexity assumptions, thereby relaxing the common reliance on geodesic convexity. Our analysis circumvents the need for discrete-time adaptations of the Evolution Variational Inequality (EVI). Instead, we leverage a uniform logarithmic Sobolev inequality (LSI) and the entropy “sandwich” lemma, extending the analysis from arXiv:2201.10469 and arXiv:2202.01009. The major challenge in the proof via LSI is to show that the relative Fisher information I(\cdot|\pi) is well-defined at every step of the scheme. Since the relative entropy is not Wasserstein differentiable, we prove that along the scheme the iterates belong to a certain class of Sobolev regularity, and hence the relative entropy \operatornameKL(\cdot|\pi) has a unique Wasserstein sub-gradient, and that the relative Fisher information is indeed finite.

[LG-41] A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its Benefits

链接: https://arxiv.org/abs/2411.15002
作者: Tsogt-Ochir Enkhbayar
关键词-EN: Kronecker-Factored Approximate Curvature, Deep Hedging, Deep Hedging implementations, established Deep Hedging, Kronecker-Factored Approximate
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a data-driven alternative to traditional risk management strategies, the computational burden of training neural networks with first-order methods remains a significant impediment to practical implementation. The proposed architecture couples Long Short-Term Memory (LSTM) networks with K-FAC second-order optimization, specifically addressing the challenges of sequential financial data and curvature estimation in recurrent networks. Empirical validation using simulated paths from a calibrated Heston stochastic volatility model demonstrates that the K-FAC implementation achieves marked improvements in convergence dynamics and hedging efficacy. The methodology yields a 78.3% reduction in transaction costs ( t = 56.88 , p 0.001 ) and a 34.4% decrease in profit and loss (PL) variance compared to Adam optimization. Moreover, the K-FAC-enhanced model exhibits superior risk-adjusted performance with a Sharpe ratio of 0.0401, contrasting with -0.0025 for the baseline model. These results provide compelling evidence that second-order optimization methods can materially enhance the tractability of Deep Hedging implementations. The findings contribute to the growing literature on computational methods in quantitative finance while highlighting the potential for advanced optimization techniques to bridge the gap between theoretical frameworks and practical applications in financial markets.

[LG-42] CardioLab: Laboratory Values Estimation and Monitoring from Electrocardiogram Signals – A Multimodal Deep Learning Approach ALT

链接: https://arxiv.org/abs/2411.14886
作者: Juan Miguel Lopez Alcaraz,Nils Strodthoff
关键词-EN: diagnosis and management, fundamental to medical, medical diagnosis, Laboratory, Background
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure, code under this https URL

点击查看摘要

Abstract:Background: Laboratory values are fundamental to medical diagnosis and management, but acquiring these values can be costly, invasive, and time-consuming. While electrocardiogram (ECG) patterns have been linked to certain laboratory abnormalities, the comprehensive modeling of these relationships remains underexplored. Methods: We utilize MIMIC-IV dataset to develop multimodal deep-learning models to demonstrate the feasibility of estimating (real-time) and monitoring (predict at future intervals) laboratory value abnormalities from ECG waveforms, demographics, biometrics, and vital signs. Results: The models exhibit a strong predictive performance with AUROC scores above 0.70 in a statistically significant manner for 23 laboratory values in the estimation setting and up to 26 values in the monitoring setting. Most notably, the accurately predictable values encompassing abnormalities across diverse physiological categories such as cardiac, renal, hematological, metabolic, immunological and coagulation. To name examples, for estimation NTproBNP (353 pg/mL) with 0.882, whereas for monitoring at 30 minutes Urea nitrogen (6 mg/dL) with 0.851, at 60 minutes creatinine (0.5 mg/dL) with 0.85, and at 120 minutes hemoglobin (17.5 g/dL) with 0.821. Conclusions: This study provides first evidence for the feasibility of using ECG data alongside clinical routine data for the real-time estimation and monitoring of laboratory value abnormalities, which could provide a non-invasive, cost-effective supplement to traditional laboratory testing, with strong implications for enhanced patient monitoring and early intervention. Further validation could facilitate their integration into routine clinical practice. Comments: 7 pages, 1 figure, code under this https URL Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2411.14886 [eess.SP] (or arXiv:2411.14886v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2411.14886 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Juan Miguel Lopez Alcaraz [view email] [v1] Fri, 22 Nov 2024 12:10:03 UTC (186 KB)

[LG-43] Iterative Reweighted Framework Based Algorithms for Sparse Linear Regression with Generalized Elastic Net Penalty

链接: https://arxiv.org/abs/2411.14875
作者: Yanyun Ding,Zhenghua Yao,Peili Li,Yunhai Xiao
关键词-EN: elastic net penalty, elastic net, generalized elastic net, elastic net model, ell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The elastic net penalty is frequently employed in high-dimensional statistics for parameter regression and variable selection. It is particularly beneficial compared to lasso when the number of predictors greatly surpasses the number of observations. However, empirical evidence has shown that the \ell_q -norm penalty (where 0 q 1 ) often provides better regression compared to the \ell_1 -norm penalty, demonstrating enhanced robustness in various scenarios. In this paper, we explore a generalized elastic net model that employs a \ell_r -norm (where r \geq 1 ) in loss function to accommodate various types of noise, and employs a \ell_q -norm (where 0 q 1 ) to replace the \ell_1 -norm in elastic net penalty. Theoretically, we establish the computable lower bounds for the nonzero entries of the generalized first-order stationary points of the proposed generalized elastic net model. For implementation, we develop two efficient algorithms based on the locally Lipschitz continuous \epsilon -approximation to \ell_q -norm. The first algorithm employs an alternating direction method of multipliers (ADMM), while the second utilizes a proximal majorization-minimization method (PMM), where the subproblems are addressed using the semismooth Newton method (SNN). We also perform extensive numerical experiments with both simulated and real data, showing that both algorithms demonstrate superior performance. Notably, the PMM-SSN is efficient than ADMM, even though the latter provides a simpler implementation.

[LG-44] Bayesian dynamic mode decomposition for real-time ship motion digital twinning

链接: https://arxiv.org/abs/2411.14839
作者: Giorgio Palma,Andrea Serani,Kevin McTaggart,Shawn Aram,David W. Wundrow,David Drazen,Matteo Diez
关键词-EN: widely considered enablers, widely considered, considered enablers, enablers of groundbreaking, naval digital twins
类目: Applications (stat.AP); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Digital twins are widely considered enablers of groundbreaking changes in the development, operation, and maintenance of novel generations of products. They are meant to provide reliable and timely predictions to inform decisions along the entire product life cycle. One of their most interesting applications in the naval field is the digital twinning of ship performances in waves, a crucial aspect in design and operation safety. In this paper, a Bayesian extension of the Hankel dynamic mode decomposition method is proposed for ship motion’s nowcasting as a prediction tool for naval digital twins. The proposed algorithm meets all the requirements for formulations devoted to digital twinning, being able to adapt the resulting models with the data incoming from the physical system, using a limited amount of data, producing real-time predictions, and estimating their reliability. Results are presented and discussed for the course-keeping of the 5415M model in beam-quartering sea state 7 irregular waves at Fr = 0.33, using data from three different CFD solvers. The results show predictions keeping good accuracy levels up to five wave encounter periods, with the Bayesian formulation improving the deterministic forecasts. In addition, a connection between the predicted uncertainty and prediction accuracy is found.

[LG-45] Cosmological Analysis with Calibrated Neural Quantile Estimation and Approximate Simulators

链接: https://arxiv.org/abs/2411.14748
作者: He Jia
关键词-EN: cosmological Large-Scale Structure, Neural Quantile Estimation, Large-Scale Structure, introduce Neural Quantile, computationally expensive high-fidelity
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 5+4 pages, 5+3 figures, to be submitted, comments are welcome

点击查看摘要

Abstract:A major challenge in extracting information from current and upcoming surveys of cosmological Large-Scale Structure (LSS) is the limited availability of computationally expensive high-fidelity simulations. We introduce Neural Quantile Estimation (NQE), a new Simulation-Based Inference (SBI) method that leverages a large number of approximate simulations for training and a small number of high-fidelity simulations for calibration. This approach guarantees an unbiased posterior and achieves near-optimal constraining power when the approximate simulations are reasonably accurate. As a proof of concept, we demonstrate that cosmological parameters can be inferred at field level from projected 2-dim dark matter density maps up to k_\rm max\sim1.5,h /Mpc at z=0 by training on \sim10^4 Particle-Mesh (PM) simulations with transfer function correction and calibrating with \sim10^2 Particle-Particle (PP) simulations. The calibrated posteriors closely match those obtained by directly training on \sim10^4 expensive PP simulations, but at a fraction of the computational cost. Our method offers a practical and scalable framework for SBI of cosmological LSS, enabling precise inference across vast volumes and down to small scales.

[LG-46] Exploring the Use of Machine Learning Weather Models in Data Assimilation

链接: https://arxiv.org/abs/2411.14677
作者: Xiaoxu Tian,Daniel Holdaway,Daryl Kleist
关键词-EN: attracted significant attention, improve weather forecasting, weather forecasting efficiency, GraphCast and NeuralGCM, efficiency and accuracy
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of machine learning (ML) models in meteorology has attracted significant attention for their potential to improve weather forecasting efficiency and accuracy. GraphCast and NeuralGCM, two promising ML-based weather models, are at the forefront of this innovation. However, their suitability for data assimilation (DA) systems, particularly for four-dimensional variational (4DVar) DA, remains under-explored. This study evaluates the tangent linear (TL) and adjoint (AD) models of both GraphCast and NeuralGCM to assess their viability for integration into a DA framework. We compare the TL/AD results of GraphCast and NeuralGCM with those of the Model for Prediction Across Scales - Atmosphere (MPAS-A), a well-established numerical weather prediction (NWP) model. The comparison focuses on the physical consistency and reliability of TL/AD responses to perturbations. While the adjoint results of both GraphCast and NeuralGCM show some similarity to those of MPAS-A, they also exhibit unphysical noise at various vertical levels, raising concerns about their robustness for operational DA systems. The implications of this study extend beyond 4DVar applications. Unphysical behavior and noise in ML-derived TL/AD models could lead to inaccurate error covariances and unreliable ensemble forecasts, potentially degrading the overall performance of ensemble-based DA systems, as well. Addressing these challenges is critical to ensuring that ML models, such as GraphCast and NeuralGCM, can be effectively integrated into operational DA systems, paving the way for more accurate and efficient weather predictions. Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG) Cite as: arXiv:2411.14677 [physics.ao-ph] (or arXiv:2411.14677v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2411.14677 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] Double Machine Learning for Adaptive Causal Representation in High-Dimensional Data

链接: https://arxiv.org/abs/2411.14665
作者: Lynda Aouar,Han Yu
关键词-EN: estimating equation framework, semiparametric estimating equation, Adaptive causal representation, Adaptive causal, efficient sample splitting
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Adaptive causal representation learning from observational data is presented, integrated with an efficient sample splitting technique within the semiparametric estimating equation framework. The support points sample splitting (SPSS), a subsampling method based on energy distance, is employed for efficient double machine learning (DML) in causal inference. The support points are selected and split as optimal representative points of the full raw data in a random sample, in contrast to the traditional random splitting, and providing an optimal sub-representation of the underlying data generating distribution. They offer the best representation of a full big dataset, whereas the unit structural information of the underlying distribution via the traditional random data splitting is most likely not preserved. Three machine learning estimators were adopted for causal inference, support vector machine (SVM), deep learning (DL), and a hybrid super learner (SL) with deep learning (SDL), using SPSS. A comparative study is conducted between the proposed SVM, DL, and SDL representations using SPSS, and the benchmark results from Chernozhukov et al. (2018), which employed random forest, neural network, and regression trees with a random k-fold cross-fitting technique on the 401(k)-pension plan real data. The simulations show that DL with SPSS and the hybrid methods of DL and SL with SPSS outperform SVM with SPSS in terms of computational efficiency and the estimation quality, respectively.

[LG-48] Sparsifying Suprema of Gaussian Processes

链接: https://arxiv.org/abs/2411.14664
作者: Anindya De,Shivam Nadimpalli,Ryan O’Donnell,Rocco A. Servedio
关键词-EN: canonical Gaussian process, boldsymbol, centered Gaussian processes, varepsilon, dimension-independent sparsification result
类目: Machine Learning (stat.ML); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注: 30 pages

点击查看摘要

Abstract:We give a dimension-independent sparsification result for suprema of centered Gaussian processes: Let T be any (possibly infinite) bounded set of vectors in \mathbbR^n , and let \boldsymbolX_t_t\in T be the canonical Gaussian process on T . We show that there is an O_\varepsilon(1) -size subset S \subseteq T and a set of real values \c_s_s \in S such that \sup_s \in S \boldsymbolX_s + c_s\ is an \varepsilon -approximator of \sup_t \in T \boldsymbolX_t . Notably, the size of S is completely independent of both the size of T and of the ambient dimension n . We use this to show that every norm is essentially a junta when viewed as a function over Gaussian space: Given any norm \nu(x) on \mathbbR^n , there is another norm \psi(x) which depends only on the projection of x along O_\varepsilon(1) directions, for which \psi(\boldsymbolg) is a multiplicative (1 \pm \varepsilon) -approximation of \nu(\boldsymbolg) with probability 1-\varepsilon for \boldsymbolg \sim N(0,I_n) . We also use our sparsification result for suprema of centered Gaussian processes to give a sparsification lemma for convex sets of bounded geometric width: Any intersection of (possibly infinitely many) halfspaces in \mathbbR^n that are at distance O(1) from the origin is \varepsilon -close, under N(0,I_n) , to an intersection of only O_\varepsilon(1) many halfspaces. We describe applications to agnostic learning and tolerant property testing. Comments: 30 pages Subjects: Machine Learning (stat.ML); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2411.14664 [stat.ML] (or arXiv:2411.14664v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.14664 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shivam Nadimpalli [view email] [v1] Fri, 22 Nov 2024 01:43:58 UTC (204 KB)

[LG-49] ACE-Net: AutofoCus-Enhanced Convolutional Network for Field Imperfection Estimation with application to high b-value spiral Diffusion MRI

链接: https://arxiv.org/abs/2411.14630
作者: Mengze Gao,Zachary Shah,Xiaozhi Cao,Nan Wang,Daniel Abraham,Kawin Setsompop
关键词-EN: Spatiotemporal magnetic field, rapid image-encoding schemes, undesirable image artifacts, magnetic field variations, Spatiotemporal magnetic
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 8 pages, 5 figures, submitted to International Society for Magnetic Resonance in Medicine 32th Scientific Meeting, 2025

点击查看摘要

Abstract:Spatiotemporal magnetic field variations from B0-inhomogeneity and diffusion-encoding-induced eddy-currents can be detrimental to rapid image-encoding schemes such as spiral, EPI and 3D-cones, resulting in undesirable image artifacts. In this work, a data driven approach for automatic estimation of these field imperfections is developed by combining autofocus metrics with deep learning, and by leveraging a compact basis representation of the expected field imperfections. The method was applied to single-shot spiral diffusion MRI at high b-values where accurate estimation of B0 and eddy were obtained, resulting in high quality image reconstruction without need for additional external calibrations.

[LG-50] On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms

链接: https://arxiv.org/abs/2411.14601
作者: Dmitry Kovalev,Ekaterina Borodich
关键词-EN: convex-concave bilinearly-coupled saddle-point, smooth convex-concave bilinearly-coupled, bilinearly-coupled saddle-point problem, lower complexity bounds, bilinearly-coupled saddle-point
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the smooth convex-concave bilinearly-coupled saddle-point problem of the form \min_x\max_y f(x) + \langle y,\mathbfB x\rangle - g(y) . In the highly specific case where each of the functions f(x) and g(y) is either affine or strongly convex, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional to \log(1/\epsilon) . However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve smooth non-strongly convex functions f(x) and g(y) . Therefore, we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.

[LG-51] Past Present and Future of Sensor-based Human Activity Recognition using Wearables: A Surveying Tutorial on a Still Challenging Task

链接: https://arxiv.org/abs/2411.14452
作者: Harish Haresamudram,Chi Ian Tang,Sungho Suh,Paul Lukowicz,Thomas Ploetz
关键词-EN: wearable sensor-based Human, Human Activity Recognition, sensor-based Human Activity, recognize activities, Human Activity
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the many years since the inception of wearable sensor-based Human Activity Recognition (HAR), a wide variety of methods have been introduced and evaluated for their ability to recognize activities. Substantial gains have been made since the days of hand-crafting heuristics as features, yet, progress has seemingly stalled on many popular benchmarks, with performance falling short of what may be considered ‘sufficient’-- despite the increase in computational power and scale of sensor data, as well as rising complexity in techniques being employed. The HAR community approaches a new paradigm shift, this time incorporating world knowledge from foundational models. In this paper, we take stock of sensor-based HAR – surveying it from its beginnings to the current state of the field, and charting its future. This is accompanied by a hands-on tutorial, through which we guide practitioners in developing HAR systems for real-world application scenarios. We provide a compendium for novices and experts alike, of methods that aim at finally solving the activity recognition problem.

[LG-52] Rising Rested Bandits: Lower Bounds and Efficient Algorithms

链接: https://arxiv.org/abs/2411.14446
作者: Marco Fiandri,Alberto Maria Metelli,Francesco Trov`o
关键词-EN: stochastic Multi-Armed Bandits, sequential selection techniques, chosen option, field of stochastic, stochastic Multi-Armed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 62 pages. arXiv admin note: substantial text overlap with arXiv:2212.03798

点击查看摘要

Abstract:This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e. those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm ). We study a particular case of the rested bandits in which the arms’ expected reward is monotonically non-decreasing and concave. We study the inherent sample complexity of the regret minimization problem by deriving suitable regret lower bounds. Then, we design an algorithm for the rested case \textitR-ed-UCB , providing a regret bound depending on the properties of the instance and, under certain circumstances, of \widetilde\mathcalO(T^\frac23) . We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset

[LG-53] Industrial Machines Health Prognosis using a Transformer-based Framework

链接: https://arxiv.org/abs/2411.14443
作者: David J Poland,Lemuel Puglisi,Daniele Ravi
关键词-EN: Quantile Regression Neural, article introduces Transformer, Regression Neural Networks, introduces Transformer Quantile, real-time machine failure
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Accepted for presentation at the IEEE MetroAXRAINE conference

点击查看摘要

Abstract:This article introduces Transformer Quantile Regression Neural Networks (TQRNNs), a novel data-driven solution for real-time machine failure prediction in manufacturing contexts. Our objective is to develop an advanced predictive maintenance model capable of accurately identifying machine system breakdowns. To do so, TQRNNs employ a two-step approach: (i) a modified quantile regression neural network to segment anomaly outliers while maintaining low time complexity, and (ii) a concatenated transformer network aimed at facilitating accurate classification even within a large timeframe of up to one hour. We have implemented our proposed pipeline in a real-world beverage manufacturing industry setting. Our findings demonstrate the model’s effectiveness, achieving an accuracy rate of 70.84% with a 1-hour lead time for predicting machine breakdowns. Additionally, our analysis shows that using TQRNNs can increase high-quality production, improving product yield from 78.38% to 89.62%. We believe that predictive maintenance assumes a pivotal role in modern manufacturing, minimizing unplanned downtime, reducing repair costs, optimizing production efficiency, and ensuring operational stability. Its potential to generate substantial cost savings while enhancing sustainability and competitiveness underscores its importance in contemporary manufacturing practices.

信息检索

[IR-0] Multi-granularity Interest Retrieval and Refinement Network for Long-Term User Behavior Modeling in CTR Prediction KDD2025

链接: https://arxiv.org/abs/2411.15005
作者: Xiang Xu,Hao Wang,Wei Guo,Luankang Zhang,Wanshan Yang,Runlong Yu,Yong Liu,Defu Lian,Enhong Chen
关键词-EN: Click-through Rate, online personalization platforms, CTR prediction, personalization platforms, target item
类目: Information Retrieval (cs.IR)
*备注: KDD2025

点击查看摘要

Abstract:Click-through Rate (CTR) prediction is crucial for online personalization platforms. Recent advancements have shown that modeling rich user behaviors can significantly improve the performance of CTR prediction. Current long-term user behavior modeling algorithms predominantly follow two cascading stages. The first stage retrieves subsequence related to the target item from the long-term behavior sequence, while the second stage models the relationship between the subsequence and the target item. Despite significant progress, these methods have two critical flaws. First, the retrieval query typically includes only target item information, limiting the ability to capture the user’s diverse interests. Second, relational information, such as sequential and interactive information within the subsequence, is frequently overlooked. Therefore, it requires to be further mined to more accurately model user interests. To this end, we propose Multi-granularity Interest Retrieval and Refinement Network (MIRRN). Specifically, we first construct queries based on behaviors observed at different time scales to obtain subsequences, each capturing users’ interest at various granularities. We then introduce an noval multi-head Fourier transformer to efficiently learn sequential and interactive information within the subsequences, leading to more accurate modeling of user interests. Finally, we employ multi-head target attention to adaptively assess the impact of these multi-granularity interests on the target item. Extensive experiments have demonstrated that MIRRN significantly outperforms state-of-the-art baselines. Furthermore, an A/B test shows that MIRRN increases the average number of listening songs by 1.32% and the average time of listening songs by 0.55% on a popular music streaming app. The implementation code is publicly available at this https URL. Comments: KDD2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.15005 [cs.IR] (or arXiv:2411.15005v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.15005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] A Reproducibility and Generalizability Study of Large Language Models for Query Generation

链接: https://arxiv.org/abs/2411.14914
作者: Moritz Staudinger,Wojciech Kusa,Florina Piroi,Aldo Lipani,Allan Hanbury
关键词-EN: Boolean query generation, detailed literature curation, Boolean query, query generation, literature curation process
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and large language models (LLMs) promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation. Therefore, we implemented a pipeline, which automatically creates a Boolean query for a given review topic by using a previously defined LLM, retrieves all documents for this query from the PubMed database and then evaluates the results. With this pipeline we first assess whether the results obtained using ChatGPT for query generation are reproducible and consistent. We then generalize our results by analyzing and evaluating open-source models and evaluating their efficacy in generating Boolean queries. Finally, we conduct a failure analysis to identify and discuss the limitations and shortcomings of using LLMs for Boolean query generation. This examination helps to understand the gaps and potential areas for improvement in the application of LLMs to information retrieval tasks. Our findings highlight the strengths, limitations, and potential of LLMs in the domain of information retrieval and literature review automation. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.14914 [cs.IR] (or arXiv:2411.14914v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.14914 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3673791.3698432 Focus to learn more DOI(s) linking to related resources

[IR-2] he 1st Workshop on Human-Centered Recommender Systems

链接: https://arxiv.org/abs/2411.14760
作者: Kaike Zhang,Yunfan Wu,Yougang lyu,Du Su,Yingqiang Ge,Shuchang Liu,Qi Cao,Zhaochun Ren,Fei Sun
关键词-EN: Recommender systems, human-computer interaction, quintessential applications, applications of human-computer, Recommender
类目: Information Retrieval (cs.IR)
*备注: Workshop at TheWebConf 2025

点击查看摘要

Abstract:Recommender systems are quintessential applications of human-computer interaction. Widely utilized in daily life, they offer significant convenience but also present numerous challenges, such as the information cocoon effect, privacy concerns, fairness issues, and more. Consequently, this workshop aims to provide a platform for researchers to explore the development of Human-Centered Recommender Systems~(HCRS). HCRS refers to the creation of recommender systems that prioritize human needs, values, and capabilities at the core of their design and operation. In this workshop, topics will include, but are not limited to, robustness, privacy, transparency, fairness, diversity, accountability, ethical considerations, and user-friendly design. We hope to engage in discussions on how to implement and enhance these properties in recommender systems. Additionally, participants will explore diverse evaluation methods, including innovative metrics that capture user satisfaction and trust. This workshop seeks to foster a collaborative environment for researchers to share insights and advance the field toward more ethical, user-centric, and socially responsible recommender systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-25

目录

概览 (2024-11-25)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载