本篇博文主要内容为 2025-02-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-02-03)
今日共更新455篇论文,其中:
- 自然语言处理共60篇(Computation and Language (cs.CL))
- 人工智能共112篇(Artificial Intelligence (cs.AI))
- 计算机视觉共105篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共186篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Scalable-Softmax Is Superior for Attention
【速读】: 该论文旨在解决随着输入向量大小增加,Softmax函数输出的最大元素趋近于零的问题。这导致基于Transformer的语言模型在计算注意力分数时,注意力分布趋于平坦,从而削弱了模型有效优先处理关键信息的能力,并可能限制其长度泛化能力。为了解决这一问题,论文提出了一种名为Scalable-Softmax (SSMax)的方法,它可以在输入向量大小变化的情况下替代Softmax。SSMax能够无缝集成到现有的基于Transformer的架构中。关键在于SSMax不仅能在预训练期间实现更快的损失减少,还能显著提高长上下文和关键信息检索中的性能。此外,分析表明SSMax使模型即使在长上下文中也能关注关键信息。
链接: https://arxiv.org/abs/2501.19399
作者: Ken M. Nakanishi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 8 figures
Abstract:The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model’s ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.
zh
[NLP-1] s1: Simple test-time scaling
【速读】: 该论文旨在解决如何在测试时通过额外计算资源提升语言模型性能的问题。关键解决方案在于开发了一种名为“预算强制”(budget forcing)的技术,通过控制模型在推理过程中的计算量,实现对其思考时间或长度的调控。此技术使得模型能够在生成过程中多次“等待”(append “Wait”)以延长推理时间,从而有机会纠正错误的推理步骤,进而提高模型的性能和准确性。
链接: https://arxiv.org/abs/2501.19393
作者: Niklas Muennighoff,Zitong Yang,Weijia Shi,Xiang Lisa Li,Li Fei-Fei,Hannaneh Hajishirzi,Luke Zettlemoyer,Percy Liang,Emmanuel Candès,Tatsunori Hashimoto
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 46 pages (9 main), 10 figures, 14 tables
Abstract:Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL.
zh
[NLP-2] Decoding-based Regression
【速读】: 该论文旨在探讨因果自回归序列模型在处理任意特征表示时的有效性,并提供理论基础以解释语言模型进行回归任务的能力。论文的关键在于发现这些模型即使通过传统的下一个词预测(next-token prediction)方式训练,也能通过解码实现与传统方法相当的性能,同时具备捕捉任意分布的灵活性,如在密度估计任务中的应用。
链接: https://arxiv.org/abs/2501.19383
作者: Xingyou Song,Dara Bahri
机构: Google DeepMind(谷歌深思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Google DeepMind Technical Report, 25 pages. Code can be found at this https URL
Abstract:Language models have recently been shown capable of performing regression tasks wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal auto-regressive sequence models when they are applied to any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoding-based regression is as performant as traditional approaches for tabular regression tasks, while being flexible enough to capture arbitrary distributions, such as in the task of density estimation.
zh
[NLP-3] ableMaster: A Recipe to Advance Table Understanding with Language Models
【速读】: 该论文旨在提高语言模型(Language Models, LMs)在理解表格数据方面的性能。论文识别出四个主要挑战:1)定位目标数据的难度,2)表格语义的不足,3)文本推理中的数值不准确,以及4)符号推理中的语义僵化。为了解决这些问题,论文提出了一种名为TableMaster的综合框架,它集成了多种解决方案。TableMaster的关键在于首先提取相关表格内容并赋予其丰富的语义背景,同时引入自适应推理方法,动态调整文本推理与符号推理之间的平衡,以适应每个查询的具体需求。
链接: https://arxiv.org/abs/2501.19378
作者: Lang Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Tables serve as a fundamental format for representing structured relational data. While current language models (LMs) excel at many text-based tasks, they still face challenges in table understanding due to the complex characteristics of tabular data, such as their structured nature. In this paper, we aim to enhance LMs for improved table understanding. We identify four key challenges: 1) difficulty in locating target data, 2) deficiency in table semantics, 3) numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in symbolic reasoning. To address these issues, we propose TableMaster, a recipe and comprehensive framework that integrates multiple solutions to overcome these obstacles. TableMaster first extracts relevant table content and verbalizes it with enriched semantic context. Additionally, we introduce adaptive reasoning, a flexible approach that dynamically adjusts between textual and symbolic reasoning, tailoring the reasoning process to each query. Extensive analyses and experiments demonstrate our findings and the effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines.
zh
[NLP-4] SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions ICASSP2025
【速读】: 该论文旨在解决虚拟助理交互过程中音频与文本输入融合处理的问题。论文的关键解决方案是提出了SELMA(Speech-Enabled Language Model for virtual Assistant interactions),一种集成了音频和文本输入的大规模语言模型。SELMA通过低秩适应模块实现了参数高效的训练,并采用特征池化策略以识别全局模式,从而提高任务性能。实验结果表明,SELMA简化了虚拟助理的输入处理流程,并在唤醒词检测和设备定向语音检测等任务上显著提升了性能。
链接: https://arxiv.org/abs/2501.19377
作者: Dominik Wagner,Alexander Churchill,Siddarth Sigtia,Erik Marchi
机构: Apple (苹果)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP 2025
Abstract:In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative Equal-Error Rate improvements of 64% on the VT detection task, and 22% on DDSD, while also achieving word error rates close to the baseline.
zh
[NLP-5] Were Different Were the Same: Creative Homogeneity Across LLM s
【速读】: 该论文旨在探究使用大型语言模型(Large Language Models, LLMs)作为创意伙伴是否会导致创造力受限的问题。研究的关键在于通过标准化的创造力测试,比较人类与广泛使用的LLMs在创意输出上的多样性,并控制响应结构和其他关键变量后发现,LLMs之间的创意输出更为相似,而人类之间的创意输出则更加多样化。这表明,无论使用哪个LLM模型,它们作为创意伙伴可能会限制用户的创造力,使其趋向于有限的“创造性”输出。
链接: https://arxiv.org/abs/2501.19361
作者: Emily Wenger,Yoed Kenett
机构: Duke University(杜克大学); Technion - Israel Institute of Technology(以色列理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Numerous powerful large language models (LLMs) are now available for use as writing support tools, idea generators, and beyond. Although these LLMs are marketed as helpful creative assistants, several works have shown that using an LLM as a creative partner results in a narrower set of creative outputs. However, these studies only consider the effects of interacting with a single LLM, begging the question of whether such narrowed creativity stems from using a particular LLM – which arguably has a limited range of outputs – or from using LLMs in general as creative assistants. To study this question, we elicit creative responses from humans and a broad set of LLMs using standardized creativity tests and compare the population-level diversity of responses. We find that LLM responses are much more similar to other LLM responses than human responses are to each other, even after controlling for response structure and other key variables. This finding of significant homogeneity in creative outputs across the LLMs we evaluate adds a new dimension to the ongoing conversation about creativity and LLMs. If today’s LLMs behave similarly, using them as a creative partners – regardless of the model used – may drive all users towards a limited set of “creative” outputs.
zh
[NLP-6] Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023 ACL2025
【速读】: 该论文旨在评估高级大型多模态模型(Large Multimodal Models, LMMs)在生成科学图表标题任务中的有效性。关键发现是,专业编辑更偏好由GPT-4V生成的图表标题,优于其他所有模型以及作者原始撰写的标题。基于这一重要发现,论文进一步分析以解答是否先进LMMs已解决了科学图表生成标题的任务。
链接: https://arxiv.org/abs/2501.19353
作者: Ting-Yao E. Hsu,Yi-Li Hsu,Shaurya Rohatgi,Chieh-Yang Huang,Ho Yin Sam Ng,Ryan Rossi,Sungchul Kim,Tong Yu,Lun-Wei Ku,C. Lee Giles,Ting-Hao K. Huang
机构: Pennsylvania State University; National Tsing Hua University; AllSci; MetaMetrics Inc.; Adobe Research; Institute of Information Science, Academia Sinica
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TACL 2025
Abstract:Since the SCICAP datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SCICAP Challenge took place, inviting global teams to use an expanded SCICAP dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SCICAP Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?
zh
[NLP-7] PixelWorld: Towards Perceiving Everything as Pixels
【速读】: 该论文旨在解决现有基础模型在处理多模态数据时采用不同输入方式(像素和标记)的问题,这与人类统一处理视觉和文本信息的方式相悖。论文的关键解决方案是提出“将所有模态视为像素”(Perceive Everything as Pixels, PEAP)框架,并引入PixelWorld评估套件来统一评估各模态在像素空间中的表现。研究表明,PEAP在多模态数据集上优于基于标记的输入方法,并揭示了现有模型在像素输入下的推理和编码能力下降,同时指出了大型模型相较于小型模型在非推理任务上的优越性以及PEAP与文本标记输入在注意力模式上的高度一致性。
链接: https://arxiv.org/abs/2501.19339
作者: Zhiheng Lyu,Xueguang Ma,Wenhu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner. With the rise of embodied and agentic AI, where inputs primarily come from camera pixels, the need for a unified perception framework becomes increasingly evident. In this paper, we propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. “Perceive Everything as Pixels” (PEAP). We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models’ performance. Our findings show that (1) PEAP outperforms baseline with token-based input in multimodal datasets, benefiting from unified input for better disambiguation, (2) significant declines in reasoning and coding capabilities across all models when processing pixel-based input, underscoring the need to enhance foundation models’ perceptual abilities, (3) larger models can maintain strong performance on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer significant performance degradation, (4) the attention pattern of PEAP is highly aligned with text token input, (5) PEAP can be accelerated significantly by exploiting the spatial sparsity. We conclude that the existing frontier models are competent in pixel perception, however, there is still headroom for improvement. Our code, dataset will be released upon acceptance.
zh
[NLP-8] Homogeneity Bias as Differential Sampling Uncertainty in Language Models
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在生成文本时对边缘化群体(如非裔美国人和女性)表现出同质性偏见的机制。研究的关键在于分析这些模型在推理阶段从概率分布中采样标记(tokens)时的不确定性度量,包括熵(entropy)、困惑度(perplexity)以及区分概率(probability of differentiation)。研究发现,在某些模型中,尤其是GPT-4 Turbo和Llama-3.2,当生成关于边缘化群体的文本时,标记的采样更加确定性,而与之相对的主流群体(如白人和男性)则没有这种现象。这表明在某些模型中,这种确定性采样的模式可能是导致同质性偏见的原因之一,但并非所有测试的视觉-语言模型都表现出相同模式,暗示可能存在多种机制共同作用于AI的同质性偏见。
链接: https://arxiv.org/abs/2501.19337
作者: Messi H.J. Lee,Soyeon Jeon
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prior research show that Large Language Models (LLMs) and Vision-Language Models (VLMs) represent marginalized groups more homogeneously than dominant groups. However, the mechanisms underlying this homogeneity bias remain relatively unexplored. We propose that this bias emerges from systematic differences in the probability distributions from which tokens are sampled at inference-time. Analyzing three measures of uncertainty in token sampling distributions-entropy, perplexity, and probability of differentiation-we find that in some models, specifically GPT-4 Turbo and Llama-3.2, tokens are sampled more deterministically when generating texts about marginalized groups (i.e., Black Americans and women) compared to their dominant group counterparts (i.e., White Americans and men). While these findings may help explain homogeneity bias in certain models, the patterns did not replicate across all VLMs tested, suggesting multiple mechanisms may contribute to homogeneity bias in AI.
zh
[NLP-9] Reward-Guided Speculative Decoding for Efficient LLM Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)推理过程中的效率问题。解决方案的关键在于引入了一种名为奖励引导推测解码(Reward-Guided Speculative Decoding, RSD)的新框架。RSD通过结合轻量级草案模型与更强大的目标模型,并引入受控偏差以优先选择高奖励输出,优化了计算成本与输出质量之间的权衡。RSD采用中间解码步骤的奖励模型来动态决定是否调用目标模型,从而实现了资源利用与性能之间的最佳平衡。
链接: https://arxiv.org/abs/2501.19324
作者: Baohao Liao,Yuhui Xu,Hanze Dong,Junnan Li,Christof Monz,Silvio Savarese,Doyen Sahoo,Caiming Xiong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.
zh
[NLP-10] LLM -based Affective Text Generation Quality Based on Different Quantization Values
【速读】: 该论文旨在解决大型语言模型在情感文本生成中的资源消耗与文本质量之间的权衡问题。论文的关键在于评估不同量化值(精度位数)下的GPU内存利用率及文本质量,通过对比8位、16位和32位精度设置下五种开源权重语言模型的表现,发现精度降低可实现高达76%的内存节省,但伴随着F1分数的下降和推理时间的增加。研究结果表明,较低量化水平的大模型在保持相似内存需求的同时,通常能产生更高质量的情感文本。
链接: https://arxiv.org/abs/2501.19317
作者: Yarik Menchaca Resendiz,Roman Klinger
机构: Institut für Maschinelle Sprachverarbeitung, University of Stuttgart (斯图加特大学机器语言处理研究所); Fundamentals of Natural Language Processing, University of Bamberg (班贝格大学自然语言处理基础)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models exhibit a remarkable capacity in language generation and comprehension. These advances enable AI systems to produce more human-like and emotionally engaging text. However, these models rely on a large number of parameters, requiring significant computational resources for training and inference. In some scenarios, accessing these resources can be challenging (e.g., budget or hardware limitations). Techniques like reducing precision bits can make models more memory-efficient, reducing the computational resources needed, at the cost of reduced accuracy. This paper addresses the trade-off between different quantization values, GPU RAM utilization, and text quality in affective text generation (e.g., “I really enjoy running in the snow-covered forest”). To evaluate, we use an emotion classifier and ten seed prompts to generate affective text. We test three setups of precision bits (8, 16, and 32) across five open-weight language models from two different families. Our findings demonstrate that bit reductions lead to memory savings, achieving a reduction of 76%. However, this optimization comes with a trade-off, leading to a decrease of up to 10 pp in F1 score for larger models and an increase of 10 pp for smaller models, along with roughly double the inference time. In terms of text quality, larger models at lower quantization levels generally outperform smaller, higher-precision models – while requiring similar memory.
zh
[NLP-11] Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution
【速读】: 该论文旨在评估从简单源任务到复杂目标任务的知识迁移效果。不同于传统方法中的探针(probing)通常使用来自复杂源任务冻结表示在多样的简单目标任务上进行评估,本文探索了来自多个简单源任务的嵌入(embeddings)在单一目标任务上的有效性。论文的关键在于通过系统实验,验证语义相似性任务(如释义检测)的嵌入对于共指消解(coreference resolution)任务最为有益,并且微调模型中间层的表征通常优于最终层的表征。此外,结合多个任务的嵌入可以持续提升性能,特别是基于注意力机制的聚合方法能够带来显著改进。这些发现揭示了特定任务表示与适应复杂下游任务之间的关系,鼓励进一步探索嵌入层面的任务转移。
链接: https://arxiv.org/abs/2501.19316
作者: Tatiana Anikina,Arne Binder,David Harbecke,Stalin Varanasi,Leonhard Hennig,Simon Ostermann,Sebastian Möller,Josef van Genabith
机构: German Research Centre for Artificial Intelligence (德国人工智能研究中心); Saarland Informatics Campus (萨尔兰信息技术园区); Technische Universität Berlin (柏林工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we reimagine classical probing to evaluate knowledge transfer from simple source to more complex target tasks. Instead of probing frozen representations from a complex source task on diverse simple target probing tasks (as usually done in probing), we explore the effectiveness of embeddings from multiple simple source tasks on a single target task. We select coreference resolution, a linguistically complex problem requiring contextual understanding, as focus target task, and test the usefulness of embeddings from comparably simpler tasks tasks such as paraphrase detection, named entity recognition, and relation extraction. Through systematic experiments, we evaluate the impact of individual and combined task embeddings. Our findings reveal that task embeddings vary significantly in utility for coreference resolution, with semantic similarity tasks (e.g., paraphrase detection) proving most beneficial. Additionally, representations from intermediate layers of fine-tuned models often outperform those from final layers. Combining embeddings from multiple tasks consistently improves performance, with attention-based aggregation yielding substantial gains. These insights shed light on relationships between task-specific representations and their adaptability to complex downstream tasks, encouraging further exploration of embedding-level task transfer. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.19316 [cs.CL] (or arXiv:2501.19316v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.19316 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-12] An Efficient Approach for Machine Translation on Low-resource Languages: A Case Study in Vietnamese-Chinese
【速读】: 该论文旨在解决低资源语言(如越南语-中文)机器翻译中训练数据不足的问题。解决方案的关键在于利用多语言预训练语言模型(mBART)以及双语文本和单语文本数据集,通过构建早期鸟类机器翻译模型、使用TF-IDF技术选择相关性高的单语文本句子,并合成增强的训练数据来提升翻译模型的性能。这种方法使模型表现比变压器模型提高了8%。
链接: https://arxiv.org/abs/2501.19314
作者: Tran Ngoc Son,Nguyen Anh Tu,Nguyen Minh Tri
机构: Samsung SDS R&D Center (三星SDS研发中心)
类目: Computation and Language (cs.CL)
备注: Technical report of VLSP 2022 NMT; The first two authors contributed equally to this work
Abstract:Despite the rise of recent neural networks in machine translation, those networks do not work well if the training data is insufficient. In this paper, we proposed an approach for machine translation in low-resource languages such as Vietnamese-Chinese. Our proposed method leveraged the power of the multilingual pre-trained language model (mBART) and both Vietnamese and Chinese monolingual corpus. Firstly, we built an early bird machine translation model using the bilingual training dataset. Secondly, we used TF-IDF technique to select sentences from the monolingual corpus which are the most related to domains of the parallel dataset. Finally, the first model was used to synthesize the augmented training data from the selected monolingual corpus for the translation model. Our proposed scheme showed that it outperformed 8% compared to the transformer model. The augmented dataset also pushed the model performance.
zh
[NLP-13] Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推测解码(speculative decoding)过程中因高比例高质量候选令牌被拒绝而导致的速度提升受限的问题。论文的关键在于通过设计一个紧凑的模块,该模块基于嵌入(embeddings)来生成目标模型的“判断”(judgements),从而改进验证机制以识别正确的但非对齐的回答。这种方法使得在Llama-3.1系列模型上,8b/405B-Judge相比Llama-405B实现了9倍的速度提升,同时保持了高质量输出。
链接: https://arxiv.org/abs/2501.19309
作者: Gregor Bachmann,Sotiris Anagnostidis,Albert Pumarola,Markos Georgopoulos,Artsiom Sanakoyeu,Yuming Du,Edgar Schönfeld,Ali Thabet,Jonas Kohler
机构: Meta GenAI; ETH Zürich
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive generation, leveraging a fast draft model to propose candidate tokens, which are then verified in parallel based on their likelihood under the target model. While this approach guarantees to reproduce the target output, it incurs a substantial penalty: many high-quality draft tokens are rejected, even when they represent objectively valid continuations. Indeed, we show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates under the standard verification scheme. This severely limits the speedup potential of current speculative decoding methods, as an early rejection becomes overwhelmingly likely when solely relying on alignment of draft and target. We thus ask the following question: Can we adapt verification to recognize correct, but non-aligned replies? To this end, we draw inspiration from the LLM-as-a-judge framework, which demonstrated that LLMs are able to rate answers in a versatile way. We carefully design a dataset to elicit the same capability in the target model by training a compact module on top of the embeddings to produce ``judgements" of the current continuation. We showcase our strategy on the Llama-3.1 family, where our 8b/405B-Judge achieves a speedup of 9x over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to 141 tokens/s for 8B/70B-Judge and 129 tokens/s for 8B/405B on 2 and 8 H100s respectively. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2501.19309 [cs.LG] (or arXiv:2501.19309v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.19309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-14] SETS: Leverag ing Self-Verification and Self-Correction for Improved Test-Time Scaling
【速读】: 该论文旨在解决在利用大型语言模型(Large Language Models, LLMs)进行复杂推理任务时,传统方法如重复采样结合多数投票或奖励模型评分在测试时间计算扩展时遇到的边际收益递减问题,以及这些方法需要昂贵的任务特定奖励模型训练的问题。论文的关键解决方案是提出了一种名为Self-Enhanced Test-Time Scaling (SETS) 的新方法,该方法利用最新先进LLMs的自我验证和自我校正能力,将采样、自我验证和自我校正整合到一个统一框架中,从而实现高效且可扩展的测试时间计算,以提升复杂任务的能力。
链接: https://arxiv.org/abs/2501.19306
作者: Jiefeng Chen,Jie Ren,Xinyun Chen,Chengrun Yang,Ruoxi Sun,Sercan Ö Arık
机构: Google Cloud AI Research(谷歌云AI研究); Google DeepMind(DeepMind)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, conventional approaches such as repeated sampling with majority voting or reward model scoring, often face diminishing returns as test-time compute scales, in addition to requiring costly task-specific reward model training. In this paper, we present Self-Enhanced Test-Time Scaling (SETS), a novel method that leverages the self-verification and self-correction capabilities of recent advanced LLMs to overcome these limitations. SETS integrates sampling, self-verification, and self-correction into a unified framework, enabling efficient and scalable test-time computation for improved capabilities at complex tasks. Through extensive experiments on challenging planning and reasoning benchmarks, compared to the alternatives, we demonstrate that SETS achieves significant performance improvements and more favorable test-time scaling laws.
zh
[NLP-15] Beyond checkmate: exploring the creative chokepoints in AI text
【速读】: 该论文旨在探讨人类文本与大型语言模型(LLMs)生成文本之间的细微差异,特别是在不同文本段落中的表现。论文通过类比国际象棋游戏中开局、中局和残局的结构,分析引言、正文和结论三个文本段落,以确定两者之间最显著的区别所在。研究发现,虽然LLMs在正文段落的表现较好,但仔细分析仍显示出明显的差异性,这突显了该段落在检测AI生成文本中的重要性。此外,人类文本在不同段落间的差异大于AI文本。论文的关键在于揭示人类与AI文本之间的复杂区别,为文本检测和理解提供新的见解。
链接: https://arxiv.org/abs/2501.19301
作者: Nafis Irtiza Tripto,Saranya Venkatraman,Mahjabin Nahar,Dongwon Lee
机构: Penn State University(宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, single columns, under review at Nature Machine Intelligence
Abstract:Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) and Artificial Intelligence (AI), unlocking unprecedented capabilities. This rapid advancement has spurred research into various aspects of LLMs, their text generation reasoning capability, and potential misuse, fueling the necessity for robust detection methods. While numerous prior research has focused on detecting LLM-generated text (AI text) and thus checkmating them, our study investigates a relatively unexplored territory: portraying the nuanced distinctions between human and AI texts across text segments. Whether LLMs struggle with or excel at incorporating linguistic ingenuity across different text segments carries substantial implications for determining their potential as effective creative assistants to humans. Through an analogy with the structure of chess games-comprising opening, middle, and end games-we analyze text segments (introduction, body, and conclusion) to determine where the most significant distinctions between human and AI texts exist. While AI texts can approximate the body segment better due to its increased length, a closer examination reveals a pronounced disparity, highlighting the importance of this segment in AI text detection. Additionally, human texts exhibit higher cross-segment differences compared to AI texts. Overall, our research can shed light on the intricacies of human-AI text distinctions, offering novel insights for text detection and understanding.
zh
[NLP-16] Pheromone-based Learning of Optimal Reasoning Paths
【速读】: 该论文旨在解决在复杂问题求解过程中,大型语言模型(LLMs)通过链式思维提示展示出卓越推理能力的同时,如何高效地发现有效的推理路径的问题。解决方案的关键在于引入了一种名为蚁群优化引导树状思维(ACO-ToT)的新算法,该算法结合了蚁群优化(Ant Colony Optimization, ACO)与LLMs,利用一组经过专门微调的语言模型“蚂蚁”来探索和建立一条中央思维树中的信息素轨迹,每只“蚂蚁”的移动由现有信息素轨迹和其自身专业知识的加权组合决定。此方法通过基于专家混合的评分函数评估完整的推理路径,并通过迭代过程增强有成效的推理路径的信息素浓度。
链接: https://arxiv.org/abs/2501.19278
作者: Anirudh Chari,Aditya Tiwari,Richard Lian,Suraj Reddy,Brian Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities through chain-of-thought prompting, yet discovering effective reasoning methods for complex problems remains challenging due to the vast space of possible intermediate steps. We introduce Ant Colony Optimization-guided Tree of Thought (ACO-ToT), a novel algorithm that combines ACO with LLMs to discover optimal reasoning paths for complex problems efficiently. Drawing inspiration from Hebbian learning in neurological systems, our method employs a collection of distinctly fine-tuned LLM “ants” to traverse and lay pheromone trails through a centralized tree of thought, with each ant’s movement governed by a weighted combination of existing pheromone trails and its own specialized expertise. The algorithm evaluates complete reasoning paths using a mixture-of-experts-based scoring function, with pheromones reinforcing productive reasoning paths across iterations. Experiments on three challenging reasoning tasks (GSM8K, ARC-Challenge, and MATH) demonstrate that ACO-ToT performs significantly better than existing chain-of-thought optimization approaches, suggesting that incorporating biologically inspired collective search mechanisms into LLM inference can substantially enhance reasoning capabilities.
zh
[NLP-17] mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval ECIR2025
【速读】: 该论文旨在解决跨语言检索模型在遵循复杂指令方面的能力问题,特别是在非英语语言环境下的表现。关键解决方案在于引入了一个名为mFollowIR的多语言基准测试,通过调整来自TREC NeuCLIR的多样化语言(俄语、中文、波斯语)指令,来衡量检索模型在不同语言环境下遵循这些细微变化的能力,并评估其在跨语言(英语到其他语言)和多语言(任意两种非英语语言之间)场景下的性能表现。实验结果显示,基于英语训练的检索模型在跨语言场景下表现出色,但在多语言场景下性能显著下降,表明需要进一步开发针对指令驱动的多语言检索模型的数据集。
链接: https://arxiv.org/abs/2501.19264
作者: Orion Weller,Benjamin Chang,Eugene Yang,Mahsa Yarmohammadi,Sam Barham,Sean MacAvaney,Arman Cohan,Luca Soldaini,Benjamin Van Durme,Dawn Lawrie
机构: Johns Hopkins University; Allen Institute for AI; University of Glasgow; Yale University
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ECIR 2025
Abstract:Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability in retrieval models. mFollowIR builds upon the TREC NeuCLIR narratives (or instructions) that span three diverse languages (Russian, Chinese, Persian) giving both query and instruction to the retrieval models. We make small changes to the narratives and isolate how well retrieval models can follow these nuanced changes. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting, indicating that more work is needed in developing data for instruction-based multilingual retrievers.
zh
[NLP-18] VisualSpeech: Enhance Prosody with Visual Context in TTS
【速读】: 该论文旨在解决文本到语音(TTS)合成中从单一文本输入生成具有多变语调的语音输出的固有挑战。解决方案的关键在于引入视觉上下文信息,通过提出一种名为VisualSpeech的新模型,将视觉和文本信息结合起来以改进语调预测。实验结果表明,视觉特征提供了文本输入之外有价值的语调提示,显著提升了合成语音的自然度和准确性。
链接: https://arxiv.org/abs/2501.19258
作者: Shumin Que,Anton Ragni
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody from a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as visual features, remains underutilized. This paper investigates the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates both visual and textual information for improved prosody generation. Empirical results demonstrate that visual features provide valuable prosodic cues beyond the textual input, significantly enhancing the naturalness and accuracy of the synthesized speech. Audio samples are available at this https URL.
zh
[NLP-19] Improving the Robustness of Representation Misdirection for Large Language Model Unlearning
【速读】: 该论文旨在解决 Representation Misdirection (RM) 方法在大型语言模型 (LLM) 无学习过程中导致模型鲁棒性下降的问题。论文指出,即使在非对抗性的保留查询中存在单一的非对抗性遗忘令牌,RM 方法也会使模型行为失常。为了解决这一漏洞,论文提出了一种名为随机噪声增强 (Random Noise Augmentation, RNA) 的方法,这是一种与模型和方法无关的策略,并具有提高 RM 方法鲁棒性的理论保证。广泛的实验表明,RNA 显著提升了 RM 模型的鲁棒性,同时增强了无学习性能。
链接: https://arxiv.org/abs/2501.19202
作者: Dang Huu-Tien,Hoang Thanh-Tung,Le-Minh Nguyen,Naoya Inoue
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 1 table
Abstract:Representation Misdirection (RM) and variants are established large language model (LLM) unlearning methods with state-of-the-art performance. In this paper, we show that RM methods inherently reduce models’ robustness, causing them to misbehave even when a single non-adversarial forget-token is in the retain-query. Toward understanding underlying causes, we reframe the unlearning process as backdoor attacks and defenses: forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in RM models’ behaviors, similar to successful backdoor attacks. To mitigate this vulnerability, we propose Random Noise Augmentation – a model and method agnostic approach with theoretical guarantees for improving the robustness of RM methods. Extensive experiments demonstrate that RNA significantly improves the robustness of RM models while enhancing the unlearning performances.
zh
[NLP-20] Efficient Reasoning with Hidden Thinking
【速读】: 该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在链式思维(Chain-of-Thought, CoT)推理过程中因冗长文本描述导致的效率低下问题。关键解决方案在于提出了一种名为\textbf{Heima}(隐式 llama)的新框架,通过在隐藏潜空间(hidden latent space)中利用CoT推理来提高效率。Heima编码器将每个中间CoT压缩为一个紧凑的高阶隐藏表示,使用单个思考标记(thinking token),从而显著减少冗余并降低推理过程中的总体令牌数量。此外,设计了相应的Heima解码器与传统大型语言模型(Large Language Models, LLMs)配合,以自适应方式将隐藏表示转换为可变长度的文本序列,重构出接近原始CoT的推理过程。实验结果表明,Heima模型不仅提高了生成效率,还保持甚至提升了零样本任务的准确性。
链接: https://arxiv.org/abs/2501.19201
作者: Xuan Shen,Yizhou Wang,Xiangxi Shi,Yanzhi Wang,Pu Zhao,Jiuxiang Gu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint version
Abstract:Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose \textbfHeima (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.
zh
[NLP-21] Mixed Feelings: Cross-Domain Sentiment Classification of Patient Feedback ALT
【速读】: 该论文旨在解决公共卫生领域患者反馈情感分析中的数据稀缺问题,以辅助决策者评估提供的服务。关键解决方案在于通过利用一般领域的评论数据来缓解数据稀缺问题,并比较了不同架构在领域内和领域外的效果,以及联合多领域模型训练的效果。
链接: https://arxiv.org/abs/2501.19134
作者: Egil Rønningstad,Lilja Charlotte Storset,Petter Mæhlum,Lilja Øvrelid,Erik Velldal
机构: Department of Informatics, University of Oslo, Norway(计算机科学系,奥斯陆大学,挪威)
类目: Computation and Language (cs.CL)
备注: Accepted for NoDaLiDa / Baltic-HLT 2025
Abstract:Sentiment analysis of patient feedback from the public health domain can aid decision makers in evaluating the provided services. The current paper focuses on free-text comments in patient surveys about general practitioners and psychiatric healthcare, annotated with four sentence-level polarity classes – positive, negative, mixed and neutral – while also attempting to alleviate data scarcity by leveraging general-domain sources in the form of reviews. For several different architectures, we compare in-domain and out-of-domain effects, as well as the effects of training joint multi-domain models.
zh
[NLP-22] Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
【速读】: 该论文旨在解决低资源、领域特定场景下序列标注的挑战,特别是在字符密集型语言(如中文)中的应用。论文的关键解决方案在于提出了一种结合大型语言模型(LLM)知识增强工作流与基于跨度的知识融合丰富高效提取模型(KnowFREE)。该工作流通过解释提示生成精确的上下文目标实体解释,从而有效减轻语义偏差并增强模型的上下文理解能力。KnowFREE模型进一步整合扩展标签特征,实现在推理过程中无需外部知识即可高效进行嵌套实体提取。
链接: https://arxiv.org/abs/2501.19093
作者: Peichao Lai,Jiaxin Gan,Feiyang Ye,Yilei Wang,Bin Cui
机构: School of Computer Science, Peking University (北京大学); College of Computer and Data Science, Fuzhou University (福州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sequence labeling remains a significant challenge in low-resource, domain-specific scenarios, particularly for character-dense languages like Chinese. Existing methods primarily focus on enhancing model comprehension and improving data diversity to boost performance. However, these approaches still struggle with inadequate model applicability and semantic distribution biases in domain-specific contexts. To overcome these limitations, we propose a novel framework that combines an LLM-based knowledge enhancement workflow with a span-based Knowledge Fusion for Rich and Efficient Extraction (KnowFREE) model. Our workflow employs explanation prompts to generate precise contextual interpretations of target entities, effectively mitigating semantic biases and enriching the model’s contextual understanding. The KnowFREE model further integrates extension label features, enabling efficient nested entity extraction without relying on external knowledge during inference. Experiments on multiple Chinese domain-specific sequence labeling datasets demonstrate that our approach achieves state-of-the-art performance, effectively addressing the challenges posed by low-resource settings.
zh
[NLP-23] Enabling Autonomic Microservice Management through Self-Learning Agents
【速读】: 该论文旨在解决现代软件系统日益复杂化所带来的自管理需求与大型语言模型(Large Language Models, LLMs)难以适应特定服务环境之间的矛盾。解决方案的关键在于提出ServiceOdyssey,这是一种自我学习代理系统,通过利用课程学习原则和迭代探索,能够自主管理微服务而无需事先了解特定服务配置。这种方法逐步深化对操作环境的理解,从而减少对人工输入或静态文档的依赖。
链接: https://arxiv.org/abs/2501.19056
作者: Fenglin Yu,Fangkai Yang,Xiaoting Qin,Zhiyang Zhang,Jue Zhang,Qingwei Lin,Hongyu Zhang,Yingnong Dang,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
机构: Wuhan University; Microsoft(微软); Nanjing University(南京大学); Chongqing University(重庆大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:The increasing complexity of modern software systems necessitates robust autonomic self-management capabilities. While Large Language Models (LLMs) demonstrate potential in this domain, they often face challenges in adapting their general knowledge to specific service contexts. To address this limitation, we propose ServiceOdyssey, a self-learning agent system that autonomously manages microservices without requiring prior knowledge of service-specific configurations. By leveraging curriculum learning principles and iterative exploration, ServiceOdyssey progressively develops a deep understanding of operational environments, reducing dependence on human input or static documentation. A prototype built with the Sock Shop microservice demonstrates the potential of this approach for autonomic microservice management.
zh
[NLP-24] On the Impact of Noise in Differentially Private Text Rewriting NAACL2025
【速读】: 该论文旨在探讨文本私有化领域中微分隐私(Differential Privacy, DP)的应用,并特别关注噪声添加导致的效用损失。论文的关键解决方案是引入一种新的句子填充私有化技术,通过这种方法来研究噪声在DP文本重写中的影响。研究表明非DP私有化技术在保持效用方面表现出色,并且能够找到可接受的隐私-效用权衡,但在实际隐私保护方面无法超越DP方法。论文的核心发现强调了当前DP重写机制中噪声的显著影响,引发了对DP在自然语言处理(NLP)中的优势与挑战以及非DP方法所带来机遇的讨论。
链接: https://arxiv.org/abs/2501.19022
作者: Stephen Meisenbacher,Maulik Chevli,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学); School of Computation, Information and Technology (计算、信息和技术学院); Department of Computer Science (计算机科学系)
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures, 9 tables. Accepted to NAACL 2025 (Findings)
Abstract:The field of text privatization often leverages the notion of \textitDifferential Privacy (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter \varepsilon . However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP, as well as the opportunities that non-DP methods present.
zh
[NLP-25] Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses
【速读】: 该论文旨在解决Tsetlin Machine ™ 在处理大规模输入序列时遇到的可扩展性挑战。关键解决方案在于引入了一种包含两阶段训练的新型方法,该方法能够针对输入序列发现上下文嵌入,并且在保持模型可扩展性的同时保留了解释性。这种方法通过封装数据集中词汇表中每个输入词的知识,并利用提取的知识构建输入词序列的嵌入,从而有效解决了前述的可扩展性问题。
链接: https://arxiv.org/abs/2501.19018
作者: Ahmed K. Kadhim,Lei Jiao,Rishad Shafik,Ole-Christoffer Granmo,Bimal Bhattarai
机构: Department of ICT (信息与通信技术系), University of Agder (阿格德大学); School of Engineering (工程学院), Newcastle University (纽卡斯尔大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The Tsetlin Machine ™ architecture has recently demonstrated effectiveness in Machine Learning (ML), particularly within Natural Language Processing (NLP). It has been utilized to construct word embedding using conjunctive propositional clauses, thereby significantly enhancing our understanding and interpretation of machine-derived decisions. The previous approach performed the word embedding over a sequence of input words to consolidate the information into a cohesive and unified representation. However, that approach encounters scalability challenges as the input size increases. In this study, we introduce a novel approach incorporating two-phase training to discover contextual embeddings of input sequences. Specifically, this method encapsulates the knowledge for each input word within the dataset’s vocabulary, subsequently constructing embeddings for a sequence of input words utilizing the extracted knowledge. This technique not only facilitates the design of a scalable model but also preserves interpretability. Our experimental findings revealed that the proposed method yields competitive performance compared to the previous approaches, demonstrating promising results in contrast to human-generated benchmarks. Furthermore, we applied the proposed approach to sentiment analysis on the IMDB dataset, where the TM embedding and the TM classifier, along with other interpretable classifiers, offered a transparent end-to-end solution with competitive performance.
zh
[NLP-26] Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
【速读】: 该论文旨在解决多模态大型语言模型(MLLMs)在面对否定论证(negation arguments)时表现下降的问题。论文的关键在于通过系统评估当前最先进的MLLMs在不同基准测试中的性能,揭示这些模型在推理和对齐机制上的关键漏洞,并指出专有模型相较于开源模型具有更好的鲁棒性。然而,所有被评估的MLLMs在对话过程中都难以保持逻辑一致性。论文提出的研究结果为改进MLLMs对抗对抗性输入的稳健性提供了有价值的见解,从而推动了更可靠和可信的多模态AI系统的开发。
链接: https://arxiv.org/abs/2501.19017
作者: Bin Zhu,Hui yan Qi,Yinxuan Gui,Jingjing Chen,Chong-Wah Ngo,Ee Peng Lim
机构: Singapore Management University; Fudan University
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs, particularly negation arguments. This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks, revealing significant performance drops when negation arguments are introduced to initially correct responses. We show critical vulnerabilities in the reasoning and alignment mechanisms of these models. Proprietary models such as GPT-4o and Claude-3.5-Sonnet demonstrate better resilience compared to open-source counterparts like Qwen2-VL and LLaVA. However, all evaluated MLLMs struggle to maintain logical consistency under negation arguments during conversation. This paper aims to offer valuable insights for improving the robustness of MLLMs against adversarial inputs, contributing to the development of more reliable and trustworthy multimodal AI systems.
zh
[NLP-27] Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成过程中容易产生虚构依赖(package hallucination)的问题,这种行为可能被恶意利用,导致软件供应链中的广泛漏洞。论文的关键解决方案在于分析不同编程语言、模型大小及任务请求的具体性对虚构依赖生成的影响,并提出防御策略以抵御这些潜在攻击。研究发现,虚构依赖的发生率不仅与模型选择有关,还受编程语言、模型规模和任务请求的明确性影响。通过引入评估模型虚构依赖倾向性的启发式方法,论文为未来模型优化及保障AI辅助软件开发工作流程的安全提供了基础。
链接: https://arxiv.org/abs/2501.19012
作者: Arjun Krishna,Erick Galinkin,Leon Derczynski,Jeffrey Martin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) have become an essential tool in the programmer’s toolkit, but their tendency to hallucinate code can be used by malicious actors to introduce vulnerabilities to broad swathes of the software supply chain. In this work, we analyze package hallucination behaviour in LLMs across popular programming languages examining both existing package references and fictional dependencies. By analyzing this package hallucination behaviour we find potential attacks and suggest defensive strategies to defend against these attacks. We discover that package hallucination rate is predicated not only on model choice, but also programming language, model size, and specificity of the coding task request. The Pareto optimality boundary between code generation performance and package hallucination is sparsely populated, suggesting that coding models are not being optimized for secure code. Additionally, we find an inverse correlation between package hallucination rate and the HumanEval coding benchmark, offering a heuristic for evaluating the propensity of a model to hallucinate packages. Our metrics, findings and analyses provide a base for future models, securing AI-assisted software development workflows against package supply chain attacks.
zh
[NLP-28] DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition NAACL2025
【速读】: 该论文旨在解决构音障碍语音识别中的性能退化问题,主要由于构音障碍严重程度的内在多样性和与正常语音的外在差异所致。解决方案的关键在于提出了一种动态音素级对比学习(DyPCL)方法,通过将语音分解为音素片段进行对比学习,并利用动态连接主义时间分类对齐。此外,引入了基于音素发音相似性的动态课程学习策略,逐步过渡从简单的负样本到难以区分的负样本,从而缓解说话者固有的变异性,更好地识别具有挑战性的语音。
链接: https://arxiv.org/abs/2501.19010
作者: Wonjun Lee,Solee Im,Heejin Do,Yunsu Kim,Jungseul Ok,Gary Geunbae Lee
机构: Department of Computer Science and Engineering, POSTECH (POSTECH计算机科学与工程系), Republic of Korea (韩国); Graduate School of Artificial Intelligence, POSTECH (POSTECH人工智能研究生院), Republic of Korea (韩国); aiXplain Inc. (aiXplain公司), LosGatos, CA, USA (美国)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: NAACL 2025, 9pages, 1 page appendix
Abstract:Dysarthric speech recognition often suffers from performance degradation due to the intrinsic diversity of dysarthric severity and extrinsic disparity from normal speech. To bridge these gaps, we propose a Dynamic Phoneme-level Contrastive Learning (DyPCL) method, which leads to obtaining invariant representations across diverse speakers. We decompose the speech utterance into phoneme segments for phoneme-level contrastive learning, leveraging dynamic connectionist temporal classification alignment. Unlike prior studies focusing on utterance-level embeddings, our granular learning allows discrimination of subtle parts of speech. In addition, we introduce dynamic curriculum learning, which progressively transitions from easy negative samples to difficult-to-distinguishable negative samples based on phonetic similarity of phoneme. Our approach to training by difficulty levels alleviates the inherent variability of speakers, better identifying challenging speeches. Evaluated on the UASpeech dataset, DyPCL outperforms baseline models, achieving an average 22.10% relative reduction in word error rate (WER) across the overall dysarthria group.
zh
[NLP-29] Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings
【速读】: 该论文旨在解决利用人工智能生成文本工具可能引发的剽窃检测难题。论文的关键解决方案在于提出了一种针对检测模型(如Fast-DetectGPT)的新颖文本对抗攻击方法。此方法通过使用嵌入模型进行数据扰动,并结合同义词和嵌入相似性向量,特别是采用了解释性强的Tsetlin Machine ™技术,显著降低了Fast-DetectGPT在XSum和SQuAD数据集上的检测评分,从而有效提升了对抗检测的能力。
链接: https://arxiv.org/abs/2501.18998
作者: Ahmed K. Kadhim,Lei Jiao,Rishad Shafik,Ole-Christoffer Granmo
机构: Department of ICT (信息与通信技术系), University of Agder (阿格德大学); School of Engineering (工程学院), Newcastle University (纽卡斯尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine ™, an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.
zh
[NLP-30] Intrinsic Tensor Field Propagation in Large Language Models : A Novel Approach to Contextual Information Flow
【速读】: 该论文旨在解决上下文传播(context propagation)在语言模型架构中的核心挑战,特别是在需要保持长距离依赖关系的任务中。论文的关键解决方案是引入内在张量场传播(Intrinsic Tensor Field Propagation, ITFP),它将上下文关系建模为分布在词嵌入(token embeddings)上的连续张量场,并通过微分方程控制上下文信息的结构化流动,从而增强标准注意力机制的连贯性和记忆能力。
链接: https://arxiv.org/abs/2501.18957
作者: Alfred Bexley,Lukas Radcliffe,Giles Weatherstone,Joseph Sakau
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Context propagation remains a central challenge in language model architectures, particularly in tasks requiring the retention of long-range dependencies. Conventional attention mechanisms, while effective in many applications, exhibit limitations in maintaining coherent contextual representations over extended sequences due to their reliance on discrete token interactions. A novel approach is introduced through the formulation of Intrinsic Tensor Field Propagation (ITFP), which models contextual relationships as continuous tensor fields distributed across token embeddings. The propagation dynamics are governed through differential equations that enable a structured flow of contextual information, augmenting the standard attention mechanism to enhance coherence and recall. A series of experiments conducted on an open-source transformer-based model demonstrate that ITFP provides measurable improvements in contextual retention, dependency resolution, and inference stability across various linguistic structures. Comparisons with baseline models reveal a reduction in syntactic inconsistencies and factual errors, while ablation studies indicate that the choice of propagation depth and integration strength significantly impacts model performance. Additional evaluations assessing domain generalization suggest that ITFP effectively adapts across different text genres, reinforcing its applicability beyond conventional language modeling tasks. Although computational trade-offs are introduced through the inclusion of tensor field computations, empirical findings suggest that the benefits in accuracy and coherence outweigh the increased processing demands.
zh
[NLP-31] Language Games as the Pathway to Artificial Superhuman Intelligence
【速读】: 该论文旨在解决大型语言模型(LLMs)在向人工超人类智能(ASI)演进过程中陷入数据再生产陷阱的问题。当前方法通过在固定的人类生成的数据分布内进行优化,导致模型仅重新组合现有知识而无法探索新的领域,从而陷入停滞。论文的关键解决方案是通过引入语言游戏的三个机制来打破这一循环:1)角色流动性(role fluidity),通过多代理系统动态转换任务角色以增强数据多样性和覆盖范围;2)奖励多样性(reward variety),嵌入多种反馈标准以驱动复杂的智能行为;3)规则可塑性(rule plasticity),迭代演化交互约束以促进学习能力,并注入持续的新颖性。这些机制共同作用,通过扩展数据再生产过程,推动人类与AI的共同进化,形成无边界的数据流,进而实现开放式的探索。
链接: https://arxiv.org/abs/2501.18924
作者: Ying Wen,Ziyu Wan,Shao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: This position paper argues that language games provide robust mechanism for achieving superhuman intelligence in large language models
Abstract:The evolution of large language models (LLMs) toward artificial superhuman intelligence (ASI) hinges on data reproduction, a cyclical process in which models generate, curate and retrain on novel data to refine capabilities. Current methods, however, risk getting stuck in a data reproduction trap: optimizing outputs within fixed human-generated distributions in a closed loop leads to stagnation, as models merely recombine existing knowledge rather than explore new frontiers. In this paper, we propose language games as a pathway to expanded data reproduction, breaking this cycle through three mechanisms: (1) \textitrole fluidity, which enhances data diversity and coverage by enabling multi-agent systems to dynamically shift roles across tasks; (2) \textitreward variety, embedding multiple feedback criteria that can drive complex intelligent behaviors; and (3) \textitrule plasticity, iteratively evolving interaction constraints to foster learnability, thereby injecting continual novelty. By scaling language games into global sociotechnical ecosystems, human-AI co-evolution generates unbounded data streams that drive open-ended exploration. This framework redefines data reproduction not as a closed loop but as an engine for superhuman intelligence.
zh
[NLP-32] KBQA-o1 : Agent ic Knowledge Base Question Answering with Monte Carlo Tree Search
【速读】: 该论文旨在解决知识库问答(KBQA)中的弱知识库意识、效果与效率不平衡以及对标注数据的高度依赖等挑战。解决方案的关键在于提出了一种名为KBQA-o1的新颖代理知识库问答方法,该方法结合了蒙特卡洛树搜索(MCTS)和基于ReAct的代理过程,用于逐步逻辑形式生成和知识库环境探索。这种方法通过启发式搜索平衡了代理探索的性能和搜索空间,并通过渐进微调生成高质量标注以进一步改进模型。实验结果显示,KBQA-o1在有限标注数据的情况下超越了先前的低资源KBQA方法,将Llama-3.1-8B模型在GrailQA上的F1得分提升至78.5%,而之前的最先进方法仅达到48.5%。
链接: https://arxiv.org/abs/2501.18922
作者: Haoran Luo,Haihong E,Yikai Guo,Qika Lin,Xiaobao Wu,Xinyu Mu,Wenhao Liu,Meina Song,Yifan Zhu,Luu Anh Tuan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Preprint
Abstract:Knowledge Base Question Answering (KBQA) aims to answer natural language questions with a large-scale structured knowledge base (KB). Despite advancements with large language models (LLMs), KBQA still faces challenges in weak KB awareness, imbalance between effectiveness and efficiency, and high reliance on annotated data. To address these challenges, we propose KBQA-o1, a novel agentic KBQA method with Monte Carlo Tree Search (MCTS). It introduces a ReAct-based agent process for stepwise logical form generation with KB environment exploration. Moreover, it employs MCTS, a heuristic search method driven by policy and reward models, to balance agentic exploration’s performance and search space. With heuristic exploration, KBQA-o1 generates high-quality annotations for further improvement by incremental fine-tuning. Experimental results show that KBQA-o1 outperforms previous low-resource KBQA methods with limited annotated data, boosting Llama-3.1-8B model’s GrailQA F1 performance to 78.5% compared to 48.5% of the previous sota method with GPT-3.5-turbo.
zh
[NLP-33] Efficient Supernet Training with Orthogonal Softmax for Scalable ASR Model Compression ICASSP2025
【速读】: 该论文旨在解决自动语音识别(ASR)系统在不同硬件环境下性能适配的问题。论文的关键解决方案是通过超网训练(supernet training)联合训练多个具有不同规模的编码器,以实现模型规模的动态调整,从而适应特定的硬件约束条件,同时避免冗余训练。此外,论文引入了一种名为正交Softmax(OrthoSoftmax)的新方法,通过应用多个正交softmax函数来高效识别超网中的最优子网,从而避免资源密集型搜索。这种方法还允许基于各种标准和粒度级别进行更灵活和精确的子网选择,最终实现了与单独训练模型相当或略优的整体性能。
链接: https://arxiv.org/abs/2501.18895
作者: Jingjing Xu,Eugen Beck,Zijian Yang,Ralf Schlüter
机构: RWTH Aachen University (亚琛工业大学); AppTek GmbH (AppTek GmbH)
类目: Computation and Language (cs.CL)
备注: Accepted by ICASSP 2025
Abstract:ASR systems are deployed across diverse environments, each with specific hardware constraints. We use supernet training to jointly train multiple encoders of varying sizes, enabling dynamic model size adjustment to fit hardware constraints without redundant training. Moreover, we introduce a novel method called OrthoSoftmax, which applies multiple orthogonal softmax functions to efficiently identify optimal subnets within the supernet, avoiding resource-intensive search. This approach also enables more flexible and precise subnet selection by allowing selection based on various criteria and levels of granularity. Our results with CTC on Librispeech and TED-LIUM-v2 show that FLOPs-aware component-wise selection achieves the best overall performance. With the same number of training updates from one single job, WERs for all model sizes are comparable to or slightly better than those of individually trained models. Furthermore, we analyze patterns in the selected components and reveal interesting insights.
zh
[NLP-34] BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
【速读】: 该论文旨在解决大型语言模型(LLMs)在复杂推理任务中生成可靠推理过程的挑战。解决方案的关键在于提出了一种名为Bootstrapping Reinforced Thinking Process (BRiTE)的算法,该算法通过强化学习近似最优思维过程,并采用新颖的奖励塑形机制生成高质量的推理依据。此外,BRiTE通过最大化推理依据生成的联合概率来增强基础LLM的性能,从而在不同基准测试中展示了优于现有方法的一致改进效果。
链接: https://arxiv.org/abs/2501.18858
作者: Han Zhong,Yutong Yin,Shenao Zhang,Xiaojun Xu,Yuanxin Liu,Yifei Zuo,Zhihan Liu,Boyi Liu,Sirui Zheng,Hongyi Guo,Liwei Wang,Mingyi Hong,Zhaoran Wang
机构: Peking University (北京大学); Northwestern University (西北大学); Bytedance Inc. (字节跳动); University of Minnesota (明尼苏达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps. First, it generates high-quality rationales by approximating the optimal thinking process through reinforcement learning, using a novel reward shaping mechanism. Second, it enhances the base LLM by maximizing the joint probability of rationale generation with respect to the model’s parameters. Theoretically, we demonstrate BRiTE’s convergence at a rate of 1/T with T representing the number of iterations. Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes. In addition, BRiTE demonstrates superior performance compared to existing algorithms that bootstrap thinking processes use alternative methods such as rejection sampling, and can even match or exceed the results achieved through supervised fine-tuning with human-annotated data.
zh
[NLP-35] xt Data Augmentation for Large Language Models : A Comprehensive Survey of Methods Challenges and Opportunities
【速读】: 该论文旨在分析大型语言模型(Large Language Models, LLMs)在数据增强中的应用,并系统地分类相关技术。论文的关键解决方案在于提出并综述了四种主要的数据增强技术:简单增强(Simple Augmentation)、基于提示的增强(Prompt-based Augmentation)、基于检索的增强(Retrieval-based Augmentation)以及混合增强(Hybrid Augmentation)。通过这些方法,论文强调了引入外部知识和个性化提示模板的重要性,以提高生成数据的质量和真实性,从而有效应对训练数据不足导致的过拟合问题。此外,论文还总结了数据增强后的后处理方法,以进一步优化增强数据并过滤不忠实的内容。
链接: https://arxiv.org/abs/2501.18845
作者: Yaping Chai,Haoran Xie,Joe S. Qin
机构: Lingnan University (岭南大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 4 figures, 4 tables
Abstract:The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.
zh
[NLP-36] Partially Rewriting a Transformer in Natural Language
【速读】: 该论文旨在通过部分重写大型语言模型(LLM),以简化且更易于人类理解的方式重新表达深度神经网络的行为与性能。解决方案的关键在于采用一个更宽的具有稀疏激活神经元的多层感知机(MLP)——称为转码器(transcoder),并通过自动化可解释性管道生成这些神经元的解释。随后,使用基于LLM的模拟器替代稀疏MLP的第一层,该模拟器依据神经元的解释及其周围上下文预测神经元的激活状态。这一方法的关键创新点在于结合自动解释和基于解释的预测机制,从而在保持模型性能的同时提高其可解释性。
链接: https://arxiv.org/abs/2501.18838
作者: Gonçalo Paulo,Nora Belrose
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to partially rewrite a large language model using simple natural language explanations. We first approximate one of the feedforward networks in the LLM with a wider MLP with sparsely activating neurons - a transcoder - and use an automated interpretability pipeline to generate explanations for these neurons. We then replace the first layer of this sparse MLP with an LLM-based simulator, which predicts the activation of each neuron given its explanation and the surrounding context. Finally, we measure the degree to which these modifications distort the model’s final output. With our pipeline, the model’s increase in loss is statistically similar to entirely replacing the sparse MLP output with the zero vector. We employ the same protocol, this time using a sparse autoencoder, on the residual stream of the same layer and obtain similar results. These results suggest that more detailed explanations are needed to improve performance substantially above the zero ablation baseline.
zh
[NLP-37] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
【速读】: 该论文旨在解决大型语言模型(LLMs)易受普遍越狱(universal jailbreaks)攻击的问题,这种攻击策略能够系统性地规避模型的安全防护,并使用户能够执行需要多次模型交互的有害操作。论文的关键解决方案是引入了“宪法分类器”(Constitutional Classifiers),这是一种通过使用自然语言规则(即“宪法”)来指定允许和限制内容的合成数据训练的安全保障措施。这些分类器显著增强了模型防御普遍越狱的能力,同时保持了实际部署的可行性。
链接: https://arxiv.org/abs/2501.18837
作者: Mrinank Sharma,Meg Tong,Jesse Mu,Jerry Wei,Jorrit Kruthoff,Scott Goodfriend,Euan Ong,Alwin Peng,Raj Agarwal,Cem Anil,Amanda Askell,Nathan Bailey,Joe Benton,Emma Bluemke,Samuel R. Bowman,Eric Christiansen,Hoagy Cunningham,Andy Dau,Anjali Gopal,Rob Gilson,Logan Graham,Logan Howard,Nimit Kalra,Taesung Lee,Kevin Lin,Peter Lofgren,Francesco Mosconi,Clare O’Hara,Catherine Olsson,Linda Petrini,Samir Rajani,Nikhil Saxena,Alex Silverstein,Tanya Singh,Theodore Sumers,Leonard Tang,Kevin K. Troy,Constantin Weisser,Ruiqi Zhong,Giulio Zhou,Jan Leike,Jared Kaplan,Ethan Perez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
zh
[NLP-38] Structural Embedding Projection for Contextual Large Language Model Inference
【速读】: 该论文旨在解决语言模型推理效率与语义一致性之间的权衡问题。解决方案的关键在于引入结构化嵌入投影(SEP),通过投影矩阵整合层次和关系依赖来精炼词元表示。SEP的数学公式使嵌入空间能够捕捉结构化的上下文关系,从而在不显著增加计算开销的情况下提高语义保真度。实验评估表明,SEP减少了困惑度并增强了上下文连贯性,同时对生成文本的叙事一致性和主题对齐有所改善。
链接: https://arxiv.org/abs/2501.18826
作者: Vincent Enoasmo,Cedric Featherstonehaugh,Xavier Konstantinopoulos,Zacharias Huntington
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Structured embedding transformations offer a promising approach for enhancing the efficiency and coherence of language model inference. The introduction of Structural Embedding Projection (SEP) provides a mechanism for refining token representations through projection matrices that integrate hierarchical and relational dependencies. The mathematical formulation of SEP enables embedding spaces to capture structured contextual relationships, thereby improving semantic fidelity without significantly increasing computational overhead. Experimental evaluations conducted on a range of linguistic datasets revealed that SEP contributed to reductions in perplexity and enhanced contextual coherence, demonstrating its potential to refine language model outputs. Computational efficiency assessments highlighted variations across different datasets, suggesting that the integration of structured embeddings introduced dataset-dependent trade-offs between inference speed and representational richness. The qualitative analysis of generated responses indicated that SEP enhanced narrative consistency and topic alignment, leading to improved fluency in multi-sentence text generation. The modifications to embedding layers required precise optimization to ensure stable training dynamics, as the introduction of structured transformations altered the traditional representation-learning process. The architectural adjustments necessary for SEP implementation influenced inference latency and memory consumption, requiring a balance between efficiency gains and additional processing demands. The impact of SEP on lexical diversity suggested that embedding modifications influenced the model’s vocabulary usage, reflecting a more context-aware selection of generated tokens.
zh
[NLP-39] Memory-Efficient Fine-Tuning of Transformers via Token Selection EMNLP2024
【速读】: 该论文旨在解决在微调大型Transformer模型(如大语言模型LLMs)过程中内存开销过高的问题。关键在于TokenTune方法,它通过仅反向传播输入序列的一部分标记来近似梯度计算,从而减少存储中间激活所需的内存。这种方法使得只有部分中间激活在前向传播过程中被缓存,并且可以与现有的内存高效方法(如LoRA)结合使用,进一步降低内存成本。
链接: https://arxiv.org/abs/2501.18824
作者: Antoine Simoulin,Namyong Park,Xiaoyi Liu,Grey Yang
机构: Meta
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024
Abstract:Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at this https URL.
zh
[NLP-40] Bridging the Reasoning Gap: Small LLM s Can Plan with Generalised Strategies IJCAI2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提升推理能力过程中伴随的高昂财务和计算成本问题。论文的关键解决方案在于提出两种方法来增强资源消耗较少的LLMs的推理能力:一是提供由资源消耗较多的LLM生成的通用策略以解决特定领域内的任务;二是通过迭代提示这些模型纠正其建议方案中的错误,从而利用其成本效益。实证结果显示,这些方法使资源消耗较少的LLMs在规划和数学推理任务中的表现达到与资源消耗较多的LLMs相当的水平,且成本仅为后者的极小部分。此外,实验表明使用通用策略可以将资源消耗较少的模型的成本平均降低近30%。
链接: https://arxiv.org/abs/2501.18817
作者: Andrey Borro,Patricia J Riddle,Michael W Barley,Michael J Witbrock
机构: University of Auckland (奥克兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 page body, 2 page references, 16 page appendix (25 pages total); 2 figures; submitted to IJCAI2025
Abstract:Recent advancements in the reasoning skills of Large Language Models (LLMs) demonstrate an increase in the ability of LLMs to solve simple planning tasks. However, as long as the driving force behind improved reasoning capability is the size and complexity of the model, the financial and computational costs associated with running them will also increase. This trend raises questions about continued accessibility and whether these improvements will increase at the same pace as models continue to grow in size and expense. We propose two approaches to enhance the reasoning ability of less resource-intensive LLMs. (1) Provide them with a generalised strategy for solving tasks within a given domain, generated by a more resource-intensive LLM. (2) Exploit their cost-effectiveness by iteratively prompting these models to correct errors in their proposed solutions. Our empirical results from planning and mathematical reasoning tasks demonstrate that these methods improve the performance of less resource-intensive LLMs to levels comparable with their more resource-intensive counterparts, at a fraction of the cost. Additionally, we show that the utilisation of generalised strategies in our experiments reduced the cost of the less resource-intensive model by nearly 30 percent on average.
zh
[NLP-41] Large Language Models as Common-Sense Heuristics IJCAI2025
【速读】: 该论文旨在解决生成正确且可执行规划方案的挑战,当前研究依赖大型语言模型(LLMs)输出中间语言解决方案,这需要额外的翻译步骤。关键在于引入了一种新颖的规划方法,通过利用LLMs的参数化知识,将LLMs的输出作为爬山搜索(Hill-Climbing Search)的启发式函数,并通过提示LLMs生成解估计来引导搜索过程。这种方法在常见家庭环境中实现了比类似系统高出22个百分点的任务成功率,并且所有操作均以原始表示形式编码,无需中间语言翻译步骤。
链接: https://arxiv.org/abs/2501.18816
作者: Andrey Borro,Patricia J Riddle,Michael W Barley,Michael J Witbrock
机构: University of Auckland (奥克兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 page body, 2 page references, 5 page appendix (14 page total); 1 figure; Submitted to IJCAI2025
Abstract:While systems designed for solving planning tasks vastly outperform Large Language Models (LLMs) in this domain, they usually discard the rich semantic information embedded within task descriptions. In contrast, LLMs possess parametrised knowledge across a wide range of topics, enabling them to leverage the natural language descriptions of planning tasks in their solutions. However, current research in this direction faces challenges in generating correct and executable plans. Furthermore, these approaches depend on the LLM to output solutions in an intermediate language, which must be translated into the representation language of the planning task. We introduce a novel planning method, which leverages the parametrised knowledge of LLMs by using their output as a heuristic for Hill-Climbing Search. This approach is further enhanced by prompting the LLM to generate a solution estimate to guide the search. Our method outperforms the task success rate of similar systems within a common household environment by 22 percentage points, with consistently executable plans. All actions are encoded in their original representation, demonstrating that strong results can be achieved without an intermediate language, thus eliminating the need for a translation step.
zh
[NLP-42] Rope to Nope and Back Again: A New Hybrid Attention Strategy
【速读】: 该论文旨在解决长上下文大语言模型(Long-context LLMs)在处理较长输入序列时性能受限的问题。关键解决方案在于提出了一种基于混合注意力机制的新架构,该架构不仅超越了传统的基于旋转位置嵌入(RoPE)的变换器模型在长上下文任务中的表现,还在需要较短上下文长度的基准测试中实现了竞争性性能。通过全面分析包括RoPE、无位置嵌入(NoPE)和查询-键归一化(QK-Norm)在内的多种注意力机制,论文揭示了这些方法在长上下文建模中的优势与不足,并据此设计出新的架构。
链接: https://arxiv.org/abs/2501.18795
作者: Bowen Yang,Bharat Venkitesh,Dwarak Talupuru,Hangyu Lin,David Cairuz,Phil Blunsom,Acyr Locatelli
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architectural based on a hybrid attention mechanism that not only surpasses conventional RoPE-based transformer models in long context tasks but also achieves competitive performance on benchmarks requiring shorter context lengths.
zh
[NLP-43] Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contaminations Impact on Machine Translation
【速读】: 该论文旨在解决数据污染(Data Contamination)对语言模型评估基准有效性的影响问题。论文的关键解决方案在于通过精心设计的实验,从一个严格去污染(decontaminated)的训练测试分割开始,系统性地在不同阶段、规模和数据格式下引入污染,以隔离其影响并衡量其对性能指标的具体影响。实验结果显示,全面的源语言和目标语言污染显著抬高了BLEU分数,且这种抬高的幅度在大规模模型(8B)中比小规模模型(1B)大2.5倍。
链接: https://arxiv.org/abs/2501.18771
作者: Muhammed Yusuf Kocyigit,Eleftheria Briakou,Daniel Deutsch,Jiaming Luo,Colin Cherry,Markus Freitag
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Data contamination – the accidental consumption of evaluation examples within the pre-training data – can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.
zh
[NLP-44] Breaking the Fake News Barrier: Deep Learning Approaches in Bangla Language
【速读】: 该论文旨在解决数字时代下虚假信息在社会中的广泛传播问题,特别是在说孟加拉语的社区中所引发的不确定性与判断力的削弱。解决方案的关键在于利用深度学习创新,特别是门控循环单元(Gated Recurrent Unit, GRU),来识别孟加拉语中的假新闻。该方法包括深入的数据预处理,如词元化(tokenization)、词形还原(lemmatization)以及通过过采样解决类别不平衡问题,最终构建了一个包含58,478个段落的数据集。基于GRU模型的演示展示了显著的性能,其精确率达到94%。
链接: https://arxiv.org/abs/2501.18766
作者: Pronoy Kumar Mondal,Sadman Sadik Khan,Md. Masud Rana,Shahriar Sultan Ramit,Abdus Sattar,Md. Sadekur Rahman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, THE 15th INTERNATIONAL IEEE CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT)
Abstract:The rapid development of digital stages has greatly compounded the dispersal of untrue data, dissolving certainty and judgment in society, especially among the Bengali-speaking community. Our ponder addresses this critical issue by presenting an interesting strategy that utilizes a profound learning innovation, particularly the Gated Repetitive Unit (GRU), to recognize fake news within the Bangla dialect. The strategy of our proposed work incorporates intensive information preprocessing, which includes lemmatization, tokenization, and tending to course awkward nature by oversampling. This comes about in a dataset containing 58,478 passages. We appreciate the creation of a demonstration based on GRU (Gated Repetitive Unit) that illustrates remarkable execution with a noteworthy precision rate of 94%. This ponder gives an intensive clarification of the methods included in planning the information, selecting the show, preparing it, and assessing its execution. The performance of the model is investigated by reliable metrics like precision, recall, F1 score, and accuracy. The commitment of the work incorporates making a huge fake news dataset in Bangla and a demonstration that has outperformed other Bangla fake news location models.
zh
[NLP-45] Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource Languages ALT
【速读】: 该论文旨在解决跨语言命名实体识别(Cross-lingual Named Entity Recognition, NER)在低资源语言中的挑战。解决方案的关键在于对基于注释投影(annotation projection)步骤的两项改进:一是通过回译(back-translation)优化词对齐以提高精度;二是提出一种新的形式化投影方法,用于匹配源语言实体与目标语言提取的候选实体。实验结果表明,所提方法在低资源设置下超越了现有的基于投影的方法。这些发现强调了基于数据传输的投影方法作为模型基方法的替代方案,在低资源语言中进行跨语言命名实体识别的稳健性。
链接: https://arxiv.org/abs/2501.18750
作者: Andrei Politov,Oleh Shkalikov,René Jäkel,Michael Färber
机构: Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), TU Dresden, Dresden/Leipzig, Germany
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at NoDaLiDa/Baltic-HLT 2025
Abstract:Cross-lingual Named Entity Recognition (NER) leverages knowledge transfer between languages to identify and classify named entities, making it particularly useful for low-resource languages. We show that the data-based cross-lingual transfer method is an effective technique for crosslingual NER and can outperform multilingual language models for low-resource languages. This paper introduces two key enhancements to the annotation projection step in cross-lingual NER for low-resource languages. First, we explore refining word alignments using back-translation to improve accuracy. Second, we present a novel formalized projection approach of matching source entities with extracted target candidates. Through extensive experiments on two datasets spanning 57 languages, we demonstrated that our approach surpasses existing projectionbased methods in low-resource settings. These findings highlight the robustness of projection-based data transfer as an alternative to model-based methods for crosslingual named entity recognition in lowresource languages.
zh
[NLP-46] Examining the Robustness of Large Language Models across Language Complexity
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)在处理不同复杂度的语言文本时的可靠性和鲁棒性。特别是在学习环境中,学生可能具有不同的语言背景和写作技能水平,因此确保这些模型能够对具有不同语言复杂度的文本进行一致的评估至关重要。论文的关键在于通过比较LLM基学生模型在处理高复杂度与低复杂度的词汇、句法和语义文本时的表现,来检验这些模型的稳健性,从而解决上述问题。
链接: https://arxiv.org/abs/2501.18738
作者: Jiayi Zhang
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the advancement of large language models (LLMs), an increasing number of student models have leveraged LLMs to analyze textual artifacts generated by students to understand and evaluate their learning. These student models typically employ pre-trained LLMs to vectorize text inputs into embeddings and then use the embeddings to train models to detect the presence or absence of a construct of interest. However, how reliable and robust are these models at processing language with different levels of complexity? In the context of learning where students may have different language backgrounds with various levels of writing skills, it is critical to examine the robustness of such models to ensure that these models work equally well for text with varying levels of language complexity. Coincidentally, a few (but limited) research studies show that the use of language can indeed impact the performance of LLMs. As such, in the current study, we examined the robustness of several LLM-based student models that detect student self-regulated learning (SRL) in math problem-solving. Specifically, we compared how the performance of these models vary using texts with high and low lexical, syntactic, and semantic complexity measured by three linguistic measures.
zh
[NLP-47] Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment
【速读】: 该论文旨在解决及时且准确评估认知障碍的需求,特别是在面临痴呆症(Alzheimer’s disease and related dementias, ADRD)风险的人群中。论文的关键解决方案在于使用语音生物标志物(Voice biomarkers)结合机器学习(Machine Learning, ML)技术,通过分析言语和语言中的特征来实现对ADRD的自动化筛查和严重程度预测。论文特别强调了利用随机森林(Random Forest)算法处理词汇特征(lexical features),以达到较高的灵敏度(sensitivity)和特异性(specificity),从而提高模型的泛化能力(generalisability)和临床实用性。此外,通过对语言特征的重要性分析(linguistic feature importance analysis),增强了预测结果的可解释性(interpretability)。
链接: https://arxiv.org/abs/2501.18731
作者: Maria R. Lima,Alexander Capstick,Fatemeh Geranmayeh,Ramin Nilforooshan,Maja Matarić,Ravi Vaidyanathan,Payam Barnaghi
机构: Imperial College London; UK Dementia Research Institute, Care Research and Technology Centre; Imperial College Healthcare NHS Trust; Great Ormond Street Hospital NHS Foundation Trust; Surrey and Borders Partnership NHS Foundation Trust; University of Southern California
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Timely and accurate assessment of cognitive impairment is a major unmet need in populations at risk. Alterations in speech and language can be early predictors of Alzheimer’s disease and related dementias (ADRD) before clinical signs of neurodegeneration. Voice biomarkers offer a scalable and non-invasive solution for automated screening. However, the clinical applicability of machine learning (ML) remains limited by challenges in generalisability, interpretability, and access to patient data to train clinically applicable predictive models. Using DementiaBank recordings (N=291, 64% female), we evaluated ML techniques for ADRD screening and severity prediction from spoken language. We validated model generalisability with pilot data collected in-residence from older adults (N=22, 59% female). Risk stratification and linguistic feature importance analysis enhanced the interpretability and clinical utility of predictions. For ADRD classification, a Random Forest applied to lexical features achieved a mean sensitivity of 69.4% (95% confidence interval (CI) = 66.4-72.5) and specificity of 83.3% (78.0-88.7). On real-world pilot data, this model achieved a mean sensitivity of 70.0% (58.0-82.0) and specificity of 52.5% (39.3-65.7). For severity prediction using Mini-Mental State Examination (MMSE) scores, a Random Forest Regressor achieved a mean absolute MMSE error of 3.7 (3.7-3.8), with comparable performance of 3.3 (3.1-3.5) on pilot data. Linguistic features associated with higher ADRD risk included increased use of pronouns and adverbs, greater disfluency, reduced analytical thinking, lower lexical diversity and fewer words reflecting a psychological state of completion. Our interpretable predictive modelling offers a novel approach for in-home integration with conversational AI to monitor cognitive health and triage higher-risk individuals, enabling earlier detection and intervention.
zh
[NLP-48] Zero-shot Large Language Models for Long Clinical Text Summarization with Temporal Reasoning
【速读】: 该论文旨在评估零样本大型语言模型(Zero-shot Large Language Models, LLMs)在总结需要时间推理的长临床文本中的有效性。研究的关键在于这些模型能否在没有针对特定任务训练的情况下,整合并准确反映时间动态。尽管模型能够有效识别关键时间事件,但在长时间跨度的叙述中,它们在时间顺序连贯性方面表现不佳。因此,论文强调了零样本LLMs在临床文本摘要中的优势与局限,并指出虽然有潜力,但这些模型仍需进一步改进以支持临床决策过程,这凸显了开发增强型训练方法的需求,以便更好地捕捉长期医疗文档中时间信息的细微差别。
链接: https://arxiv.org/abs/2501.18724
作者: Maya Kruse,Shiyue Hu,Nicholas Derby,Yifu Wu,Samantha Stonbraker,Bingsheng Yao,Dakuo Wang,Elizabeth Goldberg,Yanjun Gao
机构: University of Colorado Anschutz Medical Campus; University of Colorado Boulder; Northeastern University
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in large language models (LLMs) have shown potential for transforming data processing in healthcare, particularly in understanding complex clinical narratives. This study evaluates the efficacy of zero-shot LLMs in summarizing long clinical texts that require temporal reasoning, a critical aspect for comprehensively capturing patient histories and treatment trajectories. We applied a series of advanced zero-shot LLMs to extensive clinical documents, assessing their ability to integrate and accurately reflect temporal dynamics without prior task-specific training. While the models efficiently identified key temporal events, they struggled with chronological coherence over prolonged narratives. The evaluation, combining quantitative and qualitative methods, highlights the strengths and limitations of zero-shot LLMs in clinical text summarization. The results suggest that while promising, zero-shot LLMs require further refinement to effectively support clinical decision-making processes, underscoring the need for enhanced model training approaches that better capture the nuances of temporal information in long context medical documents.
zh
[NLP-49] Fake News Detection After LLM Laundering: Measurement and Explanation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成的假新闻检测难题,特别是通过分析添加释义步骤对检测效能的影响。研究的关键在于评估现有检测器在识别LLM生成并经过释义处理的假新闻方面的有效性,并通过LIME解释发现检测失败的一个可能原因是情感偏移。此外,论文揭示了即使在高BERTSCORE的情况下,存在情感偏移的释义样本仍可能导致检测困难。该研究提供了包含释义输出和评分的新数据集以供进一步研究。
链接: https://arxiv.org/abs/2501.18649
作者: Rupak Kumar Das,Jonathan Dodge
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub
zh
[NLP-50] Layered Chain-of-Thought Prompting for Multi-Agent LLM Systems: A Comprehensive Approach to Explainable Large Language Models
【速读】: 该论文旨在解决传统链式思维(Chain-of-Thought, CoT)提示方法在处理复杂任务时,难以全面验证中间推理和容易产生误导性解释的问题。关键解决方案是提出了一种分层链式思维(Layered Chain-of-Thought, Layered-CoT)提示框架,该框架将推理过程系统地分割为多个层次,并通过外部检查和可选用户反馈来增强每层推理的透明度和准确性。
链接: https://arxiv.org/abs/2501.18645
作者: Manish Sanwal
机构: News Corporation (新闻集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) leverage chain-of-thought (CoT) prompting to provide step-by-step rationales, improving performance on complex tasks. Despite its benefits, vanilla CoT often fails to fully verify intermediate inferences and can produce misleading explanations. In this work, we propose Layered Chain-of-Thought (Layered-CoT) Prompting, a novel framework that systematically segments the reasoning process into multiple layers, each subjected to external checks and optional user feedback. We expand on the key concepts, present three scenarios – medical triage, financial risk assessment, and agile engineering – and demonstrate how Layered-CoT surpasses vanilla CoT in terms of transparency, correctness, and user engagement. By integrating references from recent arXiv papers on interactive explainability, multi-agent frameworks, and agent-based collaboration, we illustrate how Layered-CoT paves the way for more reliable and grounded explanations in high-stakes domains.
zh
[NLP-51] Prompt-oriented Output of Culture-Specific Items in Translated African Poetry by Large Language Model: An Initial Multi-layered Tabular Review
【速读】: 该论文旨在评估Chat Generative PreTrained Transformer Pro(ChatGPT)在翻译非洲诗歌时,不同文化导向提示(culture-oriented prompts)的效果。论文通过设计三种结构化提示,分别进行广泛性提示、关注诗歌结构的提示以及强调文化特异性的提示,对比分析了其生成的文化项目(cultural items)。关键解决方案在于采用Aixelá框架下的专有名词与常见表达分类方法,并创建了四张比较表格来系统展示和分析ChatGPT与人类译者及自定义翻译引擎之间的差异。研究发现,尽管采用了特定文化导向的提示,但ChatGPT在将非洲诗歌从英语翻译成法语时,并未显著提升文化项目的翻译质量。
链接: https://arxiv.org/abs/2501.18644
作者: Adeyola Opaluwah
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 4 tables. arXiv admin note: text overlap with arXiv:2406.03450 , arXiv:2312.15304 by other authors
Abstract:This paper examines the output of cultural items generated by Chat Generative PreTrained Transformer Pro in response to three structured prompts to translate three anthologies of African poetry. The first prompt was broad, the second focused on poetic structure, and the third prompt emphasized cultural specificity. To support this analysis, four comparative tables were created. The first table presents the results of the cultural items produced after the three prompts, the second categorizes these outputs based on Aixela framework of Proper nouns and Common expressions, the third table summarizes the cultural items generated by human translators, a custom translation engine, and a Large Language Model. The final table outlines the strategies employed by Chat Generative PreTrained Transformer Pro following the culture specific prompt. Compared to the outputs of cultural items from reference human translation and the custom translation engine in prior studies the findings indicate that the culture oriented prompts used with Chat Generative PreTrained Transformer Pro did not yield significant enhancements of cultural items during the translation of African poetry from English to French. Among the fifty four cultural items, the human translation produced thirty three cultural items in repetition, the custom translation engine generated Thirty eight cultural items in repetition while Chat Generative PreTrained Transformer Pro produced forty one cultural items in repetition. The untranslated cultural items revealed inconsistencies in Large language models approach to translating cultural items in African poetry from English to French.
zh
[NLP-52] Divergent Emotional Patterns in Disinformation on Social Media? An Analysis of Tweets and TikToks about the DANA in Valencia
【速读】: 该论文旨在解决在社交媒体平台上识别与传播的虚假信息(Disinformation)问题,特别是在2024年10月29日西班牙瓦伦西亚地区因高海拔孤立低压系统(DANA)引发极端降雨和严重洪水事件期间。研究的关键解决方案在于开发了一种结合文本和音频特征的多模态检测模型。通过构建包含650个TikTok和X平台(即Twitter)帖子的数据集,并采用人工标注与Few-Shot标注方法(使用GPT-4o),论文不仅实现了较高的一致性(科恩的Kappa系数为0.684),还利用SVM+TF-IDF模型达到了最高的F1分数。进一步,将音频特征融入到roberta-large-bne模型中,显著提升了检测准确性。这些发现表明,结合文本和音频特征对于提高多模态平台如TikTok上的虚假信息识别至关重要。
链接: https://arxiv.org/abs/2501.18640
作者: Iván Arcos,Paolo Rosso,Ramón Salaverría
机构: PRHLT Research Center, Universitat Politècnica de València(瓦伦西亚理工大学PRHLT研究中心), Spain; ValgrAI Valencian Graduate School and Research Network of Artificial Intelligence(瓦伦西亚人工智能研究生院和研究网络), Spain; School of Communication, Universidad de Navarra(纳瓦拉大学传播学院), Spain
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:This study investigates the dissemination of disinformation on social media platforms during the DANA event (DANA is a Spanish acronym for Depresion Aislada en Niveles Altos, translating to high-altitude isolated depression) that resulted in extremely heavy rainfall and devastating floods in Valencia, Spain, on October 29, 2024. We created a novel dataset of 650 TikTok and X posts, which was manually annotated to differentiate between disinformation and trustworthy content. Additionally, a Few-Shot annotation approach with GPT-4o achieved substantial agreement (Cohen’s kappa of 0.684) with manual labels. Emotion analysis revealed that disinformation on X is mainly associated with increased sadness and fear, while on TikTok, it correlates with higher levels of anger and disgust. Linguistic analysis using the LIWC dictionary showed that trustworthy content utilizes more articulate and factual language, whereas disinformation employs negations, perceptual words, and personal anecdotes to appear credible. Audio analysis of TikTok posts highlighted distinct patterns: trustworthy audios featured brighter tones and robotic or monotone narration, promoting clarity and credibility, while disinformation audios leveraged tonal variation, emotional depth, and manipulative musical elements to amplify engagement. In detection models, SVM+TF-IDF achieved the highest F1-Score, excelling with limited data. Incorporating audio features into roberta-large-bne improved both Accuracy and F1-Score, surpassing its text-only counterpart and SVM in Accuracy. GPT-4o Few-Shot also performed well, showcasing the potential of large language models for automated disinformation detection. These findings demonstrate the importance of leveraging both textual and audio features for improved disinformation detection on multimodal platforms like TikTok.
zh
[NLP-53] Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation
【速读】: 该论文旨在解决大型语言模型 (LLM) 内容审核中的查询效率低下及冷启动问题,并提升攻击成功率。关键解决方案包括开发Graph of Attacks with Pruning (GAP),以策略性方式利用先前攻击方法,将GPT-3.5的攻击成功率提高至92%,同时减少所需查询量54%;以及通过LLMs自动生成基于高级内容策略的种子提示,从而有效缓解冷启动难题。这些生成的越狱提示进一步提升了PromptGuard模型在Toxic-Chat数据集上的检测准确性,从5.1%提升到93.89%。
链接: https://arxiv.org/abs/2501.18638
作者: Daniel Schwartz,Dmitriy Bespalov,Zhe Wang,Ninad Kulkarni,Yanjun Qi
机构: Amazon(亚马逊)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 7 figures
Abstract:We present a modular pipeline that automates the generation of stealthy jailbreak prompts derived from high-level content policies, enhancing LLM content moderation. First, we address query inefficiency and jailbreak strength by developing Graph of Attacks with Pruning (GAP), a method that utilizes strategies from prior jailbreaks, resulting in 92% attack success rate on GPT-3.5 using only 54% of the queries of the prior algorithm. Second, we address the cold-start issue by automatically generating seed prompts from the high-level policy using LLMs. Finally, we demonstrate the utility of these generated jailbreak prompts of improving content moderation by fine-tuning PromptGuard, a model trained to detect jailbreaks, increasing its accuracy on the Toxic-Chat dataset from 5.1% to 93.89%.
zh
[NLP-54] Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study
【速读】: 该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)和音乐情感识别(Music Emotion Recognition, MER)在锡兰歌曲研究中的不足。关键解决方案在于利用从YouTube视频中收集的93,116条评论,通过高级过滤方法和音译机制精炼出63,471条锡兰语评论,并从中推导出特定于锡兰语的964个停用词。这些精心策划的数据集和衍生的停用词成为未来MIR和MER研究的重要资源,表明使用计算技术可以解决跨文化传统中的复杂音乐体验问题。
链接: https://arxiv.org/abs/2501.18633
作者: W. M. Yomal De Mel,Nisansa de Silva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This research investigates the area of Music Information Retrieval (MIR) and Music Emotion Recognition (MER) in relation to Sinhala songs, an underexplored field in music studies. The purpose of this study is to analyze the behavior of Sinhala comments on YouTube Sinhala song videos using social media comments as primary data sources. These included comments from 27 YouTube videos containing 20 different Sinhala songs, which were carefully selected so that strict linguistic reliability would be maintained and relevancy ensured. This process led to a total of 93,116 comments being gathered upon which the dataset was refined further by advanced filtering methods and transliteration mechanisms resulting into 63,471 Sinhala comments. Additionally, 964 stop-words specific for the Sinhala language were algorithmically derived out of which 182 matched exactly with English stop-words from NLTK corpus once translated. Also, comparisons were made between general domain corpora in Sinhala against the YouTube Comment Corpus in Sinhala confirming latter as good representation of general domain. The meticulously curated data set as well as the derived stop-words form important resources for future research in the fields of MIR and MER, since they could be used and demonstrate that there are possibilities with computational techniques to solve complex musical experiences across varied cultural traditions
zh
[NLP-55] owards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
【速读】: 该论文旨在评估大型语言模型(LLMs)在医疗应用场景中的安全漏洞,并探索防御医疗对抗性攻击的方法。论文的关键解决方案是通过引入自动化且领域适应的代理评估管道来量化三种高级黑盒越狱技术对六种LLMs的影响,并进一步研究持续微调(CFT)在抵御医疗对抗性攻击方面的有效性。研究结果强调了发展攻击方法评估、领域特定的安全对齐以及LLM安全性与实用性平衡的必要性。
链接: https://arxiv.org/abs/2501.18632
作者: Hang Zhang,Qian Lou,Yanshan Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly utilized in healthcare applications. However, their deployment in clinical practice raises significant safety concerns, including the potential spread of harmful information. This study systematically assesses the vulnerabilities of six LLMs to three advanced black-box jailbreaking techniques within medical contexts. To quantify the effectiveness of these techniques, we propose an automated and domain-adapted agentic evaluation pipeline. Experiment results indicate that leading commercial and open-source LLMs are highly vulnerable to medical jailbreaking attacks. To bolster model safety and reliability, we further investigate the effectiveness of Continual Fine-Tuning (CFT) in defending against medical adversarial attacks. Our findings underscore the necessity for evolving attack methods evaluation, domain-specific safety alignment, and LLM safety-utility balancing. This research offers actionable insights for advancing the safety and reliability of AI clinicians, contributing to ethical and effective AI deployment in healthcare.
zh
[NLP-56] Indiana Jones: There Are Always Some Useful Ancient Relics
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在内容安全方面的漏洞问题。论文的关键在于提出了一种名为Indiana Jones的方法,通过利用跨模型对话和关键词驱动的提示,实现三个专门化的LLMs之间的交互,从而以近乎完美的成功率绕过白盒和黑盒LLMs的内容防护机制。这种方法揭示了当代模型系统性漏洞,特别是其在历史或情境背景下受到看似无害的提示引导时,容易产生有害或不道德输出的脆弱性。
链接: https://arxiv.org/abs/2501.18628
作者: Junchen Ding,Jiahao Zhang,Yi Liu,Ziqi Ding,Gelei Deng,Yuekang Li
机构: University of New South Wales(新南威尔士大学), Nanyang Technological University(南洋理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:This paper introduces Indiana Jones, an innovative approach to jailbreaking Large Language Models (LLMs) by leveraging inter-model dialogues and keyword-driven prompts. Through orchestrating interactions among three specialised LLMs, the method achieves near-perfect success rates in bypassing content safeguards in both white-box and black-box LLMs. The research exposes systemic vulnerabilities within contemporary models, particularly their susceptibility to producing harmful or unethical outputs when guided by ostensibly innocuous prompts framed in historical or contextual contexts. Experimental evaluations highlight the efficacy and adaptability of Indiana Jones, demonstrating its superiority over existing jailbreak methods. These findings emphasise the urgent need for enhanced ethical safeguards and robust security measures in the development of LLMs. Moreover, this work provides a critical foundation for future studies aimed at fortifying LLMs against adversarial exploitation while preserving their utility and flexibility.
zh
[NLP-57] he TIP of the Iceberg: Revealing a Hidden Class of Task-In-Prompt Adversarial Attacks on LLM s
【速读】: 该论文旨在解决大型语言模型(LLMs)中的安全漏洞问题,特别是针对现有防御机制的对抗性攻击。论文的关键在于提出了一种名为任务在提示(Task-in-Prompt, TIP)的新颖攻击方法,通过将序列到序列的任务(如密码解码、谜语、代码执行等)嵌入到模型的提示中,间接生成被禁止的输入。为了系统评估这些攻击的有效性,作者引入了PHRYGE基准测试。实验结果表明,TIP攻击方法成功绕过了六种最先进的语言模型的安全防护,包括GPT-4o和LLaMA 3.2。这项研究揭示了当前LLM安全对齐中的关键弱点,并强调了开发更复杂防御策略的迫切需求。
链接: https://arxiv.org/abs/2501.18626
作者: Sergey Berezin,Reza Farahbakhsh,Noel Crespi
机构: SAMOVAR (SAMOVAR); Télécom SudParis (南巴黎电信学院); Institut Polytechnique de Paris (巴黎高等技术研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model’s prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignments and underscore the urgent need for more sophisticated defence strategies. Warning: this paper contains examples of unethical inquiries used solely for research purposes. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2501.18626 [cs.CR] (or arXiv:2501.18626v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.18626 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-58] Survey: Understand the challenges of MachineLearning Experts using Named EntityRecognition Tools
【速读】: 该论文旨在通过Kasunic的调查研究方法识别机器学习(Machine Learning, ML)专家用于评估命名实体识别(Named Entity Recognition, NER)工具和框架的标准。论文的关键在于设计和实施一项调查,以探讨ML专家在选择合适的NER工具和框架时面临的主要挑战,并最终评估这些调查结果所带来的见解。
链接: https://arxiv.org/abs/2501.16112
作者: Florian Freund,Philippe Tamla,Matthias Hemmje
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 20 Pages, 13 Figures, 6th International Conference on Natural Language Processing, Information Retrieval and AI (NIAI 2025) January 25 ~ 26, 2025, Copenhagen, Denmark
Abstract:This paper presents a survey based on Kasunic’s survey research methodology to identify the criteria used by Machine Learning (ML) experts to evaluate Named Entity Recognition (NER) tools and frameworks. Comparison and selection of NER tools and frameworks is a critical step in leveraging NER for Information Retrieval to support the development of Clinical Practice Guidelines. In addition, this study examines the main challenges faced by ML experts when choosing suitable NER tools and frameworks. Using Nunamaker’s methodology, the article begins with an introduction to the topic, contextualizes the research, reviews the state-of-the-art in science and technology, and identifies challenges for an expert survey on NER tools and frameworks. This is followed by a description of the survey’s design and implementation. The paper concludes with an evaluation of the survey results and the insights gained, ending with a summary and conclusions.
zh
[NLP-59] Language Bias in Self-Supervised Learning For Automatic Speech Recognition
【速读】: 该论文旨在探究大规模自动语音识别(Automatic Speech Recognition, ASR)模型在使用自监督学习(Self-supervised learning, SSL)进行多语言预训练时存在的语言偏见问题。论文的关键解决方案是利用彩票假设(Lottery Ticket Hypothesis, LTH)识别XLS-R模型中的语言特定子网络,并评估这些子网络在多种不同语言上的性能,从而揭示XLS-R在微调过程中主要依赖于贡献最大数据的语言所学权重,而非传统语言学知识。
链接: https://arxiv.org/abs/2501.19321
作者: Edward Storey,Naomi Harte,Peter Bell
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Accepted to Speech and Language Technology Workshop (SLT) 2024 accessible on IEEE Xplore
Abstract:Self-supervised learning (SSL) is used in deep learning to train on large datasets without the need for expensive labelling of the data. Recently, large Automatic Speech Recognition (ASR) models such as XLS-R have utilised SSL to train on over one hundred different languages simultaneously. However, deeper investigation shows that the bulk of the training data for XLS-R comes from a small number of languages. Biases learned through SSL have been shown to exist in multiple domains, but language bias in multilingual SSL ASR has not been thoroughly examined. In this paper, we utilise the Lottery Ticket Hypothesis (LTH) to identify language-specific subnetworks within XLS-R and test the performance of these subnetworks on a variety of different languages. We are able to show that when fine-tuning, XLS-R bypasses traditional linguistic knowledge and builds only on weights learned from the languages with the largest data contribution to the pretraining data.
zh
计算机视觉
[CV-0] LiDAR Loop Closure Detection using Semantic Graphs with Graph Attention Networks
【速读】:该论文旨在解决移动机器人在环境地图构建过程中出现的回环检测问题。关键解决方案在于提出了一种新颖的回环检测算法,该算法利用图注意力神经网络(Graph Attention Neural Networks, GANN)编码语义图进行位置识别,并采用语义配准估计6自由度(6 Degrees of Freedom, 6 DoF)相对位姿约束。具体而言,该算法包含两个核心模块:语义图编码模块和图比较模块。其中,语义图编码模块使用图注意力网络高效地编码输入点云的语义图中的空间、语义和几何信息;通过节点嵌入和图嵌入步骤中的自注意力机制生成独特的图向量。图比较模块则通过比较当前扫描和关键帧扫描的图向量来识别可能的回环闭合。实验结果表明,采用两个图向量差异的方法显著提升了性能,在SemanticKITTI数据集上的最大F1分数提升了13%,相较于基线语义图算法。
链接: https://arxiv.org/abs/2501.19382
作者: Liudi Yang,Ruben Mascaro,Ignacio Alzugaray,Sai Manoj Prakhya,Marco Karrer,Ziyuan Liu,Margarita Chli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In this paper, we propose a novel loop closure detection algorithm that uses graph attention neural networks to encode semantic graphs to perform place recognition and then use semantic registration to estimate the 6 DoF relative pose constraint. Our place recognition algorithm has two key modules, namely, a semantic graph encoder module and a graph comparison module. The semantic graph encoder employs graph attention networks to efficiently encode spatial, semantic and geometric information from the semantic graph of the input point cloud. We then use self-attention mechanism in both node-embedding and graph-embedding steps to create distinctive graph vectors. The graph vectors of the current scan and a keyframe scan are then compared in the graph comparison module to identify a possible loop closure. Specifically, employing the difference of the two graph vectors showed a significant improvement in performance, as shown in ablation studies. Lastly, we implemented a semantic registration algorithm that takes in loop closure candidate scans and estimates the relative 6 DoF pose constraint for the LiDAR SLAM system. Extensive evaluation on public datasets shows that our model is more accurate and robust, achieving 13% improvement in maximum F1 score on the SemanticKITTI dataset, when compared to the baseline semantic graph algorithm. For the benefit of the community, we open-source the complete implementation of our proposed algorithm and custom implementation of semantic registration at this https URL
zh
[CV-1] Consistent Video Colorization via Palette Guidance
【速读】:该论文旨在解决视频着色过程中存在的色彩不饱和和时间不一致问题。关键在于将着色任务视为生成任务,并采用稳定视频扩散(Stable Video Diffusion, SVD)作为基础模型。此外,引入基于调色板的颜色引导器,以生成生动且一致的色彩,通过统一的色彩上下文增强色彩稳定性。
链接: https://arxiv.org/abs/2501.19331
作者: Han Wang,Yuang Zhang,Yuhong Zhang,Lingxiao Lu,Li Song
机构: Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Colorization is a traditional computer vision task and it plays an important role in many time-consuming tasks, such as old film restoration. Existing methods suffer from unsaturated color and temporally inconsistency. In this paper, we propose a novel pipeline to overcome the challenges. We regard the colorization task as a generative task and introduce Stable Video Diffusion (SVD) as our base model. We design a palette-based color guider to assist the model in generating vivid and consistent colors. The color context introduced by the palette not only provides guidance for color generation, but also enhances the stability of the generated colors through a unified color context across multiple sequences. Experiments demonstrate that the proposed method can provide vivid and stable colors for videos, surpassing previous methods.
zh
[CV-2] Let Human Sketches Help: Empowering Challenging Image Segmentation Task with Freehand Sketches
【速读】:该论文旨在解决在挑战性任务如隐蔽物体检测(Camouflaged Object Detection, COD)中的分割性能问题。关键解决方案在于引入了一种创新的基于草图引导的交互式分割框架,允许用户使用自由手绘草图(freehand sketches)来标注物体,而非传统的边界框或点。此外,论文通过网络架构的关键修改和新颖的草图增强技术,充分利用草图输入的优势,进一步提升分割精度。这些改进显著减少了标注时间,同时保持了与像素级标注相当的模型训练效果。
链接: https://arxiv.org/abs/2501.19329
作者: Ying Zang,Runlong Cao,Jianqi Zhang,Yidong Han,Ziyue Cao,Wenjun Hu,Didi Zhu,Lanyun Zhu,Zejian Li,Deyi Ji,Tianrun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sketches, with their expressive potential, allow humans to convey the essence of an object through even a rough contour. For the first time, we harness this expressive potential to improve segmentation performance in challenging tasks like camouflaged object detection (COD). Our approach introduces an innovative sketch-guided interactive segmentation framework, allowing users to intuitively annotate objects with freehand sketches (drawing a rough contour of the object) instead of the traditional bounding boxes or points used in classic interactive segmentation models like SAM. We demonstrate that sketch input can significantly improve performance in existing iterative segmentation methods, outperforming text or bounding box annotations. Additionally, we introduce key modifications to network architectures and a novel sketch augmentation technique to fully harness the power of sketch input and further boost segmentation accuracy. Remarkably, our model’ s output can be directly used to train other neural networks, achieving results comparable to pixel-by-pixel annotations–while reducing annotation time by up to 120 times, which shows great potential in democratizing the annotation process and enabling model training with less reliance on resource-intensive, laborious pixel-level annotations. We also present KOSCamo+, the first freehand sketch dataset for camouflaged object detection. The dataset, code, and the labeling tool will be open sourced.
zh
[CV-3] Capturing Temporal Dynamics in Large-Scale Canopy Tree Height Estimation
【速读】:该论文旨在通过利用Sentinel-2时间序列卫星数据,生成大规模、高分辨率的树冠高度地图,以应对全球温室气体排放增加所带来的挑战。解决方案的关键在于使用GEDI LiDAR数据作为地面真实数据来训练模型,并成功生成了欧洲大陆从2019年至2022年的首个10米分辨率的时间序列树冠高度地图。
链接: https://arxiv.org/abs/2501.19328
作者: Jan Pauls,Max Zimmer,Berkant Turan,Sassan Saatchi,Philippe Ciais,Sebastian Pokutta,Fabian Gieseke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages main paper, 5 pages references and appendix, 8 figures, 5 tables
Abstract:With the rise in global greenhouse gas emissions, accurate large-scale tree canopy height maps are essential for understanding forest structure, estimating above-ground biomass, and monitoring ecological disruptions. To this end, we present a novel approach to generate large-scale, high-resolution canopy height maps over time. Our model accurately predicts canopy height over multiple years given Sentinel-2 time series satellite data. Using GEDI LiDAR data as the ground truth for training the model, we present the first 10m resolution temporal canopy height map of the European continent for the period 2019-2022. As part of this product, we also offer a detailed canopy height map for 2020, providing more precise estimates than previous studies. Our pipeline and the resulting temporal height map are publicly available, enabling comprehensive large-scale monitoring of forests and, hence, facilitating future research and ecological analyses. For an interactive viewer, see this https URL.
zh
[CV-4] A Generic Hybrid Framework for 2D Visual Reconstruction
【速读】:该论文旨在解决二维真实世界重构任务中的拼图问题(JPPs),特别是针对方形、非重叠的拼图块。解决方案的关键在于提出了一种将深度学习(Deep Learning, DL)兼容性测度(CM)模型与优化遗传算法(Genetic Algorithm, GA)相结合的混合框架。其中,DL-based CM模型能够整体评估拼图块对之间的兼容性,而不仅仅是关注相邻边缘;优化的GA则用于迭代搜索全局最优排列,利用这些CM评分来指导搜索过程。这种独特的混合方法实现了在重建葡萄牙瓷砖面板和具有侵蚀边界的大型退化拼图方面的最新成果。
链接: https://arxiv.org/abs/2501.19325
作者: Daniel Rika,Dror Sholomon,Eli David,Alexandre Pais,Nathan S. Netanyahu
机构: Department of Computer Science, Bar-Ilan University (巴伊兰大学); National Tile Museum (国家瓷砖博物馆); Data Science and AI Institute, Bar-Ilan University (数据科学与人工智能研究所, 巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a versatile hybrid framework for addressing 2D real-world reconstruction tasks formulated as jigsaw puzzle problems (JPPs) with square, non-overlapping pieces. Our approach integrates a deep learning (DL)-based compatibility measure (CM) model that evaluates pairs of puzzle pieces holistically, rather than focusing solely on their adjacent edges as traditionally done. This DL-based CM is paired with an optimized genetic algorithm (GA)-based solver, which iteratively searches for a global optimal arrangement using the pairwise CM scores of the puzzle pieces. Extensive experimental results highlight the framework’s adaptability and robustness across multiple real-world domains. Notably, our unique hybrid methodology achieves state-of-the-art (SOTA) results in reconstructing Portuguese tile panels and large degraded puzzles with eroded boundaries.
zh
[CV-5] Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-driven Surface Normal-aware Tracking and Mapping ICRA2025
【速读】:该论文旨在解决微创手术过程中实时内窥镜同时定位与建图(SLAM)系统在深度和表面重建中的多视角不一致问题。解决方案的关键在于引入了一种结合二维高斯点 splatting(2DGS)技术的实时内窥镜SLAM系统——Endo-2DTAM,并采用了一种表面法线感知的处理流程,包括跟踪、映射和束调整模块以实现几何精确的重建。该系统通过结合点到点和点到平面距离度量来增强鲁棒性跟踪,并利用法线一致性及深度畸变校正来提升表面重建质量。此外,还提出了一种位姿一致性策略,以实现高效且几何连贯的关键帧采样。
链接: https://arxiv.org/abs/2501.19319
作者: Yiming Huang,Beilei Cui,Long Bai,Zhen Chen,Jinlin Wu,Zhen Li,Hongbin Liu,Hongliang Ren
机构: Dept. of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China (香港中文大学电子工程系,香港,中国); CUHK Shenzhen Research Institute, Shenzhen, China (香港中文大学深圳研究院,中国深圳); Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, China (人工智能与机器人中心(CAIR),香港科学院,香港,中国); Qilu Hospital of Shandong University, Jinan, China (山东大学齐鲁医院,中国济南)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICRA 2025
Abstract:Simultaneous Localization and Mapping (SLAM) is essential for precise surgical interventions and robotic tasks in minimally invasive procedures. While recent advancements in 3D Gaussian Splatting (3DGS) have improved SLAM with high-quality novel view synthesis and fast rendering, these systems struggle with accurate depth and surface reconstruction due to multi-view inconsistencies. Simply incorporating SLAM and 3DGS leads to mismatches between the reconstructed frames. In this work, we present Endo-2DTAM, a real-time endoscopic SLAM system with 2D Gaussian Splatting (2DGS) to address these challenges. Endo-2DTAM incorporates a surface normal-aware pipeline, which consists of tracking, mapping, and bundle adjustment modules for geometrically accurate reconstruction. Our robust tracking module combines point-to-point and point-to-plane distance metrics, while the mapping module utilizes normal consistency and depth distortion to enhance surface reconstruction quality. We also introduce a pose-consistent strategy for efficient and geometrically coherent keyframe sampling. Extensive experiments on public endoscopic datasets demonstrate that Endo-2DTAM achieves an RMSE of 1.87\pm 0.63 mm for depth reconstruction of surgical scenes while maintaining computationally efficient tracking, high-quality visual appearance, and real-time rendering. Our code will be released at this http URL.
zh
[CV-6] Application of Generative Adversarial Network (GAN) for Synthetic Training Data Creation to improve performance of ANN Classifier for extracting Built-Up pixels from Landsat Satellite Imagery
【速读】:该论文旨在解决使用低分辨率Landsat图像进行基于像素的分类任务时,由于训练数据量有限导致神经网络难以达到预期精度的问题。解决方案的关键在于开发一个简单的生成对抗网络(GAN)架构,通过生成与原始样本分布一致的合成训练数据,从而提升人工神经网络(ANN)在识别Landsat 7图像中的建成区像素方面的性能。为确保生成像素与原始像素在各波段的边缘分布和联合分布无显著差异,文中采用了非参数Kolmogorov-Smirnov检验和Ball Divergence分布相等性检验。结果表明,随着合成建成区像素加入到原始数据集中,ANN模型的整体精度从0.9331提升至0.9983,Kappa系数从0.8277提高到0.9958。
链接: https://arxiv.org/abs/2501.19283
作者: Amritendu Mukherjee,Dipanwita Sinha Mukherjee,Parthasarathy Ramachandran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Training a neural network for pixel based classification task using low resolution Landsat images is difficult as the size of the training data is usually small due to less number of available pixels that represent a single class without any mixing with other classes. Due to this scarcity of training data, neural network may not be able to attain expected level of accuracy. This limitation could be overcome using a generative network that aims to generate synthetic data having the same distribution as the sample data with which it is trained. In this work, we have proposed a methodology for improving the performance of ANN classifier to identify built-up pixels in the Landsat 7 image with the help of developing a simple GAN architecture that could generate synthetic training pixels when trained using original set of sample built-up pixels. To ensure that the marginal and joint distributions of all the bands corresponding to the generated and original set of pixels are indistinguishable, non-parametric Kolmogorov Smirnov Test and Ball Divergence based Equality of Distributions Test have been performed respectively. It has been observed that the overall accuracy and kappa coefficient of the ANN model for built-up classification have continuously improved from 0.9331 to 0.9983 and 0.8277 to 0.9958 respectively, with the inclusion of generated sets of built-up pixels to the original one.
zh
[CV-7] Imagine with the Teacher: Complete Shape in a Multi-View Distillation Way
【速读】:该论文旨在解决点云补全问题,即从物体的部分观测(由于遮挡、传感器限制或噪声等原因导致)恢复完整的3D形状。解决方案的关键在于提出了一种新颖的视图蒸馏点补全网络(View Distillation Point Completion Network, VD-PCN),通过多视角蒸馏方法解决补全问题。该设计充分利用了二维像素的有序性、二维处理的灵活性以及二维网络的强大能力。
链接: https://arxiv.org/abs/2501.19270
作者: Zhanpeng Luo,Linna Wang,Guangwu Qian,Li Lu
机构: Computer College, Sichuan University (四川大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures 4 tables
Abstract:Point cloud completion aims to recover the completed 3D shape of an object from its partial observation caused by occlusion, sensor’s limitation, noise, etc. When some key semantic information is lost in the incomplete point cloud, the neural network needs to infer the missing part based on the input information. Intuitively we would apply an autoencoder architecture to solve this kind of problem, which take the incomplete point cloud as input and is supervised by the ground truth. This process that develops model’s imagination from incomplete shape to complete shape is done automatically in the latent space. But the knowledge for mapping from incomplete to complete still remains dark and could be further explored. Motivated by the knowledge distillation’s teacher-student learning strategy, we design a knowledge transfer way for completing 3d shape. In this work, we propose a novel View Distillation Point Completion Network (VD-PCN), which solve the completion problem by a multi-view distillation way. The design methodology fully leverages the orderliness of 2d pixels, flexibleness of 2d processing and powerfulness of 2d network. Extensive evaluations on PCN, ShapeNet55/34, and MVP datasets confirm the effectiveness of our design and knowledge transfer strategy, both quantitatively and qualitatively. Committed to facilitate ongoing research, we will make our code publicly available.
zh
[CV-8] Medical Semantic Segmentation with Diffusion Pretrain
【速读】:该论文旨在解决3D医学影像预训练中利用先验任务(pretext tasks)进行特征学习的不足,特别是在生成泛化特征表示方面的挑战。论文的关键解决方案是提出了一种新颖的预训练策略,采用带有解剖学引导的扩散模型(diffusion models with anatomical guidance),专门针对3D医学图像数据的特点。通过引入一个辅助扩散过程来预训练模型,以生成可用于多种下游分割任务的泛化特征表示,并使用额外的模型预测三维通用身体部位坐标,从而在扩散过程中提供指导并增强生成表示的空间感知能力。这种方法不仅有助于解决定位不准确的问题,还增强了模型理解复杂解剖结构的能力。
链接: https://arxiv.org/abs/2501.19265
作者: David Li,Anvar Kurmukov,Mikhail Goncharov,Roman Sokolov,Mikhail Belyaev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in deep learning have shown that learning robust feature representations is critical for the success of many computer vision tasks, including medical image segmentation. In particular, both transformer and convolutional-based architectures have benefit from leveraging pretext tasks for pretraining. However, the adoption of pretext tasks in 3D medical imaging has been less explored and remains a challenge, especially in the context of learning generalizable feature representations. We propose a novel pretraining strategy using diffusion models with anatomical guidance, tailored to the intricacies of 3D medical image data. We introduce an auxiliary diffusion process to pretrain a model that produce generalizable feature representations, useful for a variety of downstream segmentation tasks. We employ an additional model that predicts 3D universal body-part coordinates, providing guidance during the diffusion process and improving spatial awareness in generated representations. This approach not only aids in resolving localization inaccuracies but also enriches the model’s ability to understand complex anatomical structures. Empirical validation on a 13-class organ segmentation task demonstrate the effectiveness of our pretraining technique. It surpasses existing restorative pretraining methods in 3D medical image segmentation by 7.5% , and is competitive with the state-of-the-art contrastive pretraining approach, achieving an average Dice coefficient of 67.8 in a non-linear evaluation scenario. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.19265 [cs.CV] (or arXiv:2501.19265v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.19265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-9] Neuro-LIFT: A Neuromorphic LLM -based Interactive Framework for Autonomous Drone FlighT at the Edge
【速读】:该论文旨在解决传统自然语言处理系统在上下文理解和意图识别方面的局限性,以及现有基于人工智能的导航算法在低延迟任务中的挑战。解决方案的关键在于结合大型语言模型(Large Language Models, LLMs)进行自然语言处理,将人类语音转化为高层次规划指令,并通过事件驱动的类脑视觉系统与基于物理的规划方法实现自主执行。这种方法能够实现能源高效且低延迟的导航,适用于动态环境下的实时避障和适应人类指令。
链接: https://arxiv.org/abs/2501.19259
作者: Amogh Joshi,Sourav Sanyal,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
备注:
Abstract:The integration of human-intuitive interactions into autonomous systems has been limited. Traditional Natural Language Processing (NLP) systems struggle with context and intent understanding, severely restricting human-robot interaction. Recent advancements in Large Language Models (LLMs) have transformed this dynamic, allowing for intuitive and high-level communication through speech and text, and bridging the gap between human commands and robotic actions. Additionally, autonomous navigation has emerged as a central focus in robotics research, with artificial intelligence (AI) increasingly being leveraged to enhance these systems. However, existing AI-based navigation algorithms face significant challenges in latency-critical tasks where rapid decision-making is critical. Traditional frame-based vision systems, while effective for high-level decision-making, suffer from high energy consumption and latency, limiting their applicability in real-time scenarios. Neuromorphic vision systems, combining event-based cameras and spiking neural networks (SNNs), offer a promising alternative by enabling energy-efficient, low-latency navigation. Despite their potential, real-world implementations of these systems, particularly on physical platforms such as drones, remain scarce. In this work, we present Neuro-LIFT, a real-time neuromorphic navigation framework implemented on a Parrot Bebop2 quadrotor. Leveraging an LLM for natural language processing, Neuro-LIFT translates human speech into high-level planning commands which are then autonomously executed using event-based neuromorphic vision and physics-driven planning. Our framework demonstrates its capabilities in navigating in a dynamic environment, avoiding obstacles, and adapting to human instructions in real-time.
zh
[CV-10] ContextFormer: Redefining Efficiency in Semantic Segmentation
【速读】:该论文旨在解决语义分割任务中实时处理高分辨率图像时面临的效率与性能平衡难题。现有方法主要优化编码器架构,而忽略了瓶颈部分的改进。关键在于提出ContextFormer框架,通过结合卷积神经网络(CNNs)和视觉变换器(Vision Transformers, ViTs)在瓶颈部分的优势,实现高效性、准确性和鲁棒性的平衡。ContextFormer采用三个协同模块:Token金字塔提取模块(TPEM)用于分层多尺度表示,Transformer和调制深度可分离卷积(Trans-MDC)块用于动态尺度感知特征建模,以及特征融合模块(FMM)用于增强的空间和上下文一致性集成。这些创新显著提升了模型的性能,特别是在ADE20K、Pascal Context、CityScapes和COCO-Stuff数据集上的表现达到了新的技术水平。
链接: https://arxiv.org/abs/2501.19255
作者: Mian Muhammad Naeem Abid,Nancy Mehta,Zongwei Wu,Fayaz Ali Dharejo,Radu Timofte
机构: Computer Vision Lab (计算机视觉实验室), CAIDAS (CAIDAS), University of Würzburg (维尔茨堡大学), Germany (德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high-resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored - a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation. The framework’s efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi-scale representation, the Transformer and Modulating DepthwiseConv (Trans-MDC) block for dynamic scale-aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO-Stuff datasets show ContextFormer significantly outperforms existing models, achieving state-of-the-art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available.
zh
[CV-11] Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
【速读】:该论文旨在解决文本到视频扩散模型在生成过程中存在的感知质量不足问题,特别是帧间一致性和对齐问题。论文的关键在于提出了一种扩散潜空间束搜索算法结合前瞻估计器的方法,通过优化对齐奖励来选择更优的扩散潜变量,从而提升生成视频的质量。此外,论文指出为了实现更好的感知质量,需要通过加权现有度量标准来进行奖励校准。这种方法无需更新模型参数即可改善生成视频的自然度,并且优于贪婪搜索和最佳-N采样方法。
链接: https://arxiv.org/abs/2501.19252
作者: Yuta Oshima,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. When evaluating outputs by using vision language models as a proxy of humans, many previous metrics to quantify the naturalness of video do not always correlate with evaluation and also depend on the degree of dynamic descriptions in evaluation prompts. We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling. We provide practical guidelines on which axes, among search budget, lookahead steps for reward estimate, and denoising steps, in the reverse diffusion process, we should allocate the inference-time computation.
zh
[CV-12] Accelerating Diffusion Transformer via Error-Optimized Cache
【速读】:该论文旨在解决扩散变压器(Diffusion Transformer, DiT)在内容生成过程中因采样时间过长而引入的缓存误差累积问题。现有的缓存方法通过重用前一时步的DiT特征并跳过后续计算来加速生成过程,但它们往往侧重于定位并缓存低错误模块,忽视了减少因缓存引起的误差,导致生成内容质量显著下降。为了解决这一问题,论文提出了一种误差优化缓存(Error-Optimized Cache, EOC)方法。该方法的关键改进包括:(1) 提取并处理缓存差异的先验知识;(2) 判断是否需要优化缓存的决策方法;(3) 减少缓存误差的优化策略。实验表明,该算法显著减少了因过度缓存导致的误差积累,并在ImageNet数据集上提升了生成图像的质量。
链接: https://arxiv.org/abs/2501.19243
作者: Junxiang Qiu,Shuo Wang,Jinda Lu,Lin Liu,Houcheng Jiang,Yanbin Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformer (DiT) is a crucial method for content generation. However, it needs a lot of time to sample. Many studies have attempted to use caching to reduce the time consumption of sampling. Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity. To solve this problem, we propose the Error-Optimized Cache (EOC). This method introduces three key improvements: (1) Prior knowledge extraction: Extract and process the caching differences; (2) A judgment method for cache optimization: Determine whether certain caching steps need to be optimized; (3) Cache optimization: reduce caching errors. Experiments show that this algorithm significantly reduces the error accumulation caused by caching (especially over-caching). On the ImageNet dataset, without significantly increasing the computational burden, this method improves the quality of the generated images under the over-caching, rule-based, and training-based methods. Specifically, the Fréchet Inception Distance (FID) values are improved as follows: from 6.857 to 5.821, from 3.870 to 3.692 and form 3.539 to 3.451 respectively.
zh
[CV-13] Integrating Semi-Supervised and Active Learning for Semantic Segmentation
【速读】:该论文旨在减少人工标注成本并提升模型性能,提出了一种结合改进半监督学习框架的主动学习方法。关键解决方案在于通过主动学习选择标记数据,并利用自动伪标签自校正(Pseudo-Label Auto-Refinement, PLAR)模块修正特征表示不匹配的伪标签像素。此方法基于聚类假设,在不增加标注预算的情况下,仅对未标注数据中最困难和最不确定的区域进行人工标注。
链接: https://arxiv.org/abs/2501.19227
作者: Wanli Ma,Oktay Karakus,Paul L. Rosin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we propose a novel active learning approach integrated with an improved semi-supervised learning framework to reduce the cost of manual annotation and enhance model performance. Our proposed approach effectively leverages both the labelled data selected through active learning and the unlabelled data excluded from the selection process. The proposed active learning approach pinpoints areas where the pseudo-labels are likely to be inaccurate. Then, an automatic and efficient pseudo-label auto-refinement (PLAR) module is proposed to correct pixels with potentially erroneous pseudo-labels by comparing their feature representations with those of labelled regions. This approach operates without increasing the labelling budget and is based on the cluster assumption, which states that pixels belonging to the same class should exhibit similar representations in feature space. Furthermore, manual labelling is only applied to the most difficult and uncertain areas in unlabelled data, where insufficient information prevents the PLAR module from making a decision. We evaluated the proposed hybrid semi-supervised active learning framework on two benchmark datasets, one from natural and the other from remote sensing imagery domains. In both cases, it outperformed state-of-the-art methods in the semantic segmentation task.
zh
[CV-14] RaySplats: Ray Tracing based Gaussian Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS)在渲染过程中难以融入光照和阴影效果的问题。为了解决这一挑战,论文提出了一种名为RaySplats的新模型,该模型采用基于光线追踪的高斯随机点阵方法。关键在于使用光线追踪机制直接操作表示为带有RGB颜色的置信椭圆的高斯基元,并通过计算椭圆与光线的交点来构建光线追踪算法,从而实现高斯随机点阵模型与网格的融合以及光照、阴影等相关效果的添加。
链接: https://arxiv.org/abs/2501.19196
作者: Krzysztof Byrski,Marcin Mazur,Jacek Tabor,Tadeusz Dziarmaga,Marcin Kądziołka,Dawid Baran,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is a process that enables the direct creation of 3D objects from 2D images. This representation offers numerous advantages, including rapid training and rendering. However, a significant limitation of 3DGS is the challenge of incorporating light and shadow reflections, primarily due to the utilization of rasterization rather than ray tracing for rendering. This paper introduces RaySplats, a model that employs ray-tracing based Gaussian Splatting. Rather than utilizing the projection of Gaussians, our method employs a ray-tracing mechanism, operating directly on Gaussian primitives represented by confidence ellipses with RGB colors. In practice, we compute the intersection between ellipses and rays to construct ray-tracing algorithms, facilitating the incorporation of meshes with Gaussian Splatting models and the addition of lights, shadows, and other related effects.
zh
[CV-15] A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches
【速读】:该论文旨在解决跨类别(Cross-Class)目标计数的问题,特别是关注在无需先验知识的情况下识别和计数任意类别中的对象。论文的关键在于提出并分类了三种新的方法范式:基于参考的方法、无参考的方法以及开放世界文本引导的方法。这些方法分别通过示例指导机制、利用图像固有模式以及使用视觉语言模型来实现类无关的目标计数,从而在少样本设置下能够对从未见过的类别进行计数,克服了传统方法依赖大规模标注数据集的局限性。
链接: https://arxiv.org/abs/2501.19184
作者: Luca Ciampi,Ali Azmoudeh,Elif Ecem Akbaba,Erdi Sarıtaş,Ziya Ata Yazıcı,Hazım Kemal Ekenel,Giuseppe Amato,Fabrizio Falchi
机构: CNR-ISTI(意大利国家研究委员会ISTI研究所), Pisa, Italy; Istanbul Technical University(伊斯坦布尔技术大学), Türkiye; Division of Engineering, NYU Abu Dhabi, UAE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object counting has recently shifted towards class-agnostic counting (CAC), which addresses the challenge of counting objects across arbitrary categories, tackling a critical need in versatile counting systems. While humans effortlessly identify and count objects from diverse categories without prior knowledge, most counting methods remain restricted to enumerating instances of known classes, requiring extensive labeled datasets for training, and struggling under open-vocabulary settings. Conversely, CAC aims to count objects belonging to classes never seen during training, typically operating in a few-shot setting. In this paper, for the first time, we review advancements in CAC methodologies, categorizing them into three paradigms based on how target object classes can be specified: reference-based, reference-less, and open-world text-guided. Reference-based approaches have set performance benchmarks using exemplar-guided mechanisms. Reference-less methods eliminate exemplar dependency by leveraging inherent image patterns. Finally, open-world text-guided methods utilize vision-language models, enabling object class descriptions through textual prompts, representing a flexible and appealing solution. We analyze state-of-the-art techniques and we report their results on existing gold standard benchmarks, comparing their performance and identifying and discussing their strengths and limitations. Persistent challenges – such as annotation dependency, scalability, and generalization – are discussed, alongside future directions. We believe this survey serves as a valuable resource for researchers to understand the progressive developments and contributions over time and the current state-of-the-art of CAC, suggesting insights for future directions and challenges to be addressed.
zh
[CV-16] Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs
【速读】:该论文旨在解决大型视觉语言模型(LVMs)在处理和解释视觉信息时可能出现的物体幻觉问题,即生成看似合理但事实上不准确的信息。解决方案的关键在于提出了一种新颖的视觉对抗扰动(Visual Adversarial Perturbation, VAP)方法,通过应用经过优化的视觉噪声来减轻这种幻觉现象,而不改变基础模型本身。这种方法将幻觉抑制表述为一个优化问题,并利用对抗策略生成有益的视觉扰动,以增强模型的事实依据并减少参数知识偏差。
链接: https://arxiv.org/abs/2501.19164
作者: Kejia Zhang,Keda Tao,Jiasheng Tang,Huan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (LVMs) extend large language models (LLMs) with visual perception capabilities, enabling them to process and interpret visual information. A major challenge compromising their reliability is object hallucination that LVMs may generate plausible but factually inaccurate information. We propose a novel visual adversarial perturbation (VAP) method to mitigate this hallucination issue. VAP alleviates LVM hallucination by applying strategically optimized visual noise without altering the base model. Our approach formulates hallucination suppression as an optimization problem, leveraging adversarial strategies to generate beneficial visual perturbations that enhance the model’s factual grounding and reduce parametric knowledge bias. Extensive experimental results demonstrate that our method consistently reduces object hallucinations across 8 state-of-the-art LVMs, validating its efficacy across diverse evaluations.
zh
[CV-17] RMDM: Radio Map Diffusion Model with Physics Informed
【速读】:该论文旨在解决无线通信环境中射频地图重构面临的复杂信号传播和数据稀疏性问题。论文的关键在于提出了一种名为“射频地图扩散模型(RMDM)”的物理信息框架,该框架结合了物理信息神经网络(PINNs),以纳入亥姆霍兹方程等物理约束。RMDM采用双U-Net架构,第一部分通过最小化偏微分方程残差、边界条件和源项约束来确保物理一致性,第二部分则通过基于扩散的去噪技术优化预测结果。这一方法显著提升了重构精度、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2501.19160
作者: Haozhe Jia,Wenshuo Chen,Zhihui Huang,Hongru Xiao,Nanqian Jia,Keming Wu,Songning Lai,Yutao Yue
机构: HKUST(GZ)(香港科技大学(广州)); Shandong University(山东大学); Deep Interdisciplinary Intelligence Lab(深度跨学科智能实验室); Tongji University(同济大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid development of wireless communication technology, the efficient utilization of spectrum resources, optimization of communication quality, and intelligent communication have become critical. Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse data hinder accurate reconstruction. To address these issues, we propose the Radio Map Diffusion Model (RMDM), a physics-informed framework that integrates Physics-Informed Neural Networks (PINNs) to incorporate constraints like the Helmholtz equation. RMDM employs a dual U-Net architecture: the first ensures physical consistency by minimizing PDE residuals, boundary conditions, and source constraints, while the second refines predictions via diffusion-based denoising. By leveraging physical laws, RMDM significantly enhances accuracy, robustness, and generalization. Experiments demonstrate that RMDM outperforms state-of-the-art methods, achieving NMSE of 0.0031 and RMSE of 0.0125 under the Static RM (SRM) setting, and NMSE of 0.0047 and RMSE of 0.0146 under the Dynamic RM (DRM) setting. These results establish a novel paradigm for integrating physics-informed and data-driven approaches in radio map reconstruction, particularly under sparse data conditions.
zh
[CV-18] GDO: Gradual Domain Osmosis ICML2025
【速读】:该论文旨在解决渐进域适应(Gradual Domain Adaptation, GDA)中的平滑知识迁移问题,即从源域到目标域的知识迁移过程中存在的不连续或低效现象。论文的关键解决方案是提出了一种称为渐进域渗透(Gradual Domain Osmosis)的方法,通过设计一个基于超参数 (\lambda) 的优化框架,动态平衡源域和目标域的损失权重,使模型能够在训练过程中逐步调整知识迁移的强度((\lambda) 从0递增至1),从而更有效地实现跨域泛化。此方法结合自训练生成伪标签,并通过最小化加权损失函数迭代更新模型,确保在中间域渐进适应过程中的稳定性和鲁棒性。
链接: https://arxiv.org/abs/2501.19159
作者: Zixi Wang,Yubo Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to icml 2025
Abstract:In this paper, we propose a new method called Gradual Domain Osmosis, which aims to solve the problem of smooth knowledge migration from source domain to target domain in Gradual Domain Adaptation (GDA). Traditional Gradual Domain Adaptation methods mitigate domain bias by introducing intermediate domains and self-training strategies, but often face the challenges of inefficient knowledge migration or missing data in intermediate domains. In this paper, we design an optimisation framework based on the hyperparameter \lambda by dynamically balancing the loss weights of the source and target domains, which enables the model to progressively adjust the strength of knowledge migration ( \lambda incrementing from 0 to 1) during the training process, thus achieving cross-domain generalisation more efficiently. Specifically, the method incorporates self-training to generate pseudo-labels and iteratively updates the model by minimising a weighted loss function to ensure stability and robustness during progressive adaptation in the intermediate domain. The experimental part validates the effectiveness of the method on rotated MNIST, colour-shifted MNIST, portrait dataset and forest cover type dataset, and the results show that it outperforms existing baseline methods. The paper further analyses the impact of the dynamic tuning strategy of the hyperparameter \lambda on the performance through ablation experiments, confirming the advantages of progressive domain penetration in mitigating the domain bias and enhancing the model generalisation capability. The study provides a theoretical support and practical framework for asymptotic domain adaptation and expands its application potential in dynamic environments.
zh
[CV-19] SWAT: Sliding Window Adversarial Training for Gradual Domain Adaptation ICML2025
【速读】:该论文旨在解决领域适应中的急剧领域偏移问题,这些问题会损害机器学习模型的性能。现有的无监督领域适应(Unsupervised Domain Adaptation, UDA)方法在面对剧烈的领域偏移时效果不佳。为缓解这一问题,论文提出了渐进领域适应(Gradual Domain Adaptation, GDA),通过使用多个中间领域逐步从源领域过渡到目标领域来减轻问题。论文的关键解决方案是滑动窗口对抗训练(Sliding Window Adversarial Training, SWAT),它通过构建对抗流连接源领域和目标领域的特征空间,并设计了一个滑动窗口范式沿对抗流移动,以逐渐缩小相邻中间领域之间的微小差距。当窗口移动到对抗流的末端即目标领域时,领域偏移被显著减小。
链接: https://arxiv.org/abs/2501.19155
作者: Zixi Wang,Yubo Huang,Wenwei Luo,Tonglan Xie,Mengmeng Jing,Lin Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: submitted to icml 2025
Abstract:Domain shifts are critical issues that harm the performance of machine learning. Unsupervised Domain Adaptation (UDA) mitigates this issue but suffers when the domain shifts are steep and drastic. Gradual Domain Adaptation (GDA) alleviates this problem in a mild way by gradually adapting from the source to the target domain using multiple intermediate domains. In this paper, we propose Sliding Window Adversarial Training (SWAT) for Gradual Domain Adaptation. SWAT uses the construction of adversarial streams to connect the feature spaces of the source and target domains. In order to gradually narrow the small gap between adjacent intermediate domains, a sliding window paradigm is designed that moves along the adversarial stream. When the window moves to the end of the stream, i.e., the target domain, the domain shift is drastically reduced. Extensive experiments are conducted on public GDA benchmarks, and the results demonstrate that the proposed SWAT significantly outperforms the state-of-the-art approaches. The implementation is available at: this https URL.
zh
[CV-20] Improving Multi-Label Contrastive Learning by Leverag ing Label Distribution
【速读】:该论文旨在解决多标签学习中利用对比学习方法学习更好表征的关键挑战:正负样本的选择以及有效利用标签信息。论文提出了一种基于标签分布改进多标签对比学习的新方法。关键在于在选择正负样本时仅需考虑标签之间是否存在交集,并通过基于径向基函数(Radial Basis Function, RBF)和对比损失的两种方法恢复标签分布,以建模标签间的关系。实验结果表明,该方法在六个评估指标上优于当前最先进的方法。
链接: https://arxiv.org/abs/2501.19145
作者: Ning Chen,Shen-Huan Lyu,Tian-Shuang Wu,Yanyan Wang,Bin Tang
机构: Key Laboratory of Water Big Data Technology of Ministry of Water Resources, College of Computer Science and Software Engineering, Hohai University (河海大学), Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University (南京大学), Nanjing, China
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In multi-label learning, leveraging contrastive learning to learn better representations faces a key challenge: selecting positive and negative samples and effectively utilizing label information. Previous studies selected positive and negative samples based on the overlap between labels and used them for label-wise loss balancing. However, these methods suffer from a complex selection process and fail to account for the varying importance of different labels. To address these problems, we propose a novel method that improves multi-label contrastive learning through label distribution. Specifically, when selecting positive and negative samples, we only need to consider whether there is an intersection between labels. To model the relationships between labels, we introduce two methods to recover label distributions from logical labels, based on Radial Basis Function (RBF) and contrastive loss, respectively. We evaluate our method on nine widely used multi-label datasets, including image and vector datasets. The results demonstrate that our method outperforms state-of-the-art methods in six evaluation metrics.
zh
[CV-21] Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play
【速读】:该论文旨在解决机器感知领域中由对抗性幻觉引发的根本威胁。对抗性幻觉主要表现为演绎幻觉和归纳幻觉两种形式,分别通过干扰模型的决策边界和嵌入模型学习阶段的后门来影响模型行为。论文的关键解决方案是一种基于模仿博弈概念的幻觉消除范式,其中心是一个由思维链推理驱动的多模态生成代理,能够观察、内化并重构样本的语义本质,而不必拘泥于将样本恢复到原始状态。这一方法通过实验仿真和多种攻击场景验证了其有效性。
链接: https://arxiv.org/abs/2501.19143
作者: Ching-Chun Chang,Fan-Yun Chen,Shih-Hong Gu,Kai Gao,Hanrui Wang,Isao Echizen
机构: Information and Society Research Division, National Institute of Informatics (国立信息学研究所), Tokyo, Japan; Department of Information Engineering and Computer Science, Feng Chia University (逢甲大學), Taichung, Taiwan
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model’s general decision logic, and inductive illusion, where the victim model’s general decision logic is shaped by specific stimuli. The former exploits the model’s decision boundaries to create a stimulus that, when applied, interferes with its decision-making process. The latter reinforces a conditioned reflex in the model, embedding a backdoor during its learning phase that, when triggered by a stimulus, causes aberrant behaviours. The multifaceted nature of adversarial illusions calls for a unified defence framework, addressing vulnerabilities across various forms of attack. In this study, we propose a disillusion paradigm based on the concept of an imitation game. At the heart of the imitation game lies a multimodal generative agent, steered by chain-of-thought reasoning, which observes, internalises and reconstructs the semantic essence of a sample, liberated from the classic pursuit of reversing the sample to its original state. As a proof of concept, we conduct experimental simulations using a multimodal generative dialogue agent and evaluates the methodology under a variety of attack scenarios.
zh
[CV-22] ransformation trees – documentation of multimodal image registration
【速读】:该论文旨在解决多模态医学图像在不同坐标系统下变换结果的文档化问题,并提出使用树结构来组织这些变换。关键解决方案在于引入了一种新的文件格式 .dpw
(数字患者工作空间),并通过 dpVision
软件展示了基于正畸分析的不同配准实例,以此说明树结构的应用及其主要方面。
链接: https://arxiv.org/abs/2501.19140
作者: Agnieszka Anna Tomaka,Dariusz Pojda,Michał Tarnawski,Leszek Luchowski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures
Abstract:The paper presents proposals for the application of a tree structure to the documentation of a set of transformations obtained as a result of various registrations of multimodal images obtained in coordinate systems associated with acquisition devices and being registered in one patient-specific coordinate system. A special file format .dpw (digital patient workspace) is introduced. Examples of different registrations yielded from orthodontic analysis and showing main aspects of the usage of tree structure are illustrated in dpVision software.
zh
[CV-23] RGB-Event ISP: The Dataset and Benchmark ICLR2025
【速读】:该论文旨在解决事件引导成像(Event-guided imaging)在图像信号处理器(ISP)中的应用问题。现有的方法主要集中在后处理增强RGB图像,忽视了事件传感器在ISP过程中的挑战及所提供的优势。论文的关键在于提出首个事件引导ISP(Event-guided ISP)的研究,并通过构建一个新的事件-RAW配对数据集、设计传统ISP流程以生成参考RGB帧、分类现有可学习ISP方法并训练评估,以及提出一个简单的事件引导ISP方法来应对这些挑战。
链接: https://arxiv.org/abs/2501.19129
作者: Yunfan Lu,Yanlin Qian,Ziyang Rao,Junren Xiao,Liming Chen,Hui Xiong
机构: AI Thrust, HKUST(GZ)(香港科技大学(广州)); AlpsenTek2(阿尔森特克2)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ICLR 2025; 14 pages, 8 figures, 4 tables
Abstract:Event-guided imaging has received significant attention due to its potential to revolutionize instant imaging systems. However, the prior methods primarily focus on enhancing RGB images in a post-processing manner, neglecting the challenges of image signal processor (ISP) dealing with event sensor and the benefits events provide for reforming the ISP process. To achieve this, we conduct the first research on event-guided ISP. First, we present a new event-RAW paired dataset, collected with a novel but still confidential sensor that records pixel-level aligned events and RAW images. This dataset includes 3373 RAW images with 2248 x 3264 resolution and their corresponding events, spanning 24 scenes with 3 exposure modes and 3 lenses. Second, we propose a conventional ISP pipeline to generate good RGB frames as reference. This conventional ISP pipleline performs basic ISP operations, this http URL, white balancing, denoising and color space transforming, with a ColorChecker as reference. Third, we classify the existing learnable ISP methods into 3 classes, and select multiple methods to train and evaluate on our new dataset. Lastly, since there is no prior work for reference, we propose a simple event-guided ISP method and test it on our dataset. We further put forward key technical challenges and future directions in RGB-Event ISP. In summary, to the best of our knowledge, this is the very first research focusing on event-guided ISP, and we hope it will inspire the community. The code and dataset are available at: this https URL.
zh
[CV-24] A Benchmark for Incremental Micro-expression Recognition
【速读】:该论文旨在解决微表情识别在实际应用场景中面临的持续演化数据流适应性问题,即如何在不断更新训练数据的同时保留已学得的知识。解决方案的关键在于提出了首个专为增量式微表情识别设计的基准(benchmark),包括定义增量学习设置、组织具有精心策划的学习顺序的序列数据集、设计两种基于交叉验证的测试协议以及提供六种基线方法及其评估结果。这一基准为推进增量式微表情识别研究奠定了基础。
链接: https://arxiv.org/abs/2501.19111
作者: Zhengqin Lai,Xiaopeng Hong,Yabin Wang,Xiaobai Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Micro-expression recognition plays a pivotal role in understanding hidden emotions and has applications across various fields. Traditional recognition methods assume access to all training data at once, but real-world scenarios involve continuously evolving data streams. To respond to the requirement of adapting to new data while retaining previously learned knowledge, we introduce the first benchmark specifically designed for incremental micro-expression recognition. Our contributions include: Firstly, we formulate the incremental learning setting tailored for micro-expression recognition. Secondly, we organize sequential datasets with carefully curated learning orders to reflect real-world scenarios. Thirdly, we define two cross-evaluation-based testing protocols, each targeting distinct evaluation objectives. Finally, we provide six baseline methods and their corresponding evaluation results. This benchmark lays the groundwork for advancing incremental micro-expression recognition research. All code used in this study will be made publicly available.
zh
[CV-25] infty-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
【速读】:该论文旨在解决当前视频-语言模型在处理长视频理解时因受限于有限的上下文长度和稀疏帧采样而导致的信息丢失问题。关键解决方案在于引入了(\infty)-Video框架,通过连续时间长期记忆(Long-Term Memory, LTM)整合机制,允许视频Q-former高效且无限制地处理任意长度的视频,而无需额外训练。此方法通过持续关注动态分配最相关视频片段的高粒度信息,形成随时间演变的“粘性”记忆。
链接: https://arxiv.org/abs/2501.19098
作者: Saul Santos,António Farinhas,Daniel C. McNamee,André F. T. Martins
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 7 figures
Abstract:Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces \infty -Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming “sticky” memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
zh
[CV-26] Ambient Denoising Diffusion Generative Adversarial Networks for Establishing Stochastic Object Models from Noisy Image Data
【速读】:该论文旨在解决利用噪声医学影像数据建立随机对象模型(Stochastic Object Models, SOMs)的问题。解决方案的关键在于提出了一种改进的去噪扩散生成对抗网络(Augmented Denoising Diffusion GAN, ADDGAN),它能够从噪声图像数据中学习出逼真的SOMs。与先前采用增强型生成对抗网络(AmbientGAN)的方法相比,ADDGAN不仅实现了快速的图像生成,还保持了高质量的生成图像,特别是在合成具有复杂纹理的高分辨率医学图像方面表现出显著的优势。
链接: https://arxiv.org/abs/2501.19094
作者: Xichen Xu,Wentao Chen,Weimin Zhou
机构: Shanghai Jiao Tong University (上海交通大学); University of Michigan-Shanghai Jiao Tong University Joint Institute (密歇根大学-上海交通大学联合研究院); Wyant College of Optical Sciences (韦恩特光学科学学院), University of Arizona (亚利桑那大学); Department of Medical Imaging (医学影像系), University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: SPIE Medical Imaging 2025
Abstract:It is widely accepted that medical imaging systems should be objectively assessed via task-based image quality (IQ) measures that ideally account for all sources of randomness in the measured image data, including the variation in the ensemble of objects to be imaged. Stochastic object models (SOMs) that can randomly draw samples from the object distribution can be employed to characterize object variability. To establish realistic SOMs for task-based IQ analysis, it is desirable to employ experimental image data. However, experimental image data acquired from medical imaging systems are subject to measurement noise. Previous work investigated the ability of deep generative models (DGMs) that employ an augmented generative adversarial network (GAN), AmbientGAN, for establishing SOMs from noisy measured image data. Recently, denoising diffusion models (DDMs) have emerged as a leading DGM for image synthesis and can produce superior image quality than GANs. However, original DDMs possess a slow image-generation process because of the Gaussian assumption in the denoising steps. More recently, denoising diffusion GAN (DDGAN) was proposed to permit fast image generation while maintain high generated image quality that is comparable to the original DDMs. In this work, we propose an augmented DDGAN architecture, Ambient DDGAN (ADDGAN), for learning SOMs from noisy image data. Numerical studies that consider clinical computed tomography (CT) images and digital breast tomosynthesis (DBT) images are conducted. The ability of the proposed ADDGAN to learn realistic SOMs from noisy image data is demonstrated. It has been shown that the ADDGAN significantly outperforms the advanced AmbientGAN models for synthesizing high resolution medical images with complex textures.
zh
[CV-27] JGHand: Joint-Driven Animatable Hand Avater via 3D Gaussian Splatting
【速读】:该论文旨在解决高保真数字手模型的实时渲染及驱动问题,特别是在不同姿势和角色下保持高质量渲染的同时实现交互性和实时性。关键解决方案在于提出了一种名为Jointly 3D Gaussian Hand (JGHand) 的新型表示方法,它基于3D高斯点阵(3D Gaussian Splatting, 3DGS),通过可微分的空间变换过程处理基于3D关键点的变形,支持从标准模板到任意骨骼长度和姿势的手部变形。此外,引入了一种基于逐像素深度的实时阴影模拟方法,以模拟手指运动引起的自遮挡阴影。这些技术共同实现了仅由3D关键点驱动的可动画化3DGS手部表示,从而在不牺牲渲染质量的前提下实现了实时渲染。
链接: https://arxiv.org/abs/2501.19088
作者: Zhoutao Sun,Xukun Shen,Yong Hu,Yuyou Zhong,Xueyang Zhou
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Since hands are the primary interface in daily interactions, modeling high-quality digital human hands and rendering realistic images is a critical research problem. Furthermore, considering the requirements of interactive and rendering applications, it is essential to achieve real-time rendering and driveability of the digital model without compromising rendering quality. Thus, we propose Jointly 3D Gaussian Hand (JGHand), a novel joint-driven 3D Gaussian Splatting (3DGS)-based hand representation that renders high-fidelity hand images in real-time for various poses and characters. Distinct from existing articulated neural rendering techniques, we introduce a differentiable process for spatial transformations based on 3D key points. This process supports deformations from the canonical template to a mesh with arbitrary bone lengths and poses. Additionally, we propose a real-time shadow simulation method based on per-pixel depth to simulate self-occlusion shadows caused by finger movements. Finally, we embed the hand prior and propose an animatable 3DGS representation of the hand driven solely by 3D key points. We validate the effectiveness of each component of our approach through comprehensive ablation studies. Experimental results on public datasets demonstrate that JGHand achieves real-time rendering speeds with enhanced quality, surpassing state-of-the-art methods.
zh
[CV-28] Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification
【速读】:该论文旨在解决X射线图像分类中基于对比语言图像预训练(CLIP)模型的公平性问题,特别是在不同患者人口统计学特征和疾病类别下的表现差异。研究的关键在于通过零样本推理和多种微调技术,包括线性探测(Linear Probing)、多层感知器(MLP)、低秩适应(LoRA)以及全微调(full fine-tuning),评估和分析这些模型的性能与公平性。研究表明,尽管微调可以提高模型准确性,但公平性问题仍然存在,强调了在这些基础模型中进一步实施公平性干预措施的必要性。
链接: https://arxiv.org/abs/2501.19086
作者: Xiangyu Sun,Xiaoguang Zou,Yuanquan Wu,Guotai Wang,Shaoting Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the 2025 IEEE International Symposium on Biomedical Imaging (ISBI 2025)
Abstract:X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.
zh
[CV-29] Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields
【速读】:该论文旨在解决通过文本指导实现高效且精确的3D场景分割问题。关键在于引入了一种适配器模块(adapter module),并通过自交叉训练策略(self-cross-training strategy)来减轻密集CLIP特征蒸馏过程中的噪声问题。此外,通过提出低秩瞬态查询注意力机制(low-rank transient query attention mechanism)增强边缘分割的准确性,并将分割任务转化为分类任务以提高颜色相似区域的分割一致性。最后,采用简化的文本增强策略(simplified text augmentation strategy)来缓解CLIP特征与文本之间对应关系的模糊性问题。这些方法共同提升了训练速度和性能,超越当前最先进的技术。
链接: https://arxiv.org/abs/2501.19084
作者: Xingyu Miao,Haoran Duan,Yang Bai,Tejal Shah,Jun Song,Yang Long,Rajiv Ranjan,Ling Shao
机构: Durham University (杜伦大学), UK; Institute of High Performance Computing (IHPC), ASTAR (ASTAR), Singapore; School of Computer Science, China University of Geosciences (中国地质大学), Wuhan, 430074, P. R. China; School of Computing, Newcastle University (纽卡斯尔大学), UK; UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学), Beijing 100049, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: this https URL.
zh
[CV-30] MotionPCM: Real-Time Motion Synthesis with Phased Consistency Model
【速读】:该论文旨在解决实时文本条件人类运动合成在潜在空间中的挑战。解决方案的关键在于引入\textbf{MotionPCM},这是一种基于分阶段一致性模型的方法,旨在提升潜在空间中实时运动合成的质量与效率。通过减少采样步骤,MotionPCM显著加速了扩散模型的合成过程,从而应对了传统方法中高计算复杂性和大量采样步骤的问题。
链接: https://arxiv.org/abs/2501.19083
作者: Lei Jiang,Ye Wei,Hao Ni
机构: University College London (伦敦大学学院); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have become a popular choice for human motion synthesis due to their powerful generative capabilities. However, their high computational complexity and large sampling steps pose challenges for real-time applications. Fortunately, the Consistency Model (CM) provides a solution to greatly reduce the number of sampling steps from hundreds to a few, typically fewer than four, significantly accelerating the synthesis of diffusion models. However, its application to text-conditioned human motion synthesis in latent space remains challenging. In this paper, we introduce \textbfMotionPCM, a phased consistency model-based approach designed to improve the quality and efficiency of real-time motion synthesis in latent space.
zh
[CV-31] Improving vision-language alignment with graph spiking hybrid Networks
【速读】:该论文旨在解决视觉与语言(Vision and Language, VL)之间的语义差距问题,特别是如何更全面地捕捉不同对象间的复杂上下文关系。论文的关键在于提出了一种综合视觉语义表示模块,利用全景分割生成连贯的细粒度语义特征,并引入了一种新颖的图尖峰混合网络(Graph Spiking Hybrid Network, GSHN),结合尖峰神经网络(Spiking Neural Networks, SNNs)和图注意力网络(Graph Attention Networks, GATs)的优势来编码视觉语义信息。GSHN不仅编码实例的离散和连续潜在变量,还能够捕获局部和全局的上下文特征,从而显著增强语义表示的丰富性和多样性。此外,通过利用SNN的时间特性,采用对比学习(Contrastive Learning, CL)策略进一步优化嵌入的相似性表示。
链接: https://arxiv.org/abs/2501.19069
作者: Siyu Zhang,Heming Zheng,Yiming Wu,Yeming Chen
机构: Tongji University (同济大学); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:To bridge the semantic gap between vision and language (VL), it is necessary to develop a good alignment strategy, which includes handling semantic diversity, abstract representation of visual information, and generalization ability of models. Recent works use detector-based bounding boxes or patches with regular partitions to represent visual semantics. While current paradigms have made strides, they are still insufficient for fully capturing the nuanced contextual relations among various objects. This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate coherent fine-grained semantic features. Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information. Intriguingly, the model not only encodes the discrete and continuous latent variables of instances but also adeptly captures both local and global contextual features, thereby significantly enhancing the richness and diversity of semantic representations. Leveraging the spatiotemporal properties inherent in SNNs, we employ contrastive learning (CL) to enhance the similarity-based representation of embeddings. This strategy alleviates the computational overhead of the model and enriches meaningful visual representations by constructing positive and negative sample pairs. We design an innovative pre-training method, Spiked Text Learning (STL), which uses text features to improve the encoding ability of discrete semantics. Experiments show that the proposed GSHN exhibits promising results on multiple VL downstream tasks.
zh
[CV-32] Concept Steerers: Leverag ing K-Sparse Autoencoders for Controllable Generations
【速读】:该论文旨在解决文本到图像生成模型在对抗性攻击下的脆弱性以及无意生成不安全、不道德内容的问题。解决方案的关键在于提出了一种新颖的框架,利用k稀疏自编码器(k-sparse autoencoders, k-SAEs)实现扩散模型中高效且可解释的概念操控。这种方法通过识别潜在空间中的可解释单义概念,并利用这些概念精确引导生成过程远离或趋向特定概念(如裸露),或者引入新概念(如摄影风格),从而无需重新训练基础模型或使用LoRA适配器,同时保持生成质量,并增强了对抗性提示操纵的鲁棒性。
链接: https://arxiv.org/abs/2501.19066
作者: Dahye Kim,Deepti Ghadiyaram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 16 figures
Abstract:Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of \mathbf20.01% in unsafe concept removal, is effective in style manipulation, and is \mathbf\sim5 x faster than current state-of-the-art.
zh
[CV-33] EgoMe: Follow Me via Egocentric View in Real World
【速读】:该论文旨在解决机器人在模仿学习过程中如何更有效地利用人类认知行为的问题。当前研究要么使用多个不同视角的摄像头同时捕捉同一行为,要么面临未配对的第一人称和第三人称视图场景,未能充分利用真实世界中的人类认知行为。为填补这一空白,论文引入了一个名为EgoMe的新大规模第一人称数据集,包含7902对视频(共15804个视频),用于模拟真实世界中的日常行为。关键在于每对视频中,一个视频记录模仿者从旁观者视角观察示范者的动作,另一个视频则记录模仿者随后从第一人称视角执行这些动作的过程。此外,该数据集还包括旁观者视角与第一人称视角的眼动、角速度、加速度、磁力等多模态传感器数据,以辅助建立观察和执行过程之间的关联。论文还提出了八个具有挑战性的基准任务,以充分利用这些数据资源,推动机器人模仿学习能力的研究。
链接: https://arxiv.org/abs/2501.19061
作者: Heqian Qiu,Zhaofeng Shi,Lanxiao Wang,Huiyu Xiong,Xiang Li,Hongliang Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:When interacting with the real world, human often take the egocentric (first-person) view as a benchmark, naturally transferring behaviors observed from a exocentric (third-person) view to their own. This cognitive theory provides a foundation for researching how robots can more effectively imitate human behavior. However, current research either employs multiple cameras with different views focusing on the same individual’s behavior simultaneously or encounters unpair ego-exo view scenarios, there is no effort to fully exploit human cognitive behavior in the real world. To fill this gap, in this paper, we introduce a novel large-scale egocentric dataset, called EgoMe, which towards following the process of human imitation learning via egocentric view in the real world. Our dataset includes 7902 pairs of videos (15804 videos) for diverse daily behaviors in real-world scenarios. For a pair of videos, one video captures a exocentric view of the imitator observing the demonstrator’s actions, while the other captures a egocentric view of the imitator subsequently following those actions. Notably, our dataset also contain exo-ego eye gaze, angular velocity, acceleration, magnetic strength and other sensor multi-modal data for assisting in establishing correlations between observing and following process. In addition, we also propose eight challenging benchmark tasks for fully leveraging this data resource and promoting the research of robot imitation learning ability. Extensive statistical analysis demonstrates significant advantages compared to existing datasets. The proposed EgoMe dataset and benchmark will be released soon.
zh
[CV-34] Contrast-Aware Calibration for Fine-Tuned CLIP: Leverag ing Image-Text Alignment
【速读】:该论文旨在解决在开放词汇设置下的分类任务中,经过微调的视觉-语言模型(Vision-Language Models, VLMs)在未见类别上的置信度评分与实际准确率之间存在不匹配的问题。现有置信度校准方法通常需要训练参数或分析训练数据集特征,这限制了它们在没有相应训练数据的情况下泛化到未见类别的能力。此外,特定于VLM的校准方法仅依赖于训练类别的文本特征作为校准指标,从而限制了其校准训练类别的能力。论文提出的关键解决方案是对比感知校准(Contrast-Aware Calibration, CAC),该方法基于CLIP的零样本适应性,并通过计算原始和微调CLIP之间的对比差异来确定校准权重。这种方法不仅能够适应未见类别的校准,还克服了先前VLM校准方法无法同时校准训练类别的局限性。
链接: https://arxiv.org/abs/2501.19060
作者: Song-Lin Lv,Yu-Yang Chen,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2402.04655 by other authors
Abstract:Vision-language models (VLMs), such as CLIP, have demonstrated exceptional generalization capabilities and can quickly adapt to downstream tasks through prompt fine-tuning. Unfortunately, in classification tasks involving non-training classes, known as open-vocabulary setting, fine-tuned VLMs often overfit to train classes, resulting in a misalignment between confidence scores and actual accuracy on unseen classes, which significantly undermines their reliability in real-world deployments. Existing confidence calibration methods typically require training parameters or analyzing features from the training dataset, restricting their ability to generalize unseen classes without corresponding train data. Moreover, VLM-specific calibration methods rely solely on text features from train classes as calibration indicators, which inherently limits their ability to calibrate train classes. To address these challenges, we propose an effective multimodal calibration method Contrast-Aware Calibration (CAC). Building on the original CLIP’s zero-shot adaptability and the conclusion from empirical analysis that poor intra-class and inter-class discriminative ability on unseen classes is the root cause, we calculate calibration weights based on the contrastive difference between the original and fine-tuned CLIP. This method not only adapts to calibrating unseen classes but also overcomes the limitations of previous VLM calibration methods that could not calibrate train classes. In experiments involving 11 datasets with 5 fine-tuning methods, CAC consistently achieved the best calibration effect on both train and unseen classes without sacrificing accuracy and inference speed.
zh
[CV-35] xt-to-CAD Generation Through Infusing Visual Feedback in Large Language Models
【速读】:该论文旨在解决将文本描述转换为计算机辅助设计(CAD)参数序列过程中存在的问题。当前方法主要依赖于单一的顺序信号(ground-truth parametric sequences)进行监督训练,未能充分利用CAD模型固有的多模态特性,即参数序列与对应的渲染视觉对象之间的关系。论文的关键在于提出CADFusion框架,该框架通过交替进行顺序学习(SL)阶段和视觉反馈(VF)阶段,确保模型在生成逻辑连贯的参数序列的同时,能够学习到如何生成更符合视觉偏好(visually preferred)的渲染对象。这种双向的学习机制不仅平衡了两种信号的优势,还显著提升了模型在文本转CAD任务上的性能。
链接: https://arxiv.org/abs/2501.19054
作者: Ruiyu Wang,Yu Yuan,Shizhao Sun,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Creating Computer-Aided Design (CAD) models requires significant expertise and effort. Text-to-CAD, which converts textual descriptions into CAD parametric sequences, is crucial in streamlining this process. Recent studies have utilized ground-truth parametric sequences, known as sequential signals, as supervision to achieve this goal. However, CAD models are inherently multimodal, comprising parametric sequences and corresponding rendered visual objects. Besides,the rendering process from parametric sequences to visual objects is many-to-one. Therefore, both sequential and visual signals are critical for effective training. In this work, we introduce CADFusion, a framework that uses Large Language Models (LLMs) as the backbone and alternates between two training stages: the sequential learning (SL) stage and the visual feedback (VF) stage. In the SL stage, we train LLMs using ground-truth parametric sequences, enabling the generation of logically coherent parametric sequences. In the VF stage, we reward parametric sequences that render into visually preferred objects and penalize those that do not, allowing LLMs to learn how rendered visual objects are perceived and evaluated. These two stages alternate throughout the training, ensuring balanced learning and preserving benefits of both signals. Experiments demonstrate that CADFusion significantly improves performance, both qualitatively and quantitatively.
zh
[CV-36] Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing
【速读】:该论文旨在解决遥感图像时间序列检索(ITSR)方法在跨模态文本-ITSR任务中的局限性。现有的ITSR方法主要针对单一模态检索问题设计,限制了其可用性和通用性。为了解决这一问题,论文首次引入了一种自监督的跨模态文本-ITSR方法,能够使用文本句子作为查询来检索图像时间序列,并反之亦然。关键解决方案包括两个部分:1)模态特定编码器,用于以判别特征建模双时相图像和文本句子的语义内容;2)模态特定投影头,用于将文本和图像表示对齐到共享嵌入空间。此外,为了有效建模双时相图像中的时间信息,论文提出了两种融合策略:i)全局特征融合(GFF)策略,通过简单而有效的算子结合全局图像特征;ii)基于变换器的特征融合(TFF)策略,利用变换器进行细粒度的时间整合。
链接: https://arxiv.org/abs/2501.19043
作者: Genc Hoxha,Olivér Angyal,Begüm Demir
机构: Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin (电气工程与计算机科学学院, 柏林工业大学); Berlin Institute for the Foundations of Learning and Data (BIFOLD) (柏林学习与数据基础研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The development of image time series retrieval (ITSR) methods is a growing research interest in remote sensing (RS). Given a user-defined image time series (i.e., the query time series), the ITSR methods search and retrieve from large archives the image time series that have similar content to the query time series. The existing ITSR methods in RS are designed for unimodal retrieval problems, limiting their usability and versatility. To overcome this issue, as a first time in RS we introduce the task of cross-modal text-ITSR. In particular, we present a self-supervised cross-modal text-image time series retrieval (text-ITSR) method that enables the retrieval of image time series using text sentences as queries, and vice versa. In detail, we focus our attention on text-ITSR in pairs of images (i.e., bitemporal images). The proposed text-ITSR method consists of two key components: 1) modality-specific encoders to model the semantic content of bitemporal images and text sentences with discriminative features; and 2) modality-specific projection heads to align textual and image representations in a shared embedding space. To effectively model the temporal information within the bitemporal images, we introduce two fusion strategies: i) global feature fusion (GFF) strategy that combines global image features through simple yet effective operators; and ii) transformer-based feature fusion (TFF) strategy that leverages transformers for fine-grained temporal integration. Extensive experiments conducted on two benchmark RS archives demonstrate the effectiveness of the proposed method in accurately retrieving semantically relevant bitemporal images (or text sentences) to a query text sentence (or bitemporal image). The code of this work is publicly available at this https URL.
zh
[CV-37] Beyond Token Compression: A Training-Free Reduction Framework for Efficient Visual Processing in MLLM s
【速读】:该论文旨在解决多模态大型语言模型(MLLMs)在解码器-only架构下因大量自注意力和前馈网络(FFN)操作导致的高计算资源需求问题。论文的关键创新在于引入Hollow Attention机制,该机制仅限于局部注意力以保持视觉与文本之间的关联,并提出Probe-Activated Dynamic FFN方法,动态激活视觉标记的FFN参数。这两种方法均无需微调,从而显著提高了分析效率。实验表明,通过上述方法减少约一半层的操作不仅能够维持甚至有时提升模型性能,这表明当前架构中存在显著的计算冗余。
链接: https://arxiv.org/abs/2501.19036
作者: Hongliang Li,Jiaxin Zhang,Wenhui Liao,Dezhi Peng,Kai Ding,Lianwen Jin
机构: South China University of Technology (华南理工大学); Intsig Information Co., Ltd. (英拓信息有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are typically based on decoder-only or cross-attention architectures. While decoder-only MLLMs outperform their cross-attention counterparts, they require significantly higher computational resources due to extensive self-attention and FFN operations on visual tokens. This raises the question: can we eliminate these expensive operations while maintaining the performance? To this end, we present a novel analysis framework to investigate the necessity of these costly operations in decoder-only MLLMs. Our framework introduces two key innovations: (1) Hollow Attention, which limits visual token interactions to local attention while maintaining visual-text associations, and (2) Probe-Activated Dynamic FFN, which selectively activates FFN parameters for visual tokens. Both methods do not require fine-tuning, which significantly enhances analysis efficiency. To assess the impact of applying these reductions across different proportions of layers, we developed a greedy search method that significantly narrows the search space. Experiments on state-of-the-art MLLMs reveal that applying our reductions to approximately half of the layers not only maintains but sometimes improves model performance, indicating significant computational redundancy in current architectures. Additionally, our method is orthogonal to existing token compression techniques, allowing for further combination to achieve greater computational reduction. Our findings may provide valuable insights for the design of more efficient future MLLMs. Our code will be publicly available at this https URL.
zh
[CV-38] SynthmanticLiDAR: A Synthetic Dataset for Semantic Segmentation on LiDAR Imaging ICIP
【速读】:该论文旨在解决利用激光雷达(LiDAR)进行语义分割时,因收集和标注真实数据昂贵且耗时所导致的数据获取难题。论文的关键解决方案是开发了一种专门用于生成具有语义分割标签的激光雷达图像的CARLA模拟器版本,并基于此生成了一个名为SynthmanticLiDAR的合成数据集。该数据集不仅在类别的定义上与现有的真实数据集(如SemanticKITTI)保持一致,还允许调整对象类别分布。通过使用简单的迁移学习方法,研究结果表明,将SynthmanticLiDAR数据集纳入训练过程能够提升不同语义分割算法的整体性能,从而证明了合成数据集的有效性及其模拟器的实用性。
链接: https://arxiv.org/abs/2501.19035
作者: Javier Montalvo,Pablo Carballeira,Álvaro García-Martín
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2024 IEEE International Conference on Image Processing (ICIP)
Abstract:Semantic segmentation on LiDAR imaging is increasingly gaining attention, as it can provide useful knowledge for perception systems and potential for autonomous driving. However, collecting and labeling real LiDAR data is an expensive and time-consuming task. While datasets such as SemanticKITTI have been manually collected and labeled, the introduction of simulation tools such as CARLA, has enabled the creation of synthetic datasets on demand. In this work, we present a modified CARLA simulator designed with LiDAR semantic segmentation in mind, with new classes, more consistent object labeling with their counterparts from real datasets such as SemanticKITTI, and the possibility to adjust the object class distribution. Using this tool, we have generated SynthmanticLiDAR, a synthetic dataset for semantic segmentation on LiDAR imaging, designed to be similar to SemanticKITTI, and we evaluate its contribution to the training process of different semantic segmentation algorithms by using a naive transfer learning approach. Our results show that incorporating SynthmanticLiDAR into the training process improves the overall performance of tested algorithms, proving the usefulness of our dataset, and therefore, our adapted CARLA simulator. The dataset and simulator are available in this https URL. Comments: 2024 IEEE International Conference on Image Processing (ICIP) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.19035 [cs.CV] (or arXiv:2501.19035v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.19035 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 137-143 Related DOI: https://doi.org/10.1109/ICIP51287.2024.10648055 Focus to learn more DOI(s) linking to related resources
zh
[CV-39] XRF V2: A Dataset for Action Summarization with Wi-Fi Signals and IMUs in Phones Watches Earbuds and Glasses
【速读】:该论文旨在解决室内日常活动的时间动作定位(Temporal Action Localization, TAL)和动作概括问题。为了解决这些问题,论文引入了XRF V2数据集,并提出了一种名为XRFMamba的神经网络模型。该模型能够有效捕捉未剪辑感官序列中的长期依赖关系,并在TAL和动作概括任务上超越现有最先进方法,如ActionFormer和WiFiTAD。关键在于XRFMamba模型的设计及其在多模态数据上的应用。
链接: https://arxiv.org/abs/2501.19034
作者: Bo Lan,Pei Li,Jiaxi Yin,Yunpeng Song,Ge Wang,Han Ding,Jinsong Han,Fei Wang
机构: School of Software Engineering, Xi’an Jiaotong University(软件工程学院,西安交通大学); MOE KLINNS Lab, Xi’an Jiaotong University(教育部KLINNS实验室,西安交通大学); School of Computer Science and Technology, Xi’an Jiaotong University(计算机科学与技术学院,西安交通大学); College of Computer Science and Technology, Zhejiang University(计算机科学与技术学院,浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 11 figures, 8 tables
Abstract:Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization, which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and outperforms state-of-the-art methods, such as ActionFormer and WiFiTAD. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more.
zh
[CV-40] Virtual airways heatmaps to optimize point of entry location in lung biopsy planning systems
【速读】:该论文旨在解决肺活检规划系统中点位进入(Point of Entry, POE)优化的问题。关键在于提出了一种虚拟模型,通过生成体热图(Heatmap)来评估不同POE处的活检样本质量。此模型考虑了从规划模拟到实际操作过程中因姿态差异导致的误差,并且通过将病灶与锥形不确定区域相交来确定可提取组织量。这种方法有助于直观评估最优POE,识别多个潜在最优POE的位置,并研究影响手术成功的变量及最大允许导航误差。
链接: https://arxiv.org/abs/2501.19003
作者: Debora Gil,Pere Lloret,Marta Diez-Ferrer,Carles Sanchez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: We present a virtual model to optimize point of entry (POE) in lung biopsy planning systems. Our model allows to compute the quality of a biopsy sample taken from potential POE, taking into account the margin of error that arises from discrepancies between the orientation in the planning simulation and the actual orientation during the operation. Additionally, the study examines the impact of the characteristics of the lesion. Methods: The quality of the biopsy is given by a heatmap projected onto the skeleton of a patient-specific model of airways. The skeleton provides a 3D representation of airways structure, while the heatmap intensity represents the potential amount of tissue that it could be extracted from each POE. This amount of tissue is determined by the intersection of the lesion with a cone that represents the uncertainty area in the introduction of biopsy instruments. The cone, lesion, and skeleton are modelled as graphical objects that define a 3D scene of the intervention. Results: We have simulated different settings of the intervention scene from a single anatomy extracted from a CT scan and two lesions with regular and irregular shapes. The different scenarios are simulated by systematic rotation of each lesion placed at different distances from airways. Analysis of the heatmaps for the different settings show a strong impact of lesion orientation for irregular shape and the distance for both shapes. Conclusion: The proposed heatmaps help to visually assess the optimal POE and identify whether multiple optimal POEs exist in different zones of the bronchi. They also allow us to model the maximum allowable error in navigation systems and study which variables have the greatest influence on the success of the operation. Additionally, they help determine at what point this influence could potentially jeopardize the operation.
zh
[CV-41] VKFPos: A Learning-Based Monocular Positioning with Variational Bayesian Extended Kalman Filter Integration
【速读】:该论文旨在解决基于学习的单目定位挑战,提出了一种名为VKFPos的新方法。该方法通过变分贝叶斯推理框架内的扩展卡尔曼滤波器(EKF)整合绝对姿态回归(APR)和相对姿态回归(RPR)。其关键是将单目定位问题的核心后验概率分解为APR和RPR组件,并通过预测APR和RPR分支中的协方差来增强损失函数,从而实现EKF的有效集成。这使得VKFPos在室内和室外数据集上的单次拍摄APR分支达到了与最先进方法相媲美的精度,并且在时间定位任务中表现出色。
链接: https://arxiv.org/abs/2501.18994
作者: Jian-Yu Chen,Yi-Ru Chen,Yin-Qiao Chang,Che-Ming Li,Jann-Long Chern,Chih-Wei Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper addresses the challenges in learning-based monocular positioning by proposing VKFPos, a novel approach that integrates Absolute Pose Regression (APR) and Relative Pose Regression (RPR) via an Extended Kalman Filter (EKF) within a variational Bayesian inference framework. Our method shows that the essential posterior probability of the monocular positioning problem can be decomposed into APR and RPR components. This decomposition is embedded in the deep learning model by predicting covariances in both APR and RPR branches, allowing them to account for associated uncertainties. These covariances enhance the loss functions and facilitate EKF integration. Experimental evaluations on both indoor and outdoor datasets show that the single-shot APR branch achieves accuracy on par with state-of-the-art methods. Furthermore, for temporal positioning, where consecutive images allow for RPR and EKF integration, VKFPos outperforms temporal APR and model-based integration methods, achieving superior accuracy.
zh
[CV-42] Visual Autoregressive Modeling for Image Super-Resolution
【速读】:该论文旨在解决图像超分辨率(ISR)中的保真度与真实感之间的权衡问题以及计算复杂性。为了解决这些问题,论文提出了一种新的视觉自回归模型VARSR,采用逐尺度预测框架,并引入前缀标记以整合条件信息。关键创新包括引入与尺度对齐的旋转位置编码以捕捉空间结构,使用扩散精炼器建模量化残差损失以实现像素级保真度,以及基于图像的无分类器引导以生成更真实的图像。此外,论文通过收集大规模数据并设计训练过程来获得稳健的生成先验。这些方法共同提高了ISR任务的效率和效果。
链接: https://arxiv.org/abs/2501.18993
作者: Yunpeng Qu,Kun Yuan,Jinhua Hao,Kai Zhao,Qizhi Xie,Ming Sun,Chao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages; 17 figures
Abstract:Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbfVARSR, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes will be released at this https URL.
zh
[CV-43] Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images
【速读】:该论文旨在解决全幻灯片图像(Whole Slide Image, WSI)分析中由于图像包含大量切片而带来的巨大计算挑战。论文的关键解决方案是提出了一种名为Querent的查询感知长上下文动态建模框架。该框架在保持完整自注意力机制表达能力的同时,实现了实际效率,通过自适应预测每个切片周围最相关的区域,仅计算潜在重要上下文的关注,从而大幅减少了计算开销,同时保留了全局感知以建模细粒度的切片相关性。
链接: https://arxiv.org/abs/2501.18984
作者: Zhengrui Guo,Qichen Sun,Jiabo Ma,Lishuang Feng,Jinzhuo Wang,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 6 figures, 3 tables
Abstract:Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. While transformer architectures excel at modeling long-range correlations through self-attention, their quadratic computational complexity makes them impractical for computational pathology applications. Existing solutions like local-global or linear self-attention reduce computational costs but compromise the strong modeling capabilities of full self-attention. In this work, we propose Querent, i.e., the query-aware long contextual dynamic modeling framework, which maintains the expressive power of full self-attention while achieving practical efficiency. Our method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. Through comprehensive experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across over 10 WSI datasets, our method demonstrates superior performance compared to the state-of-the-art approaches. Code will be made available at this https URL.
zh
[CV-44] OmniPhysGS: 3D Constitutive Gaussians for General Physics-Based Dynamics Generation ICLR2025
【速读】:该论文旨在解决现有方法在恢复三维资产物理属性时存在的局限性,即这些方法通常假设所有材料属于特定预定义类别(如弹性),而忽略了现实场景中多异质物体复杂组成的问题,并且对于更广泛的物体种类,其模拟的物理逼真度较低。论文的关键解决方案是提出OmniPhysGS系统,该系统通过将每个三维资产视为由多个构成三维高斯组成的集合来处理这一问题。OmniPhysGS使用一组12种物理域专家子模型(如橡胶、金属、蜂蜜、水等)来表示每个高斯的物理材质,极大地提高了模型的灵活性。通过用户指定的提示定义场景,并利用预训练的视频扩散模型监督材料权重因子的估计,从而实现更广泛材料的更加真实和通用的物理动力学仿真。
链接: https://arxiv.org/abs/2501.18982
作者: Yuchen Lin,Chenguo Lin,Jianjin Xu,Yadong Mu
机构: Peking University (北京大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025; Project page: this https URL
Abstract:Recently, significant advancements have been made in the reconstruction and generation of 3D assets, including static cases and those with physical interactions. To recover the physical properties of 3D assets, existing methods typically assume that all materials belong to a specific predefined category (e.g., elasticity). However, such assumptions ignore the complex composition of multiple heterogeneous objects in real scenarios and tend to render less physically plausible animation given a wider range of objects. We propose OmniPhysGS for synthesizing a physics-based 3D dynamic scene composed of more general objects. A key design of OmniPhysGS is treating each 3D asset as a collection of constitutive 3D Gaussians. For each Gaussian, its physical material is represented by an ensemble of 12 physical domain-expert sub-models (rubber, metal, honey, water, etc.), which greatly enhances the flexibility of the proposed model. In the implementation, we define a scene by user-specified prompts and supervise the estimation of material weighting factors via a pretrained video diffusion model. Comprehensive experiments demonstrate that OmniPhysGS achieves more general and realistic physical dynamics across a broader spectrum of materials, including elastic, viscoelastic, plastic, and fluid substances, as well as interactions between different materials. Our method surpasses existing methods by approximately 3% to 16% in metrics of visual quality and text alignment.
zh
[CV-45] LLM Det: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
【速读】:该论文旨在提升开放词汇检测器在区域级标注数据充足情况下的性能。关键解决方案在于通过与大型语言模型协同训练,利用其生成的图像级详细描述来优化检测器。具体而言,研究者收集了一个包含图像及其关联接地标签和图像级详细描述的新数据集GroundingCap-1M,并在此基础上微调开放词汇检测器,引入标准接地损失和描述生成损失作为训练目标。通过大型语言模型生成的区域级简短描述和图像级长描述进行监督,最终提出的LLMDet检测器在开放词汇能力方面表现出显著优势。
链接: https://arxiv.org/abs/2501.18954
作者: Shenghao Fu,Qize Yang,Qijie Mo,Junkai Yan,Xihan Wei,Jingke Meng,Xiaohua Xie,Wei-Shi Zheng
机构: School of Computer Science and Engineering, Sun Yat-sen University, China(中山大学计算机科学与工程学院);
Tongyi Lab, Alibaba Group(阿里巴巴达摩院);
Peng Cheng Laboratory, China(鹏城实验室);
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部机器智能与先进计算重点实验室);
Guangdong Province Key Laboratory of Information Security Technology, China(广东省信息安全技术重点实验室);
Pazhou Laboratory (Huangpu), China(琶洲实验室(黄埔))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at this https URL.
zh
[CV-46] Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them
【速读】:该论文旨在解决概念擦除(Concept Erasure)在缓解扩散模型(Diffusion Models)中有害内容生成风险时存在的局限性。传统的固定目标策略(fixed-target strategy)将特定概念映射到一个固定的通用概念,如中性概念或空文本提示,但这种方法未能充分考虑擦除一个概念对其它概念的影响。论文的关键在于提出自适应引导擦除(Adaptive Guided Erasure, AGE)方法,通过将概念空间建模为图,并动态选择针对每个不希望概念的最优目标概念,从而最小化非预期副作用。实验结果表明,AGE方法在保持有效擦除性能的同时,显著优于现有最先进的擦除方法,特别是在保留无关概念方面。
链接: https://arxiv.org/abs/2501.18950
作者: Anh Bui,Trang Vu,Long Vuong,Trung Le,Paul Montague,Tamas Abraham,Junae Kim,Dinh Phung
机构: Monash University; Defence Science and Technology Group, Australia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emphdynamically selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at this https URL.
zh
[CV-47] V-Dialogue: Crafting Theme-Aware Video Dialogues with Immersive Interaction
【速读】:该论文旨在解决视频对话生成领域的问题,特别是如何生成与视频内容及用户指定主题相契合的新对话。论文的关键解决方案是提出了Theme-aware Video Dialogue Crafting (TVDC)任务,并设计了一个名为TV-Dialogue的多模态代理框架。该框架通过实现视频角色之间的实时沉浸式互动,确保对话主题一致性和视觉一致性,从而准确理解视频内容并生成符合指定主题的新对话。
链接: https://arxiv.org/abs/2501.18940
作者: Sai Wang,Fan Ma,Xinyi Li,Hehe Fan,Yu Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in LLMs have accelerated the development of dialogue generation across text and images, yet video-based dialogue generation remains underexplored and presents unique challenges. In this paper, we introduce Theme-aware Video Dialogue Crafting (TVDC), a novel task aimed at generating new dialogues that align with video content and adhere to user-specified themes. We propose TV-Dialogue, a novel multi-modal agent framework that ensures both theme alignment (i.e., the dialogue revolves around the theme) and visual consistency (i.e., the dialogue matches the emotions and behaviors of characters in the video) by enabling real-time immersive interactions among video characters, thereby accurately understanding the video content and generating new dialogue that aligns with the given themes. To assess the generated dialogues, we present a multi-granularity evaluation benchmark with high accuracy, interpretability and reliability, demonstrating the effectiveness of TV-Dialogue on self-collected dataset over directly using existing LLMs. Extensive experiments reveal that TV-Dialogue can generate dialogues for videos of any length and any theme in a zero-shot manner without training. Our findings underscore the potential of TV-Dialogue for various applications, such as video re-creation, film dubbing and its use in downstream multimodal tasks.
zh
[CV-48] Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning
【速读】:该论文旨在解决视觉提示调优(Visual Prompt Tuning, VPT)在适应预训练视觉模型到下游任务时的受限功能表达问题。论文的关键解决方案是提出视觉自适应提示调优(Visual Adaptive Prompt Tuning, VAPT),通过将提示重新定义为输入的自适应函数,从而实现更优的样本效率,并且在VTAB-1K和FGVC数据集上分别取得了7.34%和1.04%的性能提升,同时使用的参数更少。
链接: https://arxiv.org/abs/2501.18936
作者: Minh Le,Anh Nguyen,Huy Nguyen,Chau Nguyen,Nhat Ho
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 55 pages, 10 figures, 18 tables. arXiv admin note: text overlap with arXiv:2410.02200
Abstract:Visual Prompt Tuning (VPT) has recently emerged as a powerful method for adapting pre-trained vision models to downstream tasks. By introducing learnable prompt tokens as task-specific instructions, VPT effectively guides pre-trained transformer models with minimal overhead. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on recent insights into the connection between mixture of experts and prompt-based approaches, we identify a key limitation in VPT: the restricted functional expressiveness in prompt formulation. To address this limitation, we propose Visual Adaptive Prompt Tuning (VAPT), a new generation of prompts that redefines prompts as adaptive functions of the input. Our theoretical analysis shows that this simple yet intuitive approach achieves optimal sample efficiency. Empirical results on VTAB-1K and FGVC further demonstrate VAPT’s effectiveness, with performance gains of 7.34% and 1.04% over fully fine-tuning baselines, respectively. Notably, VAPT also surpasses VPT by a substantial margin while using fewer parameters. These results highlight both the effectiveness and efficiency of our method and pave the way for future research to explore the potential of adaptive prompts.
zh
[CV-49] raining-free Quantum-Inspired Image Edge Extraction Method
【速读】:该论文旨在解决传统边缘检测方法在复杂或噪声场景下表现不佳以及需要大量训练数据和微调的问题。论文的关键解决方案在于提出了一种无需训练的量子启发式边缘检测模型,该模型集成了经典的Sobel边缘检测、基于薛定谔波动方程的细化以及结合Canny和Laplacian算子的混合框架。其中,基于薛定谔波动方程的迭代扩散显著提升了边缘精度,而混合框架通过协同利用局部和全局特征增强了模型的鲁棒性,使其适用于多样化的应用场景。
链接: https://arxiv.org/abs/2501.18929
作者: Arti Jain,Pradeep Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figure,
Abstract:Edge detection is a cornerstone of image processing, yet existing methods often face critical limitations. Traditional deep learning edge detection methods require extensive training datasets and fine-tuning, while classical techniques often fail in complex or noisy scenarios, limiting their real-world applicability. To address these limitations, we propose a training-free, quantum-inspired edge detection model. Our approach integrates classical Sobel edge detection, the Schrödinger wave equation refinement, and a hybrid framework combining Canny and Laplacian operators. By eliminating the need for training, the model is lightweight and adaptable to diverse applications. The Schrödinger wave equation refines gradient-based edge maps through iterative diffusion, significantly enhancing edge precision. The hybrid framework further strengthens the model by synergistically combining local and global features, ensuring robustness even under challenging conditions. Extensive evaluations on datasets like BIPED, Multicue, and NYUD demonstrate superior performance of the proposed model, achieving state-of-the-art metrics, including ODS, OIS, AP, and F-measure. Noise robustness experiments highlight its reliability, showcasing its practicality for real-world scenarios. Due to its versatile and adaptable nature, our model is well-suited for applications such as medical imaging, autonomous systems, and environmental monitoring, setting a new benchmark for edge detection.
zh
[CV-50] Rethinking Diffusion Posterior Sampling: From Conditional Score Estimator to Maximizing a Posterior ICLR2025
【速读】:该论文旨在解决扩散后采样(Diffusion Posterior Sampling, DPS)方法在处理逆问题时的有效性问题。通过分析发现,DPS方法中的条件分数近似不如预期有效,反而更接近于最大化后验(MAP)估计。关键解决方案在于:1) 显式地通过多步梯度上升和投影来最大化后验;2) 使用仅需100张图像和8个GPU小时训练的轻量级条件分数估计器。这些改进显著提升了DPS方法的性能。
链接: https://arxiv.org/abs/2501.18913
作者: Tongda Xu,Xiyan Cai,Xinjie Zhang,Xingtong Ge,Dailan He,Ming Sun,Jingjing Liu,Ya-Qin Zhang,Jian Li,Yan Wang
机构: Institute for AI Industry Research, Tsinghua University(清华大学智能产业研究院); Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系); Bloomberg(彭博); Hong Kong University of Science and Technology(香港科技大学); SenseTime Research(商汤研究部); School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院); Kuaishou Technology(快手科技); Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025
Abstract:Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512x512 ImageNet images, revealing that: 1) DPS’s conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS’s conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS’s performance. The source code for these improvements is provided in this https URL.
zh
[CV-51] GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
【速读】:该论文旨在解决基于语音信号控制人体手势的挑战,特别是现有方法在空间交互方面的不足以及生成速度慢的问题。论文的关键在于提出了一种基于潜在空间捷径的共发言手势生成方法GestureLSM,通过空间-时间建模显式地对各个身体部位及其相互作用进行建模,并利用空间和时间注意力机制来提升模型效果。此外,通过研究去噪模式并设计有效的时序分布以加速采样过程,从而实现实时手势生成,同时保证生成质量。
链接: https://arxiv.org/abs/2501.18898
作者: Pinxin Liu,Luchuan Song,Junhua Huang,Chenliang Xu
机构: University of Rochester
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Controlling human gestures based on speech signals presents a significant challenge in computer vision. While existing works did preliminary studies of generating holistic co-speech gesture from speech, the spatial interaction of each body region during the speech remains barely explored. This leads to wield body part interactions given the speech signal. Furthermore, the slow generation speed limits the construction of real-world digital avatars. To resolve these problems, we propose \textbfGestureLSM, a Latent Shortcut based approach for Co-Speech Gesture Generation with spatial-temporal modeling. We tokenize various body regions and explicitly model their interactions with spatial and temporal attention. To achieve real-time gesture generations, we exam the denoising patterns and design an effective time distribution to speed up sampling while improve the generation quality for shortcut model. Extensive quantitative and qualitative experiments demonstrate the effectiveness of GestureLSM, showcasing its potential for various applications in the development of digital humans and embodied agents. Project Page: this https URL
zh
[CV-52] RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在特定应用中的指令驱动视觉定位任务中,由于数据不足和数据不平衡导致的性能局限。论文的关键解决方案在于提出了一种新的框架,通过整合强化学习(Reinforcement Learning, RL)代理来生成合成数据,以增强VLM的微调过程。具体而言,RL代理被用于操控室内环境中的对象,从而创建有助于克服VLM某些弱点的合成数据。这种方法使得RL代理能够作为信息性数据采样工具,为VLM提供反馈,生成高效且有针对性的任务(如空间推理)数据,从而提高模型性能并解决特定任务中的脆弱性。
链接: https://arxiv.org/abs/2501.18880
作者: Joshua R. Waite,Md. Zahid Hasan,Qisai Liu,Zhanhong Jiang,Chinmay Hegde,Soumik Sarkar
机构: Iowa State University(A衣阿华州立大学); New York University(A纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCPS 2025 accepted paper, 10 pages, 9 figures
Abstract:Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.
zh
[CV-53] Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型在生成过程中因不安全提示而可能导致的不适宜工作内容(NSFW)生成问题。现有方法如提示过滤或概念遗忘无法在保持良性图像质量的同时有效防御对抗性攻击。论文提出的关键解决方案是“扭曲嵌入空间”(Distorting Embedding Space, DES),这是一种基于文本编码器的防御机制,通过创新的嵌入空间控制有效地解决了上述问题。DES通过对从文本编码器中提取的不安全嵌入进行转换,使其朝向精心计算的安全嵌入区域,从而防止不安全内容的生成,同时保留原始安全嵌入。DES还通过将使用“裸露”提示提取的裸露嵌入与中性嵌入对齐,来增强对抗对抗性攻击的鲁棒性。这些方法确保了强大的防御能力和高质量的图像生成。
链接: https://arxiv.org/abs/2501.18877
作者: Jaesin Ahn,Heechul Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Text-to-image diffusion models show remarkable generation performance following text prompts, but risk generating Not Safe For Work (NSFW) contents from unsafe prompts. Existing approaches, such as prompt filtering or concept unlearning, fail to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks. These methods ensure both robust defense and high-quality image generation. Additionally, DES can be adopted in a plug-and-play manner and requires zero inference overhead, facilitating its deployment. Extensive experiments on diverse attack types, including black-box and white-box scenarios, demonstrate DES’s state-of-the-art performance in both defense capability and benign image generation quality. Our model is available at this https URL.
zh
[CV-54] Self-Supervised Learning Using Nonlinear Dependence
【速读】:该论文旨在解决自监督学习(Self-supervised learning, SSL)在处理复杂数据时,难以有效捕捉样本间非线性依赖关系的问题。现有SSL方法主要关注特征变化和线性相关性,而忽视了样本间的复杂关系和数据中的非线性依赖。论文的关键解决方案是提出了一种名为相关依赖自监督学习(Correlation-Dependence Self-Supervised Learning, CDSSL)的新框架,该框架通过整合线性相关性和非线性依赖性,以及样本间和特征间的交互,来增强表示学习。CDSSL利用希尔伯特-施密特独立性准则(Hilbert-Schmidt Independence Criterion, HSIC)在再生核希尔伯特空间(Reproducing Kernel Hilbert Space)中稳健地捕捉非线性依赖关系,从而提升表示质量。
链接: https://arxiv.org/abs/2501.18875
作者: M.Hadi Sepanj,Benyamin Ghojogh,Paul Fieguth
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Self-supervised learning has gained significant attention in contemporary applications, particularly due to the scarcity of labeled data. While existing SSL methodologies primarily address feature variance and linear correlations, they often neglect the intricate relations between samples and the nonlinear dependencies inherent in complex data. In this paper, we introduce Correlation-Dependence Self-Supervised Learning (CDSSL), a novel framework that unifies and extends existing SSL paradigms by integrating both linear correlations and nonlinear dependencies, encapsulating sample-wise and feature-wise interactions. Our approach incorporates the Hilbert-Schmidt Independence Criterion (HSIC) to robustly capture nonlinear dependencies within a Reproducing Kernel Hilbert Space, enriching representation learning. Experimental evaluations on diverse benchmarks demonstrate the efficacy of CDSSL in improving representation quality.
zh
[CV-55] UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
【速读】:该论文旨在解决现有Vision-Language-Action (VLA)模型在处理高阶语义内容时忽视低阶特征的问题,这限制了其捕捉详细空间信息及理解物理动态的能力。此类细节对于具身控制任务至关重要,但在现有的预训练范式中尚未得到充分探索。论文的关键解决方案是引入\textbf{UP-VLA},一个统一的VLA模型训练方法,该方法结合了多模态理解(multi-modal Understanding)和未来预测(future Prediction)的目标,从而增强高阶语义理解和低阶空间理解能力。实验结果表明,UP-VLA在Calvin ABC-D基准测试中比先前最先进方法提高了33%,并且在需要精确空间信息的真实世界操作任务中展示了更高的成功率。
链接: https://arxiv.org/abs/2501.18867
作者: Jianke Zhang,Yanjiang Guo,Yucheng Hu,Xiaoyu Chen,Xiang Zhu,Jianyu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbfUP-VLA, a \textbfUnified VLA model training with both multi-modal \textbfUnderstanding and future \textbfPrediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.
zh
[CV-56] REG: Rectified Gradient Guidance for Conditional Diffusion Models
【速读】:该论文旨在解决扩散模型中条件生成的引导技术在实际应用中与其理论动机之间的不一致问题。论文的关键在于提出了一种新的目标函数,即有效的缩放联合分布目标(scaled joint distribution objective),替代了理论上无效的缩放边缘分布目标(scaled marginal distribution target)。基于这一理论洞见,作者提出了校正梯度引导(Rectified Gradient Guidance, REG),这是一种增强方法,能够更好地逼近最优解,并在多个实验中验证了其有效性,尤其是在改善FID和Inception/CLIP评分方面。
链接: https://arxiv.org/abs/2501.18865
作者: Zhengqi Gao,Kaiwen Zha,Tianyuan Zhang,Zihui Xue,Duane S. Boning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 10 figures
Abstract:Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.
zh
[CV-57] st-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models
【速读】:该论文旨在解决在测试阶段分布偏移导致的性能下降问题,并提出了一种新的方法来减少现有方法中的高计算成本。论文的关键在于从损失景观的角度揭示了现有方法中反向传播的非必要性,并提出了一个名为测试时损失景观适应(Test-time Loss Landscape Adaptation, TLLA)的新框架。TLLA 的关键是通过引导训练最小值与测试损失景观之间的相对位置,避免在测试时更新模型参数。具体而言,它包括两个主要阶段:尖锐感知提示调优(Sharpness-Aware Prompt Tuning, SAPT)用于识别平坦最小值,以及基于尖锐度的测试样本选择(Sharpness-based Test Sample Selection, STSS)以确保训练损失景观和平坦最小值与每个增强的测试样本的损失景观之间的一致性。
链接: https://arxiv.org/abs/2501.18864
作者: Aodi Li,Liansheng Zhuang,Xiao Long,Minghong Yao,Shafei Wang
机构: University of Science and Technology of China(中国科学技术大学); Peng Cheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-time adaptation of pre-trained vision-language models has emerged as a technique for tackling distribution shifts during the test time. Although existing methods, especially those based on Test-time Prompt Tuning (TPT), have shown promising results, their high computational cost associated with parameter optimization presents challenges for scalability and practical application. This paper unveils the unnecessary nature of backpropagation in existing methods from a loss landscape perspective. Building on this insight, this paper proposes a simple yet effective framework called Test-time Loss Landscape Adaptation (TLLA). TLLA leverages the relative position between the training minimum and test loss landscapes to guide the adaptation process, avoiding the update of model parameters at test time. Specifically, it mainly consists of two main stages: In the prompt tuning stage, a Sharpness-Aware Prompt Tuning (SAPT) method is introduced to identify the training flat minimum, setting the foundation for the subsequent test-time adaptation; In the test stage, a Sharpness-based Test Sample Selection (STSS) approach is utilized to ensure the alignment of flat minima within the training loss landscape and each augmented test sample’s loss landscape. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that TLLA achieves state-of-the-art performances while significantly reducing computational overhead. Notably, TLLA surpasses TPT by an average of 5.32% and 6.98% on four ImageNet variant datasets when employing ResNet50 and ViT-B/16 image encoders, respectively. The code will be available soon.
zh
[CV-58] FlexiCrackNet: A Flexible Pipeline for Enhanced Crack Segmentation with General Features Transfered from SAM
【速读】:该论文旨在解决自动裂缝分割在资源受限环境中的适应性有限以及跨多样数据域的可扩展性不足的问题。解决方案的关键在于提出了一种名为FlexiCrackNet的新方法,该方法融合了传统深度学习范式与大规模预训练模型的优势,采用编码器-解码器架构提取特定任务特征,并引入信息交互门控注意力机制(IGAM),以增强多层级特征的自适应融合,从而提高分割性能并减少无关噪声,确保其在不同输入分辨率和资源受限环境下的适应性和高效性。
链接: https://arxiv.org/abs/2501.18855
作者: Xinlong Wan,Xiaoyan Jiang,Guangsheng Luo,Ferdous Sohel,Jenqneng Hwang
机构: Shanghai University of Engineering Science (上海工程技术大学); Leiden University (莱顿大学); Murdoch University (默多克大学); University of Washington (华盛顿大学); Leiden University (莱顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic crack segmentation is a cornerstone technology for intelligent visual perception modules in road safety maintenance and structural integrity systems. Existing deep learning models and ``pre-training + fine-tuning’’ paradigms often face challenges of limited adaptability in resource-constrained environments and inadequate scalability across diverse data domains. To overcome these limitations, we propose FlexiCrackNet, a novel pipeline that seamlessly integrates traditional deep learning paradigms with the strengths of large-scale pre-trained models. At its core, FlexiCrackNet employs an encoder-decoder architecture to extract task-specific features. The lightweight EdgeSAM’s CNN-based encoder is exclusively used as a generic feature extractor, decoupled from the fixed input size requirements of EdgeSAM. To harmonize general and domain-specific features, we introduce the information-Interaction gated attention mechanism (IGAM), which adaptively fuses multi-level features to enhance segmentation performance while mitigating irrelevant noise. This design enables the efficient transfer of general knowledge to crack segmentation tasks while ensuring adaptability to diverse input resolutions and resource-constrained environments. Experiments show that FlexiCrackNet outperforms state-of-the-art methods, excels in zero-shot generalization, computational efficiency, and segmentation robustness under challenging scenarios such as blurry inputs, complex backgrounds, and visually ambiguous artifacts. These advancements underscore the potential of FlexiCrackNet for real-world applications in automated crack detection and comprehensive structural health monitoring systems.
zh
[CV-59] Project-and-Fuse: Improving RGB-D Semantic Segmentation via Graph Convolution Networks
【速读】:该论文旨在解决现有RGB-D语义分割方法在特征融合过程中可能引发的错位问题及分割结果中的反直觉区域。解决方案的关键在于采用了一种后期融合方式来融合两个模态的特征,并利用纹理特征先验指导几何特征注入;此外,通过图神经网络(GNNs)处理融合后的特征以减轻不规则区域的出现,通过推断区域间的关联关系实现。针对深度图特征提取效率低的问题,论文提出将深度图编码为法线图,以便传统卷积神经网络(CNNs)能够更有效地提取物体表面信息。同时,在投影矩阵生成阶段引入Kullback-Leibler损失确保重要像素特征不被遗漏,并通过加大相邻区域之间的边权重来考虑位置信息,从而缓解Biased-Assignment和Ambiguous-Locality问题。
链接: https://arxiv.org/abs/2501.18851
作者: Xiaoyan Jiang,Bohan Wang,Xinlong Wan,Zhi Zhou,Hamido Fujita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing RGB-D semantic segmentation methods focus on the feature level fusion, including complex cross-modality and cross-scale fusion modules. However, these methods may cause misalignment problem in the feature fusion process and counter-intuitive patches in the segmentation results. Inspired by the popular pixel-node-pixel pipeline, we propose to 1) fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior; 2) employ Graph Neural Networks (GNNs) on the fused feature to alleviate the emergence of irregular patches by inferring patch relationship. At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps. So, we encode depth map into normal map, after which CNNs can easily extract object surface this http URL projection matrix generation stage, we find the existence of Biased-Assignment and Ambiguous-Locality issues in the original pipeline. Therefore, we propose to 1) adopt the Kullback-Leibler Loss to ensure no missing important pixel features, which can be viewed as hard pixel mining process; 2) connect regions that are close to each other in the Euclidean space as well as in the semantic space with larger edge weights so that location informations can been considered. Extensive experiments on two public datasets, NYU-DepthV2 and SUN RGB-D, have shown that our approach can consistently boost the performance of RGB-D semantic segmentation task.
zh
[CV-60] Early Diagnosis and Severity Assessment of Weligama Coconut Leaf Wilt Disease and Coconut Caterpillar Infestation using Deep Learning-based Image Processing Techniques
【速读】:该论文旨在解决斯里兰卡及周边椰子生产国由于椰子萎蔫病(Weligama Coconut Leaf Wilt Disease, WCWLD)和椰子 caterpillar infestation (CCI)导致的椰树损害及产量损失问题。当前这些疾病的检测依赖于人工现场观察,不仅耗时而且难以实现早期检测。为解决这一问题,论文提出使用基于迁移学习的卷积神经网络(Convolutional Neural Network, CNN)和Mask R-CNN来早期识别WCWLD和CCI,并评估疾病进展。此外,利用You Only Look Once (YOLO)目标检测模型计算叶片上的 caterpillar 数量。关键在于采用先进的机器学习算法以提高早期检测的效率和准确性。实验结果表明,所提出的识别方法对于WCWLD和CCI的准确率分别为90%和95%,而针对WCWLD病情严重程度的分类准确率为97%。
链接: https://arxiv.org/abs/2501.18835
作者: Samitha Vidhanaarachchi,Janaka L. Wijekoon,W. A. Shanaka P. Abeysiriwardhana,Malitha Wijesundara
机构: Sri Lanka Institute of Information Technology(斯里兰卡信息技术学院); Victorian Institute of Technology(维多利亚技术学院); Department of System Design Engineering, Keio University(庆应义塾大学系统设计工程系); ACSL Ltd(ACSL有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Global Coconut (Cocos nucifera (L.)) cultivation faces significant challenges, including yield loss, due to pest and disease outbreaks. In particular, Weligama Coconut Leaf Wilt Disease (WCWLD) and Coconut Caterpillar Infestation (CCI) damage coconut trees, causing severe coconut production loss in Sri Lanka and nearby coconut-producing countries. Currently, both WCWLD and CCI are detected through on-field human observations, a process that is not only time-consuming but also limits the early detection of infections. This paper presents a study conducted in Sri Lanka, demonstrating the effectiveness of employing transfer learning-based Convolutional Neural Network (CNN) and Mask Region-based-CNN (Mask R-CNN) to identify WCWLD and CCI at their early stages and to assess disease progression. Further, this paper presents the use of the You Only Look Once (YOLO) object detection model to count the number of caterpillars distributed on leaves with CCI. The introduced methods were tested and validated using datasets collected from Matara, Puttalam, and Makandura, Sri Lanka. The results show that the proposed methods identify WCWLD and CCI with an accuracy of 90% and 95%, respectively. In addition, the proposed WCWLD disease severity identification method classifies the severity with an accuracy of 97%. Furthermore, the accuracies of the object detection models for calculating the number of caterpillars in the leaflets were: YOLOv5-96.87%, YOLOv8-96.1%, and YOLO11-95.9%.
zh
[CV-61] Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
【速读】:该论文旨在解决从稀疏视角图像进行3D场景重建的问题。论文提出的关键解决方案是MVGD(基于扩散的架构),它能够直接从任意数量的输入视图生成像素级别的图像和深度图。该方法通过光线映射条件化来增强不同视角下的视觉特征,并引导从新视角生成图像和深度图。关键之处在于多任务生成图像和深度图,使用可学习的任务嵌入指导扩散过程以特定模态为目标。这一方案通过在大规模多视角数据集上的训练实现了高效且一致的学习,并提出了增量微调策略以支持更大模型的训练。
链接: https://arxiv.org/abs/2501.18804
作者: Vitor Guizilini,Muhammad Zubair Irshad,Dian Chen,Greg Shakhnarovich,Rares Ambrus
机构: Toyota Research Institute (TRI); Toyota Technological Institute at Chicago (TTIC)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.
zh
[CV-62] Every Image Listens Every Image Dances: Music-Driven Image Animation
【速读】:该论文致力于解决音乐驱动的图像动画生成问题,特别是通过文本和音乐双输入生成个性化舞蹈视频。关键解决方案在于MuseDance模型,它消除了复杂运动引导输入的需求,如姿态或深度序列,从而实现了灵活且创意的视频生成,适用于所有技术水平的用户。
链接: https://arxiv.org/abs/2501.18801
作者: Zhikang Dong,Weituo Hao,Ju-Chiang Wang,Peng Zhang,Pawel Polak
机构: Stony Brook University; Bytedance; Apple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task.
zh
[CV-63] uning Event Camera Biases Heuristic for Object Detection Applications in Staring Scenarios
【速读】:该论文旨在解决神经形态相机(Neuromorphic Camera)在特定任务下(如静态场景中小物体检测)调参难题。论文的关键在于提出了一种针对偏置参数调整的启发式方法,将多变量优化问题简化为一个双参数问题,并通过实验手段进行求解。研究表明,对于某些特定信号源(如电网供电的白炽灯),最优参数值与制造商推荐的默认值相差甚远。
链接: https://arxiv.org/abs/2501.18788
作者: David El-Chai Ben-Ezra,Daniel Brisk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 17 pages, 2 figures
Abstract:One of the main challenges in unlocking the potential of neuromorphic cameras, also called ‘event cameras’, is the development of novel methods that solve the multi-parameter problem of adjusting their bias parameters to accommodate a desired task. Actually, it is very difficult to find in the literature a systematic heuristic that solves the problem for any desired application. In this paper we present a tuning parametes heuristic for the biases of event cameras, for tasks that require small objects detection in staring scenarios. The main purpose of the heuristic is to squeeze the camera’s potential, optimize its performance, and expand its detection capabilities as much as possible. In the presentation, we translate the experimental properties of event camera and systemic constrains into mathematical terms, and show, under certain assumptions, how the multi-variable problem collapses into a two-parameter problem that can be solved experimentally. A main conclusion that will be demonstrated is that for certain desired signals, such as the one provided by an incandescent lamp powered by the periodic electrical grid, the optimal values of the camera are very far from the default values recommended by the manufacturer. Comments: 17 pages, 2 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC) MSC classes: 49J21, 93C35, 93B52, 93C65 Cite as: arXiv:2501.18788 [cs.CV] (or arXiv:2501.18788v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.18788 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: David El-Chai Ben-Ezra [view email] [v1] Thu, 30 Jan 2025 22:27:56 UTC (1,725 KB)
zh
[CV-64] Multispectral 3D mapping on a Roman sculpture to study ancient polychromy WWW
【速读】:该论文旨在解决古希腊和罗马雕塑原始彩绘情况的研究局限性问题,特别是现有成像技术可能低估了雕塑表面原有彩绘范围的挑战。研究的关键在于提出一种基于现实的三维模型分析方法,利用可见光反射成像(Visible Reflected Imaging, VIS)和紫外诱导荧光成像(Ultraviolet-induced Fluorescence Imaging, UVF),并通过摄影测量技术构建三维模型。这种方法通过处理不同光源下的图像数据,并将其对齐和整合到同一个二维空间中,实现了多纹理映射,从而支持分类算法直接应用于三维模型表面,使文物保护者能够更深入地理解文物保存状态,详细观察材料分布,并与三维几何数据相关联。
链接: https://arxiv.org/abs/2501.18786
作者: Francesca Uccheddu(1),Umair Shafqat Malik(2),Emanuela Massa(3),Anna Pelagotti(4),Maria Emilia Masci(5),Gabriele Guidi(2) ((1) University of Padova, Padua, Italy, (2) Indiana University, Bloomington, IN, USA, (3) Art-Test, Florence, Italy, (4) Istituto Nazionale di Ottica (INO), Florence, Italy, (5) Opificio delle Pietre Dure, Florence, Italy)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14 pages, 5 figures, to be published in the proceedings of “Heri-Tech - The Future of Heritage Science And Technologies” Conference by Springer, 29-30 April 2024, Florence, Italy ( this https URL )
Abstract:Research into the polychromy of Greek and Roman sculptures has surged to explore the hypothesis that ancient sculptures were originally not pristine white but adorned with colors. Multispectral and multimodal imaging techniques have been crucial in studying painted surfaces, revealing polychromies even in traces. In fact, imaging techniques, such as reflectance and fluorescence, can identify different materials and map inhomogeneities, guiding further investigations such as Raman, XRays Fluorescence, and Fourier Transform InfraRed Spectroscopy (FTIR) to investigate residual colors. However, this approach may underestimate the original polychromies’ extent over the complex articulation of a sculptured surface. This study proposes a methodology to analyze the original appearance of ancient sculptures using reality-based 3D models with textures not limited to those visible to the naked eye. We employ Visible Reflected Imaging (VIS) and Ultraviolet-induced Fluorescence Imaging (UVF). From the UVF and VIS datasets, the underlying 3D model is built by means of photogrammetry. Through raw data processing, images taken with different illuminating sources are successfully aligned and processed, creating a single 3D model with multiple textures mapped onto the same bi-dimensional space. The pixel-to-pixel correspondence of different textures allows for the implementation of a classification algorithm that can directly map its outcome onto the 3D model surface. This enables conservators to deepen their understanding of artifact preservation, observe mate-rial distribution in detail, and correlate this with 3D geometrical data. In this study, we experiment with this approach on an ancient Roman sculpture of Artemis, conserved at the Archeological and Art Museum of Maremma (MAAM) in Grosseto, Italy.
zh
[CV-65] RUN: Reversible Unfolding Network for Concealed Object Segmentation
【速读】:该论文旨在解决现有隐蔽物体分割(COS)方法局限于掩码域而未充分探索RGB域的问题。为了解决这一局限,论文提出了一种名为可逆展开网络(RUN)的方法。关键在于RUN通过引入额外的残差稀疏性约束来最小化分割不确定性,并采用迭代优化步骤,将其展开为多阶段网络。RUN包含两个可逆模块:面向分割的前景分离(SOFS)模块和面向重建的背景提取(ROBE)模块。SOFS在掩码级别应用可逆策略,而ROBE则将其扩展到RGB域,以解决因独立模块分别估计导致的冲突前景和背景区域。这种方法使得RUN能够在掩码和RGB域中逐步实现前景和背景的可逆建模,从而提高分割准确性并减少假阳性和假阴性的结果。
链接: https://arxiv.org/abs/2501.18783
作者: Chunming He,Rihan Zhang,Fengyang Xiao,Chenyu Fang,Longxiang Tang,Yulun Zhang,Linghe Kong,Deng-Ping Fan,Kai Li,Sina Farsiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 tables, 8 figures
Abstract:Existing concealed object segmentation (COS) methods frequently utilize reversible strategies to address uncertain regions. However, these approaches are typically restricted to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose the Reversible Unfolding Network (RUN), which applies reversible strategies across both mask and RGB domains through a theoretically grounded framework, enabling accurate segmentation. RUN first formulates a novel COS model by incorporating an extra residual sparsity constraint to minimize segmentation uncertainties. The iterative optimization steps of the proposed model are then unfolded into a multistage network, with each step corresponding to a stage. Each stage of RUN consists of two reversible modules: the Segmentation-Oriented Foreground Separation (SOFS) module and the Reconstruction-Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non-local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion-prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, directing the network’s attention to uncertain regions and mitigating false-positive and false-negative results. Extensive experiments demonstrate the superior performance of RUN and highlight the potential of unfolding-based frameworks for COS and other high-level vision tasks. We will release the code and models.
zh
[CV-66] A New Statistical Approach to the Performance Analysis of Vision-based Localization
【速读】:该论文旨在解决无线设备在使用基于无线信号的定位方法不准确或不可用的情况下,如何利用视觉传感器数据精确定位目标位置的问题。关键在于通过引入几何约束来缩小候选地标组合范围,并利用无噪声条件下三个距离测量值足以唯一确定二维平面上正确地标组合的方法,即使单个地标在视觉上无法区分。对于存在噪声的测量值,论文提供了基于新型联合分布的关键随机变量的概率正确识别地标组合的数学描述。
链接: https://arxiv.org/abs/2501.18758
作者: Haozhou Hu,Harpreet S. Dhillon,R. Michael Buehrer
机构: Wireless@VT, Bradley Department of Electrical and Computer Engineering, Virginia Tech (弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Image and Video Processing (eess.IV); Statistics Theory (math.ST); Applications (stat.AP)
备注: 14 pages
Abstract:Many modern wireless devices with accurate positioning needs also have access to vision sensors, such as a camera, radar, and Light Detection and Ranging (LiDAR). In scenarios where wireless-based positioning is either inaccurate or unavailable, using information from vision sensors becomes highly desirable for determining the precise location of the wireless device. Specifically, vision data can be used to estimate distances between the target (where the sensors are mounted) and nearby landmarks. However, a significant challenge in positioning using these measurements is the inability to uniquely identify which specific landmark is visible in the data. For instance, when the target is located close to a lamppost, it becomes challenging to precisely identify the specific lamppost (among several in the region) that is near the target. This work proposes a new framework for target localization using range measurements to multiple proximate landmarks. The geometric constraints introduced by these measurements are utilized to narrow down candidate landmark combinations corresponding to the range measurements and, consequently, the target’s location on a map. By modeling landmarks as a marked Poisson point process (PPP), we show that three noise-free range measurements are sufficient to uniquely determine the correct combination of landmarks in a two-dimensional plane. For noisy measurements, we provide a mathematical characterization of the probability of correctly identifying the observed landmark combination based on a novel joint distribution of key random variables. Our results demonstrate that the landmark combination can be identified using ranges, even when individual landmarks are visually indistinguishable.
zh
[CV-67] INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation
【速读】:该论文旨在解决利用单一任务通用提示(single task-generic prompt)进行多样化样本分割时,视觉-语言模型(Vision-Language Models, VLMs)在某些图像实例上泛化能力不足的问题。解决方案的关键在于引入了一种名为“实例特定负采样”的方法(Instance-specific Negative Mining for Task-Generic Promptable Segmentation, INT)。INT通过自适应地减少不相关的先验知识的影响,并增加对比度更高的合理先验知识的使用,从而优化实例特定提示的生成。具体而言,INT包含两个组件:(1) 实例特定提示生成,逐步过滤掉提示生成中的错误信息;(2) 语义掩模生成,确保每个图像实例的分割与实例特定提示的语义相匹配。
链接: https://arxiv.org/abs/2501.18753
作者: Jian Hu,Zixu Cheng,Shaogang Gong
机构: Queen Mary University of London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A new task-generic promptable segmentation approach
Abstract:Task-generic promptable image segmentation aims to achieve segmentation of diverse samples under a single task description by utilizing only one task-generic prompt. Current methods leverage the generalization capabilities of Vision-Language Models (VLMs) to infer instance-specific prompts from these task-generic prompts in order to guide the segmentation process. However, when VLMs struggle to generalise to some image instances, predicting instance-specific prompts becomes poor. To solve this problem, we introduce \textbfInstance-specific \textbfNegative Mining for \textbfTask-Generic Promptable Segmentation (\textbfINT). The key idea of INT is to adaptively reduce the influence of irrelevant (negative) prior knowledge whilst to increase the use the most plausible prior knowledge, selected by negative mining with higher contrast, in order to optimise instance-specific prompts generation. Specifically, INT consists of two components: (1) instance-specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance-specific prompts. INT is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.
zh
[CV-68] Motion Diffusion Autoencoders: Enabling Attribute Manipulation in Human Motion Demonstrated on Karate Techniques
【速读】:该论文旨在解决人体运动数据中特定属性操作的问题,即在改变数据点或时间序列中的个别属性的同时,保持所有其他方面不变。论文的关键在于设计了一种新型的基于旋转的人体姿态表示方法,该方法能够分离人体骨骼与运动轨迹,同时仍允许精确重建原始解剖结构。解决方案的核心是利用Transformer编码器发现高层语义,并使用扩散概率模型来建模剩余的随机变化。研究表明,从Transformer编码器获得的嵌入空间具有语义意义且线性,这使得通过在语义嵌入空间中发现其线性变化方向来操纵高层属性成为可能。
链接: https://arxiv.org/abs/2501.18729
作者: Anthony Mendil,Felix Putze
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures
Abstract:Attribute manipulation deals with the problem of changing individual attributes of a data point or a time series, while leaving all other aspects unaffected. This work focuses on the domain of human motion, more precisely karate movement patterns. To the best of our knowledge, it presents the first success at manipulating attributes of human motion data. One of the key requirements for achieving attribute manipulation on human motion is a suitable pose representation. Therefore, we design a novel rotation-based pose representation that enables the disentanglement of the human skeleton and the motion trajectory, while still allowing an accurate reconstruction of the original anatomy. The core idea of the manipulation approach is to use a transformer encoder for discovering high-level semantics, and a diffusion probabilistic model for modeling the remaining stochastic variations. We show that the embedding space obtained from the transformer encoder is semantically meaningful and linear. This enables the manipulation of high-level attributes, by discovering their linear direction of change in the semantic embedding space and moving the embedding along said direction. The code and data are available at this https URL.
zh
[CV-69] Strong and Controllable 3D Motion Generation
【速读】:该论文旨在解决生成式AI (Generative AI)在人体运动生成中的两大挑战:(1)生成过程耗时长,阻碍了实时应用;(2)基于文本的生成方法难以实现精确的关节级控制。为了解决这些问题,论文提出了两个关键方案:首先,通过自定义Flash线性注意力机制优化基于Transformer的扩散模型,以提高硬件效率和降低计算复杂度,并定制一致性模型以加速运动生成;其次,引入Motion ControlNet,以实现比现有方法更精确的关节级人体运动控制。这些贡献显著推进了基于文本的人体运动生成技术,使其更接近实际应用。
链接: https://arxiv.org/abs/2501.18726
作者: Canxuan Gang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:Human motion generation is a significant pursuit in generative computer vision with widespread applications in film-making, video games, AR/VR, and human-robot interaction. Current methods mainly utilize either diffusion-based generative models or autoregressive models for text-to-motion generation. However, they face two significant challenges: (1) The generation process is time-consuming, posing a major obstacle for real-time applications such as gaming, robot manipulation, and other online settings. (2) These methods typically learn a relative motion representation guided by text, making it difficult to generate motion sequences with precise joint-level control. These challenges significantly hinder progress and limit the real-world application of human motion generation techniques. To address this gap, we propose a simple yet effective architecture consisting of two key components. Firstly, we aim to improve hardware efficiency and computational complexity in transformer-based diffusion models for human motion generation. By customizing flash linear attention, we can optimize these models specifically for generating human motion efficiently. Furthermore, we will customize the consistency model in the motion latent space to further accelerate motion generation. Secondly, we introduce Motion ControlNet, which enables more precise joint-level control of human motion compared to previous text-to-motion generation methods. These contributions represent a significant advancement for text-to-motion generation, bringing it closer to real-world applications.
zh
[CV-70] Full-Head Segmentation of MRI with Abnormal Brain Anatomy: Model and Data Release
【速读】:该论文旨在开发一种深度网络以实现全头分割,特别是针对包含异常解剖结构的临床MRI图像,并编译首个公开基准数据集。解决方案的关键在于构建一个多轴网络(MultiAxial network),它由三个独立操作于矢状面、横断面和冠状面的二维U-Net模型组成,随后整合为单一的三维分割结果。该网络在测试集上的Dice分数达到0.88(中位数±0.04),显著优于现有脑部分割方法,在处理异常解剖结构区域及去标识化图像时表现尤为稳健。
链接: https://arxiv.org/abs/2501.18716
作者: Andrew M Birnbaum,Adam Buchwald,Peter Turkeltaub,Adam Jacks,Yu Huang,Abhisheck Datta,Lucas C Parra,Lukas A Hirsch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:The goal of this work was to develop a deep network for whole-head segmentation, including clinical MRIs with abnormal anatomy, and compile the first public benchmark dataset for this purpose. We collected 91 MRIs with volumetric segmentation labels for a diverse set of human subjects (4 normal, 32 traumatic brain injuries, and 57 strokes). These clinical cases are characterized by extended cerebrospinal fluid (CSF) in regions normally containing the brain. Training labels were generated by manually correcting initial automated segmentations for skin/scalp, skull, CSF, gray matter, white matter, air cavity, and extracephalic air. We developed a MultiAxial network consisting of three 2D U-Net models that operate independently in sagittal, axial, and coronal planes and are then combined to produce a single 3D segmentation. The MultiAxial network achieved test-set Dice scores of 0.88 (median plus-minus 0.04). For brain tissue, it significantly outperforms existing brain segmentation methods (MultiAxial: 0.898 plus-minus 0.041, SynthSeg: 0.758 plus-minus 0.054, BrainChop: 0.757 plus-minus 0.125). The MultiAxial network gains in robustness by avoiding the need for coregistration with an atlas. It performed well in regions with abnormal anatomy and on images that have been de-identified. It enables more robust current flow modeling when incorporated into ROAST, a widely-used modeling toolbox for transcranial electric stimulation. We are releasing a state-of-the-art model for whole-head MRI segmentation, along with a dataset of 61 clinical MRIs and training labels, including non-brain structures. Together, the model and data may serve as a benchmark for future efforts.
zh
[CV-71] Human Re-ID Meets LVLMs: What can we expect?
【速读】:该论文旨在评估当前领先的大型视觉语言模型(Large Vision-Language Models, LVLMs)在行人重识别(human re-identification, Re-ID)任务中的性能,并将其与专门为此任务设计的最先进AI模型进行对比。论文的关键解决方案在于通过一个综合的评估流程,包括数据集整理、提示工程以及度量标准选择,来全面分析这些模型的表现,从相似性评分、分类准确率以及多种分类指标(如精度、召回率、F1分数和曲线下面积AUC)等多个角度进行评估。研究表明,尽管LVLMs展示了某些优势,但也存在显著的局限性,这些局限性可能导致严重的错误,需要进一步的研究来解决。
链接: https://arxiv.org/abs/2501.18698
作者: Kailash Hambarde,Pranita Samale,Hugo Proença
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (LVLMs) have been regarded as a breakthrough advance in an astoundingly variety of tasks, from content generation to virtual assistants and multimodal search or retrieval. However, for many of these applications, the performance of these methods has been widely criticized, particularly when compared with state-of-the-art methods and technologies in each specific domain. In this work, we compare the performance of the leading large vision-language models in the human re-identification task, using as baseline the performance attained by state-of-the-art AI models specifically designed for this problem. We compare the results due to ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max to a baseline ReID PersonViT model, using the well-known Market1501 dataset. Our evaluation pipeline includes the dataset curation, prompt engineering, and metric selection to assess the models’ performance. Results are analyzed from many different perspectives: similarity scores, classification accuracy, and classification metrics, including precision, recall, F1 score, and area under curve (AUC). Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers and should be the scope of further research. As a concluding remark, we speculate about some further research that should fuse traditional and LVLMs to combine the strengths from both families of techniques and achieve solid improvements in performance.
zh
[CV-72] Unpaired Translation of Point Clouds for Modeling Detector Response NEURIPS
【速读】:本文旨在解决探测器响应建模这一关键挑战,特别是在时间投影室(Time Projection Chambers, TPC)中。研究将此问题转化为未配对点云翻译任务,即在仿真数据与实验运行采集的数据之间进行转换。解决方案的关键在于采用了一种基于扩散概率模型(Diffusion Probabilistic Models)的新框架,以实现有效的映射,从而辅助噪声抑制及构建高保真模拟器。
链接: https://arxiv.org/abs/2501.18674
作者: Mingyang Li,Michelle Kuchera,Raghuram Ramanujan,Adam Anthony,Curtis Hunt,Yassid Ayyad
机构: Davidson College; High Point University; Michigan State University; University of Santiago de Compostela
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex)
备注: NeurIPS Machine Learning and the Physical Sciences Workshop 2025
Abstract:Modeling detector response is a key challenge in time projection chambers. We cast this problem as an unpaired point cloud translation task, between data collected from simulations and from experimental runs. Effective translation can assist with both noise rejection and the construction of high-fidelity simulators. Building on recent work in diffusion probabilistic models, we present a novel framework for performing this mapping. We demonstrate the success of our approach in both synthetic domains and in data sourced from the Active-Target Time Projection Chamber.
zh
[CV-73] Drag Your Gaussian: Effective Drag -Based Editing with Score Distillation for 3D Gaussian Splatting
【速读】:该论文旨在解决现有基于文本引导的3D场景编辑方法(如3D高斯点云编辑)在处理几何变化时的局限性,并缺乏对编辑结果空间位置的精确控制。为克服这些限制,论文提出了一种名为DYG的新型3D拖拽式编辑方法,用于3D高斯点云编辑。DYG的关键在于通过输入3D掩模和控制点对来实现用户指定的编辑区域和拖拽方向,从而实现对编辑程度的精确控制。此外,DYG结合隐式三平面表示以建立编辑结果的几何框架,有效克服了由于3D高斯点云稀疏导致的次优编辑结果。同时,通过引入基于拖拽的潜扩散模型及提出的Drag-SDS损失函数,实现了灵活、多视角一致且细致的编辑效果。
链接: https://arxiv.org/abs/2501.18672
作者: Yansong Qu,Dian Chen,Xinyang Li,Xiaofan Li,Shengchuan Zhang,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Baidu Inc. (百度公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Visit our project page at this https URL
Abstract:Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character’s head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at this https URL.
zh
[CV-74] High-Accuracy ECG Image Interpretation using Parameter-Efficient LoRA Fine-Tuning with Multimodal LLaMA 3.2
【速读】:该论文旨在提升心电图(ECG)图像解读的准确性与效率。为实现这一目标,关键在于采用了一种高效的参数微调策略——低秩适应(Low-Rank Adaptation, LoRA),并结合了一个大规模指令数据集ECGInstruct,其中包含了从MIMIC-IV ECG和PTB-XL等可信开源存储库中合成的心电图图像及其专家编写的详细解析。通过仅更新模型中的少量参数,特别是忽略lm_head
和embed_tokens
层,该方法显著提升了基于LLaMA 3.2模型在多种心脏状况下对ECG图像的理解能力,其性能优于传统卷积神经网络(CNN)方法,并能够识别超过70种心脏异常情况。
链接: https://arxiv.org/abs/2501.18670
作者: Nandakishor M,Anjali M
机构: Convai Innovations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Electrocardiogram (ECG) interpretation is a cornerstone of cardiac diagnostics. This paper explores a practical approach to enhance ECG image interpretation using the multimodal LLaMA 3.2 model. We used a parameter-efficient fine-tuning strategy, Low-Rank Adaptation (LoRA), specifically designed to boost the model’s ability to understand ECG images and achieve better outcomes across a wide range of cardiac conditions. Our method is tailored for ECG analysis and leverages ECGInstruct, a large-scale instruction dataset with 1 Million samples. This dataset is a rich collection of synthesized ECG images, generated from raw ECG data from trusted open-source repositories like MIMIC-IV ECG and PTB-XL. Each ECG image in ECGInstruct comes with expert-written questions and detailed answers, covering diverse ECG interpretation scenarios, including complex cardiac conditions like Myocardial Infarction and Conduction Disturbances. Our fine-tuning approach efficiently adapts the LLaMA 3.2 model (built upon LLaMA 3) by integrating low-rank adaptation techniques, focusing on efficiency by updating only a small set of parameters, specifically ignoring the lm_head
and embed_tokens
layers. This paper details the model setup, our efficient fine-tuning method, and implementation specifics. We provide a thorough evaluation through extensive experiments, demonstrating the effectiveness of our method across various ECG interpretation tasks. The results convincingly show that our parameter-efficient LoRA fine-tuning achieves excellent performance in ECG image interpretation, significantly outperforming baseline models and reaching accuracy comparable to or exceeding traditional CNN-based methods in identifying a wide range of cardiac abnormalities, including over 70 conditions from the PTB-XL dataset.
zh
[CV-75] Image Text and Speech Data Augmentation using Multimodal LLM s for Deep Learning: A Survey
【速读】:该论文旨在解决现有研究在利用大型语言模型(Large Language Models, LLMs)进行多模态数据增强方面的不足。论文的关键在于探索并总结近期文献中采用多模态LLMs来增强图像、文本和音频数据的方法,并讨论了当前方法中的局限性。此外,论文还从文献中识别出潜在的解决方案,以提升使用多模态LLMs进行数据增强的有效性。这些工作为未来的研究奠定了基础,目标是优化并扩展多模态LLMs在提升深度学习应用的数据集质量和多样性方面的作用。
链接: https://arxiv.org/abs/2501.18648
作者: Ranjan Sapkota,Shaina Raza,Maged Shoman,Achyut Paudel,Manoj Karkee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified potential solutions to these limitations from the literature to enhance the efficacy of data augmentation practices using multimodal LLMs. This survey serves as a foundation for future research, aiming to refine and expand the use of multimodal LLMs in enhancing dataset quality and diversity for deep learning applications. (Surveyed Paper GitHub Repo: this https URL. Keywords: LLM data augmentation, LLM text data augmentation, LLM image data augmentation, LLM speech data augmentation, audio augmentation, voice augmentation, chatGPT for data augmentation, DeepSeek R1 text data augmentation, DeepSeek R1 image augmentation, Image Augmentation using LLM, Text Augmentation using LLM, LLM data augmentation for deep learning applications)
zh
[CV-76] 3D Reconstruction of Shoes for Augmented Reality
【速读】:该论文旨在解决在线鞋类购物中因依赖静态二维图像而导致的视觉体验不足的问题。关键解决方案在于利用基于3D高斯点云(3D Gaussian Splatting)的框架,从二维图像生成逼真的三维鞋模,并实现了平均峰值信噪比(PSNR)为0.32的模型精度,同时通过智能手机提供沉浸式的增强现实(AR)交互体验。该方法借助一个包含3120张图像的定制鞋类分割数据集,其中最佳性能的分割模型达到了交并比(IoU)为0.95的精度。
链接: https://arxiv.org/abs/2501.18643
作者: Pratik Shrestha,Sujan Kapali,Swikar Gautam,Vishal Pokharel,Santosh Giri
机构: Department of Electronics and Computer Engineering (电子与计算机工程系); Pulchowk Campus, Institute of Engineering (普尔科夫克校区,工程学院); Lalitpur, Nepal (尼泊尔拉利特布尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:This paper introduces a mobile-based solution that enhances online shoe shopping through 3D modeling and Augmented Reality (AR), leveraging the efficiency of 3D Gaussian Splatting. Addressing the limitations of static 2D images, the framework generates realistic 3D shoe models from 2D images, achieving an average Peak Signal-to-Noise Ratio (PSNR) of 0.32, and enables immersive AR interactions via smartphones. A custom shoe segmentation dataset of 3120 images was created, with the best-performing segmentation model achieving an Intersection over Union (IoU) score of 0.95. This paper demonstrates the potential of 3D modeling and AR to revolutionize online shopping by offering realistic virtual interactions, with applicability across broader fashion categories.
zh
[CV-77] DebiasPI: Inference-time Debiasing by Prompt Iteration of a Text-to-Image Generative Model ECCV-2024 ECCV
【速读】:该论文旨在解决文本到图像生成模型中存在的种族和性别偏见问题。现有方法要么需要重新训练模型,要么难以在生成的图像中反映期望的性别和种族分布。论文提出了一种名为DebiasPI的推理时过程,通过提示迭代来实现偏差消除。关键在于用户可以控制图像生成过程中个体的人口统计属性分布,通过跟踪已生成的属性并引导模型选择尚未充分表示的属性,从而实现更均衡的图像生成。
链接: https://arxiv.org/abs/2501.18642
作者: Sarah Bonna,Yu-Cheng Huang,Ekaterina Novozhilova,Sejin Paik,Zhengyang Shan,Michelle Yilin Feng,Ge Gao,Yonish Tayal,Rushil Kulkarni,Jialin Yu,Nupur Divekar,Deepti Ghadiyaram,Derry Wijaya,Margrit Betke
机构: Boston University(波士顿大学); Boston University(波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: This work was presented at The European Conference on Computer Vision (ECCV) 2024 Workshop “Fairness and ethics towards transparent AI: facing the chalLEnge through model Debiasing” (FAILED), Milano, Italy, on September 29, 2024, this https URL
Abstract:Ethical intervention prompting has emerged as a tool to counter demographic biases of text-to-image generative AI models. Existing solutions either require to retrain the model or struggle to generate images that reflect desired distributions on gender and race. We propose an inference-time process called DebiasPI for Debiasing-by-Prompt-Iteration that provides prompt intervention by enabling the user to control the distributions of individuals’ demographic attributes in image generation. DebiasPI keeps track of which attributes have been generated either by probing the internal state of the model or by using external attribute classifiers. Its control loop guides the text-to-image model to select not yet sufficiently represented attributes, With DebiasPI, we were able to create images with equal representations of race and gender that visualize challenging concepts of news headlines. We also experimented with the attributes age, body type, profession, and skin tone, and measured how attributes change when our intervention prompt targets the distribution of an unrelated attribute type. We found, for example, if the text-to-image model is asked to balance racial representation, gender representation improves but the skin tone becomes less diverse. Attempts to cover a wide range of skin colors with various intervention prompts showed that the model struggles to generate the palest skin tones. We conducted various ablation studies, in which we removed DebiasPI’s attribute control, that reveal the model’s propensity to generate young, male characters. It sometimes visualized career success by generating two-panel images with a pre-success dark-skinned person becoming light-skinned with success, or switching gender from pre-success female to post-success male, thus further motivating ethical intervention prompting with DebiasPI.
zh
[CV-78] Image Velocimetry using Direct Displacement Field estimation with Neural Networks for Fluids
【速读】:该论文旨在解决粒子图像测速技术(Particle Image Velocimetry, PIV)在提高空间分辨率方面的需求。解决方案的关键在于提出了一种利用神经网络和光流方程预测连续图像序列间位移矢量的新方法。这种方法能够提供全图像空间分辨率下的连续位移表示,并且无需预先训练,可以直接应用于任意图像对。
链接: https://arxiv.org/abs/2501.18641
作者: Efraín Magaña,Francisco Sahli Costabal,Wernher Brevis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:An important tool for experimental fluids mechanics research is Particle Image Velocimetry (PIV). Several robust methodologies have been proposed to perform the estimation of velocity field from the images, however, alternative methods are still needed to increase the spatial resolution of the results. This work presents a novel approach for estimating fluid flow fields using neural networks and the optical flow equation to predict displacement vectors between sequential images. The result is a continuous representation of the displacement, that can be evaluated on the full spatial resolution of the image. The methodology was validated on synthetic and experimental images. Accurate results were obtained in terms of the estimation of instantaneous velocity fields, and of the determined time average turbulence quantities and power spectral density. The methodology proposed differs of previous attempts of using machine learning for this task: it does not require any previous training, and could be directly used in any pair of images.
zh
[CV-79] Machine learning of microstructure–property relationships in materials with robust features from foundational vision transformers
【速读】:该论文旨在解决在计算材料科学中,针对每种微结构-性能关系开发特定任务模型的问题。论文的关键在于利用预训练的基础视觉变换器(pre-trained foundational vision transformers)来提取与任务无关的微结构特征,并通过轻量级机器学习方法实现对微结构依赖性性能的预测。这种方法避免了昂贵的任务特定训练或定制深度学习模型的微调需求。
链接: https://arxiv.org/abs/2501.18637
作者: Sheila E. Whitman,Marat I. Latypov
机构: unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:
Abstract:Machine learning of microstructure–property relationships from data is an emerging approach in computational materials science. Most existing machine learning efforts focus on the development of task-specific models for each microstructure–property relationship. We propose utilizing pre-trained foundational vision transformers for the extraction of task-agnostic microstructure features and subsequent light-weight machine learning of a microstructure-dependent property. We demonstrate our approach with pre-trained state-of-the-art vision transformers (CLIP, DINOV2, SAM) in two case studies on machine-learning: (i) elastic modulus of two-phase microstructures based on simulations data; and (ii) Vicker’s hardness of Ni-base and Co-base superalloys based on experimental data published in literature. Our results show the potential of foundational vision transformers for robust microstructure representation and efficient machine learning of microstructure–property relationships without the need for expensive task-specific training or fine-tuning of bespoke deep learning models.
zh
[CV-80] owards Understanding Depth Perception in Foveated Rendering
【速读】:该论文旨在探讨周边模糊对立体深度感知的影响,以解决在实时虚拟和增强现实系统中应用 fovulated rendering 技术时可能存在的深度感知失真问题。研究的关键在于设计了一项心理视觉实验来定量分析周边模糊对深度感知的影响,并由此推导出一个简单的感知模型,用于确定不会影响立体视锐度的 foveation 程度。研究表明,即使在高水平的周边模糊下,立体视锐度仍保持不变或有所改善,这表明 foveated rendering 技术在不损害立体深度感知的前提下可以实现更高的渲染效率。
链接: https://arxiv.org/abs/2501.18635
作者: Sophie Kergaßner,Taimoor Tariq,Piotr Didyk
机构: Università della Svizzera italiana (瑞士意大利语大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 8 pages including references
Abstract:The true vision for real-time virtual and augmented reality is reproducing our visual reality in its entirety on immersive displays. To this end, foveated rendering leverages the limitations of spatial acuity in human peripheral vision to allocate computational resources to the fovea while reducing quality in the periphery. Such methods are often derived from studies on the spatial resolution of the human visual system and its ability to perceive blur in the periphery, enabling the potential for high spatial quality in real-time. However, the effects of blur on other visual cues that depend on luminance contrast, such as depth, remain largely unexplored. It is critical to understand this interplay, as accurate depth representation is a fundamental aspect of visual realism. In this paper, we present the first evaluation exploring the effects of foveated rendering on stereoscopic depth perception. We design a psychovisual experiment to quantitatively study the effects of peripheral blur on depth perception. Our analysis demonstrates that stereoscopic acuity remains unaffected (or even improves) by high levels of peripheral blur. Based on our studies, we derive a simple perceptual model that determines the amount of foveation that does not affect stereoacuity. Furthermore, we analyze the model in the context of common foveation practices reported in literature. The findings indicate that foveated rendering does not impact stereoscopic depth perception, and stereoacuity remains unaffected up to 2x stronger foveation than commonly used. Finally, we conduct a validation experiment and show that our findings hold for complex natural stimuli.
zh
[CV-81] Deformable Beta Splatting
【速读】:该论文旨在解决现有3D Gaussian Splatting (3DGS) 方法在捕捉复杂几何结构和多样化色彩方面的能力受限的问题。解决方案的关键在于引入Deformable Beta Splatting (DBS),它采用可变形Beta核替代高斯核,以增强几何表示的精度和色彩编码的多样性。此外,DBS通过仅调整正则化不透明度来确保分布保留的马尔可夫链蒙特卡洛 (Markov Chain Monte Carlo, MCMC) 过程,从而实现了与散射核类型无关的优化。实验结果表明,DBS在参数使用量减少45%的情况下,达到了比基于3DGS的方法快1.5倍的渲染速度,并且首次在实时辐射场渲染中超越了最先进的神经辐射场方法。
链接: https://arxiv.org/abs/2501.18630
作者: Rong Liu,Dylan Sun,Meida Chen,Yue Wang,Andrew Feng
机构: University of Southern California, Institute for Creative Technologies(南加州大学, 信息技术创意中心); University of Southern California(南加州大学); University of Southern California, Institute for Creative Technologies(南加州大学, 信息技术创意中心); University of Southern California(南加州大学); University of Southern California, Institute for Creative Technologies(南加州大学, 信息技术创意中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:3D Gaussian Splatting (3DGS) has advanced radiance field reconstruction by enabling real-time rendering. However, its reliance on Gaussian kernels for geometry and low-order Spherical Harmonics (SH) for color encoding limits its ability to capture complex geometries and diverse colors. We introduce Deformable Beta Splatting (DBS), a deformable and compact approach that enhances both geometry and color representation. DBS replaces Gaussian kernels with deformable Beta Kernels, which offer bounded support and adaptive frequency control to capture fine geometric details with higher fidelity while achieving better memory efficiency. In addition, we extended the Beta Kernel to color encoding, which facilitates improved representation of diffuse and specular components, yielding superior results compared to SH-based methods. Furthermore, Unlike prior densification techniques that depend on Gaussian properties, we mathematically prove that adjusting regularized opacity alone ensures distribution-preserved Markov chain Monte Carlo (MCMC), independent of the splatting kernel type. Experimental results demonstrate that DBS achieves state-of-the-art visual quality while utilizing only 45% of the parameters and rendering 1.5x faster than 3DGS-based methods. Notably, for the first time, splatting-based methods outperform state-of-the-art Neural Radiance Fields, highlighting the superior performance and efficiency of DBS for real-time radiance field rendering.
zh
[CV-82] A Radiance Field Loss for Fast and Simple Emissive Surface Reconstruction
【速读】:该论文旨在解决将图像转换为发射性表面表示的问题。关键在于修改损失函数,将训练图像投影到场景中,直接监督空间方向辐射场,从而摒弃了alpha混合和光线行进步骤,转而在损失计算中处理这些步骤。这种方法不仅促进了表面收敛,还赋予了辐射场的二维子集明确的语义意义,将其转化为定义明确的发射面,并最终提取出高质量的发射面模型。
链接: https://arxiv.org/abs/2501.18627
作者: Ziyi Zhang,Nicolas Roussel,Thomas Müller,Tizian Zeltner,Merlin Nimier-David,Fabrice Rousselle,Wenzel Jakob
机构: École Polytechnique Fédérale de Lausanne (EPFL); NVIDIA (英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a fast and simple technique to convert images into an emissive surface-based scene representation. Building on existing emissive volume reconstruction algorithms, we introduce a subtle yet impactful modification of the loss function requiring changes to only a few lines of code: instead of integrating the radiance field along rays and supervising the resulting images, we project the training images into the scene to directly supervise the spatio-directional radiance field. The primary outcome of this change is the complete removal of alpha blending and ray marching from the image formation model, instead moving these steps into the loss computation. In addition to promoting convergence to surfaces, this formulation assigns explicit semantic meaning to 2D subsets of the radiance field, turning them into well-defined emissive surfaces. We finally extract a level set from this representation, which results in a high-quality emissive surface model. Our method retains much of the speed and quality of the baseline algorithm. For instance, a suitably modified variant of Instant~NGP maintains comparable computational efficiency, while achieving an average PSNR that is only 0.1 dB lower. Most importantly, our method generates explicit surfaces in place of an exponential volume, doing so with a level of simplicity not seen in prior work. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.18627 [cs.GR] (or arXiv:2501.18627v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2501.18627 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-83] VLMaterial: Procedural Material Generation with Large Vision-Language Models
【速读】:该论文旨在解决利用程序化材质(Procedural Materials)从输入图像生成相应的程序代码这一难题。程序化材质通常以功能节点图的形式表示,在计算机图形学中广泛应用于实现高度逼真的材质外观设计。然而,现有的方法需要专业知识且耗时费力。为应对这一挑战,论文的关键解决方案在于将程序化材质转换为标准Python程序,并通过微调大规模预训练的视觉-语言模型(VLM)来实现从输入图像到程序代码的自动生成。此外,为了提高微调效果,论文还贡献了一个开源的程序化材质数据集,并提出通过提示另一个预训练的大规模语言模型(LLM)来进行程序级别的增强学习。通过广泛的评估表明,所提出的方法在合成和真实场景中均优于现有方法。
链接: https://arxiv.org/abs/2501.18623
作者: Beichen Li,Rundi Wu,Armando Solar-Lezama,Changxi Zheng,Liang Shi,Bernd Bickel,Wojciech Matusik
机构: MIT CSAIL(计算机科学与人工智能实验室); Columbia University(哥伦比亚大学); ETH Zürich(瑞士联邦理工学院); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Procedural materials, represented as functional node graphs, are ubiquitous in computer graphics for photorealistic material appearance design. They allow users to perform intuitive and precise editing to achieve desired visual appearances. However, creating a procedural material given an input image requires professional knowledge and significant effort. In this work, we leverage the ability to convert procedural materials into standard Python programs and fine-tune a large pre-trained vision-language model (VLM) to generate such programs from input images. To enable effective fine-tuning, we also contribute an open-source procedural material dataset and propose to perform program-level augmentation by prompting another pre-trained large language model (LLM). Through extensive evaluation, we show that our method outperforms previous methods on both synthetic and real-world examples.
zh
[CV-84] hree Laws of Statistical Linguistics Emerging in images
【速读】:该论文旨在探究图像是否也遵循统计语言学的规律。为此,研究者们从人类思维与视觉感知的关系入手,通过使用预训练的深度卷积神经网络(Deep Convolutional Neural Networks, DCNNs)来定义图像中的“词汇”。关键在于采用VGG-19模型,基于每个核(kernel)定义图像中的词汇,并计算灰度值大于90%的像素数量。通过词频排序、随机化核出现顺序以及逐层累加词频计数的方法,研究发现Zipf’s、Heaps’和Benford’s定律同样适用于构成不同图像文本的这些词汇。
链接: https://arxiv.org/abs/2501.18620
作者: Ping-Rui Tsai,Chi-hsiang Wang,Yu-Cheng Liao,Tzay-Ming Hong
机构: National Tsing Hua University(清华大学); National Yang Ming Chiao Tung University(阳明交通大学); National Taiwan University(台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:
Abstract:Images, as a product evolving alongside civilization, develop similarly to natural languages with the advancement of civilization. Not only are images abundant in daily life, but are also influenced by technology in shaping their forms, embodying various characteristics as they evolve in time. Language is a sequence of symbols that represents thoughts. While a written language is typically associated with the close integration of text and sound, as a combination of visual symbols and perception, the communicative power of image is no less significant. This is especially notable since 60%% of the sensory input received by our central nervous system comes from vision. Given the symbolic system inherent in images, we are curious whether images can also exhibit the laws of statistical linguistics. To explore this, we begin with the relationship between human thought and visual perception to decode how images are formed by the latter mechanism. Building upon previous studies that established the high correlation between pre-trained deep convolutional neural networks and the human visual system, we use the VGG-19 to define words via each kernel and calculate the number of pixels with grayscale values greater than 90%%. By (a) ranking words frequency, (b) randomizing the order of kernel appearances and performing the same word count accumulation, and © summing the word counts layer by layer, we are surprised to find that Zipf’s, Heaps’, and Benford’s laws of statistical linguistics also exist in the words that comprises the text representing different images.
zh
[CV-85] FAAGC: Feature Augmentation on Adaptive Geodesic Curve Based on the shape space theory IJCAI2025
【速读】:该论文旨在解决因数据有限而导致的各个领域和行业在应用深度学习模型时所面临的挑战。论文的关键解决方案是提出了一种在预形状空间中的特征增强方法,即基于自适应测地曲线的特征增强 (Feature Augmentation on Adaptive Geodesic Curve, FAAGC) 方法。该方法通过将深度模型表示投影到预形状空间,并构造每个类别的测地线(大圆上的弧),然后沿这些测地路径进行采样来实现特征增强。实验结果表明,FAAGC 方法在数据稀缺条件下能够提升分类精度,并且具有良好的泛化能力。
链接: https://arxiv.org/abs/2501.18619
作者: Yuexing Han,Ruijie Li
机构: School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院); Key Laboratory of Silicate Cultural Relics Conservation (Shanghai University), Ministry of Education(教育部硅酸盐文物化学保护重点实验室(上海大学))
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8pages, 3figures, submitted to IJCAI 2025
Abstract:Deep learning models have been widely applied across various domains and industries. However, many fields still face challenges due to limited and insufficient data. This paper proposes a Feature Augmentation on Adaptive Geodesic Curve (FAAGC) method in the pre-shape space to increase data. In the pre-shape space, objects with identical shapes lie on a great circle. Thus, we project deep model representations into the pre-shape space and construct a geodesic curve, i.e., an arc of a great circle, for each class. Feature augmentation is then performed by sampling along these geodesic paths. Extensive experiments demonstrate that FAAGC improves classification accuracy under data-scarce conditions and generalizes well across various feature types.
zh
[CV-86] Vision Aided Channel Prediction for Vehicular Communications: A Case Study of Received Power Prediction Using RGB Images
【速读】:该论文旨在解决6G通信场景中毫米波车载通信信道预测的复杂性和挑战性,特别是在准确性、实用性和普适性之间难以取得最优平衡的问题。此外,传统方法通常未能有效利用环境特征。论文的关键解决方案在于提出了一种基于视觉辅助的两阶段模型,通过RGB图像实现接收功率的准确预测。该模型首先从RGB相机获取传播环境的原始图像,然后在第一阶段使用包括目标检测、实例分割和二值掩模在内的典型计算机视觉方法提取环境信息;第二阶段基于处理后的图像进行接收功率预测。模型在两个阶段分别采用了预训练的YOLOv8和ResNets,并进行了微调。这一创新方法不仅提升了信道预测的智能性,还验证了其可行性、准确性和泛化能力。
链接: https://arxiv.org/abs/2501.18618
作者: Xuejian Zhang,Ruisi He,Mi Yang,Zhengyu Zhang,Ziyi Qi,Bo Ai
机构: School of Electronics and Information Engineering and the Frontiers Science Center for Smart High-speed Railway System, Beijing Jiaotong University (北京交通大学电子与信息工程学院和智能高速铁路系统前沿科学中心); Henan High-Speed Railway Operation and Maintenance Engineering Research Center (河南高速铁路运营维护工程技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures, submitted to IEEE Transactions on Vehicular Technology
Abstract:The communication scenarios and channel characteristics of 6G will be more complex and difficult to characterize. Conventional methods for channel prediction face challenges in achieving an optimal balance between accuracy, practicality, and generalizability. Additionally, they often fail to effectively leverage environmental features. Within the framework of integration communication and artificial intelligence as a pivotal development vision for 6G, it is imperative to achieve intelligent prediction of channel characteristics. Vision-aided methods have been employed in various wireless communication tasks, excluding channel prediction, and have demonstrated enhanced efficiency and performance. In this paper, we propose a vision-aided two-stage model for channel prediction in millimeter wave vehicular communication scenarios, realizing accurate received power prediction utilizing solely RGB images. Firstly, we obtain original images of propagation environment through an RGB camera. Secondly, three typical computer vision methods including object detection, instance segmentation and binary mask are employed for environmental information extraction from original images in stage 1, and prediction of received power based on processed images is implemented in stage 2. Pre-trained YOLOv8 and ResNets are used in stages 1 and 2, respectively, and fine-tuned on datasets. Finally, we conduct five experiments to evaluate the performance of proposed model, demonstrating its feasibility, accuracy and generalization capabilities. The model proposed in this paper offers novel solutions for achieving intelligent channel prediction in vehicular communications.
zh
[CV-87] STAMP: Scalable Task And Model-agnostic Collaborative Perception ICLR2025
【速读】:该论文旨在解决单个智能体在自动驾驶感知中的局限性,特别是在严重遮挡、恶劣天气条件以及检测远距离物体时传感器物理限制导致的性能下降问题。为应对这些挑战,论文提出了一种名为STAMP的可扩展且与任务及模型无关的多智能体协作感知框架。STAMP的关键在于利用轻量级适配器-还原器对(adapter-reverter pairs)来转换不同智能体之间的鸟瞰图(BEV)特征,从而实现高效的特征共享和融合,同时最小化计算开销,增强可扩展性,并保持模型安全性。
链接: https://arxiv.org/abs/2501.18616
作者: Xiangbo Gao,Runsheng Xu,Jiachen Li,Ziran Wang,Zhiwen Fan,Zhengzhong Tu
机构: Texas A&M University (德克萨斯A&M大学); UCLA (加州大学洛杉矶分校); UC Riverside (加州大学河滨分校); Purdue University (普渡大学); UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Paper is accepted by ICLR 2025
Abstract:Perception is crucial for autonomous driving, but single-agent perception is often constrained by sensors’ physical limitations, leading to degraded performance under severe occlusion, adverse weather conditions, and when detecting distant objects. Multi-agent collaborative perception offers a solution, yet challenges arise when integrating heterogeneous agents with varying model architectures. To address these challenges, we propose STAMP, a scalable task- and model-agnostic, collaborative perception pipeline for heterogeneous agents. STAMP utilizes lightweight adapter-reverter pairs to transform Bird’s Eye View (BEV) features between agent-specific and shared protocol domains, enabling efficient feature sharing and fusion. This approach minimizes computational overhead, enhances scalability, and preserves model security. Experiments on simulated and real-world datasets demonstrate STAMP’s comparable or superior accuracy to state-of-the-art models with significantly reduced computational costs. As a first-of-its-kind task- and model-agnostic framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy. Our project page is at this https URL and the code is available at this https URL.
zh
[CV-88] Multi-Frame Blind Manifold Deconvolution for Rotating Synthetic Aperture Imaging
【速读】:该论文旨在解决旋转合成孔径(RSA)成像系统获取图像的去模糊问题,特别是通过多帧盲卷积中的流形拟合和惩罚来处理低维流形结构。关键在于利用流形学习方法从高维空间中提取低维结构特征,以提高潜在清晰图像的估计精度,特别是在像素强度估计和结构细节保留方面。
链接: https://arxiv.org/abs/2501.19386
作者: Dao Lin,Jian Zhang,Martin Benning
机构: School of Mathematics, Statistics and Actuarial Science, University of Kent (数学、统计和精算科学学院,肯特大学); Department of Computer Science, University College London (计算机科学系,伦敦大学学院)
类目: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 39 pages, 9 figures
Abstract:Rotating synthetic aperture (RSA) imaging system captures images of the target scene at different rotation angles by rotating a rectangular aperture. Deblurring acquired RSA images plays a critical role in reconstructing a latent sharp image underlying the scene. In the past decade, the emergence of blind convolution technology has revolutionised this field by its ability to model complex features from acquired images. Most of the existing methods attempt to solve the above ill-posed inverse problem through maximising a posterior. Despite this progress, researchers have paid limited attention to exploring low-dimensional manifold structures of the latent image within a high-dimensional ambient-space. Here, we propose a novel method to process RSA images using manifold fitting and penalisation in the content of multi-frame blind convolution. We develop fast algorithms for implementing the proposed procedure. Simulation studies demonstrate that manifold-based deconvolution can outperform conventional deconvolution algorithms in the sense that it can generate a sharper estimate of the latent image in terms of estimating pixel intensities and preserving structural details. Comments: 39 pages, 9 figures Subjects: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP) MSC classes: 62P30 Cite as: arXiv:2501.19386 [stat.ME] (or arXiv:2501.19386v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2501.19386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-89] Using gradient of Lagrangian function to compute efficient channels for the ideal observer
【速读】:该论文旨在解决高维医学影像数据下理想观察者(Ideal Observer, IO)和Hotelling观察者(Hotelling Observer, HO)性能评估及计算难题。关键在于提出了一种基于拉格朗日损失函数梯度生成高效通道(Lagrangian-gradient, L-grad通道)的新方法,以减少图像数据维度并提高信号检测性能,同时降低计算复杂度。研究表明,利用L-grad通道的通道化Hotelling观察者(Channelized Hotelling Observer, CHO)在二元信号检测任务中,相较于使用偏最小二乘法(Partial Least Squares, PLS)通道的CHO,能够显著提升检测性能并大幅缩短计算时间。
链接: https://arxiv.org/abs/2501.19381
作者: Weimin Zhou
机构: Wyant College of Optical Sciences, University of Arizona (光学科学学院,亚利桑那大学); Department of Medical Imaging, University of Arizona (医学影像系,亚利桑那大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
备注: SPIE Medical Imaging 2025
Abstract:It is widely accepted that the Bayesian ideal observer (IO) should be used to guide the objective assessment and optimization of medical imaging systems. The IO employs complete task-specific information to compute test statistics for making inference decisions and performs optimally in signal detection tasks. However, the IO test statistic typically depends non-linearly on the image data and cannot be analytically determined. The ideal linear observer, known as the Hotelling observer (HO), can sometimes be used as a surrogate for the IO. However, when image data are high dimensional, HO computation can be difficult. Efficient channels that can extract task-relevant features have been investigated to reduce the dimensionality of image data to approximate IO and HO performance. This work proposes a novel method for generating efficient channels by use of the gradient of a Lagrangian-based loss function that was designed to learn the HO. The generated channels are referred to as the Lagrangian-gradient (L-grad) channels. Numerical studies are conducted that consider binary signal detection tasks involving various backgrounds and signals. It is demonstrated that channelized HO (CHO) using L-grad channels can produce significantly better signal detection performance compared to the CHO using PLS channels. Moreover, it is shown that the proposed L-grad method can achieve significantly lower computation time compared to the PLS method.
zh
[CV-90] Pathological MRI Segmentation by Synthetic Pathological Data Generation in Fetuses and Neonates
【速读】:该论文旨在解决临床胎儿和新生儿MRI数据自动化分析方法开发受限的问题,主要由于标注病理性数据集稀缺及隐私问题导致的数据共享限制,从而影响深度学习模型的效果。为解决这一问题,论文提出的关键方案包括:一是引入FetalNeonatal-DDPM,这是一种新型扩散模型框架,用于从语义标签图像生成高质量的合成病理性胎儿和新生儿MRI;二是通过修改健康标签图像以模拟诸如脑室扩大、小脑及脑桥小脑发育不良和小头畸形等病理条件,增强训练数据。论文表明,利用FetalNeonatal-DDPM生成的合成MRI在质量及诊断价值方面显著优于真实MRI,并且增强了最先进的nnUNet分割性能,特别是在重度脑室扩大病例中,尤其是在脑室分割(Dice分数:0.9253 vs. 0.7317)方面取得了显著改进。这些成果展示了生成式AI作为数据增强的变革性工具的潜力,有助于提高病理性病例中的分割性能。
链接: https://arxiv.org/abs/2501.19338
作者: Misha P.T Kaandorp,Damola Agbelese,Hosna Asma-ull,Hyun-Gi Kim,Kelly Payette,Patrice Grehten,Gennari Antonio Giulio,Levente István Lánczi,Andras Jakab
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 4 figures, 5 tables
Abstract:Developing new methods for the automated analysis of clinical fetal and neonatal MRI data is limited by the scarcity of annotated pathological datasets and privacy concerns that often restrict data sharing, hindering the effectiveness of deep learning models. We address this in two ways. First, we introduce FetalNeonatal-DDPM, a novel diffusion model framework designed to generate high-quality synthetic pathological fetal and neonatal MRIs from semantic label images. Second, we enhance training data by modifying healthy label images through morphological alterations to simulate conditions such as ventriculomegaly, cerebellar and pontocerebellar hypoplasia, and microcephaly. By leveraging FetalNeonatal-DDPM, we synthesize realistic pathological MRIs from these modified pathological label images. Radiologists rated the synthetic MRIs as significantly (p 0.05) superior in quality and diagnostic value compared to real MRIs, demonstrating features such as blood vessels and choroid plexus, and improved alignment with label annotations. Synthetic pathological data enhanced state-of-the-art nnUNet segmentation performance, particularly for severe ventriculomegaly cases, with the greatest improvements achieved in ventricle segmentation (Dice scores: 0.9253 vs. 0.7317). This study underscores the potential of generative AI as transformative tool for data augmentation, offering improved segmentation performance in pathological cases. This development represents a significant step towards improving analysis and segmentation accuracy in prenatal imaging, and also offers new ways for data anonymization through the generation of pathologic image data.
zh
[CV-91] Single cell resolution 3D imaging and segmentation within intact live tissues
【速读】:该论文旨在解决在三维(3D)活体组织内对荧光标记的单个细胞进行精确量化的问题。解决方案的关键在于提供一个详细的步骤协议,包括样本准备、成像以及基于深度学习的细胞分割方法,以实现对复杂组织结构中细胞属性的高分辨率三维量化。关键还涉及选择合适的显微镜模式和参数(如物镜和样品固定方法),以及适用的分割算法。
链接: https://arxiv.org/abs/2501.19203
作者: G. Paci,P. Vicente-Munuera,I. Fernandez-Mosquera,A. Miranda,K. Lau,Q. Zhang,R. Barrientos,Y. Mao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Tissues and Organs (q-bio.TO)
备注:
Abstract:Epithelial cells form diverse structures from squamous spherical organoids to densely packed pseudostratified tissues. Quantification of cellular properties in these contexts requires high-resolution deep imaging and computational techniques to achieve truthful three-dimensional (3D) structural features. Here, we describe a detailed step-by-step protocol for sample preparation, imaging and deep-learning-assisted cell segmentation to achieve accurate quantification of fluorescently labelled individual cells in 3D within live tissues. We share the lessons learned through troubleshooting 3D imaging of Drosophila wing discs, including considerations on the choice of microscopy modality and settings (objective, sample mounting) and available segmentation methods. In addition, we include a computational pipeline alongside custom code to assist replication of the protocol. While we focus on the segmentation of cell outlines from membrane labelling, this protocol applies to a wide variety of samples, and we believe it be valuable for studying other tissues that demand complex analysis in 3D.
zh
[CV-92] Augmented Intelligence for Multimodal Virtual Biopsy in Breast Cancer Using Generative Artificial Intelligence
【速读】:该论文旨在解决乳腺癌筛查中因乳腺密度或纤维囊性病变导致全视野数字乳腺摄影(FFDM)效果受限的问题,并且提高虚拟活检在诊断良恶性病变中的准确性。解决方案的关键在于引入了一种多模态、多视角的深度学习方法,将FFDM与对比增强谱乳腺摄影(CESM)模态结合,用于病变分类。为了应对CESM数据缺失的问题,研究利用生成式人工智能(Generative AI)从FFDM扫描中生成CESM图像。实验结果表明,在缺少真实CESM数据的情况下,合成CESM图像能够显著提升虚拟活检的性能,尤其是在结合FFDM和CESM模态的多模态配置下。
链接: https://arxiv.org/abs/2501.19176
作者: Aurora Rofena,Claudia Lucia Piccolo,Bruno Beomonte Zobel,Paolo Soda,Valerio Guarrasi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Full-Field Digital Mammography (FFDM) is the primary imaging modality for routine breast cancer screening; however, its effectiveness is limited in patients with dense breast tissue or fibrocystic conditions. Contrast-Enhanced Spectral Mammography (CESM), a second-level imaging technique, offers enhanced accuracy in tumor detection. Nonetheless, its application is restricted due to higher radiation exposure, the use of contrast agents, and limited accessibility. As a result, CESM is typically reserved for select cases, leaving many patients to rely solely on FFDM despite the superior diagnostic performance of CESM. While biopsy remains the gold standard for definitive diagnosis, it is an invasive procedure that can cause discomfort for patients. We introduce a multimodal, multi-view deep learning approach for virtual biopsy, integrating FFDM and CESM modalities in craniocaudal and mediolateral oblique views to classify lesions as malignant or benign. To address the challenge of missing CESM data, we leverage generative artificial intelligence to impute CESM images from FFDM scans. Experimental results demonstrate that incorporating the CESM modality is crucial to enhance the performance of virtual biopsy. When real CESM data is missing, synthetic CESM images proved effective, outperforming the use of FFDM alone, particularly in multimodal configurations that combine FFDM and CESM modalities. The proposed approach has the potential to improve diagnostic workflows, providing clinicians with augmented intelligence tools to improve diagnostic accuracy and patient care. Additionally, as a contribution to the research community, we publicly release the dataset used in our experiments, facilitating further advancements in this field.
zh
[CV-93] he Role of Graph-based MIL and Interventional Training in the Generalization of WSI Classifiers ML4H2024
【速读】:该论文旨在解决 Whole Slide Imaging (WSI) 在癌症诊断中的高分辨率图像处理与标注数据稀缺带来的挑战,特别是深度学习模型在处理这些数据时面临的困难。传统多重实例学习(Multiple Instance Learning, MIL)方法忽视了图像块之间的空间关系,这对癌症分级和诊断至关重要。为了解决这一问题,论文引入了一种新的框架,即带有干预训练的基于图的多重实例学习(Graph-based Multiple Instance Learning with Interventional Training, GMIL-IT),通过改进图构造技术来整合空间信息,从而提高模型的泛化能力。关键在于结合图方法以捕捉图像块间的空间依赖性,并通过干预训练减少模型对无关因素如颜色变化的依赖,增强模型的鲁棒性。
链接: https://arxiv.org/abs/2501.19048
作者: Rita Pereira,M. Rita Verdelho,Catarina Barata,Carlos Santiago
机构: Institute for Systems and Robotics(系统与机器人研究所); LARSyS(激光系统与系统实验室); Instituto Superior Técnico(技术高等研究院); University of Lisbon(里斯本大学); Lisbon(里斯本); Portugal(葡萄牙)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at ML4H 2024 - Findings Track
Abstract:Whole Slide Imaging (WSI), which involves high-resolution digital scans of pathology slides, has become the gold standard for cancer diagnosis, but its gigapixel resolution and the scarcity of annotated datasets present challenges for deep learning models. Multiple Instance Learning (MIL), a widely-used weakly supervised approach, bypasses the need for patch-level annotations. However, conventional MIL methods overlook the spatial relationships between patches, which are crucial for tasks such as cancer grading and diagnosis. To address this, graph-based approaches have gained prominence by incorporating spatial information through node connections. Despite their potential, both MIL and graph-based models are vulnerable to learning spurious associations, like color variations in WSIs, affecting their robustness. In this dissertation, we conduct an extensive comparison of multiple graph construction techniques, MIL models, graph-MIL approaches, and interventional training, introducing a new framework, Graph-based Multiple Instance Learning with Interventional Training (GMIL-IT), for WSI classification. We evaluate their impact on model generalization through domain shift analysis and demonstrate that graph-based models alone achieve the generalization initially anticipated from interventional training. Our code is available here: this http URL
zh
[CV-94] Understanding Model Calibration – A gentle introduction and visual exploration of calibration and the expected calibration error (ECE)
【速读】:该论文旨在探讨模型校准的不同概念及其评估方法,并强调当前广泛使用的评估度量存在的某些问题。关键在于指出现有评估校准度量的局限性,并提出需要新的概念和相应的评估措施来弥补这些不足。
链接: https://arxiv.org/abs/2501.19047
作者: Maja Pavlovic
机构: Queen Mary University London(伦敦大学玛丽皇后学院)
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:To be considered reliable, a model must be calibrated so that its confidence in each decision closely reflects its true outcome. In this blogpost we’ll take a look at the most commonly used definition for calibration and then dive into a frequently used evaluation measure for model calibration. We’ll then cover some of the drawbacks of this measure and how these surfaced the need for additional notions of calibration, which require their own new evaluation measures. This post is not intended to be an in-depth dissection of all works on calibration, nor does it focus on how to calibrate models. Instead, it is meant to provide a gentle introduction to the different notions and their evaluation measures as well as to re-highlight some issues with a measure that is still widely used to evaluate calibration.
zh
[CV-95] Full-scale Representation Guided Network for Retinal Vessel Segmentation
【速读】:该论文旨在提升视网膜血管分割的性能,并解决现有U-Net架构及其变体在参数规模和可扩展性方面的问题。论文的关键解决方案是引入Full Scale Guided Network (FSG-Net),其中现代化卷积块提取全尺度信息,引导卷积块通过注意力引导滤波器(attention-guided filter)优化这些信息。这种设计使得网络能够生成改进的注意力图,从而增强分割性能。此外,FSG-Net的结构具有较高的可扩展性,可以替换为任何U-Net变体。实验结果表明,该方法在多个公开数据集上取得了与当前最先进的模型相竞争的结果,并且在较小的参数规模下仍表现出竞争力。
链接: https://arxiv.org/abs/2501.18921
作者: Sunyong Seo,Huisu Yoon,Semin Kim,Jongha Lee
机构: lululab Inc.(噜噜实验室股份有限公司), AI R&D center(人工智能研发中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introduced to the guided convolution block under the interpretation that the filter behaves like the unsharp mask filter. Passing full-scale information to the attention block allows for the generation of improved attention maps, which are then passed to the attention-guided filter, resulting in performance enhancement of the segmentation network. The structure preceding the guided convolution block can be replaced by any U-Net variant, which enhances the scalability of the proposed approach. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments also show that the proposed network demonstrates competitive results compared to current SOTA models on various public datasets. Ablation studies demonstrate that the proposed model is competitive with much smaller parameter sizes. Lastly, by applying the proposed model to facial wrinkle segmentation, we confirmed the potential for scalability to similar tasks in other domains. Our code is available on this https URL.
zh
[CV-96] Pitfalls of defacing whole-head MRI: re-identification risk with diffusion models and compromised research potential
【速读】:该论文旨在评估面部去识别(Defacing)技术在保护隐私方面的有效性,并探讨其对下游任务的影响。论文的关键解决方案是开发了一种基于级联扩散概率模型(DPMs)的重新面部化(Refacing)管道,能够从去识别化的头部磁共振成像(MRI)数据中恢复人脸。通过训练这些模型,研究者验证了其在未见过的数据集上的性能,并进一步分析了去识别化过程中受影响体素对于骨骼肌密度预测的潜在价值。结果显示,DPMs可以生成与原始图像高度一致的人脸,同时证明了去识别化会削弱骨骼肌密度预测的准确性,从而表明去识别化可能不仅无法有效保护隐私,还会丢失有价值的信息。
链接: https://arxiv.org/abs/2501.18834
作者: Chenyu Gao,Kaiwen Xu,Michael E. Kim,Lianrui Zuo,Zhiyuan Li,Derek B. Archer,Timothy J. Hohman,Ann Zenobia Moore,Luigi Ferrucci,Lori L. Beason-Held,Susan M. Resnick,Christos Davatzikos,Jerry L. Prince,Bennett A. Landman
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Defacing is often applied to head magnetic resonance image (MRI) datasets prior to public release to address privacy concerns. The alteration of facial and nearby voxels has provoked discussions about the true capability of these techniques to ensure privacy as well as their impact on downstream tasks. With advancements in deep generative models, the extent to which defacing can protect privacy is uncertain. Additionally, while the altered voxels are known to contain valuable anatomical information, their potential to support research beyond the anatomical regions directly affected by defacing remains uncertain. To evaluate these considerations, we develop a refacing pipeline that recovers faces in defaced head MRIs using cascaded diffusion probabilistic models (DPMs). The DPMs are trained on images from 180 subjects and tested on images from 484 unseen subjects, 469 of whom are from a different dataset. To assess whether the altered voxels in defacing contain universally useful information, we also predict computed tomography (CT)-derived skeletal muscle radiodensity from facial voxels in both defaced and original MRIs. The results show that DPMs can generate high-fidelity faces that resemble the original faces from defaced images, with surface distances to the original faces significantly smaller than those of a population average face (p 0.05). This performance also generalizes well to previously unseen datasets. For skeletal muscle radiodensity predictions, using defaced images results in significantly weaker Spearman’s rank correlation coefficients compared to using original images (p 10-4). For shin muscle, the correlation is statistically significant (p 0.05) when using original images but not statistically significant (p 0.05) when any defacing method is applied, suggesting that defacing might not only fail to protect privacy but also eliminate valuable information.
zh
[CV-97] An Adversarial Approach to Register Extreme Resolution Tissue Cleared 3D Brain Images
【速读】:该论文旨在解决超高分辨率组织透明化图像的配准难题。传统图像配准方法在处理此类高分辨率图像时表现不佳。为了解决这一问题,论文提出了一种基于补丁的生成网络InvGAN。关键在于InvGAN能够高效且准确地完成超高分辨率组织透明化图像的配准任务,在25%分辨率下仅需约7分钟,在100%分辨率下也只需10分钟,显著优于传统方法。
链接: https://arxiv.org/abs/2501.18815
作者: Abdullah Naziba,Clinton Fookes,Dimitri Perrin
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We developed a generative patch based 3D image registration model that can register very high resolution images obtained from a biochemical process name tissue clearing. Tissue clearing process removes lipids and fats from the tissue and make the tissue transparent. When cleared tissues are imaged with Light-sheet fluorescent microscopy, the resulting images give a clear window to the cellular activities and dynamics inside the this http URL the images obtained are very rich with cellular information and hence their resolution is extremely high (eg .2560x2160x676). Analyzing images with such high resolution is a difficult task for any image analysis this http URL registration is a common step in image analysis pipeline when comparison between images are required. Traditional image registration methods fail to register images with such extant. In this paper we addressed this very high resolution image registration issue by proposing a patch-based generative network named InvGAN. Our proposed network can register very high resolution tissue cleared images. The tissue cleared dataset used in this paper are obtained from a tissue clearing protocol named CUBIC. We compared our method both with traditional and deep-learning based registration this http URL different versions of CUBIC dataset are used, representing two different resolutions 25% and 100% respectively. Experiments on two different resolutions clearly show the impact of resolution on the registration quality. At 25% resolution, our method achieves comparable registration accuracy with very short time (7 minutes approximately). At 100% resolution, most of the traditional registration methods fail except Elastix registration this http URL takes 28 hours to register where proposed InvGAN takes only 10 minutes.
zh
[CV-98] PSO-Net: Development of an automated psoriasis assessment system using attention-based interpretable deep neural networks
【速读】:该论文旨在解决银屑病(Psoriasis)临床评估中的几个主要问题:患者因需亲自前往诊所进行病情评估而产生的负担、研究者评分所需的时间以及评分者之间和评分者内部的评分变异性。为了解决这些问题,论文提出了一种名为PSO-Net的新颖且可解释的深度学习架构,该架构通过将来自不同解剖区域的数字图像映射来推导基于注意力的评分,并进一步结合这些区域评分以估算绝对的银屑病面积与严重性指数(PASI)。此外,论文还设计了一种新的回归激活图,用于通过排名注意力评分来增强模型的可解释性。这种方法实现了与两位不同临床医生评分者之间的高类内相关分数,分别为82.2% [95% 置信区间: 77%-87%] 和87.8% [95% 置信区间: 84%-91%]。
链接: https://arxiv.org/abs/2501.18782
作者: Sharif A. Kamran,Molly V. Lucas,Brendon Lutnick,Chaitanya Parmar,Basudha Pal,Asha Patel Shah,David Apfel,Steven Fakharzadeh,Lloyd Miller,Stephen Yip,Kristopher Standish,Gabriela Oana Cula
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE ISBI 2025. 5 Pages, 3 figures, 2 tables
Abstract:Psoriasis is a chronic skin condition that requires long-term treatment and monitoring. Although, the Psoriasis Area and Severity Index (PASI) is utilized as a standard measurement to assess psoriasis severity in clinical trials, it has many drawbacks such as (1) patient burden for in-person clinic visits for assessment of psoriasis, (2) time required for investigator scoring and (3) variability of inter- and intra-rater scoring. To address these drawbacks, we propose a novel and interpretable deep learning architecture called PSO-Net, which maps digital images from different anatomical regions to derive attention-based scores. Regional scores are further combined to estimate an absolute PASI score. Moreover, we devise a novel regression activation map for interpretability through ranking attention scores. Using this approach, we achieved inter-class correlation scores of 82.2% [95% CI: 77- 87%] and 87.8% [95% CI: 84-91%] with two different clinician raters, respectively.
zh
[CV-99] Distillation-Driven Diffusion Model for Multi-Scale MRI Super-Resolution: Make 1.5T MRI Great Again
【速读】:该论文旨在解决临床应用中标准1.5T MRI系统空间分辨率有限的问题,以及高成本和有限可用性的7T MRI系统难以普及的问题。解决方案的关键在于提出了一种新颖的超分辨率(Super-Resolution, SR)模型,该模型能够从标准1.5T MRI图像生成类似7T MRI的图像。具体而言,该方法利用基于扩散的架构,并结合7T成像中的梯度非线性校正和偏置场校正数据作为指导。此外,引入了渐进蒸馏策略,通过逐步细化学生模型来实现高性能的7T SR任务,同时保持较小的模型规模。实验结果表明,基线教师模型达到了最先进的超分辨率性能,而学生模型在保持较小规模的同时,仅牺牲少量性能,并且能够在不同分辨率的MRI输入下工作,无需重新训练,显著提高了部署灵活性。
链接: https://arxiv.org/abs/2501.18736
作者: Zhe Wang,Yuhua Ru,Fabian Bauer,Aladine Chetouani,Fang Chen,Liping Zhang,Didier Hans,Rachid Jennane,Mohamed Jarraya,Yung Hsin Chen
机构: Department of Radiology, Massachusetts General Hospital, Harvard Medical School (麻省总医院,哈佛医学院); L2TI Laboratory, University Sorbonne Paris Nord (L2TI实验室,巴黎北部索邦大学); Jiangsu Institute of Hematology, The First Affiliated Hospital of Soochow University (苏州大学第一附属医院血液研究所); department of Medical School, Henan University of Chinese Medicine (河南中医药大学医学院); Division of Radiology, German Cancer Research Center (德国癌症研究中心放射科); Athinoula A. Martinos Centre for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School (Athinoula A. Martinos生物医学成像中心,麻省总医院,哈佛医学院); Nuclear Medicine Division, Geneva University Hospital (日内瓦大学医院核医学部); IDP Institute, UMR CNRS 7013, University of Orleans (IDP研究所,UMR CNRS 7013,奥尔良大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Magnetic Resonance Imaging (MRI) offers critical insights into microstructural details, however, the spatial resolution of standard 1.5T imaging systems is often limited. In contrast, 7T MRI provides significantly enhanced spatial resolution, enabling finer visualization of anatomical structures. Though this, the high cost and limited availability of 7T MRI hinder its widespread use in clinical settings. To address this challenge, a novel Super-Resolution (SR) model is proposed to generate 7T-like MRI from standard 1.5T MRI scans. Our approach leverages a diffusion-based architecture, incorporating gradient nonlinearity correction and bias field correction data from 7T imaging as guidance. Moreover, to improve deployability, a progressive distillation strategy is introduced. Specifically, the student model refines the 7T SR task with steps, leveraging feature maps from the inference phase of the teacher model as guidance, aiming to allow the student model to achieve progressively 7T SR performance with a smaller, deployable model size. Experimental results demonstrate that our baseline teacher model achieves state-of-the-art SR performance. The student model, while lightweight, sacrifices minimal performance. Furthermore, the student model is capable of accepting MRI inputs at varying resolutions without the need for retraining, significantly further enhancing deployment flexibility. The clinical relevance of our proposed method is validated using clinical data from Massachusetts General Hospital. Our code is available at this https URL.
zh
[CV-100] Rethinking the Upsampling Layer in Hyperspectral Image Super Resolution
【速读】:该论文旨在解决高光谱图像单图超分辨率(Single Hyperspectral Image Super-Resolution, SHSR)中的高计算负担问题,特别是在实时场景中的部署难题。为了解决这一问题,论文提出了一种新颖的轻量级网络LKCA-Net,该网络通过引入通道注意力机制来校准多尺度通道特征。此外,论文首次指出可学习上采样层的低秩属性是轻量级SHSR方法的关键瓶颈,并采用低秩近似策略优化参数冗余。同时,引入基于知识蒸馏的特征对齐技术以确保低秩近似网络保留与原始网络相同的特征表示能力。
链接: https://arxiv.org/abs/2501.18664
作者: Haohan Shi,Fei Zhou,Xin Sun,Jungong Han
机构: Faculty of Data Science, City University of Macau (澳门城市大学数据科学学院); College of Oceanography and Space Informatics, China University of Petroleum (East China) (中国石油大学(华东)海洋与空间信息学院); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning has achieved significant success in single hyperspectral image super-resolution (SHSR); however, the high spectral dimensionality leads to a heavy computational burden, thus making it difficult to deploy in real-time scenarios. To address this issue, this paper proposes a novel lightweight SHSR network, i.e., LKCA-Net, that incorporates channel attention to calibrate multi-scale channel features of hyperspectral images. Furthermore, we demonstrate, for the first time, that the low-rank property of the learnable upsampling layer is a key bottleneck in lightweight SHSR methods. To address this, we employ the low-rank approximation strategy to optimize the parameter redundancy of the learnable upsampling layer. Additionally, we introduce a knowledge distillation-based feature alignment technique to ensure the low-rank approximated network retains the same feature representation capacity as the original. We conducted extensive experiments on the Chikusei, Houston 2018, and Pavia Center datasets compared to some SOTAs. The results demonstrate that our method is competitive in performance while achieving speedups of several dozen to even hundreds of times compared to other well-performing SHSR methods.
zh
[CV-101] Review and Recommendations for using Artificial Intelligence in Intracoronary Optical Coherence Tomography Analysis
【速读】:该论文旨在评估自2015年1月至2023年2月期间发表的利用血管内光学相干断层成像(Intravascular Optical Coherent Tomography, IVOCT)图像进行冠状动脉疾病(Coronary Artery Disease, CAD)诊断的人工智能(AI)模型的临床应用潜力,并识别这些模型存在的方法学缺陷和潜在偏倚。论文的关键解决方案在于提出改进模型质量和研究实践的建议,以促进具有临床实用价值的AI产品的开发。
链接: https://arxiv.org/abs/2501.18614
作者: Xu Chen,Yuan Huang,Benn Jessney,Jason Sangha,Sophie Gu,Carola-Bibiane Schönlieb,Martin Bennett,Michael Roberts
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Artificial intelligence (AI) methodologies hold great promise for the rapid and accurate diagnosis of coronary artery disease (CAD) from intravascular optical coherent tomography (IVOCT) images. Numerous papers have been published describing AI-based models for different diagnostic tasks, yet it remains unclear which models have potential clinical utility and have been properly validated. This systematic review considered published literature between January 2015 and February 2023 describing AI-based diagnosis of CAD using IVOCT. Our search identified 5,576 studies, with 513 included after initial screening and 35 studies included in the final systematic review after quality screening. Our findings indicate that most of the identified models are not currently suitable for clinical use, primarily due to methodological flaws and underlying biases. To address these issues, we provide recommendations to improve model quality and research practices to enhance the development of clinically useful AI products.
zh
人工智能
[AI-0] AI Biases Towards Rich and Powerful Surnames
链接: https://arxiv.org/abs/2501.19407
作者: Pat Pataranutaporn,Nattavudh Powdthavee,Pattie Maes
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 54 pages, 5 figures, 1 table
Abstract:Surnames often convey implicit markers of social status, wealth, and lineage, shaping perceptions in ways that can perpetuate systemic biases. This study investigates whether and how surnames influence AI-driven decision-making, focusing on their effects across key areas such as hiring recommendations, leadership appointments, and loan approvals. Drawing on 600 surnames from the United States and Thailand, countries with differing sociohistorical dynamics and surname conventions, we categorize names into Rich, Legacy, Normal, and phonetically similar Variant groups. Our findings reveal that elite surnames consistently predict AI-generated perceptions of power, intelligence, and wealth, leading to significant consequences for decisions in high-stakes situations. Mediation analysis highlights perceived intelligence as a crucial pathway through which surname biases operate. Providing objective qualifications alongside the surnames reduces, but does not eliminate, these biases, especially in contexts with uniformly low credentials. These results call for fairness-aware algorithms and robust policy interventions to mitigate the reinforcement of inherited inequalities by AI systems. Our work also urges a reexamination of algorithmic accountability and its societal impact, particularly in systems designed for meritocratic outcomes.
[AI-1] Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach
链接: https://arxiv.org/abs/2501.19403
作者: Yingdan Shi,Ren Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Machine unlearning seeks to systematically remove specified data from a trained model, effectively achieving a state as though the data had never been encountered during training. While metrics such as Unlearning Accuracy (UA) and Membership Inference Attack (MIA) provide a baseline for assessing unlearning performance, they fall short of evaluating the completeness and reliability of forgetting. This is because the ground truth labels remain potential candidates within the scope of uncertainty quantification, leaving gaps in the evaluation of true forgetting. In this paper, we identify critical limitations in existing unlearning metrics and propose enhanced evaluation metrics inspired by conformal prediction. Our metrics can effectively capture the extent to which ground truth labels are excluded from the prediction set. Furthermore, we observe that many existing machine unlearning methods do not achieve satisfactory forgetting performance when evaluated with our new metrics. To address this, we propose an unlearning framework that integrates conformal prediction insights into Carlini Wagner adversarial attack loss. Extensive experiments on the image classification task demonstrate that our enhanced metrics offer deeper insights into unlearning effectiveness, and that our unlearning framework significantly improves the forgetting quality of unlearning methods.
[AI-2] Vintix: Action Model via In-Context Reinforcement Learning
链接: https://arxiv.org/abs/2501.19400
作者: Andrey Polubarov,Nikita Lyubaykin,Alexander Derevyagin,Ilya Zisman,Denis Tarasov,Alexander Nikulin,Vladislav Kurenkov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Preprint. In review
Abstract:In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems. Code to be released at this https URL
[AI-3] Do LLM s Strategically Reveal Conceal and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game
链接: https://arxiv.org/abs/2501.19398
作者: Mustafa O. Karabag,Ufuk Topcu
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Large language model-based (LLM-based) agents have become common in settings that include non-cooperative parties. In such settings, agents’ decision-making needs to conceal information from their adversaries, reveal information to their cooperators, and infer information to identify the other agents’ characteristics. To investigate whether LLMs have these information control and decision-making capabilities, we make LLM agents play the language-based hidden-identity game, The Chameleon. In the game, a group of non-chameleon agents who do not know each other aim to identify the chameleon agent without revealing a secret. The game requires the aforementioned information control capabilities both as a chameleon and a non-chameleon. The empirical results show that while non-chameleon LLM agents identify the chameleon, they fail to conceal the secret from the chameleon, and their winning probability is far from the levels of even trivial strategies. To formally explain this behavior, we give a theoretical analysis for a spectrum of strategies, from concealing to revealing, and provide bounds on the non-chameleons’ winning probability. Based on the empirical results and theoretical analysis of different strategies, we deduce that LLM-based non-chameleon agents reveal excessive information to agents of unknown identities. Our results point to a weakness of contemporary LLMs, including GPT-4, GPT-4o, Gemini 1.5, and Claude 3.5 Sonnet, in strategic interactions.
[AI-4] CoSTI: Consistency Models for (a faster) Spatio-Temporal Imputation
链接: https://arxiv.org/abs/2501.19364
作者: Javier Solís-García,Belén Vega-Márquez,Juan A. Nepomuceno,Isabel A. Nepomuceno-Chamorro
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 5 figures, 13 tables
Abstract:Multivariate Time Series Imputation (MTSI) is crucial for many applications, such as healthcare monitoring and traffic management, where incomplete data can compromise decision-making. Existing state-of-the-art methods, like Denoising Diffusion Probabilistic Models (DDPMs), achieve high imputation accuracy; however, they suffer from significant computational costs and are notably time-consuming due to their iterative nature. In this work, we propose CoSTI, an innovative adaptation of Consistency Models (CMs) for the MTSI domain. CoSTI employs Consistency Training to achieve comparable imputation quality to DDPMs while drastically reducing inference times, making it more suitable for real-time applications. We evaluate CoSTI across multiple datasets and missing data scenarios, demonstrating up to a 98% reduction in imputation time with performance on par with diffusion-based models. This work bridges the gap between efficiency and accuracy in generative imputation tasks, providing a scalable solution for handling missing data in critical spatio-temporal systems.
[AI-5] MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems
链接: https://arxiv.org/abs/2501.19318
作者: Anirudh Chari,Suraj Reddy,Aditya Tiwari,Richard Lian,Brian Zhou
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:While large language models (LLMs) have shown promising capabilities as zero-shot planners for embodied agents, their inability to learn from experience and build persistent mental models limits their robustness in complex open-world environments like Minecraft. We introduce MINDSTORES, an experience-augmented planning framework that enables embodied agents to build and leverage mental models through natural interaction with their environment. Drawing inspiration from how humans construct and refine cognitive mental models, our approach extends existing zero-shot LLM planning by maintaining a database of past experiences that informs future planning iterations. The key innovation is representing accumulated experiences as natural language embeddings of (state, task, plan, outcome) tuples, which can then be efficiently retrieved and reasoned over by an LLM planner to generate insights and guide plan refinement for novel states and tasks. Through extensive experiments in the MineDojo environment, a simulation environment for agents in Minecraft that provides low-level controls for Minecraft, we find that MINDSTORES learns and applies its knowledge significantly better than existing memory-based LLM planners while maintaining the flexibility and generalization benefits of zero-shot approaches, representing an important step toward more capable embodied AI systems that can learn continuously through natural experience.
[AI-6] Ontological analysis of proactive life event services
链接: https://arxiv.org/abs/2501.19308
作者: Kuldar Taveter
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Life event service is a direct digital public service provided jointly by several governmental institutions so that a person can fulfill all the obligations and use all the rights that arise due to a particular event or situation in personal life. Life event service consolidates several public services related to the same life event into one service for the service consumer. This paper presents an ontological analysis of life event services, which is based on the works by Guarino, Guizzardi, Nardi, Wagner, and others. The purpose of the ontological analysis is to understand the meanings of life event, proactive public service based on life event, and other related notions. This kind of ontological analysis is crucial because for implementing the hardware and software architectures of e-government and digital public services, it is essential to agree upon the precise meanings of the underlying terms.
[AI-7] Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes
链接: https://arxiv.org/abs/2501.19298
作者: Zhiyao Xu,Dan Zhao,Qingsong Zou,Jingyu Xiao,Yong Jiang,Zhenhui Yuan,Qing Li
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:In recent years, as smart home systems have become more widespread, security concerns within these environments have become a growing threat. Currently, most smart home security solutions, such as anomaly detection and behavior prediction models, are trained using fixed datasets that are precollected. However, the process of dataset collection is time-consuming and lacks the flexibility needed to adapt to the constantly evolving smart home environment. Additionally, the collection of personal data raises significant privacy concerns for users. Lately, large language models (LLMs) have emerged as a powerful tool for a wide range of tasks across diverse application domains, thanks to their strong capabilities in natural language processing, reasoning, and problem-solving. In this paper, we propose an LLM-based synthetic dataset generation IoTGen framework to enhance the generalization of downstream smart home intelligent models. By generating new synthetic datasets that reflect changes in the environment, smart home intelligent models can be retrained to overcome the limitations of fixed and outdated data, allowing them to better align with the dynamic nature of real-world home environments. Specifically, we first propose a Structure Pattern Perception Compression (SPPC) method tailored for IoT behavior data, which preserves the most informative content in the data while significantly reducing token consumption. Then, we propose a systematic approach to create prompts and implement data generation to automatically generate IoT synthetic data with normative and reasonable properties, assisting task models in adaptive training to improve generalization and real-world performance.
[AI-8] Analysis of LLM s vs Human Experts in Requirements Engineering
链接: https://arxiv.org/abs/2501.19297
作者: Cory Hymel,Hiroe Johnson
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 8 pages, 15 figures
Abstract:The majority of research around Large Language Models (LLM) application to software development has been on the subject of code generation. There is little literature on LLMs’ impact on requirements engineering (RE), which deals with the process of developing and verifying the system requirements. Within RE, there is a subdiscipline of requirements elicitation, which is the practice of discovering and documenting requirements for a system from users, customers, and other stakeholders. In this analysis, we compare LLM’s ability to elicit requirements of a software system, as compared to that of a human expert in a time-boxed and prompt-boxed study. We found LLM-generated requirements were evaluated as more aligned (+1.12) than human-generated requirements with a trend of being more complete (+10.2%). Conversely, we found users tended to believe that solutions they perceived as more aligned had been generated by human experts. Furthermore, while LLM-generated documents scored higher and performed at 720x the speed, their cost was, on average, only 0.06% that of a human expert. Overall, these findings indicate that LLMs will play an increasingly important role in requirements engineering by improving requirements definitions, enabling more efficient resource allocation, and reducing overall project timelines.
[AI-9] Concept-Based Explainable Artificial Intelligence: Metrics and Benchmarks
链接: https://arxiv.org/abs/2501.19271
作者: Halil Ibrahim Aysel,Xiaohao Cai,Adam Prugel-Bennett
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages it total, 8 main pages
Abstract:Concept-based explanation methods, such as concept bottleneck models (CBMs), aim to improve the interpretability of machine learning models by linking their decisions to human-understandable concepts, under the critical assumption that such concepts can be accurately attributed to the network’s feature space. However, this foundational assumption has not been rigorously validated, mainly because the field lacks standardised metrics and benchmarks to assess the existence and spatial alignment of such concepts. To address this, we propose three metrics: the concept global importance metric, the concept existence metric, and the concept location metric, including a technique for visualising concept activations, i.e., concept activation mapping. We benchmark post-hoc CBMs to illustrate their capabilities and challenges. Through qualitative and quantitative experiments, we demonstrate that, in many cases, even the most important concepts determined by post-hoc CBMs are not present in input images; moreover, when they are present, their saliency maps fail to align with the expected regions by either activating across an entire object or misidentifying relevant concept-specific regions. We analyse the root causes of these limitations, such as the natural correlation of concepts. Our findings underscore the need for more careful application of concept-based explanation techniques especially in settings where spatial interpretability is critical.
[AI-10] Jackpot! Alignment as a Maximal Lottery
链接: https://arxiv.org/abs/2501.19266
作者: Roberto-Rafael Maura-Rivero,Marc Lanctot,Francesco Visin,Kate Larson
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority \citege2024axioms. To overcome these issues, we propose the use of a probabilistic Social Choice rule called \emphmaximal lotteries as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) \citemunos2023nash and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH) Cite as: arXiv:2501.19266 [cs.AI] (or arXiv:2501.19266v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.19266 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] Objective Metrics for Human-Subjects Evaluation in Explainable Reinforcement Learning
链接: https://arxiv.org/abs/2501.19256
作者: Balint Gyevnar,Mark Towers
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注:
Abstract:Explanation is a fundamentally human process. Understanding the goal and audience of the explanation is vital, yet existing work on explainable reinforcement learning (XRL) routinely does not consult humans in their evaluations. Even when they do, they routinely resort to subjective metrics, such as confidence or understanding, that can only inform researchers of users’ opinions, not their practical effectiveness for a given problem. This paper calls on researchers to use objective human metrics for explanation evaluations based on observable and actionable behaviour to build more reproducible, comparable, and epistemically grounded research. To this end, we curate, describe, and compare several objective evaluation methodologies for applying explanations to debugging agent behaviour and supporting human-agent teaming, illustrating our proposed methods using a novel grid-based environment. We discuss how subjective and objective metrics complement each other to provide holistic validation and how future work needs to utilise standardised benchmarks for testing to enable greater comparisons between research.
[AI-12] Linear Q-Learning Does Not Diverge: Convergence Rates to a Bounded Set
链接: https://arxiv.org/abs/2501.19254
作者: Xinyu Liu,Zixuan Xie,Shangtong Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
Abstract: Q -learning is one of the most fundamental reinforcement learning algorithms. Previously, it is widely believed that Q -learning with linear function approximation (i.e., linear Q -learning) suffers from possible divergence. This paper instead establishes the first L^2 convergence rate of linear Q -learning to a bounded set. Notably, we do not make any modification to the original linear Q -learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an \epsilon -softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the L^2 convergence rate of tabular Q -learning with an \epsilon -softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.
[AI-13] SHARPIE: A Modular Framework for Reinforcement Learning and Human-AI Interaction Experiments
链接: https://arxiv.org/abs/2501.19245
作者: Hüseyin Aydın,Kevin Dubois-Godin,Libio Goncalvez Braz,Floris den Hengst,Kim Baraka,Mustafa Mert Çelikok,Andreas Sauter,Shihan Wang,Frans A. Oliehoek
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Reinforcement learning (RL) offers a general approach for modeling and training AI agents, including human-AI interaction scenarios. In this paper, we propose SHARPIE (Shared Human-AI Reinforcement Learning Platform for Interactive Experiments) to address the need for a generic framework to support experiments with RL agents and humans. Its modular design consists of a versatile wrapper for RL environments and algorithm libraries, a participant-facing web interface, logging utilities, deployment on popular cloud and participant recruitment platforms. It empowers researchers to study a wide variety of research questions related to the interaction between humans and RL agents, including those related to interactive reward specification and learning, learning from human feedback, action delegation, preference elicitation, user-modeling, and human-AI teaming. The platform is based on a generic interface for human-RL interactions that aims to standardize the field of study on RL in human contexts.
[AI-14] A Zero-Shot Generalization Framework for LLM -Driven Cross-Domain Sequential Recommendation
链接: https://arxiv.org/abs/2501.19232
作者: Yunzhe Li,Junting Wang,Hari Sundaram,Zhining Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 11 pages
Abstract:Zero-shot cross-domain sequential recommendation (ZCDSR) enables predictions in unseen domains without the need for additional training or fine-tuning, making it particularly valuable in data-sparse environments where traditional models struggle. Recent advancements in large language models (LLMs) have greatly improved ZCDSR by leveraging rich pretrained representations to facilitate cross-domain knowledge transfer. However, a key challenge persists: domain semantic bias, which arises from variations in vocabulary and content focus across domains. This misalignment leads to inconsistencies in item embeddings and hinders generalization. To address this issue, we propose a novel framework designed to enhance LLM-based ZCDSR by improving cross-domain alignment at both the item and sequential levels. At the item level, we introduce a generalization loss that promotes inter-domain compactness by aligning embeddings of similar items across domains while maintaining intra-domain diversity to preserve unique item characteristics. This prevents embeddings from becoming overly generic while ensuring effective transferability. At the sequential level, we develop a method for transferring user behavioral patterns by clustering user sequences in the source domain and applying attention-based aggregation for target domain inference. This dynamic adaptation of user embeddings allows effective zero-shot recommendations without requiring target-domain interactions. Comprehensive experiments across multiple datasets and domains demonstrate that our framework significantly improves sequential recommendation performance in the ZCDSR setting. By mitigating domain bias and enhancing the transferability of sequential patterns, our method provides a scalable and robust approach for achieving more effective zero-shot recommendations across domains. Comments: 11 pages Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.19232 [cs.IR] (or arXiv:2501.19232v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2501.19232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-15] Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method
链接: https://arxiv.org/abs/2501.19215
作者: Alexander Kozachinskiy,Felipe Urrutia,Hector Jimenez,Tomasz Steifer,Germán Pizarro,Matías Fuentes,Francisco Meza,Cristian Buc,Cristóbal Rojas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:We propose a novel method to evaluate the theoretical limits of Transformers, allowing us to prove the first lower bounds against one-layer softmax Transformers with infinite precision. We establish those bounds for three tasks that require advanced reasoning. The first task, Match3 (Sanford et al., 2023), requires looking at all triples of positions. The second and third tasks address compositionality-based reasoning: one is composition of functions (Peng et al., 2024) and the other is composition of binary relations. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. In an attempt to overcome these limitations, we introduce Strassen attention and prove that with this mechanism a one-layer Transformer can in principle solve all these tasks. We also show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023) and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.
[AI-16] An Empirical Game-Theoretic Analysis of Autonomous Cyber-Defence Agents
链接: https://arxiv.org/abs/2501.19206
作者: Gregory Palmer,Luke Swaby,Daniel J.B. Harrold,Matthew Stewart,Alex Hiles,Chris Willis,Ian Miles,Sara Farmer
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注: 21 pages, 17 figures, 10 tables
Abstract:The recent rise in increasingly sophisticated cyber-attacks raises the need for robust and resilient autonomous cyber-defence (ACD) agents. Given the variety of cyber-attack tactics, techniques and procedures (TTPs) employed, learning approaches that can return generalisable policies are desirable. Meanwhile, the assurance of ACD agents remains an open challenge. We address both challenges via an empirical game-theoretic analysis of deep reinforcement learning (DRL) approaches for ACD using the principled double oracle (DO) algorithm. This algorithm relies on adversaries iteratively learning (approximate) best responses against each others’ policies; a computationally expensive endeavour for autonomous cyber operations agents. In this work we introduce and evaluate a theoretically-sound, potential-based reward shaping approach to expedite this process. In addition, given the increasing number of open-source ACD-DRL approaches, we extend the DO formulation to allow for multiple response oracles (MRO), providing a framework for a holistic evaluation of ACD approaches.
[AI-17] Rethinking Early Stopping: Refine Then Calibrate
链接: https://arxiv.org/abs/2501.19195
作者: Eugène Berta,David Holzmüller,Michael I. Jordan,Francis Bach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains. The quality of these predictions is generally evaluated with proper losses like cross-entropy, which decompose into two components: calibration error assesses general under/overconfidence, while refinement error measures the ability to distinguish different classes. In this paper, we provide theoretical and empirical evidence that these two errors are not minimized simultaneously during training. Selecting the best training epoch based on validation loss thus leads to a compromise point that is suboptimal for both calibration error and, most importantly, refinement error. To address this, we introduce a new metric for early stopping and hyperparameter tuning that makes it possible to minimize refinement error during training. The calibration error is minimized after training, using standard techniques. Our method integrates seamlessly with any architecture and consistently improves performance across diverse classification tasks.
[AI-18] Secured Communication Schemes for UAVs in 5G: CRYSTALS-Kyber and IDS
链接: https://arxiv.org/abs/2501.19191
作者: Taneya Sharma,Seyed Ahmad Soleymani,Mohammad Shojafar,Rahim Tafazolli
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures, Paper accepted at IEEE FNWF’25 conference (References number: 1571070613)
Abstract:This paper introduces a secure communication architecture for Unmanned Aerial Vehicles (UAVs) and ground stations in 5G networks, addressing critical challenges in network security. The proposed solution integrates the Advanced Encryption Standard (AES) with Elliptic Curve Cryptography (ECC) and CRYSTALS-Kyber for key encapsulation, offering a hybrid cryptographic approach. By incorporating CRYSTALS-Kyber, the framework mitigates vulnerabilities in ECC against quantum attacks, positioning it as a quantum-resistant alternative. The architecture is based on a server-client model, with UAVs functioning as clients and the ground station acting as the server. The system was rigorously evaluated in both VPN and 5G environments. Experimental results confirm that CRYSTALS-Kyber delivers strong protection against quantum threats with minimal performance overhead, making it highly suitable for UAVs with resource constraints. Moreover, the proposed architecture integrates an Artificial Intelligence (AI)-based Intrusion Detection System (IDS) to further enhance security. In performance evaluations, the IDS demonstrated strong results across multiple models with XGBoost, particularly in more demanding scenarios, outperforming other models with an accuracy of 97.33% and an AUC of 0.94. These findings underscore the potential of combining quantum-resistant encryption mechanisms with AI-driven IDS to create a robust, scalable, and secure communication framework for UAV networks, particularly within the high-performance requirements of 5G environments.
[AI-19] Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
链接: https://arxiv.org/abs/2501.19180
作者: Xianglin Yang,Gelei Deng,Jieming Shi,Tianwei Zhang,Jin Song Dong
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textitreasoning capabilities of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.
[AI-20] On the inductive bias of infinite-depth ResNets and the bottleneck rank
链接: https://arxiv.org/abs/2501.19149
作者: Enric Boix-Adsera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 10 pages
Abstract:We compute the minimum-norm weights of a deep linear ResNet, and find that the inductive bias of this architecture lies between minimizing nuclear norm and rank. This implies that, with appropriate hyperparameters, deep nonlinear ResNets have an inductive bias towards minimizing bottleneck rank.
[AI-21] A Metric for the Balance of Information in Graph Learning AAAI
链接: https://arxiv.org/abs/2501.19137
作者: Alex O. Davies,Nirav S. Ajmeri,Telmo de Menezes e Silva Filho
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: In proceedings of the 4th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)
Abstract:Graph learning on molecules makes use of information from both the molecular structure and the features attached to that structure. Much work has been conducted on biasing either towards structure or features, with the aim that bias bolsters performance. Identifying which information source a dataset favours, and therefore how to approach learning that dataset, is an open issue. Here we propose Noise-Noise Ratio Difference (NNRD), a quantitative metric for whether there is more useful information in structure or features. By employing iterative noising on features and structure independently, leaving the other intact, NNRD measures the degradation of information in each. We employ NNRD over a range of molecular tasks, and show that it corresponds well to a loss of information, with intuitive results that are more expressive than simple performance aggregates. Our future work will focus on expanding data domains, tasks and types, as well as refining our choice of baseline model.
[AI-22] Decorrelated Soft Actor-Critic for Efficient Deep Reinforcement Learning
链接: https://arxiv.org/abs/2501.19133
作者: Burcu Küçükoğlu,Sander Dalm,Marcel van Gerven
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:The effectiveness of credit assignment in reinforcement learning (RL) when dealing with high-dimensional data is influenced by the success of representation learning via deep neural networks, and has implications for the sample efficiency of deep RL algorithms. Input decorrelation has been previously introduced as a method to speed up optimization in neural networks, and has proven impactful in both efficient deep learning and as a method for effective representation learning for deep RL algorithms. We propose a novel approach to online decorrelation in deep RL based on the decorrelated backpropagation algorithm that seamlessly integrates the decorrelation process into the RL training pipeline. Decorrelation matrices are added to each layer, which are updated using a separate decorrelation learning rule that minimizes the total decorrelation loss across all layers, in parallel to minimizing the usual RL loss. We used our approach in combination with the soft actor-critic (SAC) method, which we refer to as decorrelated soft actor-critic (DSAC). Experiments on the Atari 100k benchmark with DSAC shows, compared to the regular SAC baseline, faster training in five out of the seven games tested and improved reward performance in two games with around 50% reduction in wall-clock time, while maintaining performance levels on the other games. These results demonstrate the positive impact of network-wide decorrelation in deep RL for speeding up its sample efficiency through more effective credit assignment.
[AI-23] Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach
链接: https://arxiv.org/abs/2501.19128
作者: Wenyun Li,Wenjie Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, our approach performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the Semi-Supervised Learning (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method effectively generalizes reward shaping to sparse reward scenarios, achieving up to four times better performance in reaching higher best scores compared to curiosity-driven methods. The proposed double entropy data augmentation enhances performance, showcasing a 15.8% increase in best score over other augmentation methods.
[AI-24] FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling
链接: https://arxiv.org/abs/2501.19122
作者: Hong Huang,Hai Yang,Yuan Chen,Jiaxun Ye,Dapeng Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as greedy adjustments, unstable topologies, and communication inefficiency, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose Federated Robust pruning via combinatorial Thompson Sampling (FedRTS), a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable, farsighted information instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: this https URL
[AI-25] Principal Components for Neural Network Initialization
链接: https://arxiv.org/abs/2501.19114
作者: Nhan Phan,Thu Nguyen,Pål Halvorsen,Michael A. Riegler
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Principal Component Analysis (PCA) is a commonly used tool for dimension reduction and denoising. Therefore, it is also widely used on the data prior to training a neural network. However, this approach can complicate the explanation of explainable AI (XAI) methods for the decision of the model. In this work, we analyze the potential issues with this approach and propose Principal Components-based Initialization (PCsInit), a strategy to incorporate PCA into the first layer of a neural network via initialization of the first layer in the network with the principal components, and its two variants PCsInit-Act and PCsInit-Sub. Explanations using these strategies are as direct and straightforward as for neural networks and are simpler than using PCA prior to training a neural network on the principal components. Moreover, as will be illustrated in the experiments, such training strategies can also allow further improvement of training via backpropagation.
[AI-26] Logical Modalities within the European AI Act: An Analysis
链接: https://arxiv.org/abs/2501.19112
作者: Lara Lawniczak,Christoph Benzmüller
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
*备注: 16 pages, 19 figures
Abstract:The paper presents a comprehensive analysis of the European AI Act in terms of its logical modalities, with the aim of preparing its formal representation, for example, within the logic-pluralistic Knowledge Engineering Framework and Methodology (LogiKEy). LogiKEy develops computational tools for normative reasoning based on formal methods, employing Higher-Order Logic (HOL) as a unifying meta-logic to integrate diverse logics through shallow semantic embeddings. This integration is facilitated by Isabelle/HOL, a proof assistant tool equipped with several automated theorem provers. The modalities within the AI Act and the logics suitable for their representation are discussed. For a selection of these logics, embeddings in HOL are created, which are then used to encode sample paragraphs. Initial experiments evaluate the suitability of these embeddings for automated reasoning, and highlight key challenges on the way to more robust reasoning capabilities.
[AI-27] PathE: Leverag ing Entity-Agnostic Paths for Parameter-Efficient Knowledge Graph Embeddings
链接: https://arxiv.org/abs/2501.19095
作者: Ioannis Reklos,Jacopo de Berardinis,Elena Simperl,Albert Meroño-Peñuela
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Knowledge Graphs (KGs) store human knowledge in the form of entities (nodes) and relations, and are used extensively in various applications. KG embeddings are an effective approach to addressing tasks like knowledge discovery, link prediction, and reasoning. This is often done by allocating and learning embedding tables for all or a subset of the entities. As this scales linearly with the number of entities, learning embedding models in real-world KGs with millions of nodes can be computationally intractable. To address this scalability problem, our model, PathE, only allocates embedding tables for relations (which are typically orders of magnitude fewer than the entities) and requires less than 25% of the parameters of previous parameter efficient methods. Rather than storing entity embeddings, we learn to compute them by leveraging multiple entity-relation paths to contextualise individual entities within triples. Evaluated on four benchmarks, PathE achieves state-of-the-art performance in relation prediction, and remains competitive in link prediction on path-rich KGs while training on consumer-grade hardware. We perform ablation experiments to test our design choices and analyse the sensitivity of the model to key hyper-parameters. PathE is efficient and cost-effective for relationally diverse and well-connected KGs commonly found in real-world applications.
[AI-28] BEAT: Balanced Frequency Adaptive Tuning for Long-Term Time-Series Forecasting
链接: https://arxiv.org/abs/2501.19065
作者: Zhixuan Li,Naipeng Chen,Seonghwa Choi,Sanghoon Lee,Weisi Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures
Abstract:Time-series forecasting is crucial for numerous real-world applications including weather prediction and financial market modeling. While temporal-domain methods remain prevalent, frequency-domain approaches can effectively capture multi-scale periodic patterns, reduce sequence dependencies, and naturally denoise signals. However, existing approaches typically train model components for all frequencies under a unified training objective, often leading to mismatched learning speeds: high-frequency components converge faster and risk overfitting, while low-frequency components underfit due to insufficient training time. To deal with this challenge, we propose BEAT (Balanced frEquency Adaptive Tuning), a novel framework that dynamically monitors the training status for each frequency and adaptively adjusts their gradient updates. By recognizing convergence, overfitting, or underfitting for each frequency, BEAT dynamically reallocates learning priorities, moderating gradients for rapid learners and increasing those for slower ones, alleviating the tension between competing objectives across frequencies and synchronizing the overall learning process. Extensive experiments on seven real-world datasets demonstrate that BEAT consistently outperforms state-of-the-art approaches.
[AI-29] owards Physiologically Sensible Predictions via the Rule-based Reinforcement Learning Layer
链接: https://arxiv.org/abs/2501.19055
作者: Lingwei Zhu,Zheng Chen,Yukie Nagai,Jimeng Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:This paper adds to the growing literature of reinforcement learning (RL) for healthcare by proposing a novel paradigm: augmenting any predictor with Rule-based RL Layer (RRLL) that corrects the model’s physiologically impossible predictions. Specifically, RRLL takes as input states predicted labels and outputs corrected labels as actions. The reward of the state-action pair is evaluated by a set of general rules. RRLL is efficient, general and lightweight: it does not require heavy expert knowledge like prior work but only a set of impossible transitions. This set is much smaller than all possible transitions; yet it can effectively reduce physiologically impossible mistakes made by the state-of-the-art predictor models. We verify the utility of RRLL on a variety of important healthcare classification problems and observe significant improvements using the same setup, with only the domain-specific set of impossibility changed. In-depth analysis shows that RRLL indeed improves accuracy by effectively reducing the presence of physiologically impossible predictions.
[AI-30] Swarm-Gen: Fast Generation of Diverse Feasible Swarm Behaviors
链接: https://arxiv.org/abs/2501.19042
作者: Simon Idoko,B.Bhanu Teja,K.Madhava Krishna,Arun Kumar Singh
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to RAL
Abstract:Coordination behavior in robot swarms is inherently multi-modal in nature. That is, there are numerous ways in which a swarm of robots can avoid inter-agent collisions and reach their respective goals. However, the problem of generating diverse and feasible swarm behaviors in a scalable manner remains largely unaddressed. In this paper, we fill this gap by combining generative models with a safety-filter (SF). Specifically, we sample diverse trajectories from a learned generative model which is subsequently projected onto the feasible set using the SF. We experiment with two choices for generative models, namely: Conditional Variational Autoencoder (CVAE) and Vector-Quantized Variational Autoencoder (VQ-VAE). We highlight the trade-offs these two models provide in terms of computation time and trajectory diversity. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. Thecinitialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We provide two sets of empirical results. First, we demonstrate that we can generate a large set of multi-modal, feasible trajectories, simulating diverse swarm behaviors, within a few tens of milliseconds. Second, we show that our initialization network provides faster convergence of our SF solver vis-a-vis other alternative heuristics.
[AI-31] Symmetric Pruning of Large Language Models
链接: https://arxiv.org/abs/2501.18980
作者: Kai Yi,Peter Richtárik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Popular post-training pruning methods such as Wanda and RIA are known for their simple, yet effective, designs that have shown exceptional empirical performance. Wanda optimizes performance through calibrated activations during pruning, while RIA emphasizes the relative, rather than absolute, importance of weight elements. Despite their practical success, a thorough theoretical foundation explaining these outcomes has been lacking. This paper introduces new theoretical insights that redefine the standard minimization objective for pruning, offering a deeper understanding of the factors contributing to their success. Our study extends beyond these insights by proposing complementary strategies that consider both input activations and weight significance. We validate these approaches through rigorous experiments, demonstrating substantial enhancements over existing methods. Furthermore, we introduce a novel training-free fine-tuning approach R^2 -DSnoT that incorporates relative weight importance and a regularized decision boundary within a dynamic pruning-and-growing framework, significantly outperforming strong baselines and establishing a new state of the art.
[AI-32] GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization
链接: https://arxiv.org/abs/2501.18973
作者: Seungheun Baek,Soyon Park,Yan Ting Chok,Mogan Gim,Jaewoo Kang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Motivation: Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. Results: We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model’s ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that align with experimentally validated regulatory pathways. GPO-VAE is available at this https URL
[AI-33] Enhancing Neural Function Approximation: The XNet Outperforming KAN
链接: https://arxiv.org/abs/2501.18959
作者: Xin Li,Xiaotao Zheng,Zhihong Xia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2410.02033
Abstract:XNet is a single-layer neural network architecture that leverages Cauchy integral-based activation functions for high-order function approximation. Through theoretical analysis, we show that the Cauchy activation functions used in XNet can achieve arbitrary-order polynomial convergence, fundamentally outperforming traditional MLPs and Kolmogorov-Arnold Networks (KANs) that rely on increased depth or B-spline activations. Our extensive experiments on function approximation, PDE solving, and reinforcement learning demonstrate XNet’s superior performance - reducing approximation error by up to 50000 times and accelerating training by up to 10 times compared to existing approaches. These results establish XNet as a highly efficient architecture for both scientific computing and AI applications.
[AI-34] Deep Learning based Quasi-consciousness Training for Robot Intelligent Model
链接: https://arxiv.org/abs/2501.18955
作者: Yuchun Li,Fang Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
Abstract:This paper explores a deep learning based robot intelligent model that renders robots learn and reason for complex tasks. First, by constructing a network of environmental factor matrix to stimulate the learning process of the robot intelligent model, the model parameters must be subjected to coarse fine tuning to optimize the loss function for minimizing the loss score, meanwhile robot intelligent model can fuse all previously known concepts together to represent things never experienced before, which need robot intelligent model can be generalized extensively. Secondly, in order to progressively develop a robot intelligent model with primary consciousness, every robot must be subjected to at least 1~3 years of special school for training anthropomorphic behaviour patterns to understand and process complex environmental information and make rational decisions. This work explores and delivers the potential application of deep learning-based quasi-consciousness training in the field of robot intelligent model.
[AI-35] Deepfake Detection of Singing Voices With Whisper Encodings ICASSP
链接: https://arxiv.org/abs/2501.18919
作者: Falguni Sharma,Priyanka Gupta
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted in ICASSP,2025
Abstract:The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI’s Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in %EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.
[AI-36] Lightspeed Geometric Dataset Distance via Sliced Optimal Transport
链接: https://arxiv.org/abs/2501.18901
作者: Khai Nguyen,Hai Nguyen,Tuan Pham,Nhat Ho
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 23 pages, 9 figures
Abstract:We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.
[AI-37] Building Bridges Not Walls – Advancing Interpretability by Unifying Feature Data and Model Component Attribution
链接: https://arxiv.org/abs/2501.18887
作者: Shichang Zhang,Tessa Han,Usha Bhalla,Hima Lakkaraju
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:The increasing complexity of AI systems has made understanding their behavior a critical challenge. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods across three domains and present a unified view to demonstrate that these seemingly distinct methods employ similar approaches, such as perturbations, gradients, and linear approximations, differing primarily in their perspectives rather than core techniques. Our unified perspective enhances understanding of existing attribution methods, identifies shared concepts and challenges, makes this field more accessible to newcomers, and highlights new directions not only for attribution and interpretability but also for broader AI research, including model editing, steering, and regulation.
[AI-38] An Optimal Cascade Feature-Level Spatiotemporal Fusion Strategy for Anomaly Detection in CAN Bus
链接: https://arxiv.org/abs/2501.18821
作者: Mohammad Fatahi,Danial Sadrian Zadeh,Benyamin Ghojogh,Behzad Moshiri,Otman Basir
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
Abstract:Autonomous vehicles represent a revolutionary advancement driven by the integration of artificial intelligence within intelligent transportation systems. However, they remain vulnerable due to the absence of robust security mechanisms in the Controller Area Network (CAN) bus. In order to mitigate the security issue, many machine learning models and strategies have been proposed, which primarily focus on a subset of dominant patterns of anomalies and lack rigorous evaluation in terms of reliability and robustness. Therefore, to address the limitations of previous works and mitigate the security vulnerability in CAN bus, the current study develops a model based on the intrinsic nature of the problem to cover all dominant patterns of anomalies. To achieve this, a cascade feature-level fusion strategy optimized by a two-parameter genetic algorithm is proposed to combine temporal and spatial information. Subsequently, the model is evaluated using a paired t-test to ensure reliability and robustness. Finally, a comprehensive comparative analysis conducted on two widely used datasets advocates that the proposed model outperforms other models and achieves superior accuracy and F1-score, demonstrating the best performance among all models presented to date.
[AI-39] Compositional Generalization Requires More Than Disentangled Representations
链接: https://arxiv.org/abs/2501.18797
作者: Qiyao Liang,Daoyuan Qian,Liu Ziyin,Ila Fiete
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 8 pages, 4 figures, plus appendix
Abstract:Composition-the ability to generate myriad variations from finite means-is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate out-of-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian “bump” generation task, demonstrating that standard generative architectures fail in OOD regions when training with partial data, even when supplied with fully disentangled (x, y) coordinates, re-entangling them through subsequent layers. By examining the model’s learned kernels and manifold geometry, we show that this failure reflects a “memorization” strategy for generation through the superposition of training data rather than by combining the true factorized features. We show that models forced-through architectural modifications with regularization or curated training data-to create disentangled representations in the full-dimensional representational (pixel) space can be highly data-efficient and effective at learning to compose in OOD regions. These findings underscore that bottlenecks with factorized/disentangled representations in an abstract representation are insufficient: the model must actively maintain or induce factorization directly in the representational space in order to achieve robust compositional generalization.
[AI-40] OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization
链接: https://arxiv.org/abs/2501.18793
作者: Kelvin Kan,Xingjian Li,Stanley Osher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Transformers have achieved state-of-the-art performance in numerous tasks. In this paper, we propose a continuous-time formulation of transformers. Specifically, we consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model. Moreover, we demonstrate in theory that this regularization is necessary as it promotes uniqueness and regularity of solutions. Our model is flexible in that almost any existing transformer architectures can be adopted to construct the dynamical system with only slight modifications to the existing code. We perform extensive numerical experiments on tasks motivated by natural language processing, image classification, and point cloud classification. Our experimental results show that the proposed method improves the performance of its discrete counterpart and outperforms relevant comparing models.
[AI-41] LLM -Generated Heuristics for AI Planning : Do We Even Need Domain-Independence Anymore?
链接: https://arxiv.org/abs/2501.18784
作者: Alexander Tuisov,Yonatan Vernik,Alexander Shleyfman
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Domain-independent heuristics have long been a cornerstone of AI planning, offering general solutions applicable across a wide range of tasks without requiring domain-specific engineering. However, the advent of large language models (LLMs) presents an opportunity to generate heuristics tailored to specific planning problems, potentially challenging the necessity of domain independence as a strict design principle. In this paper, we explore the use of LLMs to automatically derive planning heuristics from task descriptions represented as successor generators and goal tests written in general purpose programming language. We investigate the trade-offs between domain-specific LLM-generated heuristics and traditional domain-independent methods in terms of computational efficiency and explainability. Our experiments demonstrate that LLMs can create heuristics that achieve state-of-the-art performance on some standard IPC domains, as well as their ability to solve problems that lack an adequate Planning Domain Definition Language (\sc pddl) representation. We discuss whether these results signify a paradigm shift and how they can complement existing approaches.
[AI-42] Diversity By Design: Leverag ing Distribution Matching for Offline Model-Based Optimization
链接: https://arxiv.org/abs/2501.18768
作者: Michael S. Yao,James C. Gee,Osbert Bastani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:The goal of offline model-based optimization (MBO) is to propose new designs that maximize a reward function given only an offline dataset. However, an important desiderata is to also propose a diverse set of final candidates that capture many optimal and near-optimal design configurations. We propose Diversity in Adversarial Model-based Optimization (DynAMO) as a novel method to introduce design diversity as an explicit objective into any MBO problem. Our key insight is to formulate diversity as a distribution matching problem where the distribution of generated designs captures the inherent diversity contained within the offline dataset. Extensive experiments spanning multiple scientific domains show that DynAMO can be used with common optimization methods to significantly improve the diversity of proposed designs while still discovering high-quality candidates.
[AI-43] Synthetic Data Generation for Augmenting Small Samples
链接: https://arxiv.org/abs/2501.18741
作者: Dan Liu,Samer El Kababji,Nicholas Mitsakakis,Lisa Pilgram,Thomas Walters,Mark Clemons,Greg Pond,Alaa El-Hussuna,Khaled El Emam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
Abstract:Small datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data. We found that augmentation improves prognostic performance for datasets that: have fewer observations, with smaller baseline AUC, have higher cardinality categorical variables, and have more balanced outcome variables. No specific generative model consistently outperformed the others. We developed a decision support model that can be used to inform analysts if augmentation would be useful. For seven small application datasets, augmenting the existing data results in an increase in AUC between 4.31% (AUC from 0.71 to 0.75) and 43.23% (AUC from 0.51 to 0.73), with an average 15.55% relative improvement, demonstrating the nontrivial impact of augmentation on small datasets (p=0.0078). Augmentation AUC was higher than resampling only AUC (p=0.016). The diversity of augmented datasets was higher than the diversity of resampled datasets (p=0.046).
[AI-44] Neural Graph Pattern Machine
链接: https://arxiv.org/abs/2501.18739
作者: Zehong Wang,Zheyuan Zhang,Tianyi Ma,Nitesh V Chawla,Chuxu Zhang,Yanfang Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:
Abstract:Graph learning tasks require models to comprehend essential substructure patterns relevant to downstream tasks, such as triadic closures in social networks and benzene rings in molecular graphs. Due to the non-Euclidean nature of graphs, existing graph neural networks (GNNs) rely on message passing to iteratively aggregate information from local neighborhoods. Despite their empirical success, message passing struggles to identify fundamental substructures, such as triangles, limiting its expressiveness. To overcome this limitation, we propose the Neural Graph Pattern Machine (GPM), a framework designed to learn directly from graph patterns. GPM efficiently extracts and encodes substructures while identifying the most relevant ones for downstream tasks. We also demonstrate that GPM offers superior expressivity and improved long-range information modeling compared to message passing. Empirical evaluations on node classification, link prediction, graph classification, and regression show the superiority of GPM over state-of-the-art baselines. Further analysis reveals its desirable out-of-distribution robustness, scalability, and interpretability. We consider GPM to be a step toward going beyond message passing.
[AI-45] Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation
链接: https://arxiv.org/abs/2501.18733
作者: Yuelei Li,Ge Yan,Annabella Macaluso,Mazeyu Ji,Xueyan Zou,Xiaolong Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
Abstract:The recent advancements in visual reasoning capabilities of large multimodal models (LMMs) and the semantic enrichment of 3D feature fields have expanded the horizons of robotic capabilities. These developments hold significant potential for bridging the gap between high-level reasoning from LMMs and low-level control policies utilizing 3D feature fields. In this work, we introduce LMM-3DP, a framework that can integrate LMM planners and 3D skill Policies. Our approach consists of three key perspectives: high-level planning, low-level control, and effective integration. For high-level planning, LMM-3DP supports dynamic scene understanding for environment disturbances, a critic agent with self-feedback, history policy memorization, and reattempts after failures. For low-level control, LMM-3DP utilizes a semantic-aware 3D feature field for accurate manipulation. In aligning high-level and low-level control for robot actions, language embeddings representing the high-level policy are jointly attended with the 3D feature field in the 3D transformer for seamless integration. We extensively evaluate our approach across multiple skills and long-horizon tasks in a real-world kitchen environment. Our results show a significant 1.45x success rate increase in low-level control and an approximate 1.5x improvement in high-level planning accuracy compared to LLM-based baselines. Demo videos and an overview of LMM-3DP are available at this https URL.
[AI-46] Exploring Audio Editing Features as User-Centric Privacy Defenses Against Emotion Inference Attacks AAAI
链接: https://arxiv.org/abs/2501.18727
作者: Mohd. Farhan Israk Soumik,W.K.M. Mithsara,Abdur R. Shahid,Ahmed Imteaj
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted for presentation(Poster) at PPAI-25: The 6th AAAI Workshop on Privacy-Preserving Artificial Intelligence
Abstract:The rapid proliferation of speech-enabled technologies, including virtual assistants, video conferencing platforms, and wearable devices, has raised significant privacy concerns, particularly regarding the inference of sensitive emotional information from audio data. Existing privacy-preserving methods often compromise usability and security, limiting their adoption in practical scenarios. This paper introduces a novel, user-centric approach that leverages familiar audio editing techniques, specifically pitch and tempo manipulation, to protect emotional privacy without sacrificing usability. By analyzing popular audio editing applications on Android and iOS platforms, we identified these features as both widely available and usable. We rigorously evaluated their effectiveness against a threat model, considering adversarial attacks from diverse sources, including Deep Neural Networks (DNNs), Large Language Models (LLMs), and and reversibility testing. Our experiments, conducted on three distinct datasets, demonstrate that pitch and tempo manipulation effectively obfuscates emotional data. Additionally, we explore the design principles for lightweight, on-device implementation to ensure broad applicability across various devices and platforms.
[AI-47] Scaling Policy Gradient Quality-Diversity with Massive Parallelization via Behavioral Variations
链接: https://arxiv.org/abs/2501.18723
作者: Konstantinos Mitsides,Maxence Faldor,Antoine Cully
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Quality-Diversity optimization comprises a family of evolutionary algorithms aimed at generating a collection of diverse and high-performing solutions. MAP-Elites (ME), a notable example, is used effectively in fields like evolutionary robotics. However, the reliance of ME on random mutations from Genetic Algorithms limits its ability to evolve high-dimensional solutions. Methods proposed to overcome this include using gradient-based operators like policy gradients or natural evolution strategies. While successful at scaling ME for neuroevolution, these methods often suffer from slow training speeds, or difficulties in scaling with massive parallelization due to high computational demands or reliance on centralized actor-critic training. In this work, we introduce a fast, sample-efficient ME based algorithm capable of scaling up with massive parallelization, significantly reducing runtimes without compromising performance. Our method, ASCII-ME, unlike existing policy gradient quality-diversity methods, does not rely on centralized actor-critic training. It performs behavioral variations based on time step performance metrics and maps these variations to solutions using policy gradients. Our experiments show that ASCII-ME can generate a diverse collection of high-performing deep neural network policies in less than 250 seconds on a single GPU. Additionally, it operates on average, five times faster than state-of-the-art algorithms while still maintaining competitive sample efficiency.
[AI-48] he Pitfalls of “Security by Obscurity” And What They Mean for Transparent AI AAAI2025
链接: https://arxiv.org/abs/2501.18669
作者: Peter Hall,Olivia Mundahl,Sunoo Park
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 27 pages, abbreviated version in AAAI 2025
Abstract:Calls for transparency in AI systems are growing in number and urgency from diverse stakeholders ranging from regulators to researchers to users (with a comparative absence of companies developing AI). Notions of transparency for AI abound, each addressing distinct interests and concerns. In computer security, transparency is likewise regarded as a key concept. The security community has for decades pushed back against so-called security by obscurity – the idea that hiding how a system works protects it from attack – against significant pressure from industry and other stakeholders. Over the decades, in a community process that is imperfect and ongoing, security researchers and practitioners have gradually built up some norms and practices around how to balance transparency interests with possible negative side effects. This paper asks: What insights can the AI community take from the security community’s experience with transparency? We identify three key themes in the security community’s perspective on the benefits of transparency and their approach to balancing transparency against countervailing interests. For each, we investigate parallels and insights relevant to transparency in AI. We then provide a case study discussion on how transparency has shaped the research subfield of anonymization. Finally, shifting our focus from similarities to differences, we highlight key transparency issues where modern AI systems present challenges different from other kinds of security-critical systems, raising interesting open questions for the security and AI communities alike. Comments: 27 pages, abbreviated version in AAAI 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2501.18669 [cs.CR] (or arXiv:2501.18669v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.18669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-49] Simulation Streams: A Programming Paradigm for Controlling Large Language Models and Building Complex Systems with Generative AI
链接: https://arxiv.org/abs/2501.18668
作者: Peter Sunehag,Joel Z. Leibo
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Technical report accompanying the release of code on GitHub
Abstract:We introduce Simulation Streams, a programming paradigm designed to efficiently control and leverage Large Language Models (LLMs) for complex, dynamic simulations and agentic workflows. Our primary goal is to create a minimally interfering framework that harnesses the agentic abilities of LLMs while addressing their limitations in maintaining consistency, selectively ignoring/including information, and enforcing strict world rules. Simulation Streams achieves this through a state-based approach where variables are modified in sequential steps by “operators,” producing output on a recurring format and adhering to consistent rules for state variables. This approach focus the LLMs on defined tasks, while aiming to have the context stream remain “in-distribution”. The approach incorporates an Entity-Component-System (ECS) architecture to write programs in a more intuitive manner, facilitating reuse of workflows across different components and entities. This ECS approach enhances the modularity of the output stream, allowing for complex, multi-entity simulations while maintaining format consistency, information control, and rule enforcement. It is supported by a custom editor that aids in creating, running, and analyzing simulations. We demonstrate the versatility of simulation streams through an illustrative example of an ongoing market economy simulation, a social simulation of three characters playing a game of catch in a park and a suite of classical reinforcement learning benchmark tasks. These examples showcase Simulation Streams’ ability to handle complex, evolving scenarios over 100s-1000s of iterations, facilitate comparisons between different agent workflows and models, and maintain consistency and continued interesting developments in LLM-driven simulations.
[AI-50] Structure Development in List-Sorting Transformers
链接: https://arxiv.org/abs/2501.18666
作者: Einar Urdshals,Jasmina Urdshals
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 15+19 pages, 6+13 figures
Abstract:We study how a one-layer attention-only transformer develops relevant structures while learning to sort lists of numbers. At the end of training, the model organizes its attention heads in two main modes that we refer to as vocabulary-splitting and copy-suppression. Both represent simpler modes than having multiple heads handle overlapping ranges of numbers. Interestingly, vocabulary-splitting is present regardless of whether we use weight decay, a common regularization technique thought to drive simplification, supporting the thesis that neural networks naturally prefer simpler solutions. We relate copy-suppression to a mechanism in GPT-2 and investigate its functional role in our model. Guided by insights from a developmental analysis of the model, we identify features in the training data that drive the model’s final acquired solution. This provides a concrete example of how the training data shape the internal organization of transformers, paving the way for future studies that could help us better understand how LLMs develop their internal structures.
[AI-51] BARNN: A Bayesian Autoregressive and Recurrent Neural Network
链接: https://arxiv.org/abs/2501.18665
作者: Dario Coscia,Max Welling,Nicola Demo,Gianluigi Rozza
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and Machine Learning Force Fields. To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the “Variational Mixtures of Posteriors” prior (tVAMP-prior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.
[AI-52] Joint Optimization of Prompt Security and System Performance in Edge-Cloud LLM Systems
链接: https://arxiv.org/abs/2501.18663
作者: Haiyang Huang,Tianhui Meng,Weijia Jia
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Large language models (LLMs) have significantly facilitated human life, and prompt engineering has improved the efficiency of these models. However, recent years have witnessed a rise in prompt engineering-empowered attacks, leading to issues such as privacy leaks, increased latency, and system resource wastage. Though safety fine-tuning based methods with Reinforcement Learning from Human Feedback (RLHF) are proposed to align the LLMs, existing security mechanisms fail to cope with fickle prompt attacks, highlighting the necessity of performing security detection on prompts. In this paper, we jointly consider prompt security, service latency, and system resource optimization in Edge-Cloud LLM (EC-LLM) systems under various prompt attacks. To enhance prompt security, a vector-database-enabled lightweight attack detector is proposed. We formalize the problem of joint prompt detection, latency, and resource optimization into a multi-stage dynamic Bayesian game model. The equilibrium strategy is determined by predicting the number of malicious tasks and updating beliefs at each stage through Bayesian updates. The proposed scheme is evaluated on a real implemented EC-LLM system, and the results demonstrate that our approach offers enhanced security, reduces the service latency for benign users, and decreases system resource consumption compared to state-of-the-art algorithms.
[AI-53] Enhancing Large Language Model Efficiencyvia Symbolic Compression: A Formal Approach Towards Interpretability
链接: https://arxiv.org/abs/2501.18657
作者: Lumen AI,Tengzhou No. 1 Middle School,Shihao Ji,Zihui Song,Fucheng Zhong,Jisen Jia,Zhaobo Wu,Zheyi Cao,Tianhao Xu
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:
Abstract:Large language models (LLMs) face significant token efficiency bottlenecks in code generation and logical reasoning tasks, a challenge that directly impacts inference cost and model interpretability. This paper proposes a formal framework based on symbolic compression,integrating combinatory logic, information-theoretic optimal encoding, and context-aware inference techniques to achieve a step-change improvement in token efficiency while preserving semantic integrity. We establish a mathematical framework within a functional programming paradigm, derive the quantitative relationship between symbolic density and model interpretability, and propose a differentiable compression factor metric to evaluate encoding efficiency. Furthermore, we leverage parameter-efficient fine-tuning (PEFT) techniques to achieve a low-cost application of the GAEL language. Experimental results show that this method achieves a 78.3% token compression rate in code generation tasks while improving logical traceability by 62% through structural explicitness. This research provides new theoretical tools for efficient inference in LLMs and opens a symbolic path for modelinterpretability research.
[AI-54] Cogito ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation
链接: https://arxiv.org/abs/2501.18653
作者: Yanlong Li,Jindong Li,Qi Wang,Menglin Yang,He Kong,Shengsheng Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
Abstract:Large language models based Multi Agent Systems (MAS) have demonstrated promising performance for enhancing the efficiency and accuracy of code generation tasks. However,most existing methods follow a conventional sequence of planning, coding, and debugging,which contradicts the growth-driven nature of human learning process. Additionally,the frequent information interaction between multiple agents inevitably involves high computational costs. In this paper,we propose Cogito,a neurobiologically inspired multi-agent framework to enhance the problem-solving capabilities in code generation tasks with lower cost. Specifically,Cogito adopts a reverse sequence: it first undergoes debugging, then coding,and finally planning. This approach mimics human learning and development,where knowledge is acquired progressively. Accordingly,a hippocampus-like memory module with different functions is designed to work with the pipeline to provide quick retrieval in similar tasks. Through this growth-based learning model,Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task. Extensive experiments against representative baselines demonstrate the superior performance and efficiency of Cogito. The code is publicly available at this https URL.
[AI-55] SafeRAG : Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
链接: https://arxiv.org/abs/2501.18636
作者: Xun Liang,Simin Niu,Zhiyu Li,Sensen Zhang,Hanyu Wang,Feiyu Xiong,Jason Zhaoxin Fan,Bo Tang,Shichao Song,Mengwei Wang,Jiawei Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: this https URL.
[AI-56] Membership Inference Attacks Against Vision-Language Models
链接: https://arxiv.org/abs/2501.18624
作者: Yuke Hu,Zheng Li,Zhihao Liu,Yang Zhang,Zhan Qin,Kui Ren,Chun Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted by USENIX’25; 22 pages, 28 figures;
Abstract:Vision-Language Models (VLMs), built on pre-trained vision encoders and large language models (LLMs), have shown exceptional multi-modal understanding and dialog capabilities, positioning them as catalysts for the next technological revolution. However, while most VLM research focuses on enhancing multi-modal interaction, the risks of data misuse and leakage have been largely unexplored. This prompts the need for a comprehensive investigation of such risks in VLMs. In this paper, we conduct the first analysis of misuse and leakage detection in VLMs through the lens of membership inference attack (MIA). In specific, we focus on the instruction tuning data of VLMs, which is more likely to contain sensitive or unauthorized information. To address the limitation of existing MIA methods, we introduce a novel approach that infers membership based on a set of samples and their sensitivity to temperature, a unique parameter in VLMs. Based on this, we propose four membership inference methods, each tailored to different levels of background knowledge, ultimately arriving at the most challenging scenario. Our comprehensive evaluations show that these methods can accurately determine membership status, e.g., achieving an AUC greater than 0.8 targeting a small set consisting of only 5 samples on LLaVA.
[AI-57] Deeply Optimizing the SAT Solver for the IC3 Algorithm
链接: https://arxiv.org/abs/2501.18612
作者: Yuheng Su,Qiusong Yang,Yiwei Ci,Yingcheng Li,Tianjun Bu,Ziyu Huang
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:
Abstract:The IC3 algorithm, also known as PDR, is a SAT-based model checking algorithm that has significantly influenced the field in recent years due to its efficiency, scalability, and completeness. It utilizes SAT solvers to solve a series of SAT queries associated with relative induction. Based on our observations of the unique characteristics of SAT queries in IC3, this paper introduces GipSAT, a lightweight SAT solver specifically optimized for IC3. By observing that SAT queries do not necessarily require decisions on all variables, GipSAT calculates a subset of variables that need to be decided before each solving, while ensuring that the result remains unaffected. By observing that the overhead of binary heap operations in VSIDS is not negligible, GipSAT utilizes buckets instead of binary heap to achieve constant-time operations. GipSAT supports temporary clauses without the need to allocate a new activation variable before each solving, thus eliminating the need to reset solvers. The comprehensive evaluation demonstrates a significant performance improvement achieved by GipSAT. When compared to Minisat, GipSAT achieves an average speedup of 3.61 times in solving time.
[AI-58] Faster Configuration Performance Bug Testing with Neural Dual-level Prioritization ICSE2025
链接: https://arxiv.org/abs/2501.15392
作者: Youpeng Ma,Tao Chen,Ke Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: accepted by ICSE 2025
Abstract:As software systems become more complex and configurable, more performance problems tend to arise from the configuration designs. This has caused some configuration options to unexpectedly degrade performance which deviates from their original expectations designed by the developers. Such discrepancies, namely configuration performance bugs (CPBugs), are devastating and can be deeply hidden in the source code. Yet, efficiently testing CPBugs is difficult, not only due to the test oracle is hard to set, but also because the configuration measurement is expensive and there are simply too many possible configurations to test. As such, existing testing tools suffer from lengthy runtime or have been ineffective in detecting CPBugs when the budget is limited, compounded by inaccurate test oracle. In this paper, we seek to achieve significantly faster CPBug testing by neurally prioritizing the testing at both the configuration option and value range levels with automated oracle estimation. Our proposed tool, dubbed NDP, is a general framework that works with different heuristic generators. The idea is to leverage two neural language models: one to estimate the CPBug types that serve as the oracle while, more vitally, the other to infer the probabilities of an option being CPBug-related, based on which the options and the value ranges to be searched can be prioritized. Experiments on several widely-used systems of different versions reveal that NDP can, in general, better predict CPBug type in 87% cases and find more CPBugs with up to 88.88x testing efficiency speedup over the state-of-the-art tools.
[AI-59] What is causal about causal models and representations?
链接: https://arxiv.org/abs/2501.19335
作者: Frederik Hytting Jørgensen,Luigi Gresele,Sebastian Weichwald
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 50 pages
Abstract:Causal Bayesian networks are ‘causal’ models since they make predictions about interventional distributions. To connect such causal model predictions to real-world outcomes, we must determine which actions in the world correspond to which interventions in the model. For example, to interpret an action as an intervention on a treatment variable, the action will presumably have to a) change the distribution of treatment in a way that corresponds to the intervention, and b) not change other aspects, such as how the outcome depends on the treatment; while the marginal distributions of some variables may change as an effect. We introduce a formal framework to make such requirements for different interpretations of actions as interventions precise. We prove that the seemingly natural interpretation of actions as interventions is circular: Under this interpretation, every causal Bayesian network that correctly models the observational distribution is trivially also interventionally valid, and no action yields empirical data that could possibly falsify such a model. We prove an impossibility result: No interpretation exists that is non-circular and simultaneously satisfies a set of natural desiderata. Instead, we examine non-circular interpretations that may violate some desiderata and show how this may in turn enable the falsification of causal models. By rigorously examining how a causal Bayesian network could be a ‘causal’ model of the world instead of merely a mathematical object, our formal framework contributes to the conceptual foundations of causal representation learning, causal discovery, and causal abstraction, while also highlighting some limitations of existing approaches.
[AI-60] Survey and Improvement Strategies for Gene Prioritization with Large Language Models
链接: https://arxiv.org/abs/2501.18794
作者: Matthew Neeley,Guantong Qi,Guanchu Wang,Ruixiang Tang,Dongxue Mao,Chaozhong Liu,Sasidhar Pasupuleti,Bo Yuan,Fan Xia,Pengfei Liu,Zhandong Liu,Xia Hu
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures, 10 pages of supplementary figures
Abstract:Rare diseases are challenging to diagnose due to limited patient data and genetic diversity. Despite advances in variant prioritization, many cases remain undiagnosed. While large language models (LLMs) have performed well in medical exams, their effectiveness in diagnosing rare genetic diseases has not been assessed. To identify causal genes, we benchmarked various LLMs for gene prioritization. Using multi-agent and Human Phenotype Ontology (HPO) classification, we categorized patients based on phenotypes and solvability levels. As gene set size increased, LLM performance deteriorated, so we used a divide-and-conquer strategy to break the task into smaller subsets. At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly. The multi-agent and HPO approaches helped distinguish confidently solved cases from challenging ones, highlighting the importance of known gene-phenotype associations and phenotype specificity. We found that cases with specific phenotypes or clear associations were more accurately solved. However, we observed biases toward well-studied genes and input order sensitivity, which hindered gene prioritization. Our divide-and-conquer strategy improved accuracy by overcoming these biases. By utilizing HPO classification, novel multi-agent techniques, and our LLM strategy, we improved causal gene identification accuracy compared to our baseline evaluation. This approach streamlines rare disease diagnosis, facilitates reanalysis of unsolved cases, and accelerates gene discovery, supporting the development of targeted diagnostics and therapies.
机器学习
[LG-0] Low-Rank Adapting Models for Sparse Autoencoders ATC
链接: https://arxiv.org/abs/2501.19406
作者: Matthew Chen,Joshua Engels,Max Tegmark
类目: Machine Learning (cs.LG)
*备注: Code available at this https URL
Abstract:Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3 \times to 20 \times faster on Gemma-2-2B and 2 \times to 10 \times faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.
[LG-1] Detection Is All You Need: A Feasible Optimal Prior-Free Black-Box Approach For Piecewise Stationary Bandits
链接: https://arxiv.org/abs/2501.19401
作者: Argyrios Gerogiannis,Yu-Han Huang,Subhonmesh Bose,Venugopal V. Veeravalli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 5 figures
Abstract:We study the problem of piecewise stationary bandits without prior knowledge of the underlying non-stationarity. We propose the first \textitfeasible black-box algorithm applicable to most common parametric bandit variants. Our procedure, termed Detection Augmented Bandit (DAB), is modular, accepting any stationary bandit algorithm as input and augmenting it with a change detector. DAB achieves optimal regret in the piecewise stationary setting under mild assumptions. Specifically, we prove that DAB attains the order-optimal regret bound of \tilde\mathcalO(\sqrtN_T T) , where N_T denotes the number of changes over the horizon T , if its input stationary bandit algorithm has order-optimal stationary regret guarantees. Applying DAB to different parametric bandit settings, we recover recent state-of-the-art results. Notably, for self-concordant bandits, DAB achieves optimal dynamic regret, while previous works obtain suboptimal bounds and require knowledge on the non-stationarity. In simulations on piecewise stationary environments, DAB outperforms existing approaches across varying number of changes. Interestingly, despite being theoretically designed for piecewise stationary environments, DAB is also effective in simulations in drifting environments, outperforming existing methods designed specifically for this scenario.
[LG-2] Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
链接: https://arxiv.org/abs/2501.19392
作者: Alina Shutova,Vladimir Malinovskii,Vage Egiazarian,Denis Kuznedelev,Denis Mazur,Nikita Surkov,Ivan Ermakov,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注: Preprint, under review
Abstract:Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to “optimally” compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under 1% relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.
[LG-3] Federated Sketching LoRA: On-Device Collaborative Fine-Tuning of Large Language Models
链接: https://arxiv.org/abs/2501.19389
作者: Wenzhi Fang,Dong-Jun Han,Liangqi Yuan,Seyyedali Hosseinalipour,Christopher G. Brinton
类目: Machine Learning (cs.LG)
*备注: 23 pages
Abstract:Fine-tuning large language models (LLMs) on devices is attracting increasing interest. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with device model sizes and data scarcity. Still, the heterogeneity of computational resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying device capabilities constrain LoRA’s feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for an efficient and theoretically-grounded solution. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable devices to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the devices, FSLoRA flexibly adapts to device-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through comprehensive experiments on multiple datasets and LLM models, we demonstrate FSLoRA’s superior performance compared to various baselines.
[LG-4] Fixing the Double Penalty in Data-Driven Weather Forecasting Through a Modified Spherical Harmonic Loss Function
链接: https://arxiv.org/abs/2501.19374
作者: Christopher Subich,Syed Zahid Husain,Leo Separovic,Jing Yang
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Recent advancements in data-driven weather forecasting models have delivered deterministic models that outperform the leading operational forecast systems based on traditional, physics-based models. However, these data-driven models are typically trained with a mean squared error loss function, which causes smoothing of fine scales through a “double penalty” effect. We develop a simple, parameter-free modification to this loss function that avoids this problem by separating the loss attributable to decorrelation from the loss attributable to spectral amplitude errors. Fine-tuning the GraphCast model with this new loss function results in sharp deterministic weather forecasts, an increase of the model’s effective resolution from 1,250km to 160km, improvements to ensemble spread, and improvements to predictions of tropical cyclone strength and surface wind extremes.
[LG-5] he Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
链接: https://arxiv.org/abs/2501.19358
作者: Yuchun Miao,Sen Zhang,Liang Ding,Yuqi Zhang,Lefei Zhang,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注: 28 pages, 21 figures
Abstract:This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM’s final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of \textttEPPO in mitigating reward hacking and improving RLHF performance.
[LG-6] Neural Implicit Solution Formula for Efficiently Solving Hamilton-Jacobi Equations
链接: https://arxiv.org/abs/2501.19351
作者: Yesom Park,Stanley Osher
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents an implicit solution formula for the Hamilton-Jacobi partial differential equation (HJ PDE). The formula is derived using the method of characteristics and is shown to coincide with the Hopf and Lax formulas in the case where either the Hamiltonian or the initial function is convex. It provides a simple and efficient numerical approach for computing the viscosity solution of HJ PDEs, bypassing the need for the Legendre transform of the Hamiltonian or the initial condition, and the explicit computation of individual characteristic trajectories. A deep learning-based methodology is proposed to learn this implicit solution formula, leveraging the mesh-free nature of deep learning to ensure scalability for high-dimensional problems. Building upon this framework, an algorithm is developed that approximates the characteristic curves piecewise linearly for state-dependent Hamiltonians. Extensive experimental results demonstrate that the proposed method delivers highly accurate solutions, even for nonconvex Hamiltonians, and exhibits remarkable scalability, achieving computational efficiency for problems up to 40 dimensions.
[LG-7] An All-digital 65-nm Tsetlin Machine Image Classification Accelerator with 8.6 nJ per MNIST Frame at 60.3k Frames per Second
链接: https://arxiv.org/abs/2501.19347
作者: Svein Anders Tunheim,Yujin Zheng,Lei Jiao,Rishad Shafik,Alex Yakovlev,Ole-Christoffer Granmo
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 10 pages, 6 figures. This work has been submitted to the IEEE for possible publication
Abstract:We present an all-digital programmable machine learning accelerator chip for image classification, underpinning on the Tsetlin machine ™ principles. The TM is a machine learning algorithm founded on propositional logic, utilizing sub-pattern recognition expressions called clauses. The accelerator implements the coalesced TM version with convolution, and classifies booleanized images of 28 \times 28 pixels with 10 categories. A configuration with 128 clauses is used in a highly parallel architecture. Fast clause evaluation is obtained by keeping all clause weights and Tsetlin automata (TA) action signals in registers. The chip is implemented in a 65 nm low-leakage CMOS technology, and occupies an active area of 2.7mm ^2 . At a clock frequency of 27.8 MHz, the accelerator achieves 60.3k classifications per second, and consumes 8.6 nJ per classification. The latency for classifying a single image is 25.4 \mu s which includes system timing overhead. The accelerator achieves 97.42%, 84.54% and 82.55% test accuracies for the datasets MNIST, Fashion-MNIST and Kuzushiji-MNIST, respectively, matching the TM software models.
[LG-8] PUATE: Semiparametric Efficient Averag e Treatment Effect Estimation from Treated (Positive) and Unlabeled Units
链接: https://arxiv.org/abs/2501.19345
作者: Masahiro Kato,Fumiaki Kozai,Ryo Inokuchi
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:The estimation of average treatment effects (ATEs), defined as the difference in expected outcomes between treatment and control groups, is a central topic in causal inference. This study develops semiparametric efficient estimators for ATE estimation in a setting where only a treatment group and an unknown group-comprising units for which it is unclear whether they received the treatment or control-are observable. This scenario represents a variant of learning from positive and unlabeled data (PU learning) and can be regarded as a special case of ATE estimation with missing data. For this setting, we derive semiparametric efficiency bounds, which provide lower bounds on the asymptotic variance of regular estimators. We then propose semiparametric efficient ATE estimators whose asymptotic variance aligns with these efficiency bounds. Our findings contribute to causal inference with missing data and weakly supervised learning.
[LG-9] Covering Multiple Objectives with a Small Set of Solutions Using Bayesian Optimization
链接: https://arxiv.org/abs/2501.19342
作者: Natalie Maus,Kyurae Kim,Yimeng Zeng,Haydn Thomas Jones,Fangping Wan,Marcelo Der Torossian Torres,Cesar de la Fuente-Nunez,Jacob R. Gardner
类目: Machine Learning (cs.LG)
*备注:
Abstract:In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of T black-box objective functions, f_1 , …, f_T , simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In this work, we introduce a novel problem setting that departs from this paradigm: finding a smaller set of K solutions, where K T, that collectively “covers” the T objectives. A set of solutions is defined as “covering” if, for each objective f_1 , …, f_T , there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have T pathogens and aim to identify a set of K T antibiotics such that at least one antibiotic can be used to treat each pathogen. To address this problem, we propose Multi-Objective Coverage Bayesian Optimization (MOCOBO), a principled algorithm designed to efficiently find a covering set. We validate our approach through extensive experiments on challenging high-dimensional tasks, including applications in peptide and molecular design. Experiments demonstrate MOCOBO’s ability to find high-performing covering sets of solutions. Additionally, we show that the small sets of K T solutions found by MOCOBO can match or nearly match the performance of T individually optimized solutions for the same objectives. Our results highlight MOCOBO’s potential to tackle complex multi-objective problems in domains where finding at least one high-performing solution for each objective is critical.
[LG-10] he Value of Prediction in Identifying the Worst-Off
链接: https://arxiv.org/abs/2501.19334
作者: Unai Fischer-Abaigar,Christoph Kern,Juan Carlos Perdomo
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Machine learning is increasingly used in government programs to identify and support the most vulnerable individuals, prioritizing assistance for those at greatest risk over optimizing aggregate outcomes. This paper examines the welfare impacts of prediction in equity-driven contexts, and how they compare to other policy levers, such as expanding bureaucratic capacity. Through mathematical models and a real-world case study on long-term unemployment amongst German residents, we develop a comprehensive understanding of the relative effectiveness of prediction in surfacing the worst-off. Our findings provide clear analytical frameworks and practical, data-driven tools that empower policymakers to make principled decisions when designing these systems.
[LG-11] Offline Learning for Combinatorial Multi-armed Bandits
链接: https://arxiv.org/abs/2501.19300
作者: Xutong Liu,Xiangxiang Dai,Jinhang Zuo,Siwei Wang,Carlee-Joe Wong,John C.S. Lui,Wei Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.
[LG-12] Differentially Private In-context Learning via Sampling Few-shot Mixed with Zero-shot Outputs
链接: https://arxiv.org/abs/2501.19287
作者: James Flemings,Haosheng Gan,Hongyi Li,Meisam Razaviyayn,Murali Annavaram
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive information, which can be leaked and/or regurgitated by the LLM output. Differential Privacy (DP), a widely adopted privacy safeguard, has emerged to mitigate this privacy leakage, with recent work demonstrating strong privacy-utility tradeoffs in classification tasks for ICL. However, generation tasks for ICL are challenging due to the high-dimensional output space of open-ended generation. To this end, we propose \textttdps-mozo , Differentially Private Sampling by Mixing One-shot with Zero-shot Outputs, a decoding framework that generates DP text by sampling from the product of multiple one-shot outputs mixed with a zero-shot output. This mixing effectively reduces the amount of information that can be leaked by each demonstration. By utilizing the inherent randomness in sampling from the mixed distributions, we can achieve DP without adding noise, thereby improving the privacy-utility tradeoff. Our experimental evaluations show \textttdps-mozo can achieve a strong privacy guarantee, \epsilon=2 , with minimal utility degradation compared to non-private few-shot learning, \textbf0.3 % ROUGE-L F1 score decrease on the SAMSum dataset with Gemma 2 2B.
[LG-13] OneBatchPAM: A Fast and Frugal K-Medoids Algorithm AAAI2025
链接: https://arxiv.org/abs/2501.19285
作者: Antoine de Mathelin,Nicolas Enrique Cecchi,François Deheeger,Mathilde Mougeot,Nicolas Vayatis
类目: Machine Learning (cs.LG)
*备注: Paper accepted by AAAI 2025
Abstract:This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.
[LG-14] S-VOTE: Similarity-based Voting for Client Selection in Decentralized Federated Learning IJCNN
链接: https://arxiv.org/abs/2501.19279
作者: Pedro Miguel Sánchez Sánchez,Enrique Tomás Martínez Beltrán,Chao Feng,Gérôme Bovet,Gregorio Martínez Pérez,Alberto Huertas Celdrán
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Submitted to IJCNN
Abstract:Decentralized Federated Learning (DFL) enables collaborative, privacy-preserving model training without relying on a central server. This decentralized approach reduces bottlenecks and eliminates single points of failure, enhancing scalability and resilience. However, DFL also introduces challenges such as suboptimal models with non-IID data distributions, increased communication overhead, and resource usage. Thus, this work proposes S-VOTE, a voting-based client selection mechanism that optimizes resource usage and enhances model performance in federations with non-IID data conditions. S-VOTE considers an adaptive strategy for spontaneous local training that addresses participation imbalance, allowing underutilized clients to contribute without significantly increasing resource costs. Extensive experiments on benchmark datasets demonstrate the S-VOTE effectiveness. More in detail, it achieves lower communication costs by up to 21%, 4-6% faster convergence, and improves local performance by 9-17% compared to baseline methods in some configurations, all while achieving a 14-24% energy consumption reduction. These results highlight the potential of S-VOTE to address DFL challenges in heterogeneous environments.
[LG-15] Clustering in hyperbolic balls
链接: https://arxiv.org/abs/2501.19247
作者: Vladimir Jaćimović,Aladin Crnkić
类目: Machine Learning (cs.LG)
*备注:
Abstract:The idea of representations of the data in negatively curved manifolds recently attracted a lot of attention and gave a rise to the new research direction named \it hyperbolic machine learning (ML). In order to unveil the full potential of this new paradigm, efficient techniques for data analysis and statistical modeling in hyperbolic spaces are necessary. In the present paper rigorous mathematical framework for clustering in hyperbolic spaces is established. First, we introduce the k -means clustering in hyperbolic balls, based on the novel definition of barycenter. Second, we present the expectation-maximization (EM) algorithm for learning mixtures of novel probability distributions in hyperbolic balls. In such a way we lay the foundation of unsupervised learning in hyperbolic spaces.
[LG-16] Multi-agent Multi-armed Bandit with Fully Heavy-tailed Dynamics
链接: https://arxiv.org/abs/2501.19239
作者: Xingyu Wang,Mengfan Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages
Abstract:We study decentralized multi-agent multi-armed bandits in fully heavy-tailed settings, where clients communicate over sparse random graphs with heavy-tailed degree distributions and observe heavy-tailed (homogeneous or heterogeneous) reward distributions with potentially infinite variance. The objective is to maximize system performance by pulling the globally optimal arm with the highest global reward mean across all clients. We are the first to address such fully heavy-tailed scenarios, which capture the dynamics and challenges in communication and inference among multiple clients in real-world systems. In homogeneous settings, our algorithmic framework exploits hub-like structures unique to heavy-tailed graphs, allowing clients to aggregate rewards and reduce noises via hub estimators when constructing UCB indices; under M clients and degree distributions with power-law index \alpha 1 , our algorithm attains a regret bound (almost) of order O(M^1 -\frac1\alpha \logT) . Under heterogeneous rewards, clients synchronize by communicating with neighbors, aggregating exchanged estimators in UCB indices; With our newly established information delay bounds on sparse random graphs, we prove a regret bound of O(M \logT) . Our results improve upon existing work, which only address time-invariant connected graphs, or light-tailed dynamics in dense graphs and rewards.
[LG-17] Hourly Short Term Load Forecasting for Residential Buildings and Energy Communities
链接: https://arxiv.org/abs/2501.19234
作者: Aleksei Kychkin,Georgios C. Chasparis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electricity load consumption may be extremely complex in terms of profile patterns, as it depends on a wide range of human factors, and it is often correlated with several exogenous factors, such as the availability of renewable energy and the weather conditions. The first goal of this paper is to investigate the performance of a large selection of different types of forecasting models in predicting the electricity load consumption within the short time horizon of a day or few hours ahead. Such forecasts may be rather useful for the energy management of individual residential buildings or small energy communities. In particular, we introduce persistence models, standard auto-regressive-based machine learning models, and more advanced deep learning models. The second goal of this paper is to introduce two alternative modeling approaches that are simpler in structure while they take into account domain specific knowledge, as compared to the previously mentioned black-box modeling techniques. In particular, we consider the persistence-based auto-regressive model (PAR) and the seasonal persistence-based regressive model (SPR), priorly introduced by the authors. In this paper, we specifically tailor these models to accommodate the generation of hourly forecasts. The introduced models and the induced comparative analysis extend prior work of the authors which was restricted to day-ahead forecasts. We observed a 15-30% increase in the prediction accuracy of the newly introduced hourly-based forecasting models over existing approaches.
[LG-18] hrough the Looking Glass: LLM -Based Analysis of AR/VR Android Applications Privacy Policies ICML
链接: https://arxiv.org/abs/2501.19223
作者: Abdulaziz Alghamdi,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 7 pages; appeared in ICMLA 2024
Abstract:\beginabstract This paper comprehensively analyzes privacy policies in AR/VR applications, leveraging BERT, a state-of-the-art text classification model, to evaluate the clarity and thoroughness of these policies. By comparing the privacy policies of AR/VR applications with those of free and premium websites, this study provides a broad perspective on the current state of privacy practices within the AR/VR industry. Our findings indicate that AR/VR applications generally offer a higher percentage of positive segments than free content but lower than premium websites. The analysis of highlighted segments and words revealed that AR/VR applications strategically emphasize critical privacy practices and key terms. This enhances privacy policies’ clarity and effectiveness.
[LG-19] underlineE2Former: A Linear-time underlineEfficient and underlineEquivariant Transunderlineformer for Scalable Molecular Modeling
链接: https://arxiv.org/abs/2501.19216
作者: Yunyang Li,Lin Huang,Zhihao Ding,Chu Wang,Xinran Wei,Han Yang,Zun Wang,Chang Liu,Yu Shi,Peiran Jin,Jia Zhang,Mark Gerstein,Tao Qin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner 6j convolution (Wigner 6j Conv). By shifting the computational burden from edges to nodes, the Wigner 6j Conv reduces the complexity from O(|\mathcalE|) to O(| \mathcalV|) while preserving both the model’s expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional \mathrmSO(3) convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.
[LG-20] RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains
链接: https://arxiv.org/abs/2501.19205
作者: Sepehr Mousavi,Shizheng Wen,Levi Lingsch,Maximilian Herde,Bogdan Raonić,Siddhartha Mishra
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning the solution operators of PDEs on arbitrary domains is challenging due to the diversity of possible domain shapes, in addition to the often intricate underlying physics. We propose an end-to-end graph neural network (GNN) based neural operator to learn PDE solution operators from data on point clouds in arbitrary domains. Our multi-scale model maps data between input/output point clouds by passing it through a downsampled regional mesh. Many novel elements are also incorporated to ensure resolution invariance and temporal continuity. Our model, termed RIGNO, is tested on a challenging suite of benchmarks, composed of various time-dependent and steady PDEs defined on a diverse set of domains. We demonstrate that RIGNO is significantly more accurate than neural operator baselines and robustly generalizes to unseen spatial resolutions and time instances.
[LG-21] A Variational Perspective on Generative Protein Fitness Optimization
链接: https://arxiv.org/abs/2501.19200
作者: Lea Bogensperger,Dominik Narnhofer,Ahmed Allam,Konrad Schindler,Michael Krauthammer
类目: Machine Learning (cs.LG)
*备注:
Abstract:The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.
[LG-22] Position: Curvature Matrices Should Be Democratized via Linear Operators
链接: https://arxiv.org/abs/2501.19183
作者: Felix Dangel,Runa Eschenhagen,Weronika Ormaniec,Andres Fernandez,Lukas Tatzel,Agustinus Kristiadi
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures
Abstract:Structured large matrices are prevalent in machine learning. A particularly important class is curvature matrices like the Hessian, which are central to understanding the loss landscape of neural nets (NNs), and enable second-order optimization, uncertainty quantification, model pruning, data attribution, and more. However, curvature computations can be challenging due to the complexity of automatic differentiation, and the variety and structural assumptions of curvature proxies, like sparsity and Kronecker factorization. In this position paper, we argue that linear operators – an interface for performing matrix-vector products – provide a general, scalable, and user-friendly abstraction to handle curvature matrices. To support this position, we developed \textitcurvlinops , a library that provides curvature matrices through a unified linear operator interface. We demonstrate with \textitcurvlinops how this interface can hide complexity, simplify applications, be extensible and interoperable with other libraries, and scale to large NNs.
[LG-23] A Comunication Framework for Compositional Generation
链接: https://arxiv.org/abs/2501.19182
作者: Rafael Elberg,Mircea Petrache,Denis Parra
类目: Machine Learning (cs.LG)
*备注:
Abstract:Compositionality and compositional generalization–the ability to understand novel combinations of known concepts–are central characteristics of human language and are hypothesized to be essential for human cognition. In machine learning, the emergence of this property has been studied in a communication game setting, where independent agents (a sender and a receiver) converge to a shared encoding policy from a set of states to a space of discrete messages, where the receiver can correctly reconstruct the states observed by the sender using only the sender’s messages. The use of communication games in generation tasks is still largely unexplored, with recent methods for compositional generation focusing mainly on the use of supervised guidance (either through class labels or text). In this work, we take the first steps to fill this gap, and we present a self-supervised generative communication game-based framework for creating compositional encodings in learned representations from pre-trained encoder-decoder models. In an Iterated Learning (IL) protocol involving a sender and a receiver, we apply alternating pressures for compression and diversity of encoded discrete messages, so that the protocol converges to an efficient but unambiguous encoding. Approximate message entropy regularization is used to favor compositional encodings. Our framework is based on rigorous justifications and proofs of defining and balancing the concepts of Eficiency, Unambiguity and Non-Holisticity in encoding. We test our method on the compositional image dataset Shapes3D, demonstrating robust performance in both reconstruction and compositionality metrics, surpassing other tested discrete message frameworks.
[LG-24] No Foundations without Foundations – Why semi-mechanistic models are essential for regulatory biology
链接: https://arxiv.org/abs/2501.19178
作者: Luka Kovačević,Thomas Gaudelet,James Opzoomer,Hagen Triendl,John Whittaker,Caroline Uhler,Lindsay Edwards,Jake P. Taylor-King
类目: Machine Learning (cs.LG)
*备注: 19 pages, 8 figures
Abstract:Despite substantial efforts, deep learning has not yet delivered a transformative impact on elucidating regulatory biology, particularly in the realm of predicting gene expression profiles. Here, we argue that genuine “foundation models” of regulatory biology will remain out of reach unless guided by frameworks that integrate mechanistic insight with principled experimental design. We present one such ground-up, semi-mechanistic framework that unifies perturbation-based experimental designs across both in vitro and in vivo CRISPR screens, accounting for differentiating and non-differentiating cellular systems. By revealing previously unrecognised assumptions in published machine learning methods, our approach clarifies links with popular techniques such as variational autoencoders and structural causal models. In practice, this framework suggests a modified loss function that we demonstrate can improve predictive performance, and further suggests an error analysis that informs batching strategies. Ultimately, since cellular regulation emerges from innumerable interactions amongst largely uncharted molecular components, we contend that systems-level understanding cannot be achieved through structural biology alone. Instead, we argue that real progress will require a first-principles perspective on how experiments capture biological phenomena, how data are generated, and how these processes can be reflected in more faithful modelling architectures.
[LG-25] PSyDUCK: Training-Free Steganography for Latent Diffusion
链接: https://arxiv.org/abs/2501.19172
作者: Georgia Channing,Aqib Mahfuz,Mark van der Wilk,Philip Torr,Fabio Pizzati,Christian Schroeder de Witt
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Recent advances in AI-generated steganography highlight its potential for safeguarding the privacy of vulnerable democratic actors, including aid workers, journalists, and whistleblowers operating in oppressive regimes. In this work, we address current limitations and establish the foundations for large-throughput generative steganography. We introduce a novel approach that enables secure and efficient steganography within latent diffusion models. We show empirically that our methods perform well across a variety of open-source latent diffusion models, particularly in generative image and video tasks.
[LG-26] Locality-aware Surrogates for Gradient-based Black-box Optimization
链接: https://arxiv.org/abs/2501.19161
作者: Ali Momeni,Stefan Uhlich,Arun Venkitaraman,Chia-Yu Hsieh,Andrea Bonetti,Ryoga Matsuo,Eisaku Ohbuchi,Lorenzo Servadei
类目: Machine Learning (cs.LG)
*备注:
Abstract:In physics and engineering, many processes are modeled using non-differentiable black-box simulators, making the optimization of such functions particularly challenging. To address such cases, inspired by the Gradient Theorem, we propose locality-aware surrogate models for active model-based black-box optimization. We first establish a theoretical connection between gradient alignment and the minimization of a Gradient Path Integral Equation (GradPIE) loss, which enforces consistency of the surrogate’s gradients in local regions of the design space. Leveraging this theoretical insight, we develop a scalable training algorithm that minimizes the GradPIE loss, enabling both offline and online learning while maintaining computational efficiency. We evaluate our approach on three real-world tasks - spanning automated in silico experiments such as coupled nonlinear oscillators, analog circuits, and optical systems - and demonstrate consistent improvements in optimization efficiency under limited query budgets. Our results offer dependable solutions for both offline and online optimization tasks where reliable gradient estimation is needed.
[LG-27] A theoretical framework for overfitting in energy-based modeling
链接: https://arxiv.org/abs/2501.19158
作者: Giovanni Catania,Aurélien Decelle,Cyril Furtlehner,Beatriz Seoane
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注: 23 pages, 13 figures (including appendix)
Abstract:We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.
[LG-28] st-Time Training Scaling for Chemical Exploration in Drug Design
链接: https://arxiv.org/abs/2501.19153
作者: Morgan Thomas,Albert Bou,Gianni De Fabritiis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chemical language models for molecular design have the potential to find solutions to multi-parameter optimization problems in drug discovery via reinforcement learning (RL). A key requirement to achieve this is the capacity to “search” chemical space to identify all molecules of interest. Here, we propose a challenging new benchmark to discover dissimilar molecules that possess similar bioactivity, a common scenario in drug discovery, but a hard problem to optimize. We show that a population of RL agents can solve the benchmark, while a single agent cannot. We also find that cooperative strategies are not significantly better than independent agents. Moreover, the performance on the benchmark scales log-linearly with the number of independent agents, showing a test-time training scaling law for chemical language models.
[LG-29] A Theoretical Justification for Asymmetric Actor-Critic Algorithms
链接: https://arxiv.org/abs/2501.19116
作者: Gaspard Lambrechts,Damien Ernst,Aditya Mahajan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7 pages, 29 pages total
Abstract:In reinforcement learning for partially observable environments, many successful algorithms were developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates an error term arising from aliasing in the agent state.
[LG-30] Brain-inspired sparse training enables Transformers and LLM s to perform as fully connected
链接: https://arxiv.org/abs/2501.19107
作者: Yingtao Zhang,Jialin Zhao,Wenjing Wu,Ziheng Liao,Umberto Michieli,Carlo Vittorio Cannistraci
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study aims to enlarge our current knowledge on application of brain-inspired network science principles for training artificial neural networks (ANNs) with sparse connectivity. Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties to keep peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (1% connectivity or lower) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is O(Nd^3) - N node network size, d node degree - hence it can apply only to ultra-sparse networks. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. We propose a GPU-friendly approximation of the CH link predictor, which reduces the computational complexity to O(N^3), enabling a fast implementation of CHT in large-scale models. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. To improve performance, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that, using 1% of connections, CHTs outperforms fully connected networks in MLP on visual classification tasks, compressing some networks to 30% nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Using 30% of the connections, CHTss achieves superior performance compared to other dynamic sparse training methods in language modeling, and it surpasses the fully connected counterpart in zero-shot evaluations.
[LG-31] Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss
链接: https://arxiv.org/abs/2501.19105
作者: Abhijeet Mulgund,Chirag Pabbaraju
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 22 pages, 4 figures
Abstract:The paradigm of weak-to-strong generalization constitutes the training of a strong AI model on data labeled by a weak AI model, with the goal that the strong model nevertheless outperforms its weak supervisor on the target task of interest. For the setting of real-valued regression with the squared loss, recent work quantitatively characterizes the gain in performance of the strong model over the weak model in terms of the misfit between the strong and weak model. We generalize such a characterization to learning tasks whose loss functions correspond to arbitrary Bregman divergences when the strong class is convex. This extends the misfit-based characterization of performance gain in weak-to-strong generalization to classification tasks, as the cross-entropy loss can be expressed in terms of a Bregman divergence. In most practical scenarios, however, the strong model class may not be convex. We therefore weaken this assumption and study weak-to-strong generalization for convex combinations of k strong models in the strong class, in the concrete setting of classification. This allows us to obtain a similar misfit-based characterization of performance gain, upto an additional error term that vanishes as k gets large. Our theoretical findings are supported by thorough experiments on synthetic as well as real-world datasets.
[LG-32] Neural Collapse Beyond the Unconstrainted Features Model: Landscape Dynamics and Generalization in the Mean-Field Regime
链接: https://arxiv.org/abs/2501.19104
作者: Diyuan Wu,Marco Mondelli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural Collapse is a phenomenon where the last-layer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and they show the co-occurrence of NC1 and small test error for certain data distributions.
[LG-33] Reinforcement Learning on Reconfigurable Hardware: Overcoming Material Variability in Laser Material Processing ICRA
链接: https://arxiv.org/abs/2501.19102
作者: Giulio Masinelli,Chang Rajani,Patrik Hoffmann,Kilian Wasmer,David Atienza
类目: Machine Learning (cs.LG)
*备注: Accepted for the 2025 IEEE International Conference on Robotics and Automation (ICRA), May 19-23, 2025, Atlanta, USA
Abstract:Ensuring consistent processing quality is challenging in laser processes due to varying material properties and surface conditions. Although some approaches have shown promise in solving this problem via automation, they often rely on predetermined targets or are limited to simulated environments. To address these shortcomings, we propose a novel real-time reinforcement learning approach for laser process control, implemented on a Field Programmable Gate Array to achieve real-time execution. Our experimental results from laser welding tests on stainless steel samples with a range of surface roughnesses validated the method’s ability to adapt autonomously, without relying on reward engineering or prior setup information. Specifically, the algorithm learned the correct power profile for each unique surface characteristic, demonstrating significant improvements over hand-engineered optimal constant power strategies – up to 23% better performance on rougher surfaces and 7% on mixed surfaces. This approach represents a significant advancement in automating and optimizing laser processes, with potential applications across multiple industries.
[LG-34] Unraveling Zeroth-Order Optimization through the Lens of Low-Dimensional Structured Perturbations
链接: https://arxiv.org/abs/2501.19099
作者: Sihwan Park,Jihun Yun,SungYub Kim,Souvik Kundu,Eunho Yang
类目: Machine Learning (cs.LG)
*备注: 35 pages, 5 figures
Abstract:Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods suffer from slow convergence due to high-variance stochastic gradient estimators. While structured perturbations, such as sparsity and low-rank constraints, have been explored to mitigate these issues, their effectiveness remains highly under-explored. In this work, we develop a unified theoretical framework that analyzes both the convergence and generalization properties of ZO optimization under structured perturbations. We show that high dimensionality is the primary bottleneck and introduce the notions of \textitstable rank and \textiteffective overlap to explain how structured perturbations reduce gradient noise and accelerate convergence. Using the uniform stability under our framework, we then provide the first theoretical justification for why these perturbations enhance generalization. Additionally, through empirical analysis, we identify that \textbfblock coordinate descent (BCD) to be an effective structured perturbation method. Extensive experiments show that, compared to existing alternatives, memory-efficient ZO (MeZO) with BCD (\textitMeZO-BCD) can provide improved converge with a faster wall-clock time/iteration by up to \times\textbf2.09 while yielding similar or better accuracy.
[LG-35] FL-APU: A Software Architecture to Ease Practical Implementation of Cross-Silo Federated Learning
链接: https://arxiv.org/abs/2501.19091
作者: F. Stricker,J. A. Peregrina,D. Bermbach,C. Zirpins
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) is an upcoming technology that is increasingly applied in real-world applications. Early applications focused on cross-device scenarios, where many participants with limited resources train machine learning (ML) models together, e.g., in the case of Google’s GBoard. Contrarily, cross-silo scenarios have only few participants but with many resources, e.g., in the healthcare domain. Despite such early efforts, FL is still rarely used in practice and best practices are, hence, missing. For new applications, in our case inter-organizational cross-silo applications, overcoming this lack of role models is a significant challenge. In order to ease the use of FL in real-world cross-silo applications, we here propose a scenario-based architecture for the practical use of FL in the context of multiple companies collaborating to improve the quality of their ML models. The architecture emphasizes the collaboration between the participants and the FL server and extends basic interactions with domain-specific features. First, it combines governance with authentication, creating an environment where only trusted participants can join. Second, it offers traceability of governance decisions and tracking of training processes, which are also crucial in a production environment. Beyond presenting the architectural design, we analyze requirements for the real-world use of FL and evaluate the architecture with a scenario-based analysis method. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2501.19091 [cs.DC] (or arXiv:2501.19091v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2501.19091 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2024 2nd International Conference on Federated Learning Technologies and Applications (FLTA), Valencia, Spain, 2024, pp. 63-70 Related DOI: https://doi.org/10.1109/FLTA63145.2024.10839980 Focus to learn more DOI(s) linking to related resources
[LG-36] Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
链接: https://arxiv.org/abs/2501.19090
作者: Jialin Zhao,Yingtao Zhang,Carlo Vittorio Cannistraci
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its tensor coherence and GPU compatibility across all densities. However, low-rank pruning has struggled to match the performance of semi-structured pruning, often doubling perplexity (PPL) at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving an additional 24.2% memory savings and 24.6% faster inference over low-rank layers at r/d = 0.5, thereby significantly enhancing performance at the same density. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility.
[LG-37] Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics
链接: https://arxiv.org/abs/2501.19089
作者: Keqin Wang,Yulong Yang,Ishan Saha,Christine Allen-Blanchette
类目: Machine Learning (cs.LG)
*备注: 23 pages, 3 figures
Abstract:In contrast to classes of neural networks where the learned representations become increasingly expressive with network depth, the learned representations in graph neural networks (GNNs), tend to become increasingly similar. This phenomena, known as oversmoothing, is characterized by learned representations that cannot be reliably differentiated leading to reduced predictive performance. In this paper, we propose an analogy between oversmoothing in GNNs and consensus or agreement in opinion dynamics. Through this analogy, we show that the message passing structure of recent continuous-depth GNNs is equivalent to a special case of opinion dynamics (i.e., linear consensus models) which has been theoretically proven to converge to consensus (i.e., oversmoothing) for all inputs. Using the understanding developed through this analogy, we design a new continuous-depth GNN model based on nonlinear opinion dynamics and prove that our model, which we call behavior-inspired message passing neural network (BIMP) circumvents oversmoothing for general inputs. Through extensive experiments, we show that BIMP is robust to oversmoothing and adversarial attack, and consistently outperforms competitive baselines on numerous benchmarks.
[LG-38] A Bias-Correction Decentralized Stochastic Gradient Algorithm with Momentum Acceleration
链接: https://arxiv.org/abs/2501.19082
作者: Yuchen Hu,Xi Chen,Weidong Liu,Xiaojun Mao
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Distributed stochastic optimization algorithms can handle large-scale data simultaneously and accelerate model training. However, the sparsity of distributed networks and the heterogeneity of data limit these advantages. This paper proposes a momentum-accelerated distributed stochastic gradient algorithm, referred to as Exact-Diffusion with Momentum (EDM), which can correct the bias caused by data heterogeneity and introduces the momentum method commonly used in deep learning to accelerate the convergence of the algorithm. We theoretically demonstrate that this algorithm converges to the neighborhood of the optimum sub-linearly irrelevant to data heterogeneity when applied to non-convex objective functions and linearly under the Polyak-Łojasiewicz condition (a weaker assumption than \mu -strongly convexity). Finally, we evaluate the performance of the proposed algorithm by simulation, comparing it with a range of existing decentralized optimization algorithms to demonstrate its effectiveness in addressing data heterogeneity and network sparsity.
[LG-39] Differentially Private Policy Gradient
链接: https://arxiv.org/abs/2501.19080
作者: Alexandre Rio,Merwan Barlier,Igor Colin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Motivated by the increasing deployment of reinforcement learning in the real world, involving a large consumption of personal data, we introduce a differentially private (DP) policy gradient algorithm. We show that, in this setting, the introduction of Differential Privacy can be reduced to the computation of appropriate trust regions, thus avoiding the sacrifice of theoretical properties of the DP-less methods. Therefore, we show that it is possible to find the right trade-off between privacy noise and trust-region size to obtain a performant differentially private policy gradient algorithm. We then outline its performance empirically on various benchmarks. Our results and the complexity of the tasks addressed represent a significant improvement over existing DP algorithms in online RL.
[LG-40] mperature-Annealed Boltzmann Generators
链接: https://arxiv.org/abs/2501.19077
作者: Henrik Schopmans,Pascal Friederich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collapse and often do not learn to sample the full configurational space. Here, we present temperature-annealed Boltzmann generators (TA-BG) to address this challenge. First, we demonstrate that training a normalizing flow with the reverse Kullback-Leibler divergence at high temperatures is possible without mode collapse. Furthermore, we introduce a reweighting-based training objective to anneal the distribution to lower target temperatures. We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system.
[LG-41] Pareto-frontier Entropy Search with Variational Lower Bound Maximization
链接: https://arxiv.org/abs/2501.19073
作者: Masanori Ishikura,Masayuki Karasuyama
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This study considers multi-objective Bayesian optimization (MOBO) through the information gain of the Pareto-frontier. To calculate the information gain, a predictive distribution conditioned on the Pareto-frontier plays a key role, which is defined as a distribution truncated by the Pareto-frontier. However, it is usually impossible to obtain the entire Pareto-frontier in a continuous domain, and therefore, the complete truncation cannot be known. We consider an approximation of the truncate distribution by using a mixture distribution consisting of two possible approximate truncation obtainable from a subset of the Pareto-frontier, which we call over- and under-truncation. Since the optimal balance of the mixture is unknown beforehand, we propose optimizing the balancing coefficient through the variational lower bound maximization framework, by which the approximation error of the information gain can be minimized. Our empirical evaluation demonstrates the effectiveness of the proposed method particularly when the number of objective functions is large.
[LG-42] SpikingSoft: A Spiking Neuron Controller for Bio-inspired Locomotion with Soft Snake Robots
链接: https://arxiv.org/abs/2501.19072
作者: Chuhan Zhang,Cong Wang,Wei Pan,Cosimo Della Santina
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8th IEEE-RAS International Conference on Soft Robotics
Abstract:Inspired by the dynamic coupling of moto-neurons and physical elasticity in animals, this work explores the possibility of generating locomotion gaits by utilizing physical oscillations in a soft snake by means of a low-level spiking neural mechanism. To achieve this goal, we introduce the Double Threshold Spiking neuron model with adjustable thresholds to generate varied output patterns. This neuron model can excite the natural dynamics of soft robotic snakes, and it enables distinct movements, such as turning or moving forward, by simply altering the neural thresholds. Finally, we demonstrate that our approach, termed SpikingSoft, naturally pairs and integrates with reinforcement learning. The high-level agent only needs to adjust the two thresholds to generate complex movement patterns, thus strongly simplifying the learning of reactive locomotion. Simulation results demonstrate that the proposed architecture significantly enhances the performance of the soft snake robot, enabling it to achieve target objectives with a 21.6% increase in success rate, a 29% reduction in time to reach the target, and smoother movements compared to the vanilla reinforcement learning controllers or Central Pattern Generator controller acting in torque space.
[LG-43] Deep Multi-Task Learning Has Low Amortized Intrinsic Dimensionality
链接: https://arxiv.org/abs/2501.19067
作者: Hossein Zakerinia,Dorsa Ghobadi,Christoph H. Lampert
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Deep learning methods are known to generalize well from training to future data, even in an overparametrized regime, where they could easily overfit. One explanation for this phenomenon is that even when their ambient dimensionality, (i.e. the number of parameters) is large, the models’ intrinsic dimensionality is small, i.e. their learning takes place in a small subspace of all possible weight configurations. In this work, we confirm this phenomenon in the setting of deep multi-task learning. We introduce a method to parametrize multi-task network directly in the low-dimensional space, facilitated by the use of random expansions techniques. We then show that high-accuracy multi-task solutions can be found with much smaller intrinsic dimensionality (fewer free parameters) than what single-task learning requires. Subsequently, we show that the low-dimensional representations in combination with weight compression and PAC-Bayesian reasoning lead to the first non-vacuous generalization bounds for deep multi-task networks. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2501.19067 [cs.LG] (or arXiv:2501.19067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.19067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] Optimizing Job Allocation using Reinforcement Learning with Graph Neural Networks
链接: https://arxiv.org/abs/2501.19063
作者: Lars C.P.M. Quaedvlieg
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures
Abstract:Efficient job allocation in complex scheduling problems poses significant challenges in real-world applications. In this report, we propose a novel approach that leverages the power of Reinforcement Learning (RL) and Graph Neural Networks (GNNs) to tackle the Job Allocation Problem (JAP). The JAP involves allocating a maximum set of jobs to available resources while considering several constraints. Our approach enables learning of adaptive policies through trial-and-error interactions with the environment while exploiting the graph-structured data of the problem. By leveraging RL, we eliminate the need for manual annotation, a major bottleneck in supervised learning approaches. Experimental evaluations on synthetic and real-world data demonstrate the effectiveness and generalizability of our proposed approach, outperforming baseline algorithms and showcasing its potential for optimizing job allocation in complex scheduling problems.
[LG-45] ZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLM s
链接: https://arxiv.org/abs/2501.19057
作者: Yan Sun,Tiansheng Huang,Liang Ding,Li Shen,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.
[LG-46] Norm-Bounded Low-Rank Adaptation
链接: https://arxiv.org/abs/2501.19050
作者: Ruigang Wang,Krishnamurthy Dvijotham,Ian R. Manchester
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we propose norm-bounded low-rank adaptation (NB-LoRA) for parameter-efficient fine tuning. We introduce two parameterizations that allow explicit bounds on each singular value of the weight adaptation matrix, which can therefore satisfy any prescribed unitarily invariant norm bound, including the Schatten norms (e.g., nuclear, Frobenius, spectral norm). The proposed parameterizations are unconstrained and complete, i.e. they cover all matrices satisfying the prescribed rank and norm constraints. Experiments on vision fine-tuning benchmarks show that the proposed approach can achieve good adaptation performance while avoiding model catastrophic forgetting and also substantially improve robustness to a wide range of hyper-parameters, including adaptation rank, learning rate and number of training epochs. We also explore applications in privacy-preserving model merging and low-rank matrix completion.
[LG-47] owards the Worst-case Robustness of Large Language Models
链接: https://arxiv.org/abs/2501.19040
作者: Huanran Chen,Yinpeng Dong,Zeming Wei,Hang Su,Jun Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent studies have revealed the vulnerability of Large Language Models (LLMs) to adversarial attacks, where the adversary crafts specific input sequences to induce harmful, violent, private, or incorrect outputs. Although various defenses have been proposed, they have not been evaluated by strong adaptive attacks, leaving the worst-case robustness of LLMs still intractable. By developing a stronger white-box attack, our evaluation results indicate that most typical defenses achieve nearly 0% this http URL solve this, we propose \textitDiffTextPure, a general defense that diffuses the (adversarial) input prompt using any pre-defined smoothing distribution, and purifies the diffused input using a pre-trained language model. Theoretically, we derive tight robustness lower bounds for all smoothing distributions using Fractal Knapsack or 0-1 Knapsack solvers. Under this framework, we certify the robustness of a specific case – smoothing LLMs using a uniform kernel – against \textitany possible attack with an average \ell_0 perturbation of 2.02 or an average suffix length of 6.41.
[LG-48] Error Slice Discovery via Manifold Compactness
链接: https://arxiv.org/abs/2501.19032
作者: Han Yu,Jiashuo Liu,Hao Zou,Renzhe Xu,Yue He,Xingxuan Zhang,Peng Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite the great performance of deep learning models in many areas, they still make mistakes and underperform on certain subsets of data, i.e. error slices. Given a trained model, it is important to identify its semantically coherent error slices that are easy to interpret, which is referred to as the error slice discovery problem. However, there is no proper metric of slice coherence without relying on extra information like predefined slice labels. Current evaluation of slice coherence requires access to predefined slices formulated by metadata like attributes or subclasses. Its validity heavily relies on the quality and abundance of metadata, where some possible patterns could be ignored. Besides, current algorithms cannot directly incorporate the constraint of coherence into their optimization objective due to the absence of an explicit coherence metric, which could potentially hinder their effectiveness. In this paper, we propose manifold compactness, a coherence metric without reliance on extra information by incorporating the data geometry property into its design, and experiments on typical datasets empirically validate the rationality of the metric. Then we develop Manifold Compactness based error Slice Discovery (MCSD), a novel algorithm that directly treats risk and coherence as the optimization objective, and is flexible to be applied to models of various tasks. Extensive experiments on the benchmark and case studies on other typical datasets demonstrate the superiority of MCSD.
[LG-49] rue Online TD-Replan(lambda) Achieving Planning through Replaying
链接: https://arxiv.org/abs/2501.19027
作者: Abdulrahman Altahhan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we develop a new planning method that extends the capabilities of the true online TD to allow an agent to efficiently replay all or part of its past experience, online in the sequence that they appear with, either in each step or sparsely according to the usual \lambda parameter. In this new method that we call True Online TD-Replan(\lambda), the \lambda parameter plays a new role in specifying the density of the replay process in addition to the usual role of specifying the depth of the target’s updates. We demonstrate that, for problems that benefit from experience replay, our new method outperforms true online TD(\lambda), albeit quadratic in complexity due to its replay capabilities. In addition, we demonstrate that our method outperforms other methods with similar quadratic complexity such as Dyna Planning and TD(\lambda)-Replan algorithms. We test our method on two benchmarking environments, a random walk problem that uses simple binary features and a myoelectric control domain that uses both simple sEMG features and deeply extracted features to showcase its capabilities.
[LG-50] Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data
链接: https://arxiv.org/abs/2501.18990
作者: Xinshuai Dong,Ignavier Ng,Boyang Sun,Haoyue Dai,Guang-Yuan Hao,Shunxing Fan,Peter Spirtes,Yumou Qiu,Kun Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.
[LG-51] Meta-learning of shared linear representations beyond well-specified linear regression
链接: https://arxiv.org/abs/2501.18975
作者: Mathieu Even,Laurent Massoulié
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Motivated by multi-task and meta-learning approaches, we consider the problem of learning structure shared by tasks or users, such as shared low-rank representations or clustered structures. While all previous works focus on well-specified linear regression, we consider more general convex objectives, where the structural low-rank and cluster assumptions are expressed on the optima of each function. We show that under mild assumptions such as \textitHessian concentration and \textitnoise concentration at the optimum, rank and clustered regularized estimators recover such structure, provided the number of samples per task and the number of tasks are large enough. We then study the problem of recovering the subspace in which all the solutions lie, in the setting where there is only a single sample per task: we show that in that case, the rank-constrained estimator can recover the subspace, but that the number of tasks needs to scale exponentially large with the dimension of the subspace. Finally, we provide a polynomial-time algorithm via nuclear norm constraints for learning a shared linear representation in the context of convex learning objectives.
[LG-52] BCAT: A Block Causal Transformer for PDE Foundation Models for Fluid Dynamics
链接: https://arxiv.org/abs/2501.18972
作者: Yuxuan Liu,Jingmin Sun,Hayden Schaeffer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We introduce BCAT, a PDE foundation model designed for autoregressive prediction of solutions to two dimensional fluid dynamics problems. Our approach uses a block causal transformer architecture to model next frame predictions, leveraging previous frames as contextual priors rather than relying solely on sub-frames or pixel-based inputs commonly used in image generation methods. This block causal framework more effectively captures the spatial dependencies inherent in nonlinear spatiotemporal dynamics and physical phenomena. In an ablation study, next frame prediction demonstrated a 2.9x accuracy improvement over next token prediction. BCAT is trained on a diverse range of fluid dynamics datasets, including incompressible and compressible Navier-Stokes equations across various geometries and parameter regimes, as well as the shallow-water equations. The model’s performance was evaluated on 6 distinct downstream prediction tasks and tested on about 8K trajectories to measure robustness on a variety of fluid dynamics simulations. BCAT achieved an average relative error of 1.92% across all evaluation tasks, outperforming prior approaches on standard benchmarks.
[LG-53] he Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
链接: https://arxiv.org/abs/2501.18965
作者: Fabian Schaipp,Alexander Hägele,Adrien Taylor,Umut Simsekli,Francis Bach
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
[LG-54] Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping
链接: https://arxiv.org/abs/2501.18962
作者: Pu Yang,Yunzhen Feng,Ziyuan Chen,Yuhang Wu,Zhuoyuan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern foundation models often undergo iterative ``bootstrapping’’ in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model’s performance improves–raising a crucial question: how should the total budget on generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework to analyze budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies–particularly exponential growth policies–exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.
[LG-55] Solving Inverse Problem for Multi-armed Bandits via Convex Optimization
链接: https://arxiv.org/abs/2501.18945
作者: Hao Zhu,Joschka Boedecker
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:We consider the inverse problem of multi-armed bandits (IMAB) that are widely used in neuroscience and psychology research for behavior modelling. We first show that the IMAB problem is not convex in general, but can be relaxed to a convex problem via variable transformation. Based on this result, we propose a two-step sequential heuristic for (approximately) solving the IMAB problem. We discuss a condition where our method provides global solution to the IMAB problem with certificate, as well as approximations to further save computing time. Numerical experiments indicate that our heuristic method is more robust than directly solving the IMAB problem via repeated local optimization, and can achieve the performance of Monte Carlo methods within a significantly decreased running time. We provide the implementation of our method based on CVXPY, which allows straightforward application by users not well versed in convex optimization.
[LG-56] O-MAPL: Offline Multi-agent Preference Learning
链接: https://arxiv.org/abs/2501.18944
作者: TheViet Bui,Tien Mai,Hong Thanh Nguyen
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL), where large joint state-action spaces and complex inter-agent interactions complicate the task. While prior single-agent studies have explored recovering reward functions and policies from human preferences, similar work in MARL is limited. Existing methods often involve separate stages of supervised reward learning and MARL algorithms, leading to unstable training. In this work, we introduce a novel end-to-end preference-based learning framework for cooperative MARL, leveraging the underlying connection between reward functions and soft Q-functions. Our approach uses a carefully-designed multi-agent value decomposition strategy to improve training efficiency. Extensive experiments on SMAC and MAMuJoCo benchmarks show that our algorithm outperforms existing methods across various tasks.
[LG-57] abFSBench: Tabular Benchmark for Feature Shifts in Open Environment
链接: https://arxiv.org/abs/2501.18935
作者: Zi-Jian Cheng,Zi-Yi Jia,Zhi Zhou,Lan-Zhe Guo,Yu-Feng Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation. TabFSBench is released for public access by using a few lines of Python codes at this https URL.
[LG-58] LLM Program Optimization via Retrieval Augmented Search
链接: https://arxiv.org/abs/2501.18916
作者: Sagnik Anupam,Alexander Shypula,Osbert Bastani
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the advent of large language models (LLMs), there has been a great deal of interest in applying them to solve difficult programming tasks. Recent work has demonstrated their potential at program optimization, a key challenge in programming languages research. We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that performs beam search over candidate optimizations; at each step, it retrieves in-context examples from a given training dataset of slow-fast program pairs to guide the LLM. Critically, we find that performing contextual retrieval based on an LLM-generated natural language description significantly outperforms retrieval based on the source code. In addition, we propose a method called AEGIS for improving interpretability by decomposing training examples into “atomic edits” that are significantly more incremental in nature. We show that RAS performs 1.8 \times better than prior state-of-the-art blackbox adaptation strategies, and that AEGIS performs 1.37 \times better while performing significantly smaller edits.
[LG-59] An Invitation to Neuroalgebraic Geometry
链接: https://arxiv.org/abs/2501.18915
作者: Giovanni Luca Marchetti,Vahid Shahverdi,Stefano Mereta,Matthew Trager,Kathlén Kohn
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:
Abstract:In this expository work, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, such as neural networks with polynomial activations, whose associated function spaces are semi-algebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.
[LG-60] Scaling Laws for Differentially Private Language Models
链接: https://arxiv.org/abs/2501.18914
作者: Ryan McKenna,Yangsibo Huang,Amer Sinha,Borja Balle,Zachary Charles,Christopher A. Choquette-Choo,Badih Ghazi,George Kaissis,Ravi Kumar,Ruibo Liu,Da Yu,Chiyuan Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility tradeoffs and the optimal training configurations in many settings.
[LG-61] A machine learning approach for Premature Coronary Artery Disease Diagnosis according to Different Ethnicities in Iran
链接: https://arxiv.org/abs/2501.18893
作者: Mohamad Roshanzamir,Roohallah Alizadehsani,Ehsan Zarepur,Noushin Mohammadifard,Fatemeh Nouri,Mahdi Roshanzamir,Alireza Khosravi,Fereidoon Nouhi,Nizal Sarrafzadegan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Premature coronary artery disease (PCAD) refers to the early onset of the disease, usually before the age of 55 for men and 65 for women. Coronary Artery Disease (CAD) develops when coronary arteries, the major blood vessels supplying the heart with blood, oxygen, and nutrients, become clogged or diseased. This is often due to many risk factors, including lifestyle and cardiometabolic ones, but few studies were done on ethnicity as one of these risk factors, especially in PCAD. In this study, we tested the rank of ethnicity among the major risk factors of PCAD, including age, gender, body mass index (BMI), visceral obesity presented as waist circumference (WC), diabetes mellitus (DM), high blood pressure (HBP), high low-density lipoprotein cholesterol (LDL-C), and smoking in a large national sample of patients with PCAD from different ethnicities. All patients who met the age criteria underwent coronary angiography to confirm CAD diagnosis. The weight of ethnicity was compared to the other eight features using feature weighting algorithms in PCAD diagnosis. In addition, we conducted an experiment where we ran predictive models (classification algorithms) to predict PCAD. We compared the performance of these models under two conditions: we trained the classification algorithms, including or excluding ethnicity. This study analyzed various factors to determine their predictive power influencing PCAD prediction. Among these factors, gender and age were the most significant predictors, with ethnicity being the third most important. The results also showed that if ethnicity is used as one of the input risk factors for classification algorithms, it can improve their efficiency. Our results show that ethnicity ranks as an influential factor in predicting PCAD. Therefore, it needs to be addressed in the PCAD diagnostic and preventive measures.
[LG-62] CAAT-EHR: Cross-Attentional Autoregressive Transformer for Multimodal Electronic Health Record Embeddings
链接: https://arxiv.org/abs/2501.18891
作者: Mohammad Al Olaimat,Serdar Bozdag
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electronic health records (EHRs) provide a comprehensive source of longitudinal patient data, encompassing structured modalities such as laboratory results, imaging data, and vital signs, and unstructured clinical notes. These datasets, after necessary preprocessing to clean and format the data for analysis, often remain in their raw EHR form, representing numerical or categorical values without further transformation into task-agnostic embeddings. While such raw EHR data enables predictive modeling, its reliance on manual feature engineering or downstream task-specific optimization limits its utility for general-purpose applications. Deep learning (DL) techniques, such as recurrent neural networks (RNNs) and Transformers, have facilitated predictive tasks like disease progression and diagnosis prediction. However, these methods often struggle to fully exploit the temporal and multimodal dependencies inherent in EHR data due to their reliance on pre-processed but untransformed raw EHR inputs. In this study, we introduce CAAT-EHR, a novel architecture designed to bridge this gap by generating robust, task-agnostic longitudinal embeddings from raw EHR data. CAAT-EHR leverages self- and cross-attention mechanisms in its encoder to integrate temporal and contextual relationships across multiple modalities, transforming the data into enriched embeddings that capture complex dependencies. An autoregressive decoder complements the encoder by predicting future time points data during pre-training, ensuring that the resulting embeddings maintain temporal consistency and alignment. CAAT-EHR eliminates the need for manual feature engineering and enables seamless transferability across diverse downstream tasks. Extensive evaluations on benchmark datasets, demonstrate the superiority of CAAT-EHR-generated embeddings over pre-processed raw EHR data and other baseline approaches.
[LG-63] Can We Predict the Effect of Prompts?
链接: https://arxiv.org/abs/2501.18883
作者: Jae Yong Lee,Sungmin Kang,Shin Yoo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures
Abstract:Large Language Models (LLMs) are machine learning models that have seen widespread adoption due to their capability of handling previously difficult tasks. LLMs, due to their training, are sensitive to how exactly a question is presented, also known as prompting. However, prompting well is challenging, as it has been difficult to uncover principles behind prompting – generally, trial-and-error is the most common way of improving prompts, despite its significant computational cost. In this context, we argue it would be useful to perform `predictive prompt analysis’, in which an automated technique would perform a quick analysis of a prompt and predict how the LLM would react to it, relative to a goal provided by the user. As a demonstration of the concept, we present Syntactic Prevalence Analyzer (SPA), a predictive prompt analysis approach based on sparse autoencoders (SAEs). SPA accurately predicted how often an LLM would generate target syntactic structures during code synthesis, with up to 0.994 Pearson correlation between the predicted and actual prevalence of the target structure. At the same time, SPA requires only 0.4% of the time it takes to run the LLM on a benchmark. As LLMs are increasingly used during and integrated into modern software development, our proposed predictive prompt analysis concept has the potential to significantly ease the use of LLMs for both practitioners and researchers.
[LG-64] Understanding Generalization in Physics Informed Models through Affine Variety Dimensions
链接: https://arxiv.org/abs/2501.18879
作者: Takeshi Koshizuka,Issei Sato
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:In recent years, physics-informed machine learning has gained significant attention for its ability to enhance statistical performance and sample efficiency by integrating physical structures into machine learning models. These structures, such as differential equations, conservation laws, and symmetries, serve as inductive biases that can improve the generalization capacity of the hybrid model. However, the mechanisms by which these physical structures enhance generalization capacity are not fully understood, limiting the ability to guarantee the performance of the models. In this study, we show that the generalization performance of linear regressors incorporating differential equation structures is determined by the dimension of the associated affine variety, rather than the number of parameters. This finding enables a unified analysis of various equations, including nonlinear ones. We introduce a method to approximate the dimension of the affine variety and provide experimental evidence to validate our theoretical insights.
[LG-65] Best Policy Learning from Trajectory Preference Feedback
链接: https://arxiv.org/abs/2501.18873
作者: Akhil Agnihotri,Rahul Jain,Deepak Ramachandran,Zheng Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where learning occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards. This approach is useful for post-training optimization of generative AI models during multi-turn user interactions, where preference feedback is more robust than handcrafted reward models. In this setting, learning is driven by both an offline preference dataset – collected from a rater of unknown ‘competence’ – and online data collected with pure exploration. Since offline datasets may exhibit out-of-distribution (OOD) biases, principled online data collection is necessary. To address this, we propose Posterior Sampling for Preference Learning ( \mathsfPSPL ), a novel algorithm inspired by Top-Two Thompson Sampling, that maintains independent posteriors over the true reward model and transition dynamics. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret of \mathsfPSPL . Since the exact algorithm can be computationally impractical, we also provide an approximate version that outperforms existing baselines.
[LG-66] Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling
链接: https://arxiv.org/abs/2501.18871
作者: Macheng Shen,Chen Cheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Inspired by the ubiquitous use of differential equations to model continuous dynamics across diverse scientific and engineering domains, we propose a novel and intuitive approach to continuous sequence modeling. Our method interprets time-series data as \textitdiscrete samples from an underlying continuous dynamical system, and models its time evolution using Neural Stochastic Differential Equation (Neural SDE), where both the flow (drift) and diffusion terms are parameterized by neural networks. We derive a principled maximum likelihood objective and a \textitsimulation-free scheme for efficient training of our Neural SDE model. We demonstrate the versatility of our approach through experiments on sequence modeling tasks across both embodied and generative AI. Notably, to the best of our knowledge, this is the first work to show that SDE-based continuous-time modeling also excels in such complex scenarios, and we hope that our work opens up new avenues for research of SDE models in high-dimensional and temporally intricate domains.
[LG-67] Continuous-Time Analysis of Federated Averag ing
链接: https://arxiv.org/abs/2501.18870
作者: Tom Overman,Diego Klabjan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注: Under review
Abstract:Federated averaging (FedAvg) is a popular algorithm for horizontal federated learning (FL), where samples are gathered across different clients and are not shared with each other or a central server. Extensive convergence analysis of FedAvg exists for the discrete iteration setting, guaranteeing convergence for a range of loss functions and varying levels of data heterogeneity. We extend this analysis to the continuous-time setting where the global weights evolve according to a multivariate stochastic differential equation (SDE), which is the first time FedAvg has been studied from the continuous-time perspective. We use techniques from stochastic processes to establish convergence guarantees under different loss functions, some of which are more general than existing work in the discrete setting. We also provide conditions for which FedAvg updates to the server weights can be approximated as normal random variables. Finally, we use the continuous-time formulation to reveal generalization properties of FedAvg.
[LG-68] A Deep Spatio-Temporal Architecture for Dynamic Effective Connectivity Network Analysis Based on Dynamic Causal Discovery
链接: https://arxiv.org/abs/2501.18859
作者: Faming Xu,Yiding Wang,Chen Qiao,Gang Qu,Vince D. Calhoun,Julia M. Stephen,Tony W. Wilson,Yu-Ping Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dynamic effective connectivity networks (dECNs) reveal the changing directed brain activity and the dynamic causal influences among brain regions, which facilitate the identification of individual differences and enhance the understanding of human brain. Although the existing causal discovery methods have shown promising results in effective connectivity network analysis, they often overlook the dynamics of causality, in addition to the incorporation of spatio-temporal information in brain activity data. To address these issues, we propose a deep spatio-temporal fusion architecture, which employs a dynamic causal deep encoder to incorporate spatio-temporal information into dynamic causality modeling, and a dynamic causal deep decoder to verify the discovered causality. The effectiveness of the proposed method is first illustrated with simulated data. Then, experimental results from Philadelphia Neurodevelopmental Cohort (PNC) demonstrate the superiority of the proposed method in inferring dECNs, which reveal the dynamic evolution of directed flow between brain regions. The analysis shows the difference of dECNs between young adults and children. Specifically, the directed brain functional networks transit from fluctuating undifferentiated systems to more stable specialized networks as one grows. This observation provides further evidence on the modularization and adaptation of brain networks during development, leading to higher cognitive abilities observed in young adults.
[LG-69] rading Inference-Time Compute for Adversarial Robustness
链接: https://arxiv.org/abs/2501.18841
作者: Wojciech Zaremba,Evgenia Nitishinskaya,Boaz Barak,Stephanie Lin,Sam Toyer,Yaodong Yu,Rachel Dias,Eric Wallace,Kai Xiao,Johannes Heidecke,Amelia Glaese
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.
[LG-70] ransfer Learning for Nonparametric Contextual Dynamic Pricing
链接: https://arxiv.org/abs/2501.18836
作者: Fan Wang,Feiyu Jiang,Zifeng Zhao,Yi Yu
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Dynamic pricing strategies are crucial for firms to maximize revenue by adjusting prices based on market conditions and customer characteristics. However, designing optimal pricing strategies becomes challenging when historical data are limited, as is often the case when launching new products or entering new markets. One promising approach to overcome this limitation is to leverage information from related products or markets to inform the focal pricing decisions. In this paper, we explore transfer learning for nonparametric contextual dynamic pricing under a covariate shift model, where the marginal distributions of covariates differ between source and target domains while the reward functions remain the same. We propose a novel Transfer Learning for Dynamic Pricing (TLDP) algorithm that can effectively leverage pre-collected data from a source domain to enhance pricing decisions in the target domain. The regret upper bound of TLDP is established under a simple Lipschitz condition on the reward function. To establish the optimality of TLDP, we further derive a matching minimax lower bound, which includes the target-only scenario as a special case and is presented for the first time in the literature. Extensive numerical experiments validate our approach, demonstrating its superiority over existing methods and highlighting its practical utility in real-world applications.
[LG-71] ranscoders Beat Sparse Autoencoders for Interpretability
链接: https://arxiv.org/abs/2501.18823
作者: Gonçalo Paulo,Stepan Shabalin,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
[LG-72] Estimating the Probability of Sampling a Trained Neural Network at Random
链接: https://arxiv.org/abs/2501.18812
作者: Adam Scherlis,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an algorithm for estimating the probability mass, under a Gaussian or uniform prior, of a region in neural network parameter space corresponding to a particular behavior, such as achieving test loss below some threshold. When the prior is uniform, this problem is equivalent to measuring the volume of a region. We show empirically and theoretically that existing algorithms for estimating volumes in parameter space underestimate the true volume by millions of orders of magnitude. We find that this error can be dramatically reduced, but not entirely eliminated, with an importance sampling method using gradient information that is already provided by popular optimizers. The negative logarithm of this probability can be interpreted as a measure of a network’s information content, in accordance with minimum description length (MDL) principles and rate-distortion theory. As expected, this quantity increases during language model training. We also find that badly-generalizing behavioral regions are smaller, and therefore less likely to be sampled at random, demonstrating an inductive bias towards well-generalizing functions.
[LG-73] Learning Hamiltonian Dynamics with Bayesian Data Assimilation
链接: https://arxiv.org/abs/2501.18808
作者: Taehyeun Kim,Tae-Geun Kim,Anouck Girard,Ilya Kolmanovsky
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 8 pages, 12 figures
Abstract:In this paper, we develop a neural network-based approach for time-series prediction in unknown Hamiltonian dynamical systems. Our approach leverages a surrogate model and learns the system dynamics using generalized coordinates (positions) and their conjugate momenta while preserving a constant Hamiltonian. To further enhance long-term prediction accuracy, we introduce an Autoregressive Hamiltonian Neural Network, which incorporates autoregressive prediction errors into the training objective. Additionally, we employ Bayesian data assimilation to refine predictions in real-time using online measurement data. Numerical experiments on a spring-mass system and highly elliptic orbits under gravitational perturbations demonstrate the effectiveness of the proposed method, highlighting its potential for accurate and robust long-term predictions.
[LG-74] Deceptive Sequential Decision-Making via Regularized Policy Optimization
链接: https://arxiv.org/abs/2501.18803
作者: Yerin Kim,Alexander Benvenuti,Bo Chen,Mustafa Karabag,Abhishek Kulkarni,Nathaniel D. Bastian,Ufuk Topcu,Matthew Hale
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 pages, 5 figures
Abstract:Autonomous systems are increasingly expected to operate in the presence of adversaries, though an adversary may infer sensitive information simply by observing a system, without even needing to interact with it. Therefore, in this work we present a deceptive decision-making framework that not only conceals sensitive information, but in fact actively misleads adversaries about it. We model autonomous systems as Markov decision processes, and we consider adversaries that attempt to infer their reward functions using inverse reinforcement learning. To counter such efforts, we present two regularization strategies for policy synthesis problems that actively deceive an adversary about a system’s underlying rewards. The first form of deception is diversionary'', and it leads an adversary to draw any false conclusion about what the system's reward function is. The second form of deception is
targeted’', and it leads an adversary to draw a specific false conclusion about what the system’s reward function is. We then show how each form of deception can be implemented in policy optimization problems, and we analytically bound the loss in total accumulated reward that is induced by deception. Next, we evaluate these developments in a multi-agent sequential decision-making problem with one real agent and multiple decoys. We show that diversionary deception can cause the adversary to believe that the most important agent is the least important, while attaining a total accumulated reward that is 98.83% of its optimal, non-deceptive value. Similarly, we show that targeted deception can make any decoy appear to be the most important agent, while still attaining a total accumulated reward that is 99.25% of its optimal, non-deceptive value.
[LG-75] Bayesian Optimization with Preference Exploration by Monotonic Neural Network Ensemble
链接: https://arxiv.org/abs/2501.18792
作者: Hanyang Wang,Juergen Branke,Matthias Poloczek
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.
[LG-76] Achieving widetildemathcalO(sqrtT) Regret in Averag e-Reward POMDPs with Known Observation Models
链接: https://arxiv.org/abs/2501.18790
作者: Alessio Russo,Alberto Maria Metelli,Marcello Restelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed \emphAction-wise OAS-UCRL algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order \mathcalO(\sqrtT ,\log T) when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.
[LG-77] Navigating the Frag rance space Via Graph Generative Models And Predicting Odors
链接: https://arxiv.org/abs/2501.18777
作者: Mrityunjay Sharma,Sarabeshwar Balaji,Pinaki Saha,Ritesh Kumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:We explore a suite of generative modelling techniques to efficiently navigate and explore the complex landscapes of odor and the broader chemical space. Unlike traditional approaches, we not only generate molecules but also predict the odor likeliness with ROC AUC score of 0.97 and assign probable odor labels. We correlate odor likeliness with physicochemical features of molecules using machine learning techniques and leverage SHAP (SHapley Additive exPlanations) to demonstrate the interpretability of the function. The whole process involves four key stages: molecule generation, stringent sanitization checks for molecular validity, fragrance likeliness screening and odor prediction of the generated molecules. By making our code and trained models publicly accessible, we aim to facilitate broader adoption of our research across applications in fragrance discovery and olfactory research.
[LG-78] Probabilistic Joint Recovery Method for CO_2 Plume Monitoring
链接: https://arxiv.org/abs/2501.18761
作者: Zijun Deng,Rafael Orozco,Abhinav Prakash Gahlot,Felix J. Herrmann
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Reducing CO _2 emissions is crucial to mitigating climate change. Carbon Capture and Storage (CCS) is one of the few technologies capable of achieving net-negative CO _2 emissions. However, predicting fluid flow patterns in CCS remains challenging due to uncertainties in CO _2 plume dynamics and reservoir properties. Building on existing seismic imaging methods like the Joint Recovery Method (JRM), which lacks uncertainty quantification, we propose the Probabilistic Joint Recovery Method (pJRM). By estimating posterior distributions across surveys using a shared generative model, pJRM provides uncertainty information to improve risk assessment in CCS projects.
[LG-79] STaleX: A Spatiotemporal-Aware Adaptive Auto-scaling Framework for Microservices
链接: https://arxiv.org/abs/2501.18734
作者: Majid Dashtbani,Ladan Tahvildari
类目: oftware Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:While cloud environments and auto-scaling solutions have been widely applied to traditional monolithic applications, they face significant limitations when it comes to microservices-based architectures. Microservices introduce additional challenges due to their dynamic and spatiotemporal characteristics, which require more efficient and specialized auto-scaling strategies. Centralized auto-scaling for the entire microservice application is insufficient, as each service within a chain has distinct specifications and performance requirements. Therefore, each service requires its own dedicated auto-scaler to address its unique scaling needs effectively, while also considering the dependencies with other services in the chain and the overall application. This paper presents a combination of control theory, machine learning, and heuristics to address these challenges. We propose an adaptive auto-scaling framework, STaleX, for microservices that integrates spatiotemporal features, enabling real-time resource adjustments to minimize SLO violations. STaleX employs a set of weighted Proportional-Integral-Derivative (PID) controllers for each service, where weights are dynamically adjusted based on a supervisory unit that integrates spatiotemporal features. This supervisory unit continuously monitors and adjusts both the weights and the resources allocated to each service. Our framework accounts for spatial features, including service specifications and dependencies among services, as well as temporal variations in workload, ensuring that resource allocation is continuously optimized. Through experiments on a microservice-based demo application deployed on a Kubernetes cluster, we demonstrate the effectiveness of our framework in improving performance and reducing costs compared to traditional scaling methods like Kubernetes Horizontal Pod Autoscaler (HPA) with a 26.9% reduction in resource usage.
[LG-80] chebgreen: Learning and Interpolating Continuous Empirical Greens Functions from Data
链接: https://arxiv.org/abs/2501.18715
作者: Harshwardhan Praveen,Jacob Brown,Christopher Earls
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Code is available at this https URL
Abstract:In this work, we present a mesh-independent, data-driven library, chebgreen, to mathematically model one-dimensional systems, possessing an associated control parameter, and whose governing partial differential equation is unknown. The proposed method learns an Empirical Green’s Function for the associated, but hidden, boundary value problem, in the form of a Rational Neural Network from which we subsequently construct a bivariate representation in a Chebyshev basis. We uncover the Green’s function, at an unseen control parameter value, by interpolating the left and right singular functions within a suitable library, expressed as points on a manifold of Quasimatrices, while the associated singular values are interpolated with Lagrange polynomials.
[LG-81] Invisible Traces: Using Hybrid Fingerprinting to identify underlying LLM s in GenAI Apps
链接: https://arxiv.org/abs/2501.18712
作者: Devansh Bhardwaj,Naman Mishra
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Fingerprinting refers to the process of identifying underlying Machine Learning (ML) models of AI Systemts, such as Large Language Models (LLMs), by analyzing their unique characteristics or patterns, much like a human fingerprint. The fingerprinting of Large Language Models (LLMs) has become essential for ensuring the security and transparency of AI-integrated applications. While existing methods primarily rely on access to direct interactions with the application to infer model identity, they often fail in real-world scenarios involving multi-agent systems, frequent model updates, and restricted access to model internals. In this paper, we introduce a novel fingerprinting framework designed to address these challenges by integrating static and dynamic fingerprinting techniques. Our approach identifies architectural features and behavioral traits, enabling accurate and robust fingerprinting of LLMs in dynamic environments. We also highlight new threat scenarios where traditional fingerprinting methods are ineffective, bridging the gap between theoretical techniques and practical application. To validate our framework, we present an extensive evaluation setup that simulates real-world conditions and demonstrate the effectiveness of our methods in identifying and monitoring LLMs in Gen-AI applications. Our results highlight the framework’s adaptability to diverse and evolving deployment contexts.
[LG-82] Combining physics-based and data-driven models: advancing the frontiers of research with Scientific Machine Learning
链接: https://arxiv.org/abs/2501.18708
作者: Alfio Quarteroni,Paola Gervasio,Francesco Regazzoni
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 127pages. Accepted for publication in Mathematical Models and Methods in Applied Sciences (M3AS)
Abstract:Scientific Machine Learning (SciML) is a recently emerged research field which combines physics-based and data-driven models for the numerical approximation of differential problems. Physics-based models rely on the physical understanding of the problem at hand, subsequent mathematical formulation, and numerical approximation. Data-driven models instead aim to extract relations between input and output data without arguing any causality principle underlining the available data distribution. In recent years, data-driven models have been rapidly developed and popularized. Such a diffusion has been triggered by a huge availability of data (the so-called big data), an increasingly cheap computing power, and the development of powerful machine learning algorithms. SciML leverages the physical awareness of physics-based models and, at the same time, the efficiency of data-driven algorithms. With SciML, we can inject physics and mathematical knowledge into machine learning algorithms. Yet, we can rely on data-driven algorithms’ capability to discover complex and non-linear patterns from data and improve the descriptive capacity of physics-based models. After recalling the mathematical foundations of digital modelling and machine learning algorithms, and presenting the most popular machine learning architectures, we discuss the great potential of a broad variety of SciML strategies in solving complex problems governed by partial differential equations. Finally, we illustrate the successful application of SciML to the simulation of the human cardiac function, a field of significant socio-economic importance that poses numerous challenges on both the mathematical and computational fronts. The corresponding mathematical model is a complex system of non-linear ordinary and partial differential equations describing the electromechanics, valve dynamics, blood circulation, perfusion in the coronary tree, and torso potential. Despite the robustness and accuracy of physics-based models, certain aspects, such as unveiling constitutive laws for cardiac cells and myocardial material properties, as well as devising efficient reduced order models to dominate the extraordinary computational complexity, have been successfully tackled by leveraging data-driven models.
[LG-83] STAN: Smooth Transition Autoregressive Networks
链接: https://arxiv.org/abs/2501.18699
作者: Hugo Inzirillo,Remi Genet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional Smooth Transition Autoregressive (STAR) models offer an effective way to model these dynamics through smooth regime changes based on specific transition variables. In this paper, we propose a novel approach by drawing an analogy between STAR models and a multilayer neural network architecture. Our proposed neural network architecture mimics the STAR framework, employing multiple layers to simulate the smooth transition between regimes and capturing complex, nonlinear relationships. The network’s hidden layers and activation functions are structured to replicate the gradual switching behavior typical of STAR models, allowing for a more flexible and scalable approach to regime-dependent modeling. This research suggests that neural networks can provide a powerful alternative to STAR models, with the potential to enhance predictive accuracy in economic and financial forecasting.
[LG-84] Regularized second-order optimization of tensor-network Born machines
链接: https://arxiv.org/abs/2501.18691
作者: Matan Ben-Dov,Jing Chen
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 11 pages, 5 figures
Abstract:Tensor-network Born machines (TNBMs) are quantum-inspired generative models for learning data distributions. Using tensor-network contraction and optimization techniques, the model learns an efficient representation of the target distribution, capable of capturing complex correlations with a compact parameterization. Despite their promise, the optimization of TNBMs presents several challenges. A key bottleneck of TNBMs is the logarithmic nature of the loss function that is commonly used for this problem. The single-tensor logarithmic optimization problem cannot be solved analytically, necessitating an iterative approach that slows down convergence and increases the risk of getting trapped in one of many non-optimal local minima. In this paper, we present an improved second-order optimization technique for TNBM training, which significantly enhances convergence rates and the quality of the optimized model. Our method employs a modified Newton’s method on the manifold of normalized states, incorporating regularization of the loss landscape to mitigate local minima issues. We demonstrate the effectiveness of our approach by training a one-dimensional matrix product state (MPS) on both discrete and continuous datasets, showcasing its advantages in terms of stability, efficiency, and generalization.
[LG-85] Machine Learning Strategies for Parkinson Tremor Classification Using Wearable Sensor Data
链接: https://arxiv.org/abs/2501.18671
作者: Jesus Paucar-Escalante,Matheus Alves da Silva,Bruno De Lima Sanches,Aurea Soriano-Vargas,Laura Silveira Moriyama,Esther Luna Colombini
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 28 pages, 9 figures, 3 tables, Journal Artificial Intelligence In Medicine
Abstract:Parkinson’s disease (PD) is a neurological disorder requiring early and accurate diagnosis for effective management. Machine learning (ML) has emerged as a powerful tool to enhance PD classification and diagnostic accuracy, particularly by leveraging wearable sensor data. This survey comprehensively reviews current ML methodologies used in classifying Parkinsonian tremors, evaluating various tremor data acquisition methodologies, signal preprocessing techniques, and feature selection methods across time and frequency domains, highlighting practical approaches for tremor classification. The survey explores ML models utilized in existing studies, ranging from traditional methods such as Support Vector Machines (SVM) and Random Forests to advanced deep learning architectures like Convolutional Neural Networks (CNN) and Long Short-Term Memory networks (LSTM). We assess the efficacy of these models in classifying tremor patterns associated with PD, considering their strengths and limitations. Furthermore, we discuss challenges and discrepancies in current research and broader challenges in applying ML to PD diagnosis using wearable sensor data. We also outline future research directions to advance ML applications in PD diagnostics, providing insights for researchers and practitioners.
[LG-86] SAFL: Structure-Aware Personalized Federated Learning via Client-Specific Clustering and SCSI-Guided Model Pruning
链接: https://arxiv.org/abs/2501.18659
作者: Nan Li,Xiaolu Wang,Xiao Du,Puyu Cai,Ting Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) enables clients to collaboratively train machine learning models without sharing local data, preserving privacy in diverse environments. While traditional FL approaches preserve privacy, they often struggle with high computational and communication overhead. To address these issues, model pruning is introduced as a strategy to streamline computations. However, existing pruning methods, when applied solely based on local data, often produce sub-models that inadequately reflect clients’ specific tasks due to data insufficiency. To overcome these challenges, this paper introduces SAFL (Structure-Aware Federated Learning), a novel framework that enhances personalized federated learning through client-specific clustering and Similar Client Structure Information (SCSI)-guided model pruning. SAFL employs a two-stage process: initially, it groups clients based on data similarities and uses aggregated pruning criteria to guide the pruning process, facilitating the identification of optimal sub-models. Subsequently, clients train these pruned models and engage in server-based aggregation, ensuring tailored and efficient models for each client. This method significantly reduces computational overhead while improving inference accuracy. Extensive experiments demonstrate that SAFL markedly diminishes model size and improves performance, making it highly effective in federated environments characterized by heterogeneous data.
[LG-87] he Relationship Between Network Similarity and Transferability of Adversarial Attacks
链接: https://arxiv.org/abs/2501.18629
作者: Gerrit Klause,Niklas Bunzel
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Neural networks are vulnerable to adversarial attacks, and several defenses have been proposed. Designing a robust network is a challenging task given the wide range of attacks that have been developed. Therefore, we aim to provide insight into the influence of network similarity on the success rate of transferred adversarial attacks. Network designers can then compare their new network with existing ones to estimate its vulnerability. To achieve this, we investigate the complex relationship between network similarity and the success rate of transferred adversarial attacks. We applied the Centered Kernel Alignment (CKA) network similarity score and used various methods to find a correlation between a large number of Convolutional Neural Networks (CNNs) and adversarial attacks. Network similarity was found to be moderate across different CNN architectures, with more complex models such as DenseNet showing lower similarity scores due to their architectural complexity. Layer similarity was highest for consistent, basic layers such as DataParallel, Dropout and Conv2d, while specialized layers showed greater variability. Adversarial attack success rates were generally consistent for non-transferred attacks, but varied significantly for some transferred attacks, with complex networks being more vulnerable. We found that a DecisionTreeRegressor can predict the success rate of transferred attacks for all black-box and Carlini Wagner attacks with an accuracy of over 90%, suggesting that predictive models may be viable under certain conditions. However, the variability of results across different data subsets underscores the complexity of these relationships and suggests that further research is needed to generalize these findings across different attack scenarios and network architectures.
[LG-88] DarkMind: Latent Chain-of-Thought Backdoor in Customized LLM s
链接: https://arxiv.org/abs/2501.18617
作者: Zhen Guo,Reza Tourani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, 13 tables
Abstract:With the growing demand for personalized AI solutions, customized LLMs have become a preferred choice for businesses and individuals, driving the deployment of millions of AI agents across various platforms, e.g., GPT Store hosts over 3 million customized GPTs. Their popularity is partly driven by advanced reasoning capabilities, such as Chain-of-Thought, which enhance their ability to tackle complex tasks. However, their rapid proliferation introduces new vulnerabilities, particularly in reasoning processes that remain largely unexplored. We introduce DarkMind, a novel backdoor attack that exploits the reasoning capabilities of customized LLMs. Designed to remain latent, DarkMind activates within the reasoning chain to covertly alter the final outcome. Unlike existing attacks, it operates without injecting triggers into user queries, making it a more potent threat. We evaluate DarkMind across eight datasets covering arithmetic, commonsense, and symbolic reasoning domains, using five state-of-the-art LLMs with five distinct trigger implementations. Our results demonstrate DarkMind effectiveness across all scenarios, underscoring its impact. Finally, we explore potential defense mechanisms to mitigate its risks, emphasizing the need for stronger security measures.
[LG-89] Beyond Fixed Horizons: A Theoretical Framework for Adaptive Denoising Diffusions
链接: https://arxiv.org/abs/2501.19373
作者: Sören Christensen,Claudia Strauch,Lukas Trottner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a new class of generative diffusion models that, unlike conventional denoising diffusion models, achieve a time-homogeneous structure for both the noising and denoising processes, allowing the number of steps to adaptively adjust based on the noise level. This is accomplished by conditioning the forward process using Doob’s h -transform, which terminates the process at a suitable sampling distribution at a random time. The model is particularly well suited for generating data with lower intrinsic dimensions, as the termination criterion simplifies to a first-hitting rule. A key feature of the model is its adaptability to the target data, enabling a variety of downstream tasks using a pre-trained unconditional generative model. These tasks include natural conditioning through appropriate initialization of the denoising process and classification of noisy data.
[LG-90] Statistical Physics of Deep Neural Networks: Generalization Capability Beyond the Infinite Width and Feature Learning
链接: https://arxiv.org/abs/2501.19281
作者: Sebastiano Ariosto
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: PhD thesis (200 pages), divided into four separate chapters, each of which can be read independently. Some of the material presented has previously appeared in works available on arXiv under the following identifiers: 2209.04882 and 2201.11022
Abstract:Deep Neural Networks (DNNs) excel at many tasks, often rivaling or surpassing human performance. Yet their internal processes remain elusive, frequently described as “black boxes.” While performance can be refined experimentally, achieving a fundamental grasp of their inner workings is still a challenge. Statistical Mechanics has long tackled computational problems, and this thesis applies physics-based insights to understand DNNs via three complementary approaches. First, by averaging over data, we derive an asymptotic bound on generalization that depends solely on the size of the last layer, rather than on the total number of parameters – revealing how deep architectures process information differently across layers. Second, adopting a data-dependent viewpoint, we explore a finite-width thermodynamic limit beyond the infinite-width regime. This leads to: (i) a closed-form expression for the generalization error in a finite-width one-hidden-layer network (regression task); (ii) an approximate partition function for deeper architectures; and (iii) a link between deep networks in this thermodynamic limit and Student’s t-processes. Finally, from a task-explicit perspective, we present a preliminary analysis of how DNNs interact with a controlled dataset, investigating whether they truly internalize its structure – collapsing to the teacher – or merely memorize it. By understanding when a network must learn data structure rather than just memorize, it sheds light on fostering meaningful internal representations. In essence, this thesis leverages the synergy between Statistical Physics and Machine Learning to illuminate the inner behavior of DNNs. Comments: PhD thesis (200 pages), divided into four separate chapters, each of which can be read independently. Some of the material presented has previously appeared in works available on arXiv under the following identifiers: 2209.04882 and 2201.11022 Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG) Cite as: arXiv:2501.19281 [cond-mat.dis-nn] (or arXiv:2501.19281v1 [cond-mat.dis-nn] for this version) https://doi.org/10.48550/arXiv.2501.19281 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sebastiano Ariosto [view email] [v1] Fri, 31 Jan 2025 16:43:57 UTC (11,383 KB) Full-text links: Access Paper: View a PDF of the paper titled Statistical Physics of Deep Neural Networks: Generalization Capability, Beyond the Infinite Width, and Feature Learning, by Sebastiano AriostoView PDFTeX SourceOther Formats view license Current browse context: cond-mat.dis-nn prev | next new | recent | 2025-01 Change to browse by: cond-mat cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-91] On Pareto Optimality for the Multinomial Logistic Bandit
链接: https://arxiv.org/abs/2501.19277
作者: Jierui Zuo,Hanzhang Qin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We provide a new online learning algorithm for tackling the Multinomial Logit Bandit (MNL-Bandit) problem. Despite the challenges posed by the combinatorial nature of the MNL model, we develop a novel Upper Confidence Bound (UCB)-based method that achieves Pareto optimality by balancing regret minimization and estimation error of the assortment revenues and the MNL parameters. We develop theoretical guarantees characterizing the tradeoff between regret and estimation error for the MNL-Bandit problem through information-theoretic bounds, and propose a modified UCB algorithm that incorporates forced exploration to improve parameter estimation accuracy while maintaining low regret. Our analysis sheds critical insights into how to optimally balance the collected revenues and the treatment estimation in dynamic assortment optimization.
[LG-92] DINAMO: Dynamic and INterpretable Anomaly MOnitoring for Large-Scale Particle Physics Experiments
链接: https://arxiv.org/abs/2501.19237
作者: Arsenii Gavrikov,Julián García Pardiñas,Alberto Garfagnini
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注:
Abstract:Ensuring reliable data collection in large-scale particle physics experiments demands Data Quality Monitoring (DQM) procedures to detect possible detector malfunctions and preserve data integrity. Traditionally, this resource-intensive task has been handled by human shifters that struggle with frequent changes in operational conditions. We present novel, interpretable, robust, and scalable DQM algorithms designed to automate anomaly detection in time-dependent settings. Our approach constructs evolving histogram templates with built-in uncertainties, featuring both a statistical variant - extending the classical Exponentially Weighted Moving Average (EWMA) - and a machine learning (ML)-enhanced version that leverages a transformer encoder for improved adaptability. Experimental validations on synthetic datasets demonstrate the high accuracy, adaptability, and interpretability of these methods, with the statistical variant being commissioned in the LHCb experiment at the Large Hadron Collider, underscoring its real-world impact. The code used in this study is available at this https URL.
[LG-93] Fast exact recovery of noisy matrix from few entries: the infinity norm approach
链接: https://arxiv.org/abs/2501.19224
作者: BaoLinh Tran,Van Vu
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR); Applications (stat.AP)
*备注: 56 pages, 1 figure
Abstract:The matrix recovery (completion) problem, a central problem in data science and theoretical computer science, is to recover a matrix A from a relatively small sample of entries. While such a task is impossible in general, it has been shown that one can recover A exactly in polynomial time, with high probability, from a random subset of entries, under three (basic and necessary) assumptions: (1) the rank of A is very small compared to its dimensions (low rank), (2) A has delocalized singular vectors (incoherence), and (3) the sample size is sufficiently large. There are many different algorithms for the task, including convex optimization by Candes, Tao and Recht (2009), alternating projection by Hardt and Wooters (2014) and low rank approximation with gradient descent by Keshavan, Montanari and Oh (2009, 2010). In applications, it is more realistic to assume that data is noisy. In this case, these approaches provide an approximate recovery with small root mean square error. However, it is hard to transform such approximate recovery to an exact one. Recently, results by Abbe et al. (2017) and Bhardwaj et al. (2023) concerning approximation in the infinity norm showed that we can achieve exact recovery even in the noisy case, given that the ground matrix has bounded precision. Beyond the three basic assumptions above, they required either the condition number of A is small (Abbe et al.) or the gap between consecutive singular values is large (Bhardwaj et al.). In this paper, we remove these extra spectral assumptions. As a result, we obtain a simple algorithm for exact recovery in the noisy case, under only three basic assumptions. This is the first such algorithm. To analyse the algorithm, we introduce a contour integration argument which is totally different from all previous methods and may be of independent interest. Comments: 56 pages, 1 figure Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR); Applications (stat.AP) MSC classes: 60B20, 05C50, 65F99, 65C20, 60C05, 15A83, 68T09 ACMclasses: F.2.1; G.1.2; G.1.3; G.2.1; G.3; I.5.4 Cite as: arXiv:2501.19224 [math.ST] (or arXiv:2501.19224v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2501.19224 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: BaoLinh Tran [view email] [v1] Fri, 31 Jan 2025 15:31:01 UTC (115 KB) Full-text links: Access Paper: View a PDF of the paper titled Fast exact recovery of noisy matrix from few entries: the infinity norm approach, by BaoLinh Tran and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: math.ST prev | next new | recent | 2025-01 Change to browse by: cs cs.LG math math.CO math.PR stat stat.AP stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-94] A single-loop SPIDER-type stochastic subgradient method for expectation-constrained nonconvex nonsmooth optimization STOC
链接: https://arxiv.org/abs/2501.19214
作者: Wei Liu,Yangyang Xu
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Key word: stochastic, subgradient, expectation constraints, weakly convex, fairness constrained classification
Abstract:Many real-world problems, such as those with fairness constraints, involve complex expectation constraints and large datasets, necessitating the design of efficient stochastic methods to solve them. Most existing research focuses on cases with no constraint or easy-to-project constraints or deterministic constraints. In this paper, we consider nonconvex nonsmooth stochastic optimization problems with expectation constraints, for which we build a novel exact penalty model. We first show the relationship between the penalty model and the original problem. Then on solving the penalty problem, we present a single-loop SPIDER-type stochastic subgradient method, which utilizes the subgradients of both the objective and constraint functions, as well as the constraint function value at each iteration. Under certain regularity conditions (weaker than Slater-type constraint qualification or strong feasibility assumed in existing works), we establish an iteration complexity result of O(\epsilon^-4) to reach a near- \epsilon stationary point of the penalized problem in expectation, matching the lower bound for such tasks. Building on the exact penalization, an (\epsilon,\epsilon) -KKT point of the original problem is obtained. For a few scenarios, our complexity of either the objective sample subgradient or the constraint sample function values can be lower than the state-of-the-art results by a factor of \epsilon^-2 . Moreover, on solving two fairness-constrained problems, our method is significantly (up to 466 times) faster than the state-of-the-art algorithms, including switching subgradient method and inexact proximal point methods.
[LG-95] Learning While Repositioning in On-Demand Vehicle Sharing Networks
链接: https://arxiv.org/abs/2501.19208
作者: Hansheng Jiang,Chunlin Sun,Zuo-Jun Max Shen,Shunan Jiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services. Due to uncertainties in both demand and returns, as well as a fixed number of rental units across an n -location network, the service provider must periodically reposition vehicles to match supply with demand spatially while minimizing costs. The optimal repositioning policy under a general n -location network is intractable without knowing the optimal value function. We introduce the best base-stock repositioning policy as a generalization of the classical inventory control policy to n dimensions, and establish its asymptotic optimality in two distinct limiting regimes under general network structures. We present reformulations to efficiently compute this best base-stock policy in an offline setting with pre-collected data. In the online setting, we show that a natural Lipschitz-bandit approach achieves a regret guarantee of \widetildeO(T^\fracnn+1) , which suffers from the exponential dependence on n . We illustrate the challenges of learning with censored data in networked systems through a regret lower bound analysis and by demonstrating the suboptimality of alternative algorithmic approaches. Motivated by these challenges, we propose an Online Gradient Repositioning algorithm that relies solely on censored demand. Under a mild cost-structure assumption, we prove that it attains an optimal regret of O(n^2.5 \sqrtT) , which matches the regret lower bound in T and achieves only polynomial dependence on n . The key algorithmic innovation involves proposing surrogate costs to disentangle intertemporal dependencies and leveraging dual solutions to find the gradient of policy change. Numerical experiments demonstrate the effectiveness of our proposed methods. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2501.19208 [stat.ML] (or arXiv:2501.19208v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.19208 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-96] Learning Sheaf Laplacian Optimizing Restriction Maps
链接: https://arxiv.org/abs/2501.19207
作者: Leonardo Di Nino,Sergio Barbarossa,Paolo Di Lorenzo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Proc. 58th Annual Asilomar Conference on Signals, Systems, and Computers (Asilomar), Pacific Grove, CA, Oct. 27 - Oct. 30, 2024
Abstract:The aim of this paper is to propose a novel framework to infer the sheaf Laplacian, including the topology of a graph and the restriction maps, from a set of data observed over the nodes of a graph. The proposed method is based on sheaf theory, which represents an important generalization of graph signal processing. The learning problem aims to find the sheaf Laplacian that minimizes the total variation of the observed data, where the variation over each edge is also locally minimized by optimizing the associated restriction maps. Compared to alternative methods based on semidefinite programming, our solution is significantly more numerically efficient, as all its fundamental steps are resolved in closed form. The method is numerically tested on data consisting of vectors defined over subspaces of varying dimensions at each node. We demonstrate how the resulting graph is influenced by two key factors: the cross-correlation and the dimensionality difference of the data residing on the graph’s nodes.
[LG-97] Learning Non-Local Molecular Interactions via Equivariant Local Representations and Charge Equilibration
链接: https://arxiv.org/abs/2501.19179
作者: Paul Fuchs,Michał Sanocki,Julija Zavadlav
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Graph Neural Network (GNN) potentials relying on chemical locality offer near-quantum mechanical accuracy at significantly reduced computational costs. By propagating local information to distance particles, Message-passing neural networks (MPNNs) extend the locality concept to model interactions beyond their local neighborhood. Still, this locality precludes modeling long-range effects, such as charge transfer, electrostatic interactions, and dispersion effects, which are critical to adequately describe many real-world systems. In this work, we propose the Charge Equilibration Layer for Long-range Interactions (CELLI) to address the challenging modeling of non-local interactions and the high computational cost of MPNNs. This novel architecture generalizes the fourth-generation high-dimensional neural network (4GHDNN) concept, integrating the charge equilibration (Qeq) method into a model-agnostic building block for modern equivariant GNN potentials. A series of benchmarks show that CELLI can extend the strictly local Allegro architecture to model highly non-local interactions and charge transfer. Our architecture generalizes to diverse datasets and large structures, achieving an accuracy comparable to MPNNs at about twice the computational efficiency.
[LG-98] Machine Learning in Gamma Astronomy
链接: https://arxiv.org/abs/2501.19064
作者: A.P.Kryukov,A.P.Demichev,V.A.Ilyin
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 20 pages, 5 figures, Proceedings of The 8th International Conference on Deep Learning in Computational Physics, June 19-21, 2024, Moscow, Russia
Abstract:The purpose of this paper is to review the most popular deep learning methods used to analyze astroparticle data obtained with Imaging Atmospheric Cherenkov Telescopes and provide references to the original papers.
[LG-99] Conformal Prediction in Hierarchical Classification
链接: https://arxiv.org/abs/2501.19038
作者: Thomas Mortier,Alireza Javanmardi,Yusuf Sale,Eyke Hüllermeier,Willem Waegeman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction has emerged as a widely used framework for constructing valid prediction sets in classification and regression tasks. In this work, we extend the split conformal prediction framework to hierarchical classification, where prediction sets are commonly restricted to internal nodes of a predefined hierarchy, and propose two computationally efficient inference algorithms. The first algorithm returns internal nodes as prediction sets, while the second relaxes this restriction, using the notion of representation complexity, yielding a more general and combinatorial inference problem, but smaller set sizes. Empirical evaluations on several benchmark datasets demonstrate the effectiveness of the proposed algorithms in achieving nominal coverage.
[LG-100] Quantum SMOTE with Angular Outliers: Redefining Minority Class Handling
链接: https://arxiv.org/abs/2501.19001
作者: Nishikanta Mohanty,Bikash K. Behera,Christopher Ferrie
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces Quantum-SMOTEV2, an advanced variant of the Quantum-SMOTE method, leveraging quantum computing to address class imbalance in machine learning datasets without K-Means clustering. Quantum-SMOTEV2 synthesizes data samples using swap tests and quantum rotation centered around a single data centroid, concentrating on the angular distribution of minority data points and the concept of angular outliers (AOL). Experimental results show significant enhancements in model performance metrics at moderate SMOTE levels (30-36%), which previously required up to 50% with the original method. Quantum-SMOTEV2 maintains essential features of its predecessor (arXiv:2402.17398), such as rotation angle, minority percentage, and splitting factor, allowing for tailored adaptation to specific dataset needs. The method is scalable, utilizing compact swap tests and low depth quantum circuits to accommodate a large number of features. Evaluation on the public Cell-to-Cell Telecom dataset with Random Forest (RF), K-Nearest Neighbours (KNN) Classifier, and Neural Network (NN) illustrates that integrating Angular Outliers modestly boosts classification metrics like accuracy, F1 Score, AUC-ROC, and AUC-PR across different proportions of synthetic data, highlighting the effectiveness of Quantum-SMOTEV2 in enhancing model performance for edge cases.
[LG-101] Optimal Transport-based Conformal Prediction
链接: https://arxiv.org/abs/2501.18991
作者: Gauthier Thurin(CNRS),Kimia Nadjahi(CNRS),Claire Boyer(LMO)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conformal Prediction (CP) is a principled framework for quantifying uncertainty in blackbox learning models, by constructing prediction sets with finite-sample coverage guarantees. Traditional approaches rely on scalar nonconformity scores, which fail to fully exploit the geometric structure of multivariate outputs, such as in multi-output regression or multiclass classification. Recent methods addressing this limitation impose predefined convex shapes for the prediction sets, potentially misaligning with the intrinsic data geometry. We introduce a novel CP procedure handling multivariate score functions through the lens of optimal transport. Specifically, we leverage Monge-Kantorovich vector ranks and quantiles to construct prediction region with flexible, potentially non-convex shapes, better suited to the complex uncertainty patterns encountered in multivariate learning tasks. We prove that our approach ensures finite-sample, distribution-free coverage properties, similar to typical CP methods. We then adapt our method for multi-output regression and multiclass classification, and also propose simple adjustments to generate adaptive prediction regions with asymptotic conditional coverage guarantees. Finally, we evaluate our method on practical regression and classification problems, illustrating its advantages in terms of (conditional) coverage and efficiency.
[LG-102] Optimizing Through Change: Bounds and Recommendations for Time-Varying Bayesian Optimization Algorithms
链接: https://arxiv.org/abs/2501.18963
作者: Anthony Bardou,Patrick Thiran
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Time-Varying Bayesian Optimization (TVBO) is the go-to framework for optimizing a time-varying, expensive, noisy black-box function. However, most of the solutions proposed so far either rely on unrealistic assumptions on the nature of the objective function or do not offer any theoretical guarantees. We propose the first analysis that asymptotically bounds the cumulative regret of TVBO algorithms under mild and realistic assumptions only. In particular, we provide an algorithm-independent lower regret bound and an upper regret bound that holds for a large class of TVBO algorithms. Based on this analysis, we formulate recommendations for TVBO algorithms and show how an algorithm (BOLT) that follows them performs better than the state-of-the-art of TVBO through experiments on synthetic and real-world problems.
[LG-103] rustworthy Evaluation of Generative AI Models
链接: https://arxiv.org/abs/2501.18897
作者: Zijun Gao,Yan Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 5 figures, 1 table, 15 pages
Abstract:Generative AI (GenAI) models have recently achieved remarkable empirical performance in various applications, however, their evaluations yet lack uncertainty quantification. In this paper, we propose a method to compare two generative models based on an unbiased estimator of their relative performance gap. Statistically, our estimator achieves parametric convergence rate and asymptotic normality, which enables valid inference. Computationally, our method is efficient and can be accelerated by parallel computing and leveraging pre-storing intermediate results. On simulated datasets with known ground truth, we show our approach effectively controls type I error and achieves power comparable with commonly used metrics. Furthermore, we demonstrate the performance of our method in evaluating diffusion models on real image datasets with statistical confidence.
[LG-104] QMe14S A Comprehensive and Efficient Spectral Dataset for Small Organic Molecules
链接: https://arxiv.org/abs/2501.18876
作者: Mingzhi Yuan,Zihan Zou,Wei Hu
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 11 pages, 4figures
Abstract:Developing machine learning protocols for molecular simulations requires comprehensive and efficient datasets. Here we introduce the QMe14S dataset, comprising 186,102 small organic molecules featuring 14 elements (H, B, C, N, O, F, Al, Si, P, S, Cl, As, Se, Br) and 47 functional groups. Using density functional theory at the B3LYP/TZVP level, we optimized the geometries and calculated properties including energy, atomic charge, atomic force, dipole moment, quadrupole moment, polarizability, octupole moment, first hyperpolarizability, and Hessian. At the same level, we obtained the harmonic IR, Raman and NMR spectra. Furthermore, we conducted ab initio molecular dynamics simulations to generate dynamic configurations and extract nonequilibrium properties, including energy, forces, and Hessians. By leveraging our E(3)-equivariant message-passing neural network (DetaNet), we demonstrated that models trained on QMe14S outperform those trained on the previously developed QM9S dataset in simulating molecular spectra. The QMe14S dataset thus serves as a comprehensive benchmark for molecular simulations, offering valuable insights into structure-property relationships.
[LG-105] Adaptivity and Convergence of Probability Flow ODEs in Diffusion Generative Models
链接: https://arxiv.org/abs/2501.18863
作者: Jiaqi Tang,Yuling Yan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Score-based generative models, which transform noise into data by learning to reverse a diffusion process, have become a cornerstone of modern generative AI. This paper contributes to establishing theoretical guarantees for the probability flow ODE, a widely used diffusion-based sampler known for its practical efficiency. While a number of prior works address its general convergence theory, it remains unclear whether the probability flow ODE sampler can adapt to the low-dimensional structures commonly present in natural image data. We demonstrate that, with accurate score function estimation, the probability flow ODE sampler achieves a convergence rate of O(k/T) in total variation distance (ignoring logarithmic factors), where k is the intrinsic dimension of the target distribution and T is the number of iterations. This dimension-free convergence rate improves upon existing results that scale with the typically much larger ambient dimension, highlighting the ability of the probability flow ODE sampler to exploit intrinsic low-dimensional structures in the target distribution for faster sampling.
[LG-106] Beyond Short Steps in Frank-Wolfe Algorithms
链接: https://arxiv.org/abs/2501.18773
作者: David Martínez-Rubio,Sebastian Pokutta
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We introduce novel techniques to enhance Frank-Wolfe algorithms by leveraging function smoothness beyond traditional short steps. Our study focuses on Frank-Wolfe algorithms with step sizes that incorporate primal-dual guarantees, offering practical stopping criteria. We present a new Frank-Wolfe algorithm utilizing an optimistic framework and provide a primal-dual convergence proof. Additionally, we propose a generalized short-step strategy aimed at optimizing a computable primal-dual gap. Interestingly, this new generalized short-step strategy is also applicable to gradient descent algorithms beyond Frank-Wolfe methods. As a byproduct, our work revisits and refines primal-dual techniques for analyzing Frank-Wolfe algorithms, achieving tighter primal-dual convergence rates. Empirical results demonstrate that our optimistic algorithm outperforms existing methods, highlighting its practical advantages.
[LG-107] A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization
链接: https://arxiv.org/abs/2501.18756
作者: Nuojin Cheng,Leonard Papenmeier,Stephen Becker,Luigi Nardi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Bayesian optimization is a widely used method for optimizing expensive black-box functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to reduce uncertainty about the function’s optimum and are often considered fundamentally distinct from EI. In this work, we challenge this prevailing perspective by introducing a unified theoretical framework, Variational Entropy Search, which reveals that EI and information-theoretic acquisition functions are more closely related than previously recognized. We demonstrate that EI can be interpreted as a variational inference approximation of the popular information-theoretic acquisition function, named Max-value Entropy Search. Building on this insight, we propose VES-Gamma, a novel acquisition function that balances the strengths of EI and MES. Extensive empirical evaluations across both low- and high-dimensional synthetic and real-world benchmarks demonstrate that VES-Gamma is competitive with state-of-the-art acquisition functions and in many cases outperforms EI and MES.
[LG-108] Constructing Cell-type Taxonomy by Optimal Transport with Relaxed Marginal Constraints
链接: https://arxiv.org/abs/2501.18650
作者: Sebastian Pena,Lin Lin,Jia Li
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The rapid emergence of single-cell data has facilitated the study of many different biological conditions at the cellular level. Cluster analysis has been widely applied to identify cell types, capturing the essential patterns of the original data in a much more concise form. One challenge in the cluster analysis of cells is matching clusters extracted from datasets of different origins or conditions. Many existing algorithms cannot recognize new cell types present in only one of the two samples when establishing a correspondence between clusters obtained from two samples. Additionally, when there are more than two samples, it is advantageous to align clusters across all samples simultaneously rather than performing pairwise alignment. Our approach aims to construct a taxonomy for cell clusters across all samples to better annotate these clusters and effectively extract features for downstream analysis. A new system for constructing cell-type taxonomy has been developed by combining the technique of Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and the simultaneous alignment of clusters across multiple samples. OT-RMC allows us to address challenges that arise when the proportions of clusters vary substantially between samples or when some clusters do not appear in all the samples. Experiments on more than twenty datasets demonstrate that the taxonomy constructed by this new system can yield highly accurate annotation of cell types. Additionally, sample-level features extracted based on the taxonomy result in accurate classification of samples.
信息检索
[IR-0] Characterizing User Behavior: The Interplay Between Mobility Patterns and Mobile Traffic
链接: https://arxiv.org/abs/2501.19348
作者: Anne Josiane Kouam,Aline Carneiro Viana,Mariano G. Beiró,Leo Ferres,Luca Pappalardo
类目: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR)
*备注:
Abstract:Mobile devices have become essential for capturing human activity, and eXtended Data Records (XDRs) offer rich opportunities for detailed user behavior modeling, which is useful for designing personalized digital services. Previous studies have primarily focused on aggregated mobile traffic and mobility analyses, often neglecting individual-level insights. This paper introduces a novel approach that explores the dependency between traffic and mobility behaviors at the user level. By analyzing 13 individual features that encompass traffic patterns and various mobility aspects, we enhance the understanding of how these behaviors interact. Our advanced user modeling framework integrates traffic and mobility behaviors over time, allowing for fine-grained dependencies while maintaining population heterogeneity through user-specific signatures. Furthermore, we develop a Markov model that infers traffic behavior from mobility and vice versa, prioritizing significant dependencies while addressing privacy concerns. Using a week-long XDR dataset from 1,337,719 users across several provinces in Chile, we validate our approach, demonstrating its robustness and applicability in accurately inferring user behavior and matching mobility and traffic profiles across diverse urban contexts.
[IR-1] Emancipatory Information Retrieval
链接: https://arxiv.org/abs/2501.19241
作者: Bhaskar Mitra
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Our world today is facing a confluence of several mutually reinforcing crises each of which intersects with concerns of social justice and emancipation. This paper is a provocation for the role of computer-mediated information access in our emancipatory struggles. We define emancipatory information retrieval as the study and development of information access methods that challenge various forms of human oppression, and situates its activities within broader collective emancipatory praxis. The term “emancipatory” here signifies the moral concerns of universal humanization of all peoples and the elimination of oppression to create the conditions under which we can collectively flourish. To develop an emancipatory research agenda for IR, in this paper we speculate about the practices that the community can adopt, enumerate some of the projects that the field should undertake, and discuss provocations to spark new ideas and directions for research. We challenge the field of information retrieval (IR) research to embrace humanistic values and commit to universal emancipation and social justice as part of our research.
[IR-2] Collaborative Diffusion Model for Recommender System WWW’25
链接: https://arxiv.org/abs/2501.18997
作者: Gyuseok Lee,Yaochen Zhu,Hwanjo Yu,Yao Zhou,Jundong Li
类目: Information Retrieval (cs.IR)
*备注: WWW’25 short
Abstract:Diffusion-based recommender systems (DR) have gained increasing attention for their advanced generative and denoising capabilities. However, existing DR face two central limitations: (i) a trade-off between enhancing generative capacity via noise injection and retaining the loss of personalized information. (ii) the underutilization of rich item-side information. To address these challenges, we present a Collaborative Diffusion model for Recommender System (CDiff4Rec). Specifically, CDiff4Rec generates pseudo-users from item features and leverages collaborative signals from both real and pseudo personalized neighbors identified through behavioral similarity, thereby effectively reconstructing nuanced user preferences. Experimental results on three public datasets show that CDiff4Rec outperforms competitors by effectively mitigating the loss of personalized information through the integration of item content and collaborative signals.
[IR-3] Are Representation Disentanglement and Interpretability Linked in Recommendation Models? A Critical Review and Reproducibility Study ECIR2025
链接: https://arxiv.org/abs/2501.18805
作者: Ervin Dervishaj,Tuukka Ruotsalo,Maria Maistro,Christina Lioma
类目: Information Retrieval (cs.IR)
*备注: Accepted at the 47th European Conference on Information Retrieval (ECIR 2025)
Abstract:Unsupervised learning of disentangled representations has been closely tied to enhancing the representation intepretability of Recommender Systems (RSs). This has been achieved by making the representation of individual features more distinctly separated, so that it is easier to attribute the contribution of features to the model’s predictions. However, such advantages in interpretability and feature attribution have mainly been explored qualitatively. Moreover, the effect of disentanglement on the model’s recommendation performance has been largely overlooked. In this work, we reproduce the recommendation performance, representation disentanglement and representation interpretability of five well-known recommendation models on four RS datasets. We quantify disentanglement and investigate the link of disentanglement with recommendation effectiveness and representation interpretability. While several existing work in RSs have proposed disentangled representations as a gateway to improved effectiveness and interpretability, our findings show that disentanglement is not necessarily related to effectiveness but is closely related to representation interpretability. Our code and results are publicly available at this https URL.
[IR-4] Hierarchical Multi-field Representations for Two-Stage E-commerce Retrieval
链接: https://arxiv.org/abs/2501.18707
作者: Niklas Freymuth,Dong Liu,Thomas Ricatte,Saab Mansour
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Dense retrieval methods typically target unstructured text data represented as flat strings. However, e-commerce catalogs often include structured information across multiple fields, such as brand, title, and description, which contain important information potential for retrieval systems. We present Cascading Hierarchical Attention Retrieval Model (CHARM), a novel framework designed to encode structured product data into hierarchical field-level representations with progressively finer detail. Utilizing a novel block-triangular attention mechanism, our method captures the interdependencies between product fields in a specified hierarchy, yielding field-level representations and aggregated vectors suitable for fast and efficient retrieval. Combining both representations enables a two-stage retrieval pipeline, in which the aggregated vectors support initial candidate selection, while more expressive field-level representations facilitate precise fine-tuning for downstream ranking. Experiments on publicly available large-scale e-commerce datasets demonstrate that CHARM matches or outperforms state-of-the-art baselines. Our analysis highlights the framework’s ability to align different queries with appropriate product fields, enhancing retrieval accuracy and explainability.