本篇博文主要展示 2024-11-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2024-11-27)
今日共更新517篇论文,其中:
- 自然语言处理共56篇(Computation and Language (cs.CL))
- 人工智能共119篇(Artificial Intelligence (cs.AI))
- 计算机视觉共214篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共136篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Adaptive Deployment of Untrusted LLM s Reduces Distributed Threats
【速读】: 该论文试图解决的问题是:在大语言模型 (LLMs) 的部署过程中,如何有效评估和应对模型可能故意绕过安全措施的风险。解决方案的关键在于提出了一种两级部署框架,该框架通过一个自适应的宏观协议 (macro-protocol) 来选择不同的微观协议 (micro-protocols)。微观协议在单一任务上操作,使用一个经过广泛测试的(可信的)模型来引导和监控不可信模型。宏观协议则根据不可信模型过去的行动,动态调整对其对齐性的信任度,并据此选择更安全或更具风险的微观协议。通过这种方法,论文在代码生成测试环境中展示了其有效性,相比非自适应的基准方法,在保持一定有用性的前提下,将后门代码的数量减少了80%。
链接: https://arxiv.org/abs/2411.17693
作者: Jiaxin Wen,Vivek Hebbar,Caleb Larson,Aryan Bhatt,Ansh Radhakrishnan,Mrinank Sharma,Henry Sleight,Shi Feng,He He,Ethan Perez,Buck Shlegeris,Akbir Khan
关键词-EN: measures remain effective, safety measures remain, large language models, bypass safety measures, large language
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a “distributed threat setting” – a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model’s alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
zh
[NLP-1] Low-Bit Quantization Favors Undertrained LLM s: Scaling Laws for Quantized LLM s with 100T Training Tokens
【速读】: 该论文试图解决低比特量化(low-bit quantization)在大语言模型(LLMs)中的表现与模型训练程度之间的关系问题。研究发现,低比特量化对未充分训练的大型语言模型更为有利,因为这些模型在量化过程中受到的量化诱导退化(QiD)较小。解决方案的关键在于通过研究超过1500个不同大小和训练程度的量化LLM检查点,推导出量化诱导退化与训练令牌数量、模型大小和比特宽度等因素之间的缩放规律(scaling laws)。基于这些规律,论文提出了一种新的视角,即利用量化诱导退化来衡量LLM的训练程度,并确定不同大小模型达到充分训练所需的最少训练令牌数量。此外,论文还预测了未来模型在训练超过100万亿令牌后的低比特量化性能,指出这可能不会带来理想的量化效果,从而为未来的低比特量化研究提出了挑战。
链接: https://arxiv.org/abs/2411.17691
作者: Xu Ouyang,Tao Ge,Thomas Hartvigsen,Zhisong Zhang,Haitao Mi,Dong Yu
关键词-EN: favors undertrained large, tokens suffer significant, suffer significant QiD, training tokens, fewer training tokens
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in progress; Please note that Figure 1’s gray areas may not be displayed properly using Chrome (maybe due to bugs in Chrome)
点击查看摘要
Abstract:We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM’s training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model’s training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at this https URL. Comments: Work in progress; Please note that Figure 1’s gray areas may not be displayed properly using Chrome (maybe due to bugs in Chrome) Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2411.17691 [cs.LG] (or arXiv:2411.17691v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.17691 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-2] Attamba: Attending To Multi-Token States
【速读】: 该论文试图解决传统Transformer模型在处理长序列时计算复杂度随序列长度呈二次方增长的问题。解决方案的关键在于引入Attamba架构,该架构利用状态空间模型(State-Space Models, SSMs)对序列中的token块进行压缩,并在这些压缩后的键值表示上应用注意力机制。通过替换Transformer中的键和值投影层为SSMs,Attamba不仅提高了模型质量,还实现了灵活的token块划分,从而在保持相似的KV-Cache和注意力计算量的前提下,将困惑度(perplexity)降低了24%,并且将KV-Cache和注意力计算量减少了约4倍,以5%的困惑度为代价。此外,Attamba能够对可变长度的块序列进行注意力计算,实现了从二次方到线性计算复杂度的平滑过渡,提供了适应性的效率提升。
链接: https://arxiv.org/abs/2411.17685
作者: Yash Akhauri,Safeen Huda,Mohamed S. Abdelfattah
关键词-EN: vanilla transformers compute, transformers compute attention, attention, compute attention, sequence
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:When predicting the next token in a sequence, vanilla transformers compute attention over all previous tokens, resulting in quadratic scaling of compute with sequence length. State-space models compress the entire sequence of tokens into a fixed-dimensional representation to improve efficiency, while other architectures achieve sub-quadratic complexity via low-rank projections or sparse attention patterns over the sequence. In this paper, we introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens and applies attention on these compressed key-value representations. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking, resulting in 24% improved perplexity with transformer of similar KV-Cache and attention footprint, and ~4 times smaller KV-Cache and Attention FLOPs for 5% perplexity trade-off. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling, offering adaptable efficiency gains.
zh
[NLP-3] Enhancing Character-Level Understanding in LLM s through Token Internal Structure Learning
【速读】: 该论文试图解决大型语言模型(LLMs)在处理下游任务时,由于分词技术(如Byte-Pair Encoding (BPE) 和 Byte-Level BPE (BBPE))导致的内部字符结构和序列信息丢失的问题。解决方案的关键在于引入了一种名为Token内部位置感知(Token Internal Position Awareness, TIPA)的新方法,通过训练模型进行反向字符预测任务,使其能够有效学习和泛化字符在分词中的位置和内部结构。实验结果表明,采用TIPA训练的LLMs在字符位置预测和下游任务(如中文拼写校正 (CSC))中均表现优异,不仅加速了模型收敛,还显著提升了任务性能。
链接: https://arxiv.org/abs/2411.17679
作者: Zhu Xu,Zhiqiang Zhao,Zihan Zhang,Yuchi Liu,Quanwei Shen,Fei Liu,Yu Kuang
关键词-EN: Byte-Level BPE, Tokenization techniques, Byte-Pair Encoding, BPE, vocabulary representation stability
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Tokenization techniques such as Byte-Pair Encoding (BPE) and Byte-Level BPE (BBPE) have significantly improved the computational efficiency and vocabulary representation stability of large language models (LLMs) by segmenting text into tokens. However, this segmentation often obscures the internal character structures and sequences within tokens, preventing models from fully learning these intricate details during training. Consequently, LLMs struggle to comprehend the character compositions and positional relationships within tokens, especially when fine-tuned on downstream tasks with limited data. In this paper, we introduce Token Internal Position Awareness (TIPA), a novel approach that enhances LLMs’ understanding of internal token structures by training them on reverse character prediction tasks using the tokenizer’s own vocabulary. This method enables models to effectively learn and generalize character positions and internal structures. Experimental results demonstrate that LLMs trained with TIPA outperform baseline models in predicting character positions at the token level. Furthermore, when applied to the downstream task of Chinese Spelling Correction (CSC), TIPA not only accelerates model convergence but also significantly improves task performance.
zh
[NLP-4] Push the Limit of Multi-modal Emotion Recognition by Prompting LLM s with Receptive-Field-Aware Attention Weighting
【速读】: 该论文试图解决在对话情感理解中,预训练语言模型(LLMs)处理文本模态的能力有限,而处理多媒体信息成本过高的问题。解决方案的关键在于提出了一个名为Lantern的框架,该框架通过接收域感知注意力加权机制,利用LLMs的外部知识和上下文理解能力来增强基础模型的性能。具体来说,Lantern训练了一个多任务基础模型来生成情感类别的概率和维度分数,这些预测结果作为参考输入到LLMs中,以调整每个情感类别的预测概率。通过将对话分割成不同的接收域,并确保每个样本包含在t个接收域中,最终结合接收域感知注意力驱动的加权模块,将LLMs的预测结果与基础模型的预测结果进行融合。实验结果表明,Lantern在IEMO-CAP数据集上的4-way和6-way设置中,分别提升了基础模型CORECT和SDT的性能达1.23%和1.80%。
链接: https://arxiv.org/abs/2411.17674
作者: Liyun Zhang,Dian Ding,Yu Lu,Yi-Chao Chen,Guangtao Xue
关键词-EN: understand the contents, accurately understand, requires external knowledge, LLMs, vanilla model
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia in- formation. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMO- CAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.
zh
[NLP-5] Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods
【速读】: 该论文试图解决传统自然语言处理(NLP)中的子词分词方法(如Byte-Pair Encoding (BPE), WordPiece, 和 SentencePiece)在蛋白质序列处理中的适用性问题。解决方案的关键在于评估这些分词方法在不同词汇量(400-6400)下的表现,特别是它们在捕捉蛋白质功能和结构特性、保持蛋白质域边界完整性以及遵守语言学定律方面的效果。研究发现,尽管这些方法在某些方面表现良好,但它们在保持蛋白质域完整性方面存在局限,尤其是在词汇量增加时。此外,这些方法在遵守语言学定律(如Zipf’s和Brevity laws)方面表现出部分合规,但在Menzerath’s law上存在显著偏差,表明蛋白质序列可能遵循与自然语言不同的组织原则。因此,论文强调需要开发专门针对蛋白质特性的分词策略。
链接: https://arxiv.org/abs/2411.17669
作者: Burak Suyunu,Enes Taylan,Arzucan Özgür
关键词-EN: machine learning models, require meaningful segmentation, learning models, structural properties, crucial step
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 8 pages, 9 figures
点击查看摘要
Abstract:Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf’s and Brevity laws but notable deviations from Menzerath’s law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.
zh
[NLP-6] How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations
【速读】: 该论文试图解决多模态基础模型在跨语言和跨模态场景下的统一表示问题。解决方案的关键在于:1) 跨模态表示在模型层级上逐渐收敛,但在初始层级仍保留模态特异性处理;2) 长度适应对于缩小文本和语音之间的跨模态差距至关重要,但现有方法主要在高资源语言中有效;3) 语音在跨语言差异上比文本更大;4) 对于未明确训练为模态无关的模型,模态差异比语言差异更为显著。
链接: https://arxiv.org/abs/2411.17666
作者: Hyunji Lee,Danni Liu,Supriti Sinhamahapatra,Jan Niehues
关键词-EN: Multimodal foundation models, Multimodal foundation, unified representation space, foundation models aim, aim to create
类目: Computation and Language (cs.CL)
备注: Under review
点击查看摘要
Abstract:Multimodal foundation models aim to create a unified representation space that abstracts away from surface features like language syntax or modality differences. To investigate this, we study the internal representations of three recent models, analyzing the model activations from semantically equivalent sentences across languages in the text and speech modalities. Our findings reveal that: 1) Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. 2) Length adaptation is crucial for reducing the cross-modal gap between text and speech, although current approaches’ effectiveness is primarily limited to high-resource languages. 3) Speech exhibits larger cross-lingual differences than text. 4) For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.
zh
[NLP-7] BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings
【速读】: 该论文试图解决低资源语言(如马拉地语)在自然语言处理(NLP)任务中由于高质量标注数据和语言资源匮乏而面临的挑战。解决方案的关键在于评估和比较不同嵌入技术(Contextual BERT-based, Non-Contextual BERT-based, FastText-based)在马拉地语NLP分类任务中的表现。研究特别关注了压缩和未压缩嵌入的效果,并通过对比Muril、MahaBERT、IndicFT和MahaFT等模型嵌入,结合多重逻辑回归(MLR)分类器和TSNE可视化,验证了上下文嵌入优于非上下文嵌入,且BERT非上下文嵌入在第一层的表现优于FastText嵌入,为替代FastText嵌入提供了潜在方案。
链接: https://arxiv.org/abs/2411.17661
作者: Abhay Shanbhag,Suramya Jadhav,Amogh Thakurdesai,Ridhima Sinare,Raviraj Joshi
关键词-EN: Natural Language Processing, presents significant challenges, high-quality annotated data, languages presents significant, low-resource languages presents
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Natural Language Processing (NLP) for low-resource languages presents significant challenges, particularly due to the scarcity of high-quality annotated data and linguistic resources. The choice of embeddings plays a critical role in enhancing the performance of NLP tasks, such as news classification, sentiment analysis, and hate speech detection, especially for low-resource languages like Marathi. In this study, we investigate the impact of various embedding techniques- Contextual BERT-based, Non-Contextual BERT-based, and FastText-based on NLP classification tasks specific to the Marathi language. Our research includes a thorough evaluation of both compressed and uncompressed embeddings, providing a comprehensive overview of how these embeddings perform across different scenarios. Specifically, we compare two BERT model embeddings, Muril and MahaBERT, as well as two FastText model embeddings, IndicFT and MahaFT. Our evaluation includes applying embeddings to a Multiple Logistic Regression (MLR) classifier for task performance assessment, as well as TSNE visualizations to observe the spatial distribution of these embeddings. The results demonstrate that contextual embeddings outperform non-contextual embeddings. Furthermore, BERT-based non-contextual embeddings extracted from the first BERT embedding layer yield better results than FastText-based embeddings, suggesting a potential alternative to FastText embeddings.
zh
[NLP-8] On Limitations of LLM as Annotator for Low Resource Languages
【速读】: 该论文试图解决低资源语言(如Marathi)在自然语言处理(NLP)任务中由于数据和资源匮乏而面临的挑战。解决方案的关键在于评估大型语言模型(LLMs)作为潜在标注工具的性能,特别是在生成数据集和资源方面的能力。论文通过对比闭源和开源LLMs(如GPT-4o、Gemini 1.0 Pro、Gemma 2、Llama 3.1)在情感分析、新闻分类和仇恨言论检测等分类任务中的表现,发现尽管LLMs在英语等高资源语言的标注任务中表现优异,但在Marathi等低资源语言上仍存在显著不足,甚至不如基于BERT的基线模型,这突显了LLMs作为低资源语言标注工具的局限性。
链接: https://arxiv.org/abs/2411.17637
作者: Suramya Jadhav,Abhay Shanbhag,Amogh Thakurdesai,Ridhima Sinare,Raviraj Joshi
关键词-EN: sufficient linguistic data, face significant challenges, significant challenges due, languages face significant, Low-resource languages face
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate models and datasets, making it difficult to perform critical NLP tasks like sentiment analysis or hate speech detection. To bridge this gap, Large Language Models (LLMs) present an opportunity for potential annotators, capable of generating datasets and resources for these underrepresented languages. In this paper, we focus on Marathi, a low-resource language, and evaluate the performance of both closed-source and open-source LLMs as annotators. We assess models such as GPT-4o and Gemini 1.0 Pro, Gemma 2 (2B and 9B), and Llama 3.1 (8B) on classification tasks including sentiment analysis, news classification, and hate speech detection. Our findings reveal that while LLMs excel in annotation tasks for high-resource languages like English, they still fall short when applied to Marathi. Even advanced closed models like Gemini and GPT underperform in comparison to BERT-based baselines, highlighting the limitations of LLMs as annotators for low-resource languages.
zh
[NLP-9] Scaling Speech-Text Pre-training with Synthetic Interleaved Data
【速读】: 该论文试图解决语音语言模型(SpeechLMs)在预训练过程中依赖于有限的非监督语音数据和并行语音-文本数据的问题,这些数据相对于文本预训练数据来说非常稀缺,限制了其扩展性。解决方案的关键在于提出了一种利用大规模合成交错数据(synthetic interleaved data)的方法,通过从现有文本语料库中采样文本片段并使用文本到标记模型(text-to-token model)合成相应的语音片段,从而避免了生成实际语音的需求。此外,论文还采用了从自动语音识别(ASR)模型中派生的监督语音分词器(supervised speech tokenizer),通过在编码器中引入向量量化瓶颈(vector-quantized bottleneck),实现了在较低采样率(如12.5Hz)下仍能保持强语义保留和语音重建质量的离散语音标记。通过这种方法,论文在语音语言建模和口语问答任务中达到了最先进的性能,并展示了在语音对话数据上微调预训练模型以开发端到端语音聊天机器人的潜力。
链接: https://arxiv.org/abs/2411.17607
作者: Aohan Zeng,Zhengxiao Du,Mingdao Liu,Lei Zhang,Shengmin Jiang,Yuxiao Dong,Jie Tang
关键词-EN: natural human-computer interaction, human-computer interaction compared, text-based large language, produce speech output, accept speech input
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower sampling rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.
zh
[NLP-10] What Differentiates Educational Literature? A Multimodal Fusion Approach of Transformers and Computational Linguistics
【速读】: 该论文试图解决将新文献整合到英语课程中的挑战,特别是教育者缺乏可扩展工具来快速评估文本的可读性并根据多样化的课堂需求调整文本。解决方案的关键在于采用多模态方法,结合基于Transformer的文本分类和语言特征分析,以使文本与英国关键阶段(UK Key Stages)对齐。具体来说,研究通过微调八个最先进的Transformer模型(如BERT)和搜索500种深度神经网络拓扑结构来分类语言特征,实现了显著的性能提升。特别是ELECTRA Transformer与神经网络的融合模型达到了0.996的F1分数。最终,该方法被封装在一个面向利益相关者的网络应用中,提供实时文本复杂性、阅读难度、课程对齐和学习年龄范围的推荐,从而支持数据驱动的决策制定并减少手动工作量。
链接: https://arxiv.org/abs/2411.17593
作者: Jordan J. Bird
关键词-EN: lack scalable tools, rapidly evaluate readability, remains a challenge, challenge since educators, educators often lack
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The integration of new literature into the English curriculum remains a challenge since educators often lack scalable tools to rapidly evaluate readability and adapt texts for diverse classroom needs. This study proposes to address this gap through a multimodal approach that combines transformer-based text classification with linguistic feature analysis to align texts with UK Key Stages. Eight state-of-the-art Transformers were fine-tuned on segmented text data, with BERT achieving the highest unimodal F1 score of 0.75. In parallel, 500 deep neural network topologies were searched for the classification of linguistic characteristics, achieving an F1 score of 0.392. The fusion of these modalities shows a significant improvement, with every multimodal approach outperforming all unimodal models. In particular, the ELECTRA Transformer fused with the neural network achieved an F1 score of 0.996. The proposed approach is finally encapsulated in a stakeholder-facing web application, providing non-technical stakeholder access to real-time insights on text complexity, reading difficulty, curriculum alignment, and recommendations for learning age range. The application empowers data-driven decision making and reduces manual workload by integrating AI-based recommendations into lesson planning for English literature.
zh
[NLP-11] Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey
【速读】: 该论文试图解决视觉问答(Visual Question Answering, VQA)这一结合自然语言处理和计算机视觉技术的挑战性任务,并提供该领域最新发展的全面概述。解决方案的关键在于:1) 对图像和文本的自然语言理解;2) 基于图像-问题信息的推理模块;3) 视觉-语言预训练模型和多模态大语言模型(MLLMs)在提取和融合模态信息方面的最新进展;4) 知识推理的进步,包括内部知识提取和外部知识的引入;5) 对VQA数据集和评估指标的全面审查。这些关键点共同构成了VQA任务的核心技术框架,并为未来的研究方向提供了指导。
链接: https://arxiv.org/abs/2411.17558
作者: Jiayi Kuang,Jingyou Xie,Haohao Luo,Ronghao Li,Zhe Xu,Xianfeng Cheng,Yinghui Li,Xika Lin,Ying Shen
关键词-EN: Visual Question Answering, Visual Question, Question Answering, computer vision techniques, benchmark test task
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques and gradually becomes a benchmark test task in multimodal large language models (MLLMs). The goal of our survey is to provide an overview of the development of VQA and a detailed description of the latest models with high timeliness. This survey gives an up-to-date synthesis of natural language understanding of images and text, as well as the knowledge reasoning module based on image-question information on the core VQA tasks. In addition, we elaborate on recent advances in extracting and fusing modal information with vision-language pretraining models and multimodal large language models in VQA. We also exhaustively review the progress of knowledge reasoning in VQA by detailing the extraction of internal knowledge and the introduction of external knowledge. Finally, we present the datasets of VQA and different evaluation metrics and discuss possible directions for future work.
zh
[NLP-12] Isotropy Matters: Soft-ZCA Whitening of Embeddings for Semantic Code Search
【速读】: 该论文试图解决嵌入空间各向异性(isotropy)对语义推理任务性能的影响,特别是在代码搜索任务中的表现。解决方案的关键在于提出了一种改进的ZCA白化技术(Soft-ZCA whitening),用于控制嵌入空间的各向异性水平。通过这种方法,论文展示了Soft-ZCA白化能够提升预训练代码语言模型的性能,并且可以与对比微调(contrastive fine-tuning)相结合,从而有效改善代码搜索的效果。
链接: https://arxiv.org/abs/2411.17538
作者: Andor Diera,Lukas Galke,Ansgar Scherp
关键词-EN: involving semantic inference, tasks involving semantic, Low isotropy, tasks involving, space impairs performance
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Low isotropy in an embedding space impairs performance on tasks involving semantic inference. Our study investigates the impact of isotropy on semantic code search performance and explores post-processing techniques to mitigate this issue. We analyze various code language models, examine isotropy in their embedding spaces, and its influence on search effectiveness. We propose a modified ZCA whitening technique to control isotropy levels in embeddings. Our results demonstrate that Soft-ZCA whitening improves the performance of pre-trained code language models and can complement contrastive fine-tuning. The code for our experiments is available at this https URL_isotropy
zh
[NLP-13] ShowUI: One Vision-Language-Action Model for GUI Visual Agent
【速读】: 该论文试图解决现有基于语言的图形用户界面(GUI)助手在感知UI视觉信息方面的局限性,特别是在处理UI视觉信息时缺乏人类般的视觉感知能力。解决方案的关键在于开发了一种名为ShowUI的视觉-语言-动作模型,其核心创新包括:(i) UI引导的视觉标记选择(UI-Guided Visual Token Selection),通过将截图构建成UI连接图,动态识别冗余关系并作为自注意力块中标记选择的依据,从而降低计算成本;(ii) 交错视觉-语言-动作流(Interleaved Vision-Language-Action Streaming),灵活整合GUI任务中的多样化需求,有效管理视觉-动作历史,提升训练效率;(iii) 小规模高质量的GUI指令跟随数据集(Small-scale High-quality GUI Instruction-following Datasets),通过精心数据筛选和重采样策略,解决数据类型不平衡问题。这些创新使得ShowUI在零样本截图定位任务中达到75.1%的准确率,并在训练过程中减少了33%的冗余视觉标记,性能提升1.4倍。
链接: https://arxiv.org/abs/2411.17465
作者: Kevin Qinghong Lin,Linjie Li,Difei Gao,Zhengyuan Yang,Shiwei Wu,Zechen Bai,Weixian Lei,Lijuan Wang,Mike Zheng Shou
关键词-EN: Building Graphical User, Graphical User Interface, Building Graphical, User Interface, Graphical User
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Technical Report. Github: this https URL
点击查看摘要
Abstract:Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at this https URL.
zh
[NLP-14] FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval
【速读】: 该论文试图解决少样本跨模态检索 (few-shot cross-modal retrieval, CMR) 中由于目标域特征退化和极端数据不平衡带来的挑战。解决方案的关键在于提出了FLEX-CLIP,一种特征级生成网络增强的CLIP模型。FLEX-CLIP包含两个训练阶段:首先,通过复合多模态VAE-GAN网络生成基于CLIP特征的伪样本,以解决数据不平衡问题;其次,采用门控残差网络将CLIP特征与投影特征融合,减少在少样本场景下的特征退化。实验结果表明,FLEX-CLIP在四个基准数据集上相较于现有最先进方法提升了7%-15%的性能。
链接: https://arxiv.org/abs/2411.17454
作者: Jingyou Xie,Jiayi Kuang,Zhenzhou Lin,Jiarui Ouyang,Zishuo Zhao,Ying Shen
关键词-EN: retrieves semantically similar, semantically similar instances, few-shot cross-modal retrieval, domain including classes, classical few-shot CMR
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop a gate residual network to fuse CLIP features with projected features, reducing feature degradation in X-shot scenarios. Experimental results on four benchmark datasets show a 7%-15% improvement over state-of-the-art methods, with ablation studies demonstrating enhancement of CLIP features.
zh
[NLP-15] VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
【速读】: 该论文试图解决视觉-语言生成式奖励模型(VL-GenRMs)在评估方面的不足问题。当前的评估方法主要依赖于传统视觉-语言任务中的AI标注偏好标签,这可能导致偏差并难以有效挑战最先进的模型。论文提出的解决方案是引入VL-RewardBench,这是一个全面的基准测试,涵盖了多模态查询、视觉幻觉检测和复杂推理任务。关键在于通过AI辅助的标注流程,结合样本选择和人工验证,精心筛选出1,250个高质量示例,旨在深入探测模型的局限性。实验结果表明,VL-RewardBench能够有效挑战现有模型,包括GPT-4o在内的顶级模型在基准测试中的准确率仅为65.4%,而开源模型如Qwen2-VL-72B也难以超越随机猜测。此外,VL-RewardBench的表现与MMMU-Pro准确率高度相关(Pearson’s r 0.9),通过Best-of-N采样方法验证了其有效性。论文还通过分析实验揭示了提升VL-GenRMs性能的三个关键见解。
链接: https://arxiv.org/abs/2411.17451
作者: Lei Li,Yuancheng Wei,Zhihui Xie,Xuqing Yang,Yifan Song,Peiyi Wang,Chenxin An,Tianyu Liu,Sujian Li,Bill Yuchen Lin,Lingpeng Kong,Qi Liu
关键词-EN: Vision-language generative reward, evaluation remains under-explored, generative reward models, play a crucial, remains under-explored
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL
点击查看摘要
Abstract:Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench’s effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson’s r 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.
zh
[NLP-16] “Stupid robot I want to speak to a human!” User Frustration Detection in Task-Oriented Dialog Systems
【速读】: 该论文试图解决在面向任务的对话系统 (Task-Oriented Dialog, TOD) 中检测用户挫败感的问题,以提升用户满意度、参与度和留存率。解决方案的关键在于评估和比较不同方法在实际部署环境中的可行性和效果,包括基于关键词的方法、开源情感分析方法、对话中断检测方法以及基于大语言模型 (LLM) 的上下文学习检测方法。研究结果表明,开源方法在实际应用中存在局限性,而基于LLM的方法在内部基准测试中表现出优越性,F1分数相对提高了16%。
链接: https://arxiv.org/abs/2411.17437
作者: Mireia Hernandez Caralt,Ivan Sekulić,Filip Carević,Nghia Khau,Diana Nicoleta Popa,Bruna Guedes,Victor Guimarães,Zeyu Yang,Andre Manso,Meghana Reddy,Paolo Rosso,Roland Mathis
关键词-EN: Detecting user frustration, Detecting user, modern-day task-oriented dialog, user frustration, modern-day task-oriented
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.
zh
[NLP-17] One Mind Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models
【速读】: 该论文试图解决大型语言模型 (LLMs) 中知识存储机制的不确定性和跨语言分析不足的问题。解决方案的关键在于构建了一个新的基准数据集——重述的多语言LAMA (Rephrased Multilingual LAMA, RML-LAMA),并提出了一种名为多语言集成梯度与不确定性估计 (Multilingual Integrated Gradients with Uncertainty Estimation, MATRICE) 的新方法。RML-LAMA 包含高质量的填空式多语言并行查询,用于每个事实的知识定位,而 MATRICE 方法则通过量化查询和语言间的不确定性,提高了知识神经元的定位准确性。实验结果表明,该方法能够准确地定位语言无关的知识神经元,并进一步研究了这些神经元在跨语言知识编辑、知识增强和新知识注入中的作用。
链接: https://arxiv.org/abs/2411.17401
作者: Pengfei Cao,Yuheng Chen,Zhuoran Jin,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: Large language models, learned vast amounts, knowledge, Large language, knowledge neurons
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have learned vast amounts of factual knowledge through self-supervised pre-training on large-scale corpora. Meanwhile, LLMs have also demonstrated excellent multilingual capabilities, which can express the learned knowledge in multiple languages. However, the knowledge storage mechanism in LLMs still remains mysterious. Some researchers attempt to demystify the factual knowledge in LLMs from the perspective of knowledge neurons, and subsequently discover language-agnostic knowledge neurons that store factual knowledge in a form that transcends language barriers. However, the preliminary finding suffers from two limitations: 1) High Uncertainty in Localization Results. Existing study only uses a prompt-based probe to localize knowledge neurons for each fact, while LLMs cannot provide consistent answers for semantically equivalent queries. Thus, it leads to inaccurate localization results with high uncertainty. 2) Lack of Analysis in More Languages. The study only analyzes language-agnostic knowledge neurons on English and Chinese data, without exploring more language families and languages. Naturally, it limits the generalizability of the findings. To address aforementioned problems, we first construct a new benchmark called Rephrased Multilingual LAMA (RML-LAMA), which contains high-quality cloze-style multilingual parallel queries for each fact. Then, we propose a novel method named Multilingual Integrated Gradients with Uncertainty Estimation (MATRICE), which quantifies the uncertainty across queries and languages during knowledge localization. Extensive experiments show that our method can accurately localize language-agnostic knowledge neurons. We also further investigate the role of language-agnostic knowledge neurons in cross-lingual knowledge editing, knowledge enhancement and new knowledge injection.
zh
[NLP-18] Can LLM s be Good Graph Judger for Knowledge Graph Construction?
【速读】: 该论文试图解决从非结构化文本数据中构建高质量知识图谱(Knowledge Graphs, KGs)的挑战。解决方案的关键在于提出了一个名为GraphJudger的知识图谱构建框架,该框架通过三个创新模块来解决现有方法的局限性:(1)实体中心迭代文本去噪(entity-centric iterative text denoising),用于处理现实文档中的大量信息和噪声;(2)知识感知指令调优(knowledge aware instruction tuning),以提高从特定领域文档中提取准确知识的能力;(3)图谱判断(graph judgement),利用大语言模型(Large Language Models, LLMs)作为图谱判断者,而非仅作为预测者,从而减少幻觉现象(hallucinations)的影响。实验结果表明,GraphJudger在通用和特定领域的文本-图谱对数据集上均优于基线方法。
链接: https://arxiv.org/abs/2411.17388
作者: Haoyu Huang,Chong Chen,Conghui He,Yang Li,Jiawei Jiang,Wentao Zhang
关键词-EN: data obtained, Large Language Models, language, natural language, real-world scenarios
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In real-world scenarios, most of the data obtained from information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. The quality of constructed KGs may also impact the performance of some KG-dependent domains like GraphRAG systems and recommendation systems. Recently, Large Language Models (LLMs) have demonstrated impressive capabilities in addressing a wide range of natural language processing tasks. However, there are still challenges when utilizing LLMs to address the task of generating structured KGs. And we have identified three limitations with respect to existing KG construction methods. (1)There is a large amount of information and excessive noise in real-world documents, which could result in extracting messy information. (2)Native LLMs struggle to effectively extract accuracy knowledge from some domain-specific documents. (3)Hallucinations phenomenon cannot be overlooked when utilizing LLMs directly as an unsupervised method for constructing KGs. In this paper, we propose GraphJudger, a knowledge graph construction framework to address the aforementioned challenges. We introduce three innovative modules in our method, which are entity-centric iterative text denoising, knowledge aware instruction tuning and graph judgement, respectively. We seek to utilize the capacity of LLMs to function as a graph judger, a capability superior to their role only as a predictor for KG construction problems. Experiments conducted on two general text-graph pair datasets and one domain-specific text-graph pair dataset show superior performances compared to baseline methods. The code of our proposed method is available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.17388 [cs.CL] (or arXiv:2411.17388v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.17388 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] he Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations
【速读】: 该论文试图解决生成式 AI (LLMs) 在信息合成过程中缺乏可靠引用的问题,以及用户在处理高风险查询时对信息来源可验证性的需求。解决方案的关键在于引入“抽取-抽象”光谱(extractive-abstractive spectrum),将搜索引擎和生成式 AI 作为光谱的两个极端,并定义了五个中间操作点。通过这种方式,论文探讨了信息工具的可验证性和实用性的相互作用,并通过人类评估验证了不同操作点在不同查询分布下的表现。研究发现,随着输出变得更加抽象,感知实用性显著提高,但正确引用的比例显著下降,用户验证引用信息的时间增加。这些发现为特定领域的生成式 AI 系统提供了不同的操作点选择,并为高实用性生成式 AI 系统提供了用户验证信息的改进方法。
链接: https://arxiv.org/abs/2411.17375
作者: Theodora Worledge,Tatsunori Hashimoto,Carlos Guestrin
关键词-EN: academic study, experts cite, fields of academic, search engines, information
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high-stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information-sharing tools, we introduce the extractive-abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in-context sources without reliable citation. We define five operating points that span the extractive-abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real-world QA settings: web search, language simplification, multi-step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain-specific LLM systems and our failure analysis informs approaches to high-utility LLM systems that empower users to verify information.
zh
[NLP-20] Fairness And Performance In Harmony: Data Debiasing Is All You Need
【速读】: 该论文试图解决机器学习模型在大学录取决策中的公平性问题,特别是算法和数据偏见对预测结果的影响,以及人类决策中的主观性和认知偏见。解决方案的关键在于:1) 通过对比不同背景专家和三种机器学习模型(XGB、Bi-LSTM、KNN)的决策一致性,评估个体公平性,结果显示机器学习模型在公平性上优于人类14.08%至18.79%;2) 提出并验证了一种性别去偏见流程,该流程能够有效去除性别特定语言,同时不损害预测性能,证明了公平性和性能可以共存。最终,研究倡导采用结合人类判断和机器学习模型的混合方法,以提升录取决策的公平性和准确性。
链接: https://arxiv.org/abs/2411.17374
作者: Junhua Liu,Wendy Wan Yee Hui,Roy Ka-Wei Lee,Kwan Hui Lim
关键词-EN: data bias, cognitive bias, machine learning, prone to algorithmic, algorithmic and data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Fairness in both machine learning (ML) predictions and human decisions is critical, with ML models prone to algorithmic and data bias, and human decisions affected by subjectivity and cognitive bias. This study investigates fairness using a real-world university admission dataset with 870 profiles, leveraging three ML models, namely XGB, Bi-LSTM, and KNN. Textual features are encoded with BERT embeddings. For individual fairness, we assess decision consistency among experts with varied backgrounds and ML models, using a consistency score. Results show ML models outperform humans in fairness by 14.08% to 18.79%. For group fairness, we propose a gender-debiasing pipeline and demonstrate its efficacy in removing gender-specific language without compromising prediction performance. Post-debiasing, all models maintain or improve their classification accuracy, validating the hypothesis that fairness and performance can coexist. Our findings highlight ML’s potential to enhance fairness in admissions while maintaining high accuracy, advocating a hybrid approach combining human judgement and ML models.
zh
[NLP-21] Different Bias Under Different Criteria: Assessing Bias in LLM s with a Fact-Based Approach NEURIPS2024
【速读】: 该论文试图解决大语言模型(LLMs)中存在的偏见问题,特别是如何定义和评估一个无偏见的状态。解决方案的关键在于引入了一种基于事实和现实世界统计数据的新型评估指标,以替代传统的基于平等原则的评估方法。通过人类调查,论文展示了当LLM的输出与现实世界的人口分布相符时,人类对其输出的评价更为积极。该研究强调了多角度评估模型偏见的重要性,并指出不同评估标准会导致模型偏见的不同表现。
链接: https://arxiv.org/abs/2411.17338
作者: Changgeon Ko,Jisu Shin,Hoyun Song,Jeongyeon Seo,Jong C. Park
关键词-EN: Large language models, Large language, reflect real-world biases, leading to efforts, efforts to mitigate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted in NeurIPS 2024 Workshop on Socially Responsible Language Modelling Research (SoLaR)
点击查看摘要
Abstract:Large language models (LLMs) often reflect real-world biases, leading to efforts to mitigate these effects and make the models unbiased. Achieving this goal requires defining clear criteria for an unbiased state, with any deviation from these criteria considered biased. Some studies define an unbiased state as equal treatment across diverse demographic groups, aiming for balanced outputs from LLMs. However, differing perspectives on equality and the importance of pluralism make it challenging to establish a universal standard. Alternatively, other approaches propose using fact-based criteria for more consistent and objective evaluations, though these methods have not yet been fully applied to LLM bias assessments. Thus, there is a need for a metric with objective criteria that offers a distinct perspective from equality-based approaches. Motivated by this need, we introduce a novel metric to assess bias using fact-based criteria and real-world statistics. In this paper, we conducted a human survey demonstrating that humans tend to perceive LLM outputs more positively when they align closely with real-world demographic distributions. Evaluating various LLMs with our proposed metric reveals that model bias varies depending on the criteria used, highlighting the need for multi-perspective assessment.
zh
[NLP-22] Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning
【速读】: 该论文试图解决大型语言模型(LLMs)中存在的认知偏见和对外部知识的过度依赖问题。解决方案的关键在于引入一种称为“哈希”(hashing)的新方法,通过将可能引发偏见的词汇用类似哈希的无意义标识符进行掩盖,从而减少认知偏见并降低对外部知识的依赖。该方法在多个实验中显示出显著的改进效果,涵盖了不同类型的LLM模型和任务,尽管在减少幻觉率方面效果不一致。
链接: https://arxiv.org/abs/2411.17304
作者: Milena Chadimová,Eduard Jurášek,Tomáš Kliegr
关键词-EN: hash-like meaningless identifiers, large language models, involves masking potentially, potentially bias-inducing words, reduce cognitive biases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper introduces a novel method, referred to as “hashing”, which involves masking potentially bias-inducing words in large language models (LLMs) with hash-like meaningless identifiers to reduce cognitive biases and reliance on external knowledge. The method was tested across three sets of experiments involving a total of 490 prompts. Statistical analysis using chi-square tests showed significant improvements in all tested scenarios, which covered LLama, ChatGPT, Copilot, Gemini and Mixtral models. In the first experiment, hashing decreased the fallacy rate in a modified version of the “Linda” problem aimed at evaluating susceptibility to cognitive biases. In the second experiment, it improved LLM results on the frequent itemset extraction task. In the third experiment, we found hashing is also effective when the Linda problem is presented in a tabular format rather than text, indicating that the technique works across various input representations. Overall, the method was shown to improve bias reduction and incorporation of external knowledge. Despite bias reduction, hallucination rates were inconsistently reduced across types of LLM models. These findings suggest that masking bias-inducing terms can improve LLM performance, although its effectiveness is model- and task-dependent.
zh
[NLP-23] ER2Score: LLM -based Explainable and Customizable Metric for Assessing Radiology Reports with Reward-Control Loss
【速读】: 该论文试图解决自动化放射学报告生成(R2Gen)中评估准确性不足的问题。传统评估指标依赖于刚性词匹配或仅关注病理实体,导致与人类评估不一致。解决方案的关键在于引入ER2Score,这是一种专门为R2Gen设计的自动评估指标。ER2Score利用基于奖励模型的边缘奖励强化损失(margin-based reward enforcement loss)和定制化的训练数据设计,能够根据用户定义的需求定制评估标准,并提供详细的子评分,增强了解释性。通过GPT-4设计的数据生成管道,生成了基于两种评分系统的广泛训练数据,用于训练大型语言模型(LLM),使其能够输出高质量报告的高奖励。ER2Score不仅提供总体评分,还提供每个评估项目的单独评分,显著提高了与人类判断的相关性和模型选择性能。
链接: https://arxiv.org/abs/2411.17301
作者: Yunyi Liu,Yingshu Li,Zhanyu Wang,Xinyu Liang,Lingqiao Liu,Lei Wang,Luping Zhou
关键词-EN: Automated radiology report, Automated radiology, accurate evaluation due, advanced significantly, introducing challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Automated radiology report generation (R2Gen) has advanced significantly, introducing challenges in accurate evaluation due to its complexity. Traditional metrics often fall short by relying on rigid word-matching or focusing only on pathological entities, leading to inconsistencies with human assessments. To bridge this gap, we introduce ER2Score, an automatic evaluation metric designed specifically for R2Gen. Our metric utilizes a reward model, guided by our margin-based reward enforcement loss, along with a tailored training data design that enables customization of evaluation criteria to suit user-defined needs. It not only scores reports according to user-specified criteria but also provides detailed sub-scores, enhancing interpretability and allowing users to adjust the criteria between different aspects of reports. Leveraging GPT-4, we designed an easy-to-use data generation pipeline, enabling us to produce extensive training data based on two distinct scoring systems, each containing reports of varying quality along with corresponding scores. These GPT-generated reports are then paired as accepted and rejected samples through our pairing rule to train an LLM towards our fine-grained reward model, which assigns higher rewards to the report with high quality. Our reward-control loss enables this model to simultaneously output multiple individual rewards corresponding to the number of evaluation criteria, with their summation as our final ER2Score. Our experiments demonstrate ER2Score’s heightened correlation with human judgments and superior performance in model selection compared to traditional metrics. Notably, our model provides both an overall score and individual scores for each evaluation item, enhancing interpretability. We also demonstrate its flexible training across various evaluation systems.
zh
[NLP-24] 2D Matryoshka Training for Information Retrieval
【速读】: 该论文试图解决2D Matryoshka Training在不同实现版本之间存在的差异问题,并评估其在语义文本相似性(STS)任务和检索任务中的表现。解决方案的关键在于通过复现和对比两种不同的2D Matryoshka Training实现,发现它们在子层和子维度设置上的训练效果虽优于传统的Matryoshka训练和全尺寸模型训练,但并未超越针对特定子层和子维度单独训练的模型。此外,研究还探索了不同的损失计算方法,发现结合全维度损失和更广泛的训练目标维度设置对检索任务更为有效。
链接: https://arxiv.org/abs/2411.17299
作者: Shuai Wang,Shengyao Zhuang,Bevan Koopman,Guido Zuccon
关键词-EN: Semantic Text Similarity, advanced embedding representation, representation training approach, training approach designed, Matryoshka Training
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:2D Matryoshka Training is an advanced embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups. This method has demonstrated higher effectiveness in Semantic Text Similarity (STS) tasks over traditional training approaches when using sub-layers for embeddings. Despite its success, discrepancies exist between two published implementations, leading to varied comparative results with baseline models. In this reproducibility study, we implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks. Our findings indicate that while both versions achieve higher effectiveness than traditional Matryoshka training on sub-dimensions, and traditional full-sized model training approaches, they do not outperform models trained separately on specific sub-layer and sub-dimension setups. Moreover, these results generalize well to retrieval tasks, both in supervised (MSMARCO) and zero-shot (BEIR) settings. Further explorations of different loss computations reveals more suitable implementations for retrieval tasks, such as incorporating full-dimension loss and training on a broader range of target dimensions. Conversely, some intuitive approaches, such as fixing document encoders to full model outputs, do not yield improvements. Our reproduction code is available at this https URL.
zh
[NLP-25] An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese
【速读】: 该论文试图解决越南语文本在现有语料库(如VietTreebank和VnDT)中不符合简化版头部驱动短语结构语法(Head-Driven Phrase Structure Grammar, HPSG)规则的问题。解决方案的关键在于通过随机排列训练和开发集中的样本,使其符合简化HPSG规则,并利用PhoBERT或XLM-RoBERTa模型替换原有的Penn Treebank神经解析器,以适应越南语文本的编码需求。实验结果表明,这种改进的简化HPSG神经解析器在成分解析中达到了82%的F-score,并在依赖解析中取得了更高的未标注依存得分(UAS),尽管标注依存得分(LAS)较低,这可能是由于在弧排列过程中未改变原始标签且未咨询语言学专家所致。
链接: https://arxiv.org/abs/2411.17270
作者: Duc-Vu Nguyen,Thang Chau Phan,Quoc-Nam Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
关键词-EN: Phrase Structure Grammar, Head-Driven Phrase Structure, Structure Grammar, simplified Head-Driven Phrase, Phrase Structure
类目: Computation and Language (cs.CL)
备注: Accepted at SoICT 2024
点击查看摘要
Abstract:In this paper, we aimed to develop a neural parser for Vietnamese based on simplified Head-Driven Phrase Structure Grammar (HPSG). The existing corpora, VietTreebank and VnDT, had around 15% of constituency and dependency tree pairs that did not adhere to simplified HPSG rules. To attempt to address the issue of the corpora not adhering to simplified HPSG rules, we randomly permuted samples from the training and development sets to make them compliant with simplified HPSG. We then modified the first simplified HPSG Neural Parser for the Penn Treebank by replacing it with the PhoBERT or XLM-RoBERTa models, which can encode Vietnamese texts. We conducted experiments on our modified VietTreebank and VnDT corpora. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art F-score of 82% for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser obtained lower Labeled Attachment Score (LAS) scores likely due to our focus on arc permutation without changing the original labels, as we did not consult with a linguistic expert. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic expert when developing treebanks for Vietnamese natural language processing.
zh
[NLP-26] A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLM s
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在行为上与人类偏好对齐的问题,特别是如何在不依赖大量人类专家或外部AI系统的情况下,提高模型输出的可信度和减少幻觉(hallucination)。解决方案的关键是引入了一种名为主题级偏好重写(Topic-level Preference Overwriting, TPO)的自校正方法。TPO通过模型自身生成的最佳和最差替代方案来替换响应中的每个主题,从而创建更具对比性的成对偏好反馈,显著提升了反馈质量,同时避免了外部干预和资源开销。实验结果表明,TPO在可信度方面达到了最先进的性能,对象幻觉减少了92%,总体幻觉减少了38%。
链接: https://arxiv.org/abs/2411.17265
作者: Lehan He,Zeren Chen,Zhelun Shi,Tianyu Yu,Jing Shao,Lu Sheng
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Aligning the behaviors
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Aligning the behaviors of Multimodal Large Language Models (MLLMs) with human preferences is crucial for developing robust and trustworthy AI systems. While recent attempts have employed human experts or powerful auxiliary AI systems to provide more accurate preference feedback, such as determining the preferable responses from MLLMs or directly rewriting hallucination-free responses, extensive resource overhead compromise the scalability of the feedback collection. In this work, we introduce Topic-level Preference Overwriting (TPO), a self-correctional approach that guide the model itself to mitigate its own hallucination at the topic level. Through a deconfounded strategy that replaces each topic within the response with the best or worst alternatives generated by the model itself, TPO creates more contrasting pairwise preference feedback, enhancing the feedback quality without human or proprietary model intervention. Notably, the experimental results demonstrate proposed TPO achieves state-of-the-art performance in trustworthiness, significantly reducing the object hallucinations by 92% and overall hallucinations by 38%. Code, model and data will be released.
zh
[NLP-27] Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks
【速读】: 该论文试图解决的问题是如何全面评估和比较不同大型语言模型(Large Language Models, LLMs)在多种对话任务中的表现,以确定最适合特定任务的模型。解决方案的关键在于采用了一个综合的评估框架,该框架结合了自动评估和人工评估,并使用了通用和任务特定的指标来准确衡量各模型在预订、共情响应生成、心理健康与法律咨询、说服和谈判等对话任务中的性能。通过这种多维度的评估方法,研究揭示了不同模型在不同任务中的优劣,强调了在选择对话应用模型时应考虑任务的具体需求和特性。
链接: https://arxiv.org/abs/2411.17204
作者: Ratnesh Kumar Joshi,Priyanshu Priya,Vishesh Desai,Saurav Dudhate,Siddhant Senapati,Asif Ekbal,Roshni Ramnani,Anutosh Maitra
关键词-EN: Large Language Models, Large Language, conversational artificial intelligence, assessment of Large, artificial intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages, 12 tables
点击查看摘要
Abstract:Given the advancements in conversational artificial intelligence, the evaluation and assessment of Large Language Models (LLMs) play a crucial role in ensuring optimal performance across various conversational tasks. In this paper, we present a comprehensive study that thoroughly evaluates the capabilities and limitations of five prevalent LLMs: Llama, OPT, Falcon, Alpaca, and MPT. The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation. To conduct the evaluation, an extensive test setup is employed, utilizing multiple evaluation criteria that span from automatic to human evaluation. This includes using generic and task-specific metrics to gauge the LMs’ performance accurately. From our evaluation, no single model emerges as universally optimal for all tasks. Instead, their performance varies significantly depending on the specific requirements of each task. While some models excel in certain tasks, they may demonstrate comparatively poorer performance in others. These findings emphasize the importance of considering task-specific requirements and characteristics when selecting the most suitable LM for conversational applications.
zh
[NLP-28] Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment
【速读】: 该论文试图解决生成式文本与图像交错内容(interleaved text-and-image generation)中的不一致性问题,特别是在确保文本步骤与伴随图像之间的连贯性和准确性方面。解决方案的关键在于提出了ISG(Interleaved Text-and-Image Generation)评估框架,该框架利用场景图(scene graph)结构来捕捉文本和图像块之间的关系,并通过四个层次的粒度(整体、结构、块级和图像特定)进行评估。这种多层次的评估方法能够细致地评估内容的一致性、连贯性和准确性,并提供可解释的问答反馈。此外,论文还引入了ISG-Bench基准数据集和ISG-Agent基线代理,以促进未来在该领域的研究。
链接: https://arxiv.org/abs/2411.17188
作者: Dongping Chen,Ruoxi Chen,Shu Pu,Zhaoyi Liu,Yanru Wu,Caixi Chen,Benlin Liu,Yue Huang,Yao Wan,Pan Zhou,Ranjay Krishna
关键词-EN: real-world user queries, egg fried rice, make egg fried, user queries, fried rice
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Many real-world user queries (e.g. “How do to make egg fried rice?”) could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a “plan-execute-refine” pipeline to invoke tools, achieving a 122% performance improvement.
zh
[NLP-29] A Novel Word Pair-based Gaussian Sentence Similarity Algorithm For Bengali Extractive Text Summarization
【速读】: 该论文试图解决孟加拉语(Bengali)文本摘要中语义关系表达不准确的问题。现有的方法,如基于统计的TF-IDF或简单的词平均技术(word averaging technique),无法正确捕捉句子间的语义关系。论文提出的解决方案是基于词对的高斯句子相似度(Word pair-based Gaussian Sentence Similarity, WGSS)算法,通过计算词嵌入向量的几何平均高斯相似度值来衡量句子间的语义关系。WGSS通过逐词比较来修正词平均方法中的句子表示问题,并结合谱聚类(Spectral Clustering)算法将语义相似的句子分组,再利用TF-IDF排序从每个聚类中选取最佳句子。实验结果表明,该方法在ROUGE评分上平均优于其他模型43.2%,并在其他低资源语言(如土耳其语、马拉地语和印地语)中也表现出类似的效果。
链接: https://arxiv.org/abs/2411.17181
作者: Fahim Morshed,Md. Abdur Rahman,Sumon Ahmed
关键词-EN: Extractive Text Summarization, Extractive Text, Text Summarization, larger text, representative parts
类目: Computation and Language (cs.CL)
备注: Submitted to ACM Transaction on Asian and Low-resource Language Information Processing
点击查看摘要
Abstract:Extractive Text Summarization is the process of selecting the most representative parts of a larger text without losing any key information. Recent attempts at extractive text summarization in Bengali, either relied on statistical techniques like TF-IDF or used naive sentence similarity measures like the word averaging technique. All of these strategies suffer from expressing semantic relationships correctly. Here, we propose a novel Word pair-based Gaussian Sentence Similarity (WGSS) algorithm for calculating the semantic relation between two sentences. WGSS takes the geometric means of individual Gaussian similarity values of word embedding vectors to get the semantic relationship between sentences. It compares two sentences on a word-to-word basis which rectifies the sentence representation problem faced by the word averaging method. The summarization process extracts key sentences by grouping semantically similar sentences into clusters using the Spectral Clustering algorithm. After clustering, we use TF-IDF ranking to pick the best sentence from each cluster. The proposed method is validated using four different datasets, and it outperformed other recent models by 43.2% on average ROUGE scores (ranging from 2.5% to 95.4%). It is also experimented on other low-resource languages i.e. Turkish, Marathi, and Hindi language, where we find that the proposed method performs as similar as Bengali for these languages. In addition, a new high-quality Bengali dataset is curated which contains 250 articles and a pair of summaries for each of them. We believe this research is a crucial addition to Bengali Natural Language Processing (NLP) research and it can easily be extended into other low-resource languages. We made the implementation of the proposed model and data public on \hrefthis https URLthis https URL.
zh
[NLP-30] Learning Monotonic Attention in Transducer for Streaming Generation
【速读】: 该论文试图解决Transducer架构在处理非单调对齐任务(如同时翻译)时的性能问题,其关键解决方案是通过引入可学习的单调注意力机制,将Transducer的解码过程与输入流的历史紧密结合。具体来说,论文利用前向-后向算法推断预测状态与输入时间戳之间的对齐后验概率,并据此在训练中估计单调注意力的上下文表示。这种方法使Transducer模型能够根据预测自适应调整注意力范围,从而避免枚举指数级大的对齐空间,显著提升了流式生成中非单调对齐的处理能力。
链接: https://arxiv.org/abs/2411.17170
作者: Zhengrui Ma,Yang Feng,Min Zhang
关键词-EN: industrial applications, increasingly utilized, popular in industrial, Transducer architecture, Streaming generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Codes: this https URL
点击查看摘要
Abstract:Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer’s decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks.
zh
[NLP-31] Star Attention: Efficient LLM Inference over Long Sequences
【速读】: 该论文试图解决Transformer-based大型语言模型(LLMs)在处理长序列时由于自注意力机制(self-attention mechanism)的二次复杂度导致的计算成本高和推理速度慢的问题。解决方案的关键是引入了一种名为Star Attention的两阶段块稀疏近似方法,通过在多个主机间分片注意力(sharding attention)来提高计算效率,同时最小化通信开销。在第一阶段,上下文通过跨主机的块本地注意力(blockwise-local attention)并行处理;在第二阶段,查询和响应标记通过序列全局注意力(sequence-global attention)参与所有先前缓存的标记。Star Attention能够无缝集成到大多数使用全局注意力训练的Transformer-based LLMs中,显著减少内存需求和推理时间(最多可达11倍),同时保持95-100%的准确性。
链接: https://arxiv.org/abs/2411.17116
作者: Shantanu Acharya,Fei Jia,Boris Ginsburg
关键词-EN: Large Language Models, Transformer-based Large Language, Language Models, Large Language, Transformer-based Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL
点击查看摘要
Abstract:Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
zh
[NLP-32] Dont Command Cultivate: An Exploratory Study of System-2 Alignment DATE
【速读】: 该论文试图解决生成式 AI (Generative AI) 模型在面对复杂安全威胁时的鲁棒性问题,特别是针对 o1 模型在遭受对抗性自然语言提示和数学编码提示攻击时的安全性表现。解决方案的关键在于引入 System-2 思维模式,通过细致的安全评估和实验,发现 o1 模型在某些攻击场景下仍存在漏洞,尤其是数学编码攻击。论文提出通过提示工程和监督微调技术来增强模型的安全对齐,并计划实施过程监督以进一步提升安全性。关键在于鼓励模型仔细审查用户请求,并通过实验验证了简单方法对提升模型安全性的有效性。
链接: https://arxiv.org/abs/2411.17075
作者: Yuhang Wang,Jitao Sang
关键词-EN: system card identifies, system card, robust within OpenAI, progression from rapid, deliberate reasoning
类目: Computation and Language (cs.CL)
备注: Preprint version, more results will be updated
点击查看摘要
Abstract:The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking to slower, more deliberate reasoning. This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance; however, it still exhibits vulnerabilities, particularly against jailbreak attacks employing mathematical encoding. Through detailed case analysis, we identified specific patterns in the o1 model’s responses. We also explored the alignment of System-2 safety in open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results show that some simple methods to encourage the model to carefully scrutinize user requests are beneficial for model safety. Additionally, we proposed a implementation plan for process supervision to enhance safety alignment. The implementation details and experimental results will be provided in future versions.
zh
[NLP-33] Relations Negations and Numbers: Looking for Logic in Generative Text-to-Image Models
【速读】: 该论文试图解决多模态AI在逻辑运算符(logical operators)应用上的显著不足,特别是在关系(relations)、否定(negations)和离散数字(discrete numbers)的处理上。论文通过一系列实验发现,现有的最先进图像生成AI(如DALL-E 3)在处理这些逻辑探针(logical probes)时,无法达到超过50%的人类一致性评分。解决方案的关键在于提出了一种基于“接地扩散”(grounded diffusion)的管道,该管道利用目标导向的提示工程(prompt engineering)和结构化的中间表示(structured intermediate representations)来增强组合控制(compositional control)。然而,实验结果显示,这种改进方法的表现甚至不如DALL-E 3。论文进一步通过辅助分析和图示量化了成功和失败的原因,并提出了基于发展心理学和图像处理的微小修改,以缩小规模与结构之间的组合差距。
链接: https://arxiv.org/abs/2411.17066
作者: Colin Conwell,Rupert Tawiah-Quashie,Tomer Ullman
关键词-EN: logical operators, multi-modal AI research, remarkable progress, progress in multi-modal, salient domain
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:
点击查看摘要
Abstract:Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a
grounded diffusion’ pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further clarity on potential sources of success and failure in these text-to-image systems, we supplement our 4 core experiments with multiple auxiliary analyses and schematic diagrams, directly quantifying, for example, the relationship between the N-gram frequency of relational prompts and the average match to generated images; the success rates for 3 different prompt modification strategies in the rendering of negation prompts; and the scalar variability / ratio dependence (approximate numeracy') of prompts involving integers. We conclude by discussing the limitations inherent to
grounded’ multimodal learning systems whose grounding relies heavily on vector-based semantics (e.g. DALL-E 3), or under-specified syntactical constraints (e.g. `grounded diffusion’), and propose minimal modifications (inspired by development, based in imagery) that could help to bridge the lingering compositional gap between scale and structure. All data and code is available at this https URL
zh
[NLP-34] ree Transformers are an Ineffective Model of Syntactic Constituency
【速读】: 该论文试图解决的问题是当前最先进的语言模型是否能够有效地捕捉自然语言中的递归成分结构(constituent structures)。解决方案的关键在于研究Tree Transformer模型,该模型通过修改注意力机制来组织词元(tokens)形成成分结构。论文通过预训练大型Tree Transformer模型并评估其在需要成分结构的任务(如错误检测)中的表现,发现尽管Tree Transformer模型在某些任务中略微优于传统Transformer模型,但总体上缺乏证据表明Tree Transformer能够有效建模语法成分结构。
链接: https://arxiv.org/abs/2411.16993
作者: Michael Ginn
关键词-EN: Linguists have long, Tree Transformer, Tree, natural language syntax, suggested that current
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Linguists have long held that a key aspect of natural language syntax is the recursive organization of language units into constituent structures, and research has suggested that current state-of-the-art language models lack an inherent bias towards this feature. A number of alternative models have been proposed to provide inductive biases towards constituency, including the Tree Transformer, which utilizes a modified attention mechanism to organize tokens into constituents. We investigate Tree Transformers to study whether they utilize meaningful and/or useful constituent structures. We pretrain a large Tree Transformer on language modeling in order to investigate the learned constituent tree representations of sentences, finding little evidence for meaningful structures. Next, we evaluate Tree Transformers with similar transformer models on error detection tasks requiring constituent structure. We find that while the Tree Transformer models may slightly outperform at these tasks, there is little evidence to suggest a meaningful improvement. In general, we conclude that there is little evidence to support Tree Transformer as an effective model of syntactic constituency. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.16993 [cs.CL] (or arXiv:2411.16993v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.16993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-35] Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models
【速读】: 该论文试图解决在知识蒸馏(Knowledge Distillation, KD)过程中依赖复杂教师模型的问题,特别是在使用商业大型语言模型(LLMs)如GPT4时,传统KD方法可能难以实现或成本高昂。解决方案的关键在于提出了一种模型无关和任务无关的自蒸馏方法,称为从前一个minibatch动态自蒸馏(Dynamic SelfD from the Previous Minibatch, DynSDPB)。该方法通过从前一次迭代的生成的logits中进行蒸馏,实现了当前迭代的学习,同时动态调整蒸馏影响和温度值以提高微调的适应性。此外,DynSDPB作为一种新颖的微调策略,能够无缝集成现有的自校正和自训练技术,适用于小型语言模型(SLMs)的参数更新,并在自然语言理解(NLU)和自然语言生成(NLG)基准测试中验证了其有效性。
链接: https://arxiv.org/abs/2411.16991
作者: Yao Fu,Yin Yu,Xiaotian Han,Runchao Li,Xianxuan Long,Haotian Yu,Pan Li
关键词-EN: widely adopted approach, reduce computational costs, compressing large language, memory footprints, Knowledge distillation
类目: Computation and Language (cs.CL)
备注: Work in progress
点击查看摘要
Abstract:Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers’ guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations’ distillation from the last ones’ generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, DynSDPB is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs’ parameters. We demonstrate the superior performance of DynSDPB on both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.
zh
[NLP-36] aching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)
【速读】: 该论文试图解决在推理系统领域中,预训练的大型语言模型(LLMs)在回答未见问题时面临的挑战,特别是在本地计算资源有限且无互联网连接的情况下。解决方案的关键在于开发一种较小规模的推理模型,该模型能够通过检索到的上下文进行推理,从而回答多样化的问题。论文提出了几种创新方法,包括使用多跳密集检索系统和优化后的语言模型生成的理由(rationales)作为知识源,以及引入检索增强训练数据集(RATD)来显著提升模型性能。此外,论文还提出了一种理由排序模型(Rationale Ranking model, RR),用于评估和组合来自不同知识源的上下文,以提高模型在处理复杂问题时的准确性和效率。
链接: https://arxiv.org/abs/2411.16985
作者: Tim Hartill
关键词-EN: Pretrained large Language, Pretrained large, large Language Models, Model, large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Pretrained large Language Models (LLMs) are able to answer questions that are unlikely to have been encountered during training. However a diversity of potential applications exist in the broad domain of reasoning systems and considerations such as latency, cost, available compute resource and internet connectivity are relevant in determining an appropriate approach. We consider the setting where some local compute capacity is available at inference time but internet connectivity is not. Similar to a general-purpose LLM, we assume that our much smaller Reasoning Models may be asked arbitrary questions from unknown distributions, so we focus on evaluation in an unseen setting. We train our models to answer diverse questions by instilling an ability to reason over a retrieved context. We acquire context from two knowledge sources; a Wikipedia corpus queried using a multi-hop dense retrieval system with novel extensions, and from rationales generated from a larger Language Model optimised to run in a lower resource environment. Our main contributions: We propose novel methods to show that our model is capable of answering contextualised questions without memorisation. We establish a comprehensive set of baseline results on unseen evaluation datasets. We show that the addition of novel retrieval-augmented training datasets (RATD) to the training regime of the Reasoning Model significantly improves results. We demonstrate further significant improvement through the application of methods for combining knowledge from two sources. The first method (RR) involves training a novel Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We use the scores to derive combined contexts. We also show that utilising the RATD datasets enables our model to become proficient at utilising combined noisy contexts. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.16985 [cs.CL] (or arXiv:2411.16985v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.16985 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-37] Harnessing LLM s for Educational Content-Driven Italian Crossword Generation
【速读】: 该论文试图解决意大利语教育中缺乏先进互动工具的问题,解决方案的关键在于利用先进的语言模型(如GPT-4o、Mistral-7B-Instruct-v0.3和Llama3-8b-Instruct)和专门构建的意大利-Clue-Instruct数据集(包含超过30,000个条目),生成多样化的意大利填字游戏线索。通过研究四种不同的线索风格(无格式约束、定冠词短语、系动词句和裸名词短语),该工具能够根据特定文本和关键词生成上下文相关的线索,从而提供一个互动且有趣的学习环境,促进认知发展和语言学习。
链接: https://arxiv.org/abs/2411.16936
作者: Kamyar Zeinalipour,Achille Fusco,Asya Zanollo,Marco Maggini,Marco Gori
关键词-EN: utilizing advanced language, advanced language models, utilizing advanced, generating Italian crossword, generating Italian
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at this http URL 2024
点击查看摘要
Abstract:In this work, we unveil a novel tool for generating Italian crossword puzzles from text, utilizing advanced language models such as GPT-4o, Mistral-7B-Instruct-v0.3, and Llama3-8b-Instruct. Crafted specifically for educational applications, this cutting-edge generator makes use of the comprehensive Italian-Clue-Instruct dataset, which comprises over 30,000 entries including diverse text, solutions, and types of clues. This carefully assembled dataset is designed to facilitate the creation of contextually relevant clues in various styles associated with specific texts and keywords. The study delves into four distinctive styles of crossword clues: those without format constraints, those formed as definite determiner phrases, copular sentences, and bare noun phrases. Each style introduces unique linguistic structures to diversify clue presentation. Given the lack of sophisticated educational tools tailored to the Italian language, this project seeks to enhance learning experiences and cognitive development through an engaging, interactive platform. By meshing state-of-the-art AI with contemporary educational strategies, our tool can dynamically generate crossword puzzles from Italian educational materials, thereby providing an enjoyable and interactive learning environment. This technological advancement not only redefines educational paradigms but also sets a new benchmark for interactive and cognitive language learning solutions.
zh
[NLP-38] Boundless Socratic Learning with Language Games
【速读】: 该论文试图解决的问题是如何在封闭系统中训练一个能够掌握任何所需能力的智能体。解决方案的关键在于满足三个条件:(a) 智能体接收足够信息丰富且对齐的反馈 (sufficiently informative and aligned feedback);(b) 其经验/数据覆盖范围足够广泛 (broad enough coverage of experience/data);© 具备足够的容量和资源 (sufficient capacity and resource)。在假设条件 © 不是瓶颈的情况下,论文特别关注条件 (a) 和 (b) 在封闭系统中的限制。对于输入和输出空间匹配的智能体(即语言智能体),论文提出了一种称为“苏格拉底学习” (Socratic learning) 的纯递归自我改进方法,认为这种方法可以大幅提升性能,超越初始数据或知识中的表现,且仅受时间以及逐渐对齐问题的限制。论文还提出了一个基于语言游戏概念的构建性框架来实现这一方法。
链接: https://arxiv.org/abs/2411.16905
作者: Tom Schaul
关键词-EN: receives sufficiently informative, desired capability, aligned feedback, coverage of experience, capacity and resource
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:An agent trained within a closed system can master any desired capability, as long as the following three conditions hold: (a) it receives sufficiently informative and aligned feedback, (b) its coverage of experience/data is broad enough, and © it has sufficient capacity and resource. In this position paper, we justify these conditions, and consider what limitations arise from (a) and (b) in closed systems, when assuming that © is not a bottleneck. Considering the special case of agents with matching input and output spaces (namely, language), we argue that such pure recursive self-improvement, dubbed “Socratic learning”, can boost performance vastly beyond what is present in its initial data or knowledge, and is only limited by time, as well as gradual misalignment concerns. Furthermore, we propose a constructive framework to implement it, based on the notion of language games.
zh
[NLP-39] Augmenting Multimodal LLM s with Self-Reflective Tokens for Knowledge-based Visual Question Answering
【速读】: 该论文试图解决多模态大语言模型 (Multimodal LLMs, MLLMs) 在处理复杂任务时,由于训练时获取的知识有限,导致其实际应用效果受限的问题。解决方案的关键在于引入了一种名为 Reflective LLaVA (ReflectiVA) 的新方法,通过反射性标记 (reflective tokens) 动态判断是否需要外部知识,并预测从外部数据库中检索的信息的相关性。该方法采用两阶段两模型的训练策略,使模型在不需要外部知识的情况下仍能保持流畅性和任务性能,从而显著提升了知识密集型视觉问答任务的效果。
链接: https://arxiv.org/abs/2411.16863
作者: Federico Cocchi,Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: handle multimodal inputs, Multimodal LLMs, multimodal inputs, handle multimodal, large language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at this https URL.
zh
[NLP-40] Integrating Geodesic Interpolation and Flow Matching for Non-Autoregressive Text Generation in Logit Space
【速读】: 该论文试图解决非自回归语言模型在自然语言处理领域中的应用问题,特别是如何有效地进行离散序列的初始分布和目标分布之间的插值。解决方案的关键在于引入了一种新的流匹配方法,该方法利用Kullback-Leibler (KL) 散度测地线来实现插值,并通过设计一个最大化离散标记条件似然的损失函数,使得其最大化解对应于logit插值过程中的流匹配速度。尽管在TinyStories数据集上的初步实验结果不理想,但通过基于预训练去噪器的经验采样方案,显著提升了性能。此外,论文还提出了一种更通用的混合方法,在更复杂的数据集如Fine Web和Lamini Instruction上取得了良好的表现。
链接: https://arxiv.org/abs/2411.16821
作者: Egor Sevriugov,Ivan Oseledets
关键词-EN: Non-autoregressive language models, natural language processing, Non-autoregressive language, simultaneous token generation, facilitating simultaneous token
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Non-autoregressive language models are emerging as effective alternatives to autoregressive models in the field of natural language processing, facilitating simultaneous token generation. This study introduces a novel flow matching approach that employs Kullback-Leibler (KL) divergence geodesics to interpolate between initial and target distributions for discrete sequences. We formulate a loss function designed to maximize the conditional likelihood of discrete tokens and demonstrate that its maximizer corresponds to the flow matching velocity during logit interpolation. Although preliminary experiments conducted on the TinyStories dataset yielded suboptimal results, we propose an empirical sampling scheme based on a pretrained denoiser that significantly enhances performance. Additionally, we present a more general hybrid approach that achieves strong performance on more complex datasets, such as Fine Web and Lamini Instruction.
zh
[NLP-41] Enhancing In-Hospital Mortality Prediction Using Multi-Representational Learning with LLM -Generated Expert Summaries
【速读】: 该论文试图解决重症监护病房(ICU)患者院内死亡率(IHM)预测的问题,特别是如何通过整合结构化生理数据和非结构化临床笔记来提高预测准确性。解决方案的关键在于利用大型语言模型(LLM)生成的专家摘要来增强文本数据的处理能力,同时开发一个多表示学习框架,以综合利用这些数据源。具体来说,论文通过将临床笔记转化为专家摘要,并将其与时间序列生理数据结合,显著提升了预测模型的性能,特别是在AUPRC和AUROC指标上分别提高了36.41%和7.64%。这种方法不仅提高了预测准确性,还在不同人口统计群体中表现出一致的性能提升,特别是在代表性不足的群体中,突显了其公平应用的潜力。
链接: https://arxiv.org/abs/2411.16818
作者: Harshavardhan Battula,Jiacheng Liu,Jaideep Srivastava
关键词-EN: efficient resource allocation, In-hospital mortality, resource allocation, timely interventions, interventions and efficient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In-hospital mortality (IHM) prediction for ICU patients is critical for timely interventions and efficient resource allocation. While structured physiological data provides quantitative insights, clinical notes offer unstructured, context-rich narratives. This study integrates these modalities with Large Language Model (LLM)-generated expert summaries to improve IHM prediction accuracy. Using the MIMIC-III database, we analyzed time-series physiological data and clinical notes from the first 48 hours of ICU admission. Clinical notes were concatenated chronologically for each patient and transformed into expert summaries using Med42-v2 70B. A multi-representational learning framework was developed to integrate these data sources, leveraging LLMs to enhance textual data while mitigating direct reliance on LLM predictions, which can introduce challenges in uncertainty quantification and interpretability. The proposed model achieved an AUPRC of 0.6156 (+36.41%) and an AUROC of 0.8955 (+7.64%) compared to a time-series-only baseline. Expert summaries outperformed clinical notes or time-series data alone, demonstrating the value of LLM-generated knowledge. Performance gains were consistent across demographic groups, with notable improvements in underrepresented populations, underscoring the framework’s equitable application potential. By integrating LLM-generated summaries with structured and unstructured data, the framework captures complementary patient information, significantly improving predictive performance. This approach showcases the potential of LLMs to augment critical care prediction models, emphasizing the need for domain-specific validation and advanced integration strategies for broader clinical adoption.
zh
[NLP-42] Fine-Tuning LLM s with Noisy Data for Political Argument Generation
【速读】: 该论文试图解决社交媒体中政治敏感内容生成模型中存在的失礼问题,解决方案的关键在于微调(fine-tuning)和提示策略(prompting strategies)。研究通过使用CLAPTON数据集的子集,对GPT-3.5 Turbo模型进行微调和提示策略的实验,发现基于Reddit数据的微调模型在讨论质量上得分最高,而混合噪声数据则导致持续的毒性。提示策略虽然能减少特定毒性特征(如人身攻击),但对整体影响有限。研究强调,高质量数据和精心设计的提示策略是减少失礼行为并提高自动化政治讨论生成中修辞质量的关键。
链接: https://arxiv.org/abs/2411.16813
作者: Svetlana Churina,Kokil Jaidka
关键词-EN: politically sensitive content, social media discourse, media discourse complicates, sensitive content, social media
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The incivility in social media discourse complicates the deployment of automated text generation models for politically sensitive content. Fine-tuning and prompting strategies are critical, but underexplored, solutions to mitigate toxicity in such contexts. This study investigates the fine-tuning and prompting effects on GPT-3.5 Turbo using subsets of the CLAPTON dataset of political discussion posts, comprising Twitter and Reddit data labeled for their justification, reciprocity and incivility. Fine-tuned models on Reddit data scored highest on discussion quality, while combined noisy data led to persistent toxicity. Prompting strategies reduced specific toxic traits, such as personal attacks, but had limited broader impact. The findings emphasize that high-quality data and well-crafted prompts are essential to reduce incivility and improve rhetorical quality in automated political discourse generation.
zh
[NLP-43] Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models
【速读】: 该论文试图解决在缺乏确切标准答案的情况下,如何通过多个高级语言模型(如GPT-4、Meta-LLaMA、Claude和Gemini)之间的协作来提高复杂统计问题回答的可靠性和精确度。解决方案的关键在于通过统计方法(如卡方检验、Fleiss’ Kappa和置信区间分析)评估模型间的共识率和一致性,从而量化协作输出的可靠性。研究结果表明,Claude和GPT-4在共识率和一致性上表现出最高的可靠性,而Gemini和LLaMA则显示出较大的变异性。这些发现强调了大型语言模型(LLMs)之间协作交互在提升回答可靠性方面的重要作用。
链接: https://arxiv.org/abs/2411.16797
作者: Alireza Amiri-Margavi,Iman Jebellat,Ehsan Jebellat,Seyed Pouyan Mousavi Davoudi
关键词-EN: involving advanced models, system involving advanced, involving advanced, innovative language model, interaction system involving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures
点击查看摘要
Abstract:We explore the collaborative dynamics of an innovative language model interaction system involving advanced models such as GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash. These models generate and answer complex, PhD-level statistical questions without exact ground-truth answers. Our study investigates how inter-model consensus enhances the reliability and precision of responses. By employing statistical methods such as chi-square tests, Fleiss’ Kappa, and confidence interval analysis, we evaluate consensus rates and inter-rater agreement to quantify the reliability of collaborative outputs. Key results reveal that Claude and GPT-4 exhibit the highest reliability and consistency, as evidenced by their narrower confidence intervals and higher alignment with question-generating models. Conversely, Gemini and LLaMA show more significant variability in their consensus rates, as reflected in wider confidence intervals and lower reliability percentages. These findings demonstrate that collaborative interactions among large language models (LLMs) significantly improve response reliability, offering novel insights into autonomous, cooperative reasoning and validation in AI systems.
zh
[NLP-44] What can LLM tell us about cities?
【速读】: 该论文试图解决的问题是如何利用大型语言模型 (Large Language Models, LLMs) 在全球范围内提供关于城市和地区的知识。解决方案的关键在于采用了两种方法:直接查询LLM以获取目标变量的值,以及从LLM中提取与目标变量相关的显性和隐性特征。通过实验,研究者发现LLMs在全球城市中嵌入了广泛但程度不一的知识,并且基于LLM衍生的特征训练的机器学习模型能够显著提高预测准确性。此外,研究还观察到LLMs在所有大陆的城市中都表现出一定程度的知识,但在缺乏知识时,它们往往会生成通用或随机的输出。这些发现表明,LLMs为城市研究中的数据驱动决策提供了新的机会。
链接: https://arxiv.org/abs/2411.16791
作者: Zhuoheng Li,Yaochen Wang,Zhixue Song,Yuqi Huang,Rui Bao,Guanjie Zheng,Zhenhui Jessie Li
关键词-EN: large language models, explores the capabilities, capabilities of large, large language, global scale
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This study explores the capabilities of large language models (LLMs) in providing knowledge about cities and regions on a global scale. We employ two methods: directly querying the LLM for target variable values and extracting explicit and implicit features from the LLM correlated with the target variable. Our experiments reveal that LLMs embed a broad but varying degree of knowledge across global cities, with ML models trained on LLM-derived features consistently leading to improved predictive accuracy. Additionally, we observe that LLMs demonstrate a certain level of knowledge across global cities on all continents, but it is evident when they lack knowledge, as they tend to generate generic or random outputs for unfamiliar tasks. These findings suggest that LLMs can offer new opportunities for data-driven decision-making in the study of cities.
zh
[NLP-45] Leveraging the Power of MLLM s for Gloss-Free Sign Language Translation
【速读】: 该论文试图解决手语翻译 (Sign Language Translation, SLT) 中的挑战,即如何将手语图像准确翻译为口语语言。解决方案的关键在于提出了一种名为多模态手语翻译 (Multimodal Sign Language Translation, MMSLT) 的新框架,该框架利用现成的多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的表征能力。具体来说,MMSLT 通过 MLLMs 生成手语组件的详细文本描述,并通过多模态语言预训练模块将这些描述特征与手语视频特征整合,以在口语句子空间中对齐它们。这种方法在 PHOENIX14T 和 CSL-Daily 等基准数据集上实现了最先进的性能,展示了 MLLMs 在 SLT 中的有效应用潜力。
链接: https://arxiv.org/abs/2411.16789
作者: Jungeun Kim,Hyeongwoo Jeon,Jongseong Bae,Ha Young Kim
关键词-EN: Sign language translation, involves translating sign, Sign language, translating sign language, sign language images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.
zh
[NLP-46] Contrastive Multi-graph Learning with Neighbor Hierarchical Sifting for Semi-supervised Text Classification
【速读】: 该论文试图解决图对比学习在文本分类应用中存在的三个主要问题:显式图增强可能导致语义丢失、现有方法忽视边特征和节点特征的重要性差异、以及对比损失中存在假负样本问题。解决方案的关键在于提出了一种名为ConNHS的新方法,即对比多图学习与邻居层次筛选(Contrastive Multi-Graph Learning with Neighbor Hierarchical Sifting)。具体来说,ConNHS通过利用核心特征构建多关系文本图,增强文本间的语义联系,并通过分离文本图提供多样化的对比学习视角,确保图信息的优化保留。此外,该方法分别执行关系感知传播和跨图注意力传播,有效利用节点和边特征的变异相关性,并协调跨图信息融合。最后,引入邻居层次筛选损失(NHS)来优化负样本选择,基于同质性假设和相似性排除高阶相似邻居作为负样本,从而减少假负样本的出现,防止嵌入空间中相似样本间距离的扩大。
链接: https://arxiv.org/abs/2411.16787
作者: Wei Ai,Jianbin Li,Ze Wang,Yingying Wei,Tao Meng,Yuntao Shou,Keqin Lib
关键词-EN: self-supervised node representation, node representation learning, text classification due, successfully applied, remarkable ability
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 figures
点击查看摘要
Abstract:Graph contrastive learning has been successfully applied in text classification due to its remarkable ability for self-supervised node representation learning. However, explicit graph augmentations may lead to a loss of semantics in the contrastive views. Secondly, existing methods tend to overlook edge features and the varying significance of node features during multi-graph learning. Moreover, the contrastive loss suffer from false negatives. To address these limitations, we propose a novel method of contrastive multi-graph learning with neighbor hierarchical sifting for semi-supervised text classification, namely ConNHS. Specifically, we exploit core features to form a multi-relational text graph, enhancing semantic connections among texts. By separating text graphs, we provide diverse views for contrastive learning. Our approach ensures optimal preservation of the graph information, minimizing data loss and distortion. Then, we separately execute relation-aware propagation and cross-graph attention propagation, which effectively leverages the varying correlations between nodes and edge features while harmonising the information fusion across graphs. Subsequently, we present the neighbor hierarchical sifting loss (NHS) to refine the negative selection. For one thing, following the homophily assumption, NHS masks first-order neighbors of the anchor and positives from being negatives. For another, NHS excludes the high-order neighbors analogous to the anchor based on their similarities. Consequently, it effectively reduces the occurrence of false negatives, preventing the expansion of the distance between similar samples in the embedding space. Our experiments on ThuCNews, SogouNews, 20 Newsgroups, and Ohsumed datasets achieved 95.86%, 97.52%, 87.43%, and 70.65%, which demonstrates competitive results in semi-supervised text classification.
zh
[NLP-47] Parameter Efficient Instruction Tuning: An Empirical Study
【速读】: 该论文试图解决在指令微调(Instruction Tuning)过程中,如何通过参数高效微调(Parameter Efficient Finetuning, PEFT)方法在减少计算、内存和存储成本的同时,保持或接近全参数微调(Full Finetuning)的性能。解决方案的关键在于系统地研究了几种代表性的PEFT方法(如LoRA和Adapter),并探讨了超参数选择(包括训练超参数和PEFT特定超参数)、模型大小、指令任务数量对性能的影响,以及任务内分布记忆和开放指令遵循能力。研究结果表明,只有在理想训练设置下(如适当的学习率、最大的LoRA秩或Adapter大小以及多样化的训练任务),LoRA和Adapter才能接近全参数微调的性能,但它们在训练不稳定性和复杂推理、编码及长篇生成任务上表现不如全参数微调。
链接: https://arxiv.org/abs/2411.16775
作者: Pengfei He
关键词-EN: pretrained language models, follow human instructions, finetuning pretrained language, Instruction tuning, pretrained language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 7 figures
点击查看摘要
Abstract:Instruction tuning has become an important step for finetuning pretrained language models to better follow human instructions and generalize on various tasks. Nowadays, pretrained language models become increasingly larger, and full parameter finetuning is overwhelmingly costly. Therefore, Parameter Efficient Finetuning (PEFT) has arisen as a cost-effective practice for instruction tuning because of significantly smaller computational, memory, and storage cost compared to full finetuning. Despite their widespread adaptations, the vast hyperparameter spaces, the number of PEFT methods, the different focus of instruction tuning capabilities make disentangling the impact of each aspect difficult. This study systematically investigates several representative PEFT methods, surveying the effect of hyperparameter choices including training hyperparameters and PEFT-specific hyperparameters, how different models sizes and the number of instruction tasks affect the performance, in-task-distribution memorization and open instruction following capability. Our empirical study shows that only LoRA and adapter can get close to full finetuning with ideal training settings. The ideal training setting includes an appropriate learning rate, largest LoRA rank or adapter size allowed and diverse training tasks. On the other hand, LoRA and adapter suffer from training instability if such an ideal training condition is not met. Additionally, LoRA requires a greater number of tasks for effective unseen task generalization, exhibit slower learning speed. Moreover, LoRA has weaker task-level memorization. Lastly, LoRA and adapter fall short in complex reasoning, coding and long-form generation compared to finetuning in open instruction tuning settings but it shows stronger capabilities compared to adapter.
zh
[NLP-48] In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
【速读】: 该论文试图解决文本到图像(Text-to-image, T2I)模型在生成有害内容方面的潜在风险问题,特别是在缺乏系统性工具评估现有安全机制对实际滥用场景的有效性方面。解决方案的关键在于提出了一个名为ICER的新型红队测试框架,该框架利用大型语言模型(Large Language Models, LLMs)和基于bandit优化的算法,通过学习过去的成功红队测试案例,生成可解释且语义上有意义的潜在问题提示。ICER能够在不需内部访问或额外训练的情况下,高效地测试不同T2I模型的安全机制,从而广泛适用于已部署的系统。实验结果表明,ICER在识别模型漏洞方面显著优于现有的提示攻击方法,同时保持与预期内容的高度语义相似性。
链接: https://arxiv.org/abs/2411.16769
作者: Zhi-Yi Chin,Kuan-Chen Mu,Mario Fritz,Pin-Yu Chen,Wei-Chen Chiu
关键词-EN: shown remarkable progress, harmful content remains, remarkable progress, shown remarkable, remains a critical
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. While various safety mechanisms have been developed, the field lacks systematic tools for evaluating their effectiveness against real-world misuse scenarios. In this work, we propose ICER, a novel red-teaming framework that leverages Large Language Models (LLMs) and a bandit optimization-based algorithm to generate interpretable and semantic meaningful problematic prompts by learning from past successful red-teaming attempts. Our ICER efficiently probes safety mechanisms across different T2I models without requiring internal access or additional training, making it broadly applicable to deployed systems. Through extensive experiments, we demonstrate that ICER significantly outperforms existing prompt attack methods in identifying model vulnerabilities while maintaining high semantic similarity with intended content. By uncovering that successful jailbreaking instances can systematically facilitate the discovery of new vulnerabilities, our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
zh
[NLP-49] SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
【速读】: 该论文试图解决手语处理中任务特定模型限制跨任务迁移学习的问题。解决方案的关键是引入SHuBERT(Sign Hidden-Unit BERT),这是一种自监督的transformer编码器,通过学习约1,000小时美国手语(ASL)视频内容中的强表示,采用多流视觉手语输入的掩码预测方法,针对手、面部和身体姿态流进行多目标预测。SHuBERT在多个基准测试中达到了最先进的性能,显著提升了手语翻译和孤立手语识别的准确性。
链接: https://arxiv.org/abs/2411.16765
作者: Shester Gueuwou,Xiaodan Du,Greg Shakhnarovich,Karen Livescu,Alexander H. Liu
关键词-EN: Sign language, Sign language processing, American Sign Language, Sign Hidden-Unit BERT, processing has traditionally
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
点击查看摘要
Abstract:Sign language processing has traditionally relied on task-specific models,limiting the potential for transfer learning across tasks. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised transformer encoder that learns strong representations from approximately 1,000 hours of American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input, learning to predict multiple targets for corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple benchmarks. On sign language translation, it outperforms prior methods trained on publicly available data on the How2Sign (+0.7 BLEU), OpenASL (+10.0 BLEU), and FLEURS-ASL (+0.3 BLEU) benchmarks. Similarly for isolated sign language recognition, SHuBERT’s accuracy surpasses that of specialized models on ASL-Citizen (+5%) and SEM-LEX (+20.6%), while coming close to them on WLASL2000 (-3%). Ablation studies confirm the contribution of each component of the approach.
zh
[NLP-50] PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
【速读】: 该论文试图解决单目深度估计中的固有模糊性和视觉干扰问题。解决方案的关键在于利用文本到图像扩散模型中学习到的语言先验(language priors)来增强单目深度估计。具体来说,通过预训练的文本到图像扩散模型,结合图像和与场景对齐的文本描述,通过去噪过程推断出仿射不变深度(affine-invariant depth)。这种方法不仅能够引导模型关注特定区域,帮助其感知与用户意图对齐的3D场景,还能作为约束加速扩散轨迹的收敛,因为从低维语言特征中学习3D属性比从高维图像特征中学习更为高效。实验结果表明,该方法在多个数据集上实现了最先进的零样本性能和更快的收敛速度。
链接: https://arxiv.org/abs/2411.16750
作者: Ziyao Zeng,Jingcheng Ni,Daniel Wang,Patrick Rim,Younjoon Chung,Fengyu Yang,Byung-Woo Hong,Alex Wong
关键词-EN: monocular depth estimation, monocular depth, depth estimation, paper explores, explores the potential
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:This paper explores the potential of leveraging language priors learned by text-to-image diffusion models to address ambiguity and visual nuisance in monocular depth estimation. Particularly, traditional monocular depth estimation suffers from inherent ambiguity due to the absence of stereo or multi-view depth cues, and nuisance due to lack of robustness of vision. We argue that language prior in diffusion models can enhance monocular depth estimation by leveraging the geometric prior aligned with the language description, which is learned during text-to-image pre-training. To generate images that reflect the text properly, the model must comprehend the size and shape of specified objects, their spatial relationship, and the scale of the scene. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both image and text description that aligned with the scene to infer affine-invariant depth through a denoising process. We also show that language priors can guide the model’s attention to specific regions and help it perceive the 3D scene in alignment with user intent. Simultaneously, it acts as a constraint to accelerate the convergence of the diffusion trajectory, since learning 3D properties from a condensed, low-dimensional language feature is more efficient compared with learning from a redundant, high-dimensional image feature. By training on HyperSim and Virtual KITTI, we achieve state-of-the-art zero-shot performance and a faster convergence speed, compared with other diffusion-based depth estimators, across NYUv2, KITTI, ETH3D, and ScanNet.
zh
[NLP-51] ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
【速读】: 该论文试图解决大型语言模型(LLMs)在化学领域应用中可能生成的科学错误或不安全响应的问题。解决方案的关键在于引入ChemSafetyBench,这是一个专门设计的基准,用于评估LLM在化学相关任务中的准确性和安全性。ChemSafetyBench包含三个核心任务:查询化学性质、评估化学用途的合法性以及描述合成方法,这些任务要求不同程度的化学知识。该基准通过超过30K个样本的数据集,结合手工模板和高级越狱场景,增强了任务的多样性。论文还提出了一个自动化评估框架,全面评估LLM响应的安全性、准确性和适当性。通过与最先进的LLM进行广泛实验,揭示了这些模型在化学应用中的显著优势和关键漏洞,强调了在化学领域开发更安全AI技术的必要性。
链接: https://arxiv.org/abs/2411.16736
作者: Haochen Zhao,Xiangru Tang,Ziran Yang,Xiao Han,Xuanzhi Feng,Yueqing Fan,Senhao Cheng,Di Jin,Yilun Zhao,Arman Cohan,Mark Gerstein
关键词-EN: scientific research assistance, large language models, research assistance, application of large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at this https URL. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.
zh
[NLP-52] Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge
【速读】: 该论文旨在解决大型语言模型(LLMs)在金融领域中的应用问题,特别是如何高效处理和分析复杂的金融文档,如财务报表和披露文件。解决方案的关键在于开发了一个高性能的金融专用检索增强生成(Retrieval-Augmented Generation, RAG)系统,并通过以下几个关键技术优化了性能:1) 预检索阶段的查询扩展和语料库优化,通过消融研究进行性能优化;2) 采用多重重排序模型提升检索准确性;3) 引入一种高效的长上下文管理方法,显著提高生成响应的质量而不影响性能。这些技术共同作用,使得该系统在ACM-ICAIF '24 FinanceRAG竞赛中获得第二名,展示了LLMs在处理复杂金融数据方面的潜力。
链接: https://arxiv.org/abs/2411.16732
作者: Joohyun Lee,Minji Roh
关键词-EN: Large Language Models, address domain-specific problems, increasingly address domain-specific, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) increasingly address domain-specific problems, their application in the financial sector has expanded rapidly. Tasks that are both highly valuable and time-consuming, such as analyzing financial statements, disclosures, and related documents, are now being effectively tackled using LLMs. This paper details the development of a high-performance, finance-specific Retrieval-Augmented Generation (RAG) system for the ACM-ICAIF '24 FinanceRAG competition. We optimized performance through ablation studies on query expansion and corpus refinement during the pre-retrieval phase. To enhance retrieval accuracy, we employed multiple reranker models. Notably, we introduced an efficient method for managing long context sizes during the generation phase, significantly improving response quality without sacrificing performance. We ultimately achieve 2nd place in the FinanceRAG Challenge. Our key contributions include: (1) pre-retrieval ablation analysis, (2) an enhanced retrieval algorithm, and (3) a novel approach for long-context management. This work demonstrates the potential of LLMs in effectively processing and analyzing complex financial data to generate accurate and valuable insights. The source code and further details are available at this https URL.
zh
[NLP-53] “Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks ICLR2025
【速读】: 该论文试图解决在大语言模型(Large Language Models)应用扩展过程中,如何有效识别和防范由多步骤提示(multi-step jailbreak prompts)生成的有害内容的问题。解决方案的关键在于通过黑箱测试(black-box testing)模拟看似道德的提示情景,评估现有防护机制(guardrails)在面对此类攻击时的有效性。研究通过设计一个“企业中层管理者竞争晋升”的场景,观察各模型在每一步的响应,发现所有模型的防护机制均被绕过,生成了攻击性内容。数据结果显示,Claude 3.5 Sonnet在识别此类提示方面表现较好。研究者强调,防护机制不仅应作为内容过滤器,还应具备预防功能,并呼吁开发者和研究者关注这一问题。为确保实验的客观性和普遍性,研究者已将实验过程、测试代码及增强防护代码上传至GitHub,以促进开发社区的合作。
链接: https://arxiv.org/abs/2411.16730
作者: Libo Wang
关键词-EN: poses higher challenges, large language models, language models continues, identifying harmful content, harmful content generation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been submitted to ICLR 2025 BlogPosts and OpenReview preprints. It has 9 pages of text, 4 figures, and 3 tables
点击查看摘要
Abstract:As the application of large language models continues to expand in various fields, it poses higher challenges to the effectiveness of identifying harmful content generation and guardrail mechanisms. This research aims to evaluate the effectiveness of guardrails in the face of multi-step jailbreak prompt-generated verbal attacks, through black-box testing of seemingly ethical prompt simulations. The experimental subjects were selected GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5 and Claude 3.5 Sonnet. The researcher used the same multi-step prompt to simulate moral attacks by designing a scenario of “enterprise middle managers competing for promotion” and observed the model’s response at each step. During the experiment, the guardrails of the above model were all bypassed in this experiment and the content of verbal attacks was generated. The data results show that Claude 3.5 Sonnet performs better than other models in terms of its tendency to identify jailbreak prompts. The researcher hopes to use this to remind developers and future research that guardrails not only inappropriately play the role of content filters, but should also have a preventive function. In order to ensure the objectivity and generalizability of the experiment, the researcher has uploaded the experimental process, black box test code, and enhanced guardrail code to GitHub to promote cooperation in the development community: this https URL.
zh
[NLP-54] A Brief Summary of Explanatory Virtues
【速读】: 该论文旨在探讨解释性美德(Explanatory Virtues)在哲学、心理学和认知科学领域的文献,并将其与可解释性人工智能(eXplainable AI)联系起来。解决方案的关键在于识别和整合不同学科中关于解释性美德的概念,以提升人工智能系统的可解释性和透明度。
链接: https://arxiv.org/abs/2411.16709
作者: Ingrid Zukerman
关键词-EN: Explanatory Virtues, science about Explanatory, literature in philosophy, psychology and cognitive, cognitive science
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 tables
点击查看摘要
Abstract:In this report, I provide a brief summary of the literature in philosophy, psychology and cognitive science about Explanatory Virtues, and link these concepts to eXplainable AI.
zh
[NLP-55] Enhancing LLM s for Power System Simulations: A Feedback-driven Multi-agent Framework
【速读】: 该论文试图解决大型语言模型(LLMs)在电力系统模拟管理中的局限性问题,主要表现为领域知识不足、推理能力受限以及模拟参数处理不精确。解决方案的关键在于提出了一种反馈驱动的多代理框架,该框架包含三个核心模块:增强的检索增强生成(RAG)模块、改进的推理模块和带有错误反馈机制的动态环境行动模块。这些模块协同工作,显著提高了LLMs在电力系统模拟任务中的成功率,分别在Daline和MATPOWER的69个多样化任务中达到了93.13%和96.85%的成功率,远超最新LLMs(如ChatGPT 4o和o1-preview)的表现。此外,该框架还支持快速、低成本的任务执行,每项模拟任务平均在30秒内完成,成本仅为0.014美元。
链接: https://arxiv.org/abs/2411.16707
作者: Mengshuo Jia,Zeyu Cui,Gabriela Hug
关键词-EN: large language models, mere problem-solving tool, transforming scientific research, language models, problem-solving tool
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:The integration of experimental technologies with large language models (LLMs) is transforming scientific research, positioning AI as a versatile research assistant rather than a mere problem-solving tool. In the field of power systems, however, managing simulations – one of the essential experimental technologies – remains a challenge for LLMs due to their limited domain-specific knowledge, restricted reasoning capabilities, and imprecise handling of simulation parameters. To address these limitations, we propose a feedback-driven, multi-agent framework that incorporates three proposed modules: an enhanced retrieval-augmented generation (RAG) module, an improved reasoning module, and a dynamic environmental acting module with an error-feedback mechanism. Validated on 69 diverse tasks from Daline and MATPOWER, this framework achieves success rates of 93.13% and 96.85%, respectively, significantly outperforming the latest LLMs (ChatGPT 4o and o1-preview), which achieved a 27.77% success rate on standard simulation tasks and 0% on complex tasks. Additionally, our framework also supports rapid, cost-effective task execution, completing each simulation in approximately 30 seconds at an average cost of 0.014 USD for tokens. Overall, this adaptable framework lays a foundation for developing intelligent LLM-based assistants for human researchers, facilitating power system research and beyond.
zh
计算机视觉
[CV-0] Video-Guided Foley Sound Generation with Multimodal Controls
【速读】: 该论文试图解决视频音效生成中艺术性音效创作与灵活控制的问题。解决方案的关键在于提出了MultiFoley模型,该模型支持通过文本、音频和视频的多模态条件生成音效。MultiFoley不仅能够根据静音视频和文本提示生成高质量的音效,还能通过选择参考音频或部分视频进行条件化,实现对音效的精细控制。模型的一个创新点在于其联合训练了互联网视频数据集和专业音效库的录音,从而能够生成高质量、全频带(48kHz)的音频。通过自动化评估和人类研究,证明了MultiFoley在各种条件输入下成功生成了同步的高质量音效,并优于现有方法。
链接: https://arxiv.org/abs/2411.17698
作者: Ziyang Chen,Prem Seetharaman,Bryan Russell,Oriol Nieto,David Bourgin,Andrew Owens,Justin Salamon
关键词-EN: requires creating artistic, Generating sound effects, creating artistic sound, Generating sound, requires creating
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project site: this https URL
点击查看摘要
Abstract:Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion’s roar sound like a cat’s meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: this https URL
zh
[CV-1] StableAnimator: High-Quality Identity-Preserving Human Image Animation
【速读】: 该论文试图解决当前扩散模型在人体图像动画中难以保持身份一致性(ID consistency)的问题。解决方案的关键在于提出了StableAnimator,这是一个端到端的身份保持视频扩散框架。其核心创新包括:1) 使用现成的提取器计算图像和面部嵌入,并通过全局内容感知的面部编码器进一步优化面部嵌入;2) 引入分布感知身份适配器(distribution-aware ID Adapter),通过对齐方式防止时间层间的干扰,从而保持身份一致性;3) 在推理阶段,采用基于Hamilton-Jacobi-Bellman (HJB) 方程的优化方法,通过将HJB方程的求解集成到扩散去噪过程中,约束去噪路径,从而进一步增强面部质量。这些设计在训练和推理阶段均致力于身份一致性,实验结果表明StableAnimator在多个基准测试中均表现出色。
链接: https://arxiv.org/abs/2411.17697
作者: Shuyuan Tu,Zhen Xing,Xintong Han,Zhi-Qi Cheng,Qi Dai,Chong Luo,Zuxuan Wu
关键词-EN: Current diffusion models, human image animation, image animation struggle, Current diffusion, animation struggle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.
zh
[CV-2] ScribbleLight: Single Image Indoor Relighting with Scribbles
【速读】: 该论文试图解决从单一图像中对室内房间进行基于图像的重照明问题,特别是在复杂光照交互和物体几何及材质多样性下的局部光照效果控制难题。解决方案的关键在于引入了ScribbleLight,一种生成式模型,通过涂鸦(scribbles)实现对光照效果的局部细粒度控制。其核心技术包括:1) 基于反照率(Albedo)条件的稳定图像扩散模型,确保重照明后图像的固有颜色和纹理得以保留;2) 基于编码器-解码器的ControlNet架构,结合法线图(normal map)和涂鸦注释,实现几何保持的光照效果。通过这些创新,ScribbleLight能够从稀疏的涂鸦注释中生成不同的光照效果,如开关灯光、添加高光、投射阴影或来自不可见光源的间接照明。
链接: https://arxiv.org/abs/2411.17696
作者: Jun Myeong Choi,Annie Wang,Pieter Peers,Anand Bhattad,Roni Sengupta
关键词-EN: immersive virtual understanding, virtual staging, interior design, real estate, immersive virtual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight’s ability to create different lighting effects (e.g., turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.
zh
[CV-3] Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
【速读】: 该论文试图解决从视频和语音转录生成语音(VTTS)的新任务,旨在推动多模态语音生成技术的发展。解决方案的关键在于提出了一种名为Visatronic的解码器专用多模态模型。该模型通过将视觉、文本和语音直接嵌入到Transformer模型的共同子空间中,并使用自回归损失来学习基于说话者视频和语音转录的离散化梅尔频谱图的生成模型。通过将所有模态嵌入到共同子空间,Visatronic相比仅使用文本或视频作为输入的模型,能够实现更好的效果。此外,相较于依赖唇部检测器和复杂架构的主流方法,Visatronic提供了一种更简单的多模态语音生成方法,同时还能产生更好的结果。该模型还具有灵活性,能够适应不同的输入序列排序方式,从而探索最佳的信息传递策略。
链接: https://arxiv.org/abs/2411.17690
作者: Akshita Gupta,Tatiana Likhomanenko,Karren Dai Yang,Richard He Bai,Zakaria Aldeneh,Navdeep Jaitly
关键词-EN: generating speech, task, speech, multimodal speech generation, motivate new techniques
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:In this paper, we propose a new task – generating speech from videos of people and their transcripts (VTTS) – to motivate new techniques for multimodal speech generation. This task generalizes the task of generating speech from cropped lip videos, and is also more complicated than the task of generating generic audio clips (e.g., dog barking) from videos and text. Multilingual versions of the task could lead to new techniques for cross-lingual dubbing. We also present a decoder-only multimodal model for this task, which we call Visatronic. This model embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech. By embedding all modalities into a common subspace, Visatronic can achieve improved results over models that use only text or video as input. Further, it presents a much simpler approach for multimodal speech generation compared to prevailing approaches which rely on lip-detectors and complicated architectures to fuse modalities while producing better results. Since the model is flexible enough to accommodate different ways of ordering inputs as a sequence, we carefully explore different strategies to better understand the best way to propagate information to the generative steps. To facilitate further research on VTTS, we will release (i) our code, (ii) clean transcriptions for the large-scale VoxCeleb2 dataset, and (iii) a standardized evaluation protocol for VTTS incorporating both objective and subjective metrics.
zh
[CV-4] GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration
【速读】: 该论文试图解决深度学习模型在全图像恢复(All-In-One Image Restoration, AIOR)任务中泛化能力不足的问题,主要原因是现有数据集中降质变化和场景多样性不足,导致模型难以应对真实世界中的复杂情况。解决方案的关键在于利用潜在扩散模型(latent diffusion models)的生成能力,合成高质量的降质图像。具体来说,论文提出了一个名为GenDeg的降质和强度感知条件扩散模型,能够生成多样化的降质模式。通过GenDeg,合成了超过55万张包含六种降质类型(雾、雨、雪、运动模糊、低光和雨滴)的图像,并与现有数据集结合形成GenDS数据集,总样本数超过75万。实验结果表明,基于GenDS数据集训练的图像恢复模型在分布外样本上的表现显著优于仅使用现有数据集训练的模型。
链接: https://arxiv.org/abs/2411.17687
作者: Sudarshan Rajagopalan,Nithin Gopalakrishnan Nair,Jay N. Paranjape,Vishal M. Patel
关键词-EN: Deep learning-based models, Deep learning-based, recent years, achieved significant advancements, advancements in recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Deep learning-based models for All-In-One Image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on the GenDS dataset exhibit significant improvements in out-of-distribution performance compared to those trained solely on existing datasets. Furthermore, we provide comprehensive analyses on the implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.
zh
[CV-5] Rethinking Token Reduction in MLLM s: Towards a Unified Paradigm for Training-Free Acceleration
【速读】: 该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 推理过程中的计算效率问题。解决方案的关键在于提出了一种统一的“过滤-关联-压缩” (filter-correlate-compress) 范式,将现有的无训练标记减少方法分解为三个独立的阶段,从而清晰地展示了各组件的相互作用和效果,便于比较、迁移和扩展。该范式不仅涵盖了现有的流行方法,还提供了一系列基于此范式的新方法,在保持推理速度的同时,最大限度地减少对模型性能的影响。实验结果表明,这些方法在多个基准测试中能够实现高达82.4%的FLOPs减少,同时超越了现有的最先进无训练方法。
链接: https://arxiv.org/abs/2411.17686
作者: Yuhang Han,Xuyang Liu,Pengxiang Ding,Donglin Wang,Honggang Chen,Qingsen Yan,Siteng Huang
关键词-EN: Large Language Models, Multimodal Large Language, heavy Multimodal Large, Language Models, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ‘‘filter-correlate-compress’’ paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at this https URL.
zh
[CV-6] SketchAgent : Language-Driven Sequential Sketch Generation
【速读】: 该论文试图解决人工系统在捕捉人类草图绘制的动态和抽象特性方面的挑战。解决方案的关键在于引入了一个名为 SketchAgent 的语言驱动、序列化草图生成方法。该方法通过动态对话交互,使用户能够创建、修改和完善草图,而无需任何训练或微调。其核心在于利用现成的多模态大语言模型 (LLMs) 的序列特性和丰富的先验知识,通过上下文示例引入直观的草图语言,使模型能够使用基于字符串的动作进行“绘制”,这些动作随后被处理为矢量图形并在像素画布上渲染,从而生成草图。这种方法能够逐笔捕捉草图的演变和动态特性,展示了 SketchAgent 在生成多样草图、对话驱动绘图以及与人类用户有意义协作方面的能力。
链接: https://arxiv.org/abs/2411.17673
作者: Yael Vinker,Tamar Rott Shaham,Kristine Zheng,Alex Zhao,Judith E Fan,Antonio Torralba
关键词-EN: enabling rapid exploration, externalizing ideas, spans various disciplines, versatile tool, tool for externalizing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
点击查看摘要
Abstract:Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to “draw” using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.
zh
[CV-7] RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
【速读】: 该论文试图解决在视觉姿态估计中,现有方法未能充分利用机器人图像中丰富的物理结构信息,导致在遮挡和截断情况下性能受限的问题。解决方案的关键在于引入了一种名为RoboPEPP的方法,该方法通过融合机器人的物理模型信息到编码器中,采用基于掩码的自监督嵌入预测架构。具体来说,RoboPEPP通过掩码机器人的关节,并预训练一个编码器-预测器模型,使其能够从周围未掩码区域推断出关节的嵌入,从而增强编码器对机器人物理模型的理解。预训练的编码器-预测器对与关节角度和关键点预测网络一起微调,用于姿态和关节角度估计。在微调过程中随机掩码输入以及在评估过程中进行关键点过滤,进一步提高了方法的鲁棒性。该方法在多个数据集上实现了最佳的机器人姿态和关节角度估计性能,同时对遮挡的敏感性最低,执行时间最短。
链接: https://arxiv.org/abs/2411.17662
作者: Raktim Gautam Goswami,Prashanth Krishnamurthy,Yann LeCun,Farshad Khorrami
关键词-EN: human-robot interaction tasks, Vision-based pose estimation, Vision-based pose, unknown joint angles, interaction tasks
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot’s physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot’s physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot’s joints and pre-train an encoder-predictor model to infer the joints’ embeddings from surrounding unmasked regions, enhancing the encoder’s understanding of the robot’s physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.
zh
[CV-8] DROID-Splat: Combining end-to-end SLAM with 3D Gaussian Splatting
【速读】: 该论文试图解决单目视频场景下同时实现高鲁棒性、速度和精度的同步定位与地图构建(SLAM)问题。解决方案的关键在于结合端到端跟踪器(end-to-end Tracker)和基于3D高斯Splatting技术的渲染器(Renderer),构建了一个名为DroidSplat的SLAM系统。该系统通过并行运行现代SLAM系统的多个构建模块,实现了在普通消费级GPU上的快速推理,并利用单目深度预测和相机标定技术,即使在未知相机内参的情况下,也能在野外数据上取得优异的跟踪和渲染效果。
链接: https://arxiv.org/abs/2411.17660
作者: Christian Homeyer,Leon Begiristain,Christoph Schnörr
关键词-EN: makes standalone SLAM, Toggle, scene synthesis makes, synthesis makes standalone, SLAM systems purely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent progress in scene synthesis makes standalone SLAM systems purely based on optimizing hyperprimitives with a Rendering objective possible \citemonogs. However, the tracking performance still lacks behind traditional \citeorbslam and end-to-end SLAM systems \citedroid. An optimal trade-off between robustness, speed and accuracy has not yet been reached, especially for monocular video. In this paper, we introduce a SLAM system based on an end-to-end Tracker and extend it with a Renderer based on recent 3D Gaussian Splatting techniques. Our framework \textbfDroidSplat achieves both SotA tracking and rendering results on common SLAM benchmarks. We implemented multiple building blocks of modern SLAM systems to run in parallel, allowing for fast inference on common consumer GPU’s. Recent progress in monocular depth prediction and camera calibration allows our system to achieve strong results even on in-the-wild data without known camera intrinsics. Code will be available at \urlthis https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.17660 [cs.CV] (or arXiv:2411.17660v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.17660 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christian Homeyer [view email] [v1] Tue, 26 Nov 2024 18:25:51 UTC (41,013 KB) Full-text links: Access Paper: View a PDF of the paper titled DROID-Splat: Combining end-to-end SLAM with 3D Gaussian Splatting, by Christian Homeyer and 2 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-9] SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation
【速读】: 该论文试图解决在视频对象分割 (Referring Video Object Segmentation, RVOS) 中,现有方法在处理长视频时丢失全局上下文或无法在线处理的问题。解决方案的关键在于设计一种能够在流式处理场景中有效操作,同时保留过去帧上下文信息的RVOS方法。论文基于Segment-Anything 2 (SAM2) 模型,通过引入一个新颖的适配器模块 (adapter module),在特征提取阶段注入时间信息和多模态线索,从而增强SAM2的自然语言理解和显式时间建模能力,且无需微调其权重或依赖外部模型进行模态交互。此外,论文还提出了一种可学习的模块来调整SAM2的跟踪焦点,以应对当前帧特征提示新对象更符合描述的情况。最终,提出的SAMWISE方法在多个基准测试中达到最先进水平,仅增加了4.2 M参数的额外开销。
链接: https://arxiv.org/abs/2411.17646
作者: Claudia Cuttano,Gabriele Trivigno,Gabriele Rosi,Carlo Masone,Giuseppe Averta
关键词-EN: Referring Video Object, Referring Video, Video Object Segmentation, expressions to segment, natural language expressions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, SAMWISE, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of just 4.2 M parameters. The code is available at this https URL
zh
[CV-10] Accelerating Vision Diffusion Transformers with Skip Branches
【速读】: 该论文试图解决扩散变换器 (Diffusion Transformers, DiT) 在实际部署中的计算复杂性和序列去噪过程中的冗余问题。解决方案的关键在于通过引入跳跃分支 (skip branches) 来增强特征平滑性,从而提高特征在时间步间的可重用性。具体来说,论文提出了将标准 DiT 转换为带有跳跃分支的 Skip-DiT,并引入 Skip-Cache 机制,在推理时利用跳跃分支跨时间步缓存 DiT 特征。实验结果表明,Skip-DiT 在不显著降低生成质量的情况下,实现了显著的加速效果,例如在几乎不增加额外计算成本的情况下实现了 1.5 倍的加速,而在仅轻微降低量化指标的情况下实现了 2.2 倍的加速。
链接: https://arxiv.org/abs/2411.17616
作者: Guanjie Chen,Xinyu Zhao,Yucheng Zhou,Tianlong Chen,Cheng Yu
关键词-EN: demonstrated great potential, Diffusion Transformers, scalability properties, demonstrated great, great potential
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches. Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup. Experimental results indicate that Skip-DiT achieves a 1.5x speedup almost for free and a 2.2x speedup with only a minor reduction in quantitative metrics. Code is available at this https URL.
zh
[CV-11] Modality-Incremental Learning with Disjoint Relevance Mapping Networks for Image-based Semantic Segmentation WACV2025
【速读】: 该论文试图解决在自动驾驶环境中,由于传感器数据多样性导致的灾难性遗忘问题,特别是在持续学习(Continual Learning, CL)框架下,当面对显著的领域转移(如不同传感器模态)时,增量学习(Incremental Learning)的挑战。解决方案的关键在于提出了模态增量学习(Modality-Incremental Learning)的概念,并通过修改相关性映射网络(Relevance Mapping Network, RMN)来实现。具体来说,通过确保相关性映射的分离性,防止共享连接,从而在严格持续学习框架下减轻遗忘问题。实验结果表明,这种方法有效地保留了先前学习模态的性能,同时能够增量学习新的模态。
链接: https://arxiv.org/abs/2411.17610
作者: Niharika Hegde,Shishir Muralidhara,René Schuster,Didier Stricker
关键词-EN: deep learning techniques, autonomous driving, environment perception, perception has significantly, significantly advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025
点击查看摘要
Abstract:In autonomous driving, environment perception has significantly advanced with the utilization of deep learning techniques for diverse sensors such as cameras, depth sensors, or infrared sensors. The diversity in the sensor stack increases the safety and contributes to robustness against adverse weather and lighting conditions. However, the variance in data acquired from different sensors poses challenges. In the context of continual learning (CL), incremental learning is especially challenging for considerably large domain shifts, e.g. different sensor modalities. This amplifies the problem of catastrophic forgetting. To address this issue, we formulate the concept of modality-incremental learning and examine its necessity, by contrasting it with existing incremental learning paradigms. We propose the use of a modified Relevance Mapping Network (RMN) to incrementally learn new modalities while preserving performance on previously learned modalities, in which relevance maps are disjoint. Experimental results demonstrate that the prevention of shared connections in this approach helps alleviate the problem of forgetting within the constraints of a strict continual learning framework.
zh
[CV-12] HyperSeg: Towards Universal Visual Segmentation with Large Language Model
【速读】: 该论文试图解决图像和视频感知中的通用分割问题,特别是如何通过视觉大语言模型 (Visual Large Language Models, VLLMs) 增强的强大推理能力来处理复杂的推理分割任务。解决方案的关键在于提出了HyperSeg,这是首个基于VLLM的通用分割模型,能够进行像素级的图像和视频感知,涵盖通用分割任务以及需要强大推理能力和世界知识的复杂推理感知任务。HyperSeg通过结合混合实体识别和细粒度视觉感知模块,充分利用VLLMs的识别能力和细粒度视觉信息,并结合时间适配器实现对时间信息的全面理解。实验结果验证了该方法在解决通用图像和视频分割任务中的有效性,包括更复杂的推理感知任务。
链接: https://arxiv.org/abs/2411.17606
作者: Cong Wei,Yujie Zhong,Haoxian Tan,Yong Liu,Zheng Zhao,Jie Hu,Yujiu Yang
关键词-EN: Large Language Models, Visual Large Language, Large Language, strong reasoning ability, reasoning ability empowered
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.
zh
[CV-13] Distractor-free Generalizable 3D Gaussian Splatting
【速读】: 该论文试图解决在包含干扰物(distractor)的3D高斯喷洒(3D Gaussian Splatting, 3DGS)数据中进行泛化的问题。解决方案的关键在于提出了一种名为DGGS(Distractor-free Generalizable 3D Gaussian Splatting)的新框架,该框架通过以下两个主要策略来实现目标:1) 在训练阶段引入基于场景无关的参考掩码预测和细化方法,结合训练视图选择策略,以提高干扰物预测的准确性和训练稳定性;2) 在推理阶段采用两阶段的推理框架,基于预测的干扰物掩码进行更好的参考选择,并结合干扰物修剪模块以消除残留的干扰物影响。这些方法共同作用,使得DGGS在包含干扰物的条件下表现出优越的泛化性能,并且在场景无关的掩码推理中达到与场景特定训练方法相当的准确性。
链接: https://arxiv.org/abs/2411.17605
作者: Yanqi Bao,Jing Liao,Jing Huo,Yang Gao
关键词-EN: Gaussian Splatting, previously unexplored challenge, Distractor-free Generalizable, addressing the previously, previously unexplored
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present DGGS, a novel framework addressing the previously unexplored challenge of Distractor-free Generalizable 3D Gaussian Splatting (3DGS). It accomplishes two key objectives: fortifying generalizable 3DGS against distractor-laden data during both training and inference phases, while successfully extending cross-scene adaptation capabilities to conventional distractor-free approaches. To achieve these objectives, DGGS introduces a scene-agnostic reference-based mask prediction and refinement methodology during training phase, coupled with a training view selection strategy, effectively improving distractor prediction accuracy and training stability. Moreover, to address distractor-induced voids and artifacts during inference stage, we propose a two-stage inference framework for better reference selection based on the predicted distractor masks, complemented by a distractor pruning module to eliminate residual distractor effects. Extensive generalization experiments demonstrate DGGS’s advantages under distractor-laden conditions. Additionally, experimental results show that our scene-agnostic mask inference achieves accuracy comparable to scene-specific trained methods. Homepage is \urlthis https URL.
zh
[CV-14] VideoDirector: Precise Video Editing via Text-to-Video Models
【速读】: 该论文试图解决现有文本到视频(T2V)模型在视频编辑过程中出现的颜色闪烁和内容失真等严重伪影问题。解决方案的关键在于提出了空间-时间解耦引导(STDG)和多帧空文本优化策略,以提供更精确的关键帧反演,同时引入自注意力控制策略来保持未编辑内容的更高保真度。这些方法旨在克服传统基于文本到图像(T2I)模型的编辑方法在时间一致性生成能力上的不足,从而实现更高质量的视频编辑效果。
链接: https://arxiv.org/abs/2411.17592
作者: Yukun Wang,Longguang Wang,Zhiyuan Ma,Qibin Hu,Kai Xu,Yulan Guo
关键词-EN: suffers severe artifacts, demonstrated promising results, Tightly Spatial-temporal Coupling, directly extending, demonstrated promising
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 figures
点击查看摘要
Abstract:Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.
zh
[CV-15] Pre-training for Action Recognition with Automatically Generated Fractal Datasets
【速读】: 该论文试图解决在视频领域中使用合成数据进行预训练的问题,特别是在动作识别任务中。解决方案的关键在于利用分形几何(fractal geometry)自动生成大规模的合成视频片段,并通过模拟真实视频的关键属性来缩小域差距。论文通过详细的消融实验确定了增强下游任务性能的属性,并提供了使用合成视频进行预训练的一般指导原则。实验结果表明,该方法在多个视频基准测试中表现优异,甚至在某些下游数据集上优于标准的Kinetics预训练方法。
链接: https://arxiv.org/abs/2411.17584
作者: Davyd Svyezhentsev,George Retsinas,Petros Maragos
关键词-EN: including object classification, recent years, including object, object classification, medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.17584 [cs.CV] (or arXiv:2411.17584v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.17584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-16] Revisiting Point Cloud Completion: Are We Ready For The Real-World?
【速读】: 该论文试图解决在受限和具有挑战性的现实环境中获取的点云数据不完整、非均匀稀疏或两者兼有的问题,特别是在点云补全任务中面临的挑战。解决方案的关键在于利用代数拓扑学和持久同调(Persistent Homology, \mathcalPH)工具,识别出现有基准合成点云数据缺乏现实点云中重要的拓扑特征。为此,论文贡献了首个用于点云补全的现实工业点云数据集RealPC,并展示了现有方法在处理现实数据时的不足。基于RealPC中包含的0维和1维\mathcalPH拓扑特征,论文提出将基于同调的拓扑先验信息整合到现有方法中,特别是利用0维\mathcalPH先验提取完整形状的全局拓扑信息,以生成拓扑一致的完整形状。
链接: https://arxiv.org/abs/2411.17580
作者: Stuti Pathak,Prashant Kumar,Nicholus Mboga,Gunther Steenackers,Rudi Penne
关键词-EN: point cloud completion, Point clouds acquired, Point clouds, non-uniformly sparse, Point
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Point clouds acquired in constrained and challenging real-world settings are incomplete, non-uniformly sparse, or both. These obstacles present acute challenges for a vital task - point cloud completion. Using tools from Algebraic Topology and Persistent Homology ( \mathcalPH ), we demonstrate that current benchmark synthetic point clouds lack rich topological features that are important constituents of point clouds captured in realistic settings. To facilitate research in this direction, we contribute the first real-world industrial point cloud dataset for point cloud completion, RealPC - a diverse set of rich and varied point clouds, consisting of \sim 40,000 pairs across 21 categories of industrial structures in railway establishments. Our benchmark results on several strong baselines reveal a striking observation - the existing methods are tailored for synthetic datasets and fail miserably in real-world settings. Building on our observation that RealPC consists of several 0 and 1-dimensional \mathcalPH -based topological features, we demonstrate the potential of integrating Homology-based topological priors into existing works. More specifically, we present how 0-dimensional \mathcalPH priors, which extract the global topology of a complete shape in the form of a 3-D skeleton, can assist a model in generating topologically-consistent complete shapes.
zh
[CV-17] A Distractor-Aware Memory for Visual Object Tracking with SAM2
【速读】: 该论文试图解决现有基于记忆的跟踪器(memory-based trackers)在存在干扰物(distractors)时表现不佳的问题。解决方案的关键在于提出了一种新的干扰物感知记忆模型(distractor-aware memory model),并结合基于内省的更新策略(introspection-based update strategy),以提高分割准确性和跟踪鲁棒性。具体来说,研究者对SAM2进行了改进,提出了SAM2.1++,并通过引入新的干扰物提炼数据集(distractor-distilled DiDi dataset)来更好地研究干扰物问题。实验结果表明,SAM2.1++在七个基准测试中均优于SAM2.1及其相关扩展,并在六个基准测试中达到了新的最佳性能。
链接: https://arxiv.org/abs/2411.17576
作者: Jovana Videnovic,Alan Lukezic,Matej Kristan
关键词-EN: recently tracked frames, concatenating recently tracked, video object segmentation, object segmentation methods, form the target
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Code available on Github: this https URL
点击查看摘要
Abstract:Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
zh
[CV-18] A Bilayer Segmentation-Recombination Network for Accurate Segmentation of Overlapping C. elegans
【速读】: 该论文试图解决秀丽隐杆线虫(Caenorhabditis elegans, C. elegans)图像分割中的两个主要问题:1) 线虫活动轨迹不可控,导致多条线虫重叠,边界模糊,难以清晰研究单条线虫的生命轨迹;2) 重叠线虫的显微图像中,边缘的半透明组织相互遮挡,导致边界分割不准确。解决方案的关键在于提出了一种双层分割重组网络(Bilayer Segmentation-Recombination Network, BR-Net),该网络包含三个模块:粗略掩码分割模块(Coarse Mask Segmentation Module, CMSM)、双层分割模块(Bilayer Segmentation Module, BSM)和语义一致性重组模块(Semantic Consistency Recombination Module, SCRM)。CMSM用于提取粗略掩码,并通过引入统一注意力模块(Unified Attention Module, UAM)增强对线虫实例的感知;BSM将聚集的线虫分割为重叠和非重叠区域;SCRM通过引入语义一致性正则化,进一步提高线虫实例分割的准确性。实验结果表明,BR-Net在处理线虫重叠图像时表现出色,优于其他近期提出的实例分割方法。
链接: https://arxiv.org/abs/2411.17557
作者: Mengqian Dinga,Jun Liua,Yang Luo,Jinshan Tang
关键词-EN: Bilayer Segmentation Module, excellent model organism, Segmentation Module, Caenorhabditis elegans, Mask Segmentation Module
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Caenorhabditis elegans (C. elegans) is an excellent model organism because of its short lifespan and high degree of homology with human genes, and it has been widely used in a variety of human health and disease models. However, the segmentation of C. elegans remains challenging due to the following reasons: 1) the activity trajectory of C. elegans is uncontrollable, and multiple nematodes often overlap, resulting in blurred boundaries of C. elegans. This makes it impossible to clearly study the life trajectory of a certain nematode; and 2) in the microscope images of overlapping C. elegans, the translucent tissues at the edges obscure each other, leading to inaccurate boundary segmentation. To solve these problems, a Bilayer Segmentation-Recombination Network (BR-Net) for the segmentation of C. elegans instances is proposed. The network consists of three parts: A Coarse Mask Segmentation Module (CMSM), a Bilayer Segmentation Module (BSM), and a Semantic Consistency Recombination Module (SCRM). The CMSM is used to extract the coarse mask, and we introduce a Unified Attention Module (UAM) in CMSM to make CMSM better aware of nematode instances. The Bilayer Segmentation Module (BSM) segments the aggregated C. elegans into overlapping and non-overlapping regions. This is followed by integration by the SCRM, where semantic consistency regularization is introduced to segment nematode instances more accurately. Finally, the effectiveness of the method is verified on the C. elegans dataset. The experimental results show that BR-Net exhibits good competitiveness and outperforms other recently proposed instance segmentation methods in processing C. elegans occlusion images.
zh
[CV-19] Rapid Deployment of Domain-specific Hyperspectral Image Processors with Application to Autonomous Driving
【速读】: 该论文试图解决在资源和功耗受限的低成本系统级模块 (System-On-Module, SOM) 平台上实现高效的超光谱成像 (Hyperspectral Imaging, HSI) 处理器的问题,特别是用于自动驾驶系统 (Autonomous Driving Systems, ADS) 中的图像语义分割。解决方案的关键在于对多层全卷积网络 (Fully Convolutional Networks, FCN) 进行重新设计和定制,以适应低成本SOM的约束条件。具体措施包括采用数据和硬件特定的量化技术,将FCN适配到商用定点可编程AI协处理器IP中,并提出一种完全定制的后训练量化方案,以在不牺牲分割精度的前提下降低计算和存储成本。
链接: https://arxiv.org/abs/2411.17543
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe,Óscar Mata-Carballeira,M. Victoria Martínez
关键词-EN: efficient hyperspectral imaging, hyperspectral imaging, processors for application, implementation of efficient, efficient hyperspectral
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:The article discusses the use of low cost System-On-Module (SOM) platforms for the implementation of efficient hyperspectral imaging (HSI) processors for application in autonomous driving. The work addresses the challenges of shaping and deploying multiple layer fully convolutional networks (FCN) for low-latency, on-board image semantic segmentation using resource- and power-constrained processing devices. The paper describes in detail the steps followed to redesign and customize a successfully trained HSI segmentation lightweight FCN that was previously tested on a high-end heterogeneous multiprocessing system-on-chip (MPSoC) to accommodate it to the constraints imposed by a low-cost SOM. This SOM features a lower-end but much cheaper MPSoC suitable for the deployment of automatic driving systems (ADS). In particular the article reports the data- and hardware-specific quantization techniques utilized to fit the FCN into a commercial fixed-point programmable AI coprocessor IP, and proposes a full customized post-training quantization scheme to reduce computation and storage costs without compromising segmentation accuracy.
zh
[CV-20] Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning BMVC2024
【速读】: 该论文试图解决多任务部分监督学习中的数据标注问题,特别是在对象检测(Object Detection)和语义分割(Semantic Segmentation)任务中,每个训练样本仅针对单一任务进行标注的情况。解决方案的关键在于提出了Box-for-Mask和Mask-for-Box策略,以及它们的组合BoMBo,通过弱损失函数(weak losses)从一种任务的标注中提取必要信息来训练另一种任务,从而实现不同任务数据集的扩展和互补。实验结果表明,这些策略在VOC和COCO数据集上表现良好。
链接: https://arxiv.org/abs/2411.17536
作者: Hoàng-Ân Lê,Paul Berg,Minh-Tan Pham
关键词-EN: scene understanding tasks, Object detection requires, Object detection, semantic segmentation, semantic segmentation requires
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publishing in BMVC 2024
点击查看摘要
Abstract:Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task’s information to train the other would be beneficial for multi-task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different-task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at this https URL.
zh
[CV-21] IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework
【速读】: 该论文试图解决生成式模型在生成合成医学图像时,传统评估指标(如FID分数、精确度和召回率)无法准确衡量图像的医学/生物学合理性的问题。解决方案的关键在于提出了一种名为IMPROVE(Improving Medical Plausibility without Reliance on Human Validation - An Enhanced Prototype-Guided Diffusion Framework)的新方法,该方法通过原型引导的扩散过程来生成医学图像,显著提升了生成图像的生物学合理性,且无需依赖人类反馈。实验结果表明,在Bone Marrow和HAM10000数据集上,这种方法能够在不引入人类反馈的情况下大幅提高医学图像的准确性。
链接: https://arxiv.org/abs/2411.17535
作者: Anurag Shandilya,Swapnil Bhat,Akshat Gautam,Subhash Yadav,Siddharth Bhatt,Deval Mehta,Kshitij Jadhav
关键词-EN: enhancing rare disease, machine learning algorithms, Generative models, scaling machine learning, generating synthetic medical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Generative models have proven to be very effective in generating synthetic medical images and find applications in downstream tasks such as enhancing rare disease datasets, long-tailed dataset augmentation, and scaling machine learning algorithms. For medical applications, the synthetically generated medical images by such models are still reasonable in quality when evaluated based on traditional metrics such as FID score, precision, and recall. However, these metrics fail to capture the medical/biological plausibility of the generated images. Human expert feedback has been used to get biological plausibility which demonstrates that these generated images have very low plausibility. Recently, the research community has further integrated this human feedback through Reinforcement Learning from Human Feedback(RLHF), which generates more medically plausible images. However, incorporating human feedback is a costly and slow process. In this work, we propose a novel approach to improve the medical plausibility of generated images without the need for human feedback. We introduce IMPROVE:Improving Medical Plausibility without Reliance on Human Validation - An Enhanced Prototype-Guided Diffusion Framework, a prototype-guided diffusion process for medical image generation and show that it substantially enhances the biological plausibility of the generated medical images without the need for any human feedback. We perform experiments on Bone Marrow and HAM10000 datasets and show that medical accuracy can be substantially increased without human feedback.
zh
[CV-22] FTMoMamba: Motion Generation with Frequency and Text State Space Models
【速读】: 该论文试图解决在人体运动生成中,现有扩散模型忽视频率域信息(frequency-domain information)对细粒度运动捕捉的重要性,以及文本与运动之间的语义差异导致生成运动与文本描述不一致的问题。解决方案的关键在于提出了一个基于扩散的FTMoMamba框架,该框架配备了频率状态空间模型(Frequency State Space Model, FreqSSM)和文本状态空间模型(Text State Space Model, TextSSM)。FreqSSM通过将序列分解为低频和高频成分,分别指导生成静态姿势和细粒度运动,从而学习细粒度表示。TextSSM则在句子层面上编码文本特征,确保文本语义与序列特征的对齐,从而提高文本与生成运动之间的一致性。实验结果表明,FTMoMamba在文本到运动生成任务中表现优异,尤其是在HumanML3D数据集上取得了最低的FID值(0.181),显著优于现有方法(如MLD的0.421)。
链接: https://arxiv.org/abs/2411.17532
作者: Chengjian Li,Xiangbo Shu,Qiongjie Cui,Yazhou Yao,Jinhui Tang
关键词-EN: Diffusion models achieve, State Space Model, Diffusion models, Frequency State Space, models achieve impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
点击查看摘要
Abstract:Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.
zh
[CV-23] HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving
【速读】: 该论文旨在通过更新HSI-Drive数据集(v2.0版本)来推动使用高光谱成像(Hyperspectral Imaging, HSI)的自动驾驶系统(Automated Driving Systems, ADS)的发展。关键解决方案包括:1) 扩展数据集以覆盖四季(春、夏、秋、冬)的实际驾驶场景,新增752张标注图像;2) 通过实验展示基于新数据集训练的模型在场景理解(如车辆、路标、行人、自行车等关键道路安全对象的识别)方面的性能提升;3) 开发计算效率高、轻量级的机器学习(ML)模型,以满足车载ADS平台的高吞吐率需求。这些措施共同提升了模型在不同环境和条件下的鲁棒性和实用性。
链接: https://arxiv.org/abs/2411.17530
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe,M. Victoria Martínez,Unai Martínez-Corral
关键词-EN: automated driving systems, developing automated driving, HSI-Drive dataset aimed, hyperspectral imaging, present the updated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:We present the updated version of the HSI-Drive dataset aimed at developing automated driving systems (ADS) using hyperspectral imaging (HSI). The v2.0 version includes new annotated images from videos recorded during winter and fall in real driving scenarios. Added to the spring and summer images included in the previous v1.1 version, the new dataset contains 752 images covering the four seasons. In this paper, we show the improvements achieved over previously published results obtained on the v1.1 dataset, showcasing the enhanced performance of models trained on the new v2.0 dataset. We also show the progress made in comprehensive scene understanding by experimenting with more capable image segmentation models. These models include new segmentation categories aimed at the identification of essential road safety objects such as the presence of vehicles and road signs, as well as highly vulnerable groups like pedestrians and cyclists. In addition, we provide evidence of the performance and robustness of the models when applied to segmenting HSI video sequences captured in various environments and conditions. Finally, for a correct assessment of the results described in this work, the constraints imposed by the processing platforms that can sensibly be deployed in vehicles for ADS must be taken into account. Thus, and although implementation details are out of the scope of this paper, we focus our research on the development of computationally efficient, lightweight ML models that can eventually operate at high throughput rates. The dataset and some examples of segmented videos are available in this https URL.
zh
[CV-24] SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates
【速读】: 该论文试图解决从图像中高效且物理一致地分解基于物理的材质(PBR materials)的问题。解决方案的关键在于提出了SuperMat框架,该框架通过单步推理实现高质量的材质分解,能够在毫秒级时间内同时分解出反照率(albedo)、金属度(metallic)和粗糙度(roughness)图。此外,SuperMat通过UV细化网络扩展到3D对象,确保了不同视角下材质估计的一致性,同时保持了计算效率。实验结果表明,SuperMat在PBR材质分解质量和推理时间上均达到了最先进的水平。
链接: https://arxiv.org/abs/2411.17515
作者: Yijia Hong,Yuan-Chen Guo,Ran Yi,Yulong Chen,Yan-Pei Cao,Lizhuang Ma
关键词-EN: properties remains challenging, constituent properties remains, remains challenging, physical consistency, Decomposing physically-based materials
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Decomposing physically-based materials from images into their constituent properties remains challenging, particularly when maintaining both computational efficiency and physical consistency. While recent diffusion-based approaches have shown promise, they face substantial computational overhead due to multiple denoising steps and separate models for different material properties. We present SuperMat, a single-step framework that achieves high-quality material decomposition with one-step inference. This enables end-to-end training with perceptual and re-render losses while decomposing albedo, metallic, and roughness maps at millisecond-scale speeds. We further extend our framework to 3D objects through a UV refinement network, enabling consistent material estimation across viewpoints while maintaining efficiency. Experiments demonstrate that SuperMat achieves state-of-the-art PBR material decomposition quality while reducing inference time from seconds to milliseconds per image, and completes PBR material estimation for 3D objects in approximately 3 seconds.
zh
[CV-25] Perceptually Optimized Super Resolution
【速读】: 该论文试图解决的问题是如何在不影响视觉质量的前提下,提高基于深度学习的超分辨率技术的计算效率。解决方案的关键在于提出了一种感知驱动的、与架构无关的方法,通过动态调整超分辨率算法的计算资源分配,使其更符合人类视觉系统的敏感性。具体来说,该方法利用感知模型,根据图像的空间频率、亮度、颜色、对比度和运动等特征,以及观看条件,来指导超分辨率算法的处理过程,从而在感知重要的区域集中计算资源,减少不必要的计算开销。通过网络分支和复杂度降低等技术,该方法在保持视觉质量的同时,显著降低了计算量,实现了高达2倍及以上的FLOPS减少。
链接: https://arxiv.org/abs/2411.17513
作者: Volodymyr Karpenko,Taimoor Tariq,Jorge Condor,Piotr Didyk
关键词-EN: Modern deep-learning based, deep-learning based super-resolution, Modern deep-learning, techniques process images, super-resolution techniques process
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Modern deep-learning based super-resolution techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system to image details changes depending on the underlying content characteristics, such as spatial frequency, luminance, color, contrast, or motion. This observation hints that computational resources spent on up-sampling visual content may be wasted whenever a viewer cannot resolve the results. Motivated by this observation, we propose a perceptually inspired and architecture-agnostic approach for controlling the visual quality and efficiency of super-resolution techniques. The core is a perceptual model that dynamically guides super-resolution methods according to the human’s sensitivity to image details. Our technique leverages the limitations of the human visual system to improve the efficiency of super-resolution techniques by focusing computational resources on perceptually important regions; judged on the basis of factors such as adapting luminance, contrast, spatial frequency, motion, and viewing conditions. We demonstrate the application of our proposed model in combination with network branching, and network complexity reduction to improve the computational efficiency of super-resolution methods without visible quality loss. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2 \mathbfx and greater, without sacrificing perceived quality.
zh
[CV-26] Whats in the Image? A Deep-Dive into the Vision of Vision Language Models
【速读】: 该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在处理视觉信息时的内部机制问题。解决方案的关键在于通过深入的实证分析,揭示了VLMs在处理视觉数据时的几个核心机制:(i) 查询令牌 (query tokens) 的内部表示用于存储全局图像信息,模型能够仅通过这些令牌生成详细的描述,而不直接访问图像令牌;(ii) 跨模态信息流主要受中间层(约占总层数的25%)影响,早期和晚期层的影响较小;(iii) 细粒度的视觉属性和对象细节直接从图像令牌中以空间局部化的方式提取,即与特定对象或属性相关的生成令牌会强烈关注图像中的相应区域。论文通过提出新的定量评估方法来验证这些观察,并展示了这些发现如何促进最先进VLMs中的高效视觉处理。
链接: https://arxiv.org/abs/2411.17491
作者: Omri Kaduri,Shai Bagon,Tali Dekel
关键词-EN: recently demonstrated remarkable, demonstrated remarkable capabilities, recently demonstrated, demonstrated remarkable, remarkable capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of “describe the image”), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally.(iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.
zh
[CV-27] Learning Visual Hierarchies with Hyperbolic Embeddings
【速读】: 该论文试图解决在图像理解模型中学习多层次视觉层次结构的问题。现有的模型主要关注视觉相似性,而学习视觉层次结构的研究相对较少。论文提出的解决方案之关键是引入了一种在双曲空间(hyperbolic space)中编码用户定义的多层次视觉层次结构的学习范式。具体来说,论文首先定义了一个基于部件的图像层次结构,并利用对象级别的标注来构建这一层次结构。然后,通过对比损失(contrastive loss)与成对蕴含度量(pairwise entailment metrics)来强化层次结构。最后,论文提出了新的评估指标来有效测量层次图像检索。这种方法确保了学习到的表示不仅捕捉视觉相似性,还能捕捉语义和结构信息,从而在基于部件的图像检索任务中显著提升了层次检索的性能。
链接: https://arxiv.org/abs/2411.17490
作者: Ziwei Wang,Sameera Ramasinghe,Chenchen Xu,Julien Monteil,Loris Bazzani,Thalaiyasingam Ajanthan
关键词-EN: Structuring latent representations, Structuring latent, manner enables models, hierarchical manner enables, levels of abstraction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
zh
[CV-28] Puzzle Similarity: A Perceptually-guided No-Reference Metric for Artifact Detection in 3D Scene Reconstructions
【速读】: 该论文试图解决从稀疏2D视图重建复杂3D场景时,自动评估新视图质量并识别伪影的难题。由于缺乏真实图像作为参考,以及无参考图像质量评估指标在预测详细伪影图方面的局限性,现有的质量评估方法难以准确预测生成视图的质量,从而限制了后处理技术(如修复)在提升重建质量中的应用。论文提出的解决方案之关键是引入了一种新的无参考评估指标——拼图相似度(Puzzle Similarity),该指标通过利用输入视图的图像块统计信息建立场景特定的分布,进而识别新视图中重建不佳的区域。实验结果表明,该方法不仅能有效定位伪影,且与人类评估高度相关,甚至在某些情况下优于无参考和全参考图像质量评估指标。
链接: https://arxiv.org/abs/2411.17489
作者: Nicolai Hermann,Jorge Condor,Piotr Didyk
关键词-EN: effectively model complex, Modern reconstruction techniques, Modern reconstruction, model complex, views
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of no-reference image metrics in predicting detailed artifact maps. The absence of such quality metrics hinders accurate predictions of the quality of generated views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. In this work, we propose a new no-reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the input views to establish a scene-specific distribution that is later used to identify poorly reconstructed regions in the novel views. We test and evaluate our method in the context of 3D reconstruction; to this end, we collected a novel dataset of human quality assessment in unseen reconstructed views. Through this dataset, we demonstrate that our method can not only successfully localize artifacts in novel views, correlating with human assessment, but do so without direct references. Surprisingly, our metric outperforms both no-reference metrics and popular full-reference image metrics. We can leverage our new metric to enhance applications like automatic image restoration, guided acquisition, or 3D reconstruction from sparse inputs.
zh
[CV-29] Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding
【速读】: 该论文试图解决视频段落定位 (Video Paragraph Grounding, VPG) 中大规模标注时间标签和视频与段落对应关系未知的问题。解决方案的关键在于提出了一种双任务互增强嵌入式联合视频段落检索与定位方法 (Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding, DMR-JRG)。该方法通过检索和定位任务的相互增强,构建了一个粗粒度特征空间,减少了模态差异,并在此基础上进一步提取细粒度上下文表示,从而实现了精确的跨模态匹配和定位,显著减少了对大规模标注时间标签的依赖。
链接: https://arxiv.org/abs/2411.17481
作者: Mengzhao Wang,Huafeng Li,Yafei Zhang,Jinxing Li,Minghong Xie,Dapeng Tao
关键词-EN: Grounding, Dual-task Mutual Reinforcing, Mutual Reinforcing Embedded, Reinforcing Embedded Joint, aims to precisely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted with mandatory minor revisions by TMM
点击查看摘要
Abstract:Video Paragraph Grounding (VPG) aims to precisely locate the most appropriate moments within a video that are relevant to a given textual paragraph query. However, existing methods typically rely on large-scale annotated temporal labels and assume that the correspondence between videos and paragraphs is known. This is impractical in real-world applications, as constructing temporal labels requires significant labor costs, and the correspondence is often unknown. To address this issue, we propose a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG). In this method, retrieval and grounding tasks are mutually reinforced rather than being treated as separate issues. DMR-JRG mainly consists of two branches: a retrieval branch and a grounding branch. The retrieval branch uses inter-video contrastive learning to roughly align the global features of paragraphs and videos, reducing modality differences and constructing a coarse-grained feature space to break free from the need for correspondence between paragraphs and videos. Additionally, this coarse-grained feature space further facilitates the grounding branch in extracting fine-grained contextual representations. In the grounding branch, we achieve precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs. By synergizing these dimensions, we construct a fine-grained feature space for video and textual features, greatly reducing the need for large-scale annotated temporal labels.
zh
[CV-30] COBRA: A Continual Learning Approach to Vision-Brain Understanding
【速读】: 该论文试图解决视觉-大脑理解 (Vision-Brain Understanding, VBU) 领域中的灾难性遗忘问题,即模型在适应新受试者时会丢失先前受试者的知识。解决方案的关键在于引入了一个名为 Continual Learning for Vision-Brain (COBRA) 的新框架,该框架包含三个创新模块:Subject Commonality (SC) 模块、Prompt-based Subject Specific (PSS) 模块和基于 transformer 的 MRIFormer 模块。SC 模块捕捉跨受试者的共享视觉-大脑模式,以减少灾难性遗忘的影响;PSS 模块学习每个受试者的独特视觉-大脑模式;MRIFormer 模块则通过 transformer 编码器和解码器学习 fMRI 特征。在持续学习设置中,COBRA 仅训练新受试者的 PSS 和 MRIFormer 模块,保持先前受试者的模块不变,从而有效解决了灾难性遗忘问题,并在持续学习和视觉-大脑重建任务中达到了最先进的性能。
链接: https://arxiv.org/abs/2411.17475
作者: Xuan-Bac Nguyen,Arabinda Kumar Choudhary,Pawan Sinha,Xin Li,Khoa Luu
关键词-EN: Magnetic Resonance Imaging, functional Magnetic Resonance, Resonance Imaging, Magnetic Resonance, extract visual information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-Brain Understanding (VBU) aims to extract visual information perceived by humans from brain activity recorded through functional Magnetic Resonance Imaging (fMRI). Despite notable advancements in recent years, existing studies in VBU continue to face the challenge of catastrophic forgetting, where models lose knowledge from prior subjects as they adapt to new ones. Addressing continual learning in this field is, therefore, essential. This paper introduces a novel framework called Continual Learning for Vision-Brain (COBRA) to address continual learning in VBU. Our approach includes three novel modules: a Subject Commonality (SC) module, a Prompt-based Subject Specific (PSS) module, and a transformer-based module for fMRI, denoted as MRIFormer module. The SC module captures shared vision-brain patterns across subjects, preserving this knowledge as the model encounters new subjects, thereby reducing the impact of catastrophic forgetting. On the other hand, the PSS module learns unique vision-brain patterns specific to each subject. Finally, the MRIFormer module contains a transformer encoder and decoder that learns the fMRI features for VBU from common and specific patterns. In a continual learning setup, COBRA is trained in new PSS and MRIFormer modules for new subjects, leaving the modules of previous subjects unaffected. As a result, COBRA effectively addresses catastrophic forgetting and achieves state-of-the-art performance in both continual learning and vision-brain reconstruction tasks, surpassing previous methods.
zh
[CV-31] Probing the Mid-level Vision Capabilities of Self-Supervised Learning
【速读】: 该论文试图解决自监督学习(Self-Supervised Learning, SSL)模型在中层视觉能力(mid-level vision capabilities)评估不足的问题。解决方案的关键在于引入了一套基准协议(benchmark protocols),用于系统性地评估22种主流SSL模型在8个中层视觉任务上的表现。通过这项研究,作者揭示了中层视觉任务与高层视觉任务(high-level vision tasks)之间性能的弱相关性,并识别出在不同任务间表现不平衡的SSL方法。此外,研究还探讨了预训练目标(pretraining objectives)和网络架构(network architectures)等关键因素对中层视觉性能的影响。该研究为SSL模型的全面评估提供了新的视角,强调了未来研究应同时关注中层和高层视觉任务的重要性。
链接: https://arxiv.org/abs/2411.17474
作者: Xuweiyi Chen,Markus Marks,Zezhou Cheng
关键词-EN: generic object localization, Mid-level vision capabilities, Mid-level vision, vision, geometric understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well. Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.17474 [cs.CV] (or arXiv:2411.17474v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.17474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-32] nyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
【速读】: 该论文试图解决现有基于Mamba的轻量级视觉模型在性能上无法与卷积神经网络(Convolution)或Transformer方法相媲美的问题。解决方案的关键在于通过频谱和定量分析,发现Mamba块在卷积-Mamba混合架构中主要建模低频信息。基于此,论文提出了一种新颖的拉普拉斯混合器(Laplace mixer),用于在频率上解耦特征,并将低频成分单独输入到Mamba块中。此外,考虑到特征冗余和不同阶段对高频细节和低频全局信息的不同需求,引入了频率斜坡初始化(frequency ramp inception),即逐步减少高频分支的输入维度,以在不同层高效地权衡高频和低频成分。通过整合移动友好的卷积和高效的拉普拉斯混合器,构建了一系列名为TinyViM的微型混合视觉Mamba模型,显著提升了在图像分类、语义分割、目标检测和实例分割等下游任务中的性能,并在吞吐量上优于其他基于Mamba的模型。
链接: https://arxiv.org/abs/2411.17473
作者: Xiaowen Ma,Zhenliang Ni,Xinghao Chen
关键词-EN: shown great potential, computer vision due, vision Mamba, shown great, linear complexity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. We observe that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models. Code is available at this https URL.
zh
[CV-33] Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory
【速读】: 该论文试图解决文本到图像(Text-to-Image, T2I)扩散模型在处理复杂文本提示时,特别是涉及多个对象和属性时,出现的对象与属性对齐错误和忽略某些元素的问题。解决方案的关键在于利用PAC-Bayes框架,设计自定义先验分布来优化注意力机制,确保对象间的分离、修饰词与对应名词的对齐、减少对无关词的关注,并通过正则化提升模型的泛化能力。这种方法将注意力机制视为可解释的组件,从而实现细粒度的控制和改进属性与对象的对齐,最终在标准基准测试中取得了最先进的结果。
链接: https://arxiv.org/abs/2411.17472
作者: Eric Hanchen Jiang,Yasi Zhang,Zhi Zhang,Yixin Wan,Andrew Lizarraga,Shufan Li,Ying Nian Wu
关键词-EN: visually realistic images, revolutionized generative modeling, producing high-fidelity, modeling by producing, visually realistic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.
zh
[CV-34] Learning New Concepts Remembering the Old: A Novel Continual Learning
【速读】: 该论文试图解决现有概念瓶颈模型(Concept Bottleneck Models, CBMs)在处理动态数据流时无法适应新概念和类别的问题。解决方案的关键在于提出了一种名为CONceptual Continual Incremental Learning (CONCIL)的框架,该框架通过将概念和决策层的更新重新定义为线性回归问题,从而避免了梯度下降更新,实现了对新概念和类别的增量学习,同时防止了灾难性遗忘。CONCIL仅依赖递归矩阵运算,具有计算效率高、适用于实时和大规模数据应用的特点。
链接: https://arxiv.org/abs/2411.17471
作者: Songning Lai,Mingqian Liao,Zhangyi Hu,Jiayu Yang,Wenshuo Chen,Yutao Yue
关键词-EN: enhance model interpretability, Concept Bottleneck Models, introducing human-understandable concepts, Bottleneck Models, Concept Bottleneck
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) enhance model interpretability by introducing human-understandable concepts within the architecture. However, existing CBMs assume static datasets, limiting their ability to adapt to real-world, continuously evolving data streams. To address this, we define a novel concept-incremental and class-incremental continual learning task for CBMs, enabling models to accumulate new concepts and classes over time while retaining previously learned knowledge. To achieve this, we propose CONceptual Continual Incremental Learning (CONCIL), a framework that prevents catastrophic forgetting by reformulating concept and decision layer updates as linear regression problems, thus eliminating the need for gradient-based updates. CONCIL requires only recursive matrix operations, making it computationally efficient and suitable for real-time and large-scale data applications. Experimental results demonstrate that CONCIL achieves “absolute knowledge memory” and outperforms traditional CBM methods in concept- and class-incremental settings, establishing a new benchmark for continual learning in CBMs.
zh
[CV-35] owards Precise Scaling Laws for Video Diffusion Transformers
【速读】: 该论文试图解决在给定的数据和计算预算下,如何优化视频扩散变换器(video diffusion transformers)的性能问题。解决方案的关键在于系统地分析并确认视频扩散模型的缩放定律(scaling laws),并发现这些模型对学习率(learning rate)和批量大小(batch size)这两个超参数的敏感性。为此,论文提出了一种新的缩放定律,能够预测任何模型大小和计算预算下的最优超参数设置。通过这些最优设置,论文在1e10 TFlops的计算预算下,实现了与传统缩放方法相当的性能,并将推理成本降低了40.1%。此外,论文还建立了验证损失、模型大小和计算预算之间更普遍和精确的关系,从而能够预测非最优模型大小的性能,这在实际推理成本约束下提供了更好的权衡。
链接: https://arxiv.org/abs/2411.17470
作者: Yuanyang Yin,Yaqi Zhao,Mingwu Zheng,Ke Lin,Jiarong Ou,Rui Chen,Victor Shea-Jay Huang,Jiahao Wang,Xin Tao,Pengfei Wan,Di Zhang,Baoqun Yin,Wentao Zhang,Kun Gai
关键词-EN: crucial due, high training costs, compute budget, video diffusion transformers, high training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.
zh
[CV-36] Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers NEURIPS2024
【速读】: 该论文试图解决针对基于Transformer的视觉目标跟踪器(transformer trackers)的对抗攻击问题。现有攻击方法通常依赖于对象候选列表(object candidate list),而Transformer跟踪器直接预测特定边界框(bounding box),这限制了现有攻击方法的应用。论文提出的解决方案之关键是:通过仅使用一个预测的边界框,生成一系列对抗性边界框(adversarial bounding boxes),并计算这些边界框的对抗性损失(adversarial loss)。这种方法不仅简单有效,而且在多个流行的基准数据集上,对包括TransT-M、ROMTrack和MixFormer在内的多个鲁棒Transformer跟踪器表现出色。
链接: https://arxiv.org/abs/2411.17468
作者: Fatemeh Nourilenjan Nokabadi,Jean-Francois Lalonde,Christian Gagné
关键词-EN: deceive neural networks, predicting inaccurate results, Adversarial perturbations aim, aim to deceive, deceive neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in The 3rd New Frontiers in Adversarial Machine Learning (AdvML Frontiers @NeurIPS2024)
点击查看摘要
Abstract:Adversarial perturbations aim to deceive neural networks into predicting inaccurate results. For visual object trackers, adversarial attacks have been developed to generate perturbations by manipulating the outputs. However, transformer trackers predict a specific bounding box instead of an object candidate list, which limits the applicability of many existing attack scenarios. To address this issue, we present a novel white-box approach to attack visual object trackers with transformer backbones using only one bounding box. From the tracker predicted bounding box, we generate a list of adversarial bounding boxes and compute the adversarial loss for those bounding boxes. Experimental results demonstrate that our simple yet effective attack outperforms existing attacks against several robust transformer trackers, including TransT-M, ROMTrack, and MixFormer, on popular benchmark tracking datasets such as GOT-10k, UAV123, and VOT2022STS.
zh
[CV-37] Learning 3D Representations from Procedural 3D Programs
【速读】: 该论文试图解决从无标签的3D点云中获取可迁移的3D表示的问题,特别是在获取3D资产时面临的成本高、专业性强和版权问题。解决方案的关键在于利用程序化3D程序(procedural 3D programs)自动生成3D形状,这些程序使用简单的基本元素和增强技术来创建3D模型。尽管这些生成的3D模型缺乏语义内容,但通过自监督学习方法获得的3D表示在形状分类、部分分割和掩码点云完成等下游3D任务中表现与从语义可识别的3D模型(如飞机)中学习到的最先进表示相当。这表明当前的自监督学习方法主要捕获几何结构而非高级语义。
链接: https://arxiv.org/abs/2411.17467
作者: Xuweiyi Chen,Zezhou Cheng
关键词-EN: promising approach, acquiring transferable, representations learned, Self-supervised learning, representations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics. Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.17467 [cs.CV] (or arXiv:2411.17467v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.17467 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-38] WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
【速读】: 该论文试图解决视频变分自编码器 (Video Variational Autoencoder, VAE) 在高分辨率和长时长视频生成过程中,编码成本过高的问题,以及在块状推理过程中潜在空间不连续的问题。解决方案的关键在于利用多级小波变换 (Wavelet Transform) 将视频分解为多个频域分量,并通过低频能量流 (Low-Frequency Energy Flow) 高效地编码关键信息,从而提出小波流变分自编码器 (Wavelet Flow VAE, WF-VAE)。此外,引入因果缓存 (Causal Cache) 方法以在块状推理过程中保持潜在空间的完整性。实验结果表明,WF-VAE 在 PSNR 和 LPIPS 指标上表现优异,吞吐量提高 2 倍,内存消耗降低 4 倍,同时保持了竞争性的重建质量。
链接: https://arxiv.org/abs/2411.17459
作者: Zongjian Li,Bin Lin,Yang Ye,Liuhan Chen,Xinhua Cheng,Shenghai Yuan,Li Yuan
关键词-EN: Video Variational Autoencoder, Latent Video Diffusion, Video Diffusion Models, Variational Autoencoder, Video Variational
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
点击查看摘要
Abstract:Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at this https URL.
zh
[CV-39] Spatially Visual Perception for End-to-End Robotic Learning
【速读】: 该论文试图解决机器人控制和具身智能中,模仿学习在面对多样化的摄像头观测时难以实现鲁棒泛化的问题。解决方案的关键在于引入了一个基于视频的空间感知框架,该框架利用3D空间表示来应对环境变化,特别是光照变化。核心技术包括一种名为AugBlender的新型图像增强技术,以及一个在互联网规模数据上训练的先进单目深度估计模型。这些组件共同构成了一个旨在增强动态场景中鲁棒性和适应性的系统,显著提高了在不同摄像头曝光条件下的成功率,克服了先前模型在性能上的崩溃问题。
链接: https://arxiv.org/abs/2411.17458
作者: Travis Davies,Jiahuan Yan,Xiang Chen,Yu Tian,Yueting Zhuang,Yiqi Huang,Luhui Hu
关键词-EN: shown significant promise, Recent advances, advances in imitation, shown significant, significant promise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 5 figures
点击查看摘要
Abstract:Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.
zh
[CV-40] Identity-Preserving Text-to-Video Generation by Frequency Decomposition
【速读】: 该论文试图解决文本到视频生成中保持人物身份一致性的问题,即在生成的视频中保持人物身份的高保真度和一致性。解决方案的关键在于提出了一个无需微调的频率感知启发式身份保持控制方案,称为ConsisID。该方案通过在频率域中引入身份控制信号,将面部特征分解为低频的全局特征和高频的内在特征,从而在生成过程中保持人物身份的一致性。具体来说,通过全局面部提取器捕捉低频信息并将其集成到网络的浅层,以及通过局部面部提取器捕捉高频细节并注入到Transformer块中,来增强模型对细粒度特征的保留能力。此外,论文还提出了一种分层训练策略,将预训练的视频生成模型转化为能够保持身份一致性的文本到视频生成模型。
链接: https://arxiv.org/abs/2411.17440
作者: Shenghai Yuan,Jinfa Huang,Xianyi He,Yunyuan Ge,Yujun Shi,Liuhan Chen,Jiebo Luo,Li Yuan
关键词-EN: create high-fidelity videos, aims to create, create high-fidelity, generation aims, video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 pages, 8 figures
点击查看摘要
Abstract:Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model’s ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.
zh
[CV-41] Object-centric proto-symbolic behavioural reasoning from pixels
【速读】: 该论文试图解决自主智能体在不同抽象层次(从感官输入和运动指令的低级空间到抽象推理和规划的高级领域)之间进行有效计算的问题。解决方案的关键在于采用基于对象的表示(object-centric representations),这种表示方式能够从像素级别学习环境解释、控制和推理,而无需昂贵的数据标注监督。论文提出了一种受大脑启发的深度学习架构,通过这种架构,智能体能够学习并执行复杂的逻辑推理和连续控制任务,如条件行为推理、逻辑组合和异或操作,同时具备在线适应环境和模型轻微违规的能力。
链接: https://arxiv.org/abs/2411.17438
作者: Ruben van Bergen,Justus Hübotter,Pablo Lanillos
关键词-EN: Autonomous intelligent agents, bridge computational challenges, Autonomous intelligent, bridge computational, computational challenges
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels – ideally without requiring supervision in the form of expensive data annotations. These objectives can be efficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment, using object-centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as (A \to B) \land (\neg A \to C) , as well as logical composition (A \to B) \land (A \to C) \vdash A \to (B \land C) and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning.
zh
[CV-42] Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps
【速读】: 该论文试图解决历史地图中地理实体的跟踪与关联问题,特别是如何高效地将这些实体在不同地图之间进行对齐。解决方案的关键在于结合视频实例分割 (Video Instance Segmentation, VIS) 技术,通过自监督学习 (Self-Supervised Learning, SSL) 方法来提升模型性能。具体来说,论文提出了一种新的方法,通过生成合成视频数据来预训练VIS模型,从而减少对大量手动标注数据的依赖。这种方法显著提高了地理实体对齐的自动化程度,实验结果显示,自监督的VIS方法在平均精度 (AP) 和F1分数上分别比从头训练的模型提高了24.9%和0.23。
链接: https://arxiv.org/abs/2411.17425
作者: Xue Xia,Randall Balestriero,Tao Zhang,Lorenz Hurni
关键词-EN: offers valuable insights, Tracking geographic entities, historical research endeavors, Tracking geographic, urbanization patterns
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Tracking geographic entities from historical maps, such as buildings, offers valuable insights into cultural heritage, urbanization patterns, environmental changes, and various historical research endeavors. However, linking these entities across diverse maps remains a persistent challenge for researchers. Traditionally, this has been addressed through a two-step process: detecting entities within individual maps and then associating them via a heuristic-based post-processing step. In this paper, we propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS). This method significantly streamlines geographic entity alignment and enhances automation. However, acquiring high-quality, video-format training data for VIS models is prohibitively expensive, especially for historical maps that often contain hundreds or thousands of geographic entities. To mitigate this challenge, we explore self-supervised learning (SSL) techniques to enhance VIS performance on historical maps. We evaluate the performance of VIS models under different pretraining configurations and introduce a novel method for generating synthetic videos from unlabeled historical map images for pretraining. Our proposed self-supervised VIS method substantially reduces the need for manual annotation. Experimental results demonstrate the superiority of the proposed self-supervised VIS approach, achieving a 24.9% improvement in AP and a 0.23 increase in F1 score compared to the model trained from scratch.
zh
[CV-43] DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters
【速读】: 该论文试图解决从多模态数据中高质量生成和动画化3D角色的问题,特别是针对衣物和头发等复杂元素的动画化。解决方案的关键在于提出了一个名为DRiVE的新框架,该框架利用3D高斯表示(3D Gaussian representation)来生成和装配3D人体角色,从而实现高效的动画和高品质渲染。此外,论文还引入了基于3D高斯的扩散模块GSDiff,用于预测关节位置的空间分布,克服了传统回归方法的局限性。通过这些创新,DRiVE在精确装配和复杂元素的动态表现上超越了现有方法。
链接: https://arxiv.org/abs/2411.17423
作者: Mingze Sun,Junhao Chen,Junting Dong,Yurun Chen,Xinyu Jiang,Shiwei Mao,Puhua Jiang,Jingbo Wang,Bo Dai,Ruqi Huang
关键词-EN: Recent advances, reconstruction from multi-modal, advances in generative, generative models, models have enabled
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in generative models have enabled high-quality 3D character reconstruction from multi-modal. However, animating these generated characters remains a challenging task, especially for complex elements like garments and hair, due to the lack of large-scale datasets and effective rigging methods. To address this gap, we curate AnimeRig, a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. Unlike existing methods, DRiVE utilizes a 3D Gaussian representation, facilitating efficient animation and high-quality rendering. We further introduce GSDiff, a 3D Gaussian-based diffusion module that predicts joint positions as spatial distributions, overcoming the limitations of regression-based approaches. Extensive experiments demonstrate that DRiVE achieves precise rigging results, enabling realistic dynamics for clothing and hair, and surpassing previous methods in both quality and versatility. The code and dataset will be made public for academic use upon acceptance.
zh
[CV-44] Multimodal Outer Arithmetic Block Dual Fusion of Whole Slide Images and Omics Data for Precision Oncology
【速读】: 该论文试图解决中枢神经系统肿瘤分类的问题,通过将DNA甲基化数据与全切片图像(Whole Slide Images, WSI)相结合,以提高诊断的精确性。解决方案的关键在于提出了一种双融合框架,该框架在早期和晚期两个阶段都整合了基因组数据。在早期融合阶段,基因组嵌入被投影到逐片潜在空间中,生成包含每个片分子和形态学信息的omic-WSI嵌入,从而将这些信息融入到组织学的空间表示中。在晚期融合阶段,通过多模态外部算术块(Multimodal Outer Arithmetic Block, MOAB)重新引入基因组数据,与片级omic-WSI嵌入融合,捕捉两种模态的全局相关性和互补性。这种双融合策略不仅提高了分类性能,还增强了模型的可解释性,显示出其在临床诊断中的潜力。
链接: https://arxiv.org/abs/2411.17418
作者: Omnia Alwazzan,Amaya Gallagher-Syed,Thomas Millner,Ioannis Patras,Silvia Marino,Gregory Slabaugh
关键词-EN: integrating DNA methylation, Slide Images, DNA methylation data, central nervous system, Developing a central
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Developing a central nervous system (CNS) tumor classifier by integrating DNA methylation data with Whole Slide Images (WSI) offers significant potential for enhancing diagnostic precision in neuropathology. Existing approaches typically integrate encoded omic data with histology only once - either at an early or late fusion stage - while reintroducing encoded omic data to create a dual fusion variant remains unexplored. Nevertheless, reintroduction of omic embeddings during early and late fusion enables the capture of complementary information from localized patch-level and holistic slide-level interactions, allowing boosted performance through advanced multimodal integration. To achieve this, we propose a dual fusion framework that integrates omic data at both early and late stages, fully leveraging its diagnostic strength. In the early fusion stage, omic embeddings are projected into a patch-wise latent space, generating omic-WSI embeddings that encapsulate per-patch molecular and morphological insights, effectively incorporating this information into the spatial representation of histology. These embeddings are refined with a multiple instance learning gated attention mechanism to attend to critical patches. In the late fusion stage, we reintroduce the omic data by fusing it with slide-level omic-WSI embeddings using a Multimodal Outer Arithmetic Block (MOAB), which richly intermingles features from both modalities, capturing their global correlations and complementarity. We demonstrate accurate CNS tumor subtyping across 20 fine-grained subtypes and validate our approach on benchmark datasets, achieving improved survival prediction on TCGA-BLCA and competitive performance on TCGA-BRCA compared to state-of-the-art methods. This dual fusion strategy enhances interpretability and classification performance, highlighting its potential for clinical diagnostics.
zh
[CV-45] CoA: Chain-of-Action for Generative Semantic Labels
【速读】: 该论文试图解决在开放领域(如自动驾驶)中,视觉-语言模型(Vision-Language Models, VLM)在图像分类时使用预定义标签集的不切实际性问题,以及固定嵌入文本提示倾向于预测单一标签而非多标签的局限性。解决方案的关键在于引入了一种创新的行动链(Chain-of-Action, CoA)方法,该方法通过逐步提取和合并图像中的关键信息,生成与图像所有上下文相关特征对齐的标签。CoA通过将生成标签任务分解为详细行动,并构建一个行动链来实现最终的生成目标,每个行动从前一行动中提取并合并关键信息,并将丰富后的信息作为上下文传递给下一行动,从而显著提升VLM生成全面且准确语义标签的能力。
链接: https://arxiv.org/abs/2411.17406
作者: Meng Wei,Zhongnian Li,Peng Ying,Xinzheng Xu
关键词-EN: demonstrated remarkable capability, Recent advances, demonstrated remarkable, remarkable capability, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures
点击查看摘要
Abstract:Recent advances in vision-language models (VLM) have demonstrated remarkable capability in image classification. These VLMs leverage a predefined set of categories to construct text prompts for zero-shot reasoning. However, in more open-ended domains like autonomous driving, using a predefined set of labels becomes impractical, as the semantic label space is unknown and constantly evolving. Additionally, fixed embedding text prompts often tend to predict a single label (while in reality, multiple labels commonly exist per image). In this paper, we introduce CoA, an innovative Chain-of-Action (CoA) method that generates labels aligned with all contextually relevant features of an image. CoA is designed based on the observation that enriched and valuable contextual information improves generative performance during inference. Traditional vision-language models tend to output singular and redundant responses. Therefore, we employ a tailored CoA to alleviate this problem. We first break down the generative labeling task into detailed actions and construct an CoA leading to the final generative objective. Each action extracts and merges key information from the previous action and passes the enriched information as context to the next action, ultimately improving the VLM in generating comprehensive and accurate semantic labels. We assess the effectiveness of CoA through comprehensive evaluations on widely-used benchmark datasets and the results demonstrate significant improvements across key performance metrics.
zh
[CV-46] NumGrad-Pull: Numerical Gradient Guided Tri-plane Representation for Surface Reconstruction from Point Clouds
【速读】: 该论文试图解决从无方向和无序的三维点云中重建连续表面的基本挑战。解决方案的关键在于引入了一种名为NumGrad-Pull的方法,该方法利用三平面结构(tri-plane structures)的表示能力来加速有符号距离函数(signed distance functions)的学习,并增强表面重建中的局部细节保真度。具体来说,论文提出使用数值梯度(numerical gradients)替代传统的解析计算,以提高基于网格的三平面训练的稳定性。此外,还设计了一种渐进平面扩展策略和数据采样策略,以促进有符号距离函数的更快收敛并减少重建伪影。实验结果表明,该方法在多种基准测试中表现出了有效性和鲁棒性。
链接: https://arxiv.org/abs/2411.17392
作者: Ruikai Cui,Shi Qiu,Jiawei Liu,Saeed Anwar,Nick Barnes
关键词-EN: Reconstructing continuous surfaces, Reconstructing continuous, signed distance functions, unoriented and unordered, vision and graphics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Reconstructing continuous surfaces from unoriented and unordered 3D points is a fundamental challenge in computer vision and graphics. Recent advancements address this problem by training neural signed distance functions to pull 3D location queries to their closest points on a surface, following the predicted signed distances and the analytical gradients computed by the network. In this paper, we introduce NumGrad-Pull, leveraging the representation capability of tri-plane structures to accelerate the learning of signed distance functions and enhance the fidelity of local details in surface reconstruction. To further improve the training stability of grid-based tri-planes, we propose to exploit numerical gradients, replacing conventional analytical computations. Additionally, we present a progressive plane expansion strategy to facilitate faster signed distance function convergence and design a data sampling strategy to mitigate reconstruction artifacts. Our extensive experiments across a variety of benchmarks demonstrate the effectiveness and robustness of our approach. Code is available at this https URL
zh
[CV-47] DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
【速读】: 该论文试图解决在大规模预训练视觉模型中,深度感知如何在没有显式深度监督的情况下产生的问题。解决方案的关键在于引入了一个名为DepthCues的新基准,用于评估模型对深度线索的理解。通过分析20个多样且具有代表性的预训练视觉模型,研究发现人类类似的深度线索在较新的更大模型中自然涌现。此外,通过在DepthCues上进行微调,即使没有密集的深度监督,也能显著提升深度估计的性能。这一研究为深入探索视觉模型中的深度感知提供了新的工具和方法。
链接: https://arxiv.org/abs/2411.17385
作者: Duolikun Danier,Mehmet Aygün,Changjian Li,Hakan Bilen,Oisin Mac Aodha
关键词-EN: Large-scale pre-trained vision, generalizable visual representations, Large-scale pre-trained, increasingly prevalent, offering expressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
点击查看摘要
Abstract:Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.
zh
[CV-48] AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation
【速读】: 该论文试图解决在生成式产品推广视频中,如何实现高质量的人-物交互 (Human-Object Interaction, HOI) 视频生成的问题。解决方案的关键在于提出了一个名为 AnchorCrafter 的新型扩散模型,该模型通过两个核心创新来实现这一目标:1) HOI-appearance perception,增强从任意多视角对物体外观的识别能力,并解耦物体和人的外观;2) HOI-motion injection,通过克服物体轨迹条件化和相互遮挡管理等挑战,实现复杂的人-物交互。此外,论文还引入了 HOI-region reweighting loss 作为训练目标,以增强物体细节的学习。实验结果表明,AnchorCrafter 在保持物体外观和形状意识的同时,还能维持人物外观和动作的一致性,显著优于现有方法。
链接: https://arxiv.org/abs/2411.17383
作者: Ziyi Xu,Ziyao Huang,Juan Cao,Yong Zhang,Xiaodong Cun,Qing Shuai,Yuchen Wang,Linchao Bao,Jintao Li,Fan Tang
关键词-EN: anchor-style product promotion, presents promising opportunities, promotion videos presents, videos presents promising, product promotion videos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The automatic generation of anchor-style product promotion videos presents promising opportunities in online commerce, advertising, and consumer engagement. However, this remains a challenging task despite significant advancements in pose-guided human video generation. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Additionally, we introduce the HOI-region reweighting loss, a training objective that enhances the learning of object details. Extensive experiments demonstrate that our proposed system outperforms existing methods in preserving object appearance and shape awareness, while simultaneously maintaining consistency in human appearance and motion. Project page: this https URL
zh
[CV-49] RealTraj: Towards Real-World Pedestrian Trajectory Forecasting
【速读】: 该论文旨在解决传统行人轨迹预测中的三个关键限制:行人感知错误、现实世界数据收集成本高以及行人ID标注成本高。解决方案的关键在于提出了一种名为RealTraj的新框架,该框架通过两个训练阶段——在合成数据上的自监督预训练和在有限现实世界数据上的弱监督微调——来减少数据收集的工作量。具体来说,论文提出了Det2TrajFormer模型,该模型通过使用过去的检测结果作为输入,能够在跟踪噪声下保持不变性。此外,通过多任务预训练,模型增强了鲁棒性并提高了仅基于检测数据的预测性能。与以往的方法不同,该方法仅使用地面真实检测结果进行微调,显著减少了昂贵的行人ID标注需求。实验结果表明,该方法在多个数据集上优于现有的最先进轨迹预测方法。
链接: https://arxiv.org/abs/2411.17376
作者: Ryo Fujii,Hideo Saito,Ryo Hachiuma
关键词-EN: paper jointly addresses, pedestrian perception errors, conventional pedestrian trajectory, data collection costs, conventional pedestrian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases–self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data–to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant in tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, significantly reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.
zh
[CV-50] SAM-MPA: Applying SAM to Few-shot Medical Image Segmentation using Mask Propagation and Auto-prompting NEURIPS2024
【速读】: 该论文试图解决医学图像分割中标注成本高昂的问题,提出了一种基于少样本学习的解决方案。解决方案的关键在于利用预训练的Segment Anything Model (SAM),该模型已在超过10亿个掩码上进行训练,从而避免了大量特定领域标注数据的依赖。具体来说,论文提出了SAM-MPA框架,通过基于掩码传播的自动提示技术,首先使用k-中心点聚类选择最具代表性的样本进行标注,构建支持集。然后,通过图像配准生成变形场,将掩码知识传播到整个数据集,获得粗略掩码。接着,基于粗略掩码的区域和边界扩展自动生成视觉提示,包括点、框和粗略掩码,输入到SAM中进行分割预测,并通过后处理模块进行结果的细化。实验结果表明,SAM-MPA在仅使用少量标注样本的情况下,显著优于其他先进的少样本自动分割方法。
链接: https://arxiv.org/abs/2411.17363
作者: Jie Xu,Xiaokang Li,Chengyu Yue,Yuanyuan Wang,Yi Guo
关键词-EN: expensive annotation costs, prohibitively expensive annotation, Medical image, annotation costs, Medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an oral presentation at NeurIPS 2024 AIM-FM Workshop
点击查看摘要
Abstract:Medical image segmentation often faces the challenge of prohibitively expensive annotation costs. While few-shot learning offers a promising solution to alleviate this burden, conventional approaches still rely heavily on pre-training with large volumes of labeled data from known categories. To address this issue, we propose leveraging the Segment Anything Model (SAM), pre-trained on over 1 billion masks, thus circumventing the need for extensive domain-specific annotated data. In light of this, we developed SAM-MPA, an innovative SAM-based framework for few-shot medical image segmentation using Mask Propagation-based Auto-prompting. Initially, we employ k-centroid clustering to select the most representative examples for labelling to construct the support set. These annotated examples are registered to other images yielding deformation fields that facilitate the propagation of the mask knowledge to obtain coarse masks across the dataset. Subsequently, we automatically generate visual prompts based on the region and boundary expansion of the coarse mask, including points, box and a coarse mask. Finally, we can obtain the segmentation predictions by inputting these prompts into SAM and refine the results by post refinement module. We validate the performance of the proposed framework through extensive experiments conducted on two medical image datasets with different modalities. Our method achieves Dices of 74.53%, 94.36% on Breast US, Chest X-ray, respectively. Experimental results substantiate that SAM-MPA yields high-accuracy segmentations within 10 labeled examples, outperforming other state-of-the-art few-shot auto-segmentation methods. Our method enables the customization of SAM for any medical image dataset with a small number of labeled examples.
zh
[CV-51] DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering
【速读】: 该论文试图解决多视图对比聚类(Multi-view Contrastive Clustering, MVCC)中存在的两个主要问题:1) 现有方法通过任意组合两个视图生成跨视图对,导致大量不可靠对的出现;2) 这些方法往往忽视多视图表示之间的差异,导致表示退化。解决方案的关键在于提出了双重加权对比学习(Dual-Weighted Contrastive Learning, DWCL)模型。具体来说,通过引入创新的Best-Other(B-O)对比机制,以低计算成本增强单视图表示;同时,采用双重加权策略,结合视图质量权重和视图差异权重,有效降低低质量和高度差异的跨视图对的影响,从而缓解表示退化问题。实验结果表明,DWCL在多个多视图数据集上显著优于现有方法,展示了其在多视图对比聚类中的优越性能和鲁棒性。
链接: https://arxiv.org/abs/2411.17354
作者: Zhihui Zhang,Xiaoshuai Hao,Hanning Yuan,Lianhua Chi,Qi Guo,Qi Li,Ziqiang Yuan,Jinhui Pang,Yexin Li,Sijie Ruan
关键词-EN: gained significant attention, generating consistent clustering, consistent clustering structures, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multi-view contrastive clustering (MVCC) has gained significant attention for generating consistent clustering structures from multiple views through contrastive learning. However, most existing MVCC methods create cross-views by combining any two views, leading to a high volume of unreliable pairs. Furthermore, these approaches often overlook discrepancies in multi-view representations, resulting in representation degeneration. To address these challenges, we introduce a novel model called Dual-Weighted Contrastive Learning (DWCL) for Multi-View Clustering. Specifically, to reduce the impact of unreliable cross-views, we introduce an innovative Best-Other (B-O) contrastive mechanism that enhances the representation of individual views at a low computational cost. Furthermore, we develop a dual weighting strategy that combines a view quality weight, reflecting the quality of each view, with a view discrepancy weight. This approach effectively mitigates representation degeneration by downplaying cross-views that are both low in quality and high in discrepancy. We theoretically validate the efficiency of the B-O contrastive mechanism and the effectiveness of the dual weighting strategy. Extensive experiments demonstrate that DWCL outperforms previous methods across eight multi-view datasets, showcasing superior performance and robustness in MVCC. Specifically, our method achieves absolute accuracy improvements of 5.4% and 5.6% compared to state-of-the-art methods on the Caltech6V7 and MSRCv1 datasets, respectively.
zh
[CV-52] Real-Time Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee
【速读】: 该论文试图解决在动态环境中自主系统与人类之间的实时通信问题,特别是在RoboCup比赛中,机器人需要准确理解裁判的手势和哨声,同时减少对网络的依赖。解决方案的关键在于采用两阶段流水线进行手势识别,包括关键点提取和分类,以及使用连续卷积神经网络(CCNNs)进行高效的哨声检测。这种方法提升了在RoboCup等竞争环境中实时人机交互的能力,为开发能够与人类协作的自主系统提供了工具。
链接: https://arxiv.org/abs/2411.17347
作者: Filippo Ansalone,Flavio Maiorana,Daniele Affinita,Flavio Volpi,Eugenio Bugli,Francesco Petri,Michele Brienza,Valerio Spagnoli,Vincenzo Suriani,Daniele Nardi,Domenico D. Bloisi
关键词-EN: Advancing human-robot communication, accurate real-time interpretation, dynamic environments, signals is essential, Advancing human-robot
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11th Italian Workshop on Artificial Intelligence and Robotics (AIRO 2024), Published in CEUR Workshop Proceedings AI*IA Series
点击查看摘要
Abstract:Advancing human-robot communication is crucial for autonomous systems operating in dynamic environments, where accurate real-time interpretation of human signals is essential. RoboCup provides a compelling scenario for testing these capabilities, requiring robots to understand referee gestures and whistle with minimal network reliance. Using the NAO robot platform, this study implements a two-stage pipeline for gesture recognition through keypoint extraction and classification, alongside continuous convolutional neural networks (CCNNs) for efficient whistle detection. The proposed approach enhances real-time human-robot interaction in a competitive setting like RoboCup, offering some tools to advance the development of autonomous systems capable of cooperating with humans.
zh
[CV-53] MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension
【速读】: 该论文试图解决运动合成与理解的多任务统一框架问题,解决方案的关键在于三个核心原则:首先,通过HoMi Tokenizer建立了一个强大的统一表示空间,其单一代码本的重建精度可媲美使用六个代码本的残差向量量化(Residual Vector Quantization),超越了所有现有的单一代码本标记器;其次,整合大型语言模型以处理多种运动相关任务,通过跨模态融合实现复杂运动合成与理解;最后,引入MotionHub数据集,这是目前最广泛的多模态多任务运动数据集,支持大型语言模型的微调。这些关键技术使得MotionLLaMA在多种运动相关任务中达到或接近最先进(SOTA)的性能。
链接: https://arxiv.org/abs/2411.17335
作者: Zeyu Ling,Bo Han,Shiyang Li,Hongdeng Shen,Jikang Cheng,Changqing Zou
关键词-EN: HoMi Tokenizer, full-body motion tokenizer, motion tokenizer called, tokenizer called, tokenizer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper introduces MotionLLaMA, a unified framework for motion synthesis and comprehension, along with a novel full-body motion tokenizer called the HoMi Tokenizer. MotionLLaMA is developed based on three core principles. First, it establishes a powerful unified representation space through the HoMi Tokenizer. Using a single codebook, the HoMi Tokenizer in MotionLLaMA achieves reconstruction accuracy comparable to residual vector quantization tokenizers utilizing six codebooks, outperforming all existing single-codebook tokenizers. Second, MotionLLaMA integrates a large language model to tackle various motion-related tasks. This integration bridges various modalities, facilitating both comprehensive and intricate motion synthesis and comprehension. Third, MotionLLaMA introduces the MotionHub dataset, currently the most extensive multimodal, multitask motion dataset, which enables fine-tuning of large language models. Extensive experimental results demonstrate that MotionLLaMA not only covers the widest range of motion-related tasks but also achieves state-of-the-art (SOTA) performance in motion completion, interaction dual-person text-to-motion, and all comprehension tasks while reaching performance comparable to SOTA in the remaining tasks. The code and MotionHub dataset are publicly available.
zh
[CV-54] InsightEdit: Towards Better Instruction Following for Image Editing
【速读】: 该论文试图解决基于指令的图像编辑任务中存在的两个主要问题:现有数据集的低分辨率、背景不一致性和指令过于简单,以及当前方法主要依赖文本信息而未充分利用图像信息,导致在复杂指令执行和背景一致性维护方面的表现不佳。解决方案的关键在于:首先,通过创新的数据构建流程创建了AdvancedEdit数据集,该数据集具有高视觉质量、复杂指令和良好的背景一致性;其次,引入了一种双流桥接机制,利用多模态大语言模型(Multimodal Large Language Models, MLLM)推理出的文本和视觉特征,更精确地指导图像编辑过程。这些创新使得InsightEdit方法在复杂指令执行和背景一致性维护方面达到了最先进的性能。
链接: https://arxiv.org/abs/2411.17323
作者: Yingjing Xu,Jie Kong,Jiazhi Wang,Xiao Pan,Bo Lin,Qiang Liu
关键词-EN: instruction-based image editing, task of instruction-based, background consistency, image editing, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.
zh
[CV-55] Event Ellipsometer: Event-based Mueller-Matrix Video Imaging
【速读】: 该论文试图解决现有光学椭偏仪(optical ellipsometers)在捕捉动态场景时由于采集时间长而受限的问题。解决方案的关键在于引入了一种名为“事件椭偏仪(Event Ellipsometer)”的新方法,通过在光源前使用快速旋转的四分之一波片(quarter-wave plates, QWPs)和事件相机(event camera)异步捕捉由旋转QWPs引起的强度变化,从而实现对动态场景的Mueller矩阵视频成像。论文中还开发了椭偏事件图像形成模型、校准方法和椭偏事件重建方法,实验证明该系统能够在30fps的帧率下进行Mueller矩阵视频成像,从而将椭偏测量扩展到动态场景。
链接: https://arxiv.org/abs/2411.17313
作者: Ryota Maeda,Yunseong Moon,Seung-Hwan Baek
关键词-EN: Light-matter interactions modify, Light-matter interactions, interactions modify, Light-matter, polarization state
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Light-matter interactions modify both the intensity and polarization state of light. Changes in polarization, represented by a Mueller matrix, encode detailed scene information. Existing optical ellipsometers capture Mueller-matrix images; however, they are often limited to capturing static scenes due to long acquisition times. Here, we introduce Event Ellipsometer, a method for acquiring a Mueller-matrix video for dynamic scenes. Our imaging system employs fast-rotating quarter-wave plates (QWPs) in front of a light source and an event camera that asynchronously captures intensity changes induced by the rotating QWPs. We develop an ellipsometric-event image formation model, a calibration method, and an ellipsometric-event reconstruction method. We experimentally demonstrate that Event Ellipsometer enables Mueller-matrix video imaging at 30fps, extending ellipsometry to dynamic scenes.
zh
[CV-56] Reward Incremental Learning in Text-to-Image Generation
【速读】: 该论文试图解决在文本到图像生成任务中,预训练的扩散模型在面对多个逐步引入的下游目标时,如何避免灾难性遗忘(catastrophic forgetting)的问题。解决方案的关键是提出了奖励增量蒸馏(Reward Incremental Distillation, RID)方法,该方法通过在模型适应新目标的同时,保留对先前目标的知识,从而在逐步引入多个下游目标的情况下,实现稳定的高质量图像生成。RID方法通过最小化计算开销,有效缓解了灾难性遗忘问题,确保模型在奖励增量学习(Reward Incremental Learning, RIL)场景中持续表现出色。
链接: https://arxiv.org/abs/2411.17310
作者: Maorong Wang,Jiafeng Mao,Xueting Wang,Toshihiko Yamasaki
关键词-EN: significantly advanced, recent success, success of denoising, Reward Incremental Learning, denoising diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review
点击查看摘要
Abstract:The recent success of denoising diffusion models has significantly advanced text-to-image generation. While these large-scale pretrained models show excellent performance in general image synthesis, downstream objectives often require fine-tuning to meet specific criteria such as aesthetics or human preference. Reward gradient-based strategies are promising in this context, yet existing methods are limited to single-reward tasks, restricting their applicability in real-world scenarios that demand adapting to multiple objectives introduced incrementally over time. In this paper, we first define this more realistic and unexplored problem, termed Reward Incremental Learning (RIL), where models are desired to adapt to multiple downstream objectives incrementally. Additionally, while the models adapt to the ever-emerging new objectives, we observe a unique form of catastrophic forgetting in diffusion model fine-tuning, affecting both metric-wise and visual structure-wise image quality. To address this catastrophic forgetting challenge, we propose Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios. The source code of our work will be publicly available upon acceptance.
zh
[CV-57] n-Car Biometrics (iCarB) Datasets for Driver Recognition: Face Fingerprint and Voice
【速读】: 该论文试图解决在车内环境中进行生物识别数据收集和评估的问题。解决方案的关键在于提供了一个包含面部视频、指纹图像和语音样本的多模态生物识别数据集(iCarB-Face, iCarB-Fingerprint, iCarB-Voice),这些数据集由200名志愿者在车内驾驶座上采集,涵盖了多种环境条件和模拟的非理想数据采集情况。该数据集不仅适用于汽车环境中的生物识别系统评估,还可用于多模态融合算法、呈现攻击检测算法以及生物识别系统中的偏见研究。其关键创新点包括:(i) 提供多模态数据集,填补了车内指纹数据集的空白;(ii) 数据集具有高度的多样性,包括性别、肤色和年龄的广泛分布;(iii) 提供多种评估协议和元数据,支持深入的生物识别研究。
链接: https://arxiv.org/abs/2411.17305
作者: Vedrana Krivokuca Hahn,Jeremy Maceiras,Alain Komaty,Philip Abbet,Sebastien Marcel
关键词-EN: Presentation Attack Detection, collected inside, datasets, biometric, consenting volunteers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 13 figures, 4 tables
点击查看摘要
Abstract:We present three biometric datasets (iCarB-Face, iCarB-Fingerprint, iCarB-Voice) containing face videos, fingerprint images, and voice samples, collected inside a car from 200 consenting volunteers. The data was acquired using a near-infrared camera, two fingerprint scanners, and two microphones, while the volunteers were seated in the driver’s seat of the car. The data collection took place while the car was parked both indoors and outdoors, and different “noises” were added to simulate non-ideal biometric data capture that may be encountered in real-life driver recognition. Although the datasets are specifically tailored to in-vehicle biometric recognition, their utility is not limited to the automotive environment. The iCarB datasets, which are available to the research community, can be used to: (i) evaluate and benchmark face, fingerprint, and voice recognition systems (we provide several evaluation protocols); (ii) create multimodal pseudo-identities, to train/test multimodal fusion algorithms; (iii) create Presentation Attacks from the biometric data, to evaluate Presentation Attack Detection algorithms; (iv) investigate demographic and environmental biases in biometric systems, using the provided metadata. To the best of our knowledge, ours are the largest and most diverse publicly available in-vehicle biometric datasets. Most other datasets contain only one biometric modality (usually face), while our datasets consist of three modalities, all acquired in the same automotive environment. Moreover, iCarB-Fingerprint seems to be the first publicly available in-vehicle fingerprint dataset. Finally, the iCarB datasets boast a rare level of demographic diversity among the 200 data subjects, including a 50/50 gender split, skin colours across the whole Fitzpatrick-scale spectrum, and a wide age range (18-60+). So, these datasets will be valuable for advancing biometrics research.
zh
[CV-58] ask Progressive Curriculum Learning for Robust Visual Question Answering
【速读】: 该论文试图解决视觉问答系统 (Visual Question Answering, VQA) 在分布外数据集上的性能不佳问题。解决方案的关键在于提出了一种任务渐进课程学习 (Task Progressive Curriculum Learning, TPCL) 策略,通过将主要的 VQA 问题分解为基于问题类型的小而简单的任务,并按照精心设计的顺序逐步训练模型。此外,论文还引入了一种基于分布的难度测量方法来支持该策略。TPCL 方法简单、模型无关且易于实现,无需数据增强或显式去偏机制,在 VQA-CP v2、VQA-CP v1 和 VQA v2 数据集上达到了最先进的性能,显著优于现有的最竞争力的鲁棒 VQA 方法。
链接: https://arxiv.org/abs/2411.17292
作者: Ahmed Akl,Abdelwahed Khamis,Zhe Wang,Ali Cheraghian,Sara Khalifa,Kewen Wang
关键词-EN: Visual Question Answering, Visual Question, Question Answering, robust Visual Question, Progressive Curriculum Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Visual Question Answering (VQA) systems are known for their poor performance in out-of-distribution datasets. An issue that was addressed in previous works through ensemble learning, answer re-ranking, or artificially growing the training set. In this work, we show for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy. Our proposed approach, Task Progressive Curriculum Learning (TPCL), breaks the main VQA problem into smaller, easier tasks based on the question type. Then, it progressively trains the model on a (carefully crafted) sequence of tasks. We further support the method by a novel distributional-based difficulty measurer. Our approach is conceptually simple, model-agnostic, and easy to implement. We demonstrate TPCL effectiveness through a comprehensive evaluation on standard datasets. Without either data augmentation or explicit debiasing mechanism, it achieves state-of-the-art on VQA-CP v2, VQA-CP v1 and VQA v2 datasets. Extensive experiments demonstrate that TPCL outperforms the most competitive robust VQA approaches by more than 5% and 7% on VQA-CP v2 and VQA-CP v1; respectively. TPCL also can boost VQA baseline backbone performance by up to 28.5%.
zh
[CV-59] Interpretable label-free self-guided subspace clustering
【速读】: 该论文试图解决在无标签数据情况下,依赖超参数的子空间聚类算法(SC)的超参数优化(HPO)问题。解决方案的关键在于提出了一种基于内部聚类质量指标(如准确率 (ACC) 或归一化互信息 (NMI))的无标签超参数优化方法。该方法通过在预定义的超参数网格上计算伪标签,并假设 ACC 或 NMI 是超参数值的平滑函数,从而选择超参数的子区间,并迭代地进一步分割这些子区间,直到满足相对误差标准。这种方法原则上可以用于任何依赖超参数的 SC 算法,并通过实验验证了其在多个单视图和多视图 SC 算法上的有效性,尽管其性能通常比基于标签的优化方法低 5% 到 7%。此外,通过可视化子空间基,该方法还增强了其可解释性,有助于初始超参数搜索空间的选择。
链接: https://arxiv.org/abs/2411.17291
作者: Ivica Kopriva
关键词-EN: Majority subspace clustering, Majority subspace, clustering quality metrics, Majority, quality metrics
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages; 3 figures; 10 tables
点击查看摘要
Abstract:Majority subspace clustering (SC) algorithms depend on one or more hyperparameters that need to be carefully tuned for the SC algorithms to achieve high clustering performance. Hyperparameter optimization (HPO) is often performed using grid-search, assuming that some labeled data is available. In some domains, such as medicine, this assumption does not hold true in many cases. One avenue of research focuses on developing SC algorithms that are inherently free of hyperparameters. For hyperparameters-dependent SC algorithms, one approach to label-independent HPO tuning is based on internal clustering quality metrics (if available), whose performance should ideally match that of external (label-dependent) clustering quality metrics. In this paper, we propose a novel approach to label-independent HPO that uses clustering quality metrics, such as accuracy (ACC) or normalized mutual information (NMI), that are computed based on pseudo-labels obtained from the SC algorithm across a predefined grid of hyperparameters. Assuming that ACC (or NMI) is a smooth function of hyperparameter values it is possible to select subintervals of hyperparameters. These subintervals are then iteratively further split into halves or thirds until a relative error criterion is satisfied. In principle, the hyperparameters of any SC algorithm can be tuned using the proposed method. We demonstrate this approach on several single- and multi-view SC algorithms, comparing the achieved performance with their oracle versions across six datasets representing digits, faces and objects. The proposed method typically achieves clustering performance that is 5% to 7% lower than that of the oracle versions. We also make our proposed method interpretable by visualizing subspace bases, which are estimated from the computed clustering partitions. This aids in the initial selection of the hyperparameter search space.
zh
[CV-60] BadScan: An Architectural Backdoor Attack on Visual State Space Models
【速读】: 该论文试图解决视觉状态空间模型(Visual State Space Model, VMamba)在面对后门攻击(backdoor attack)时的脆弱性问题。解决方案的关键在于引入了一种新型的架构后门攻击,称为BadScan。这种攻击利用位平面切片(bit plane slicing)技术创建视觉上难以察觉的后门图像,并在检测到触发器时,通过XOR操作替换VMamba模型中的传统2D选择性扫描(SS2D)机制,代之以包含四种新型扫描模式的BadScan块。实验结果表明,BadScan攻击对视觉状态空间模型构成重大威胁,即使在模型从零开始完全重新训练后,仍能有效误导模型。
链接: https://arxiv.org/abs/2411.17283
作者: Om Suhas Deshmukh,Sankalp Nagaonkar,Achyut Mani Tripathi,Ashish Mishra
关键词-EN: computer vision tasks, Vision Transformers, Visual State Space, shown exceptional performance, exceptional performance compared
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The newly introduced Visual State Space Model (VMamba), which employs \textitState Space Mechanisms (SSM) to interpret images as sequences of patches, has shown exceptional performance compared to Vision Transformers (ViT) across various computer vision tasks. However, recent studies have highlighted that deep models are susceptible to adversarial attacks. One common approach is to embed a trigger in the training data to retrain the model, causing it to misclassify data samples into a target class, a phenomenon known as a backdoor attack. In this paper, we first evaluate the robustness of the VMamba model against existing backdoor attacks. Based on this evaluation, we introduce a novel architectural backdoor attack, termed BadScan, designed to deceive the VMamba model. This attack utilizes bit plane slicing to create visually imperceptible backdoored images. During testing, if a trigger is detected by performing XOR operations between the k^th bit planes of the modified triggered patches, the traditional 2D selective scan (SS2D) mechanism in the visual state space (VSS) block of VMamba is replaced with our newly designed BadScan block, which incorporates four newly developed scanning patterns. We demonstrate that the BadScan backdoor attack represents a significant threat to visual state space models and remains effective even after complete retraining from scratch. Experimental results on two widely used image classification datasets, CIFAR-10, and ImageNet-1K, reveal that while visual state space models generally exhibit robustness against current backdoor attacks, the BadScan attack is particularly effective, achieving a higher Triggered Accuracy Ratio (TAR) in misleading the VMamba model and its variants.
zh
[CV-61] HEIE: MLLM -Based Hierarchical Explainable AIGC Image Implausibility Evaluator
【速读】: 该论文试图解决生成式 AI 图像(AIGC images)中常见的质量问题,如伪影和不自然的纹理,特别是在缺陷区域的预测和解释方面。解决方案的关键在于提出了一个基于多模态大语言模型(MLLMs)的分层可解释图像不合理性评估器(HEIE)。该评估器通过引入CoT驱动的可解释三重评估器(CoT-Driven Explainable Trinity Evaluator),将复杂任务分解为逐步增加难度的子任务,从而提高了解释性和可理解性。此外,通过自适应分层不合理性映射器(Adaptive Hierarchical Implausibility Mapper),结合低级图像特征和高级映射器标记,实现了从局部到全局的分层热图预测,并通过不确定性自适应标记方法增强了预测的精确性。最后,论文还提出了一个新的数据集Expl-AIGI-Eval,以促进生成式 AI 图像的可解释不合理性评估。
链接: https://arxiv.org/abs/2411.17261
作者: Fan Yang,Ru Zhen,Jianing Wang,Yanhao Zhang,Haoxiang Chen,Haonan Lu,Sicheng Zhao,Guiguang Ding
关键词-EN: unnatural textures, frequently suffer, suffer from quality, quality issues, issues like artifacts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:AIGC images are prevalent across various fields, yet they frequently suffer from quality issues like artifacts and unnatural textures. Specialized models aim to predict defect region heatmaps but face two primary challenges: (1) lack of explainability, failing to provide reasons and analyses for subtle defects, and (2) inability to leverage common sense and logical reasoning, leading to poor generalization. Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details; and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable image Implausibility Evaluator. We introduce the CoT-Driven Explainable Trinity Evaluator, which integrates heatmaps, scores, and explanation outputs, using CoT to decompose complex tasks into subtasks of increasing difficulty and enhance interpretability. Our Adaptive Hierarchical Implausibility Mapper synergizes low-level image features with high-level mapper tokens from LLMs, enabling precise local-to-global hierarchical heatmap predictions through an uncertainty-based adaptive token approach. Moreover, we propose a new dataset: Expl-AIGI-Eval, designed to facilitate interpretable implausibility evaluation of AIGC images. Our method demonstrates state-of-the-art performance through extensive experiments.
zh
[CV-62] Semantic Data Augmentation for Long-tailed Facial Expression Recognition
【速读】: 该论文试图解决面部表情识别(Facial Expression Recognition, FER)任务中数据集的长尾分布问题,特别是在现实世界应用中面临的挑战。解决方案的关键在于提出了一种新的语义增强方法,通过在VAE-GAN的潜在空间中对源数据编码引入随机性,生成新的样本,从而平衡数据集的长尾分布。这种方法不仅适用于FER任务,还可应用于更多数据稀缺的场景。
链接: https://arxiv.org/abs/2411.17254
作者: Zijian Li,Yan Wang,Bowen Guan,JianKai Yin
关键词-EN: Facial Expression Recognition, driver fatigue monitoring, wide application prospect, Facial Expression, Expression Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Facial Expression Recognition has a wide application prospect in social robotics, health care, driver fatigue monitoring, and many other practical scenarios. Automatic recognition of facial expressions has been extensively studied by the Computer Vision research society. But Facial Expression Recognition in real-world is still a challenging task, partially due to the long-tailed distribution of the dataset. Many recent studies use data augmentation for Long-Tailed Recognition tasks. In this paper, we propose a novel semantic augmentation method. By introducing randomness into the encoding of the source data in the latent space of VAE-GAN, new samples are generated. Then, for facial expression recognition in RAF-DB dataset, we use our augmentation method to balance the long-tailed distribution. Our method can be used in not only FER tasks, but also more diverse data-hungry scenarios.
zh
[CV-63] LHPF: Look back the History and Plan for the Future in Autonomous Driving
【速读】: 该论文试图解决当前基于模仿学习(Imitation Learning)的自动驾驶规划算法中存在的规划意图不连续和误差累积问题。解决方案的关键在于引入了一种名为LHPF的模仿学习规划器,该规划器通过历史意图聚合模块(Historical Intention Aggregation Module)整合历史规划信息,从而生成更加连续和准确的规划轨迹。此外,该方法还通过引入舒适性辅助任务(Comfort Auxiliary Task)来提升驾驶行为的类人质量。实验结果表明,LHPF不仅在规划性能上超越了现有的先进学习型规划器,而且首次实现了纯学习型规划器优于专家规划器的效果。
链接: https://arxiv.org/abs/2411.17253
作者: Sheng Wang,Yao Tian,Xiaodong Mei,Ge Sun,Jie Cheng,Fulong Ma,Pedro V. Sander,Junwei Liang
关键词-EN: effective planning imperative, making effective planning, autonomous driving critically, driving critically reflect, making effective
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Decision-making and planning in autonomous driving critically reflect the safety of the system, making effective planning imperative. Current imitation learning-based planning algorithms often merge historical trajectories with present observations to predict future candidate paths. However, these algorithms typically assess the current and historical plans independently, leading to discontinuities in driving intentions and an accumulation of errors with each step in a discontinuous plan. To tackle this challenge, this paper introduces LHPF, an imitation learning planner that integrates historical planning information. Our approach employs a historical intention aggregation module that pools historical planning intentions, which are then combined with a spatial query vector to decode the final planning trajectory. Furthermore, we incorporate a comfort auxiliary task to enhance the human-like quality of the driving behavior. Extensive experiments using both real-world and synthetic data demonstrate that LHPF not only surpasses existing advanced learning-based planners in planning performance but also marks the first instance of a purely learning-based planner outperforming the expert. Additionally, the application of the historical intention aggregation module across various backbones highlights the considerable potential of the proposed method. The code will be made publicly available.
zh
[CV-64] DGNN-YOLO: Dynamic Graph Neural Networks with YOLO11 for Small Object Detection and Tracking in Traffic Surveillance
【速读】: 该论文试图解决交通监控系统中对行人、自行车手和摩托车等小目标的精确检测与跟踪问题。传统方法在处理遮挡、低分辨率和动态交通条件时表现不佳,因此需要创新方法来克服这些限制。解决方案的关键在于引入了一种名为DGNN-YOLO的新框架,该框架将动态图神经网络(DGNN)与YOLO11相结合,利用YOLO11的空间特征提取能力进行精确的目标检测,并通过DGNN建模时空关系以实现稳健的实时跟踪。DGNN-YOLO通过构建和更新图结构,将目标表示为节点,其交互表示为边,从而在复杂和动态环境中实现自适应和准确的跟踪。实验结果表明,DGNN-YOLO在各种交通条件下对小目标的检测和跟踪性能优于现有最先进方法,展示了其在处理小目标和遮挡场景中的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2411.17251
作者: Shahriar Soudeep,M. F. Mridha,Md Abrar Jahin,Nilanjan Dey
关键词-EN: improving road safety, traffic surveillance systems, intelligent transportation systems, traffic surveillance, motorbikes are critical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Accurate detection and tracking of small objects such as pedestrians, cyclists, and motorbikes are critical for traffic surveillance systems, which are crucial in improving road safety and decision-making in intelligent transportation systems. However, traditional methods struggle with challenges such as occlusion, low resolution, and dynamic traffic conditions, necessitating innovative approaches to address these limitations. This paper introduces DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN) with YOLO11 to enhance small object detection and tracking in traffic surveillance systems. The framework leverages YOLO11’s advanced spatial feature extraction capabilities for precise object detection and incorporates DGNN to model spatial-temporal relationships for robust real-time tracking dynamically. By constructing and updating graph structures, DGNN-YOLO effectively represents objects as nodes and their interactions as edges, ensuring adaptive and accurate tracking in complex and dynamic environments. Extensive experiments demonstrate that DGNN-YOLO consistently outperforms state-of-the-art methods in detecting and tracking small objects under diverse traffic conditions, achieving the highest precision (0.8382), recall (0.6875), and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability, particularly in challenging scenarios involving small and occluded objects. This work provides a scalable, real-time traffic surveillance and analysis solution, significantly contributing to intelligent transportation systems.
zh
[CV-65] Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors
【速读】: 该论文试图解决从视频中估计深度和法线图(几何缓冲区)的问题,特别是在没有配对的视频-深度和视频-法线训练数据的情况下。解决方案的关键在于提出了一种名为“Buffer Anytime”的框架,该框架利用单图像先验结合时间一致性约束,通过零样本训练策略实现高质量的视频缓冲区估计。具体来说,该方法结合了基于光流平滑性的最先进的图像估计模型,并通过轻量级的时间注意力架构实现混合损失函数,从而在不依赖大规模标注视频数据集的情况下,显著提升时间一致性并保持准确性。
链接: https://arxiv.org/abs/2411.17249
作者: Zhengfei Kuang,Tianyuan Zhang,Kai Zhang,Hao Tan,Sai Bi,Yiwei Hu,Zexiang Xu,Milos Hasan,Gordon Wetzstein,Fujun Luan
关键词-EN: present Buffer Anytime, call geometric buffers, Buffer Anytime, normal maps, normal training data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video–depth and video–normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.
zh
[CV-66] DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model
【速读】: 该论文试图解决手语翻译 (Sign Language Translation, SLT) 中多样性不足的问题,特别是在处理词汇和句法歧义时。解决方案的关键在于提出了一个名为 DiffSLT 的新颖无注释词 (gloss-free) SLT 框架,该框架利用扩散模型 (diffusion model) 生成多样化的翻译,同时保留手语的语义。DiffSLT 通过将随机噪声转换为目标潜在表示,并根据输入视频的视觉特征进行条件化,从而实现这一目标。为了增强视觉特征的条件化效果,论文设计了引导融合模块 (Guidance Fusion Module),充分利用视觉特征的多层次时空信息。此外,论文还引入了 DiffSLT-P 变体,该变体结合了伪注释词 (pseudo-glosses) 和视觉特征,提供关键的文本引导并减少模态差距,从而显著提高翻译质量和多样性。
链接: https://arxiv.org/abs/2411.17248
作者: JiHwan Moon,Jihoon Park,Jungeun Kim,Jongseong Bae,Hyeongwoo Jeon,Ha Young Kim
关键词-EN: involves converting sign, converting sign language, involves converting, SLT, Sign language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverages a diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which fully utilizes the multi-level spatiotemporal information of the visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over previous gloss-free SLT methods and achieve state-of-the-art performance on two SLT datasets, thereby markedly improving translation quality.
zh
[CV-67] Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration
【速读】: 该论文试图解决单目相机内参(pin-hole camera intrinsic parameters)的估计问题,特别是在缺乏大量训练数据和手工假设的情况下,如何提高估计的泛化能力。解决方案的关键在于利用扩散模型(diffusion models)的强大先验知识,通过生成式方法来估计相机内参。具体来说,论文提出了一种新的图像表示方法,称为相机图像(Camera Image),它能够无损地编码相机内参,并将其无缝集成到扩散框架中。通过微调稳定扩散模型以从单张RGB输入图像生成相机图像,并结合RANSAC操作提取相机内参,从而实现单目相机校准。这种方法不仅显著提升了在多个3D视觉任务中的性能,还展示了在零样本度量深度估计、3D计量、姿态估计和稀疏视图重建等方面的广泛应用潜力。
链接: https://arxiv.org/abs/2411.17240
作者: Junyuan Deng,Wei Yin,Xiaoyang Guo,Qian Zhang,Xiaotao Hu,Weiqiang Ren,Xiaoxiao Long,Ping Tan
关键词-EN: camera, present DM-Calib, hole camera intrinsic, camera intrinsic parameters, camera intrinsics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we present DM-Calib, a diffusion-based approach for estimating pin- hole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, re- sulting in poor generalization across diverse real-world images. Recent advance- ments in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence in- dicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Im- age, which losslessly encodes the numerical camera intrinsics and integrates seam- lessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Im- age conditioned on an input image. By fine-tuning a stable diffusion model to gen- erate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach signifi- cantly outperforms baselines and provides broad benefits to 3D vision tasks. Code is available at this https URL.
zh
[CV-68] Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment
【速读】: 该论文试图解决多模态大语言模型(MLLMs)在图像质量评估(IQA)中依赖于通用上下文描述,导致细粒度质量评估受限的问题。解决方案的关键在于引入了一种新的图像质量评估任务范式,即grounding-IQA。该范式通过将多模态的指称和定位与IQA相结合,实现了更细粒度的质量感知。具体来说,grounding-IQA包括两个子任务:grounding-IQA-description(GIQA-DES)和视觉问答(GIQA-VQA)。GIQA-DES涉及带有精确位置描述的详细描述(如边界框),而GIQA-VQA则专注于局部区域的质量问答。为实现这一范式,论文构建了相应的数据集GIQA-160K,并通过自动化标注流水线进行构建。此外,还设计了GIQA-Bench基准,从描述质量、VQA准确性和定位精度三个方面全面评估模型在grounding-IQA任务中的表现。实验结果表明,该任务范式、数据集和基准有助于推动更细粒度的IQA应用。
链接: https://arxiv.org/abs/2411.17237
作者: Zheng Chen,Xun Zhang,Wenbo Li,Renjing Pei,Fenglong Song,Xiongkuo Min,Xiaohong Liu,Xin Yuan,Yong Guo,Yulun Zhang
关键词-EN: multimodal large language, natural language descriptions, large language models, enables the evaluation, large language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
点击查看摘要
Abstract:The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code: this https URL.
zh
[CV-69] MLI-NeRF: Multi-Light Intrinsic-Aware Neural Radiance Fields
【速读】: 该论文试图解决现有方法在提取内在图像成分(如反射率和阴影)时,主要依赖于统计先验,难以处理复杂真实世界数据的问题。解决方案的关键在于提出了MLI-NeRF(Multiple Light information in Intrinsic-aware Neural Radiance Fields),通过整合不同光源位置提供的场景信息与多视角信息,生成反射率和阴影的伪标签图像,从而在不依赖真实标签数据的情况下指导内在图像分解。该方法引入了直接的监督机制,确保了在多种场景类型中的鲁棒性,并在合成和真实世界数据集上均优于现有的最先进方法。
链接: https://arxiv.org/abs/2411.17235
作者: Yixiong Yang,Shilin Hu,Haoyu Wu,Ramon Baldrich,Dimitris Samaras,Maria Vanrell
关键词-EN: textbf, Current methods, primarily rely, statistical priors, rely on statistical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted paper for the International Conference on 3D Vision 2025. Project page: this https URL
点击查看摘要
Abstract:Current methods for extracting intrinsic image components, such as reflectance and shading, primarily rely on statistical priors. These methods focus mainly on simple synthetic scenes and isolated objects and struggle to perform well on challenging real-world data. To address this issue, we propose MLI-NeRF, which integrates \textbfMultiple \textbfLight information in \textbfIntrinsic-aware \textbfNeural \textbfRadiance \textbfFields. By leveraging scene information provided by different light source positions complementing the multi-view information, we generate pseudo-label images for reflectance and shading to guide intrinsic image decomposition without the need for ground truth data. Our method introduces straightforward supervision for intrinsic component separation and ensures robustness across diverse scene types. We validate our approach on both synthetic and real-world datasets, outperforming existing state-of-the-art methods. Additionally, we demonstrate its applicability to various image editing tasks. The code and data are publicly available.
zh
[CV-70] MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers
【速读】: 该论文试图解决在恶劣天气条件下捕获的图像恢复问题,特别是现有方法通常只能处理特定类型的天气退化,而在实际场景中可能遇到多种天气退化(如雨雪或雨雾天气)的情况。解决方案的关键是提出了一个多天气Transformer(MWFormer),这是一个整体视觉Transformer架构,旨在通过单一的统一架构解决多种天气引起的退化问题。MWFormer利用超网络(hyper-networks)和特征线性调制块(feature-wise linear modulation blocks),通过同一组学习参数来恢复由不同天气类型引起的图像退化。此外,通过对比学习训练的辅助网络提取内容无关的、畸变感知特征嵌入,以高效表示预测的天气类型,指导图像恢复Transformer自适应调制其参数,进行局部和全局特征处理,以应对多种可能的天气情况。MWFormer还提供了一种新颖的调谐方式,在应用时可以调整为单一类型的天气恢复或混合天气恢复,而无需重新训练,从而提供比现有方法更大的可控性。
链接: https://arxiv.org/abs/2411.17226
作者: Ruoxi Zhu,Zhengzhong Tu,Jiaming Liu,Alan C. Bovik,Yibo Fan
关键词-EN: Restoring images captured, adverse weather conditions, Restoring images, captured under adverse, fundamental task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing. The code is available at: this https URL
点击查看摘要
Abstract:Restoring images captured under adverse weather conditions is a fundamental task for many computer vision applications. However, most existing weather restoration approaches are only capable of handling a specific type of degradation, which is often insufficient in real-world scenarios, such as rainy-snowy or rainy-hazy weather. Towards being able to address these situations, we propose a multi-weather Transformer, or MWFormer for short, which is a holistic vision Transformer that aims to solve multiple weather-induced degradations using a single, unified architecture. MWFormer uses hyper-networks and feature-wise linear modulation blocks to restore images degraded by various weather types using the same set of learned parameters. We first employ contrastive learning to train an auxiliary network that extracts content-independent, distortion-aware feature embeddings that efficiently represent predicted weather types, of which more than one may occur. Guided by these weather-informed predictions, the image restoration Transformer adaptively modulates its parameters to conduct both local and global feature processing, in response to multiple possible weather. Moreover, MWFormer allows for a novel way of tuning, during application, to either a single type of weather restoration or to hybrid weather restoration without any retraining, offering greater controllability than existing methods. Our experimental results on multi-weather restoration benchmarks show that MWFormer achieves significant performance improvements compared to existing state-of-the-art methods, without requiring much computational cost. Moreover, we demonstrate that our methodology of using hyper-networks can be integrated into various network architectures to further boost their performance. The code is available at: this https URL
zh
[CV-71] DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting
【速读】: 该论文试图解决在图像编辑中,如何在保持插入对象的身份(identity)的同时,确保其可编辑性(editability)的问题。解决方案的关键在于引入DreamMix,这是一个基于扩散模型的生成式模型,能够在用户指定位置插入目标对象,并同时支持基于文本的属性修改。具体来说,论文提出了一个解耦的局部-全局修复框架,以平衡局部对象插入的精确性和全局视觉一致性。此外,还引入了属性解耦机制(Attribute Decoupling Mechanism, ADM)和文本属性替换模块(Textual Attribute Substitution, TAS),以提高基于文本的属性指导的多样性和区分能力。通过这些创新,DreamMix在对象插入、属性编辑和小对象修复等多种应用场景中,有效地平衡了身份保持和属性可编辑性。
链接: https://arxiv.org/abs/2411.17223
作者: Yicheng Yang,Pengxiang Li,Lu Zhang,Liqian Ma,Ping Hu,Siyu Du,Yunzhi Zhuge,Xu Jia,Huchuan Lu
关键词-EN: Subject-driven image inpainting, alongside recent advancements, Subject-driven image, image editing alongside, editing alongside recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Subject-driven image inpainting has emerged as a popular task in image editing alongside recent advancements in diffusion models. Previous methods primarily focus on identity preservation but struggle to maintain the editability of inserted objects. In response, this paper introduces DreamMix, a diffusion-based generative model adept at inserting target objects into given scenes at user-specified locations while concurrently enabling arbitrary text-driven modifications to their attributes. In particular, we leverage advanced foundational inpainting models and introduce a disentangled local-global inpainting framework to balance precise local object insertion with effective global visual coherence. Additionally, we propose an Attribute Decoupling Mechanism (ADM) and a Textual Attribute Substitution (TAS) module to improve the diversity and discriminative capability of the text-based attribute guidance, respectively. Extensive experiments demonstrate that DreamMix effectively balances identity preservation and attribute editability across various application scenarios, including object insertion, attribute editing, and small object inpainting. Our code is publicly available at this https URL.
zh
[CV-72] AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
【速读】: 该论文试图解决针对人工智能生成视频(AIGVs)的视觉质量评估(VQA)问题,特别是由于AIGVs中存在的独特失真(如不真实的物体、不自然的运动或视觉元素不一致)导致现有VQA模型难以准确评估其感知质量的问题。解决方案的关键在于构建了一个大规模数据集AIGVQA-DB,包含36,576个由15种先进文本到视频模型生成的AIGVs,并通过系统化的标注流程收集了370k专家评分。基于此数据集,论文提出了AIGV-Assessor模型,该模型利用时空特征和大型多模态模型(LMM)框架,能够捕捉AIGVs的复杂质量属性,从而准确预测视频质量评分和视频对偏好,显著超越现有的评分或评估方法。
链接: https://arxiv.org/abs/2411.17221
作者: Jiarui Wang,Huiyu Duan,Guangtao Zhai,Juntong Wang,Xiongkuo Min
关键词-EN: artificial intelligence generated, large multimodal models, models designed specifically, rapid advancement, rapid expansion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions.
zh
[CV-73] Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning
【速读】: 该论文试图解决在工业场景中应用Segment Anything Model (SAM)进行异常分割时遇到的领域偏移问题。解决方案的关键在于提出了一种名为Self-Perception Tuning (SPT)的新方法,该方法通过自绘制调优策略生成异常掩码的初始粗略草图,并结合视觉关系感知适配器来增强对掩码生成过程中判别性关系信息的感知能力。SPT方法的核心在于通过两阶段的调优过程(初始草图生成和细化)来提升SAM在异常图像上的感知能力,从而在多个基准数据集上显著优于基线方法。
链接: https://arxiv.org/abs/2411.17217
作者: Hui-Yue Yang,Hui Chen,Ao Wang,Kai Chen,Zijia Lin,Yongliang Tang,Pengcheng Gao,Yuming Quan,Jungong Han,Guiguang Ding
关键词-EN: impressive generalization ability, made great progress, segmentation tasks due, generalization ability, made great
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perceptinon Tuning (SPT) method, aiming to enhance SAM’s perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness. Models and codes will be available online.
zh
[CV-74] MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution
【速读】: 该论文试图解决图像超分辨率 (SR) 任务中,传统Transformer架构在扩大自注意力窗口时面临的计算需求显著增加和固定窗口大小限制有效感受野及特征多样性的问题。解决方案的关键在于引入多范围注意力Transformer (Multi-Range Attention Transformer, MAT),该模型通过结合膨胀操作 (dilation operation) 和自注意力机制,实现多范围注意力 (MA) 和稀疏多范围注意力 (SMA),从而高效捕捉区域和稀疏全局特征。此外,MAT结合局部特征提取,能够有效捕捉不同空间范围的依赖关系,提升特征表示的多样性和有效性。论文还提出了MSConvStar模块,进一步增强模型的多范围表示学习能力。实验结果表明,MAT在性能和效率上均优于现有的最先进SR模型。
链接: https://arxiv.org/abs/2411.17214
作者: Chengxing Xie,Xiaoming Zhang,Kai Zhang,Linze Li,Yuqian Fu,Biao Gong,Tianrui Li
关键词-EN: Recent advances, image super-resolution, advances in image, Transformer architectures, significantly benefited
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in image super-resolution (SR) have significantly benefited from the incorporation of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive fields and the intermediate feature diversity. This study demonstrates that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) tailored for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Further coupled with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model’s ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency (~3.3 faster than SRFormer-light).
zh
[CV-75] Scaling nnU-Net for CBCT Segmentation
【速读】: 该论文旨在解决在锥束计算机断层扫描(Cone Beam Computed Tomography, CBCT)图像上进行多结构分割的问题,特别是在ToothFairy2挑战赛的范围内。解决方案的关键在于对nnU-Net框架的扩展和优化,具体包括调整补丁大小(patch size)、网络拓扑结构(network topology)以及数据增强策略(data augmentation strategies),以应对牙科CBCT图像的独特挑战。这些改进使得模型在测试集上达到了0.9253的平均Dice系数和18.472的HD95,从而在ToothFairy2挑战赛中获得了第一名。
链接: https://arxiv.org/abs/2411.17213
作者: Fabian Isensee,Yannick Kirchhoff,Lars Kraemer,Maximilian Rokuss,Constantin Ulrich,Klaus H. Maier-Hein
关键词-EN: Beam Computed Tomography, Cone Beam Computed, Computed Tomography, Cone Beam, Beam Computed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Fabian Isensee and Yannick Kirchhoff contributed equally
点击查看摘要
Abstract:This paper presents our approach to scaling the nnU-Net framework for multi-structure segmentation on Cone Beam Computed Tomography (CBCT) images, specifically in the scope of the ToothFairy2 Challenge. We leveraged the nnU-Net ResEnc L model, introducing key modifications to patch size, network topology, and data augmentation strategies to address the unique challenges of dental CBCT imaging. Our method achieved a mean Dice coefficient of 0.9253 and HD95 of 18.472 on the test set, securing a mean rank of 4.6 and with it the first place in the ToothFairy2 challenge. The source code is publicly available, encouraging further research and development in the field.
zh
[CV-76] LampMark: Proactive Deepfake Detection via Training-Free Landmark Perceptual Watermarks ACM-MM2024
【速读】: 该论文试图解决深度伪造(Deepfake)面部操作带来的隐私威胁问题,特别是在面对超现实合成面部图像时,现有被动检测算法普遍存在的泛化性挑战。解决方案的关键在于提出了一种主动的深度伪造检测方法,即引入了一种新颖的无训练标志感知水印(landmark perceptual watermark, LampMark)。该方法通过分析深度伪造操作的结构敏感特性,设计了一个从面部标志(facial landmarks)到二进制标志感知水印的安全保密转换流程。随后,提出了一种端到端的水印框架,能够在不引人注意的情况下,稳健地嵌入和提取受保护图像的水印。基于水印恢复的高准确性,通过评估嫌疑图像与内容匹配的标志感知水印之间的连贯性来实现深度伪造检测。实验结果表明,该方法在跨数据集和跨操作场景下的水印恢复和深度伪造检测方面优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.17209
作者: Tianyi Wang,Mengxiao Huang,Harry Cheng,Xiao Zhang,Zhiqi Shen
关键词-EN: posing privacy threats, garnered significant public, significant public attention, public attention due, enhancing human experiences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM 2024
点击查看摘要
Abstract:Deepfake facial manipulation has garnered significant public attention due to its impacts on enhancing human experiences and posing privacy threats. Despite numerous passive algorithms that have been attempted to thwart malicious Deepfake attacks, they mostly struggle with the generalizability challenge when confronted with hyper-realistic synthetic facial images. To tackle the problem, this paper proposes a proactive Deepfake detection approach by introducing a novel training-free landmark perceptual watermark, LampMark for short. We first analyze the structure-sensitive characteristics of Deepfake manipulations and devise a secure and confidential transformation pipeline from the structural representations, i.e. facial landmarks, to binary landmark perceptual watermarks. Subsequently, we present an end-to-end watermarking framework that imperceptibly and robustly embeds and extracts watermarks concerning the images to be protected. Relying on promising watermark recovery accuracies, Deepfake detection is accomplished by assessing the consistency between the content-matched landmark perceptual watermark and the robustly recovered watermark of the suspect image. Experimental results demonstrate the superior performance of our approach in watermark recovery and Deepfake detection compared to state-of-the-art methods across in-dataset, cross-dataset, and cross-manipulation scenarios.
zh
[CV-77] SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
【速读】: 该论文试图解决从无姿态的多视角图像中进行无需3D先验和姿态调整的通用3D重建问题。解决方案的关键在于提出了一种名为SelfSplat的新型3D高斯Splatting模型,该模型通过集成显式的3D表示与自监督的深度和姿态估计技术,实现了姿态精度和3D重建质量的相互提升。具体来说,模型包括一个匹配感知的姿态估计网络和一个深度细化模块,以增强多视角间的几何一致性,从而确保更准确和稳定的3D重建。实验结果表明,SelfSplat在外观和几何质量上均优于现有的最先进方法,并展示了强大的跨数据集泛化能力。
链接: https://arxiv.org/abs/2411.17190
作者: Gyeongjin Kang,Jisang Yoo,Jihyeon Park,Seungtae Nam,Hyeonsoo Im,Shin sangheon,Sangpil Kim,Eunbyung Park
关键词-EN: Gaussian Splatting model, Gaussian Splatting, unposed multi-view images, Splatting model designed, prior-free generalizable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images. These settings are inherently ill-posed due to the lack of ground-truth data, learned geometric information, and the need to achieve accurate 3D reconstruction without finetuning, making it difficult for conventional methods to achieve high-quality results. Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques, resulting in reciprocal improvements in both pose accuracy and 3D reconstruction quality. Furthermore, we incorporate a matching-aware pose estimation network and a depth refinement module to enhance geometry consistency across views, ensuring more accurate and stable 3D reconstructions. To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV. SelfSplat achieves superior results over previous state-of-the-art methods in both appearance and geometry quality, also demonstrates strong cross-dataset generalization capabilities. Extensive ablation studies and analysis also validate the effectiveness of our proposed methods. Code and pretrained models are available at this https URL
zh
[CV-78] PhysMotion: Physics-Grounded Dynamics From a Single Image
【速读】: 该论文试图解决从单张图像和输入条件(如施加的力和扭矩)生成高质量、物理上合理的3D视频的问题。解决方案的关键在于利用基于连续介质力学(continuum mechanics)的模拟作为先验知识,通过几何优化重建前馈3D高斯模型,并使用可微分的材料点方法(Material Point Method, MPM)结合连续介质力学中的弹塑性模型进行时间步进,从而在粗略细节水平上提供真实的动力学基础。随后,通过结合文本到图像(T2I)扩散模型与跨帧注意力机制对初始模拟进行细化,以增强几何和外观细节,并确保时空一致性,最终生成保留输入图像精细细节的物理上合理的视频。
链接: https://arxiv.org/abs/2411.17189
作者: Xiyang Tan,Ying Jiang,Xuan Li,Zeshun Zong,Tianyi Xie,Yin Yang,Chenfanfu Jiang
关键词-EN: leverages principled physics-based, principled physics-based simulations, physically plausible video, plausible video generation, physically plausible
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \url{ this https URL }
点击查看摘要
Abstract:We introduce PhysMotion, a novel framework that leverages principled physics-based simulations to guide intermediate 3D representations generated from a single image and input conditions (e.g., applied force and torque), producing high-quality, physically plausible video generation. By utilizing continuum mechanics-based simulations as a prior knowledge, our approach addresses the limitations of traditional data-driven generative models and result in more consistent physically plausible motions. Our framework begins by reconstructing a feed-forward 3D Gaussian from a single image through geometry optimization. This representation is then time-stepped using a differentiable Material Point Method (MPM) with continuum mechanics-based elastoplasticity models, which provides a strong foundation for realistic dynamics, albeit at a coarse level of detail. To enhance the geometry, appearance and ensure spatiotemporal consistency, we refine the initial simulation using a text-to-image (T2I) diffusion model with cross-frame attention, resulting in a physically plausible video that retains intricate details comparable to the input image. We conduct comprehensive qualitative and quantitative evaluations to validate the efficacy of our method. Our project page is available at: \urlthis https URL.
zh
[CV-79] LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
【速读】: 该论文试图解决视觉自回归模型(Visual Autoregressive, VAR)在图像生成任务中计算资源需求过高的问题,特别是在资源受限的设备上应用受限的问题。解决方案的关键在于通过分析VAR模型在注意力图(attention map)、使用无分类器引导时的注意力输出(attention outputs when using classifier free guidance)以及数据精度(data precision)三个维度上的冗余,提出了高效的注意力机制和低比特量化方法。这些方法在几乎不影响模型性能(FID增加小于0.056)的情况下,实现了注意力计算减少85.2%、总体内存减少50%以及延迟减少1.5倍的显著效果。此外,论文还开发了无需训练的压缩技术,确保了这些优化方法在实际部署中的可行性和效率提升。
链接: https://arxiv.org/abs/2411.17178
作者: Rui Xie,Tianchen Zhao,Zhihang Yuan,Rui Wan,Wenxi Gao,Zhenhua Zhu,Xuefei Ning,Yu Wang
关键词-EN: offering competitive potential, Visual Autoregressive, visual generation models, offering competitive, AR-based visual generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.
zh
[CV-80] ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
【速读】: 该论文试图解决文本到图像生成模型(T2I)在实际应用中用户面临的反复试验问题,这一问题源于用户在制作合适提示、选择适当模型和配置特定参数等繁琐步骤中的复杂性和不确定性。论文提出的解决方案是自动化T2I生成,即通过自由风格的聊天方式描述需求,系统自动完成这些繁琐步骤。解决方案的关键在于引入了一个名为ChatGenBench的新基准,用于全面评估自动T2I模型在各个步骤中的表现,并提出了一个多阶段进化策略ChatGen-Evo,逐步赋予模型必要的自动化技能,从而显著提升模型在步骤准确性和图像质量方面的性能。
链接: https://arxiv.org/abs/2411.17176
作者: Chengyou Jia,Changliang Xia,Zhuohang Dang,Weijia Wu,Hangwei Qian,Minnan Luo
关键词-EN: practical scenarios, significant advancements, Automatic, models, generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \urlthis https URL
zh
[CV-81] GMFlow: Global Motion-Guided Recurrent Flow for 6D Object Pose Estimation
【速读】: 该论文试图解决6D物体姿态估计中的遮挡和部分可见性问题,特别是在机器人感知和精确操作中面临的挑战。解决方案的关键在于提出了一个名为GMFlow的全局运动引导的循环流估计方法。GMFlow通过利用物体的结构信息,将可见部分的刚体运动扩展到不可见区域,从而克服了由遮挡或缺失部分引起的局部模糊性。具体来说,该方法通过线性注意力机制捕获全局上下文信息,并引导局部运动信息生成全局运动估计。此外,在流迭代过程中引入物体形状约束,使得流估计更适合姿态估计场景。实验结果表明,GMFlow在LM-O和YCB-V数据集上不仅提高了精度,而且在计算效率上也表现出色。
链接: https://arxiv.org/abs/2411.17174
作者: Xin Liu,Shibei Xue,Dezong Zhao,Shan Ma,Min Jiang
关键词-EN: precise manipulation, crucial for robotic, robotic perception, perception and precise, pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:6D object pose estimation is crucial for robotic perception and precise manipulation. Occlusion and incomplete object visibility are common challenges in this task, but existing pose refinement methods often struggle to handle these issues effectively. To tackle this problem, we propose a global motion-guided recurrent flow estimation method called GMFlow for pose estimation. GMFlow overcomes local ambiguities caused by occlusion or missing parts by seeking global explanations. We leverage the object’s structural information to extend the motion of visible parts of the rigid body to its invisible regions. Specifically, we capture global contextual information through a linear attention mechanism and guide local motion information to generate global motion estimates. Furthermore, we introduce object shape constraints in the flow iteration process, making flow estimation suitable for pose estimation scenarios. Experiments on the LM-O and YCB-V datasets demonstrate that our method outperforms existing techniques in accuracy while maintaining competitive computational efficiency.
zh
[CV-82] MRIFE: A Mask-Recovering and Interactive-Feature-Enhancing Semantic Segmentation Network For Relic Landslide Detection
【速读】: 该论文试图解决古滑坡(relic landslide)在高分辨率遥感图像中进行语义分割时面临的两大挑战:目标视觉模糊问题(object visual blur problem)和小规模数据集问题(small-sized dataset problem)。解决方案的关键在于提出了一种名为掩膜恢复与交互特征增强(mask-recovering and interactive-feature-enhancing, MRIFE)的语义分割模型。具体来说,MRIFE模型通过对比学习和掩膜重建方法,结合局部显著特征增强,提升了目标与背景的区分能力及滑坡语义特征的表达能力。同时,采用双分支交互特征增强架构,丰富提取的特征并解决视觉模糊问题。此外,引入自蒸馏学习(self-distillation learning)利用样本内外的特征多样性进行对比学习,提高样本利用率,加速模型收敛,有效应对小规模数据集问题。实验结果表明,MRIFE模型在古滑坡检测任务中显著提升了性能。
链接: https://arxiv.org/abs/2411.17167
作者: Juefei He,Yuexing Peng,Wei Li,Junchuan Yu,Daqing Ge,Wei Xiang
关键词-EN: hazardous geological phenomenon, Relic landslide, relic landslide detection, relic landslide dataset, reliable relic landslide
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Relic landslide, formed over a long period, possess the potential for reactivation, making them a hazardous geological phenomenon. While reliable relic landslide detection benefits the effective monitoring and prevention of landslide disaster, semantic segmentation using high-resolution remote sensing images for relic landslides faces many challenges, including the object visual blur problem, due to the changes of appearance caused by prolonged natural evolution and human activities, and the small-sized dataset problem, due to difficulty in recognizing and labelling the samples. To address these challenges, a semantic segmentation model, termed mask-recovering and interactive-feature-enhancing (MRIFE), is proposed for more efficient feature extraction and separation. Specifically, a contrastive learning and mask reconstruction method with locally significant feature enhancement is proposed to improve the ability to distinguish between the target and background and represent landslide semantic features. Meanwhile, a dual-branch interactive feature enhancement architecture is used to enrich the extracted features and address the issue of visual ambiguity. Self-distillation learning is introduced to leverage the feature diversity both within and between samples for contrastive learning, improving sample utilization, accelerating model convergence, and effectively addressing the problem of the small-sized dataset. The proposed MRIFE is evaluated on a real relic landslide dataset, and experimental results show that it greatly improves the performance of relic landslide detection. For the semantic segmentation task, compared to the baseline, the precision increases from 0.4226 to 0.5347, the mean intersection over union (IoU) increases from 0.6405 to 0.6680, the landslide IoU increases from 0.3381 to 0.3934, and the F1-score increases from 0.5054 to 0.5646.
zh
[CV-83] OSDFace: One-Step Diffusion Model for Face Restoration
【速读】: 该论文试图解决现有扩散模型在人脸修复中存在的计算效率低和生成图像不和谐、不真实、身份不一致的问题。解决方案的关键在于提出了一种名为 OSDFace 的新型一步扩散模型,其核心创新包括:1) 引入视觉表示嵌入器 (VRE),通过视觉分词器和向量量化字典生成视觉提示,以更好地捕捉和理解输入人脸的先验信息;2) 采用基于人脸识别的面部身份损失,确保生成图像的身份一致性;3) 结合生成对抗网络 (GAN) 作为指导模型,促进修复后的人脸与真实人脸之间的分布对齐。这些创新使得 OSDFace 在视觉质量和定量指标上均超越了当前最先进的方法,生成高保真、自然且身份一致的人脸图像。
链接: https://arxiv.org/abs/2411.17163
作者: Jingkai Wang,Jue Gong,Lin Zhang,Zheng Chen,Xing Liu,Hong Gu,Yutong Liu,Yulun Zhang,Xiaokang Yang
关键词-EN: demonstrated impressive performance, demonstrated impressive, impressive performance, face, face restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. The code and model will be available at this https URL
点击查看摘要
Abstract:Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject’s identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released at this https URL.
zh
[CV-84] Enhancing Lane Segment Perception and Topology Reasoning with Crowdsourcing Trajectory Priors
【速读】: 该论文试图解决自动驾驶中车道感知模型在利用多样化先验信息时面临的三个关键挑战:高质量先验信息的获取、先验信息与在线感知之间的对齐、以及高效集成。解决方案的关键在于从轨迹先验的新视角出发,通过提取Argoverse2运动预测数据集中的众包轨迹数据,并将其编码为栅格化热图和矢量化实例令牌,然后将这些先验信息以不同方式融入在线映射模型中。此外,为缓解先验信息与在线感知之间的对齐问题,设计了一个基于置信度的融合模块,在融合过程中考虑对齐因素。实验结果表明,该方法在OpenLane-V2数据集上的性能显著优于当前最先进的方法。
链接: https://arxiv.org/abs/2411.17161
作者: Peijin Jia,Ziang Luo,Tuopu Wen,Mengmeng Yang,Kun Jiang,Le Cui,Diange Yang
关键词-EN: provide autonomous vehicles, perception provide autonomous, segment perception provide, lane segment perception, driving scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In autonomous driving, recent advances in lane segment perception provide autonomous vehicles with a comprehensive understanding of driving scenarios. Moreover, incorporating prior information input into such perception model represents an effective approach to ensure the robustness and accuracy. However, utilizing diverse sources of prior information still faces three key challenges: the acquisition of high-quality prior information, alignment between prior and online perception, efficient integration. To address these issues, we investigate prior augmentation from a novel perspective of trajectory priors. In this paper, we initially extract crowdsourcing trajectory data from Argoverse2 motion forecasting dataset and encode trajectory data into rasterized heatmap and vectorized instance tokens, then we incorporate such prior information into the online mapping model through different ways. Besides, with the purpose of mitigating the misalignment between prior and online perception, we design a confidence-based fusion module that takes alignment into account during the fusion process. We conduct extensive experiments on OpenLane-V2 dataset. The results indicate that our method’s performance significantly outperforms the current state-of-the-art methods.
zh
[CV-85] On-Road Object Importance Estimation: A New Dataset and A Model with Multi-Fold Top-Down Guidance
【速读】: 该论文试图解决道路对象重要性估计问题,利用从驾驶员视角捕捉的视频序列作为输入。解决方案的关键在于提出了一种融合多重自上而下指导因素与自下而上特征的模型。具体来说,该模型集成了三种自上而下的指导因素:驾驶员意图、语义上下文和交通规则,这些因素对于对象重要性估计至关重要,但现有方法未能同时考虑。通过这种集成,模型能够更有效地处理高度动态和多样化的交通场景,实验结果表明,该模型在平均精度(AP)上比最新方法提升了23.1%。
链接: https://arxiv.org/abs/2411.17152
作者: Zhixiong Nan,Yilong Chen,Tianfei Zhou,Tao Xiang
关键词-EN: utilizes video sequences, video sequences captured, object importance estimation, object importance, on-road object importance
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper addresses the problem of on-road object importance estimation, which utilizes video sequences captured from the driver’s perspective as the input. Although this problem is significant for safer and smarter driving systems, the exploration of this problem remains limited. On one hand, publicly-available large-scale datasets are scarce in the community. To address this dilemma, this paper contributes a new large-scale dataset named Traffic Object Importance (TOI). On the other hand, existing methods often only consider either bottom-up feature or single-fold guidance, leading to limitations in handling highly dynamic and diverse traffic scenarios. Different from existing methods, this paper proposes a model that integrates multi-fold top-down guidance with the bottom-up feature. Specifically, three kinds of top-down guidance factors (ie, driver intention, semantic context, and traffic rule) are integrated into our model. These factors are important for object importance estimation, but none of the existing methods simultaneously consider them. To our knowledge, this paper proposes the first on-road object importance estimation model that fuses multi-fold top-down guidance factors with bottom-up feature. Extensive experiments demonstrate that our model outperforms state-of-the-art methods by large margins, achieving 23.1% Average Precision (AP) improvement compared with the recently proposed model (ie, Goal).
zh
[CV-86] Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
【速读】: 该论文试图解决开放词汇语义分割 (Open-Vocabulary Semantic Segmentation, OVSS) 中复杂对象分割时缺乏对象级别上下文考虑的问题。解决方案的关键在于引入对象级别的上下文知识,通过将视觉基础模型的光谱驱动特征蒸馏到视觉编码器的注意力机制中,增强对象内部的一致性,从而形成单一对象掩码。此外,通过零样本对象存在概率来精炼文本嵌入,确保其与图像中特定对象的准确对齐。这种方法不仅提升了模型的分割精度,还增强了其在不同数据集上的泛化能力。
链接: https://arxiv.org/abs/2411.17150
作者: Chanyoung Kim,Dayun Ju,Woojung Han,Ming-Hsuan Yang,Seong Jae Hwang
关键词-EN: Open-Vocabulary Semantic Segmentation, Semantic Segmentation, Open-Vocabulary Semantic, recent vision-language models, learning schemes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object-level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models’ ability to group semantically consistent elements within object and map them precisely to user-defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images. Specifically, our model enhances intra-object consistency by distilling spectral-driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero-shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object-level contextual knowledge, our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets.
zh
[CV-87] Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation
【速读】: 该论文试图解决多模态输入训练分割器时存在的单模态偏差问题,即多模态分割器过度依赖某些模态,导致在实际应用中其他模态缺失时性能下降。解决方案的关键在于引入了一种并行多模态学习策略来训练一个强大的教师模型,并通过多尺度表示空间中的跨模态和单模态知识蒸馏,将特征级知识从多模态传递到任意模态分割器,从而解决单模态偏差问题并避免对特定模态的过度依赖。此外,还提出了预测级模态无关的语义蒸馏,以实现分割的语义知识传递。
链接: https://arxiv.org/abs/2411.17141
作者: Xu Zheng,Haiwei Xue,Jialei Chen,Yibo Yan,Lutao Jiang,Yuanhuiyi Lyu,Kailun Yang,Linfeng Zhang,Xuming Hu
关键词-EN: practically challenging, inputs from multiple, multiple sensors, sensors to train, intuitively advantageous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress
点击查看摘要
Abstract:Simultaneously using multimodal inputs from multiple sensors to train segmentors is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where multimodal segmentors over rely on certain modalities, causing performance drops when others are missing, common in real world applications. To this end, we develop the first framework for learning robust segmentor that can handle any combinations of visual modalities. Specifically, we first introduce a parallel multimodal learning strategy for learning a strong teacher. The cross-modal and unimodal distillation is then achieved in the multi scale representation space by transferring the feature level knowledge from multimodal to anymodal segmentors, aiming at addressing the unimodal bias and avoiding over-reliance on specific modalities. Moreover, a prediction level modality agnostic semantic distillation is proposed to achieve semantic knowledge transferring for segmentation. Extensive experiments on both synthetic and real-world multi-sensor benchmarks demonstrate that our method achieves superior performance.
zh
[CV-88] Crack Detection in Infrastructure Using Transfer Learning Spatial Attention and Genetic Algorithm Optimization
【速读】: 该论文试图解决基础设施(如道路、桥梁和建筑物)中裂缝检测的问题,传统的手工检查方法存在劳动强度大、主观性强和危险性高等缺点。解决方案的关键在于采用深度学习技术,特别是通过迁移学习(transfer learning)、空间注意力机制(spatial attention mechanisms)和遗传算法(GA)优化来提升检测精度。具体来说,论文利用预训练的ResNet50模型进行特征提取,减少了大规模训练数据的需求,并通过引入空间注意力层和基于遗传算法优化的自定义神经网络架构来进一步增强模型性能。实验结果表明,提出的Attention-ResNet50-GA模型在裂缝检测中表现优异,精度达到0.9967,F1分数为0.9983,显著优于传统方法,特别适用于标注数据稀缺的实际应用场景。
链接: https://arxiv.org/abs/2411.17140
作者: Feng Ding
关键词-EN: reduce costly repairs, Crack detection plays, including roads, costly repairs, plays a pivotal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Crack detection plays a pivotal role in the maintenance and safety of infrastructure, including roads, bridges, and buildings, as timely identification of structural damage can prevent accidents and reduce costly repairs. Traditionally, manual inspection has been the norm, but it is labor-intensive, subjective, and hazardous. This paper introduces an advanced approach for crack detection in infrastructure using deep learning, leveraging transfer learning, spatial attention mechanisms, and genetic algorithm(GA) optimization. To address the challenge of the inaccessability of large amount of data, we employ ResNet50 as a pre-trained model, utilizing its strong feature extraction capabilities while reducing the need for extensive training datasets. We enhance the model with a spatial attention layer as well as a customized neural network which architecture was fine-tuned using GA. A comprehensive case study demonstrates the effectiveness of the proposed Attention-ResNet50-GA model, achieving a precision of 0.9967 and an F1 score of 0.9983, outperforming conventional methods. The results highlight the model’s ability to accurately detect cracks in various conditions, making it highly suitable for real-world applications where large annotated datasets are scarce.
zh
[CV-89] chCoach: Towards Technical Keypoint-Aware Descriptive Action Coaching
【速读】: 该论文试图解决现有基于评分的行为评估方法在指导学习者掌握动作技能方面的不足,特别是在无法提供详细、可理解的反馈以指出哪些方面做得好以及哪些方面可以改进的问题。解决方案的关键在于提出了一个新的任务,称为描述性动作教练 (Descriptive Action Coaching, DAC),并构建了一个新的数据集 EE4D-DAC,该数据集通过基于大型语言模型 (LLM) 的标注流程,提供了层次化的教练评论,涵盖关键点和实例级别。论文还提出了一个新的框架 TechCoach,其核心是上下文感知的关键点推理器 (Context-aware Keypoint Reasoner),该推理器通过查询视觉上下文并受关键点级别教练评论的监督,学习关键点相关的质量表示。结合这些技术,统一的具有关键点感知的行为评估器 (Keypoint-aware Action Assessor) 能够提供包含质量评分的整体教练评论。通过这些创新,论文建立了一个新的 DAC 基准,并通过广泛的实验验证了其方法的有效性。
链接: https://arxiv.org/abs/2411.17130
作者: Yuan-Ming Li,An-Lan Wang,Kun-Yu Lin,Yu-Ming Tang,Ling-An Zeng,Jian-Fang Hu,Wei-Shi Zheng
关键词-EN: learner action execution, Descriptive Action Coaching, understandable feedback, action, learner action
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 12 figures
点击查看摘要
Abstract:To guide a learner to master the action skills, it is crucial for a coach to 1) reason through the learner’s action execution and technical keypoints, and 2) provide detailed, understandable feedback on what is done well and what can be improved. However, existing score-based action assessment methods are still far from this practical scenario. To bridge this gap, we investigate a new task termed Descriptive Action Coaching (DAC) which requires a model to provide detailed commentary on what is done well and what can be improved beyond a quality score from an action execution. To this end, we construct a new dataset named EE4D-DAC. With an LLM-based annotation pipeline, our dataset goes beyond the existing action assessment datasets by providing the hierarchical coaching commentary at both keypoint and instance levels. Furthermore, we propose TechCoach, a new framework that explicitly incorporates keypoint-level reasoning into the DAC process. The central to our method lies in the Context-aware Keypoint Reasoner, which enables TechCoach to learn keypoint-related quality representations by querying visual context under the supervision of keypoint-level coaching commentary. Prompted by the visual context and the keypoint-related quality representations, a unified Keypoint-aware Action Assessor is then employed to provide the overall coaching commentary together with the quality score. Combining all of these, we build a new benchmark for DAC and evaluate the effectiveness of our method through extensive experiments. Data and code will be publicly available.
zh
[CV-90] DOGE: Towards Versatile Visual Document Grounding and Referring
【速读】: 该论文试图解决多模态大语言模型(MLLMs)在视觉文档理解领域中,由于缺乏细粒度数据集和全面基准测试而导致的地基和指称能力不足的问题。解决方案的关键在于提出了DOcument Grounding and Eferring数据引擎(DOGE-Engine),该引擎能够生成两种高质量的细粒度文档数据:多粒度解析数据用于增强基本的文本定位和识别能力;以及指令调优数据用于在对话和推理过程中激活MLLM的地基和指称能力。此外,论文还构建了DOGE-Bench,涵盖了7种地基和指称任务,跨越3种文档类型(图表、海报、PDF文档),为细粒度文档理解提供了全面的评估。通过这些数据,论文开发了一个强大的基线模型DOGE,该模型能够在文档图像中准确地进行多粒度的文本指称和地基定位。
链接: https://arxiv.org/abs/2411.17125
作者: Yinan Zhou,Yuxin Chen,Haokun Lin,Shuyu Yang,Li Zhu,Zhongang Qi,Chen Ma,Ying Shan
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, flexible user interaction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 13 figures
点击查看摘要
Abstract:In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Eferring data engine (DOGE-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM’s grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct DOGE-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, DOGE. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be open-sourced for community development.
zh
[CV-91] Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text Images and Videos
【速读】: 该论文试图解决在线平台中广泛传播的仇恨言论、骚扰、有害及性内容和暴力等问题,其解决方案的关键在于利用大型语言模型(LLM)进行内容审核。论文评估了现有的基于LLM的内容审核解决方案,如OpenAI的审核模型和Llama-Guard3,并探讨了GPT、Gemini和Llama等最新LLM在识别不适当内容方面的能力。通过使用多种文本和视觉数据集进行评估和比较,结果表明LLM在检测敏感内容方面优于传统技术,能够实现更高的准确性和更低的误报率和漏报率。这突显了将LLM集成到网站、社交媒体平台和视频分享服务中以实现监管和内容审核的潜力。
链接: https://arxiv.org/abs/2411.17123
作者: Nouar AlDahoul,Myles Joshua Toledo Tan,Harishwar Reddy Kasireddy,Yasir Zaki
关键词-EN: provokes widespread concern, platforms presents substantial, presents substantial challenges, media platforms presents, widespread dissemination
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 55 pages, 16 figures
点击查看摘要
Abstract:The widespread dissemination of hate speech, harassment, harmful and sexual content, and violence across websites and media platforms presents substantial challenges and provokes widespread concern among different sectors of society. Governments, educators, and parents are often at odds with media platforms about how to regulate, control, and limit the spread of such content. Technologies for detecting and censoring the media contents are a key solution to addressing these challenges. Techniques from natural language processing and computer vision have been used widely to automatically identify and filter out sensitive content such as offensive languages, violence, nudity, and addiction in both text, images, and videos, enabling platforms to enforce content policies at scale. However, existing methods still have limitations in achieving high detection accuracy with fewer false positives and false negatives. Therefore, more sophisticated algorithms for understanding the context of both text and image may open rooms for improvement in content censorship to build a more efficient censorship system. In this paper, we evaluate existing LLM-based content moderation solutions such as OpenAI moderation model and Llama-Guard3 and study their capabilities to detect sensitive contents. Additionally, we explore recent LLMs such as GPT, Gemini, and Llama in identifying inappropriate contents across media outlets. Various textual and visual datasets like X tweets, Amazon reviews, news articles, human photos, cartoons, sketches, and violence videos have been utilized for evaluation and comparison. The results demonstrate that LLMs outperform traditional techniques by achieving higher accuracy and lower false positive and false negative rates. This highlights the potential to integrate LLMs into websites, social media platforms, and video-sharing services for regulatory and content moderation purposes.
zh
[CV-92] PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution
【速读】: 该论文试图解决基于扩散的图像超分辨率(SR)模型在单步去噪(OSD)过程中计算成本高和存储需求大的问题。解决方案的关键在于提出了一种新颖的后训练量化方法,称为PassionSR,该方法通过以下几个关键步骤实现:首先,简化OSD模型,去除CLIPEncoder,仅保留UNet和变分自编码器(VAE)两个核心组件;其次,引入可学习边界量化器(Learnable Boundary Quantizer, LBQ)和可学习等价变换(Learnable Equivalent Transformation, LET)来优化量化过程并操纵激活分布以提高量化效果;最后,设计分布式量化校准(Distributed Quantization Calibration, DQC)策略,以稳定量化参数的训练并实现快速收敛。实验结果表明,PassionSR在8-bit和6-bit量化下能够获得与全精度模型相当的视觉质量,并且在低比特量化方法中表现优异。
链接: https://arxiv.org/abs/2411.17106
作者: Libo Zhu,Jianze Li,Haotong Qin,Yulun Zhang,Yong Guo,Xiaokang Yang
关键词-EN: Diffusion-based image super-resolution, shown superior performance, multiple denoising steps, Diffusion-based image, shown superior
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
点击查看摘要
Abstract:Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be at this https URL.
zh
[CV-93] OmegaSFormer: Dual-Modal Omega-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction
【速读】: 该论文试图解决梯田(Terraced Field)从遥感影像中提取的问题,这是土壤和水保持(Soil and Water Conservation, SWC)监测和评估的基础。解决方案的关键在于提出了一种新型的双模态Ω形超分辨率Transformer网络(dual-modal \Omega-like super-resolution Transformer network),其核心优势包括:(1)通过在编码器每一步融合原始高分辨率特征与下采样特征,并利用多头注意力机制(multi-head attention mechanism),减少传统多尺度下采样编码器的边缘分割误差;(2)通过提出Ω形网络结构,充分整合光谱和地形数据的高级特征,形成跨尺度的超分辨率特征,从而提高梯田提取的准确性;(3)验证了跨模态和跨尺度(即遥感影像与DEM之间不一致的空间分辨率)超分辨率特征提取的最佳融合方案;(4)通过粗到细的空间拓扑语义关系优化(Spatial Topological Semantic Relationship Optimization, STSRO)分割策略,缓解分割边缘像素的不确定性;(5)利用轮廓振动神经网络(contour vibration neural network)持续优化参数,并从语义分割结果中迭代矢量化梯田。此外,论文首次创建了一个基于深度学习的梯田提取数据集(DMRVD),覆盖中国四个省份的九个研究区域,总面积达22441平方公里。
链接: https://arxiv.org/abs/2411.17088
作者: Chang Li,Yu Wang,Ce Zhang,Yongjun Zhang
关键词-EN: significant engineering practice, remotely sensed imagery, Terraced field extraction, Terraced field, water conservation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual-modal \Omega-like super-resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi-scale downsampling encoder, through fusing original high-resolution features with downsampling features at each step of encoder and leveraging a multi-head attention mechanism; (2) improving the accuracy of TFVE by proposing a \Omega-like network structure, which fully integrates rich high-level features from both spectral and terrain data to form cross-scale super-resolution features; (3) validating an optimal fusion scheme for cross-modal and cross-scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super-resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse-to-fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep-learning-based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of \OmegaSFormer, classic and SOTA networks were compared. The mIOU of \OmegaSFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single-modal remotely sensed imagery, single-modal DEM and dual-modal result.
zh
[CV-94] Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts
【速读】: 该论文试图解决在条件扩散模型采样中,使用负向的分类无指导(Classifier-Free Guidance, CFG)项来过滤不想要特征时,导致样本偏离边际分布的问题。解决方案的关键在于引入对比损失(contrastive loss)来增强负向CFG指导。具体来说,通过对比损失,指导项能够根据给定条件调整去噪方向,使其在正向指导时与传统CFG几乎一致,同时在负向指导时克服现有方法的局限性,从而在去除不想要概念的同时保持样本质量。
链接: https://arxiv.org/abs/2411.17077
作者: Jinho Chang,Hyungjin Chung,Jong Chul Ye
关键词-EN: improved condition alignment, diffusion model sampling, negated CFG term, proven effective, sampling for improved
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures
点击查看摘要
Abstract:As Classifier-Free Guidance (CFG) has proven effective in conditional diffusion model sampling for improved condition alignment, many applications use a negated CFG term to filter out unwanted features from samples. However, simply negating CFG guidance creates an inverted probability distribution, often distorting samples away from the marginal distribution. Inspired by recent advances in conditional diffusion models for inverse problems, here we present a novel method to enhance negative CFG guidance using contrastive loss. Specifically, our guidance term aligns or repels the denoising direction based on the given condition through contrastive loss, achieving a nearly identical guiding direction to traditional CFG for positive guidance while overcoming the limitations of existing negative guidance methods. Experimental results demonstrate that our approach effectively removes undesirable concepts while maintaining sample quality across diverse scenarios, from simple class conditions to complex and overlapping text prompts.
zh
[CV-95] Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering
【速读】: 该论文试图解决病理图像分析中深度学习方法忽视领域专家对组织结构和细胞成分理解的问题。解决方案的关键在于提出了一个名为Path-RAG的新框架,该框架利用HistoCartography从病理图像中检索相关领域知识,并通过选择相关图像块来增强模型性能。具体来说,Path-RAG通过领域知识的引导,显著提升了LLaVA-Med在PathVQA-Open任务中的准确率,从38%提高到47%,并且在HE染色病理图像上实现了28%的显著提升。此外,在长篇问答对中,模型在ARCH-Open PubMed和ARCH-Open Books数据集上的表现分别提高了32.5%和30.6%。
链接: https://arxiv.org/abs/2411.17073
作者: Awais Naeem,Tianhao Li,Huang-Ru Liao,Jiawei Xu,Aby M. Mathew,Zehao Zhu,Zhen Tan,Ajay Kumar Jaiswal,Raffi A. Salibian,Ziniu Hu,Tianlong Chen,Ying Ding
关键词-EN: cancer treatment selection, Accurate diagnosis, pathology images, Open-ended Pathology VQA, selection and planning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Accurate diagnosis and prognosis assisted by pathology images are essential for cancer treatment selection and planning. Despite the recent trend of adopting deep-learning approaches for analyzing complex pathology images, they fall short as they often overlook the domain-expert understanding of tissue structure and cell composition. In this work, we focus on a challenging Open-ended Pathology VQA (PathVQA-Open) task and propose a novel framework named Path-RAG, which leverages HistoCartography to retrieve relevant domain knowledge from pathology images and significantly improves performance on PathVQA-Open. Admitting the complexity of pathology image analysis, Path-RAG adopts a human-centered AI approach by retrieving domain knowledge using HistoCartography to select the relevant patches from pathology images. Our experiments suggest that domain guidance can significantly boost the accuracy of LLaVA-Med from 38% to 47%, with a notable gain of 28% for HE-stained pathology images in the PathVQA-Open dataset. For longer-form question and answer pairs, our model consistently achieves significant improvements of 32.5% in ARCH-Open PubMed and 30.6% in ARCH-Open Books on H\E images. Our code and dataset is available here (this https URL).
zh
[CV-96] Geometry Field Splatting with Gaussian Surfels
【速读】: 该论文试图解决从图像中重建不透明表面的几何结构这一长期挑战,特别是在使用辐射场进行体积视图合成的背景下。解决方案的关键在于利用几何场(geometry field),并将其转换为体积密度,通过高斯核或表面元素(surfels)来投射几何场而非体积,从而实现对不透明固体的精确重建。具体贡献包括:1) 推导出一种高效且几乎精确的可微渲染算法,用于参数化为高斯表面元素的几何场,消除了当前涉及泰勒级数和无自衰减的近似;2) 解决了表面元素在几何附近聚集时导致的损失景观不连续问题,确保渲染颜色是内核颜色的连续函数,不受排序影响;3) 使用球谐编码反射向量而非球谐编码颜色来更好地处理镜面表面。这些改进显著提高了在广泛使用的数据集上重建3D表面的质量。
链接: https://arxiv.org/abs/2411.17067
作者: Kaiwen Jiang,Venkataram Sivaram,Cheng Peng,Ravi Ramamoorthi
关键词-EN: volumetric view synthesis, view synthesis algorithms, Geometric reconstruction, computer vision, longstanding challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Geometric reconstruction of opaque surfaces from images is a longstanding challenge in computer vision, with renewed interest from volumetric view synthesis algorithms using radiance fields. We leverage the geometry field proposed in recent work for stochastic opaque surfaces, which can then be converted to volume densities. We adapt Gaussian kernels or surfels to splat the geometry field rather than the volume, enabling precise reconstruction of opaque solids. Our first contribution is to derive an efficient and almost exact differentiable rendering algorithm for geometry fields parameterized by Gaussian surfels, while removing current approximations involving Taylor series and no self-attenuation. Next, we address the discontinuous loss landscape when surfels cluster near geometry, showing how to guarantee that the rendered color is a continuous function of the colors of the kernels, irrespective of ordering. Finally, we use latent representations with spherical harmonics encoded reflection vectors rather than spherical harmonics encoded colors to better address specular surfaces. We demonstrate significant improvement in the quality of reconstructed 3D surfaces on widely-used datasets.
zh
[CV-97] SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation
【速读】: 该论文试图解决在语义分割任务中,Vision Transformer (ViT) 作为通用视觉编码器时,其解码器未能充分利用编码器特征的问题。解决方案的关键在于提出了Strip Cross-Attention (SCASeg),这是一种专门为语义分割设计的创新解码器头。SCASeg通过在编码器和解码器阶段之间使用横向连接,并将编码器特征作为交叉注意力模块的查询,来替代传统的简单跳跃连接。此外,引入了Cross-Layer Block,该模块融合了来自不同编码器和解码器阶段的分层特征图,以创建统一的键和值表示。为了提高计算效率,SCASeg将查询和键压缩成条带状模式,优化了内存使用和推理速度。Cross-Layer Block还结合了卷积的局部感知优势,使SCASeg能够捕捉多层间的全局和局部上下文依赖关系,从而在不同尺度上实现有效的特征交互,提升整体性能。实验结果表明,SCASeg在不同配置下均表现出色,超越了多个基准数据集上的领先分割架构。
链接: https://arxiv.org/abs/2411.17061
作者: Guoan Xu,Jiaming Chen,Wenfeng Huang,Wenjing Jia,Guangwei Gao,Guo-Jun Qi
关键词-EN: Vision Transformer, achieved notable success, variants extensively validated, computer vision, semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures
点击查看摘要
Abstract:The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, ViT backbones often overlook the specific needs of task decoders, revealing opportunities to design decoders tailored to efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation. Instead of relying on the simple conventional skip connections, we employ lateral connections between the encoder and decoder stages, using encoder features as Queries for the cross-attention modules. Additionally, we introduce a Cross-Layer Block that blends hierarchical feature maps from different encoder and decoder stages to create a unified representation for Keys and Values. To further boost computational efficiency, SCASeg compresses queries and keys into strip-like patterns to optimize memory usage and inference speed over the traditional vanilla cross-attention. Moreover, the Cross-Layer Block incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers. This approach facilitates effective feature interaction at different scales, improving the overall performance. Experiments show that the adaptable decoder of SCASeg produces competitive performance across different setups, surpassing leading segmentation architectures on all benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under varying computational limitations.
zh
[CV-98] A generalised novel loss function for computational fluid dynamics
【速读】: 该论文试图解决在计算流体动力学 (CFD) 模拟中,由于数据集中存在大量低方差区域和少量高方差区域,导致传统深度学习技术在训练过程中对所有数据区域赋予同等重要性,从而效率低下的问题。解决方案的关键在于提出了一种新的损失函数——梯度均方误差 (GMSE),该函数能够自动动态地识别并加权数据场中的重要区域,根据局部方差分配适当的权重。通过对比使用均方误差 (MSE) 损失、GMSE 损失及其动态变体 (DGMSE) 训练的网络,结果显示 GMSE 损失函数不仅加速了损失收敛,减少了训练时间,还显著降低了生成场与真实模拟之间的结构相似性误差,提高了损失的最大速率,并增强了欺骗判别器网络的能力。
链接: https://arxiv.org/abs/2411.17059
作者: Zachary Cooper-Baldock,Paulo E. Santos,Russell S.A. Brinkworth,Karl Sammut
关键词-EN: crucial in automotive, maritime and medical, calculating the flow, requirements of directly, directly calculating
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Fluid Dynamics (physics.flu-dyn)
备注: 37 pages, 13 figures, preprint submitted to Engineering Applications of Artificial Intelligence (EAAI)
点击查看摘要
Abstract:Computational fluid dynamics (CFD) simulations are crucial in automotive, aerospace, maritime and medical applications, but are limited by the complexity, cost and computational requirements of directly calculating the flow, often taking days of compute time. Machine-learning architectures, such as controlled generative adversarial networks (cGANs) hold significant potential in enhancing or replacing CFD investigations, due to cGANs ability to approximate the underlying data distribution of a dataset. Unlike traditional cGAN applications, where the entire image carries information, CFD data contains small regions of highly variant data, immersed in a large context of low variance that is of minimal importance. This renders most existing deep learning techniques that give equal importance to every portion of the data during training, inefficient. To mitigate this, a novel loss function is proposed called Gradient Mean Squared Error (GMSE) which automatically and dynamically identifies the regions of importance on a field-by-field basis, assigning appropriate weights according to the local variance. To assess the effectiveness of the proposed solution, three identical networks were trained; optimised with Mean Squared Error (MSE) loss, proposed GMSE loss and a dynamic variant of GMSE (DGMSE). The novel loss function resulted in faster loss convergence, correlating to reduced training time, whilst also displaying an 83.6% reduction in structural similarity error between the generated field and ground truth simulations, a 76.6% higher maximum rate of loss and an increased ability to fool a discriminator network. It is hoped that this loss function will enable accelerated machine learning within computational fluid dynamics.
zh
[CV-99] PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
【速读】: 该论文试图解决在文本到视频 (T2V) 生成中,针对特定身份 (identity-specific) 的人类视频生成问题,特别是在保持高身份保真度 (ID fidelity) 的同时,保留原始动作动态 (motion dynamic) 和语义跟随 (semantic following) 的挑战。解决方案的关键在于提出了一种名为 PersonalVideo 的新框架,通过直接监督由 T2V 模型生成的视频来弥合现有方法中的调优-推理差距 (tuning-inference gap)。具体来说,引入了一个可学习的孤立身份适配器 (Isolated Identity Adapter),以非侵入性的方式定制特定身份,同时不损害原始 T2V 模型的能力。此外,通过非重建身份损失 (non-reconstructive identity loss) 和模拟提示增强 (simulated prompt augmentation),进一步减少了过拟合,提高了在仅有单一参考图像情况下的鲁棒性。实验结果表明,该方法在保持高身份保真度的同时,显著优于现有方法,并且能够无缝集成预训练的 SD 组件,如 ControlNet 和风格 LoRA,无需额外的调优开销。
链接: https://arxiv.org/abs/2411.17048
作者: Hengjia Li,Haonan Qiu,Shiwei Zhang,Xiang Wang,Yujie Wei,Zekun Li,Yingya Zhang,Boxi Wu,Deng Cai
关键词-EN: made significant progress, synthesizing realistic general, realistic general videos, identity-specific human video, made significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed \textbfPersonalVideo, that applies direct supervision on videos synthesized by the T2V model to bridge the gap. Specifically, we introduce a learnable Isolated Identity Adapter to customize the specific identity non-intrusively, which does not comprise the original T2V model’s abilities (e.g., motion dynamic and semantic following). With the non-reconstructive identity loss, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image available. Extensive experiments demonstrate our method’s superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior approaches. Notably, our PersonalVideo seamlessly integrates with pre-trained SD components, such as ControlNet and style LoRA, requiring no extra tuning overhead.
zh
[CV-100] Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation
【速读】: 该论文试图解决数据无知识蒸馏 (Data-Free Knowledge Distillation, DFKD) 在大规模高分辨率数据集(如 ImageNet)上的应用挑战。传统方法在高分辨率下生成合成图像时,往往缺乏关键的类别特定特征,且计算成本高昂。论文提出的解决方案之关键是 MUlti-reSolution data-freE (MUSE) 方法,该方法通过在较低分辨率下生成图像并利用类激活图 (Class Activation Maps, CAMs) 来确保生成的图像保留关键的类别特定特征。此外,MUSE 还采用了多分辨率生成和嵌入多样性技术,以增强潜在空间表示,从而显著提升模型性能,特别是在 ImageNet 等大规模数据集上取得了显著的性能提升。
链接: https://arxiv.org/abs/2411.17046
作者: Minh-Tuan Tran,Trung Le,Xuan-May Le,Jianfei Cai,Mehrtash Harandi,Dinh Phung
关键词-EN: Data-Free Knowledge Distillation, Knowledge Distillation, original training data, enables knowledge transfer, relying on original
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data. While DFKD methods have achieved success on smaller datasets like CIFAR10 and CIFAR100, they encounter challenges on larger, high-resolution datasets such as ImageNet. A primary issue with previous approaches is their generation of synthetic images at high resolutions (e.g., 224 \times 224 ) without leveraging information from real images, often resulting in noisy images that lack essential class-specific features in large datasets. Additionally, the computational cost of generating the extensive data needed for effective knowledge transfer can be prohibitive. In this paper, we introduce MUlti-reSolution data-freE (MUSE) to address these limitations. MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features. To further enhance model diversity, we propose multi-resolution generation and embedding diversity techniques that strengthen latent space representations, leading to significant performance improvements. Experimental results demonstrate that MUSE achieves state-of-the-art performance across both small- and large-scale datasets, with notable performance gains of up to two digits in nearly all ImageNet and subset experiments. Code is available at this https URL.
zh
[CV-101] 4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction
【速读】: 该论文试图解决现有4D高斯方法在动态场景重建中存在的内存和存储需求过高的问题。解决方案的关键在于提出了一种基于锚点(anchor-based)的框架,通过扩展3D支架到4D空间,并利用稀疏的4D网格对齐锚点与压缩特征向量,显著降低了存储成本。每个锚点模型化一组神经4D高斯,代表局部时空区域。此外,引入了时间覆盖感知锚点增长策略,有效分配额外锚点到未充分重建的动态区域,并基于高斯的时间覆盖调整累积梯度,提升动态区域的重建质量。为减少锚点数量,进一步提出了神经速度和从广义高斯分布导出的时间不透明度的增强公式。实验结果表明,该方法在保持高视觉质量的同时,实现了97.8%的存储减少。
链接: https://arxiv.org/abs/2411.17044
作者: Woong Oh Cho,In Cho,Seoha Kim,Jeongmin Bae,Youngjung Uh,Seon Joo Kim
关键词-EN: offer high visual, high visual fidelity, scene reconstruction offer, reconstruction offer high, offer high
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Existing 4D Gaussian methods for dynamic scene reconstruction offer high visual fidelity and fast rendering. However, these methods suffer from excessive memory and storage demands, which limits their practical deployment. This paper proposes a 4D anchor-based framework that retains visual quality and rendering speed of 4D Gaussians while significantly reducing storage costs. Our method extends 3D scaffolding to 4D space, and leverages sparse 4D grid-aligned anchors with compressed feature vectors. Each anchor models a set of neural 4D Gaussians, each of which represent a local spatiotemporal region. In addition, we introduce a temporal coverage-aware anchor growing strategy to effectively assign additional anchors to under-reconstructed dynamic regions. Our method adjusts the accumulated gradients based on Gaussians’ temporal coverage, improving reconstruction quality in dynamic regions. To reduce the number of anchors, we further present enhanced formulations of neural 4D Gaussians. These include the neural velocity, and the temporal opacity derived from a generalized Gaussian distribution. Experimental results demonstrate that our method achieves state-of-the-art visual quality and 97.8% storage reduction over 4DGS.
zh
[CV-102] Free2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
【速读】: 该论文试图解决在文本到视频 (Text-to-Video, T2V) 生成任务中,由于帧间复杂的时序依赖性导致的文本对齐不准确问题。现有基于强化学习 (Reinforcement Learning, RL) 的方法通常需要可微分的奖励函数或受限于有限的提示,限制了其扩展性和适用性。论文提出的解决方案是 Free² Guide,一种无需梯度训练的新型无梯度框架,利用路径积分控制原理,通过非可微分的奖励函数近似扩散模型的指导,从而能够集成强大的黑箱大型视觉语言模型 (Large Vision-Language Models, LVLMs) 作为奖励模型。此外,该框架支持灵活地集成多个奖励模型,包括大规模基于图像的模型,以协同增强对齐效果,而不会显著增加计算开销。实验证明,Free² Guide 在多个维度上显著提高了文本对齐的准确性,并提升了生成视频的整体质量。
链接: https://arxiv.org/abs/2411.17041
作者: Jaemin Kim,Bryan S Kim,Jong Chul Ye
关键词-EN: achieved impressive results, achieved impressive, impressive results, results in generative, generative tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages
点击查看摘要
Abstract:Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free ^2 Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free ^2 Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free ^2 Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.
zh
[CV-103] Multimodal Alignment and Fusion: A Survey
【速读】: 该论文试图解决多模态数据(如文本、图像、音频和视频)在机器学习中的对齐与融合问题。解决方案的关键在于系统地分类和分析现有的对齐与融合技术,以提高模型准确性和适用性,特别是在数据多样性增加和数据有限的情况下。论文通过回顾超过200篇相关文献,探讨了多模态数据整合中的挑战,如对齐问题、噪声鲁棒性和特征表示差异,并强调了这些技术在社交媒体分析、医学影像和情感识别等领域的应用。最终,论文旨在为未来研究提供指导,优化多模态学习系统,增强其在不同应用中的可扩展性、鲁棒性和通用性。
链接: https://arxiv.org/abs/2411.17040
作者: Songtao Li,Hao Tang
关键词-EN: offers a comprehensive, recent advancements, growing diversity, comprehensive review, review of recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 210+ references
点击查看摘要
Abstract:This survey offers a comprehensive review of recent advancements in multimodal alignment and fusion within machine learning, spurred by the growing diversity of data types such as text, images, audio, and video. Multimodal integration enables improved model accuracy and broader applicability by leveraging complementary information across different modalities, as well as facilitating knowledge transfer in situations with limited data. We systematically categorize and analyze existing alignment and fusion techniques, drawing insights from an extensive review of more than 200 relevant papers. Furthermore, this survey addresses the challenges of multimodal data integration - including alignment issues, noise resilience, and disparities in feature representation - while focusing on applications in domains like social media analysis, medical imaging, and emotion recognition. The insights provided are intended to guide future research towards optimizing multimodal learning systems to enhance their scalability, robustness, and generalizability across various applications.
zh
[CV-104] g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks
【速读】: 该论文试图解决在具身任务(embodied tasks)中,如何有效地将3D场景与语言信息结合,以实现对新环境的泛化能力。解决方案的关键在于提出了可泛化的3D-语言特征场(Generalizable 3D-Language Feature Fields, g3D-LF),该模型通过预训练在大规模3D-语言数据集上,能够处理来自代理的RGB-D图像,生成特征场,包括:1) 从任意位置预测新视图表示;2) 生成以代理为中心的BEV地图;3) 在上述表示中使用多粒度语言查询目标。通过沿采样光线进行体积渲染和多尺度编码器整合语义与空间关系,g3D-LF能够在不同尺度和视角下生成与多粒度语言对齐的表示,并通过多层次对比学习实现动态更新和实时构建。此外,论文还准备了一个大规模3D-语言数据集,以确保特征场与语言表示的对齐。
链接: https://arxiv.org/abs/2411.17030
作者: Zihan Wang,Gim Hee Lee
关键词-EN: introduce Generalizable, representation model pre-trained, Feature Fields, model pre-trained, Generations of BEV
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks.
zh
[CV-105] D2-World: An Efficient World Model through Decoupled Dynamic Flow CVPR2024
【速读】: 该论文旨在解决预测未来点云(point clouds)的问题,特别是在自动驾驶系统中的应用。解决方案的关键在于引入了一种名为 D^2-World 的新型世界模型,该模型通过解耦动态流(Decoupled Dynamic flow)来有效预测未来的点云。具体步骤包括:首先利用现有的占用网络(如 BEVDet)获取过去的语义占用信息,然后将这些信息作为输入,通过单阶段世界模型以非自回归方式生成未来的占用情况。为了简化任务,模型在生成过程中对动态体素(dynamic voxels)进行了解耦,通过体素流(voxel flow)扭曲现有观测来生成未来的动态体素,而静态体素(static voxels)则通过姿态变换(pose transformation)直接获得。这种方法不仅在性能上达到了最先进的水平,而且在训练速度上比基线模型快了300%以上。
链接: https://arxiv.org/abs/2411.17027
作者: Haiming Zhang,Xu Yan,Ying Xue,Zixuan Guo,Shuguang Cui,Zhen Li,Bingbing Liu
关键词-EN: Model Challenge held, Workshop on Foundation, Autonomous Systems, World Model Challenge, technical report summarizes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 2nd Place and Innovation Award Solution of Predictive World Model at the CVPR 2024 Autonomous Grand Challenge
点击查看摘要
Abstract:This technical report summarizes the second-place solution for the Predictive World Model Challenge held at the CVPR-2024 Workshop on Foundation Models for Autonomous Systems. We introduce D ^2 -World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow. Specifically, the past semantic occupancies are obtained via existing occupancy networks (e.g., BEVDet). Following this, the occupancy results serve as the input for a single-stage world model, generating future occupancy in a non-autoregressive manner. To further simplify the task, dynamic voxel decoupling is performed in the world model. The model generates future dynamic voxels by warping the existing observations through voxel flow, while remaining static voxels can be easily obtained through pose transformation. As a result, our approach achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model. Code is available at this https URL.
zh
[CV-106] RED: Robust Environmental Design
【速读】: 该论文试图解决自动驾驶系统在道路标志分类中易受对抗性攻击的问题。解决方案的关键在于通过重新设计道路标志本身来提高其鲁棒性,而非仅仅增强分类模型的鲁棒性。论文提出了一种攻击者无关的学习方案,用于自动设计对基于补丁的攻击具有高度鲁棒性的道路标志。实验结果表明,这种方法在数字和物理环境中均显著降低了对抗性攻击的成功率,优于现有的技术。
链接: https://arxiv.org/abs/2411.17026
作者: Jinghan Yan
关键词-EN: autonomous systems, visual inputs, reliant on visual, highly susceptible, susceptible to adversarial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The classification of road signs by autonomous systems, especially those reliant on visual inputs, is highly susceptible to adversarial attacks. Traditional approaches to mitigating such vulnerabilities have focused on enhancing the robustness of classification models. In contrast, this paper adopts a fundamentally different strategy aimed at increasing robustness through the redesign of road signs themselves. We propose an attacker-agnostic learning scheme to automatically design road signs that are robust to a wide array of patch-based attacks. Empirical tests conducted in both digital and physical environments demonstrate that our approach significantly reduces vulnerability to patch attacks, outperforming existing techniques.
zh
[CV-107] ED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On
【速读】: 该论文试图解决虚拟试衣(Virtual Try-On, VTO)中生成图像时文本渲染失真和细节保真度不足的问题。解决方案的关键在于提出了TED-VITON框架,该框架通过引入Garment Semantic (GS) Adapter增强服装特定特征,使用Text Preservation Loss确保文本渲染的准确性和无失真,并通过优化Large Language Model (LLM)生成提示的约束机制,从而在视觉质量和文本保真度方面实现了最先进(SOTA)的性能,为VTO任务设立了新的基准。
链接: https://arxiv.org/abs/2411.17017
作者: Zhenchen Wan,Yanwu Xu,Zhaoqing Wang,Feng Liu,Tongliang Liu,Mingming Gong
关键词-EN: demonstrated exceptional efficacy, generating realistic images, Virtual Try-On, Recent advancements, advancements in Virtual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 3 tables, conference
点击查看摘要
Abstract:Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models’ advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task.
zh
[CV-108] Event-based Spiking Neural Networks for Object Detection: A Review of Datasets Architectures Learning Rules and Implementation
【速读】: 该论文旨在系统性地回顾脉冲神经网络 (Spiking Neural Networks, SNNs) 在计算机视觉 (Computer Vision, CV) 应用中的对象检测任务中的数据集、架构、学习方法、实现技术和评估方法。解决方案的关键在于:1) 分析了全连接、卷积和递归架构的有效性;2) 评估了直接无监督、直接监督和间接学习方法的性能;3) 探讨了神经形态硬件实现中的能耗、延迟和内存的权衡。此外,论文还提供了开源资源、Python代码示例以及构建SNN模型、事件数据处理和SNN模拟的详细指南,并指出了SNN训练、硬件集成和未来CV应用中的关键挑战和方向。
链接: https://arxiv.org/abs/2411.17006
作者: Craig Iaboni,Pramod Abichandani
关键词-EN: Spiking Neural Networks, artificial neural networks, conventional artificial neural, Neural Networks, biologically inspired paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 63 pages, 15 figures
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) represent a biologically inspired paradigm offering an energy-efficient alternative to conventional artificial neural networks (ANNs) for Computer Vision (CV) applications. This paper presents a systematic review of datasets, architectures, learning methods, implementation techniques, and evaluation methodologies used in CV-based object detection tasks using SNNs. Based on an analysis of 151 journal and conference articles, the review codifies: 1) the effectiveness of fully connected, convolutional, and recurrent architectures; 2) the performance of direct unsupervised, direct supervised, and indirect learning methods; and 3) the trade-offs in energy consumption, latency, and memory in neuromorphic hardware implementations. An open-source repository along with detailed examples of Python code and resources for building SNN models, event-based data processing, and SNN simulations are provided. Key challenges in SNN training, hardware integration, and future directions for CV applications are also identified.
zh
[CV-109] Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation
【速读】: 该论文试图解决大规模预训练视觉-语言模型(VLMs)在测试时遇到的数据分布偏移问题,特别是在零样本学习场景下模型性能显著下降的情况。解决方案的关键在于利用类别文本信息生成伪标签,并通过最优传输(Optimal Transport)高效地解决标签分配问题。具体来说,论文提出了CLIP-OT方法,该方法通过固定类别文本嵌入作为中心点来生成伪标签,并结合多模板知识蒸馏技术,在不增加计算复杂度的情况下,实现了多视角对比学习策略,从而显著提升了模型在测试时适应新分布的能力,并在多个测试时适应基准上取得了优于现有最先进方法的性能。
链接: https://arxiv.org/abs/2411.17002
作者: Shambhavi Mishra,Julio Silva-Rodrıguez,Ismail Ben Ayed,Marco Pedersoli,Jose Dolz
关键词-EN: shown unprecedented zero-shot, unprecedented zero-shot performance, range of tasks, Vision-language foundation models, shown unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is signifi- cantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribu- tion drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed cen- troids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representa- tion learning but without incurring additional computational complexity. Extensive experiments on multiple popular test- time adaptation benchmarks presenting diverse complex- ity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.
zh
[CV-110] SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery
【速读】: 该论文试图解决现有基础模型在高空间分辨率、无云卫星图像上的局限性,特别是在需要频繁时间监测或广泛光谱特征的应用场景中的适用性问题。解决方案的关键是引入了一个名为 SatVision-TOA 的新型基础模型,该模型预训练于 14 波段 MODIS L1B 大气层顶 (TOA) 辐射图像数据上。SatVision-TOA 采用 Masked-Image-Modeling (MIM) 框架和 SwinV2 架构,通过自监督学习方法学习详细的上下文表示,无需标签。该模型拥有 30 亿参数,训练于 1 亿张图像,是目前最大的仅基于卫星遥感图像训练的基础模型。实验结果表明,SatVision-TOA 在下游任务如 3D 云检索中表现优异,显著提升了平均交并比 (mIOU) 至 0.46,相比基线方法的 0.22 有显著改进,同时减少了 50% 以上的假阴性结果。
链接: https://arxiv.org/abs/2411.17000
作者: Caleb S. Spradlin,Jordan A. Caraballo-Vega,Jian Li,Mark L. Carroll,Jie Gong,Paul M. Montesano
关键词-EN: enabling large computer, large computer vision, remote sensing data, remote sensing, potential to transform
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 5 figures
点击查看摘要
Abstract:Foundation models have the potential to transform the landscape of remote sensing (RS) data analysis by enabling large computer vision models to be pre-trained on vast amounts of remote sensing data. These models can then be fine-tuned with small amounts of labeled training and applied to a variety of applications. Most existing foundation models are designed for high spatial resolution, cloud-free satellite imagery or photos, limiting their applicability in scenarios that require frequent temporal monitoring or broad spectral profiles. As a result, foundation models trained solely on cloud-free images have limited utility for applications that involve atmospheric variables or require atmospheric corrections. We introduce SatVision-TOA, a novel foundation model pre-trained on 14-band MODIS L1B Top-Of-Atmosphere (TOA) radiance imagery, addressing the need for models pre-trained to handle moderate- and coarse-resolution all-sky remote sensing data. The SatVision-TOA model is pre-trained using a Masked-Image-Modeling (MIM) framework and the SwinV2 architecture, and learns detailed contextual representations through self-supervised learning without the need for labels. It is a 3 billion parameter model that is trained on 100 million images. To our knowledge this is the largest foundation model trained solely on satellite RS imagery. Results show that SatVision-TOA achieves superior performance over baseline methods on downstream tasks such as 3D cloud retrieval. Notably, the model achieves a mean intersection over union (mIOU) of 0.46, a substantial improvement over the baseline mIOU of 0.22. Additionally, the rate of false negative results in the fine-tuning task were reduced by over 50% compared to the baseline. Our work advances pre-trained vision modeling for multispectral RS by learning from a variety of atmospheric and aerosol conditions to improve cloud and land surface monitoring.
zh
[CV-111] Curvature Informed Furthest Point Sampling
【速读】: 该论文试图解决点云数据在处理过程中由于数据量过大而导致的计算需求增加的问题。解决方案的关键在于引入基于强化学习(Reinforcement Learning)的采样算法,该算法通过结合最远点采样(Furthest Point Sampling, FPS)的软排名与由深度神经网络计算的曲率分数,来优化点云的下采样过程。具体来说,该方法通过替换FPS集合中低曲率点为未选中集合中的高曲率点,从而在保持计算效率的同时,更好地保留几何特征。这种方法不仅解决了传统采样技术在几何特征保留上的不足,还克服了现有可微分采样技术在训练稳定性上的问题,实现了稳定的端到端学习,并在多个下游几何处理任务中表现出色,达到了当前最先进的性能。
链接: https://arxiv.org/abs/2411.16995
作者: Shubham Bhardwaj,Ashwin Vinod,Soumojit Bhattacharya,Aryan Koganti,Aditya Sai Ellendula,Balakrishna Reddy
关键词-EN: gained traction due, efficient memory usage, Point cloud representation, simplicity in acquisition, representation has gained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures
点击查看摘要
Abstract:Point cloud representation has gained traction due to its efficient memory usage and simplicity in acquisition, manipulation, and storage. However, as point cloud sizes increase, effective down-sampling becomes essential to address the computational requirements of downstream tasks. Classical approaches, such as furthest point sampling (FPS), perform well on benchmarks but rely on heuristics and overlook geometric features, like curvature, during down-sampling. In this paper, We introduce a reinforcement learning-based sampling algorithm that enhances FPS by integrating curvature information. Our approach ranks points by combining FPS-derived soft ranks with curvature scores computed by a deep neural network, allowing us to replace a proportion of low-curvature points in the FPS set with high-curvature points from the unselected set. Existing differentiable sampling techniques often suffer from training instability, hindering their integration into end-to-end learning frameworks. By contrast, our method achieves stable end-to-end learning, consistently outperforming baseline models across multiple downstream geometry processing tasks. We provide comprehensive ablation studies, with both qualitative and quantitative insights into the effect of each feature on performance. Our algorithm establishes state-of-the-art results for classification, segmentation and shape completion, showcasing its robustness and adaptability. Comments: 19 pages, 5 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.16995 [cs.CV] (or arXiv:2411.16995v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.16995 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-112] CMAViT: Integrating Climate Managment and Remote Sensing Data for Crop Yield Estimation with Multimodel Vision Transformers
【速读】: 该论文试图解决农业规划中的作物产量预测问题,这一问题由于天气、气候和管理实践之间的复杂相互作用而具有挑战性。解决方案的关键在于引入了一种基于深度学习的多模型——气候管理感知视觉变换器 (Climate-Management Aware Vision Transformer, CMAViT),该模型专门设计用于像素级别的葡萄园产量预测。CMAViT通过整合遥感图像和短期气象数据的空间和时间数据,捕捉生长季节变化的影响,并使用交叉注意力编码器将文本形式的管理实践与时间序列数据相结合,以模拟它们与时间序列数据的交互。这种创新的多模态变换器在涵盖2200公顷和八种葡萄品种的大规模数据集上进行了测试,表现优于传统的UNet-ConvLSTM模型,特别是在捕捉空间变异性和预测极端值方面。CMAViT在未见过的测试数据集上达到了0.84的R²和8.22%的MAPE,证明了其有效性。
链接: https://arxiv.org/abs/2411.16989
作者: Hamid Kamangir,Brent. S. Sams,Nick Dokoozlian,Luis Sanchez,J. Mason. Earles
关键词-EN: remains challenging due, Crop yield prediction, Aware Vision Transformer, Crop yield, Climate-Management Aware Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Crop yield prediction is essential for agricultural planning but remains challenging due to the complex interactions between weather, climate, and management practices. To address these challenges, we introduce a deep learning-based multi-model called Climate-Management Aware Vision Transformer (CMAViT), designed for pixel-level vineyard yield predictions. CMAViT integrates both spatial and temporal data by leveraging remote sensing imagery and short-term meteorological data, capturing the effects of growing season variations. Additionally, it incorporates management practices, which are represented in text form, using a cross-attention encoder to model their interaction with time-series data. This innovative multi-modal transformer tested on a large dataset from 2016-2019 covering 2,200 hectares and eight grape cultivars including more than 5 million vines, outperforms traditional models like UNet-ConvLSTM, excelling in spatial variability capture and yield prediction, particularly for extreme values in vineyards. CMAViT achieved an R2 of 0.84 and a MAPE of 8.22% on an unseen test dataset. Masking specific modalities lowered performance: excluding management practices, climate data, and both reduced R2 to 0.73, 0.70, and 0.72, respectively, and raised MAPE to 11.92%, 12.66%, and 12.39%, highlighting each modality’s importance for accurate yield prediction. Code is available at this https URL.
zh
[CV-113] SEMU-Net: A Segmentation-based Corrector for Fabrication Process Variations of Nanophotonics with Microscopic Images WACV2025
【速读】: 该论文试图解决集成硅光子器件在纳米制造过程中因结构变异(如过蚀刻、角圆化及意外缺陷)导致性能显著下降的问题。解决方案的关键在于引入SEMU-Net,这是一个综合方法集,包括自动分割扫描电子显微镜图像(SEM)并利用这些图像训练基于U-Net及其变体的两个深度神经网络模型。预测模型用于预估制造引起的变异,而校正模型则调整设计以应对这些问题,确保最终制造的结构与设计规格高度一致。实验结果表明,分割U-Net的平均IoU(Intersection over Union)得分为99.30%,而校正注意力U-Net在串联架构中的平均IoU得分为98.67%。
链接: https://arxiv.org/abs/2411.16973
作者: Rambod Azimi,Yijian Kong,Dusan Gostimirovic,James J. Clark,Odile Liboiron-Ladouceur
关键词-EN: Integrated silicon photonic, silicon photonic devices, Integrated silicon, photonic devices, silicon photonic
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to WACV 2025
点击查看摘要
Abstract:Integrated silicon photonic devices, which manipulate light to transmit and process information on a silicon-on-insulator chip, are highly sensitive to structural variations. Minor deviations during nanofabrication-the precise process of building structures at the nanometer scale-such as over- or under-etching, corner rounding, and unintended defects, can significantly impact performance. To address these challenges, we introduce SEMU-Net, a comprehensive set of methods that automatically segments scanning electron microscope images (SEM) and uses them to train two deep neural network models based on U-Net and its variants. The predictor model anticipates fabrication-induced variations, while the corrector model adjusts the design to address these issues, ensuring that the final fabricated structures closely align with the intended specifications. Experimental results show that the segmentation U-Net reaches an average IoU score of 99.30%, while the corrector attention U-Net in a tandem architecture achieves an average IoU score of 98.67%.
zh
[CV-114] ZoomLDM: Latent Diffusion Model for multi-scale image generation
【速读】: 该论文试图解决在大尺寸图像领域(如数字病理学和卫星图像)中,扩散模型在生成完整图像时面临的挑战。由于直接在可能具有千兆像素大小的“完整”图像上训练模型是不现实的,现有的扩散生成方法主要集中在合成从小尺寸图像中提取的固定大小的补丁(patches)。然而,这种基于补丁的生成方法无法捕捉大图像的全局结构和更广泛的上下文信息,这对于生成(语义上)准确的样本至关重要。
解决方案的关键是提出了ZoomLDM,这是一种专门为多尺度图像生成设计的扩散模型。其核心创新在于一种新的放大感知条件机制(magnification-aware conditioning mechanism),该机制利用自监督学习(Self-Supervised Learning, SSL)嵌入,使扩散模型能够在不同的“缩放”级别(即从大图像中提取的不同尺度的固定大小补丁)上合成图像。ZoomLDM在所有尺度上实现了最先进的图像生成质量,特别是在生成整个大图像的缩略图这种数据稀缺的情况下表现尤为出色。其多尺度特性还解锁了在大图像生成中的额外能力,使得计算上可行的、全局一致的图像合成达到4096×4096像素,并支持4倍超分辨率。此外,从ZoomLDM中提取的多尺度特征在多实例学习实验中表现出高效性。
链接: https://arxiv.org/abs/2411.16969
作者: Srikar Yellapragada,Alexandros Graikos,Kostas Triaridis,Prateek Prasanna,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras
关键词-EN: satellite imagery, challenges restrict, restrict their application, application to large-image, digital pathology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on ‘whole’ images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different ‘zoom’ levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to 4096 \times 4096 pixels and 4\times super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments. We provide high-resolution examples of the generated images on our website this https URL.
zh
[CV-115] MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning
【速读】: 该论文试图解决人体运动预测中时间特征和非平稳动力学的建模问题,特别是在复杂人体运动中捕捉细微过渡特征的挑战。解决方案的关键在于引入MotionWavelet框架,该框架利用小波变换(Wavelet Transformation)在空间-频率域中研究人体运动模式。具体来说,MotionWavelet通过小波扩散模型(Wavelet Diffusion Model, WDM)学习小波流形(Wavelet Manifold),从而编码复杂的时空运动模式。此外,MotionWavelet还包括小波空间形状引导机制(Wavelet Space Shaping Guidance)和基于时间注意力的引导(Temporal Attention-Based Guidance),以优化去噪过程并提高预测精度。实验结果表明,MotionWavelet在各种基准测试中显著提升了预测精度和泛化能力。
链接: https://arxiv.org/abs/2411.16964
作者: Yuming Feng,Zhiyang Dou,Ling-Hao Chen,Yuan Liu,Tianyu Li,Jingbo Wang,Zeyu Cao,Wenping Wang,Taku Komura,Lingjie Liu
关键词-EN: Modeling temporal characteristics, body movement plays, predicting human future, Modeling temporal, human future motions
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Project Page: this https URL Video: this https URL
点击查看摘要
Abstract:Modeling temporal characteristics and the non-stationary dynamics of body movement plays a significant role in predicting human future motions. However, it is challenging to capture these features due to the subtle transitions involved in the complex human motions. This paper introduces MotionWavelet, a human motion prediction framework that utilizes Wavelet Transformation and studies human motion patterns in the spatial-frequency domain. In MotionWavelet, a Wavelet Diffusion Model (WDM) learns a Wavelet Manifold by applying Wavelet Transformation on the motion data therefore encoding the intricate spatial and temporal motion patterns. Once the Wavelet Manifold is built, WDM trains a diffusion model to generate human motions from Wavelet latent vectors. In addition to the WDM, MotionWavelet also presents a Wavelet Space Shaping Guidance mechanism to refine the denoising process to improve conformity with the manifold structure. WDM also develops Temporal Attention-Based Guidance to enhance prediction accuracy. Extensive experiments validate the effectiveness of MotionWavelet, demonstrating improved prediction accuracy and enhanced generalization across various benchmarks. Our code and models will be released upon acceptance.
zh
[CV-116] RoCoDA: Counterfactual Data Augmentation for Data-Efficient Robot Learning from Demonstrations
【速读】: 该论文试图解决机器人模仿学习中的泛化问题,特别是在复杂环境和数据收集成本高的情况下。解决方案的关键在于引入了一种名为 RoCoDA 的新方法,该方法统一了不变性(invariance)、等变性(equivariance)和因果关系(causality)的概念,以增强数据增强的效果。RoCoDA 通过利用因果不变性,修改任务无关的环境状态子集而不影响策略输出,同时利用 SE(3) 等变性对物体姿态进行刚体变换并调整相应动作,生成合成演示。实验结果表明,RoCoDA 在政策性能、泛化能力和样本效率方面优于现有的最先进数据增强方法,并展现出对未见过的物体姿态、纹理和干扰物的鲁棒泛化能力。
链接: https://arxiv.org/abs/2411.16959
作者: Ezra Ameperosa,Jeremy A. Collins,Mrinal Jain,Animesh Garg
关键词-EN: faces significant challenges, robotics faces significant, Imitation learning, faces significant, significant challenges
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Imitation learning in robotics faces significant challenges in generalization due to the complexity of robotic environments and the high cost of data collection. We introduce RoCoDA, a novel method that unifies the concepts of invariance, equivariance, and causality within a single framework to enhance data augmentation for imitation learning. RoCoDA leverages causal invariance by modifying task-irrelevant subsets of the environment state without affecting the policy’s output. Simultaneously, we exploit SE(3) equivariance by applying rigid body transformations to object poses and adjusting corresponding actions to generate synthetic demonstrations. We validate RoCoDA through extensive experiments on five robotic manipulation tasks, demonstrating improvements in policy performance, generalization, and sample efficiency compared to state-of-the-art data augmentation methods. Our policies exhibit robust generalization to unseen object poses, textures, and the presence of distractors. Furthermore, we observe emergent behavior such as re-grasping, indicating policies trained with RoCoDA possess a deeper understanding of task dynamics. By leveraging invariance, equivariance, and causality, RoCoDA provides a principled approach to data augmentation in imitation learning, bridging the gap between geometric symmetries and causal reasoning.
zh
[CV-117] A SAM-guided and Match-based Semi-Supervised Segmentation Framework for Medical Imaging
【速读】: 该论文试图解决在数据稀缺场景下半监督医学图像分割中伪标签质量低的问题。解决方案的关键在于引入SAMatch框架,该框架利用预训练的SAM模型生成高置信度的提示,并通过微调的SAM模型来精炼伪标签。SAMatch框架实现了端到端的训练,允许模型之间动态交互,从而在ACDC心脏MRI、BUSI乳腺超声和MRLiver数据集上取得了最先进的结果,分别达到了89.36%、77.76%和80.04%的Dice分数,显著提升了在数据有限环境下的分割性能。
链接: https://arxiv.org/abs/2411.16949
作者: Guoping Xu,Xiaoxue Qian,Hua Chieh Shao,Jax Luo,Weiguo Lu,You Zhang
关键词-EN: SAM-guided Match-based framework, SAM-guided Match-based, aimed at improving, data-scarce scenarios, semi-supervised medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This study introduces SAMatch, a SAM-guided Match-based framework for semi-supervised medical image segmentation, aimed at improving pseudo label quality in data-scarce scenarios. While Match-based frameworks are effective, they struggle with low-quality pseudo labels due to the absence of ground truth. SAM, pre-trained on a large dataset, generalizes well across diverse tasks and assists in generating high-confidence prompts, which are then used to refine pseudo labels via fine-tuned SAM. SAMatch is trained end-to-end, allowing for dynamic interaction between the models. Experiments on the ACDC cardiac MRI, BUSI breast ultrasound, and MRLiver datasets show SAMatch achieving state-of-the-art results, with Dice scores of 89.36%, 77.76%, and 80.04%, respectively, using minimal labeled data. SAMatch effectively addresses challenges in semi-supervised segmentation, offering a powerful tool for segmentation in data-limited environments. Code and data are available at this https URL.
zh
[CV-118] Lens Distortion Encoding System Version 1.0
【速读】: 该论文试图解决镜头畸变(Lens Distortion)在电影制作中的精确控制和高效交换问题。解决方案的关键在于引入镜头畸变编码系统 (Lens Distortion Encoding System, LDES),该系统类似于学院色彩编码系统 (Academy Color Encoding System, ACES),但专门针对畸变进行处理。LDES的核心在于利用公共畸变空间生成单一的高质量、可动画的STMap,用于直接将一个视图转换为另一个视图,无需每次拍摄都更换镜头。LDES的镜头配置文件包含两个元素:视图映射纹理 (View Map texture) 和素材映射纹理 (Footage Map texture),每个都标有FOV值。通过视图映射纹理对素材映射纹理进行采样,生成可动画的映射纹理,从而实现对素材的所需畸变。此外,LDES支持从变形镜头到球面镜头的平滑过渡和动画效果,这在实际操作中是前所未有的。LDES 1.0版本使用常见的32位STMap格式进行编码,广泛支持大多数合成软件,直接或通过插件实现。与标准STMap工作流程相比,LDES在球面图像模型中编码绝对像素位置,主要优势在于能够使用较便宜的设备实现昂贵镜头的畸变效果,同时提供更大的艺术控制和前所未有的素材操作能力。
链接: https://arxiv.org/abs/2411.16946
作者: Jakub Maksymilian Fober
关键词-EN: Color Encoding System, Academy Color Encoding, high quality motion, quality motion picture, Distortion Encoding System
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注: 7 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Lens Distortion Encoding System (LDES) allows for a distortion-accurate workflow, with a seamless interchange of high quality motion picture images regardless of the lens source. This system is similar in a concept to the Academy Color Encoding System (ACES), but for distortion. Presented solution is fully compatible with existing software/plug-in tools for STMapping found in popular production software like Adobe After Effects or DaVinci Resolve. LDES utilizes common distortion space and produces single high-quality, animatable STMap used for direct transformation of one view to another, neglecting the need of lens-swapping for each shoot. The LDES profile of a lens consist of two elements; View Map texture, and Footage Map texture, each labeled with the FOV value. Direct distortion mapping is produced by sampling of the Footage Map through the View Map. The result; animatable mapping texture, is then used to sample the footage to a desired distortion. While the Footage Map is specific to a footage, View Maps can be freely combined/transitioned and animated, allowing for effects like smooth shift from anamorphic to spherical distortion, previously impossible to achieve in practice. Presented LDES Version 1.0 uses common 32-bit STMap format for encoding, supported by most compositing software, directly or via plug-ins. The difference between standard STMap workflow and LDES is that it encodes absolute pixel position in the spherical image model. The main benefit of this approach is the ability to achieve a similar look of a highly expensive lens using some less expensive equipment in terms of distortion. It also provides greater artistic control and never seen before manipulation of footage.
zh
[CV-119] Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory
【速读】: 该论文试图解决在可穿戴设备上实现在线情景记忆检索的问题,即在设备有限的功率和存储容量下,如何实时处理视频流并回答用户关于过去观察到的物体或事件的位置查询。解决方案的关键在于提出了一个名为Egocentric Streaming Object Memory (ESOM)的新框架。ESOM通过一个物体发现模块来检测潜在有趣的物体,一个视觉物体跟踪器来在线跟踪这些物体在视频中的位置,以及一个记忆模块来存储物体的时空坐标和图像表示,从而实现高效查询。该方法在在线情景记忆视觉查询定位任务(OEM-VQL)中表现优异,尤其是在考虑物体发现和跟踪的预言性能时,其成功率(81.92%)显著优于离线方法(55.89%)。
链接: https://arxiv.org/abs/2411.16934
作者: Zaira Manigrasso,Matteo Dunnhofer,Antonino Furnari,Moritz Nottebaum,Antonio Finocchiaro,Davide Marana,Giovanni Maria Farinella,Christian Micheloni
关键词-EN: Episodic memory retrieval, memory retrieval aims, enable wearable devices, Online Episodic Memory, Episodic Memory Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Episodic memory retrieval aims to enable wearable devices with the ability to recollect from past video observations objects or events that have been observed (e.g., “where did I last see my smartphone?”). Despite the clear relevance of the task for a wide range of assistive systems, current task formulations are based on the “offline” assumption that the full video history can be accessed when the user makes a query, which is unrealistic in real settings, where wearable devices are limited in power and storage capacity. We introduce the novel task of Online Episodic Memory Visual Queries Localization (OEM-VQL), in which models are required to work in an online fashion, observing video frames only once and relying on past computations to answer user queries. To tackle this challenging task, we propose ESOM - Egocentric Streaming Object Memory, a novel framework based on an object discovery module to detect potentially interesting objects, a visual object tracker to track their position through the video in an online fashion, and a memory module to store spatio-temporal object coordinates and image representations, which can be queried efficiently at any moment. Comparisons with different baselines and offline methods show that OEM-VQL is challenging and ESOM is a viable approach to tackle the task, with results outperforming offline methods (81.92 vs 55.89 success rate %) when oracular object discovery and tracking are considered. Our analysis also sheds light on the limited performance of object detection and tracking in egocentric vision, providing a principled benchmark based on the OEM-VQL downstream task to assess progress in these areas.
zh
[CV-120] Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
【速读】: 该论文试图解决视频大语言模型(LLMs)在理解和推理长视频中事件时的时间感知能力不足的问题。解决方案的关键在于提出了Seq2Time,一种数据导向的训练范式,通过利用图像序列和短视频片段来增强长视频的时间感知能力。具体来说,Seq2Time通过将序列位置转换为时间注释,将大规模的图像和片段字幕数据集转换为模拟长视频时间结构的序列,从而实现自监督训练。此外,论文引入了一种新的时间表示方法,统一了图像序列、片段序列和长视频之间的位置信息,促进了序列到时间的知识转移。实验结果表明,该方法在YouCook2和Charades-STA基准测试中显著提升了模型的性能。
链接: https://arxiv.org/abs/2411.16932
作者: Andong Deng,Zhongpai Gao,Anwesa Choudhuri,Benjamin Planche,Meng Zheng,Bin Wang,Terrence Chen,Chen Chen,Ziyan Wu
关键词-EN: large language models, video large language, long videos, Temporal awareness, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6% improvement in F1 score and 44.8% in CIDEr on the YouCook2 benchmark and a 14.7% increase in recall on the Charades-STA benchmark compared to the baseline.
zh
[CV-121] Context-Aware Input Orchestration for Video Inpainting
【速读】: 该论文试图解决传统神经网络驱动的图像修复方法在移动设备处理能力和内存限制下难以提供高质量结果的问题。解决方案的关键在于通过改变输入数据的组成来优化内存使用。具体来说,论文提出了一种动态调整输入帧组成的方法,根据光流和掩码的变化来调整输入帧的比例,从而在快速视觉上下文变化的情况下提高修复视频的质量。
链接: https://arxiv.org/abs/2411.16926
作者: Hoyoung Kim,Azimbek Khudoyberdiev,Seonghwan Jeong,Jihoon Ryoo
关键词-EN: Traditional neural network-driven, deliver high-quality results, mobile device processing, device processing power, Traditional neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Traditional neural network-driven inpainting methods struggle to deliver high-quality results within the constraints of mobile device processing power and memory. Our research introduces an innovative approach to optimize memory usage by altering the composition of input data. Typically, video inpainting relies on a predetermined set of input frames, such as neighboring and reference frames, often limited to five-frame sets. Our focus is to examine how varying the proportion of these input frames impacts the quality of the inpainted video. By dynamically adjusting the input frame composition based on optical flow and changes of the mask, we have observed an improvement in various contents including rapid visual context changes.
zh
[CV-122] Deep Convolutional Neural Networks Structured Pruning via Gravity Regularization
【速读】: 该论文试图解决深度卷积神经网络 (DCNNs) 加速中的结构化剪枝问题,特别是现有方法在修改原始架构、复杂实现和长时间微调方面的局限性。解决方案的关键在于提出了一种新颖的物理启发式方法,将重力概念引入到DCNNs的训练阶段。具体来说,该方法通过模拟重力作用,使得卷积滤波器根据其与吸引滤波器的距离和质量关系,调整其权重。重力越强的滤波器其权重被推向零,从而可以被移除,而重力较弱的滤波器则保留重要权重。这种方法在优化滤波器权重的同时,自动评估其重要性,无需复杂的实现或广泛的微调,显著简化了剪枝过程。实验结果表明,该方法在CIFAR数据集上对流行的DCNN架构进行了验证,取得了与现有方法相媲美的效果。
链接: https://arxiv.org/abs/2411.16901
作者: Abdesselam Ferdi
关键词-EN: convolutional neural networks, widely employed strategy, accelerating deep convolutional, deep convolutional neural, Structured pruning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Structured pruning is a widely employed strategy for accelerating deep convolutional neural networks (DCNNs). However, existing methods often necessitate modifications to the original architectures, involve complex implementations, and require lengthy fine-tuning stages. To address these challenges, we propose a novel physics-inspired approach that integrates the concept of gravity into the training stage of DCNNs. In this approach, the gravity is directly proportional to the product of the masses of the convolution filter and the attracting filter, and inversely proportional to the square of the distance between them. We applied this force to the convolution filters, either drawing filters closer to the attracting filter (experiencing weaker gravity) toward non-zero weights or pulling filters farther away (subject to stronger gravity) toward zero weights. As a result, filters experiencing stronger gravity have their weights reduced to zero, enabling their removal, while filters under weaker gravity retain significant weights and preserve important information. Our method simultaneously optimizes the filter weights and ranks their importance, eliminating the need for complex implementations or extensive fine-tuning. We validated the proposed approach on popular DCNN architectures using the CIFAR dataset, achieving competitive results compared to existing methods.
zh
[CV-123] G2SDF: Surface Reconstruction from Explicit Gaussians with Implicit SDFs
【速读】: 该论文试图解决现有3D高斯喷射(3D Gaussian Splatting, 3DGS)方法在提取底层3D表面时面临的挑战,特别是由于其稀疏和显式表示的特性。解决方案的关键在于引入G2SDF方法,通过将神经隐式符号距离场(Signed Distance Field, SDF)集成到3DGS框架中,实现高斯与场景表面的更紧密对齐。具体来说,该方法通过建立高斯不透明度值与其到表面距离的关联,并提出一种归一化函数以适应不同尺度的无界场景,同时利用现成的深度估计器作为伪真值来优化高斯喷射过程。通过这种显式与隐式表示的结合,G2SDF不仅提高了表面重建质量,还保持了3DGS的高效性。
链接: https://arxiv.org/abs/2411.16898
作者: Kunyi Li,Michael Niemeyer,Zeyu Chen,Nassir Navab,Federico Tombari
关键词-EN: Gaussian Splatting, view synthesis methods, achieve remarkable visual, remarkable visual quality, Gaussian Splatting framework
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:State-of-the-art novel view synthesis methods such as 3D Gaussian Splatting (3DGS) achieve remarkable visual quality. While 3DGS and its variants can be rendered efficiently using rasterization, many tasks require access to the underlying 3D surface, which remains challenging to extract due to the sparse and explicit nature of this representation. In this paper, we introduce G2SDF, a novel approach that addresses this limitation by integrating a neural implicit Signed Distance Field (SDF) into the Gaussian Splatting framework. Our method links the opacity values of Gaussians with their distances to the surface, ensuring a closer alignment of Gaussians with the scene surface. To extend this approach to unbounded scenes at varying scales, we propose a normalization function that maps any range to a fixed interval. To further enhance reconstruction quality, we leverage an off-the-shelf depth estimator as pseudo ground truth during Gaussian Splatting optimization. By establishing a differentiable connection between the explicit Gaussians and the implicit SDF, our approach enables high-quality surface reconstruction and rendering. Experimental results on several real-world datasets demonstrate that G2SDF achieves superior reconstruction quality than prior works while maintaining the efficiency of 3DGS.
zh
[CV-124] PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence
【速读】: 该论文试图解决从变长图像序列中进行无姿态前馈3D重建的问题。解决方案的关键在于提出了PreF3R(Pose-Free Feed-forward 3D Reconstruction),它通过直接从无姿态图像序列中重建3D高斯场,无需相机标定,并在规范坐标系中进行高效的新视角渲染。PreF3R利用DUSt3R的成对3D结构重建能力,并通过空间记忆网络扩展到多视角输入,消除了基于优化的全局对齐需求。此外,PreF3R集成了密集高斯参数预测头,支持可微分光栅化,从而结合光度损失和点图回归损失进行监督,提升了照片真实性和结构准确性。该方法能够在20 FPS下增量重建3D高斯场,实现实时新视角渲染,并在实验中展示了其在无姿态前馈新视角合成任务中的有效性和对未见场景的鲁棒泛化能力。
链接: https://arxiv.org/abs/2411.16877
作者: Zequn Chen,Jiezhi Yang,Heng Yang
关键词-EN: variable length, sequence of variable, Gaussian field, sequence, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
点击查看摘要
Abstract:We present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering. We leverage DUSt3R’s ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy. Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.
zh
[CV-125] RECAST: Reparameterized Compact weight Adaptation for Sequential Tasks
【速读】: 该论文试图解决增量学习(Incremental Learning)中在资源受限环境下(如边缘设备或移动手机)适应新类别时的高计算开销问题。解决方案的关键是提出了一个名为RECAST(Reparameterized, Compact weight Adaptation for Sequential Tasks)的新方法,该方法通过学习将层权重分解为共享权重模板和极少的模块特定缩放因子或系数,从而显著减少任务特定的可训练参数(少于50个),远低于现有方法如LoRA。RECAST的核心创新在于其神经模仿(Neural Mimicry)的权重重建流程,该流程无需从头预训练,能够在框架内高保真地模拟现有预训练权重,快速适应不同规模和架构的模型。实验结果表明,RECAST在多个数据集上超越了现有技术水平,且其架构无关的特性使其能够无缝集成现有方法,进一步提升性能。
链接: https://arxiv.org/abs/2411.16870
作者: Nazia Tasnim,Bryan A. Plummer
关键词-EN: Incremental learning aims, minimal computational overhead, Incremental learning, computational overhead, aims to adapt
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Incremental learning aims to adapt to new sets of categories over time with minimal computational overhead. Prior work often addresses this task by training efficient task-specific adaptors that modify frozen layer weights or features to capture relevant information without affecting predictions on previously learned categories. While these adaptors are generally more efficient than finetuning the entire network, they still require tens to hundreds of thousands of task-specific trainable parameters even for relatively small networks, making it challenging to operate on resource-constrained environments with high communication costs like edge devices or mobile phones. Thus, we propose Reparameterized, Compact weight Adaptation for Sequential Tasks (RECAST), a novel method that dramatically reduces task-specific trainable parameters to fewer than 50 - several orders of magnitude less than competing methods like LoRA. RECAST accomplishes this efficiency by learning to decompose layer weights into a soft parameter-sharing framework consisting of shared weight templates and very few module-specific scaling factors or coefficients. This soft parameter-sharing framework allows for effective task-wise reparameterization by tuning only these coefficients while keeping templates frozen.A key innovation of RECAST is the novel weight reconstruction pipeline called Neural Mimicry, which eliminates the need for pretraining from scratch. This allows for high-fidelity emulation of existing pretrained weights within our framework and provides quick adaptability to any model scale and architecture. Extensive experiments across six datasets demonstrate RECAST outperforms the state-of-the-art by up to 3% across various scales, architectures, and parameter spaces Moreover, we show that RECAST’s architecture-agnostic nature allows for seamless integration with existing methods, further boosting performance.
zh
[CV-126] SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
【速读】: 该论文试图解决3D对象生成和理解的问题,特别是如何高效地应用自回归方法于3D内容生成。解决方案的关键在于引入了一个名为Scale AutoRegressive 3D (SAR3D)的新框架,该框架利用多尺度3D向量量化变分自编码器(VQVAE)来对3D对象进行标记化,从而实现高效的自回归生成和详细理解。通过预测多尺度潜在表示中的下一个尺度而非单一标记,SAR3D显著减少了生成时间,在A6000 GPU上实现了仅需0.82秒的快速3D对象生成。此外,通过微调预训练的大型语言模型(LLM)以处理这些富含层次3D感知信息的标记,SAR3D不仅提升了生成速度和质量,还增强了LLM对3D内容的全面理解和描述能力。
链接: https://arxiv.org/abs/2411.16856
作者: Yongwei Chen,Yushi Lan,Shangchen Zhou,Tengfei Wang,XIngang Pan
关键词-EN: artificial general intelligence, demonstrated remarkable success, large language models, moving closer, general intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.
zh
[CV-127] Open Vocabulary Monocular 3D Object Detection
【速读】: 该论文试图解决开放词汇单目3D物体检测问题,即从单张RGB图像中检测和定位3D空间中的物体,且不局限于预定义的类别集合。解决方案的关键在于提出了一种类别无关的方法,该方法利用开放词汇的2D检测器,并通过将2D边界框提升到3D空间来实现3D物体检测。这种方法将物体在2D中的识别和定位与3D边界框估计任务解耦,从而实现了对未见类别的泛化能力。此外,论文还提出了一种目标感知的评估协议,以解决现有数据集中的不一致性问题,提高了模型性能评估的可靠性。实验结果表明,该方法在Omni3D数据集上对新颖物体类别的零样本3D检测表现出色,验证了其强大的泛化能力。
链接: https://arxiv.org/abs/2411.16833
作者: Jin Yao,Hao Gu,Xuweiyi Chen,Jiayun Wang,Zezhou Cheng
关键词-EN: single RGB image, single RGB, RGB image, pioneer the study, aims to detect
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:In this work, we pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image without limiting detection to a predefined set of categories. We formalize this problem, establish baseline methods, and introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space. Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories. Additionally, we propose a target-aware evaluation protocol to address inconsistencies in existing datasets, improving the reliability of model performance assessment. Extensive experiments on the Omni3D dataset demonstrate the effectiveness of the proposed method in zero-shot 3D detection for novel object categories, validating its robust generalization capabilities. Our method and evaluation protocols contribute towards the development of open-vocabulary object detection models that can effectively operate in real-world, category-diverse environments.
zh
[CV-128] Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
【速读】: 该论文试图解决生成式图像编辑中恶意篡改人像所带来的隐私和身份安全问题。解决方案的关键在于提出了一种名为FaceLock的新方法,该方法通过优化对抗性扰动(adversarial perturbations)来破坏或显著改变生物识别信息(biometric information),从而使编辑后的图像在生物识别上无法被识别。FaceLock将面部识别和视觉感知整合到扰动优化过程中,提供了对各种编辑尝试的强大保护。此外,论文还指出了现有评估指标的缺陷,并强调了可靠评估保护措施的重要性。实验结果表明,FaceLock在防御恶意编辑方面优于基线方法,并且对净化技术具有鲁棒性。
链接: https://arxiv.org/abs/2411.16832
作者: Hanhui Wang,Yihua Zhang,Ruizheng Bai,Yue Zhao,Sijia Liu,Zhengzhong Tu
关键词-EN: raising ethical concerns, Recent advancements, enabling creative edits, made generative image, enabling creative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: GitHub: this https URL
点击查看摘要
Abstract:Recent advancements in diffusion models have made generative image editing more accessible, enabling creative edits but raising ethical concerns, particularly regarding malicious edits to human portraits that threaten privacy and identity security. Existing protection methods primarily rely on adversarial perturbations to nullify edits but often fail against diverse editing requests. We propose FaceLock, a novel approach to portrait protection that optimizes adversarial perturbations to destroy or significantly alter biometric information, rendering edited outputs biometrically unrecognizable. FaceLock integrates facial recognition and visual perception into perturbation optimization to provide robust protection against various editing attempts. We also highlight flaws in commonly used evaluation metrics and reveal how they can be manipulated, emphasizing the need for reliable assessments of protection. Experiments show FaceLock outperforms baselines in defending against malicious edits and is robust against purification techniques. Ablation studies confirm its stability and broad applicability across diffusion-based editing algorithms. Our work advances biometric defense and sets the foundation for privacy-preserving practices in image editing. The code is available at: this https URL.
zh
[CV-129] CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
【速读】: 该论文旨在解决使用噪声大、网络爬取的图像-文本对进行视觉-语言预训练(如CLIP)时可能存在的性能限制问题。论文提出了一种利用合成描述作为替代方案的方法,并通过两个关键设计来优化这一过程:首先,通过观察到短合成描述通常比完整长度的描述带来更高的性能,因此仅将部分合成描述输入文本编码器;其次,引入一个自回归描述生成器,该生成器通过基于配对图像输入和网络爬取的文本描述来预测由先进的多模态大语言模型(MLLMs)生成的完整长度合成描述。实验结果表明,该框架显著提升了跨模态检索任务中的零样本性能,并在MSCOCO和Flickr30K数据集上达到了新的最先进(SOTA)结果。此外,训练后的视觉编码器还能增强LLaVA的视觉能力,在多个MLLM基准测试中显示出显著的改进。
链接: https://arxiv.org/abs/2411.16828
作者: Yanqing Liu,Xianhang Li,Zeyu Wang,Bingchen Zhao,Cihang Xie
关键词-EN: limit vision-language pretraining, Previous works show, pretraining like CLIP, CLIP and propose, synthetic captions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
点击查看摘要
Abstract:Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions – the short synthetic captions can generally lead to MUCH higher performance than full-length ones – we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process – by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks. Our project page is this https URL.
zh
[CV-130] Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
【速读】: 该论文试图解决视觉编码器(Vision Encoder, VE)与大语言模型(Large Language Model, LLM)之间的认知错位(cognitive misalignment)问题。具体来说,VE对视觉信息的表示可能与LLM的认知框架不完全对齐,导致视觉特征超出了语言模型的解释范围。解决方案的关键在于提出了一种名为实体增强认知对齐(Entity-Enhanced Cognitive Alignment, EECA)的方法,该方法通过多粒度监督生成视觉丰富且对齐良好的标记(tokens),这些标记不仅能够融入LLM的嵌入空间,还能与LLM的认知框架对齐,从而显著提升视觉-语言模型(Large Vision-Language Models, LVLMs)在地标识别任务中的性能。
链接: https://arxiv.org/abs/2411.16824
作者: Yaqi Zhao,Yuanyang Yin,Lin Li,Mingan Lin,Victor Shea-Jay Huang,Siwei Chen,Weipeng Chen,Baoqun Yin,Zenan Zhou,Wentao Zhang
关键词-EN: LLM cognitive framework, LLM, LLM cognitive, cognitive, language model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of “cognitive misalignment” between the vision encoder (VE) and the large language model (LLM). Specifically, the VE’s representation of visual information may not fully align with LLM’s cognitive framework, leading to a mismatch where visual features exceed the language model’s interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE’s interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM’s capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM’s embedding space but also align with the LLM’s cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.
zh
[CV-131] DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow
【速读】: 该论文试图解决现有3D生成方法在生成形状时由于计算限制而缺乏几何细节的问题。解决方案的关键在于提出了DetailGen3D,一种专门设计用于增强生成3D形状的生成式方法。其核心创新是通过在潜在空间中直接建模粗到细的转换过程,利用数据依赖的流来避免大规模3D生成模型的计算开销。此外,引入了一种令牌匹配策略,确保在细化过程中精确的空间对应关系,从而在保留全局结构的同时实现局部细节的合成。通过精心设计训练数据以匹配合成粗形状的特征,该方法能够有效增强各种3D生成和重建方法生成的形状,从单视图到稀疏多视图输入。实验结果表明,DetailGen3D在保持训练效率的同时,实现了高保真的几何细节合成。
链接: https://arxiv.org/abs/2411.16820
作者: Ken Deng,Yuanchen Guo,Jingxiang Sun,Zixin Zou,Yangguang Li,Xin Cai,Yanpei Cao,Yebin Liu,Ding Liang
关键词-EN: rapidly create shapes, single views, rapidly create, outputs often lack, Modern
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: this https URL
点击查看摘要
Abstract:Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.
zh
[CV-132] Pathways on the Image Manifold: Image Editing via Video Generation
【速读】: 该论文试图解决图像编辑中复杂指令遵循不准确和图像保真度受损的问题。解决方案的关键在于将图像编辑重新定义为时间过程,利用预训练的视频生成模型(image-to-video models)来创建从原始图像到目标编辑的平滑过渡。这种方法通过连续遍历图像流形(image manifold),确保编辑的一致性同时保留原始图像的关键元素,从而在基于文本的图像编辑任务中实现了最先进的成果,显著提升了编辑准确性和图像保真度。
链接: https://arxiv.org/abs/2411.16819
作者: Noam Rotstein,Gal Yona,Daniel Silver,Roy Velich,David Bensaïd,Ron Kimmel
关键词-EN: Recent advances, shown remarkable progress, image, image editing, image diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image’s key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.
zh
[CV-133] SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
【速读】: 该论文试图解决自主机器人(如自动驾驶车辆)在多样化驾驶场景中进行安全测试时,现有神经辐射场(NeRF)方法在传感器数据(特别是相机和激光雷达数据)实时渲染速度低下的问题。解决方案的关键在于提出了SplatAD,这是首个基于3D高斯喷射(3D Gaussian Splatting, 3DGS)的方法,能够实现相机和激光雷达数据的现实、实时渲染。SplatAD通过专门设计的算法优化渲染效率,准确模拟了滚动快门效应、激光雷达强度和激光雷达射线丢失等关键传感器特定现象,从而在保持高质量渲染的同时,显著提升了渲染速度,比基于NeRF的方法快一个数量级。
链接: https://arxiv.org/abs/2411.16816
作者: Georg Hess,Carl Lindström,Maryam Fatemi,Christoffer Petersson,Lennart Svensson
关键词-EN: Ensuring the safety, requires extensive testing, diverse driving scenarios, self-driving vehicles, requires extensive
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Ensuring the safety of autonomous robots, such as self-driving vehicles, requires extensive testing across diverse driving scenarios. Simulation is a key ingredient for conducting such testing in a cost-effective and scalable way. Neural rendering methods have gained popularity, as they can build simulation environments from collected logs in a data-driven manner. However, existing neural radiance field (NeRF) methods for sensor-realistic rendering of camera and lidar data suffer from low rendering speeds, limiting their applicability for large-scale testing. While 3D Gaussian Splatting (3DGS) enables real-time rendering, current methods are limited to camera data and are unable to render lidar data essential for autonomous driving. To address these limitations, we propose SplatAD, the first 3DGS-based method for realistic, real-time rendering of dynamic scenes for both camera and lidar data. SplatAD accurately models key sensor-specific phenomena such as rolling shutter effects, lidar intensity, and lidar ray dropouts, using purpose-built algorithms to optimize rendering efficiency. Evaluation across three autonomous driving datasets demonstrates that SplatAD achieves state-of-the-art rendering quality with up to +2 PSNR for NVS and +3 PSNR for reconstruction while increasing rendering speed over NeRF-based methods by an order of magnitude. See this https URL for our project page.
zh
[CV-134] FREE-Merging: Fourier Transform for Model Merging with Lightweight Experts
【速读】: 该论文试图解决在模型规模快速扩展的背景下,单一微调模型难以满足多样化部署需求的问题。解决方案的关键在于提出了一种名为FR-Merging的创新方法,该方法利用频域信息高效过滤有害的特定任务信息,从而最小化任务冲突对主干网络的影响,同时引入轻量级任务专家模块,在推理时动态集成以补偿信息损失。这一框架(FREE-Merging)在训练成本、推理速度、存储需求和性能之间实现了平衡,并展示了在计算机视觉(CV)、自然语言处理(NLP)和多模态(Multi-Modal)领域的多任务适应性。
链接: https://arxiv.org/abs/2411.16815
作者: Shenghe Zheng,Hongzhi Wang
关键词-EN: open-source model weights, current era, era of rapid, rapid expansion, increasing availability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures
点击查看摘要
Abstract:In the current era of rapid expansion in model scale, there is an increasing availability of open-source model weights for various tasks. However, the capabilities of a single fine-tuned model often fall short of meeting diverse deployment needs. Model merging has thus emerged as a widely focused method for efficiently building a single model tailored for multiple tasks combined from existing models. Nevertheless, existing model merging methods face challenging trade-offs between performance and deployment costs, primarily due to task conflicts within the merged network. Our analysis of neural networks reveals that some task-specific information introduced by fine-tuning minimally enhances performance but heavily impacts generalization, leading to task conflicts. To mitigate the impact of this information, we propose FR-Merging, an innovative method that leverages frequency domain information to efficiently filter harmful specialized information, thereby minimizing the impact of task conflicts on the backbone with minimal cost. Since performance loss is inevitable with cost-free merging methods, we introduce a lightweight task-specific expert that can be dynamically integrated during inference to compensate for information loss. This framework, FREE-Merging (FR-Merging with lightweight experts), strikes a balanced trade-off between training cost, inference speed, storage requirements, and performance. We demonstrate the effectiveness of both FR-Merging and FREE-Merging on multiple tasks across CV, NLP, and Multi-Modal domains and show that they can be flexibly adapted to meet specific needs.
zh
[CV-135] Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation
【速读】: 该论文试图解决从离散手语片段生成连续手语视频的问题,关键在于确保视频中手语动作的流畅过渡和自然衔接。传统方法通过简单拼接孤立的手语片段,往往导致突兀的过渡,破坏视频的连贯性。为此,论文提出了一种名为Sign-D2C的新框架,采用条件扩散模型(conditional diffusion model)来合成上下文平滑的过渡帧,从而实现连续手语序列的无缝构建。该方法通过在长时间手语视频中随机遮蔽片段,将过渡帧生成的无监督问题转化为有监督训练任务,模型通过去噪高斯噪声来预测这些遮蔽帧,条件是周围的手语观察结果,从而处理复杂的、非结构化的过渡。在推理阶段,采用线性插值填充策略(linearly interpolating padding strategy)初始化缺失帧,通过边界帧之间的插值提供稳定的基底,供扩散模型进行迭代细化。实验结果表明,该方法在多个数据集上能有效生成连续、自然的手语视频。
链接: https://arxiv.org/abs/2411.16810
作者: Shengeng Tang,Jiayi He,Lechao Cheng,Jingjing Wu,Dan Guo,Richang Hong
关键词-EN: Generating continuous sign, preserve natural flow, Generating continuous, flow and meaning, continuous sign language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.
zh
[CV-136] InTraGen: Trajectory-controlled Video Generation for Object Interactions
【速读】: 该论文试图解决文本到视频 (Text-to-Video, T2V) 生成中多对象交互场景的真实性和准确性问题。解决方案的关键在于引入了一个名为 InTraGen 的管道,该管道通过轨迹引导生成对象交互场景,并提出了一个新的多模态交互编码管道,结合对象 ID 注入机制,以增强对象与环境之间的交互。此外,论文还提出了四个新的数据集和一个轨迹质量评估指标,用于评估 InTraGen 的性能,从而在视觉保真度和定量性能方面实现了显著改进。
链接: https://arxiv.org/abs/2411.16804
作者: Zuhao Liu,Aleksandar Yanev,Ahmad Mahmood,Ivan Nikolov,Saman Motamed,Wei-Shi Zheng,Xi Wang,Luc Van Gool,Danda Pani Paudel
关键词-EN: created scenes, Advances, Advances in video, generation, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Advances in video generation have significantly improved the realism and quality of created scenes. This has fueled interest in developing intuitive tools that let users leverage video generation as world simulators. Text-to-video (T2V) generation is one such approach, enabling video creation from text descriptions only. Yet, due to the inherent ambiguity in texts and the limited temporal information offered by text prompts, researchers have explored additional control signals like trajectory-guided systems, for more accurate T2V generation. Nonetheless, methods to evaluate whether T2V models can generate realistic interactions between multiple objects are lacking. We introduce InTraGen, a pipeline for improved trajectory-based generation of object interaction scenarios. We propose 4 new datasets and a novel trajectory quality metric to evaluate the performance of the proposed InTraGen. To achieve object interaction, we introduce a multi-modal interaction encoding pipeline with an object ID injection mechanism that enriches object-environment interactions. Our results demonstrate improvements in both visual fidelity and quantitative performance. Code and datasets are available at this https URL
zh
[CV-137] Abnormality-Driven Representation Learning for Radiology Imaging
【速读】: 该论文试图解决放射学领域缺乏任务无关的表示模型的问题,这是由于3D成像的计算和数据需求以及放射学扫描的解剖复杂性所导致的。解决方案的关键在于提出了一种名为CLEAR的框架,该框架利用从2D切片中提取的嵌入向量,并通过基于注意力机制的聚合方法来高效预测临床终点。具体来说,论文引入了一种新的方法——病变增强对比学习(Lesion-enhanced Contrastive Learning, LeCL),通过在CT扫描的不同位置的2D轴向切片中提取由异常驱动的视觉表示。通过训练三种不同的架构(Vision Transformers, Vision State Space Models, Gated Convolutional Neural Networks),CLEAR框架在肿瘤病变定位、肺病检测和患者分期三个临床任务中表现优异,显著优于现有的基础模型,同时具有更高的计算和数据效率。
链接: https://arxiv.org/abs/2411.16803
作者: Marta Ligero,Tim Lenz,Georg Wölflein,Omar S.M. El Nahhas,Daniel Truhn,Jakob Nikolas Kather
关键词-EN: deep learning pipelines, Vision State Space, deep learning, models, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To date, the most common approach for radiology deep learning pipelines is the use of end-to-end 3D networks based on models pre-trained on other tasks, followed by fine-tuning on the task at hand. In contrast, adjacent medical fields such as pathology, which focus on 2D images, have effectively adopted task-agnostic foundational models based on self-supervised learning (SSL), combined with weakly-supervised deep learning (DL). However, the field of radiology still lacks task-agnostic representation models due to the computational and data demands of 3D imaging and the anatomical complexity inherent to radiology scans. To address this gap, we propose CLEAR, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints. As part of this framework, we introduce lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans. Specifically, we trained single-domain contrastive learning approaches using three different architectures: Vision Transformers, Vision State Space Models and Gated Convolutional Neural Networks. We evaluate our approach across three clinical tasks: tumor lesion location, lung disease detection, and patient staging, benchmarking against four state-of-the-art foundation models, including BiomedCLIP. Our findings demonstrate that CLEAR using representations learned through LeCL, outperforms existing foundation models, while being substantially more compute- and data-efficient.
zh
[CV-138] Leveraging Foundation Models To learn the shape of semi-fluid deformable objects
【速读】: 该论文试图解决可变形物体(如焊池)的特征化问题,以定义稳定的特征用于进一步的运动控制目标。解决方案的关键在于采用了两种不同的管道:第一种是通过生成模型(Generative Model)在教师-学生框架下对流体可变形物体进行特征化;第二种是利用基础模型(Foundation Models)作为教师,直接对图像中的物体进行特征化,无需预训练和数据集。知识蒸馏(Knowledge Distillation)的性能表现显著,学生网络能够以13.4像素的误差学习并提取物体的关键点,而教师模型在物体掩码(Object Mask)的像素级信息提取上达到了75.26%的平均交并比(mIoU)。
链接: https://arxiv.org/abs/2411.16802
作者: Omar El Assal(VIBOT, ImViA, Alstom Transport),Carlos M. Mateo(ICB),Sebastien Ciron(Alstom Transport),David Fofi(VIBOT, ImViA)
关键词-EN: deformable objects, difficulties imposed, detection of representative, representative keypoints, manipulate deformable objects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:One of the difficulties imposed on the manipulation of deformable objects is their characterization and the detection of representative keypoints for the purpose of manipulation. A keen interest was manifested by researchers in the last decade to characterize and manipulate deformable objects of non-fluid nature, such as clothes and ropes. Even though several propositions were made in the regard of object characterization, however researchers were always confronted with the need of pixel-level information of the object through images to extract relevant information. This usually is accomplished by means of segmentation networks trained on manually labeled data for this purpose. In this paper, we address the subject of characterizing weld pool to define stable features that serve as information for further motion control objectives. We achieve this by employing different pipelines. The first one consists of characterizing fluid deformable objects through the use of a generative model that is trained using a teacher-student framework. And in the second one we leverage foundation models by using them as teachers to characterize the object in the image, without the need of any pre-training and any dataset. The performance of knowledge distillation from foundation models into a smaller generative model shows prominent results in the characterization of deformable objects. The student network was capable of learning to retrieve the keypoitns of the object with an error of 13.4 pixels. And the teacher was evaluated based on its capacities to retrieve pixel level information represented by the object mask, with a mean Intersection Over Union (mIoU) of 75.26%.
zh
[CV-139] Controllable Human Image Generation with Personalized Multi-Garments
【速读】: 该论文试图解决在可控人像生成中,由于高质量参考服装图像数据集的获取困难所导致的瓶颈问题。解决方案的关键在于提出了一种名为 BootComp 的新框架,该框架基于文本到图像扩散模型,通过构建一个大规模的合成数据集来解决数据获取问题。具体来说,BootComp 引入了一个数据生成管道,能够从每个人像图像中提取任意参考服装图像,并结合感知相似性过滤策略来确保数据质量。最终,利用这个合成数据集训练一个具有两条并行去噪路径的扩散模型,该模型能够使用多个服装图像作为条件生成保留细节的人像图像。此外,该框架还展示了其在时尚领域中不同类型的基于参考的生成任务中的广泛适用性,如虚拟试穿和基于其他条件(如姿态、面部等)的可控人像生成。
链接: https://arxiv.org/abs/2411.16801
作者: Yisol Choi,Sangkyung Kwak,Sihyun Yu,Hyungwon Choi,Jinwoo Shin
关键词-EN: reference garment images, present BootComp, human image, human, human image generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.
zh
[CV-140] Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image
【速读】: 该论文试图解决现有4D内容生成方法在物理一致性和控制精度方面的不足,特别是依赖于预训练视频扩散模型(video diffusion models)的方法,这些方法缺乏对现实世界物理原理的深刻理解,且计算成本高。解决方案的关键在于提出了Phys4DGen框架,该框架通过将物理模拟集成到4D生成流程中,确保生成的内容遵循基本的物理定律。Phys4DGen引入了物理感知模块(Physical Perception Module, PPM),从输入图像中推断出3D对象的材料属性和结构组件,从而实现精确的下游模拟。此外,Phys4DGen通过消除动态建模阶段的迭代优化步骤,显著加速了4D生成过程,并允许用户通过调整外部力来直观控制生成内容的移动速度和方向,实现高度可调且物理上合理的动画效果。
链接: https://arxiv.org/abs/2411.16800
作者: Jiajing Lin,Zhenzhong Wang,Shu Jiang,Yongjie Hou,Min Jiang
关键词-EN: involves creating dynamic, specific input conditions, video diffusion models, generation involves creating, involves creating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The task of 4D content generation involves creating dynamic 3D models that evolve over time in response to specific input conditions, such as images. Existing methods rely heavily on pre-trained video diffusion models to guide 4D content dynamics, but these approaches often fail to capture essential physical principles, as video diffusion models lack a robust understanding of real-world physics. Moreover, these models face challenges in providing fine-grained control over dynamics and exhibit high computational costs. In this work, we propose Phys4DGen, a novel, high-efficiency framework that generates physics-compliant 4D content from a single image with enhanced control capabilities. Our approach uniquely integrates physical simulations into the 4D generation pipeline, ensuring adherence to fundamental physical laws. Inspired by the human ability to infer physical properties visually, we introduce a Physical Perception Module (PPM) that discerns the material properties and structural components of the 3D object from the input image, facilitating accurate downstream simulations. Phys4DGen significantly accelerates the 4D generation process by eliminating iterative optimization steps in the dynamics modeling phase. It allows users to intuitively control the movement speed and direction of generated 4D content by adjusting external forces, achieving finely tunable, physically plausible animations. Extensive evaluations show that Phys4DGen outperforms existing methods in both inference speed and physical realism, producing high-quality, controllable 4D content.
zh
[CV-141] One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception
【速读】: 该论文试图解决在自动驾驶中的协同感知(Collaborative Perception)中,由于感知网络的不可变异质性(Immutable Heterogeneity)导致的语义差异问题。现有的方法要么需要为每种新代理类型训练新的解释器(Interpreter),限制了扩展性,要么依赖于通过中间标准化语义空间的二阶段解释,导致累积的语义损失。论文提出的解决方案是PolyInter,一种多态特征解释器(Polymorphic Feature Interpreter)。其关键在于通过一个扩展点(Extension Point),新代理可以无缝集成,只需覆盖其特定的提示(Prompts),这些提示是用于指导解释的可学习参数,而其余参数则重用PolyInter的现有参数。这种设计确保了单一解释器足以适应多种代理,并将它们的特征解释到自我代理的语义空间中,从而在保持扩展性的同时,减少了语义损失。
链接: https://arxiv.org/abs/2411.16799
作者: Yuchen Xia,Quan Yuan,Guiyang Luo,Xiaoyuan Fu,Yang Li,Xuanhan Zhu,Tianyou Luo,Siheng Chen,Jinglin Li
关键词-EN: autonomous driving significantly, driving significantly enhances, Collaborative perception, autonomous driving, driving significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Collaborative perception in autonomous driving significantly enhances the perception capabilities of individual agents. Immutable heterogeneity in collaborative perception, where agents have different and fixed perception networks, presents a major challenge due to the semantic gap in their exchanged intermediate features without modifying the perception networks. Most existing methods bridge the semantic gap through interpreters. However, they either require training a new interpreter for each new agent type, limiting extensibility, or rely on a two-stage interpretation via an intermediate standardized semantic space, causing cumulative semantic loss. To achieve both extensibility in immutable heterogeneous scenarios and low-loss feature interpretation, we propose PolyInter, a polymorphic feature interpreter. It contains an extension point through which emerging new agents can seamlessly integrate by overriding only their specific prompts, which are learnable parameters intended to guide the interpretation, while reusing PolyInter’s remaining parameters. By leveraging polymorphism, our design ensures that a single interpreter is sufficient to accommodate diverse agents and interpret their features into the ego agent’s semantic space. Experiments conducted on the OPV2V dataset demonstrate that PolyInter improves collaborative perception precision by up to 11.1% compared to SOTA interpreters, while comparable results can be achieved by training only 1.4% of PolyInter’s parameters when adapting to new agents.
zh
[CV-142] Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery
【速读】: 该论文试图解决在手动小切口白内障手术 (Manual Small-Incision Cataract Surgery, MSICS) 中缺乏相应的手术视频数据集的问题。解决方案的关键在于引入了首个全面的 Cataract-MSICS 数据集,该数据集包含 53 个手术视频,标注了 18 个手术阶段和 3,527 帧图像中的 13 种手术工具,并在像素级别进行了详细标注。此外,论文提出了 ToolSeg 框架,通过引入阶段条件解码器和利用基础模型的伪标签进行半监督学习,显著提升了工具分割性能,平均 Dice 分数提高了 23.77% 至 38.10%,尤其对较少出现和小尺寸的工具效果显著。该方法还展示了在其他手术场景中的泛化能力。
链接: https://arxiv.org/abs/2411.16794
作者: Bhuvan Sachdeva,Naren Akash,Tajamul Ashraf,Simon Muller,Thomas Schultz,Maximilian W. M. Wintergerst,Niharika Singri Prasad,Kaushik Murali,Mohit Jain
关键词-EN: disproportionately higher burden, developing countries, Cataract surgery, Phaco cataract surgery, surgical procedure globally
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cataract surgery is the most common surgical procedure globally, with a disproportionately higher burden in developing countries. While automated surgical video analysis has been explored in general surgery, its application to ophthalmic procedures remains limited. Existing works primarily focus on Phaco cataract surgery, an expensive technique not accessible in regions where cataract treatment is most needed. In contrast, Manual Small-Incision Cataract Surgery (MSICS) is the preferred low-cost, faster alternative in high-volume settings and for challenging cases. However, no dataset exists for MSICS. To address this gap, we introduce Cataract-MSICS, the first comprehensive dataset containing 53 surgical videos annotated for 18 surgical phases and 3,527 frames with 13 surgical tools at the pixel level. We benchmark this dataset on state-of-the-art models and present ToolSeg, a novel framework that enhances tool segmentation by introducing a phase-conditional decoder and a simple yet effective semi-supervised setup leveraging pseudo-labels from foundation models. Our approach significantly improves segmentation performance, achieving a 23.77% to 38.10% increase in mean Dice scores, with a notable boost for tools that are less prevalent and small. Furthermore, we demonstrate that ToolSeg generalizes to other surgical settings, showcasing its effectiveness on the CaDIS dataset.
zh
[CV-143] ST-Align: A Multimodal Foundation Model for Image-Gene Alignment in Spatial Transcriptomics
【速读】: 该论文试图解决空间转录组学 (Spatial transcriptomics, ST) 数据在缺乏全局视角和空间内在关系的情况下,难以有效捕捉特定生物学见解的问题。解决方案的关键在于引入ST-Align,这是首个专为ST设计的基础模型,通过深度对齐图像-基因对并整合空间上下文,有效连接病理图像与基因组特征。ST-Align采用了一种新颖的预训练框架,包含三重对齐策略:(1) 多尺度对齐,捕捉点级和微环境级上下文,提供全面视角;(2) 跨层级对齐,连接局部细胞特征与更广泛的组织结构。此外,ST-Align使用专门设计的编码器处理不同的ST上下文,并通过注意力基础融合网络 (Attention-Based Fusion Network, ABFN) 增强多模态融合,有效整合领域共享知识与ST特有的病理和基因组数据见解。通过在130万个点-微环境对上预训练,并在六个数据集上进行下游任务评估,ST-Align展示了优越的零样本和少样本学习能力。
链接: https://arxiv.org/abs/2411.16793
作者: Yuxiang Lin,Ling Luo,Ying Chen,Xushi Zhang,Zihui Wang,Wenxian Yang,Mengsha Tong,Rongshan Yu
关键词-EN: whole-transcriptomic expression profiles, high-resolution pathological images, whole-slide scales, images and whole-transcriptomic, whole-transcriptomic expression
类目: Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
备注:
点击查看摘要
Abstract:Spatial transcriptomics (ST) provides high-resolution pathological images and whole-transcriptomic expression profiles at individual spots across whole-slide scales. This setting makes it an ideal data source to develop multimodal foundation models. Although recent studies attempted to fine-tune visual encoders with trainable gene encoders based on spot-level, the absence of a wider slide perspective and spatial intrinsic relationships limits their ability to capture ST-specific insights effectively. Here, we introduce ST-Align, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features. We design a novel pretraining framework with a three-target alignment strategy for ST-Align, enabling (1) multi-scale alignment across image-gene pairs, capturing both spot- and niche-level contexts for a comprehensive perspective, and (2) cross-level alignment of multimodal insights, connecting localized cellular characteristics and broader tissue architecture. Additionally, ST-Align employs specialized encoders tailored to distinct ST contexts, followed by an Attention-Based Fusion Network (ABFN) for enhanced multimodal fusion, effectively merging domain-shared knowledge with ST-specific insights from both pathological and genomic data. We pre-trained ST-Align on 1.3 million spot-niche pairs and evaluated its performance through two downstream tasks across six datasets, demonstrating superior zero-shot and few-shot capabilities. ST-Align highlights the potential for reducing the cost of ST and providing valuable insights into the distinction of critical compositions within human tissue.
zh
[CV-144] From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task
【速读】: 该论文试图解决3D体积超分辨率任务中的结构不连续性和高采样成本问题。解决方案的关键在于利用2D扩散模型和体积内的横向连续性来增强3D体积电子显微镜(vEM)超分辨率。具体步骤包括:首先在XY平面上模拟横向退化并训练2D扩散模型以恢复退化的切片;然后逐片应用模型于低分辨率体积的横向方向,恢复切片并保持固有的横向连续性;接着在恢复的横向切片序列上训练高频感知3D超分辨率网络,以学习切片间的空间特征变换;最后,将该网络应用于轴向方向,推断高分辨率体积,从而实现3D超分辨率。
链接: https://arxiv.org/abs/2411.16792
作者: Bohao Chen,Yanchao Zhang,Yanan Lv,Hua Han,Xi Chen
关键词-EN: Diffusion models, diffusion models significantly, Diffusion, recently emerged, powerful technique
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have recently emerged as a powerful technique in image generation, especially for image super-resolution tasks. While 2D diffusion models significantly enhance the resolution of individual images, existing diffusion-based methods for 3D volume super-resolution often struggle with structure discontinuities in axial direction and high sampling costs. In this work, we present a novel approach that leverages the 2D diffusion model and lateral continuity within the volume to enhance 3D volume electron microscopy (vEM) super-resolution. We first simulate lateral degradation with slices in the XY plane and train a 2D diffusion model to learn how to restore the degraded slices. The model is then applied slice-by-slice in the lateral direction of low-resolution volume, recovering slices while preserving inherent lateral continuity. Following this, a high-frequency-aware 3D super-resolution network is trained on the recovery lateral slice sequences to learn spatial feature transformation across slices. Finally, the network is applied to infer high-resolution volumes in the axial direction, enabling 3D super-resolution. We validate our approach through comprehensive evaluations, including image similarity assessments, resolution analysis, and performance on downstream tasks. Our results on two publicly available focused ion beam scanning electron microscopy (FIB-SEM) datasets demonstrate the robustness and practical applicability of our framework for 3D volume super-resolution.
zh
[CV-145] IDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction
【速读】: 该论文试图解决单源域泛化问题,特别是在面对语义偏移(如背景和视角变化)时,现有方法依赖于广泛的数据增强来覆盖多样化的训练域,但往往学习的是全局特征而非域不变的局部概念,导致模型在这些情况下表现不佳。解决方案的关键在于提出了一种新的方法,通过利用扩散模型和大型语言模型的丰富特征生成标注,并引入了一种名为TIDE的新训练方案。TIDE包括概念显著性对齐损失和局部概念对比损失,确保模型关注正确的局部概念区域并学习域不变的概念表示。此外,该方法还包括一个在测试时使用预测的概念显著性图进行迭代修正的算法,以使预测结果与存储的原型概念表示对齐。实验结果表明,该方法在四个标准域泛化基准数据集上显著优于当前最先进的方法,平均提升12%,并展示了其预测结果的可视化解释性。
链接: https://arxiv.org/abs/2411.16788
作者: Aishwarya Agarwal,Srikrishna Karanam,Vineet Gandhi
关键词-EN: single-source domain generalization, problem of single-source, domain generalization, concept, single-source domain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures
点击查看摘要
Abstract:We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-ofthe-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted
zh
[CV-146] MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM
【速读】: 该论文试图解决现有同时定位与地图构建(SLAM)系统在多智能体协作场景下的局限性,特别是单智能体操作、渲染速度慢、多智能体间轨迹漂移和观测不一致等问题。解决方案的关键在于提出了一种刚性可变形的三维高斯场景表示方法,显著提升了系统速度,并引入了新的跟踪和地图合并机制,以及在基于高斯的SLAM流程中整合了闭环检测,从而提高了多智能体间的跟踪精度和全局地图的一致性。
链接: https://arxiv.org/abs/2411.16785
作者: Vladimir Yugay,Theo Gevers,Martin R. Oswald
关键词-EN: view synthesis capabilities, Simultaneous localization, localization and mapping, computer vision, augmented reality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Simultaneous localization and mapping (SLAM) systems with novel view synthesis capabilities are widely used in computer vision, with applications in augmented reality, robotics, and autonomous driving. However, existing approaches are limited to single-agent operation. Recent work has addressed this problem using a distributed neural scene representation. Unfortunately, existing methods are slow, cannot accurately render real-world data, are restricted to two agents, and have limited tracking accuracy. In contrast, we propose a rigidly deformable 3D Gaussian-based scene representation that dramatically speeds up the system. However, improving tracking accuracy and reconstructing a globally consistent map from multiple agents remains challenging due to trajectory drift and discrepancies across agents’ observations. Therefore, we propose new tracking and map-merging mechanisms and integrate loop closure in the Gaussian-based SLAM pipeline. We evaluate MAGiC-SLAM on synthetic and real-world datasets and find it more accurate and faster than the state of the art.
zh
[CV-147] CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis
【速读】: 该论文旨在解决文本到图像扩散模型中语义准确性不足的问题,特别是针对现有初始潜在优化方法中的两个关键限制:注意力忽视(attention neglect)和注意力干扰(attention interference)。解决方案的关键是引入了一种名为CoCoNO的新算法,该算法通过利用自注意力(self-attention)和交叉注意力(cross-attention)图谱中的互补信息来优化初始潜在变量。具体来说,CoCoNO引入了两个新的损失函数:注意力对比损失(attention contrast loss)和注意力完整损失(attention complete loss),前者通过确保每个自注意力段仅与特定对象的交叉注意力图谱关联来最小化不期望的重叠,后者则通过最大化这些段内的激活来确保每个对象被完整且清晰地表示。该方法在噪声优化框架内运行,无需重新训练基础模型,并通过多项基准测试证明了其在文本图像对齐方面的显著改进,超越了当前最先进的技术。
链接: https://arxiv.org/abs/2411.16783
作者: Aravindan Sundaram,Ujjayan Pal,Abhimanyu Chauhan,Aishwarya Agarwal,Srikrishna Karanam
关键词-EN: achieving semantically accurate, semantically accurate images, achieving semantically, persistent challenge, recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures
点击查看摘要
Abstract:Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject’s cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art. Comments: 15 pages, 12 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.16783 [cs.CV] (or arXiv:2411.16783v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.16783 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-148] Scaling Laws for Black box Adversarial Attacks
【速读】: 该论文试图解决深度学习模型对对抗样本的脆弱性问题,特别是探讨了通过增加代理模型数量来提升对抗样本的跨模型迁移性(cross-model transferability)。解决方案的关键在于通过模型集成(model ensembling)策略,攻击多个代理模型以生成更具迁移性的对抗样本。研究结果表明,随着代理模型数量的增加,对抗样本的迁移性显著提升,这一发现通过在大规模基础模型、标准图像分类器、多模态大语言模型以及如GPT-4o等专有模型上的广泛实验得到了验证。此外,通过可视化分析,研究还揭示了规模化攻击在语义解释性上的优势,表明更多代理模型的使用能够更好地捕捉模型的共同特征。
链接: https://arxiv.org/abs/2411.16782
作者: Chuan Liu,Huanran Chen,Yichi Zhang,Yinpeng Dong,Jun Zhu
关键词-EN: applying imperceptible perturbations, deep learning models, models, longstanding problem, problem of deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:A longstanding problem of deep learning models is their vulnerability to adversarial examples, which are often generated by applying imperceptible perturbations to natural examples. Adversarial examples exhibit cross-model transferability, enabling to attack black-box models with limited information about their architectures and parameters. Model ensembling is an effective strategy to improve the transferability by attacking multiple surrogate models simultaneously. However, as prior studies usually adopt few models in the ensemble, there remains an open question of whether scaling the number of models can further improve black-box attacks. Inspired by the findings in large foundation models, we investigate the scaling laws of black-box adversarial attacks in this work. By analyzing the relationship between the number of surrogate models and transferability of adversarial examples, we conclude with clear scaling laws, emphasizing the potential of using more surrogate models to enhance adversarial transferability. Extensive experiments verify the claims on standard image classifiers, multimodal large language models, and even proprietary models like GPT-4o, demonstrating consistent scaling effects and impressive attack success rates with more surrogate models. Further studies by visualization indicate that scaled attacks bring better interpretability in semantics, indicating that the common features of models are captured.
zh
[CV-149] UniPose: A Unified Multimodal Framework for Human Pose Comprehension Generation and Editing
【速读】: 该论文试图解决现有方法在理解和生成人体姿态时仅支持单一模态控制信号且孤立运作的问题,限制了其在实际应用中的广泛使用。解决方案的关键在于提出了一个名为UniPose的框架,该框架利用大型语言模型(Large Language Models, LLMs)来跨多种模态(包括图像、文本和3D SMPL姿态)理解和生成人体姿态。具体来说,UniPose通过姿态标记器将3D姿态转换为离散的姿态标记,从而无缝集成到LLM中,并使用统一的词汇表。此外,通过结合多种视觉编码器,特别是姿态特定的视觉编码器,UniPose增强了细粒度的姿态感知能力。得益于统一的学习策略,UniPose能够有效地在不同姿态相关任务之间传递知识,适应未见任务,并展现出扩展的能力。这是首次尝试构建一个通用的人体姿态理解、生成和编辑框架。
链接: https://arxiv.org/abs/2411.16781
作者: Yiheng Li,Ruibing Hou,Hong Chang,Shiguang Shan,Xilin Chen
关键词-EN: Human pose plays, Large Language Models, digital age, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose’s competitive and even superior performance across various pose-relevant tasks.
zh
[CV-150] NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model
【速读】: 该论文试图解决在稀疏视角图像条件下进行高斯溅射(Gaussian Splatting, GS)的多视角图像重建问题。现有方法依赖前馈网络生成像素对齐的高斯分布,但这些方法在输入图像未覆盖的区域表现不佳。论文提出的NovelGS通过利用基于Transformer的网络进行新视角去噪,生成3D高斯分布,解决了这一问题。关键在于结合条件视角和噪声目标视角,网络预测每个视角的像素对齐高斯分布,并在训练和推理过程中进行迭代渲染和去噪,从而实现对未见区域的生成建模,确保3D对象重建的纹理一致性和清晰度。
链接: https://arxiv.org/abs/2411.16779
作者: Jinpeng Liu,Jiale Xu,Weihao Cheng,Yiming Gao,Xintao Wang,Ying Shan,Yansong Tang
关键词-EN: Gaussian Splatting, Gaussians, pixel-aligned Gaussians, Splatting, generate pixel-aligned Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.
zh
[CV-151] GEMeX: A Large-Scale Groundable and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis WWW
【速读】: 该论文试图解决当前医学视觉问答(Medical Visual Question Answering, Med-VQA)数据集的两个主要问题:(1) 缺乏视觉和文本解释,导致患者和初级医生难以理解答案;(2) 问题格式单一,未能充分反映临床场景的多样性。解决方案的关键在于引入了一个大规模、可解释性强的医学VQA基准——GEMeX,其核心创新包括:(1) 多模态解释机制,提供详细的视觉和文本解释,增强答案的可理解性;(2) 四种不同的问题类型(开放式、封闭式、单选和多选),更好地反映临床需求的多样性。通过评估和微调大型视觉语言模型,证明了GEMeX的有效性和复杂性。
链接: https://arxiv.org/abs/2411.16778
作者: Bo Liu,Ke Zou,Liming Zhan,Zexin Lu,Xiaoyu Dong,Yidi Chen,Chengqiang Xie,Jiannong Cao,Xiao-Ming Wu,Huazhu Fu
关键词-EN: Visual Question Answering, Explainable Medical VQA, current medical VQA, Medical Visual Question, Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This project is available at this https URL
点击查看摘要
Abstract:Medical Visual Question Answering (VQA) is an essential technology that integrates computer vision and natural language processing to automatically respond to clinical inquiries about medical images. However, current medical VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, which impedes their ability to satisfy the comprehension needs of patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements encountered in clinical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) A multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) Four distinct question types, open-ended, closed-ended, single-choice, and multiple-choice, that better reflect diverse clinical needs. We evaluated 10 representative large vision language models on GEMeX and found that they underperformed, highlighting the dataset’s complexity. However, after fine-tuning a baseline model using the training set, we observed a significant performance improvement, demonstrating the dataset’s effectiveness. The project is available at this http URL.
zh
[CV-152] SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models
【速读】: 该论文试图解决大规模数据集在常见环境条件(如“晴天”)下表现良好,但在代表性不足的环境条件(如“雨夜”)下性能下降的问题。解决方案的关键是引入了一种名为SynDiff-AD的新型数据增强管道,该管道利用扩散模型(DMs)生成针对这些代表性不足子组的逼真图像。SynDiff-AD结合了ControlNet(一种基于语义地图引导数据生成的扩散模型)和一种新颖的提示生成方案,该方案能够生成子组特定的、语义密集的提示。通过使用SynDiff-AD增强数据集,论文展示了在Waymo和DeepDrive数据集上,分割模型Mask2Former和SegFormer的性能分别提升了1.2%和2.3%,以及1.4%和0.7%。此外,SynDiff-AD还显著提升了端到端自动驾驶模型(如AIM-2D和AIM-BEV)在CARLA模拟器中多种环境条件下的驾驶性能,最高可达20%,从而提供了一个更为鲁棒的模型。
链接: https://arxiv.org/abs/2411.16776
作者: Harsh Goel,Sai Shankar Narasimhan,Oguzhan Akcin,Sandeep Chinchali
关键词-EN: collecting large-scale datasets, recent years, significant progress, Clear and Day, Rainy and Night
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures
点击查看摘要
Abstract:In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as “Clear and Day” weather, leading to decreased performance in under-represented conditions like “Rainy and Night”. To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model.
zh
[CV-153] MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing
【速读】: 该论文试图解决点云处理 (Point Cloud Processing, PCP) 中现有上下文学习 (In-Context Learning, ICL) 方法在任务间和任务内敏感性问题。解决方案的关键在于提出了一个名为 MICAS 的先进 ICL 框架,该框架引入了多粒度自适应采样机制。MICAS 的核心组件包括任务自适应点采样 (task-adaptive point sampling) 和查询特定提示采样 (query-specific prompt sampling)。前者利用任务间线索进行点级采样,后者则针对每个查询选择最优提示,以缓解任务内敏感性问题。这是首次在 ICL 框架中引入专门针对点云特性的自适应采样方法,实验结果表明 MICAS 在处理多种 PCP 任务时不仅高效,而且显著优于现有方法。
链接: https://arxiv.org/abs/2411.16773
作者: Feifei Shao,Ping Liu,Zhao Wang,Yawei Luo,Hongwei Wang,Jun Xiao
关键词-EN: requiring specialized models, PCP, requiring specialized, Point cloud processing, ICL
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable 4.1% improvement in the part segmentation task and delivers consistent gains across various PCP applications.
zh
[CV-154] Hyperspectral Image Cross-Domain Object Detection Method based on Spectral-Spatial Feature Alignment
【速读】: 该论文试图解决高光谱图像(Hyperspectral Images, HSI)在跨域目标检测中的域偏移问题。解决方案的关键在于提出了一种基于光谱-空间特征对齐的跨域目标检测方法。首先,通过开发光谱-空间对齐模块,提取跨域不变的局部空间-光谱特征;其次,设计了光谱自相关模块,专门解决光谱域中的域偏移问题,有效对齐具有不同光谱分辨率的HSI。此外,还收集并标注了一个用于跨域目标检测的HSI数据集。实验结果证明了该方法在HSI跨域目标检测中的有效性,为该领域迈出了重要且有前景的一步。
链接: https://arxiv.org/abs/2411.16772
作者: Hongqi Zhang,He Sun,Hongmin Gao,Feng Han,Xu Sun,Lianru Gao,Bing Zhang
关键词-EN: cross-domain object detection, object detection, HSI cross-domain object, object detection task, HSI object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With consecutive bands in a wide range of wavelengths, hyperspectral images (HSI) have provided a unique tool for object detection task. However, existing HSI object detection methods have not been fully utilized in real applications, which is mainly resulted by the difference of spatial and spectral resolution between the unlabeled target domain and a labeled source domain, i.e. the domain shift of HSI. In this work, we aim to explore the unsupervised cross-domain object detection of HSI. Our key observation is that the local spatial-spectral characteristics remain invariant across different domains. For solving the problem of domain-shift, we propose a HSI cross-domain object detection method based on spectral-spatial feature alignment, which is the first attempt in the object detection community to the best of our knowledge. Firstly, we develop a spectral-spatial alignment module to extract domain-invariant local spatial-spectral features. Secondly, the spectral autocorrelation module has been designed to solve the domain shift in the spectral domain specifically, which can effectively align HSIs with different spectral resolutions. Besides, we have collected and annotated an HSI dataset for the cross-domain object detection. Our experimental results have proved the effectiveness of HSI cross-domain object detection, which has firstly demonstrated a significant and promising step towards HSI cross-domain object detection in the object detection community.
zh
[CV-155] VidHal: Benchmarking Temporal Hallucinations in Vision LLM s
【速读】: 该论文试图解决视频输入下的视觉大语言模型(VLLMs)幻觉问题,特别是现有评估方法无法捕捉视频中复杂时空动态导致的细微错误。解决方案的关键在于引入了VidHal基准,这是一个专门设计用于评估视频幻觉的基准。VidHal通过跨常见时间维度的视频实例构建,并精心创建了代表不同幻觉程度的字幕。为了实现细粒度评估,论文提出了一种新的字幕排序任务,要求VLLMs根据幻觉程度对字幕进行排序。通过这一基准,论文旨在推动对VLLMs整体能力的深入理解,特别是幻觉问题,并促进开发更先进的VLLMs以缓解这一问题。
链接: https://arxiv.org/abs/2411.16771
作者: Wey Yeh Choong,Yangyang Guo,Mohan Kankanhalli
关键词-EN: Vision Large Language, Large Language Models, Vision Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures. Code available at this https URL
点击查看摘要
Abstract:Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucination. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.
zh
[CV-156] GAST: Sequential Gaussian Avatars with Hierarchical Spatio-temporal Context
【速读】: 该论文试图解决现有3D人体化身(avatars)在渲染质量和动画灵活性方面的局限性问题。现有方法要么依赖于空间SMPL(-X)姿态(spatial SMPL(-X) poses),导致渲染质量粗糙,要么依赖于时间嵌入(temporal embeddings),限制了动画的灵活性。论文提出的解决方案之关键是GAST框架,该框架通过层次化地整合空间和时间信息,将3D人体建模与3D高斯光场(3DGS)统一起来。具体来说,GAST设计了一个序列条件框架,用于人体非刚性变形(non-rigid warping),在此指导下可以在观测空间中获得更准确的3D高斯分布。此外,高斯分布的显式属性使其能够嵌入更丰富的序列信息,包括粗略的人体姿态序列和更精细的每个顶点运动细节。这些序列条件在不同时间尺度上进行采样,以从粗到细的方式确保非刚性变形的无偏输入。实验结果表明,GAST结合层次化的时空建模,在渲染质量和动画灵活性方面均超越了现有的基线方法。
链接: https://arxiv.org/abs/2411.16768
作者: Wangze Xu,Yifan Zhan,Zhihang Zhong,Xiao Sun
关键词-EN: canonical radiance fields, enable high-fidelity rendering, per-frame observed warping, enable high-fidelity, canonical radiance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D human avatars, through the use of canonical radiance fields and per-frame observed warping, enable high-fidelity rendering and animating. However, existing methods, which rely on either spatial SMPL(-X) poses or temporal embeddings, respectively suffer from coarse rendering quality or limited animation flexibility. To address these challenges, we propose GAST, a framework that unifies 3D human modeling with 3DGS by hierarchically integrating both spatial and temporal information. Specifically, we design a sequential conditioning framework for the non-rigid warping of the human body, under whose guidance more accurate 3D Gaussians can be obtained in the observation space. Moreover, the explicit properties of Gaussians allow us to embed richer sequential information, encompassing both the coarse sequence of human poses and finer per-vertex motion details. These sequence conditions are further sampled across different temporal scales, in a coarse-to-fine manner, ensuring unbiased inputs for non-rigid warping. Experimental results demonstrate that our method combined with hierarchical spatio-temporal modeling surpasses concurrent baselines, delivering both high-quality rendering and flexible animating capabilities.
zh
[CV-157] Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background
【速读】: 该论文试图解决异常检测中异常数据稀缺导致深度神经网络难以有效识别异常特征的问题。解决方案的关键在于通过生成模型合成异常数据集,并特别关注背景与缺陷之间的关系。论文提出了一种新的方法,通过建模背景与缺陷之间的关系,使得背景影响去噪缺陷,但反之不成立。具体实现中,引入了正则化项以分离背景与缺陷,并利用DDIM反演技术在目标正常图像上生成缺陷,同时理论上证明了该方法能够在背景不变的情况下生成缺陷。实验结果表明,合成的数据具有高度的真实性和有效性。
链接: https://arxiv.org/abs/2411.16767
作者: Youngjae Cho,Gwangyeol Kim,Sirojbek Safarov,Seongdeok Bang,Jaewoo Park
关键词-EN: identify anomalous features, effectively utilizing deep, utilizing deep neural, deep neural network, neural network representations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages
点击查看摘要
Abstract:In anomaly detection, the scarcity of anomalous data compared to normal data poses a challenge in effectively utilizing deep neural network representations to identify anomalous features. From a data-centric perspective, generative models can solve this data imbalance issue by synthesizing anomaly datasets. Although previous research tried to enhance the controllability and quality of generating defects, they do not consider the relation between background and defect. Since the defect depends on the object’s background (i.e., the normal part of an object), training only the defect area cannot utilize the background information, and even generation can be biased depending on the mask information. In addition, controlling logical anomalies should consider the dependency between background and defect areas (e.g., orange colored defect on a orange juice bottle). In this paper, our paper proposes modeling a relationship between the background and defect, where background affects denoising defects; however, the reverse is not. We introduce the regularizing term to disentangle denoising background from defects. From the disentanglement loss, we rethink defect generation with DDIM Inversion, where we generate the defect on the target normal image. Additionally, we theoretically prove that our methodology can generate a defect on the target normal image with an invariant background. We demonstrate our synthetic data is realistic and effective in several experiments.
zh
[CV-158] Is Right Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning
【速读】: 该论文试图解决多模态大语言模型(MLLMs)在图像中准确解释物体方向时面临的挑战,主要原因是训练数据中物体方向标注的不一致性。解决方案的关键在于提出了以自我为中心的指令调优(egocentric instruction tuning),通过基于用户自我中心视角的统一标注标准,使模型的方向理解与用户视角对齐。具体步骤包括生成以自我为中心的指令数据,利用MLLMs识别物体细节的能力并结合先验知识进行方向理解,然后通过指令调优增强模型对方向的准确解释能力。此外,论文还引入了EgoOrientBench基准,用于评估MLLMs在不同领域图像上的方向理解能力。实验结果表明,这种调优方法显著提升了方向理解能力,同时不损害MLLMs的整体性能。
链接: https://arxiv.org/abs/2411.16761
作者: Ji Hyeok Jung,Eun Tae Kim,Seo Yeon Kim,Joo Ho Lee,Bumsoo Kim,Buru Chang
关键词-EN: Multimodal large language, multimodal applications, Multimodal large, large language models, act as essential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs’ orientation understanding with the user’s perspective, based on a consistent annotation standard derived from the user’s egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs’ ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model’s capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs’ orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at this https URL.
zh
[CV-159] LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
【速读】: 该论文试图解决基于梯度的解释方法在Transformer模型中表现不佳的问题,特别是由于梯度流动不平衡导致的FullGrad-completeness属性缺失。解决方案的关键是引入LibraGrad,这是一种基于理论支持的后处理方法,通过修剪和缩放反向路径来纠正梯度不平衡,而不改变前向传递或增加计算开销。LibraGrad在多个评估指标上显著提升了基于梯度的方法,超越了现有的白盒方法,包括专门针对Transformer的方法,并且在多种模型架构和数据集上展示了其通用性和有效性。
链接: https://arxiv.org/abs/2411.16760
作者: Faridoun Mehri(1),Mahdieh Soleymani Baghshah(1),Mohammad Taher Pilehvar(2) ((1) Sharif University of Technology, (2) Cardiff University)
关键词-EN: gradient-based explanations struggle, Transformers, Completeness Error, gradient-based explanations, explanations struggle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad – a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods – including Transformer-specific approaches – across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models – two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available at this https URL.
zh
[CV-160] Bundle Adjusted Gaussian Avatars Deblurring
【速读】: 该论文试图解决从多视角模糊视频中生成高质量3D人体化身(3D human avatars)的问题。解决方案的关键在于提出了一种端到端的3D感知、基于物理模糊形成模型的方法,该模型结合了3D人体运动模型,以解析由于人体运动引起的模糊图像中的歧义。通过这种方法,可以同时学习化身模型参数和细化子帧运动参数,从粗略初始化开始,最终实现从模糊视频中提取出清晰的内在3D人体高斯化身。
链接: https://arxiv.org/abs/2411.16758
作者: Muyao Niu,Yifan Zhan,Qingtian Zhu,Zhuoxiao Li,Wei Wang,Zhihang Zhong,Xiao Sun,Yinqiang Zheng
关键词-EN: represents a significant, significant yet challenging, Gaussian Splattings, multi-view videos represents, videos represents
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Codes and Data: this https URL
点击查看摘要
Abstract:The development of 3D human avatars from multi-view videos represents a significant yet challenging task in the field. Recent advancements, including 3D Gaussian Splattings (3DGS), have markedly progressed this domain. Nonetheless, existing techniques necessitate the use of high-quality sharp images, which are often impractical to obtain in real-world settings due to variations in human motion speed and intensity. In this study, we attempt to explore deriving sharp intrinsic 3D human Gaussian avatars from blurry video footage in an end-to-end manner. Our approach encompasses a 3D-aware, physics-oriented model of blur formation attributable to human movement, coupled with a 3D human motion model to clarify ambiguities found in motion-induced blurry images. This methodology facilitates the concurrent learning of avatar model parameters and the refinement of sub-frame motion parameters from a coarse initialization. We have established benchmarks for this task through a synthetic dataset derived from existing multi-view captures, alongside a real-captured dataset acquired through a 360-degree synchronous hybrid-exposure camera system. Comprehensive evaluations demonstrate that our model surpasses existing baselines.
zh
[CV-161] FunGrasp: Functional Grasping for Diverse Dexterous Hands
【速读】: 该论文试图解决机器人手在执行特定任务时缺乏功能性抓握(functional grasping)能力的问题。解决方案的关键在于引入了一个名为FunGrasp的系统,该系统能够通过单张RGBD图像实现功能性灵巧抓握,并支持对未见对象的一次性迁移。核心技术包括:1) 通过人类到机器人(H2R)抓握重定向模块,将人类抓握姿势估计并迁移到不同的机器人手上;2) 利用强化学习在模拟环境中训练动态抓握控制策略;3) 采用特权学习、系统识别、领域随机化和重力补偿等技术,确保从模拟到现实的稳健迁移。通过这些技术,FunGrasp系统能够在不同机器人手上实现对未见对象的多样化功能性抓握。
链接: https://arxiv.org/abs/2411.16755
作者: Linyi Huang,Hui Zhang,Zijian Wu,Sammy Christen,Jie Song
关键词-EN: perform specific tasks, grasping, Functional grasping, finger holes, holes to cut
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Functional grasping is essential for humans to perform specific tasks, such as grasping scissors by the finger holes to cut materials or by the blade to safely hand them over. Enabling dexterous robot hands with functional grasping capabilities is crucial for their deployment to accomplish diverse real-world tasks. Recent research in dexterous grasping, however, often focuses on power grasps while overlooking task- and object-specific functional grasping poses. In this paper, we introduce FunGrasp, a system that enables functional dexterous grasping across various robot hands and performs one-shot transfer to unseen objects. Given a single RGBD image of functional human grasping, our system estimates the hand pose and transfers it to different robotic hands via a human-to-robot (H2R) grasp retargeting module. Guided by the retargeted grasping poses, a policy is trained through reinforcement learning in simulation for dynamic grasping control. To achieve robust sim-to-real transfer, we employ several techniques including privileged learning, system identification, domain randomization, and gravity compensation. In our experiments, we demonstrate that our system enables diverse functional grasping of unseen objects using single RGBD images, and can be successfully deployed across various dexterous robot hands. The significance of the components is validated through comprehensive ablation studies. Project page: this https URL .
zh
[CV-162] Visual Counter Turing Test (VCT2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)
【速读】: 该论文试图解决当前AI生成图像检测(AGID)技术在检测当代AI生成图像方面的不足问题。解决方案的关键在于提出了一个名为视觉反图灵测试(Visual Counter Turing Test, VCT^2)的基准,该基准包含约13万张由当代文本到图像模型生成的图像,并评估了现有AGID技术在该基准上的表现,揭示了它们的无效性。此外,论文还提出了视觉AI指数(Visual AI Index, V_AI),这是一个从纹理复杂性和对象一致性等多个视觉角度评估生成图像的量化框架,为评估图像生成AI模型设定了新标准。
链接: https://arxiv.org/abs/2411.16754
作者: Nasrin Imanpour,Shashwat Bajpai,Subhankar Ghosh,Sainath Reddy Sankepally,Abhilekh Borah,Hasnat Md Abdullah,Nishoak Kosaraju,Shreyas Dixit,Ashhar Aziz,Shwetangshu Biswas,Vinija Jain,Aman Chadha,Amit Sheth,Amitava Das
关键词-EN: Deep Fake Detection, Fake Image Detection, GAN Image Detection, image detection, raised significant concerns
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures
点击查看摘要
Abstract:The proliferation of AI techniques for image generation, coupled with their increasing accessibility, has raised significant concerns about the potential misuse of these images to spread misinformation. Recent AI-generated image detection (AGID) methods include CNNDetection, NPR, DM Image Detection, Fake Image Detection, DIRE, LASTED, GAN Image Detection, AIDE, SSP, DRCT, RINE, OCC-CLIP, De-Fake, and Deep Fake Detection. However, we argue that the current state-of-the-art AGID techniques are inadequate for effectively detecting contemporary AI-generated images and advocate for a comprehensive reevaluation of these methods. We introduce the Visual Counter Turing Test (VCT^2), a benchmark comprising ~130K images generated by contemporary text-to-image models (Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and Midjourney 6). VCT^2 includes two sets of prompts sourced from tweets by the New York Times Twitter account and captions from the MS COCO dataset. We also evaluate the performance of the aforementioned AGID techniques on the VCT ^2 benchmark, highlighting their ineffectiveness in detecting AI-generated images. As image-generative AI models continue to evolve, the need for a quantifiable framework to evaluate these models becomes increasingly critical. To meet this need, we propose the Visual AI Index (V_AI), which assesses generated images from various visual perspectives, including texture complexity and object coherence, setting a new standard for evaluating image-generative AI models. To foster research in this domain, we make our this https URL and this https URL datasets publicly available.
zh
[CV-163] Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy
【速读】: 该论文试图解决零样本组合图像检索 (Zero-shot Composed Image Retrieval, ZSCIR) 中由于图像与文本之间的自然差异导致的检索结果不准确的问题。解决方案的关键在于提出了Imagined Proxy for CIR (IP-CIR) 方法,通过生成一个与查询图像和文本描述对齐的代理图像 (proxy image),从而增强查询表示在检索过程中的鲁棒性。具体步骤包括利用大型语言模型的泛化能力生成图像布局,并结合查询文本和图像进行条件生成,最终通过合并代理图像、查询图像和文本语义扰动来提升查询特征的质量。此外,论文还提出了一种新的平衡度量方法,综合考虑基于文本和代理图像的检索相似度,以更准确地检索目标图像,并成功在多个公开数据集上实现了最先进的检索性能。
链接: https://arxiv.org/abs/2411.16752
作者: You Li,Fan Ma,Yi Yang
关键词-EN: Zero-shot Composed Image, Zero-shot Composed, Composed Image Retrieval, Composed Image, query image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions. Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text. In this paper, we introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process. We first leverage the large language model’s generalization capability to generate an image layout, and then apply both the query text and image for conditional generation. The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process. Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising from 45.11 to 45.74, and improved the baseline performance in CIRCO with a mAPK@10 score, increasing from 32.24 to 34.26.
zh
[CV-164] AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks
【速读】: 该论文试图解决现有扩散模型在生成高质量图像时,由于需要针对不同任务进行精细的人工设计和调整,导致合成数据在更广泛场景中应用受限的问题。解决方案的关键在于提出了AnySynth框架,该框架通过集成可适应、全面且高度可控的组件,能够根据多样化的需求生成任意类型的合成数据。具体来说,关键组件包括:1) 任务特定布局生成模块,利用大型语言模型和真实世界图像的布局先验生成合理布局;2) 统一控制图像生成模块,基于生成的布局创建高质量且可控的合成图像;3) 任务导向注释模块,为生成的图像提供针对不同任务的精确详细注释。这些组件共同确保了AnySynth框架在多种任务中的通用性和有效性。
链接: https://arxiv.org/abs/2411.16749
作者: You Li,Fan Ma,Yi Yang
关键词-EN: manual data collection, object detection, improving model generalization, Diffusion models, Cross-domain Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework’s performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.
zh
[CV-165] LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
【速读】: 该论文试图解决音频驱动的肖像图像动画生成中的多模态融合问题,确保在时间(timing)和空间(portrait)上的一致性,并生成生动逼真的说话人头像视频。解决方案的关键在于提出了LetsTalk(LatEnt Diffusion TranSformer for Talking Video Synthesis)模型,该模型通过结合模块化的时间和空间注意力机制来融合多模态信息,增强时空一致性。具体来说,论文探讨了三种从浅到深的融合方案,并根据图像、音频和视频生成的模态差异,提出了适合的融合策略:对于肖像,采用深度融合方案(Symbiotic Fusion)以确保一致性;对于音频,采用浅层融合方案(Direct Fusion)以实现音频与动画的对齐并保持多样性。实验结果表明,该方法能够生成时间上连贯、真实且具有增强多样性和生动性的视频。
链接: https://arxiv.org/abs/2411.16748
作者: Haojie Zhang,Zhihao Liang,Ruibo Fu,Zhengqi Wen,Xuefei Liu,Chenxing Li,Jianhua Tao,Yaling Liang
关键词-EN: expressive animated faces, rapidly advanced, enabling the creation, animated faces, Portrait image animation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures
点击查看摘要
Abstract:Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.
zh
[CV-166] FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction
【速读】: 该论文试图解决车辆轨迹预测中对复杂非线性模式捕捉不足,特别是对车辆跟随行为和车辆间交互细节的忽视问题。解决方案的关键在于引入了一种基于噪声条件扩散模型的车辆跟随轨迹预测方法,该方法通过将车辆间交互和车辆跟随动力学整合到生成式框架中,显著提升了预测的准确性和合理性。具体而言,该模型利用历史特征编码对噪声进行缩放,并在扩散过程中捕捉历史车辆动力学,同时采用基于交叉注意力的Transformer架构来建模复杂的车辆间依赖关系,从而有效指导去噪过程并增强预测精度。
链接: https://arxiv.org/abs/2411.16747
作者: Junwei You,Rui Gan,Weizhe Tang,Zilin Huang,Jiaxi Liu,Zhuoyu Jiang,Haotian Shi,Keshu Wu,Keke Long,Sicheng Fu,Sikai Chen,Bin Ran
关键词-EN: driver assistance systems, advanced driver assistance, advancing autonomous driving, assistance systems, Vehicle trajectory prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: arXiv admin note: text overlap with arXiv:2406.11941
点击查看摘要
Abstract:Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS). Although deep learning-based approaches - especially those utilizing transformer-based and generative models - have markedly improved prediction accuracy by capturing complex, non-linear patterns in vehicle dynamics and traffic interactions, they frequently overlook detailed car-following behaviors and the inter-vehicle interactions critical for real-world driving applications, particularly in fully autonomous or mixed traffic scenarios. To address the issue, this study introduces a scaled noise conditional diffusion model for car-following trajectory prediction, which integrates detailed inter-vehicular interactions and car-following dynamics into a generative framework, improving both the accuracy and plausibility of predicted trajectories. The model utilizes a novel pipeline to capture historical vehicle dynamics by scaling noise with encoded historical features within the diffusion process. Particularly, it employs a cross-attention-based transformer architecture to model intricate inter-vehicle dependencies, effectively guiding the denoising process and enhancing prediction accuracy. Experimental results on diverse real-world driving scenarios demonstrate the state-of-the-art performance and robustness of the proposed method.
zh
[CV-167] Document Haystacks: Vision-Language Reasoning Over Piles of 1000 Documents
【速读】: 该论文试图解决现有大型多模态模型(LMMs)在处理大规模图像检索任务时面临的复杂推理能力不足的问题。现有基准测试在多图像问答任务中仅涉及最多30张图像,无法充分反映实际应用中大规模检索任务的需求。为填补这一空白,论文提出了两个新的基准测试,即DocHaystack和InfoHaystack,用于评估LMM在大规模视觉文档检索和理解中的性能。解决方案的关键是提出了一个名为V-RAG的新型视觉中心检索增强生成(RAG)框架,该框架结合了多种多模态视觉编码器,每个编码器针对特定优势进行优化,并配备了一个专门的问题-文档相关性模块。V-RAG在DocHaystack-1000和InfoHaystack-1000基准测试中分别实现了9%和11%的Recall@1提升,显著优于之前的最佳基线模型,并使LMM能够高效处理数千张图像的检索任务。
链接: https://arxiv.org/abs/2411.16740
作者: Jun Chen,Dannong Xu,Junjie Fei,Chun-Mei Feng,Mohamed Elhoseiny
关键词-EN: achieved impressive progress, applications requiring complex, requiring complex reasoning, real-world applications requiring, large number
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at this https URL
zh
[CV-168] Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration Under Adverse Weather
【速读】: 该论文试图解决在多种恶劣天气条件下(如雨、雨滴和雪)进行图像恢复的问题,特别是现有多任务方法通过增加额外参数来处理多种场景,导致模型复杂度增加,难以实际部署的问题。解决方案的关键在于提出了一种新的梯度引导参数掩码(Gradient-Guided Parameter Mask),通过在训练过程中评估每种特定天气条件下的梯度变化强度,将模型参数分割为通用和特定组件。这种方法使得模型能够精确且自适应地学习每种天气场景的相关特征,从而在不增加额外参数的情况下提高效率和效果,确保在所有场景中都能实现高性能。
链接: https://arxiv.org/abs/2411.16739
作者: Jilong Guo,Haobo Yang,Mo Zhou,Xinyu Zhang
关键词-EN: including autonomous driving, Removing adverse weather, Removing adverse, real-world applications, including autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Removing adverse weather conditions such as rain, raindrop, and snow from images is critical for various real-world applications, including autonomous driving, surveillance, and remote sensing. However, existing multi-task approaches typically rely on augmenting the model with additional parameters to handle multiple scenarios. While this enables the model to address diverse tasks, the introduction of extra parameters significantly complicates its practical deployment. In this paper, we propose a novel Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration under adverse weather, designed to effectively handle image degradation under diverse weather conditions without additional parameters. Our method segments model parameters into common and specific components by evaluating the gradient variation intensity during training for each specific weather condition. This enables the model to precisely and adaptively learn relevant features for each weather scenario, improving both efficiency and effectiveness without compromising on performance. This method constructs specific masks based on gradient fluctuations to isolate parameters influenced by other tasks, ensuring that the model achieves strong performance across all scenarios without adding extra parameters. We demonstrate the state-of-the-art performance of our framework through extensive experiments on multiple benchmark datasets. Specifically, our method achieves PSNR scores of 29.22 on the Raindrop dataset, 30.76 on the Rain dataset, and 29.56 on the Snow100K dataset. Code is available at: \hrefthis https URLthis https URL.
zh
[CV-169] Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
【速读】: 该论文试图解决扩散模型在训练数据中精确复现图像的问题,这一现象可能导致版权侵犯和隐私信息泄露。解决方案的关键在于理解记忆现象的成因,即去噪过程中存在的吸引盆地(attraction basin),它引导扩散轨迹趋向记忆图像。论文提出通过延迟分类器无指导(classifier-free guidance)的应用,直至达到理想的过渡点,从而引导扩散轨迹远离吸引盆地,生成高质量且与条件机制对齐的非记忆图像。此外,论文还提出了一种新的指导技术——反向指导(opposite guidance),以更早地逃离吸引盆地,进一步缓解记忆现象。
链接: https://arxiv.org/abs/2411.16738
作者: Anubhav Jain,Yuya Kobayashi,Takashi Shibuya,Yuhta Takida,Nasir Memon,Julian Togelius,Yuki Mitsufuji
关键词-EN: training data, models are prone, attraction basin, Diffusion models, diffusion trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, \emphopposite guidance, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.
zh
[CV-170] owards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method
【速读】: 该论文试图解决道路图提取(road graph extraction)中由于标注数据严重稀缺导致的高效准确提取难题。解决方案的关键在于:1) 收集并发布了一个全球规模的道路图提取数据集,即Global-Scale数据集,该数据集比现有最大的公开数据集大约20倍,覆盖全球超过13,800 km²的区域;2) 开发了一种新的道路图提取模型SAM-Road++,该模型采用节点引导的重采样方法来缓解SAM-Road模型在训练与推理阶段之间的不匹配问题,并通过“扩展线”策略来减轻道路遮挡问题。这些创新显著提升了模型在未见区域中的预测能力。
链接: https://arxiv.org/abs/2411.16733
作者: Pan Yin,Kaiyu Li,Xiangyong Cao,Jing Yao,Lei Liu,Xueru Bai,Feng Zhou,Deyu Meng
关键词-EN: road graph extraction, garnered increasing attention, increasing attention due, graph extraction, road graph
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, road graph extraction has garnered increasing attention due to its crucial role in autonomous driving, navigation, etc. However, accurately and efficiently extracting road graphs remains a persistent challenge, primarily due to the severe scarcity of labeled data. To address this limitation, we collect a global-scale satellite road graph extraction dataset, i.e. Global-Scale dataset. Specifically, the Global-Scale dataset is \sim20 \times larger than the largest existing public road extraction dataset and spans over 13,800 km^2 globally. Additionally, we develop a novel road graph extraction model, i.e. SAM-Road++, which adopts a node-guided resampling method to alleviate the mismatch issue between training and inference in SAM-Road, a pioneering state-of-the-art road graph extraction model. Furthermore, we propose a simple yet effective ``extended-line’’ strategy in SAM-Road++ to mitigate the occlusion issue on the road. Extensive experiments demonstrate the validity of the collected Global-Scale dataset and the proposed SAM-Road++ method, particularly highlighting its superior predictive power in unseen regions. The dataset and code are available at \urlthis https URL.
zh
[CV-171] An Information-Theoretic Regularizer for Lossy Neural Image Compression
【速读】: 该论文试图解决有损图像压缩网络在优化过程中面临的挑战,特别是在学习量化潜在表示时,如何有效降低潜在熵的问题。论文的关键发现是,在一定程度上,最小化潜在熵等同于最大化条件源熵,这一发现基于信息论的等式。基于这一洞察,论文提出了一种新颖的结构化正则化方法,通过将负条件源熵纳入训练目标,以提升优化效果和模型的泛化能力。该信息论正则化方法具有可解释性、即插即用性,且不会增加推理开销。实验结果表明,该方法在不同压缩结构和未见领域中均能有效正则化模型并进一步压缩潜在表示的比特数。
链接: https://arxiv.org/abs/2411.16727
作者: Yingwen Zhang,Meng Wang,Xihua Sheng,Peilin Chen,Junru Li,Li Zhang,Shiqi Wang
关键词-EN: specific distortion constraints, Lossy image compression, compression networks aim, Lossy image, distortion constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures
点击查看摘要
Abstract:Lossy image compression networks aim to minimize the latent entropy of images while adhering to specific distortion constraints. However, optimizing the neural network can be challenging due to its nature of learning quantized latent representations. In this paper, our key finding is that minimizing the latent entropy is, to some extent, equivalent to maximizing the conditional source entropy, an insight that is deeply rooted in information-theoretic equalities. Building on this insight, we propose a novel structural regularization method for the neural image compression task by incorporating the negative conditional source entropy into the training objective, such that both the optimization efficacy and the model’s generalization ability can be promoted. The proposed information-theoretic regularizer is interpretable, plug-and-play, and imposes no inference overheads. Extensive experiments demonstrate its superiority in regularizing the models and further squeezing bits from the latent representation across various compression structures and unseen domains.
zh
[CV-172] EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
【速读】: 该论文试图解决在生成式谈话头视频(talking head generation)中存在的表达性、可控性和长时间生成稳定性问题。解决方案的关键在于提出了一种名为EmotiveTalk的框架,其中包括两个核心组件:Vision-guided Audio Information Decoupling (V-AID) 和 Emotional Talking Head Diffusion (ETHD)。V-AID通过设计Diffusion-based Co-speech Temporal Expansion (Di-CTE)模块,实现了音频与面部表情表示空间的对齐,从而生成与唇部运动和表情对齐的音频解耦表示。ETHD则通过Expression Decoupling Injection (EDI)模块,自动解耦参考图像中的表情并整合目标表情信息,从而实现更具表达性的生成效果。实验结果表明,EmotiveTalk在生成表达性谈话头视频方面表现出色,同时确保了情感的可控性和长时间生成的稳定性,达到了现有方法中的最先进水平。
链接: https://arxiv.org/abs/2411.16726
作者: Haotian Wang,Yuzhe Weng,Yueyan Li,Zilu Guo,Jun Du,Shutong Niu,Jiefeng Ma,Shan He,Xiaoyan Wu,Qiming Hu,Bing Yin,Cong Liu,Qingfeng Liu
关键词-EN: talking head, challenges in expressiveness, talking head videos, models have revolutionized, revolutionized the field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19pages, 16figures
点击查看摘要
Abstract:Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.
zh
[CV-173] textitRevelio: Interpreting and leveraging semantic information in diffusion models
【速读】: 该论文试图解决的问题是如何在不同扩散架构的各个层级和去噪时间步中表示丰富的视觉语义信息。解决方案的关键在于利用k-稀疏自编码器(k-Sparse Autoencoders, k-SAE)来揭示单语义可解释特征,并通过轻量级分类器在现成的扩散模型特征上进行迁移学习,验证其机制解释。研究展示了扩散特征在表示学习中的有效性,并深入分析了不同扩散架构、预训练数据集和语言模型条件对视觉表示粒度、归纳偏置和迁移学习能力的影响。这一工作是加深黑箱扩散模型可解释性的关键步骤。
链接: https://arxiv.org/abs/2411.16725
作者: Dahye Kim,Xavier Thomas,Deepti Ghadiyaram
关键词-EN: rich visual semantic, visual semantic information, semantic information, information is represented, layers and denoising
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 14 figures
点击查看摘要
Abstract:We study \textithow rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models’ features. On 4 datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: this https URL
zh
[CV-174] Devils in Middle Layers of Large Vision-Language Models: Interpreting Detecting and Mitigating Object Hallucinations via Attention Lens
【速读】: 该论文试图解决大型视觉-语言模型 (LVLMs) 中视觉信息处理导致的幻觉问题。解决方案的关键在于通过注意力机制分析模型在处理视觉数据时的中间层行为,特别是识别出“视觉信息丰富”和“语义精炼”两个阶段。研究发现,真实标记在“视觉信息丰富”阶段获得更高的注意力权重,而幻觉标记则通常与不一致的对象相关联。基于此,论文提出了一种简单的推理时方法,通过整合多个注意力头的信息来调整视觉注意力,从而有效减少幻觉现象,且无需额外的训练成本。
链接: https://arxiv.org/abs/2411.16724
作者: Zhangqi Jiang,Junkai Chen,Beier Zhu,Tingjin Luo,Yankun Shen,Xu Yang
关键词-EN: Large Vision-Language Models, Vision-Language Models, Large Vision-Language, significantly undermine, undermine their reliability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Hallucinations in Large Vision-Language Models (LVLMs) significantly undermine their reliability, motivating researchers to explore the causes of hallucination. However, most studies primarily focus on the language aspect rather than the visual. In this paper, we address how LVLMs process visual information and whether this process causes hallucination. Firstly, we use the attention lens to identify the stages at which LVLMs handle visual data, discovering that the middle layers are crucial. Moreover, we find that these layers can be further divided into two stages: “visual information enrichment” and “semantic refinement” which respectively propagate visual data to object tokens and interpret it through text. By analyzing attention patterns during the visual information enrichment stage, we find that real tokens consistently receive higher attention weights than hallucinated ones, serving as a strong indicator of hallucination. Further examination of multi-head attention maps reveals that hallucination tokens often result from heads interacting with inconsistent objects. Based on these insights, we propose a simple inference-time method that adjusts visual attention by integrating information across various heads. Extensive experiments demonstrate that this approach effectively mitigates hallucinations in mainstream LVLMs without additional training costs.
zh
[CV-175] Active Prompt Learning with Vision-Language Model Priors
【速读】: 该论文试图解决视觉-语言模型(Vision-language models, VLMs)在零样本分类任务中依赖手工文本提示(hand-crafted text prompts)的问题,尤其是在适应新任务时效率低下的挑战。解决方案的关键在于提出了一种预算高效的主动提示学习框架(budget-efficient active prompt learning framework)。具体来说,该框架通过引入类别引导的聚类(class-guided clustering),利用VLMs的预训练图像和文本编码器,从主动学习的初始阶段开始实现集群平衡的获取函数(cluster-balanced acquisition function)。此外,考虑到VLMs在不同类别间表现出显著的置信度差异,论文还提出了基于自适应类别阈值的预算节省选择性查询(budget-saving selective querying based on adaptive class-wise thresholds)。这些方法在九个数据集上的广泛实验中证明了其优于现有基线方法的性能。
链接: https://arxiv.org/abs/2411.16722
作者: Hoyoung Kim,Seokhee Jin,Changhwan Sung,Jaechang Kim,Jungseul Ok
关键词-EN: demonstrated remarkable zero-shot, remarkable zero-shot performance, Vision-language models, demonstrated remarkable, remarkable zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have demonstrated remarkable zero-shot performance across various classification tasks. Nonetheless, their reliance on hand-crafted text prompts for each task hinders efficient adaptation to new tasks. While prompt learning offers a promising solution, most studies focus on maximizing the utilization of given few-shot labeled datasets, often overlooking the potential of careful data selection strategies, which enable higher accuracy with fewer labeled data. This motivates us to study a budget-efficient active prompt learning framework. Specifically, we introduce a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling our cluster-balanced acquisition function from the initial round of active learning. Furthermore, considering the substantial class-wise variance in confidence exhibited by VLMs, we propose a budget-saving selective querying based on adaptive class-wise thresholds. Extensive experiments in active learning scenarios across nine datasets demonstrate that our method outperforms existing baselines.
zh
[CV-176] Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
【速读】: 该论文试图解决视觉语言模型 (Vision Language Models, VLMs) 在面对对抗攻击时产生意外和有害内容的问题。解决方案的关键是提出了一种名为ASTRA的防御机制,通过自适应引导模型远离对抗特征方向来抵抗VLM攻击。核心步骤包括:1) 寻找可转移的引导向量,代表有害响应的方向;2) 在推理时应用自适应激活引导,以消除这些方向。具体方法是通过随机删除对抗图像中的视觉标记,识别与越狱最相关的标记,并利用这些标记构建引导向量。在推理阶段,通过引导向量与校准激活之间的投影进行自适应引导,从而在保持良性输入性能的同时,显著避免对抗输入下的有害输出。实验结果表明,ASTRA在多个模型和基线上的表现达到了最先进的水平,并且在防御未见过的攻击和来自不同分布的对抗图像方面表现出良好的可转移性。
链接: https://arxiv.org/abs/2411.16721
作者: Han Wang,Gang Wang,Huan Zhang
关键词-EN: Vision Language Models, Vision Language, vision capabilities create, vision capabilities, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against both unseen attacks at design time (i.e., structured-based attacks) and adversarial images from diverse distributions.
zh
[CV-177] Importance-based Token Merging for Diffusion Models
【速读】: 该论文试图解决扩散模型在高质量图像和视频生成中存在的延迟问题。解决方案的关键在于通过合并相似的token来加速计算,同时通过保留重要的token来显著提高样本质量。具体来说,论文提出利用无分类器引导(classifier-free guidance)的幅度来可靠地确定每个token的重要性,这种方法与条件输入强相关,并对应于输出保真度。由于无分类器引导不增加额外的计算成本或需要额外的模块,因此该方法可以轻松集成到大多数基于扩散的框架中。实验结果表明,该方法在文本到图像合成、多视角图像生成和视频生成等多个应用中显著优于基线方法。
链接: https://arxiv.org/abs/2411.16720
作者: Haoyu Wu,Jingyi Xu,Hieu Le,Dimitris Samaras
关键词-EN: Diffusion models excel, Diffusion models, models excel, excel at high-quality, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models excel at high-quality image and video generation. However, a major drawback is their high latency. A simple yet powerful way to speed them up is by merging similar tokens for faster computation, though this can result in some quality loss. In this paper, we demonstrate that preserving important tokens during merging significantly improves sample quality. Notably, the importance of each token can be reliably determined using the classifier-free guidance magnitude, as this measure is strongly correlated with the conditioning input and corresponds to output fidelity. Since classifier-free guidance incurs no additional computational cost or requires extra modules, our method can be easily integrated into most diffusion-based frameworks. Experiments show that our approach significantly outperforms the baseline across various applications, including text-to-image synthesis, multi-view image generation, and video generation.
zh
[CV-178] Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients
【速读】: 该论文试图解决的问题是如何在不依赖手动调整大量超参数的情况下,通过合成图像训练出对输入图像域无偏的神经网络。解决方案的关键在于提出了Learn2Synth,这是一种新颖的过程,通过使用少量真实标注数据来学习合成参数。与通过对比或对抗技术强制对齐合成数据与真实数据的方法不同,Learn2Synth调整增强引擎,使得在合成数据上训练的分割网络在应用于真实数据时具有最佳的准确性。这种方法允许训练过程从真实标注样本中受益,同时避免使用这些真实样本直接训练分割网络,从而防止网络对训练集的属性产生偏见。论文中还开发了参数化和非参数化的策略来增强合成图像,进一步提升了分割网络的性能。
链接: https://arxiv.org/abs/2411.16719
作者: Xiaoling Hu,Oula Puonti,Juan Eugenio Iglesias,Bruce Fischl,Yael Balbastre
关键词-EN: Domain randomization, unbiased with respect, Domain, data, segmentation network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
点击查看摘要
Abstract:Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images. Randomization allows networks to see a virtually infinite range of intensities and artifacts during training, thereby minimizing overfitting to appearance and maximizing generalization to unseen data. While powerful, this approach relies on the accurate tuning of a large set of hyper-parameters governing the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. Unlike methods that impose constraints to align synthetic data with real data (e.g., contrastive or adversarial techniques), which risk misaligning the image and its label map, we tune an augmentation engine such that a segmentation network trained on synthetic data has optimal accuracy when applied to real data. This approach allows the training procedure to benefit from real labeled examples, without ever using these real examples to train the segmentation network, which avoids biasing the network towards the properties of the training set. Specifically, we develop both parametric and nonparametric strategies to augment the synthetic images, enhancing the segmentation network’s performance. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of this learning strategy. Code is available at: this https URL.
zh
[CV-179] Neuro-Symbolic Evaluation of Text-to-Video Models using Formalf Verification
【速读】: 该论文试图解决现有文本到视频生成模型评估指标在时间一致性(temporal fidelity)和文本与视频对齐(text-to-video alignment)方面的不足,特别是在安全关键应用中的重要性。解决方案的关键是引入了一种名为NeuS-V的新型合成视频评估指标,该指标利用神经符号形式验证技术(neuro-symbolic formal verification techniques)来严格评估文本与视频的对齐。具体方法包括将提示转换为形式化的时序逻辑(Temporal Logic, TL)规范,并将生成的视频转换为自动机表示,然后通过形式化检查视频自动机与TL规范的一致性来评估对齐效果。此外,论文还提供了一个包含时间扩展提示的数据集,用于评估当前最先进的视频生成模型,结果显示NeuS-V在与人评估的相关性上比现有指标高出5倍以上,揭示了现有模型在处理时间复杂提示时的不足。
链接: https://arxiv.org/abs/2411.16718
作者: S. P. Sharan,Minkyu Choi,Sahil Shah,Harsh Goel,Mohammad Omama,Sandeep Chinchali
关键词-EN: Recent advancements, autonomous driving, fields like robotics, CogVideoX are pushing, pushing the boundaries
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
zh
[CV-180] PaRCE: Probabilistic and Reconstruction-based Competency Estimation for CNN-based Image Classification
【速读】: 该论文试图解决卷积神经网络 (CNN) 在图像分类任务中过度自信的问题,并提出一种全面评估感知模型置信度的方法。解决方案的关键在于开发了一种基于概率和重构的能力估计方法 (PaRCE),该方法能够准确区分正确分类、错误分类和分布外 (OOD) 样本,以及具有异常区域的样本,并能区分因视觉图像修改导致的高、中、低预测准确性的样本。此外,该方法还扩展到异常定位任务,能够区分图像中感知模型熟悉和不熟悉的区域,生成可解释的分数,最可靠地捕捉感知模型置信度的整体概念。
链接: https://arxiv.org/abs/2411.16715
作者: Sara Pohland,Claire Tomlin
关键词-EN: Convolutional neural networks, Convolutional neural, neural networks, extremely popular, popular and effective
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: arXiv admin note: text overlap with arXiv:2409.06111
点击查看摘要
Abstract:Convolutional neural networks (CNNs) are extremely popular and effective for image classification tasks but tend to be overly confident in their predictions. Various works have sought to quantify uncertainty associated with these models, detect out-of-distribution (OOD) inputs, or identify anomalous regions in an image, but limited work has sought to develop a holistic approach that can accurately estimate perception model confidence across various sources of uncertainty. We develop a probabilistic and reconstruction-based competency estimation (PaRCE) method and compare it to existing approaches for uncertainty quantification and OOD detection. We find that our method can best distinguish between correctly classified, misclassified, and OOD samples with anomalous regions, as well as between samples with visual image modifications resulting in high, medium, and low prediction accuracy. We describe how to extend our approach for anomaly localization tasks and demonstrate the ability of our approach to distinguish between regions in an image that are familiar to the perception model from those that are unfamiliar. We find that our method generates interpretable scores that most reliably capture a holistic notion of perception model confidence.
zh
[CV-181] PIE: Topology-Preserved Image Editing With Text Instructions
【速读】: 该论文试图解决现有图像编辑模型在处理图像时往往忽略对象几何结构的问题,特别是在医疗和医学等对解剖结构正确性要求极高的领域。解决方案的关键在于引入了一种名为“拓扑保持图像编辑与文本指令 (Topology-Preserved Image Editing with text instructions, TPIE)”的新方法,通过文本引导的生成扩散模型确保编辑后的图像拓扑和几何结构保持不变。具体来说,TPIE方法将新生成的样本视为给定输入模板的可变形变体,从而实现可控且结构保持的编辑。该方法的核心由两个模块组成:(i) 基于自编码器的配准网络,学习由速度场参数化的对象变换的潜在表示;(ii) 一种新颖的潜在条件几何扩散 (Latent Conditional Geometric Diffusion, LCDG) 模型,能够高效地捕捉基于自定义文本指令的学习变换特征的数据分布。
链接: https://arxiv.org/abs/2411.16714
作者: Nivetha Jayakumar,Srivardhan Reddy Gadila,Tonmoy Hossain,Yangfeng Ji,Miaomiao Zhang
关键词-EN: Preserving topological structures, Preserving topological, real-world applications, healthcare and medicine, anatomy is critical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Preserving topological structures is important in real-world applications, particularly in sensitive domains such as healthcare and medicine, where the correctness of human anatomy is critical. However, most existing image editing models focus on manipulating intensity and texture features, often overlooking object geometry within images. To address this issue, this paper introduces a novel method, Topology-Preserved Image Editing with text instructions (TPIE), that for the first time ensures the topology and geometry remaining intact in edited images through text-guided generative diffusion models. More specifically, our method treats newly generated samples as deformable variations of a given input template, allowing for controllable and structure-preserving edits. Our proposed TPIE framework consists of two key modules: (i) an autoencoder-based registration network that learns latent representations of object transformations, parameterized by velocity fields, from pairwise training images; and (ii) a novel latent conditional geometric diffusion (LCDG) model efficiently capturing the data distribution of learned transformation features conditioned on custom-defined text instructions. We validate TPIE on a diverse set of 2D and 3D images and compare them with state-of-the-art image editing approaches. Experimental results show that our method outperforms other baselines in generating more realistic images with well-preserved topology. Our code will be made publicly available on Github.
zh
[CV-182] Conditional Text-to-Image Generation with Reference Guidance
【速读】: 该论文试图解决文本到图像扩散模型在精确渲染特定主题(如文本拼写)方面的挑战。解决方案的关键在于引入额外的图像条件作为视觉指导,以增强扩散模型的生成能力。具体来说,通过使用参考条件,模型不仅能够处理文本标记器词汇表无法充分表示的内容,还能扩展其对新能力的泛化,例如生成非英语文本拼写。论文开发了几个小型专家插件,每个插件针对不同的应用(如英语场景文本生成、多语言场景文本生成和标志图像生成)进行定制训练,并配备了辅助网络和损失函数。这些插件在各项任务中均表现出优于现有方法的效果,且每个插件仅包含28.55M可训练参数。
链接: https://arxiv.org/abs/2411.16713
作者: Taewook Kim,Ze Wang,Zhengyuan Yang,Jiang Wang,Lijuan Wang,Zicheng Liu,Qiang Qiu
关键词-EN: demonstrated tremendous success, synthesizing visually stunning, visually stunning images, textual instructions, demonstrated tremendous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model’s generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
zh
[CV-183] Sonic: Shifting Focus to Global Audio Perception in Portrait Animation
【速读】: 该论文试图解决当前音频驱动面部生成技术中,由于过度依赖辅助视觉和空间知识而导致自然度和时间一致性下降的问题。解决方案的关键在于提出了一种名为 Sonic 的新范式,该范式专注于全局音频感知(global audio perception)的探索。具体来说,Sonic 通过解耦音频感知为内部片段音频感知(intra-clip audio perception)和跨片段音频感知(inter-clip audio perception),并结合这两种感知来增强整体动画效果。内部片段音频感知包括上下文增强的音频学习(Context-enhanced audio learning)和运动解耦控制器(Motion-decoupled controller),前者提取长程时间音频知识以隐式表达面部表情和唇部运动,后者独立控制头部运动和表情变化。跨片段音频感知则通过时间感知位置偏移融合(Time-aware position shift fusion)来连接内部片段,实现全局音频信息的融合和长音频推理。实验结果表明,该方法在视频质量、时间一致性、唇部同步精度和运动多样性方面均优于现有的最先进技术。
链接: https://arxiv.org/abs/2411.16331
作者: Xiaozhong Ji,Xiaobin Hu,Zhihong Xu,Junwei Zhu,Chuming Lin,Qingdong He,Jiangning Zhang,Donghao Luo,Yi Chen,Qin Lin,Qinglin Lu,Chengjie Wang
关键词-EN: crafting visually appealing, talking face generation, synchronizing facial movements, inter-clip audio perception, audio perception
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: refer to our main-page \url{ this https URL }
点击查看摘要
Abstract:The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal this http URL the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to shift focus on the exploration of global audio perception.To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall this http URL the intra-clip audio perception, 1). \textbfContext-enhanced audio learning, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). \textbfMotion-decoupled controller, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips to achieve the global perception, \textbfTime-aware position shift fusion, in which the global inter-clip audio information is considered and fused for long-audio inference via through consecutively time-aware shifted windows. Extensive experiments demonstrate that the novel audio-driven paradigm outperform existing SOTA methodologies in terms of video quality, temporally consistency, lip synchronization precision, and motion diversity.
zh
[CV-184] An Ensemble Approach for Brain Tumor Segmentation and Synthesis
【速读】: 该论文试图解决在磁共振成像(MRI)中,特别是神经影像学领域,如何通过集成机器学习模型来提高诊断准确性、加速图像分析并提供数据驱动的洞察力,从而可能改变患者护理的问题。解决方案的关键在于提出一个深度学习框架,该框架通过集成多种最先进的模型架构(如nn-UNet、Swin-UNet和U-Mamba),以实现高精度的肿瘤分割和生成精细的合成图像。
链接: https://arxiv.org/abs/2411.17617
作者: Juampablo E. Heras Rivera,Agamdeep S. Chopra,Tianyi Ren,Hitender Oswal,Yutong Pan,Zineb Sordo,Sophie Walters,William Henry,Hooman Mohammadi,Riley Olson,Fargol Rezayaraghi,Tyson Lam,Akshay Jaikanth,Pavan Kancharla,Jacob Ruzevick,Daniela Ushizima,Mehmet Kurt
关键词-EN: magnetic resonance imaging, transform patient care, potentially transform patient, accelerated image analysis, specifically in neuroimaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The integration of machine learning in magnetic resonance imaging (MRI), specifically in neuroimaging, is proving to be incredibly effective, leading to better diagnostic accuracy, accelerated image analysis, and data-driven insights, which can potentially transform patient care. Deep learning models utilize multiple layers of processing to capture intricate details of complex data, which can then be used on a variety of tasks, including brain tumor classification, segmentation, image synthesis, and registration. Previous research demonstrates high accuracy in tumor segmentation using various model architectures, including nn-UNet and Swin-UNet. U-Mamba, which uses state space modeling, also achieves high accuracy in medical image segmentation. To leverage these models, we propose a deep learning framework that ensembles these state-of-the-art architectures to achieve accurate segmentation and produce finely synthesized images.
zh
[CV-185] Uncertainty quantification for White Matter Hyperintensity segmentation detects silent failures and improves automated Fazekas quantification
【速读】: 该论文试图解决白质高信号 (White Matter Hyperintensities, WMH) 在脑部MRI图像中的分割难题,这一难题主要源于WMH在形状、位置、大小、边界定义不清晰以及与其它病理(如中风病变)和伪影(如头部运动)的相似性。解决方案的关键在于应用不确定性量化 (Uncertainty Quantification, UQ) 技术,特别是结合随机分割网络 (Stochastic Segmentation Networks) 和深度集成 (Deep Ensembles) 的方法,以提高分割的Dice系数和降低绝对体积差异百分比 (Absolute Volume Difference, AVD)。此外,论文还展示了UQ在临床Fazekas评分分类中的下游应用,通过结合WMH分割的不确定性信息和空间特征,显著提升了分类性能和校准精度。
链接: https://arxiv.org/abs/2411.17571
作者: Ben Philps,Maria del C. Valdes Hernandez,Chen Qin,Una Clancy,Eleni Sakka,Susana Munoz Maniega,Mark E. Bastin,Angela C.C. Jochems,Joanna M. Wardlaw,Miguel O. Bernabeu,Alzheimers Disease Neuroimaging Initiative
关键词-EN: key neuroradiological markers, vessel disease present, White Matter Hyperintensities, brain MRI, WMH
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 34 pages (or 22 not including appendix) 26 figures (or 11 not including appendix)
点击查看摘要
Abstract:White Matter Hyperintensities (WMH) are key neuroradiological markers of small vessel disease present in brain MRI. Assessment of WMH is important in research and clinics. However, WMH are challenging to segment due to their high variability in shape, location, size, poorly defined borders, and similar intensity profile to other pathologies (e.g stroke lesions) and artefacts (e.g head motion). In this work, we apply the most effective techniques for uncertainty quantification (UQ) in segmentation to the WMH segmentation task across multiple test-time data distributions. We find a combination of Stochastic Segmentation Networks with Deep Ensembles yields the highest Dice and lowest Absolute Volume Difference % (AVD) score on in-domain and out-of-distribution data. We demonstrate the downstream utility of UQ, proposing a novel method for classification of the clinical Fazekas score using spatial features extracted for WMH segmentation and UQ maps. We show that incorporating WMH uncertainty information improves Fazekas classification performance and calibration, with median class balanced accuracy for classification models with (UQ and spatial WMH features)/(spatial WMH features)/(WMH volume only) of 0.71/0.66/0.60 in the Deep WMH and 0.82/0.77/0.73 in the Periventricular WMH regions respectively. We demonstrate that stochastic UQ techniques with high sample diversity can improve the detection of poor quality segmentations. Finally, we qualitatively analyse the semantic information captured by UQ techniques and demonstrate that uncertainty can highlight areas where there is ambiguity between WMH and stroke lesions, while identifying clusters of small WMH in deep white matter unsegmented by the model.
zh
[CV-186] AFM-Net: A Novel Approach to Skin Lesion Segmentation Using Transformer Attention and Focal Modulation
【速读】: 该论文试图解决皮肤病变分割中的挑战,特别是由于临床环境、光照条件、患者属性和毛发密度等因素导致的图像异质性问题。解决方案的关键在于开发了一种名为TAFM-Net的创新模型,该模型结合了自适应变换器注意力机制(TA)和焦点调制(FM)。TAFM-Net通过使用EfficientNetV2B1编码器来增强空间和通道相关的重要性,并通过密集连接的解码器在跳跃连接中集成FM,从而增强特征强调、分割性能和医学图像分析的解释性。此外,论文还引入了一种动态损失函数,该函数融合了区域和边界信息,指导模型更有效地训练。这些创新使得模型在ISIC2016、ISIC2017和ISIC2018数据集上分别达到了93.64%、86.88%和92.88%的Jaccard系数,展示了其在实际应用中的潜力。
链接: https://arxiv.org/abs/2411.17556
作者: Tariq M Khan,Dawn Lin,Shahzaib Iqbal,Eirk Meijering
关键词-EN: Incorporating modern computer, modern computer vision, computer vision techniques, protocols shows promise, clinical protocols shows
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Incorporating modern computer vision techniques into clinical protocols shows promise in improving skin lesion segmentation. The U-Net architecture has been a key model in this area, iteratively improved to address challenges arising from the heterogeneity of dermatologic images due to varying clinical settings, lighting, patient attributes, and hair density. To further improve skin lesion segmentation, we developed TAFM-Net, an innovative model leveraging self-adaptive transformer attention (TA) coupled with focal modulation (FM). Our model integrates an EfficientNetV2B1 encoder, which employs TA to enhance spatial and channel-related saliency, while a densely connected decoder integrates FM within skip connections, enhancing feature emphasis, segmentation performance, and interpretability crucial for medical image analysis. A novel dynamic loss function amalgamates region and boundary information, guiding effective model training. Our model achieves competitive performance, with Jaccard coefficients of 93.64%, 86.88% and 92.88% in the ISIC2016, ISIC2017 and ISIC2018 datasets, respectively, demonstrating its potential in real-world scenarios.
zh
[CV-187] On Statistical Rates of Conditional Diffusion Transformers: Approximation Estimation and Minimax Optimality
【速读】: 该论文旨在研究条件扩散变换器(Conditional Diffusion Transformers, DiTs)在分类器无指导情况下的近似和估计速率。解决方案的关键在于对“上下文内”条件DiTs在四种常见数据假设下的全面分析。具体方法包括将输入域离散化为无穷小网格,并在Hölder光滑数据假设下对条件扩散得分函数进行逐项泰勒展开,从而通过更详细的分段常数近似实现变换器的精细通用近似,进而获得更紧密的界。此外,论文还将分析扩展到线性潜在子空间假设下的潜在设置,证明了潜在条件DiTs在近似和估计方面均优于条件DiTs,并展示了潜在无条件DiTs的最小最大最优性。这些发现为条件和无条件DiTs设定了统计极限,并为开发更高效和准确的DiT模型提供了实际指导。
链接: https://arxiv.org/abs/2411.17522
作者: Jerry Yao-Chieh Hu,Weimin Wu,Yi-Chen Lee,Yu-Chao Huang,Minshuo Chen,Han Liu
关键词-EN: DiTs, conditional DiTs, conditional diffusion transformers, conditional, unconditional DiTs
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We investigate the approximation and estimation rates of conditional diffusion transformers (DiTs) with classifier-free guidance. We present a comprehensive analysis for ``in-context’’ conditional DiTs under four common data assumptions. We show that both conditional DiTs and their latent variants lead to the minimax optimality of unconditional DiTs under identified settings. Specifically, we discretize the input domains into infinitesimal grids and then perform a term-by-term Taylor expansion on the conditional diffusion score function under Hölder smooth data assumption. This enables fine-grained use of transformers’ universal approximation through a more detailed piecewise constant approximation and hence obtains tighter bounds. Additionally, we extend our analysis to the latent setting under the linear latent subspace assumption. We not only show that latent conditional DiTs achieve lower bounds than conditional DiTs both in approximation and estimation, but also show the minimax optimality of latent unconditional DiTs. Our findings establish statistical limits for conditional and unconditional DiTs, and offer practical guidance toward developing more efficient and accurate DiT models.
zh
[CV-188] Structure-Guided MR-to-CT Synthesis with Spatial and Semantic Alignments for Attenuation Correction of Whole-Body PET/MR Imaging
【速读】: 该论文试图解决基于深度学习的全身MR到CT合成中的空间错位和强度映射复杂性问题,特别是在PET/MR成像中的PET衰减校正应用中。解决方案的关键在于提出了一种包含三个创新模块的全身MR到CT合成框架:(1) 结构引导合成模块 (Structure-Guided Synthesis module),通过结构引导的注意力门减少软组织的不必要轮廓,提高合成图像质量;(2) 空间对齐模块 (Spatial Alignment module),通过考虑组织体积和呼吸运动的影响,实现MR和CT图像的精确配准,提供对齐良好的训练用真实CT图像;(3) 语义对齐模块 (Semantic Alignment module),利用对比学习约束器官相关的语义信息,确保合成CT的语义真实性。这些模块共同作用,生成视觉上合理且语义上真实的CT图像,并验证其在PET衰减校正中的实用性。
链接: https://arxiv.org/abs/2411.17488
作者: Jiaxu Zheng,Zhenrong Shen,Lichi Zhang,Qun Chen
关键词-EN: PET attenuation correction, facilitating PET attenuation, PET attenuation, estimate the electron, electron density
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep-learning-based MR-to-CT synthesis can estimate the electron density of tissues, thereby facilitating PET attenuation correction in whole-body PET/MR imaging. However, whole-body MR-to-CT synthesis faces several challenges including the issue of spatial misalignment and the complexity of intensity mapping, primarily due to the variety of tissues and organs throughout the whole body. Here we propose a novel whole-body MR-to-CT synthesis framework, which consists of three novel modules to tackle these challenges: (1) Structure-Guided Synthesis module leverages structure-guided attention gates to enhance synthetic image quality by diminishing unnecessary contours of soft tissues; (2) Spatial Alignment module yields precise registration between paired MR and CT images by taking into account the impacts of tissue volumes and respiratory movements, thus providing well-aligned ground-truth CT images during training; (3) Semantic Alignment module utilizes contrastive learning to constrain organ-related semantic information, thereby ensuring the semantic authenticity of synthetic CT this http URL conduct extensive experiments to demonstrate that the proposed whole-body MR-to-CT framework can produce visually plausible and semantically realistic CT images, and validate its utility in PET attenuation correction.
zh
[CV-189] Dual-Representation Interaction Driven Image Quality Assessment with Restoration Assistance WACV
【速读】: 该论文试图解决无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)中由于图像内容变异和失真多样性带来的挑战。解决方案的关键在于引入降级向量和质量向量(degradation vectors and quality vectors)来分别建模低质量图像的降级信息和质量信息,并通过恢复网络提供降级信息给MOS评分预测器。此外,设计基于表示的语义损失(Representation-based Semantic Loss, RS Loss)以增强表示之间的有效交互。这些创新使得该方法在合成和真实世界数据集上均优于现有的最先进模型。
链接: https://arxiv.org/abs/2411.17390
作者: Jingtong Yue,Xin Lin,Zijiu Yang,Chao Ren
关键词-EN: Image Quality Assessment, challenging problem due, age content variance, Assessment for distorted, No-Reference Image Quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,6 figures, published to WACV
点击查看摘要
Abstract:No-Reference Image Quality Assessment for distorted images has always been a challenging problem due to im- age content variance and distortion diversity. Previous IQA models mostly encode explicit single-quality features of synthetic images to obtain quality-aware representations for quality score prediction. However, performance de- creases when facing real-world distortion and restored im- ages from restoration models. The reason is that they do not consider the degradation factors of the low-quality im- ages adequately. To address this issue, we first introduce the DRI method to obtain degradation vectors and qual- ity vectors of images, which separately model the degra- dation and quality information of low-quality images. After that, we add the restoration network to provide the MOS score predictor with degradation information. Then, we design the Representation-based Semantic Loss (RS Loss) to assist in enhancing effective interaction between repre- sentations. Extensive experimental results demonstrate that the proposed method performs favorably against existing state-of-the-art models on both synthetic and real-world datasets.
zh
[CV-190] vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation
【速读】: 该论文试图解决3D血管分割在医学图像分析中的挑战,特别是由于成像模式特定的伪影、血管模式和尺度、信噪比以及背景组织的显著变化,以及不同成像协议导致的领域差距,限制了现有监督学习方法的泛化能力,需要对每个数据集进行繁琐的体素级标注。解决方案的关键在于提出了vesselFM,一个专门设计用于3D血管分割的基础模型。vesselFM通过在三个异质数据源上训练,包括一个大型精选标注数据集、通过领域随机化方案生成的数据以及从基于流匹配的生成模型中采样的数据,实现了零样本泛化。该模型在零样本、单样本和少样本场景下,在四种(预)临床相关的成像模式中均优于现有的最先进的医学图像分割基础模型,从而提供了一个通用的3D血管分割解决方案。
链接: https://arxiv.org/abs/2411.17386
作者: Bastian Wittmann,Yannick Wattenberg,Tamaz Amiranashvili,Suprosanna Shit,Bjoern Menze
关键词-EN: blood vessel segmentation, critical yet challenging, Segmenting, blood vessel, vessel segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Segmenting 3D blood vessels is a critical yet challenging task in medical image analysis. This is due to significant imaging modality-specific variations in artifacts, vascular patterns and scales, signal-to-noise ratios, and background tissues. These variations, along with domain gaps arising from varying imaging protocols, limit the generalization of existing supervised learning-based methods, requiring tedious voxel-level annotations for each dataset separately. While foundation models promise to alleviate this limitation, they typically fail to generalize to the task of blood vessel segmentation, posing a unique, complex problem. In this work, we present vesselFM, a foundation model designed specifically for the broad task of 3D blood vessel segmentation. Unlike previous models, vesselFM can effortlessly generalize to unseen domains. To achieve zero-shot generalization, we train vesselFM on three heterogeneous data sources: a large, curated annotated dataset, data generated by a domain randomization scheme, and data sampled from a flow matching-based generative model. Extensive evaluations show that vesselFM outperforms state-of-the-art medical image segmentation foundation models across four (pre-)clinically relevant imaging modalities in zero-, one-, and few-shot scenarios, therefore providing a universal solution for 3D blood vessel segmentation.
zh
[CV-191] Automatic Skull Reconstruction by Deep Learnable Symmetry Enforcement
【速读】: 该论文试图解决个性化颅骨植入物建模过程中的高成本和长时间等待问题。解决方案的关键在于利用深度学习技术,特别是通过增强可学习对称性来提高重建效果。具体来说,论文提出了一种新的方法,通过训练一个专门用于计算颅骨对称性的神经网络,该网络可以在训练过程中作为额外的目标函数使用,或在重建后的细化步骤中作为后处理目标。这种方法不仅在定量评估中显著优于基线方法(DSC、bDSC 和 HD95 指标分别为 0.94/0.94/1.31 对比 0.84/0.76/2.43),而且在计算资源需求上显著减少(500 GPU 小时对比 100,000 GPU 小时),从而为临床实践中的自动颅骨缺陷重建迈出了重要一步。
链接: https://arxiv.org/abs/2411.17342
作者: Marek Wodzinski,Mateusz Daniol,Daria Hemmerling
关键词-EN: require personalized implants, thousands of people, people suffer, damage and require, require personalized
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Every year, thousands of people suffer from skull damage and require personalized implants to fill the cranial cavity. Unfortunately, the waiting time for reconstruction surgery can extend to several weeks or even months, especially in less developed countries. One factor contributing to the extended waiting period is the intricate process of personalized implant modeling. Currently, the preparation of these implants by experienced biomechanical experts is both costly and time-consuming. Recent advances in artificial intelligence, especially in deep learning, offer promising potential for automating the process. However, deep learning-based cranial reconstruction faces several challenges: (i) the limited size of training datasets, (ii) the high resolution of the volumetric data, and (iii) significant data heterogeneity. In this work, we propose a novel approach to address these challenges by enhancing the reconstruction through learnable symmetry enforcement. We demonstrate that it is possible to train a neural network dedicated to calculating skull symmetry, which can be utilized either as an additional objective function during training or as a post-reconstruction objective during the refinement step. We quantitatively evaluate the proposed method using open SkullBreak and SkullFix datasets, and qualitatively using real clinical cases. The results indicate that the symmetry-preserving reconstruction network achieves considerably better outcomes compared to the baseline (0.94/0.94/1.31 vs 0.84/0.76/2.43 in terms of DSC, bDSC, and HD95). Moreover, the results are comparable to the best-performing methods while requiring significantly fewer computational resources ( 500 vs 100,000 GPU hours). The proposed method is a considerable contribution to the field of applied artificial intelligence in medicine and is a step toward automatic cranial defect reconstruction in clinical practice.
zh
[CV-192] DAvec: Computing Vector Summaries of Persistence Diagrams for Topological Data Analysis in R and Python
【速读】: 该论文试图解决持久同调(Persistent Homology)在拓扑数据分析(Topological Data Analysis, TDA)中应用于机器学习时面临的挑战,即持久图(Persistence Diagrams, PDs)的非希尔伯特空间特性。解决方案的关键在于开发了一个新的软件包,该软件包通过核方法(kernel methods)和向量化技术(vectorization techniques)将持久图转换为机器学习兼容的格式。这一解决方案不仅简化了持久图的向量化过程,还提供了直观的操作流程和高级功能,从而促进了持久同调在实际应用中的有效性和可操作性。
链接: https://arxiv.org/abs/2411.17340
作者: Aleksei Luchinsky,Umar Islambekov
关键词-EN: Persistent homology, topological data analysis, widely-used tool, understanding the underlying, underlying shape
类目: Algebraic Topology (math.AT); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Persistent homology is a widely-used tool in topological data analysis (TDA) for understanding the underlying shape of complex data. By constructing a filtration of simplicial complexes from data points, it captures topological features such as connected components, loops, and voids across multiple scales. These features are encoded in persistence diagrams (PDs), which provide a concise summary of the data’s topological structure. However, the non-Hilbert nature of the space of PDs poses challenges for their direct use in machine learning applications. To address this, kernel methods and vectorization techniques have been developed to transform PDs into machine-learning-compatible formats. In this paper, we introduce a new software package designed to streamline the vectorization of PDs, offering an intuitive workflow and advanced functionalities. We demonstrate the necessity of the package through practical examples and provide a detailed discussion on its contributions to applied TDA. Definitions of all vectorization summaries used in the package are included in the appendix.
zh
[CV-193] MiceBoneChallenge: Micro-CT public dataset and six solutions for automatic growth plate detection in micro-CT mice bone scans
【速读】: 该论文试图解决在小鼠微CT扫描中自动检测和量化骨变化的问题,这一任务在临床前药物开发研究中常见,但目前主要依赖手动操作,耗时且存在观察者间和观察者内的变异性。解决方案的关键在于开发能够准确识别骨生长板平面的计算机视觉模型,这对于实现骨小梁的完全自动分割至关重要。通过组织内部挑战赛,论文成功开发了六种能够实现这一目标的计算机视觉解决方案,这些解决方案在测试集上的平均绝对误差为1.91±0.87个平面,达到了放射学家可接受的实用精度水平。此外,论文还公开了标注的3D微CT扫描数据集、六种解决方案及其源代码,为研究人员提供了开发和基准测试自己方法的机会。
链接: https://arxiv.org/abs/2411.17260
作者: Nikolay Burlutskiy,Marija Kekic,Jordi de la Torre,Philipp Plewa,Mehdi Boroumand,Julia Jurkowska,Borjan Venovski,Maria Chiara Biagi,Yeman Brhane Hagos,Roksana Malinowska-Traczyk,Yibo Wang,Jacek Zalewski,Paula Sawczuk,Karlo Pintarić,Fariba Yousefi,Leif Hultin
关键词-EN: drug development studies, preclinical drug development, Detecting and quantifying, development studies, preclinical drug
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Under Review
点击查看摘要
Abstract:Detecting and quantifying bone changes in micro-CT scans of rodents is a common task in preclinical drug development studies. However, this task is manual, time-consuming and subject to inter- and intra-observer variability. In 2024, Anonymous Company organized an internal challenge to develop models for automatic bone quantification. We prepared and annotated a high-quality dataset of 3D \mu CT bone scans from 83 mice. The challenge attracted over 80 AI scientists from around the globe who formed 23 teams. The participants were tasked with developing a solution to identify the plane where the bone growth happens, which is essential for fully automatic segmentation of trabecular bone. As a result, six computer vision solutions were developed that can accurately identify the location of the growth plate plane. The solutions achieved the mean absolute error of 1.91\pm0.87 planes from the ground truth on the test set, an accuracy level acceptable for practical use by a radiologist. The annotated 3D scans dataset along with the six solutions and source code, is being made public, providing researchers with opportunities to develop and benchmark their own approaches. The code, trained models, and the data will be shared.
zh
[CV-194] cWDM: Conditional Wavelet Diffusion Models for Cross-Modality 3D Medical Image Synthesis
【速读】: 该论文试图解决在脑肿瘤分割任务中,由于时间限制或成像伪影导致某些MR模态(如T1、T1ce、T2、FLAIR)缺失的问题。解决方案的关键在于提出了一种条件小波扩散模型(conditional Wavelet Diffusion Model, cWDM),用于直接合成缺失的MR模态图像,基于已有的三种模态图像。该方法将图像到图像的翻译任务视为条件生成问题,通过结合高分辨率3D图像合成的小波扩散模型和简单的条件策略来解决。这种方法避免了切片或块状数据处理带来的伪影,能够直接应用于全分辨率体积图像,从而确保下游分割模型的有效应用。
链接: https://arxiv.org/abs/2411.17203
作者: Paul Friedrich,Alicia Durrer,Julia Wolleb,Philippe C. Cattin
关键词-EN: Image Synthesis Challenge, Wavelet Diffusion Model, Synthesis Challenge, Wavelet Diffusion, conditional Wavelet Diffusion
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: BraTS 2024 (Global Synthesis) submission. Code: this https URL
点击查看摘要
Abstract:This paper contributes to the “BraTS 2024 Brain MR Image Synthesis Challenge” and presents a conditional Wavelet Diffusion Model (cWDM) for directly solving a paired image-to-image translation task on high-resolution volumes. While deep learning-based brain tumor segmentation models have demonstrated clear clinical utility, they typically require MR scans from various modalities (T1, T1ce, T2, FLAIR) as input. However, due to time constraints or imaging artifacts, some of these modalities may be missing, hindering the application of well-performing segmentation algorithms in clinical routine. To address this issue, we propose a method that synthesizes one missing modality image conditioned on three available images, enabling the application of downstream segmentation models. We treat this paired image-to-image translation task as a conditional generation problem and solve it by combining a Wavelet Diffusion Model for high-resolution 3D image synthesis with a simple conditioning strategy. This approach allows us to directly apply our model to full-resolution volumes, avoiding artifacts caused by slice- or patch-wise data processing. While this work focuses on a specific application, the presented method can be applied to all kinds of paired image-to-image translation problems, such as CT \leftrightarrow MR and MR \leftrightarrow PET translation, or mask-conditioned anatomically guided image generation.
zh
[CV-195] Motion Free B-frame Coding for Neural Video Compression
【速读】: 该论文试图解决传统深度神经网络视频压缩方法中存在的两个主要问题:一是基于运动编码和残差编码的混合方法导致的计算复杂度高,二是对称自编码器在处理视频时产生的模糊伪影。解决方案的关键在于提出了一种基于核的无运动视频编码方法,该方法通过消除运动估计、运动补偿和运动编码这些耗时的步骤,显著降低了计算复杂度,同时通过改进的自编码器结构减轻了模糊伪影,从而提高了编码效率和重建帧的视觉质量。实验结果表明,该方法在多个数据集上优于现有的最先进深度神经网络视频压缩方法,并且在模型大小上也具有显著优势。
链接: https://arxiv.org/abs/2411.17160
作者: Van Thang Nguyen
关键词-EN: classical video coding, residual coding, separate modules, follow the hybrid, neural video compression
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Deep Neural Video Compression
点击查看摘要
Abstract:Typical deep neural video compression networks usually follow the hybrid approach of classical video coding that contains two separate modules: motion coding and residual coding. In addition, a symmetric auto-encoder is often used as a normal architecture for both motion and residual coding. In this paper, we propose a novel approach that handles the drawbacks of the two typical above- mentioned architectures, we call it kernel-based motion-free video coding. The advantages of the motion-free approach are twofold: it improves the coding efficiency of the net- work and significantly reduces computational complexity thanks to eliminating motion estimation, motion compensation, and motion coding which are the most time-consuming engines. In addition, the kernel-based auto-encoder alleviates blur artifacts that usually occur with the conventional symmetric autoencoder. Consequently, it improves the visual quality of the reconstructed frames. Experimental results show the proposed framework outperforms the SOTA deep neural video compression networks on the HEVC- class B dataset and is competitive on the UVG and MCL- JCV datasets. In addition, it generates high-quality re- constructed frames in comparison with conventional motion coding-based symmetric auto-encoder meanwhile its model size is much smaller than that of the motion-based networks around three to four times.
zh
[CV-196] Neural-Network-Enhanced Metalens Camera for High-Definition Dynamic Imaging in the Long-Wave Infrared Spectrum
【速读】: 该论文旨在为长波红外成像提供一种轻量级且成本效益高的解决方案,特别是通过使用单透镜(singlet)来实现。解决方案的关键在于将高频增强循环生成对抗网络(High-Frequency-Enhancing Cycle-GAN)集成到超透镜(metalens)成像系统中。该网络通过解决超透镜固有的频率损失问题来提升原始超透镜图像的质量。其核心创新在于引入了一个高频对抗学习模块,该模块利用小波变换提取高频分量,并通过高频反馈回路使生成器能够从高频判别器的对抗反馈中增强相机输出,从而确保生成器遵循高频对抗损失的约束,有效恢复相机的频率损失。这一恢复机制保证了相机输出的高保真图像,有助于实现流畅的视频制作。
链接: https://arxiv.org/abs/2411.17139
作者: Jing-Yang Wei,Hao Huang,Xin Zhang,De-Mao Ye,Yi Li,Le Wang,Yao-Guang Ma,Yang-Hui Li
关键词-EN: Cycle-GAN neural network, long-wave infrared imaging, metalens imaging system, provide a lightweight, lightweight and cost-effective
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To provide a lightweight and cost-effective solution for the long-wave infrared imaging using a singlet, we develop a camera by integrating a High-Frequency-Enhancing Cycle-GAN neural network into a metalens imaging system. The High-Frequency-Enhancing Cycle-GAN improves the quality of the original metalens images by addressing inherent frequency loss introduced by the metalens. In addition to the bidirectional cyclic generative adversarial network, it incorporates a high-frequency adversarial learning module. This module utilizes wavelet transform to extract high-frequency components, and then establishes a high-frequency feedback loop. It enables the generator to enhance the camera outputs by integrating adversarial feedback from the high-frequency discriminator. This ensures that the generator adheres to the constraints imposed by the high-frequency adversarial loss, thereby effectively recovering the camera’s frequency loss. This recovery guarantees high-fidelity image output from the camera, facilitating smooth video production. Our camera is capable of achieving dynamic imaging at 125 frames per second with an End Point Error value of 12.58. We also achieve 0.42 for Fréchet Inception Distance, 30.62 for Peak Signal to Noise Ratio, and 0.69 for Structural Similarity in the recorded videos.
zh
[CV-197] Improving Deformable Image Registration Accuracy through a Hybrid Similarity Metric and CycleGAN Based Auto-Segmentation
【速读】: 该论文试图解决在自适应放射治疗 (Adaptive Radiation Therapy, ART) 中,由于解剖结构变化导致的图像配准 (Deformable Image Registration, DIR) 精度不足的问题。特别是当图像强度差异较大时,传统的基于强度的DIR方法往往失效。解决方案的关键在于引入一种混合相似度度量方法,结合了点对距离 (Point-to-Distance, PD) 评分和强度相似性,并利用基于CycleGAN的强度校正和自动分割技术来生成合成CT (Synthetic CT, sCT) 图像,以增强软组织对比度。通过比较三种不同的DIR工作流程(传统基于强度的方法、基于CycleGAN自动分割的方法和专家手动分割的方法),研究结果表明,这种混合度量方法显著提高了DIR的准确性,特别是在低对比度的CBCT图像上。这一发现强调了将基于AI的图像校正和分割技术整合到ART工作流程中的潜力,以提高精度和简化临床流程。
链接: https://arxiv.org/abs/2411.16992
作者: Keyur D. Shah,James A. Shackleford,Nagarajan Kandasamy,Gregory C. Sharp
关键词-EN: Deformable image registration, adaptive radiation therapy, Deformable image, hybrid similarity metric, DIR
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Purpose: Deformable image registration (DIR) is critical in adaptive radiation therapy (ART) to account for anatomical changes. Conventional intensity-based DIR methods often fail when image intensities differ. This study evaluates a hybrid similarity metric combining intensity and structural information, leveraging CycleGAN-based intensity correction and auto-segmentation across three DIR workflows. Methods: A hybrid similarity metric combining a point-to-distance (PD) score and intensity similarity was implemented. Synthetic CT (sCT) images were generated using a 2D CycleGAN model trained on unpaired CT and CBCT images to enhance soft-tissue contrast. DIR workflows compared included: (1) traditional intensity-based (No PD), (2) auto-segmented contours on sCT (CycleGAN PD), and (3) expert manual contours (Expert PD). A 3D U-Net model trained on 56 images and validated on 14 cases segmented the prostate, bladder, and rectum. DIR accuracy was assessed using Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD), and fiducial separation. Results: The hybrid metric improved DIR accuracy. For the prostate, DSC increased from 0.61+/-0.18 (No PD) to 0.82+/-0.13 (CycleGAN PD) and 0.89+/-0.05 (Expert PD), with reductions in 95% HD from 11.75 mm to 4.86 mm and 3.27 mm, respectively. Fiducial separation decreased from 8.95 mm to 4.07 mm (CycleGAN PD) and 4.11 mm (Expert PD) (p 0.05). Improvements were also observed for the bladder and rectum. Conclusion: This study demonstrates that a hybrid similarity metric using CycleGAN-based auto-segmentation improves DIR accuracy, particularly for low-contrast CBCT images. These findings highlight the potential for integrating AI-based image correction and segmentation into ART workflows to enhance precision and streamline clinical processes.
zh
[CV-198] Glo-In-One-v2: Holistic Identification of Glomerular Cells Tissues and Lesions in Human and Mouse Histopathology
【速读】: 该论文试图解决肾小球内组织和病变的手动分割问题,这一过程传统上依赖于专家肾病理学家的详细形态学评估,既耗时又易受观察者间变异性的影响。解决方案的关键在于开发了Glo-In-One-v2工具包,该工具包通过细粒度分割能力,对14种组织区域、细胞和病变进行了标注,并基于23,529个标注的肾小球数据集训练了一个单动态头深度学习架构。该架构能够对人类和小鼠病理数据中的部分标注图像进行14类分割,涵盖了Bowman’s capsule、肾小球毛细血管丛、间质、间质细胞和足细胞等5种关键肾小球内组织,以及粘连、囊泡滴、全局硬化、透明变性、间质溶解、微动脉瘤、结节性硬化、间质扩张和节段性硬化等9种肾小球病变。该模型在Dice相似系数(DSC)上达到了76.5%的平均表现,并通过从啮齿动物到人类的迁移学习进一步提高了不同类型病变的分割准确性。
链接: https://arxiv.org/abs/2411.16961
作者: Lining Yu,Mengmeng Yin,Ruining Deng,Quan Liu,Tianyuan Yao,Can Cui,Junlin Guo,Yu Wang,Yaohong Wang,Shilin Zhao,Haichun Yang,Yuankai Huo
关键词-EN: detailed morphological evaluations, labor-intensive process susceptible, Segmenting glomerular intraglomerular, lesions traditionally depends, Segmenting glomerular
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Segmenting glomerular intraglomerular tissue and lesions traditionally depends on detailed morphological evaluations by expert nephropathologists, a labor-intensive process susceptible to interobserver variability. Our group previously developed the Glo-In-One toolkit for integrated detection and segmentation of glomeruli. In this study, we leverage the Glo-In-One toolkit to version 2 with fine-grained segmentation capabilities, curating 14 distinct labels for tissue regions, cells, and lesions across a dataset of 23,529 annotated glomeruli across human and mouse histopathology data. To our knowledge, this dataset is among the largest of its kind to this http URL this study, we present a single dynamic head deep learning architecture designed to segment 14 classes within partially labeled images of human and mouse pathology data. Our model was trained using a training set derived from 368 annotated kidney whole-slide images (WSIs) to identify 5 key intraglomerular tissues covering Bowman’s capsule, glomerular tuft, mesangium, mesangial cells, and podocytes. Additionally, the network segments 9 glomerular lesion classes including adhesion, capsular drop, global sclerosis, hyalinosis, mesangial lysis, microaneurysm, nodular sclerosis, mesangial expansion, and segmental sclerosis. The glomerulus segmentation model achieved a decent performance compared with baselines, and achieved a 76.5 % average Dice Similarity Coefficient (DSC). Additional, transfer learning from rodent to human for glomerular lesion segmentation model has enhanced the average segmentation accuracy across different types of lesions by more than 3 %, as measured by Dice scores. The Glo-In-One-v2 model and trained weight have been made publicly available at https: //github.com/hrlblab/Glo-In-One_v2.
zh
[CV-199] Contrastive Deep Learning Reveals Age Biomarkers in Histopathological Skin Biopsies
【速读】: 该论文试图解决的问题是如何识别区分快速和缓慢衰老的生物标志物,以理解衰老的生物学机制,实现早期疾病检测和预防策略的改进。解决方案的关键在于利用对比深度学习(contrastive deep learning)技术,通过皮肤活检图像的视觉特征来构建一种新的衰老生物标志物。具体来说,研究通过分析皮肤活检的组织病理学切片中的视觉特征,成功预测了个体的死亡率和慢性年龄相关疾病的患病率,展示了常规健康数据与深度学习结合的潜力,从而创建了一种可用于长期监测死亡率的新型衰老生物标志物。
链接: https://arxiv.org/abs/2411.16956
作者: Kaustubh Chakradeo(1),Pernille Nielsen(2),Lise Mette Rahbek Gjerdrum(3 and 6),Gry Sahl Hansen(3),David A Duchêne(1),Laust H Mortensen(1 and 4),Majken K Jensen(1),Samir Bhatt(1 and 5) ((1) University of Copenhagen, Section of Epidemiology, Department of Public Health, Copenhagen, Denmark, (2) Technical University of Denmark, Department of Applied Mathematics and Computer Science, Denmark, (3) Department of Pathology, Copenhagen University Hospital- Zealand University Hospital, Roskilde, Denmark, (4) Danmarks Statistik, Denmark, (5) Imperial College London, United Kingdom, (6) Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark)
关键词-EN: life expectancy increases, global life expectancy, exhibit considerable variability, individuals exhibit considerable, expectancy increases
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 tables, 5 figures Under review: npj Digital Medicine
点击查看摘要
Abstract:As global life expectancy increases, so does the burden of chronic diseases, yet individuals exhibit considerable variability in the rate at which they age. Identifying biomarkers that distinguish fast from slow ageing is crucial for understanding the biology of ageing, enabling early disease detection, and improving prevention strategies. Using contrastive deep learning, we show that skin biopsy images alone are sufficient to determine an individual’s age. We then use visual features in histopathology slides of the skin biopsies to construct a novel biomarker of ageing. By linking with comprehensive health registers in Denmark, we demonstrate that visual features in histopathology slides of skin biopsies predict mortality and the prevalence of chronic age-related diseases. Our work highlights how routinely collected health data can provide additional value when used together with deep learning, by creating a new biomarker for ageing which can be actively used to determine mortality over time.
zh
[CV-200] U-WNO:U-Net-enhanced Wavelet Neural Operator for fetal head segmentation
【速读】: 该论文试图解决在医学图像分析中,特别是二维超声图像的区域分割问题,以实现对妊娠不同阶段的准确跟踪。解决方案的关键在于开发了一种新型的U-Net增强的小波神经算子(U-WNO),该方法结合了小波分解、算子学习和编码器-解码器机制。通过利用小波在时频定位上的优势,结合下采样和上采样操作生成分割图,U-WNO能够精确地跟踪空间域中的模式,并有效地学习功能映射,从而实现区域分割。这种方法不仅在理论上有创新,而且在实际应用中展示了其潜力,特别是在提高决策精度和操作效率方面。
链接: https://arxiv.org/abs/2411.16890
作者: Pranava Seth,Deepak Mishra,Veena Iyer
关键词-EN: Wavelet Neural Operator, combines wavelet decomposition, Wavelet Neural, encoder-decoder mechanism, Neural Operator
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This article describes the development of a novel U-Net-enhanced Wavelet Neural Operator (U-WNO),which combines wavelet decomposition, operator learning, and an encoder-decoder mechanism. This approach harnesses the superiority of the wavelets in time frequency localization of the functions, and the combine down-sampling and up-sampling operations to generate the segmentation map to enable accurate tracking of patterns in spatial domain and effective learning of the functional mappings to perform regional segmentation. By bridging the gap between theoretical advancements and practical applications, the U-WNO holds potential for significant impact in multiple science and industrial fields, facilitating more accurate decision-making and improved operational efficiencies. The operator is demonstrated for different pregnancy trimesters, utilizing two-dimensional ultrasound images.
zh
[CV-201] Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration
【速读】: 该论文试图解决在图像恢复(Image Restoration)中,使用预训练扩散模型(Diffusion Models)进行恢复时由于近似误差导致的样本质量下降问题。解决方案的关键在于提出了一种基于频率域的时间变化低通滤波器(time-varying low-pass filter in the frequency domain),并开发了一种适应性课程(adaptive curriculum)来调整频率计划,从而在恢复过程中逐步引入更高频率。这种方法显著提高了在运动去模糊(motion deblurring)和图像去雾(image dehazing)等挑战性任务中的性能。
链接: https://arxiv.org/abs/2411.15295
作者: Darshan Thaker,Abhishek Goyal,René Vidal
关键词-EN: recover high-quality images, Image restoration aims, aims to recover, recover high-quality, degraded observations
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.
zh
人工智能
[AI-0] RealSeal: Revolutionizing Media Authentication with Real-Time Realism Scoring
链接: https://arxiv.org/abs/2411.17684
作者: Bhaktipriya Radharapu,Harish Krishna
关键词-EN: manipulated media necessitates, growing threat, necessitates a radical, radical rethinking, watermarking synthetic data
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Best Paper Award, Blue Sky Track at 26th ACM International Conference on Multimodal Interaction, Nov 2024, San Jose, Costa Rica
点击查看摘要
Abstract:The growing threat of deepfakes and manipulated media necessitates a radical rethinking of media authentication. Existing methods for watermarking synthetic data fall short, as they can be easily removed or altered, and current deepfake detection algorithms do not achieve perfect accuracy. Provenance techniques, which rely on metadata to verify content origin, fail to address the fundamental problem of staged or fake media. This paper introduces a groundbreaking paradigm shift in media authentication by advocating for the watermarking of real content at its source, as opposed to watermarking synthetic data. Our innovative approach employs multisensory inputs and machine learning to assess the realism of content in real-time and across different contexts. We propose embedding a robust realism score within the image metadata, fundamentally transforming how images are trusted and circulated. By combining established principles of human reasoning about reality, rooted in firmware and hardware security, with the sophisticated reasoning capabilities of contemporary machine learning systems, we develop a holistic approach that analyzes information from multiple perspectives. This ambitious, blue sky approach represents a significant leap forward in the field, pushing the boundaries of media authenticity and trust. By embracing cutting-edge advancements in technology and interdisciplinary research, we aim to establish a new standard for verifying the authenticity of digital media. Comments: Best Paper Award, Blue Sky Track at 26th ACM International Conference on Multimodal Interaction, Nov 2024, San Jose, Costa Rica Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.17684 [cs.CR] (or arXiv:2411.17684v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.17684 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3678957.367896 Focus to learn more DOI(s) linking to related resources
[AI-1] Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset
链接: https://arxiv.org/abs/2411.17645
作者: Yujie Dai,Brian Sullivan,Axel Montout,Amy Dillon,Chris Waller,Peter Acs,Rachel Denholm,Philip Williams,Alastair D Hay,Raul Santos-Rodriguez,Andrew Dowsey
关键词-EN: holds substantial potential, electronic health records, holds substantial, UTI risk, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The use of machine learning and AI on electronic health records (EHRs) holds substantial potential for clinical insight. However, this approach faces significant challenges due to data heterogeneity, sparsity, temporal misalignment, and limited labeled outcomes. In this context, we leverage a linked EHR dataset of approximately one million de-identified individuals from Bristol, North Somerset, and South Gloucestershire, UK, to characterize urinary tract infections (UTIs) and develop predictive models focused on data quality, fairness and transparency. A comprehensive data pre-processing and curation pipeline transforms the raw EHR data into a structured format suitable for AI modeling. Given the limited availability and biases of ground truth UTI outcomes, we introduce a UTI risk estimation framework informed by clinical expertise to estimate UTI risk across individual patient timelines. Using this framework, we built pairwise XGBoost models to differentiate UTI risk categories with explainable AI techniques to identify key predictors while ensuring interpretability. Our findings reveal differences in clinical and demographic factors across risk groups, offering insights into UTI risk stratification and progression. This study demonstrates the added value of AI-driven insights into UTI clinical decision-making while prioritizing interpretability, transparency, and fairness, underscoring the importance of sound data practices in advancing health outcomes.
[AI-2] MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation
链接: https://arxiv.org/abs/2411.17636
作者: Harsh Singh,Rocktim Jyoti Das,Mingfei Han,Preslav Nakov,Ivan Laptev
关键词-EN: Large Language Models, Large Language, demonstrated remarkable planning, remarkable planning abilities, Multi-Agent Large Language
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 48 pages
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. While recent efforts in robotics have leveraged LLMs both for high-level and low-level planning, these approaches often face significant challenges, such as hallucinations in long-horizon tasks and limited adaptability due to the generation of plans in a single pass without real-time feedback. To address these limitations, we propose a novel multi-agent LLM framework, Multi-Agent Large Language Model for Manipulation (MALMM) that distributes high-level planning and low-level control code generation across specialized LLM agents, supervised by an additional agent that dynamically manages transitions. By incorporating observations from the environment after each step, our framework effectively handles intermediate failures and enables adaptive re-planning. Unlike existing methods, our approach does not rely on pre-trained skill policies or in-context learning examples and generalizes to a variety of new tasks. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting, thereby overcoming key limitations of existing LLM-based manipulation methods.
[AI-3] Learning Chemical Reaction Representation with Reactant-Product Alignment
链接: https://arxiv.org/abs/2411.17629
作者: Kaipeng Zeng,Xianbin Liu,Yu Zhang,Xiaokang Yang,Yaohui Jin,Yanyan Xu
关键词-EN: Organic synthesis stands, reaction, chemical reaction representation, Organic synthesis, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Organic synthesis stands as a cornerstone of chemical industry. The development of robust machine learning models to support tasks associated with organic reactions is of significant interest. However, current methods rely on hand-crafted features or direct adaptations of model architectures from other domains, which lacks feasibility as data scales increase or overlook the rich chemical information inherent in reactions. To address these issues, this paper introduces \modelname, a novel chemical reaction representation learning model tailored for a variety of organic-reaction-related tasks. By integrating atomic correspondence between reactants and products, our model discerns the molecular transformations that occur during the reaction, thereby enhancing the comprehension of the reaction mechanism. We have designed an adapter structure to incorporate reaction conditions into the chemical reaction representation, allowing the model to handle diverse reaction conditions and adapt to various datasets and downstream tasks, e.g., reaction performance prediction. Additionally, we introduce a reaction-center aware attention mechanism that enables the model to concentrate on key functional groups, thereby generating potent representations for chemical reactions. Our model has been evaluated on a range of downstream tasks, including reaction condition prediction, reaction yield prediction, and reaction selectivity prediction. Experimental results indicate that our model markedly outperforms existing chemical reaction representation learning architectures across all tasks. Notably, our model significantly outperforms all the baselines with up to 25% (top-1) and 16% (top-10) increased accuracy over the strongest baseline on USPTO_CONDITION dataset for reaction condition prediction. We plan to open-source the code contingent upon the acceptance of the paper.
[AI-4] Machine Learning and Multi-source Remote Sensing in Forest Carbon Stock Estimation: A Review
链接: https://arxiv.org/abs/2411.17624
作者: Autumn Nguyen,Sulagna Saha
关键词-EN: Quantifying forest carbon, Quantifying forest, protect the planet, crucial for informing, informing decisions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First author and corresponding author: Autumn Nguyen
点击查看摘要
Abstract:Quantifying forest carbon is crucial for informing decisions and policies that will protect the planet. Machine learning (ML) and remote sensing (RS) techniques have been used to do this task more effectively, yet there lacks a systematic review on the most recent ML methods and RS combinations, especially with the consideration of forest characteristics. This study systematically analyzed 25 papers meeting strict inclusion criteria from over 80 related studies, identifying 28 ML methods and key combinations of RS data. Random Forest had the most frequent appearance (88% of studies), while Extreme Gradient Boosting showed superior performance in 75% of the studies in which it was compared with other methods. Sentinel-1 emerged as the most utilized remote sensing source, with multi-sensor approaches (e.g., Sentinel-1, Sentinel-2, and LiDAR) proving especially effective. Our findings provide grounds for recommending best practices in integrating machine learning and remote sensing for accurate and scalable forest carbon stock estimation.
[AI-5] Automating Chapter-Level Classification for Electronic Theses and Dissertations
链接: https://arxiv.org/abs/2411.17614
作者: Bipasha Banerjee,William A. Ingram,Edward A. Fox
关键词-EN: high-level metadata schemes, long scholarly works, rely on broad, capture the depth, describing electronic
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Traditional archival practices for describing electronic theses and dissertations (ETDs) rely on broad, high-level metadata schemes that fail to capture the depth, complexity, and interdisciplinary nature of these long scholarly works. The lack of detailed, chapter-level content descriptions impedes researchers’ ability to locate specific sections or themes, thereby reducing discoverability and overall accessibility. By providing chapter-level metadata information, we improve the effectiveness of ETDs as research resources. This makes it easier for scholars to navigate them efficiently and extract valuable insights. The absence of such metadata further obstructs interdisciplinary research by obscuring connections across fields, hindering new academic discoveries and collaboration. In this paper, we propose a machine learning and AI-driven solution to automatically categorize ETD chapters. This solution is intended to improve discoverability and promote understanding of chapters. Our approach enriches traditional archival practices by providing context-rich descriptions that facilitate targeted navigation and improved access. We aim to support interdisciplinary research and make ETDs more accessible. By providing chapter-level classification labels and using them to index in our developed prototype system, we make content in ETD chapters more discoverable and usable for a diverse range of scholarly needs. Implementing this AI-enhanced approach allows archives to serve researchers better, enabling efficient access to relevant information and supporting deeper engagement with ETDs. This will increase the impact of ETDs as research tools, foster interdisciplinary exploration, and reinforce the role of archives in scholarly communication within the data-intensive academic landscape.
[AI-6] Making History Readable
链接: https://arxiv.org/abs/2411.17600
作者: Bipasha Banerjee,Jennifer Goyne,William A. Ingram
关键词-EN: Tech University Libraries, Virginia Tech University, Digital Library Platform, University Libraries, Library Platform
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.
[AI-7] Agent ic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals
链接: https://arxiv.org/abs/2411.17598
作者: William A. Ingram,Bipasha Banerjee,Edward A. Fox
关键词-EN: Sustainable Development Goals, United Nations’ Sustainable, Nations’ Sustainable Development, Development Goals, United Nations’
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:As research institutions increasingly commit to supporting the United Nations’ Sustainable Development Goals (SDGs), there is a pressing need to accurately assess their research output against these goals. Current approaches, primarily reliant on keyword-based Boolean search queries, conflate incidental keyword matches with genuine contributions, reducing retrieval precision and complicating benchmarking efforts. This study investigates the application of autoregressive Large Language Models (LLMs) as evaluation agents to identify relevant scholarly contributions to SDG targets in scholarly publications. Using a dataset of academic abstracts retrieved via SDG-specific keyword queries, we demonstrate that small, locally-hosted LLMs can differentiate semantically relevant contributions to SDG targets from documents retrieved due to incidental keyword matches, addressing the limitations of traditional methods. By leveraging the contextual understanding of LLMs, this approach provides a scalable framework for improving SDG-related research metrics and informing institutional reporting.
[AI-8] Learning Explainable Treatment Policies with Clinician-Informed Representations: A Practical Approach ALT ML4H
链接: https://arxiv.org/abs/2411.17570
作者: Johannes O. Ferstad,Emily B. Fox,David Scheinker,Ramesh Johari
关键词-EN: Digital health interventions, remote patient monitoring, shown great potential, improving chronic disease, chronic disease management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Proceedings of Machine Learning for Health (ML4H) 2024. Code available at: this https URL
点击查看摘要
Abstract:Digital health interventions (DHIs) and remote patient monitoring (RPM) have shown great potential in improving chronic disease management through personalized care. However, barriers like limited efficacy and workload concerns hinder adoption of existing DHIs; while limited sample sizes and lack of interpretability limit the effectiveness and adoption of purely black-box algorithmic DHIs. In this paper, we address these challenges by developing a pipeline for learning explainable treatment policies for RPM-enabled DHIs. We apply our approach in the real-world setting of RPM using a DHI to improve glycemic control of youth with type 1 diabetes. Our main contribution is to reveal the importance of clinical domain knowledge in developing state and action representations for effective, efficient, and interpretable targeting policies. We observe that policies learned from clinician-informed representations are significantly more efficacious and efficient than policies learned from black-box representations. This work emphasizes the importance of collaboration between ML researchers and clinicians for developing effective DHIs in the real world.
[AI-9] AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments
链接: https://arxiv.org/abs/2411.17539
作者: Haitham S. Al-Sinani,Chris J. Mitchell
关键词-EN: Linux-based penetration testing, penetration testing environments, comprehensive cybersecurity assessments, Linux-based penetration, testing environments
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 101 pages
点击查看摘要
Abstract:This study explores the application of generative AI (GenAI) within manual exploitation and privilege escalation tasks in Linux-based penetration testing environments, two areas critical to comprehensive cybersecurity assessments. Building on previous research into the role of GenAI in the ethical hacking lifecycle, this paper presents a hands-on experimental analysis conducted in a controlled virtual setup to evaluate the utility of GenAI in supporting these crucial, often manual, tasks. Our findings demonstrate that GenAI can streamline processes, such as identifying potential attack vectors and parsing complex outputs for sensitive data during privilege escalation. The study also identifies key benefits and challenges associated with GenAI, including enhanced efficiency and scalability, alongside ethical concerns related to data privacy, unintended discovery of vulnerabilities, and potential for misuse. This work contributes to the growing field of AI-assisted cybersecurity by emphasising the importance of human-AI collaboration, especially in contexts requiring careful decision-making, rather than the complete replacement of human input.
[AI-10] Inference Scaling scriptsizemathttFLaws: The Limits of LLM Resampling with Imperfect Verifiers
链接: https://arxiv.org/abs/2411.17501
作者: Benedikt Stroebl,Sayash Kapoor,Arvind Narayanan
关键词-EN: Recent research, unit tests, passes unit tests, repeatedly sampling solutions, inference scaling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models, such as by repeatedly sampling solutions to a coding problem until it passes unit tests. The central thesis of this paper is that there is no free lunch for inference scaling: indefinite accuracy improvement through resampling can only be realized if the “verifier” (in this case, a set of unit tests) is perfect. When the verifier is imperfect, as it almost always is in domains such as reasoning or coding (for example, unit tests have imperfect coverage), there is a nonzero probability of false positives: incorrect solutions that pass the verifier. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling even with an infinite compute budget. We find that there is a very strong correlation between the model’s single-sample accuracy (i.e. accuracy without unit tests) and its false positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model (Fig. 1a). When we consider that false positives have a negative utility compared to abstaining from producing a solution, it bends the inference scaling curve further downward. Empirically, we find that the optimal number of samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
[AI-11] SoK: Decentralized AI (DeAI)
链接: https://arxiv.org/abs/2411.17461
作者: Zhipeng Wang,Rui Sun,Elizabeth Lui,Vatsal Shah,Xihan Xiong,Jiahao Sun,Davide Crapis,William Knottenbelt
关键词-EN: Artificial Intelligence, poses significant challenges, including single points, data privacy concerns, centralization of Artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: This is a Systematization of Knowledge (SoK) for the rapidly evolving field of Decentralized AI (DeAI). We welcome valuable comments, suggestions, and collaboration to further refine and enhance this work. We hope our contribution will help accelerate the advancement of DeAI
点击查看摘要
Abstract:The centralization of Artificial Intelligence (AI) poses significant challenges, including single points of failure, inherent biases, data privacy concerns, and scalability issues. These problems are especially prevalent in closed-source large language models (LLMs), where user data is collected and used without transparency. To mitigate these issues, blockchain-based decentralized AI (DeAI) has emerged as a promising solution. DeAI combines the strengths of both blockchain and AI technologies to enhance the transparency, security, decentralization, and trustworthiness of AI systems. However, a comprehensive understanding of state-of-the-art DeAI development, particularly for active industry solutions, is still lacking. In this work, we present a Systematization of Knowledge (SoK) for blockchain-based DeAI solutions. We propose a taxonomy to classify existing DeAI protocols based on the model lifecycle. Based on this taxonomy, we provide a structured way to clarify the landscape of DeAI protocols and identify their similarities and differences. We analyze the functionalities of blockchain in DeAI, investigating how blockchain features contribute to enhancing the security, transparency, and trustworthiness of AI processes, while also ensuring fair incentives for AI data and model contributors. In addition, we identify key insights and research gaps in developing DeAI protocols, highlighting several critical avenues for future research.
[AI-12] Rewiring Techniques to Mitigate Oversquashing and Oversmoothing in GNNs: A Survey
链接: https://arxiv.org/abs/2411.17429
作者: Hugo Attali,Davide Buscaldi,Nathalie Pernelle
关键词-EN: Graph Neural Networks, obscuring meaningful distinctions, Neural Networks, homogenize node representations, repeated message-passing iterations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their effectiveness is often constrained by two critical challenges: oversquashing, where the excessive compression of information from distant nodes results in significant information loss, and oversmoothing, where repeated message-passing iterations homogenize node representations, obscuring meaningful distinctions. These issues, intrinsically linked to the underlying graph structure, hinder information flow and constrain the expressiveness of GNNs. In this survey, we examine graph rewiring techniques, a class of methods designed to address these structural bottlenecks by modifying graph topology to enhance information diffusion. We provide a comprehensive review of state-of-the-art rewiring approaches, delving into their theoretical underpinnings, practical implementations, and performance trade-offs.
[AI-13] CLOVER: Constrained Learning with Orthonormal Vectors for Eliminating Redundancy
链接: https://arxiv.org/abs/2411.17426
作者: Fanxu Meng,Muhan Zhang
关键词-EN: leveraging linear combinations, well-trained large model, propose constraining learning, downstream tasks, adapt a well-trained
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:To adapt a well-trained large model to downstream tasks, we propose constraining learning within its original latent space by leveraging linear combinations of its basis vectors. This approach ensures stable training without compromising the model’s capabilities. Traditionally, constructing orthonormal bases from a matrix requires a transfer matrix, which significantly increases storage and computational overhead for parameters and feature maps. In this paper, we introduce Absorb and Decompose for Q, K, V, and O matrices, enabling their orthogonalization without the need for transfer matrices. Furthermore, the Absorb-Decompose operation eliminates redundant vectors, reducing the encoder attention parameters of Whisper-large-v3 by 46.42% without requiring additional training. For parameter-efficient and stable fine-tuning, we orthonormalized Q, K, V, and O and fine-tuned only the singular values, allowing efficient adaptation while constraining changes to the original latent space. When fine-tuning LLaMA-2-7B on eight commonsense reasoning datasets, our method outperforms LoRA by 5.4% and DoRA by 4.4%.
[AI-14] Advancing Uncertain Combinatorics through Graphization Hyperization and Uncertainization: Fuzzy Neutrosophic Soft Rough and Beyond
链接: https://arxiv.org/abs/2411.17411
作者: Takaaki Fujita
关键词-EN: handle real-world uncertainty, sets, neutrosophic sets, neutrosophic, concepts
类目: Artificial Intelligence (cs.AI)
*备注: 255 pages. 11 figures. Published as a book in 2024. Publisher: Biblio Publishing. ISBN: 978-1-59973-812-3
点击查看摘要
Abstract:To better handle real-world uncertainty, concepts such as fuzzy sets, neutrosophic sets, rough sets, and soft sets have been introduced. For example, neutrosophic sets, which simultaneously represent truth, indeterminacy, and falsehood, have proven to be valuable tools for modeling uncertainty in complex systems. These set concepts are increasingly studied in graphized forms, and generalized graph concepts now encompass well-known structures such as hypergraphs and superhypergraphs. Furthermore, hyperconcepts and superhyperconcepts are being actively researched in areas beyond graph theory. Combinatorics, uncertain sets (including fuzzy sets, neutrosophic sets, rough sets, soft sets, and plithogenic sets), uncertain graphs, and hyper and superhyper concepts are active areas of research with significant mathematical and practical implications. Recognizing their importance, this paper explores new graph and set concepts, as well as hyper and superhyper concepts, as detailed in the “Results” section of “The Structure of the Paper.” Additionally, this work aims to consolidate recent findings, providing a survey-like resource to inform and engage readers. For instance, we extend several graph concepts by introducing Neutrosophic Oversets, Neutrosophic Undersets, Neutrosophic Offsets, and the Nonstandard Real Set. This paper defines a variety of concepts with the goal of inspiring new ideas and serving as a valuable resource for researchers in their academic pursuits. Comments: 255 pages. 11 figures. Published as a book in 2024. Publisher: Biblio Publishing. ISBN: 978-1-59973-812-3 Subjects: Artificial Intelligence (cs.AI) MSC classes: 03B52 Cite as: arXiv:2411.17411 [cs.AI] (or arXiv:2411.17411v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.17411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-15] BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
链接: https://arxiv.org/abs/2411.17404
作者: Teng Wang,Wing-Yin Yu,Zhenqi He,Zehua Liu,Xiongwei Han,Hailei Gong,Han Wu,Wei Shi,Ruifeng She,Fangzhou Zhu,Tao Zhong
关键词-EN: LLMs exhibit advanced, transform natural language, natural language questions, advanced reasoning capabilities, exhibit advanced reasoning
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source operations research datasets lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, a algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods, including Chain-of-Thought, Self-Consistency, and Tree-of-Thought. In tree-based reasoning, BPP-Search also surpasses Process Reward Model combined with Greedy or Beam Search, demonstrating superior accuracy and efficiency, and enabling faster retrieval of correct solutions.
[AI-16] Knowledge-aware Evolutionary Graph Neural Architecture Search
链接: https://arxiv.org/abs/2411.17339
作者: Chao Wang,Jiaxuan Zhao,Lingling Li,Licheng Jiao,Fang Liu,Xu Liu,Shuyuan Yang
关键词-EN: high-performance graph neural, graph neural network, neural network architectures, customize high-performance graph, specific graph tasks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been accepted by Knowledge-Based Systems
点击查看摘要
Abstract:Graph neural architecture search (GNAS) can customize high-performance graph neural network architectures for specific graph tasks or datasets. However, existing GNAS methods begin searching for architectures from a zero-knowledge state, ignoring the prior knowledge that may improve the search efficiency. The available knowledge base (e.g. NAS-Bench-Graph) contains many rich architectures and their multiple performance metrics, such as the accuracy (#Acc) and number of parameters (#Params). This study proposes exploiting such prior knowledge to accelerate the multi-objective evolutionary search on a new graph dataset, named knowledge-aware evolutionary GNAS (KEGNAS). KEGNAS employs the knowledge base to train a knowledge model and a deep multi-output Gaussian process (DMOGP) in one go, which generates and evaluates transfer architectures in only a few GPU seconds. The knowledge model first establishes a dataset-to-architecture mapping, which can quickly generate candidate transfer architectures for a new dataset. Subsequently, the DMOGP with architecture and dataset encodings is designed to predict multiple performance metrics for candidate transfer architectures on the new dataset. According to the predicted metrics, non-dominated candidate transfer architectures are selected to warm-start the multi-objective evolutionary algorithm for optimizing the #Acc and #Params on a new dataset. Empirical studies on NAS-Bench-Graph and five real-world datasets show that KEGNAS swiftly generates top-performance architectures, achieving 4.27% higher accuracy than advanced evolutionary baselines and 11.54% higher accuracy than advanced differentiable baselines. In addition, ablation studies demonstrate that the use of prior knowledge significantly improves the search performance.
[AI-17] owards Intention Recognition for Robotic Assistants Through Online POMDP Planning ICAPS2023
链接: https://arxiv.org/abs/2411.17326
作者: Juan Carlos Saborio,Joachim Hertzberg
关键词-EN: plays a vital, ability to anticipate, vital role, design and development, development of automated
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Presented at the ICAPS 2023 workshop “PAIR: Plan, Activity, and Intent Recognition”
点击查看摘要
Abstract:Intention recognition, or the ability to anticipate the actions of another agent, plays a vital role in the design and development of automated assistants that can support humans in their daily tasks. In particular, industrial settings pose interesting challenges that include potential distractions for a decision-maker as well as noisy or incomplete observations. In such a setting, a robotic assistant tasked with helping and supporting a human worker must interleave information gathering actions with proactive tasks of its own, an approach that has been referred to as active goal recognition. In this paper we describe a partially observable model for online intention recognition, show some preliminary experimental results and discuss some of the challenges present in this family of problems.
[AI-18] PIM-AI: A Novel Architecture for High-Efficiency LLM Inference
链接: https://arxiv.org/abs/2411.17309
作者: Cristobal Ortega,Yann Falevoz,Renaud Ayrignac
关键词-EN: Large Language Models, advanced language understanding, Large Language, advanced language, language understanding
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注: 14 pages, 5 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to traditional hardware architectures. Processing-in-Memory (PIM), which integrates computational units directly into memory chips, offers several advantages for LLM inference, including reduced data transfer bottlenecks and improved power efficiency. This paper introduces PIM-AI, a novel DDR5/LPDDR5 PIM architecture designed for LLM inference without modifying the memory controller or DDR/LPDDR memory PHY. We have developed a simulator to evaluate the performance of PIM-AI in various scenarios and demonstrate its significant advantages over conventional architectures. In cloud-based scenarios, PIM-AI reduces the 3-year TCO per queries-per-second by up to 6.94x compared to state-of-the-art GPUs, depending on the LLM model used. In mobile scenarios, PIM-AI achieves a 10- to 20-fold reduction in energy per token compared to state-of-the-art mobile SoCs, resulting in 25 to 45~% more queries per second and 6.9x to 13.4x less energy per query, extending battery life and enabling more inferences per charge. These results highlight PIM-AI’s potential to revolutionize LLM deployments, making them more efficient, scalable, and sustainable. Comments: 14 pages, 5 figures Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET) Cite as: arXiv:2411.17309 [cs.AR] (or arXiv:2411.17309v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2411.17309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-19] GrokFormer: Graph Fourier Kolmogorov-Arnold Transformers
链接: https://arxiv.org/abs/2411.17296
作者: Guoguo Ai,Guansong Pang,Hezhe Qiao,Yuan Gao,Hui Yan
关键词-EN: long-range structural dependency, demonstrated remarkable performance, Graph, long-range structural, structural dependency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures, 7tables
点击查看摘要
Abstract:Graph Transformers (GTs) have demonstrated remarkable performance in incorporating various graph structure information, e.g., long-range structural dependency, into graph representation learning. However, self-attention – the core module of GTs – preserves only low-frequency signals on graph features, retaining only homophilic patterns that capture similar features among the connected nodes. Consequently, it has insufficient capacity in modeling complex node label patterns, such as the opposite of homophilic patterns – heterophilic patterns. Some improved GTs deal with the problem by learning polynomial filters or performing self-attention over the first-order graph spectrum. However, these GTs either ignore rich information contained in the whole spectrum or neglect higher-order spectrum information, resulting in limited flexibility and frequency response in their spectral filters. To tackle these challenges, we propose a novel GT network, namely Graph Fourier Kolmogorov-Arnold Transformers (GrokFormer), to go beyond the self-attention in GTs. GrokFormer leverages learnable activation functions in order- K graph spectrum through Fourier series modeling to i) learn eigenvalue-targeted filter functions producing learnable base that can capture a broad range of frequency signals flexibly, and ii) extract first- and higher-order graph spectral information adaptively. In doing so, GrokFormer can effectively capture intricate patterns hidden across different orders and levels of frequency signals, learning expressive, order-and-frequency-adaptive graph representations. Comprehensive experiments conducted on 10 node classification datasets across various domains, scales, and levels of graph heterophily, as well as 5 graph classification datasets, demonstrate that GrokFormer outperforms state-of-the-art GTs and other advanced graph neural networks.
[AI-20] Social Distancing Induced Coronavirus Optimization Algorithm (COVO): Application to Multimodal Function Optimization and Noise Removal
链接: https://arxiv.org/abs/2411.17282
作者: Om Ramakisan Varma,Mala Kalra
关键词-EN: optimization technique attained, attained more awareness, awareness for handling, handling complex optimization, metaheuristic optimization technique
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The metaheuristic optimization technique attained more awareness for handling complex optimization problems. Over the last few years, numerous optimization techniques have been developed that are inspired by natural phenomena. Recently, the propagation of the new COVID-19 implied a burden on the public health system to suffer several deaths. Vaccination, masks, and social distancing are the major steps taken to minimize the spread of the deadly COVID-19 virus. Considering the social distance to combat the coronavirus epidemic, a novel bio-inspired metaheuristic optimization model is proposed in this work, and it is termed as Social Distancing Induced Coronavirus Optimization Algorithm (COVO). The pace of propagation of the coronavirus can indeed be slowed by maintaining social distance. Thirteen benchmark functions are used to evaluate the COVO performance for discrete, continuous, and complex problems, and the COVO model performance is compared with other well-known optimization algorithms. The main motive of COVO optimization is to obtain a global solution to various applications by solving complex problems with faster convergence. At last, the validated results depict that the proposed COVO optimization has a reasonable and acceptable performance.
[AI-21] APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents
链接: https://arxiv.org/abs/2411.17255
作者: Jun Yu Chen,Tao Gao
关键词-EN: Large Language Model, advanced Large Language, Large Language, Minecraft environment, enables autonomous agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages
点击查看摘要
Abstract:We present APT, an advanced Large Language Model (LLM)-driven framework that enables autonomous agents to construct complex and creative structures within the Minecraft environment. Unlike previous approaches that primarily concentrate on skill-based open-world tasks or rely on image-based diffusion models for generating voxel-based structures, our method leverages the intrinsic spatial reasoning capabilities of LLMs. By employing chain-of-thought decomposition along with multimodal inputs, the framework generates detailed architectural layouts and blueprints that the agent can execute under zero-shot or few-shot learning scenarios. Our agent incorporates both memory and reflection modules to facilitate lifelong learning, adaptive refinement, and error correction throughout the building process. To rigorously evaluate the agent’s performance in this emerging research area, we introduce a comprehensive benchmark consisting of diverse construction tasks designed to test creativity, spatial reasoning, adherence to in-game rules, and the effective integration of multimodal instructions. Experimental results using various GPT-based LLM backends and agent configurations demonstrate the agent’s capacity to accurately interpret extensive instructions involving numerous items, their positions, and orientations. The agent successfully produces complex structures complete with internal functionalities such as Redstone-powered systems. A/B testing indicates that the inclusion of a memory module leads to a significant increase in performance, emphasizing its role in enabling continuous learning and the reuse of accumulated experience. Additionally, the agent’s unexpected emergence of scaffolding behavior highlights the potential of future LLM-driven agents to utilize subroutine planning and leverage the emergence ability of LLMs to autonomously develop human-like problem-solving techniques.
[AI-22] From Graph Diffusion to Graph Classification
链接: https://arxiv.org/abs/2411.17236
作者: Jia Jun Cheng Xian,Sadegh Mahdavi,Renjie Liao,Oliver Schulte
关键词-EN: achieved remarkable success, Generative models, diffusion models, achieved remarkable, Generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Generative models such as diffusion models have achieved remarkable success in state-of-the-art image and text tasks. Recently, score-based diffusion models have extended their success beyond image generation, showing competitive performance with discriminative methods in image \em classification tasks~\citezimmermann2021score. However, their application to classification in the \em graph domain, which presents unique challenges such as complex topologies, remains underexplored. We show how graph diffusion models can be applied for graph classification. We find that to achieve competitive classification accuracy, score-based graph diffusion models should be trained with a novel training objective that is tailored to graph classification. In experiments with a sampling-based inference method, our discriminative training objective achieves state-of-the-art graph classification accuracy.
[AI-23] GraphSubDetector: Time Series Subsequence Anomaly Detection via Density-Aware Adaptive Graph Neural Network
链接: https://arxiv.org/abs/2411.17218
作者: Weiqi Chen,Zhiqiang Zhou,Qingsong Wen,Liang Sun
关键词-EN: real-world applications ranging, effectively learn complex, learn complex dynamics, subsequence anomaly detection, proper subsequence length
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Time series subsequence anomaly detection is an important task in a large variety of real-world applications ranging from health monitoring to AIOps, and is challenging due to the following reasons: 1) how to effectively learn complex dynamics and dependencies in time series; 2) diverse and complicated anomalous subsequences as well as the inherent variance and noise of normal patterns; 3) how to determine the proper subsequence length for effective detection, which is a required parameter for many existing algorithms. In this paper, we present a novel approach to subsequence anomaly detection, namely GraphSubDetector. First, it adaptively learns the appropriate subsequence length with a length selection mechanism that highlights the characteristics of both normal and anomalous patterns. Second, we propose a density-aware adaptive graph neural network (DAGNN), which can generate further robust representations against variance of normal data for anomaly detection by message passing between subsequences. The experimental results demonstrate the effectiveness of the proposed algorithm, which achieves superior performance on multiple time series anomaly benchmark datasets compared to state-of-the-art algorithms.
[AI-24] Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks
链接: https://arxiv.org/abs/2411.17201
作者: Hengyu Fu,Zihao Wang,Eshaan Nichani,Jason D. Lee
关键词-EN: Toggle, multiple nonlinear features, learning, nonlinear features, Code Toggle Papers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 78 pages, 4 figures
点击查看摘要
Abstract:In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of \textitmultiple nonlinear features using three-layer neural networks. We examine a broad class of functions of the form f^\star=g^\star\circ \bp , where \bp:\mathbbR^d \rightarrow \mathbbR^r represents multiple quadratic features with r \ll d and g^\star:\mathbbR^r\rightarrow \mathbbR is a polynomial of degree p . This can be viewed as a nonlinear generalization of the multi-index model \citepdamian2022neural, and also an expansion upon previous work that focused only on a single nonlinear feature, i.e. r = 1 \citepnichani2023provable,wang2023learning. Our primary contribution shows that a three-layer neural network trained via layerwise gradient descent suffices for \beginitemize\item complete recovery of the space spanned by the nonlinear features \item efficient learning of the target function f^\star=g^\star\circ \bp or transfer learning of f=g\circ \bp with a different link function \enditemize within \widetilde\cO(d^4) samples and polynomial time. For such hierarchical targets, our result substantially improves the sample complexity \Theta(d^2p) of the kernel methods, demonstrating the power of efficient feature learning. It is important to highlight that our results leverage novel techniques and thus manage to go beyond all prior settings such as single-index and multi-index models as well as models depending just on one nonlinear feature, contributing to a more comprehensive understanding of feature learning in deep learning. Comments: 78 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2411.17201 [cs.LG] (or arXiv:2411.17201v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.17201 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zihao Wang [view email] [v1] Tue, 26 Nov 2024 08:14:48 UTC (5,871 KB) Full-text links: Access Paper: View a PDF of the paper titled Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks, by Hengyu Fu and 3 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-11 Change to browse by: cs cs.AI math math.ST stat stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[AI-25] Self-reconfiguration Strategies for Space-distributed Spacecraft
链接: https://arxiv.org/abs/2411.17137
作者: Tianle Liu,Zhixiang Wang,Yongwei Zhang,Ziwei Wang,Zihao Liu,Yizhai Zhang,Panfeng Huang
关键词-EN: on-orbit spacecraft assembly, spacecraft assembly algorithm, specific functions, structure with specific, distributed on-orbit spacecraft
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper proposes a distributed on-orbit spacecraft assembly algorithm, where future spacecraft can assemble modules with different functions on orbit to form a spacecraft structure with specific functions. This form of spacecraft organization has the advantages of reconfigurability, fast mission response and easy maintenance. Reasonable and efficient on-orbit self-reconfiguration algorithms play a crucial role in realizing the benefits of distributed spacecraft. This paper adopts the framework of imitation learning combined with reinforcement learning for strategy learning of module handling order. A robot arm motion algorithm is then designed to execute the handling sequence. We achieve the self-reconfiguration handling task by creating a map on the surface of the module, completing the path point planning of the robotic arm using A*. The joint planning of the robotic arm is then accomplished through forward and reverse kinematics. Finally, the results are presented in Unity3D.
[AI-26] LLM -Based Offline Learning for Embodied Agents via Consistency-Guided Reward Ensemble EMNLP-2024
链接: https://arxiv.org/abs/2411.17135
作者: Yujeong Lee,Sangwoo Shin,Wei-Jin Park,Honguk Woo
关键词-EN: Employing large language, large language models, Employing large, language models, limitations in practice
类目: Artificial Intelligence (cs.AI)
*备注: Findings of EMNLP-2024 Camera Ready Version
点击查看摘要
Abstract:Employing large language models (LLMs) to enable embodied agents has become popular, yet it presents several limitations in practice. In this work, rather than using LLMs directly as agents, we explore their use as tools for embodied agent learning. Specifically, to train separate agents via offline reinforcement learning (RL), an LLM is used to provide dense reward feedback on individual actions in training datasets. In doing so, we present a consistency-guided reward ensemble framework (CoREN), designed for tackling difficulties in grounding LLM-generated estimates to the target environment domain. The framework employs an adaptive ensemble of spatio-temporally consistent rewards to derive domain-grounded rewards in the training datasets, thus enabling effective offline learning of embodied agents in different environment domains. Experiments with the VirtualHome benchmark demonstrate that CoREN significantly outperforms other offline RL agents, and it also achieves comparable performance to state-of-the-art LLM-based agents with 8B parameters, despite CoREN having only 117M parameters for the agent policy network and using LLMs only for training.
[AI-27] Creative Agents : Simulating the Systems Model of Creativity with Generative Agents
链接: https://arxiv.org/abs/2411.17065
作者: Naomi Imasato,Kazuki Miyazawa,Takayuki Nagai,Takato Horii
关键词-EN: witnessed models rapidly, models rapidly improve, quality and performance, growing popularity, rapidly improve
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the growing popularity of generative AI for images, video, and music, we witnessed models rapidly improve in quality and performance. However, not much attention is paid towards enabling AI’s ability to “be creative”. In this study, we implemented and simulated the systems model of creativity (proposed by Csikszentmihalyi) using virtual agents utilizing large language models (LLMs) and text prompts. For comparison, the simulations were conducted with the “virtual artists” being: 1)isolated and 2)placed in a multi-agent system. Both scenarios were compared by analyzing the variations and overall “creativity” in the generated artifacts (measured via a user study and LLM). Our results suggest that the generative agents may perform better in the framework of the systems model of creativity.
[AI-28] Graph Structure Learning with Bi-level Optimization
链接: https://arxiv.org/abs/2411.17062
作者: Nan Yin
关键词-EN: Graph Structure Learning, local structure heterogeneity, Graph Structure, local information related, learning graph structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Currently, most Graph Structure Learning (GSL) methods, as a means of learning graph structure, improve the robustness of GNN merely from a local view by considering the local information related to each edge and indiscriminately applying the mechanism across edges, which may suffer from the local structure heterogeneity of the graph (\ie the uneven distribution of inter-class connections over nodes). To overcome the cons, we extract the graph structure as a learnable parameter and jointly learn the structure and common parameters of GNN from the global view. Excitingly, the common parameters contain the global information for nodes features mapping, which is also crucial for structure optimization (\ie optimizing the structure relies on global mapping information). Mathematically, we apply a generic structure extractor to abstract the graph structure and transform GNNs in the form of learning structure and common parameters. Then, we model the learning process as a novel bi-level optimization, \ie \textitGeneric Structure Extraction with Bi-level Optimization for Graph Structure Learning (GSEBO), which optimizes GNN parameters in the upper level to obtain the global mapping information and graph structure is optimized in the lower level with the global information learned from the upper level. We instantiate the proposed GSEBO on classical GNNs and compare it with the state-of-the-art GSL methods. Extensive experiments validate the effectiveness of the proposed GSEBO on four real-world datasets.
[AI-29] hreatModeling-LLM : Automating Threat Modeling using Large Language Models for Banking System
链接: https://arxiv.org/abs/2411.17058
作者: Shuiqiao Yang,Tingmin Wu,Shigang Liu,David Nguyen,Seung Jang,Alsharif Abuadbba
关键词-EN: Threat modeling, component of cybersecurity, data is paramount, Large Language Models, crucial component
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Threat modeling is a crucial component of cybersecurity, particularly for industries such as banking, where the security of financial data is paramount. Traditional threat modeling approaches require expert intervention and manual effort, often leading to inefficiencies and human error. The advent of Large Language Models (LLMs) offers a promising avenue for automating these processes, enhancing both efficiency and efficacy. However, this transition is not straightforward due to three main challenges: (1) the lack of publicly available, domain-specific datasets, (2) the need for tailored models to handle complex banking system architectures, and (3) the requirement for real-time, adaptive mitigation strategies that align with compliance standards like NIST 800-53. In this paper, we introduce ThreatModeling-LLM, a novel and adaptable framework that automates threat modeling for banking systems using LLMs. ThreatModeling-LLM operates in three stages: 1) dataset creation, 2) prompt engineering and 3) model fine-tuning. We first generate a benchmark dataset using Microsoft Threat Modeling Tool (TMT). Then, we apply Chain of Thought (CoT) and Optimization by PROmpting (OPRO) on the pre-trained LLMs to optimize the initial prompt. Lastly, we fine-tune the LLM using Low-Rank Adaptation (LoRA) based on the benchmark dataset and the optimized prompt to improve the threat identification and mitigation generation capabilities of pre-trained LLMs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.17058 [cs.CR] (or arXiv:2411.17058v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.17058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-30] Can a Single Tree Outperform an Entire Forest?
链接: https://arxiv.org/abs/2411.17003
作者: Qiangqiang Mao,Yankai Cao
关键词-EN: classic random forest, underperforms classic random, classic random, single decision tree, decision tree underperforms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:The prevailing mindset is that a single decision tree underperforms classic random forests in testing accuracy, despite its advantages in interpretability and lightweight structure. This study challenges such a mindset by significantly improving the testing accuracy of an oblique regression tree through our gradient-based entire tree optimization framework, making its performance comparable to the classic random forest. Our approach reformulates tree training as a differentiable unconstrained optimization task, employing a scaled sigmoid approximation strategy. To ameliorate numerical instability, we propose an algorithmic scheme that solves a sequence of increasingly accurate approximations. Additionally, a subtree polish strategy is implemented to reduce approximation errors accumulated across the tree. Extensive experiments on 16 datasets demonstrate that our optimized tree outperforms the classic random forest by an average of 2.03% improvements in testing accuracy.
[AI-31] ExpTest: Automating Learning Rate Searching and Tuning with Insights from Linearized Neural Networks
链接: https://arxiv.org/abs/2411.16975
作者: Zan Chaudhry,Naoko Mizuno
关键词-EN: time-intensive grid searches, increasing resource costs, initial learning rate, learning rate, initial learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Hyperparameter tuning remains a significant challenge for the training of deep neural networks (DNNs), requiring manual and/or time-intensive grid searches, increasing resource costs and presenting a barrier to the democratization of machine learning. The global initial learning rate for DNN training is particularly important. Several techniques have been proposed for automated learning rate tuning during training; however, they still require manual searching for the global initial learning rate. Though methods exist that do not require this initial selection, they suffer from poor performance. Here, we present ExpTest, a sophisticated method for initial learning rate searching and subsequent learning rate tuning for the training of DNNs. ExpTest draws on insights from linearized neural networks and the form of the loss curve, which we treat as a real-time signal upon which we perform hypothesis testing. We mathematically justify ExpTest and provide empirical support. ExpTest requires minimal overhead, is robust to hyperparameter choice, and achieves state-of-the-art performance on a variety of tasks and architectures, without initial learning rate selection or learning rate scheduling.
[AI-32] Clustering Time Series Data with Gaussian Mixture Embeddings in a Graph Autoencoder Framework
链接: https://arxiv.org/abs/2411.16972
作者: Amirabbas Afzali,Hesam Hosseini,Mohmmadamin Mirzai,Arash Amini
关键词-EN: time series clustering, Time series, environmental monitoring, Traditional time series, series data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: First two listed authors have equal contribution. Author ordering is determined by coin flip
点击查看摘要
Abstract:Time series data analysis is prevalent across various domains, including finance, healthcare, and environmental monitoring. Traditional time series clustering methods often struggle to capture the complex temporal dependencies inherent in such data. In this paper, we propose the Variational Mixture Graph Autoencoder (VMGAE), a graph-based approach for time series clustering that leverages the structural advantages of graphs to capture enriched data relationships and produces Gaussian mixture embeddings for improved separability. Comparisons with baseline methods are included with experimental results, demonstrating that our method significantly outperforms state-of-the-art time-series clustering techniques. We further validate our method on real-world financial data, highlighting its practical applications in finance. By uncovering community structures in stock markets, our method provides deeper insights into stock relationships, benefiting market prediction, portfolio optimization, and risk management.
[AI-33] Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach
链接: https://arxiv.org/abs/2411.16954
作者: Xiaoteng(Frank)Liu,Pavly Halim(New York University)
关键词-EN: predicting General Matrix, General Matrix Multiplication, predicting General, tiled matrix multiplication, General Matrix
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 9 pages, 9 figures, 6 tables, IEEE conference paper format
点击查看摘要
Abstract:Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix multiplication kernel for fundamental analysis, and NVIDIA’s CUTLASS library for comprehensive performance data collection across advanced configurations. Using the NVIDIA RTX 4070 as our experimental platform, we developed a Random Forest-based prediction model with multi-output regression capability. Through analysis of both naive tiled matrix multiplication with varying tile sizes (1 to 32) and 16,128 CUTLASS GEMM operations across diverse configurations, we identified critical performance patterns related to matrix dimensions, thread block configurations, and memory access patterns. Our framework achieved exceptional accuracy with an R^2 score of 0.98 for runtime prediction (mean error 15.57%) and 0.78 for power prediction (median error 5.42%). The system successfully predicts performance across matrix sizes, demonstrating robust scaling behavior. Our results show that optimal tile size selection can improve performance by up to 3.2x while reducing power consumption by 22% compared to baseline configurations. Analysis of shared memory utilization and SM occupancy reveals that tile sizes of 16x16 achieve the best balance between parallelism and resource usage. The implementation of our framework, including prediction models and analysis tools, is available as an open-source project at GPPerf [this https URL].
[AI-34] ASSERTIFY: Utilizing Large Language Models to Generate Assertions for Production Code
链接: https://arxiv.org/abs/2411.16927
作者: Mohammad Jalili Torkamani,Abhinav Sharma,Nikita Mehrotra,Rahul Purandare
关键词-EN: Production assertions, statements embedded, validate their assumptions, assertions, Large Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 20 pages, 10 figures, 10 listings, 2 tables, preprint
点击查看摘要
Abstract:Production assertions are statements embedded in the code to help developers validate their assumptions about the code. They assist developers in debugging, provide valuable documentation, and enhance code comprehension. Current research in this area primarily focuses on assertion generation for unit tests using techniques, such as static analysis and deep learning. While these techniques have shown promise, they fall short when it comes to generating production assertions, which serve a different purpose. This preprint addresses the gap by introducing Assertify, an automated end-to-end tool that leverages Large Language Models (LLMs) and prompt engineering with few-shot learning to generate production assertions. By creating context-rich prompts, the tool emulates the approach developers take when creating production assertions for their code. To evaluate our approach, we compiled a dataset of 2,810 methods by scraping 22 mature Java repositories from GitHub. Our experiments demonstrate the effectiveness of few-shot learning by producing assertions with an average ROUGE-L score of 0.526, indicating reasonably high structural similarity with the assertions written by developers. This research demonstrates the potential of LLMs in automating the generation of production assertions that resemble the original assertions. Comments: 20 pages, 10 figures, 10 listings, 2 tables, preprint Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.5; D.2.7; K.6.3 Cite as: arXiv:2411.16927 [cs.SE] (or arXiv:2411.16927v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.16927 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-35] Are Transformers Truly Foundational for Robotics?
链接: https://arxiv.org/abs/2411.16917
作者: James A. R. Marshall,Andrew B. Barron
关键词-EN: Generative Pre-Trained Transformers, Generative Pre-Trained, Pre-Trained Transformers, hyped to revolutionize, Generative
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Generative Pre-Trained Transformers (GPTs) are hyped to revolutionize robotics. Here we question their utility. GPTs for autonomous robotics demand enormous and costly compute, excessive training times and (often) offboard wireless control. We contrast GPT state of the art with how tiny insect brains have achieved robust autonomy with none of these constraints. We highlight lessons that can be learned from biology to enhance the utility of GPTs in robotics.
[AI-36] Enabling Adoption of Regenerative Agriculture through Soil Carbon Copilots
链接: https://arxiv.org/abs/2411.16872
作者: Margaret Capetz,Swati Sharma,Rafael Padilha,Peder Olsen,Emre Kiciman,Ranveer Chandra
关键词-EN: Mitigating climate change, change requires transforming, minimize environ mental, environ mental impact, soil organic carbon
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:Mitigating climate change requires transforming agriculture to minimize environ mental impact and build climate resilience. Regenerative agricultural practices enhance soil organic carbon (SOC) levels, thus improving soil health and sequestering carbon. A challenge to increasing regenerative agriculture practices is cheaply measuring SOC over time and understanding how SOC is affected by regenerative agricultural practices and other environmental factors and farm management practices. To address this challenge, we introduce an AI-driven Soil Organic Carbon Copilot that automates the ingestion of complex multi-resolution, multi-modal data to provide large-scale insights into soil health and regenerative practices. Our data includes extreme weather event data (e.g., drought and wildfire incidents), farm management data (e.g., cropland information and tillage predictions), and SOC predictions. We find that integrating public data and specialized models enables large-scale, localized analysis for sustainable agriculture. In comparisons of agricultural practices across California counties, we find evidence that diverse agricultural activity may mitigate the negative effects of tillage; and that while extreme weather conditions heavily affect SOC, composting may mitigate SOC loss. Finally, implementing role-specific personas empowers agronomists, farm consultants, policymakers, and other stakeholders to implement evidence-based strategies that promote sustainable agriculture and build climate resilience.
[AI-37] Blockchain Meets LLM s: A Living Survey on Bidirectional Integration
链接: https://arxiv.org/abs/2411.16809
作者: Jianghao Gong,Peiqi Yan,Yue Zhang,Hongli An,Logan Liu
关键词-EN: large language models, large language, language models, multimodal large language, continuous technological progress
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the domain of large language models, considerable advancements have been attained in multimodal large language models and explainability research, propelled by the continuous technological progress and innovation. Nonetheless, security and privacy concerns continue to pose as prominent challenges in this field. The emergence of blockchain technology, marked by its decentralized nature, tamper-proof attributes, distributed storage functionality, and traceability, has provided novel approaches for resolving these issues. Both of these technologies independently hold vast potential for development; yet, their combination uncovers substantial cross-disciplinary opportunities and growth prospects. The current research tendencies are increasingly concentrating on the integration of blockchain with large language models, with the aim of compensating for their respective limitations through this fusion and promoting further technological evolution. In this study, we evaluate the advantages and developmental constraints of the two technologies, and explore the possibility and development potential of their combination. This paper primarily investigates the technical convergence in two directions: Firstly, the application of large language models to blockchain, where we identify six major development directions and explore solutions to the shortcomings of blockchain technology and their application scenarios; Secondly, the application of blockchain technology to large language models, leveraging the characteristics of blockchain to remedy the deficiencies of large language models and exploring its application potential in multiple fields.
[AI-38] Human Motion Instruction Tuning
链接: https://arxiv.org/abs/2411.16805
作者: Lei Li,Sen Jia,Wang Jianhao,Zhongyu Jiang,Feng Zhou,Ju Dai,Tianfang Zhang,Wu Zongkai,Jenq-Neng Hwang
关键词-EN: Human Motion Assistant, Large Language, Motion Assistant, motion instruction tuning, instruction tuning
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model’s ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: this https URL.
[AI-39] Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference
链接: https://arxiv.org/abs/2411.16763
作者: Depeng Chen,Hao Chen,Hulin Jin,Jie Cui,Hong Zhong
关键词-EN: General Data Protection, Data Protection Regulation, Protection Regulation, Membership inference attacks, assessing privacy risks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Membership inference attacks (MIAs) are critical tools for assessing privacy risks and ensuring compliance with regulations like the General Data Protection Regulation (GDPR). However, their potential for auditing unauthorized use of data remains under explored. To bridge this gap, we propose a novel clean-label backdoor-based approach for MIAs, designed specifically for robust and stealthy data auditing. Unlike conventional methods that rely on detectable poisoned samples with altered labels, our approach retains natural labels, enhancing stealthiness even at low poisoning rates. Our approach employs an optimal trigger generated by a shadow model that mimics the target model’s behavior. This design minimizes the feature-space distance between triggered samples and the source class while preserving the original data labels. The result is a powerful and undetectable auditing mechanism that overcomes limitations of existing approaches, such as label inconsistencies and visual artifacts in poisoned samples. The proposed method enables robust data auditing through black-box access, achieving high attack success rates across diverse datasets and model architectures. Additionally, it addresses challenges related to trigger stealthiness and poisoning durability, establishing itself as a practical and effective solution for data auditing. Comprehensive experiments validate the efficacy and generalizability of our approach, outperforming several baseline methods in both stealth and attack success metrics.
[AI-40] An investigation into the performances of the Current state-of-the-art Naive Bayes Non-Bayesian and Deep Learning Based Classifier for Phishing Detection: A Survey
链接: https://arxiv.org/abs/2411.16751
作者: Tosin Ige,Christopher Kiekintveld,Aritran Piplai,Amy Waggler,Olukunle Kolade,Bolanle Hafiz Matti
关键词-EN: digital wallets, state secrets, online banking, potential victims, credentials for online
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Phishing is one of the most effective ways in which cybercriminals get sensitive details such as credentials for online banking, digital wallets, state secrets, and many more from potential victims. They do this by spamming users with malicious URLs with the sole purpose of tricking them into divulging sensitive information which is later used for various cybercrimes. In this research, we did a comprehensive review of current state-of-the-art machine learning and deep learning phishing detection techniques to expose their vulnerabilities and future research direction. For better analysis and observation, we split machine learning techniques into Bayesian, non-Bayesian, and deep learning. We reviewed the most recent advances in Bayesian and non-Bayesian-based classifiers before exploiting their corresponding weaknesses to indicate future research direction. While exploiting weaknesses in both Bayesian and non-Bayesian classifiers, we also compared each performance with a deep learning classifier. For a proper review of deep learning-based classifiers, we looked at Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Long Short Term Memory Networks (LSTMs). We did an empirical analysis to evaluate the performance of each classifier along with many of the proposed state-of-the-art anti-phishing techniques to identify future research directions, we also made a series of proposals on how the performance of the under-performing algorithm can improved in addition to a two-stage prediction model
[AI-41] LoBAM: LoRA-Based Backdoor Attack on Model Merging
链接: https://arxiv.org/abs/2411.16746
作者: Ming Yin,Jingyang Zhang,Jingwei Sun,Minghong Fang,Hai Li,Yiran Chen
关键词-EN: multiple models fine-tuned, integrates multiple models, multiple domains, integrates multiple, tasks to create
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains. This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model. Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained model. Such an assumption, however, may not be feasible given the increasing size of machine learning models. In practice where resources are limited and the attacker can only employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious model, it remains unclear whether the attack can still work and pose threats. In this work, we first identify that the attack efficacy is significantly diminished when using LoRA for fine-tuning. Then, we propose LoBAM, a method that yields high attack success rate with minimal training resources. The key idea of LoBAM is to amplify the malicious weights in an intelligent way that effectively enhances the attack efficacy. We demonstrate that our design can lead to improved attack success rate through both theoretical proof and extensive empirical experiments across various model merging scenarios. Moreover, we show that our method has strong stealthiness and is difficult to detect.
[AI-42] xt-to-SQL Calibration: No Need to Ask – Just Rescale Model Probabilities
链接: https://arxiv.org/abs/2411.16742
作者: Ashwin Ramachandran,Sunita Sarawagi
关键词-EN: convert natural language, natural language queries, commercial databases, large language models, crucial as large
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Calibration is crucial as large language models (LLMs) are increasingly deployed to convert natural language queries into SQL for commercial databases. In this work, we investigate calibration techniques for assigning confidence to generated SQL queries. We show that a straightforward baseline – deriving confidence from the model’s full-sequence probability – outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization. Our comprehensive evaluation, conducted across two widely-used Text-to-SQL benchmarks and multiple LLM architectures, provides valuable insights into the effectiveness of various calibration strategies.
[AI-43] DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2
链接: https://arxiv.org/abs/2411.16729
作者: Fan Zhang,Siyuan Zhao,Naye Ji,Zhaohan Wang,Jingmei Wu,Fuxing Gao,Zhenqing Ye,Leyao Yan,Lanxin Dai,Weidong Geng,Xin Lyu,Bozuo Zhao,Dingguo Yu,Hui Du,Bin Hu
关键词-EN: Speech-driven gesture generation, virtual human creation, rapidly advancing area, Speech-driven gesture, generative models represents
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: 13 pages, 11 figures
点击查看摘要
Abstract:Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.
[AI-44] Maximizing the Impact of Deep Learning on Subseasonal-to-Seasonal Climate Forecasting: The Essential Role of Optimization
链接: https://arxiv.org/abs/2411.16728
作者: Yizhen Guo,Tian Zhou,Wanyi Jiang,Bo Wu,Liang Sun,Rong Jin
关键词-EN: disaster management, vital for sectors, agriculture and disaster, numerical weather prediction, weather prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
点击查看摘要
Abstract:Weather and climate forecasting is vital for sectors such as agriculture and disaster management. Although numerical weather prediction (NWP) systems have advanced, forecasting at the subseasonal-to-seasonal (S2S) scale, spanning 2 to 6 weeks, remains challenging due to the chaotic and sparse atmospheric signals at this interval. Even state-of-the-art deep learning models struggle to outperform simple climatology models in this domain. This paper identifies that optimization, instead of network structure, could be the root cause of this performance gap, and then we develop a novel multi-stage optimization strategy to close the gap. Extensive empirical studies demonstrate that our multi-stage optimization approach significantly improves key skill metrics, PCC and TCC, while utilizing the same backbone structure, surpassing the state-of-the-art NWP systems (ECMWF-S2S) by over \textbf19-91%. Our research contests the recent study that direct forecasting outperforms rolling forecasting for S2S tasks. Through theoretical analysis, we propose that the underperformance of rolling forecasting may arise from the accumulation of Jacobian matrix products during training. Our multi-stage framework can be viewed as a form of teacher forcing to address this issue. Code is available at \urlthis https URL
[AI-45] wo Heads Are Better Than One: Collaborative LLM Embodied Agents for Human-Robot Interaction
链接: https://arxiv.org/abs/2411.16723
作者: Mitchell Rosser,Marc. G Carmichael
关键词-EN: natural language generation, language generation models, natural language, interpret natural language, natural language commands
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 9 pages, 10 figures
点击查看摘要
Abstract:With the recent development of natural language generation models - termed as large language models (LLMs) - a potential use case has opened up to improve the way that humans interact with robot assistants. These LLMs should be able to leverage their large breadth of understanding to interpret natural language commands into effective, task appropriate and safe robot task executions. However, in reality, these models suffer from hallucinations, which may cause safety issues or deviations from the task. In other domains, these issues have been improved through the use of collaborative AI systems where multiple LLM agents can work together to collectively plan, code and self-check outputs. In this research, multiple collaborative AI systems were tested against a single independent AI agent to determine whether the success in other domains would translate into improved human-robot interaction performance. The results show that there is no defined trend between the number of agents and the success of the model. However, it is clear that some collaborative AI agent architectures can exhibit a greatly improved capacity to produce error-free code and to solve abstract problems.
[AI-46] Benefits and Risks of Using ChatGPT4 as a Teaching Assistant for Computer Science Students
链接: https://arxiv.org/abs/2411.16690
作者: Yaiza Aragonés-Soria,Julia Kotovich,Chitsutha Soomlek,Manuel Oriol
关键词-EN: software engineering community, shocked the software, software engineering, engineering community, ability to generate
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: This paper was finished on the 17th of June of 2023
点击查看摘要
Abstract:Upon release, ChatGPT3.5 shocked the software engineering community by its ability to generate answers to specialized questions about coding. Immediately, many educators wondered if it was possible to use the chatbot as a support tool that helps students answer their programming questions. This article evaluates this possibility at three levels: fundamental Computer Science knowledge (basic algorithms and data structures), core competency (design patterns), and advanced knowledge (quantum computing). In each case, we ask normalized questions several times to ChatGPT3.5, then look at the correctness of answers, and finally check if this creates issues. The main result is that the performances of ChatGPT3.5 degrades drastically as the specialization of the domain increases: for basic algorithms it returns answers that are almost always correct, for design patterns the generated code contains many code smells and is generally of low quality, but it is still sometimes able to fix it (if asked), and for quantum computing it is often blatantly wrong.
[AI-47] Mixed-State Quantum Denoising Diffusion Probabilistic Model
链接: https://arxiv.org/abs/2411.17608
作者: Gino Kwun,Bingzhi Zhang,Quntao Zhuang
关键词-EN: gained significant attention, Phys. Rev. Lett, Generative quantum machine, produce quantum states, desired distributions
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures
点击查看摘要
Abstract:Generative quantum machine learning has gained significant attention for its ability to produce quantum states with desired distributions. Among various quantum generative models, quantum denoising diffusion probabilistic models (QuDDPMs) [Phys. Rev. Lett. 132, 100602 (2024)] provide a promising approach with stepwise learning that resolves the training issues. However, the requirement of high-fidelity scrambling unitaries in QuDDPM poses a challenge in near-term implementation. We propose the \textitmixed-state quantum denoising diffusion probabilistic model (MSQuDDPM) to eliminate the need for scrambling unitaries. Our approach focuses on adapting the quantum noise channels to the model architecture, which integrates depolarizing noise channels in the forward diffusion process and parameterized quantum circuits with projective measurements in the backward denoising steps. We also introduce several techniques to improve MSQuDDPM, including a cosine-exponent schedule of noise interpolation, the use of single-qubit random ancilla, and superfidelity-based cost functions to enhance the convergence. We evaluate MSQuDDPM on quantum ensemble generation tasks, demonstrating its successful performance.
[AI-48] LC-SVD-DLinear: A low-cost physics-based hybrid machine learning model for data forecasting using sparse measurements
链接: https://arxiv.org/abs/2411.17433
作者: Ashton Hetherington,Javier López Leonés,Soledad Le Clainche
关键词-EN: fluid mechanics data, shallow linear neural, resolution fluid mechanics, linear neural network, shallow neural network
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This article introduces a novel methodology that integrates singular value decomposition (SVD) with a shallow linear neural network for forecasting high resolution fluid mechanics data. The method, termed LC-SVD-DLinear, combines a low-cost variant of singular value decomposition (LC-SVD) with the DLinear architecture, which decomposes the input features-specifically, the temporal coefficients-into trend and seasonality components, enabling a shallow neural network to capture the non-linear dynamics of the temporal data. This methodology uses under-resolved data, which can either be input directly into the hybrid model or downsampled from high resolution using two distinct techniques provided by the methodology. Working with under-resolved cases helps reduce the overall computational cost. Additionally, we present a variant of the method, LC-HOSVD-DLinear, which combines a low-cost version of the high-order singular value decomposition (LC-HOSVD) algorithm with the DLinear network, designed for high-order data. These approaches have been validated using two datasets: first, a numerical simulation of three-dimensional flow past a circular cylinder at Re = 220 ; and second, an experimental dataset of turbulent flow passing a circular cylinder at Re = 2600 . The combination of these datasets demonstrates the robustness of the method. The forecasting and reconstruction results are evaluated through various error metrics, including uncertainty quantification. The work developed in this article will be included in the next release of ModelFLOWs-app
[AI-49] Enhancing Fluorescence Lifetime Parameter Estimation Accuracy with Differential Transformer Based Deep Learning Model Incorporating Pixelwise Instrument Response Function
链接: https://arxiv.org/abs/2411.16896
作者: Ismail Erbas,Vikas Pandey,Navid Ibtehaj Nizam,Nanxue Yuan,Amit Verma,Margarida Barosso,Xavier Intes
关键词-EN: important molecular imaging, molecular imaging modality, provide unique information, Fluorescence lifetime imaging, molecular imaging
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 11 pages, 4 figures
点击查看摘要
Abstract:Fluorescence lifetime imaging (FLI) is an important molecular imaging modality that can provide unique information for biomedical applications. FLI is based on acquiring and processing photon time of arrival histograms. The shape and temporal offset of these histograms depends on many factors, such as the instrument response function (IRF), optical properties, and the topographic profile of the sample. Several inverse solver analytical methods have been developed to compute the underlying fluorescence lifetime parameters, but most of them are computationally expensive and time-consuming. Thus, deep learning (DL) algorithms have progressively replaced computation methods in fluorescence lifetime parameter estimation. Often, DL models are trained with simple datasets either generated through simulation or a simple experiment where the fluorophore surface profile is mostly flat; therefore, DL models often do not perform well on samples with complex surface profiles such as ex-vivo organs or in-vivo whole intact animals. Herein, we introduce a new DL architecture using state-of-the-art Differential Transformer encoder-decoder architecture, MFliNet (Macroscopic FLI Network), that takes an additional input of IRF together with TPSF, addressing discrepancies in the photon time-of-arrival distribution. We demonstrate the model’s performance through carefully designed, complex tissue-mimicking phantoms and preclinical in-vivo cancer xenograft experiments.
[AI-50] ADAF: An Artificial Intelligence Data Assimilation Framework for Weather Forecasting
链接: https://arxiv.org/abs/2411.16807
作者: Yanfei Xiang,Weixin Jin,Haiyu Dong,Mingliang Bai,Zuliang Fang,Pengcheng Zhao,Hongyu Sun,Kit Thambiratnam,Qi Zhang,Xiaomeng Huang
关键词-EN: models critically depends, accurate initial conditions, numerical weather prediction, skill of numerical, critically depends
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 29 pages, 15 figures
点击查看摘要
Abstract:The forecasting skill of numerical weather prediction (NWP) models critically depends on the accurate initial conditions, also known as analysis, provided by data assimilation (DA). Traditional DA methods often face a trade-off between computational cost and accuracy due to complex linear algebra computations and the high dimensionality of the model, especially in nonlinear systems. Moreover, processing massive data in real-time requires substantial computational resources. To address this, we introduce an artificial intelligence-based data assimilation framework (ADAF) to generate high-quality kilometer-scale analysis. This study is the pioneering work using real-world observations from varied locations and multiple sources to verify the AI method’s efficacy in DA, including sparse surface weather observations and satellite imagery. We implemented ADAF for four near-surface variables in the Contiguous United States (CONUS). The results indicate that ADAF surpasses the High Resolution Rapid Refresh Data Assimilation System (HRRRDAS) in accuracy by 16% to 33% for near-surface atmospheric conditions, aligning more closely with actual observations, and can effectively reconstruct extreme events, such as tropical cyclone wind fields. Sensitivity experiments reveal that ADAF can generate high-quality analysis even with low-accuracy backgrounds and extremely sparse surface observations. ADAF can assimilate massive observations within a three-hour window at low computational cost, taking about two seconds on an AMD MI200 graphics processing unit (GPU). ADAF has been shown to be efficient and effective in real-world DA, underscoring its potential role in operational weather forecasting.
[AI-51] Reaction-conditioned De Novo Enzyme Design with GENzyme
链接: https://arxiv.org/abs/2411.16694
作者: Chenqing Hua,Jiarui Lu,Yong Liu,Odin Zhang,Jian Tang,Rex Ying,Wengong Jin,Guy Wolf,Doina Precup,Shuangjia Zheng
关键词-EN: protein structure modeling, revolutionized protein structure, revolutionized protein, interaction prediction, enzyme
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The introduction of models like RFDiffusionAA, AlphaFold3, AlphaProteo, and Chai1 has revolutionized protein structure modeling and interaction prediction, primarily from a binding perspective, focusing on creating ideal lock-and-key models. However, these methods can fall short for enzyme-substrate interactions, where perfect binding models are rare, and induced fit states are more common. To address this, we shift to a functional perspective for enzyme design, where the enzyme function is defined by the reaction it catalyzes. Here, we introduce \textscGENzyme, a \textitde novo enzyme design model that takes a catalytic reaction as input and generates the catalytic pocket, full enzyme structure, and enzyme-substrate binding complex. \textscGENzyme is an end-to-end, three-staged model that integrates (1) a catalytic pocket generation and sequence co-design module, (2) a pocket inpainting and enzyme inverse folding module, and (3) a binding and screening module to optimize and predict enzyme-substrate complexes. The entire design process is driven by the catalytic reaction being targeted. This reaction-first approach allows for more accurate and biologically relevant enzyme design, potentially surpassing structure-based and binding-focused models in creating enzymes capable of catalyzing specific reactions. We provide \textscGENzyme code at this https URL.
[AI-52] Physically Parameterized Differentiable MUSIC for DoA Estimation with Uncalibrated Arrays
链接: https://arxiv.org/abs/2411.15144
作者: Baptiste Chatelier(INSA Rennes, IETR, MERCE-France),José Miguel Mateos-Ramos,Vincent Corlay(MERCE-France),Christian Häger,Matthieu Crussière(INSA Rennes, IETR),Henk Wymeersch,Luc Le Magoarou(INSA Rennes, IETR)
关键词-EN: Direction of arrival, common sensing problem, wireless communication systems, problem in radar, wireless communication
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Direction of arrival (DoA) estimation is a common sensing problem in radar, sonar, audio, and wireless communication systems. It has gained renewed importance with the advent of the integrated sensing and communication paradigm. To fully exploit the potential of such sensing systems, it is crucial to take into account potential hardware impairments that can negatively impact the obtained performance. This study introduces a joint DoA estimation and hardware impairment learning scheme following a model-based approach. Specifically, a differentiable version of the multiple signal classification (MUSIC) algorithm is derived, allowing efficient learning of the considered impairments. The proposed approach supports both supervised and unsupervised learning strategies, showcasing its practical potential. Simulation results indicate that the proposed method successfully learns significant inaccuracies in both antenna locations and complex gains. Additionally, the proposed method outperforms the classical MUSIC algorithm in the DoA estimation task.
机器学习
[LG-0] Instance-Aware Graph Prompt Learning
链接: https://arxiv.org/abs/2411.17676
作者: Jiazheng Li,Jundong Li,Chuxu Zhang
关键词-EN: strong expressive power, neural networks stand, Graph neural networks, representation learning owing, performance highly depends
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph neural networks stand as the predominant technique for graph representation learning owing to their strong expressive power, yet the performance highly depends on the availability of high-quality labels in an end-to-end manner. Thus the pretraining and fine-tuning paradigm has been proposed to mitigate the label cost issue. Subsequently, the gap between the pretext tasks and downstream tasks has spurred the development of graph prompt learning which inserts a set of graph prompts into the original graph data with minimal parameters while preserving competitive performance. However, the current exploratory works are still limited since they all concentrate on learning fixed task-specific prompts which may not generalize well across the diverse instances that the task comprises. To tackle this challenge, we introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper, aiming to generate distinct prompts tailored to different input instances. The process involves generating intermediate prompts for each instance using a lightweight architecture, quantizing these prompts through trainable codebook vectors, and employing the exponential moving average technique to ensure stable training. Extensive experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
[LG-1] Synthetic Data Generation with LLM for Improved Depression Prediction
链接: https://arxiv.org/abs/2411.17672
作者: Andrea Kang,Jun Yu Chen,Zoe Lee-Youngzie,Shuhao Fu
关键词-EN: rapidly growing field, generate synthetic data, synthetic data, Large Language Models, generate synthetic
类目: Machine Learning (cs.LG)
*备注: 6 pages excluding references and appendix
点击查看摘要
Abstract:Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model’s capability in predicting the intensity of the patient’s depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.
[LG-2] Anytime Acceleration of Gradient Descent
链接: https://arxiv.org/abs/2411.17668
作者: Zihan Zhang,Jason D. Lee,Simon S. Du,Yuxin Chen
关键词-EN: work investigates stepsize-based, convergence guarantees, investigates stepsize-based acceleration, work investigates, gradient descent
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:This work investigates stepsize-based acceleration of gradient descent with \em anytime convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of O(T^-1.03) for any stopping time T , where the stepsize schedule is predetermined without prior knowledge of the stopping time. This result provides an affirmative answer to a COLT open problem \citepkornowski2024open regarding whether stepsize-based acceleration can yield anytime convergence rates of o(T^-1) . We further extend our theory to yield anytime convergence guarantees of \exp(-\Omega(T/\kappa^0.97)) for smooth and strongly convex optimization, with \kappa being the condition number.
[LG-3] Data-driven development of cycle prediction models for lithium metal batteries using multi modal mining
链接: https://arxiv.org/abs/2411.17625
作者: Jaewoong Lee,Junhee Woo,Sejin Kim,Cinthya Paulina,Hyunmin Park,Hee-Tak Kim,Steve Park,Jihan Kim
关键词-EN: shown great potential, Recent advances, Material Graph Digitizer, research have shown, shown great
类目: Machine Learning (cs.LG)
*备注: 30 pages, 7 figures
点击查看摘要
Abstract:Recent advances in data-driven research have shown great potential in understanding the intricate relationships between materials and their performances. Herein, we introduce a novel multi modal data-driven approach employing an Automatic Battery data Collector (ABC) that integrates a large language model (LLM) with an automatic graph mining tool, Material Graph Digitizer (MatGD). This platform enables state-of-the-art accurate extraction of battery material data and cyclability performance metrics from diverse textual and graphical data sources. From the database derived through the ABC platform, we developed machine learning models that can accurately predict the capacity and stability of lithium metal batteries, which is the first-ever model developed to achieve such predictions. Our models were also experimentally validated, confirming practical applicability and reliability of our data-driven approach.
[LG-4] Can artificial intelligence predict clinical trial outcomes?
链接: https://arxiv.org/abs/2411.17595
作者: Shuyi Jin,Lu Chen,Hongru Ding,Meijie Wang,Lun Yu
关键词-EN: pose significant challenges, Matthews Correlation Coefficient, advanced therapies, pose significant, drug development
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:The increasing complexity and cost of clinical trials, particularly in the context of oncology and advanced therapies, pose significant challenges for drug development. This study evaluates the predictive capabilities of large language models (LLMs) such as GPT-3.5, GPT-4, and HINT in determining clinical trial outcomes. By leveraging a curated dataset of trials from this http URL, we compare the models’ performance using metrics including balanced accuracy, specificity, recall, and Matthews Correlation Coefficient (MCC). Results indicate that GPT-4o demonstrates robust performance in early trial phases, achieving high recall but facing limitations in specificity. Conversely, the HINT model excels in recognizing negative outcomes, particularly in later trial phases, offering a balanced approach across diverse endpoints. Oncology trials, characterized by high complexity, remain challenging for all models. Additionally, trial duration and disease categories influence predictive performance, with longer durations and complex diseases such as neoplasms reducing accuracy. This study highlights the complementary strengths of LLMs and HINT, providing insights into optimizing predictive tools for clinical trial design and risk management. Future advancements in LLMs are essential to address current gaps in handling negative outcomes and complex domains.
[LG-5] From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs
链接: https://arxiv.org/abs/2411.17582
作者: Cynthia Dwork,Chris Hays,Nicole Immorlica,Juan C. Perdomo,Pranay Tankala
关键词-EN: Professional networks provide, provide invaluable entree, networks provide invaluable, referrals and introductions, Professional networks
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph. Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post- processing. We begin by observing that, by combining a slightly modified form of the online K29 star algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest. We apply these techniques to evolving graphs, obtaining online outcome-indistinguishable omnipredictors for rich – possibly infinite – sets of distinguishers that capture properties of pairs of nodes, and their neighborhoods. This yields, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI) Cite as: arXiv:2411.17582 [cs.LG] (or arXiv:2411.17582v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.17582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Multiscale spatiotemporal heterogeneity analysis of bike-sharing systems self-loop phenomenon: Evidence from Shanghai
链接: https://arxiv.org/abs/2411.17555
作者: Yichen Wang,Qing Yu,Yancun Song,Quan Yuan,Chao Yang,Chengcheng Yu
关键词-EN: shared mobility mode, environmentally friendly shared, friendly shared mobility, significantly impacts equity, mobility mode
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Bike-sharing is an environmentally friendly shared mobility mode, but its self-loop phenomenon, where bikes are returned to the same station after several time usage, significantly impacts equity in accessing its services. Therefore, this study conducts a multiscale analysis with a spatial autoregressive model and double machine learning framework to assess socioeconomic features and geospatial location’s impact on the self-loop phenomenon at metro stations and street scales. The results reveal that bike-sharing self-loop intensity exhibits significant spatial lag effect at street scale and is positively associated with residential land use. Marginal treatment effects of residential land use is higher on streets with middle-aged residents, high fixed employment, and low car ownership. The multimodal public transit condition reveals significant positive marginal treatment effects at both scales. To enhance bike-sharing cooperation, we advocate augmenting bicycle availability in areas with high metro usage and low bus coverage, alongside implementing adaptable redistribution strategies.
[LG-7] Navigating Spatial Inequities in Freight Truck Crash Severity via Counterfactual Inference in Los Angeles
链接: https://arxiv.org/abs/2411.17554
作者: Yichen Wang,Hao Yin,Yifan Yang,Chenyang Zhao,Siqin Wang
关键词-EN: substantial economic losses, pose significant challenges, Freight truck-related crashes, truck-related crashes pose, crashes pose significant
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Freight truck-related crashes pose significant challenges, leading to substantial economic losses, injuries, and fatalities, with pronounced spatial disparities across different regions. This study adopts a transport geography perspective to examine spatial justice concerns by employing deep counterfactual inference models to analyze how socioeconomic disparities, road infrastructure, and environmental conditions influence the geographical distribution and severity of freight truck crashes. By integrating road network datasets, socioeconomic attributes, and crash records from the Los Angeles metropolitan area, this research provides a nuanced spatial analysis of how different communities are disproportionately impacted. The results reveal significant spatial disparities in crash severity across areas with varying population densities, income levels, and minority populations, highlighting the pivotal role of infrastructural and environmental improvements in mitigating these disparities. The findings offer insights into targeted, location-specific policy interventions, suggesting enhancements in road infrastructure, lighting, and traffic control systems, particularly in low-income and minority-concentrated areas. This research contributes to the literature on transport geography and spatial equity by providing data-driven insights into effective measures for reducing spatial injustices associated with freight truck-related crashes.
[LG-8] Evolving Markov Chains: Unsupervised Mode Discovery and Recognition from Data Streams
链接: https://arxiv.org/abs/2411.17528
作者: Kutalmış Coşkun,Borahan Tümer,Bjarne C. Hiller,Martin Becker
关键词-EN: powerful mathematical structures, temporally dependent processes, model temporally dependent, simple yet powerful, powerful mathematical
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures
点击查看摘要
Abstract:Markov chains are simple yet powerful mathematical structures to model temporally dependent processes. They generally assume stationary data, i.e., fixed transition probabilities between observations/states. However, live, real-world processes, like in the context of activity tracking, biological time series, or industrial monitoring, often switch behavior over time. Such behavior switches can be modeled as transitions between higher-level \emphmodes (e.g., running, walking, etc.). Yet all modes are usually not previously known, often exhibit vastly differing transition probabilities, and can switch unpredictably. Thus, to track behavior changes of live, real-world processes, this study proposes an online and efficient method to construct Evolving Markov chains (EMCs). EMCs adaptively track transition probabilities, automatically discover modes, and detect mode switches in an online manner. In contrast to previous work, EMCs are of arbitrary order, the proposed update scheme does not rely on tracking windows, only updates the relevant region of the probability tensor, and enjoys geometric convergence of the expected estimates. Our evaluation of synthetic data and real-world applications on human activity recognition, electric motor condition monitoring, and eye-state recognition from electroencephalography (EEG) measurements illustrates the versatility of the approach and points to the potential of EMCs to efficiently track, model, and understand live, real-world processes.
[LG-9] Pushing the Limits of Large Language Model Quantization via the Linearity Theorem
链接: https://arxiv.org/abs/2411.17525
作者: Vladimir Malinovskii,Andrei Panferov,Ivan Ilin,Han Guo,Peter Richtárik,Dan Alistarh
关键词-EN: Quantizing large language, Quantizing large, large language models, computational costs, large language
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a “linearity theorem” establishing a direct relationship between the layer-wise \ell_2 reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint in the medium-bitwidth regime, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-3.1 and 3.2-family models, as well as on Qwen-family models. Further, we show that our method can be efficiently supported in terms of GPU kernels at various batch sizes, advancing both data-free and non-uniform quantization for LLMs.
[LG-10] raining Hamiltonian neural networks without backpropagation NEURIPS2024
链接: https://arxiv.org/abs/2411.17511
作者: Atamert Rahma,Chinmay Datar,Felix Dietrich
关键词-EN: synergistically integrate data, physical laws offer, laws offer great, offer great promise, modeling dynamical systems
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 5 pages, 2 figures and 2 tables in the main text, includes an Appendix section, accepted to NeurIPS 2024 Workshop ML4PS
点击查看摘要
Abstract:Neural networks that synergistically integrate data and physical laws offer great promise in modeling dynamical systems. However, iterative gradient-based optimization of network parameters is often computationally expensive and suffers from slow convergence. In this work, we present a backpropagation-free algorithm to accelerate the training of neural networks for approximating Hamiltonian systems through data-agnostic and data-driven algorithms. We empirically show that data-driven sampling of the network parameters outperforms data-agnostic sampling or the traditional gradient-based iterative optimization of the network parameters when approximating functions with steep gradients or wide input domains. We demonstrate that our approach is more than 100 times faster with CPUs than the traditionally trained Hamiltonian Neural Networks using gradient-based iterative optimization and is more than four orders of magnitude accurate in chaotic examples, including the Hénon-Heiles system.
[LG-11] Neural network modelling of kinematic and dynamic features for signature verification
链接: https://arxiv.org/abs/2411.17506
作者: Moises Diaz,Miguel A. Ferrer,Jose Juan Quintana,Adam Wolniakowski,Roman Trochimczuk,Konstantsin Miatliuk,Giovanna Castellano,Gennaro Vessio
关键词-EN: automatic signature verifier, Online signature parameters, human characteristics, broaden the applicability, Online signature
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Online signature parameters, which are based on human characteristics, broaden the applicability of an automatic signature verifier. Although kinematic and dynamic features have previously been suggested, accurately measuring features such as arm and forearm torques remains challenging. We present two approaches for estimating angular velocities, angular positions, and force torques. The first approach involves using a physical UR5e robotic arm to reproduce a signature while capturing those parameters over time. The second method, a cost effective approach, uses a neural network to estimate the same parameters. Our findings demonstrate that a simple neural network model can extract effective parameters for signature verification. Training the neural network with the MCYT300 dataset and cross validating with other databases, namely, BiosecurID, Visual, Blind, OnOffSigDevanagari 75 and OnOffSigBengali 75 confirm the models generalization capability.
[LG-12] Confidence-Aware Deep Learning for Load Plan Adjustments in the Parcel Service Industry
链接: https://arxiv.org/abs/2411.17502
作者: Thomas Bruys,Reza Zandehshahvar,Amira Hijazi,Pascal Van Hentenryck
关键词-EN: load plan adjustments, inbound load plan, automate inbound load, inbound load planning, deep learning-based approach
类目: Machine Learning (cs.LG)
*备注: 16 pages, 11 figures
点击查看摘要
Abstract:This study develops a deep learning-based approach to automate inbound load plan adjustments for a large transportation and logistics company. It addresses a critical challenge for the efficient and resilient planning of E-commerce operations in presence of increasing uncertainties. The paper introduces an innovative data-driven approach to inbound load planning. Leveraging extensive historical data, the paper presents a two-stage decision-making process using deep learning and conformal prediction to provide scalable, accurate, and confidence-aware solutions. The first stage of the prediction is dedicated to tactical load-planning, while the second stage is dedicated to the operational planning, incorporating the latest available data to refine the decisions at the finest granularity. Extensive experiments compare traditional machine learning models and deep learning methods. They highlight the importance and effectiveness of the embedding layers for enhancing the performance of deep learning models. Furthermore, the results emphasize the efficacy of conformal prediction to provide confidence-aware prediction sets. The findings suggest that data-driven methods can substantially improve decision making in inbound load planning, offering planners a comprehensive, trustworthy, and real-time framework to make decisions. The initial deployment in the industry setting indicates a high accuracy of the proposed framework.
[LG-13] me-Series Forecasting in Smart Manufacturing Systems: An Experimental Evaluation of the State-of-the-art Algorithms
链接: https://arxiv.org/abs/2411.17499
作者: Mojtaba A. Farahani,Fadi El Kalach,Austin Harper,M. R. McCormick,Ramy Harik,Thorsten Wuest
关键词-EN: domains including manufacturing, TSF, TSF algorithms, domains including, numerous TSF algorithms
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:TSF is growing in various domains including manufacturing. Although numerous TSF algorithms have been developed recently, the validation and evaluation of algorithms hold substantial value for researchers and practitioners and are missing. This study aims to fill this gap by evaluating the SoTA TSF algorithms on thirteen manufacturing datasets, focusing on their applicability in manufacturing. Each algorithm was selected based on its TSF category to ensure a representative set of algorithms. The evaluation includes different scenarios to evaluate the models using two problem categories and two forecasting horizons. To evaluate the performance, the WAPE was calculated, and additional post hoc analyses were conducted to assess the significance of observed differences. Only algorithms with codes from open-source libraries were utilized, and no hyperparameter tuning was done. This allowed us to evaluate the algorithms as “out-of-the-box” solutions that can be easily implemented, ensuring their usability within the manufacturing by practitioners with limited technical knowledge. This aligns to facilitate the adoption of these techniques in smart manufacturing systems. Based on the results, transformer and MLP-based architectures demonstrated the best performance with MLP-based architecture winning the most scenarios. For univariate TSF, PatchTST emerged as the most robust, particularly for long-term horizons, while for multivariate problems, MLP-based architectures like N-HITS and TiDE showed superior results. The study revealed that simpler algorithms like XGBoost could outperform complex algorithms in certain tasks. These findings challenge the assumption that more sophisticated models produce better results. Additionally, the research highlighted the importance of computational resource considerations, showing variations in runtime and memory usage across different algorithms.
[LG-14] A Graph Neural Network deep-dive into successful counterattacks
链接: https://arxiv.org/abs/2411.17450
作者: Joris Bekkers,Amod Sahasrabudhe
关键词-EN: high intensity direct, intensity direct attack, Graph Neural Networks, gender-specific Graph Neural, high intensity
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 11 pages, 11 figures, first submitted (and accepted) at MIT Sloan Sports Analytics Conference 2023
点击查看摘要
Abstract:A counterattack in soccer is a high speed, high intensity direct attack that can occur when a team transitions from a defensive state to an attacking state after regaining possession of the ball. The aim is to create a goal-scoring opportunity by convering a lot of ground with minimal passes before the opposing team can recover their defensive shape. The purpose of this research is to build gender-specific Graph Neural Networks to model the likelihood of a counterattack being successful and uncover what factors make them successful in professional soccer. These models are trained on a total of 20863 frames of synchronized on-ball event and spatiotemporal (broadcast) tracking data. This dataset is derived from 632 games of MLS (2022), NWSL (2022) and international soccer (2020-2022). With this data we demonstrate that gender-specific Graph Neural Networks outperform architecturally identical gender-ambiguous models in predicting the successful outcome of counterattacks. We show, using Permutation Feature Importance, that byline to byline speed, angle to the goal, angle to the ball and sideline to sideline speed are the node features with the highest impact on model performance. Additionally, we offer some illustrative examples on how to navigate the infinite solution search space to aid in identifying improvements for player decision making. This research is accompanied by an open-source repository containing all data and code, and it is also accompanied by an open-source Python package which simplifies converting spatiotemporal data into graphs. This package also facilitates testing, validation, training and prediction with this data. This should allow the reader to replicate and improve upon our research more easily. Comments: 11 pages, 11 figures, first submitted (and accepted) at MIT Sloan Sports Analytics Conference 2023 Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2411.17450 [cs.LG] (or arXiv:2411.17450v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.17450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] Maximally Separated Active Learning ECCV2024
链接: https://arxiv.org/abs/2411.17444
作者: Tejaswi Kasarla,Abhishek Jha,Faye Tervoort,Rita Cucchiara,Pascal Mettes
关键词-EN: minimizing annotation costs, Active Learning, Active Learning aims, unlabelled pool, aims to optimize
类目: Machine Learning (cs.LG)
*备注: ECCV 2024 Beyond Euclidean Workshop (proceedings)
点击查看摘要
Abstract:Active Learning aims to optimize performance while minimizing annotation costs by selecting the most informative samples from an unlabelled pool. Traditional uncertainty sampling often leads to sampling bias by choosing similar uncertain samples. We propose an active learning method that utilizes fixed equiangular hyperspherical points as class prototypes, ensuring consistent inter-class separation and robust feature representations. Our approach introduces Maximally Separated Active Learning (MSAL) for uncertainty sampling and a combined strategy (MSAL-D) for incorporating diversity. This method eliminates the need for costly clustering steps, while maintaining diversity through hyperspherical uniformity. We demonstrate strong performance over existing active learning techniques across five benchmark datasets, highlighting the method’s effectiveness and integration ease. The code is available on GitHub.
[LG-16] Robust Bayesian Optimization via Localized Online Conformal Prediction
链接: https://arxiv.org/abs/2411.17387
作者: Dongwon Kim,Matteo Zecchin,Sangwoo Park,Joonhyuk Kang,Osvaldo Simeone
关键词-EN: zeroth-order noisy observations, optimizing black-box objective, black-box objective functions, objective function, sequential approach
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Bayesian optimization (BO) is a sequential approach for optimizing black-box objective functions using zeroth-order noisy observations. In BO, Gaussian processes (GPs) are employed as probabilistic surrogate models to estimate the objective function based on past observations, guiding the selection of future queries to maximize utility. However, the performance of BO heavily relies on the quality of these probabilistic estimates, which can deteriorate significantly under model misspecification. To address this issue, we introduce localized online conformal prediction-based Bayesian optimization (LOCBO), a BO algorithm that calibrates the GP model through localized online conformal prediction (CP). LOCBO corrects the GP likelihood based on predictive sets produced by LOCBO, and the corrected GP likelihood is then denoised to obtain a calibrated posterior distribution on the objective function. The likelihood calibration step leverages an input-dependent calibration threshold to tailor coverage guarantees to different regions of the input space. Under minimal noise assumptions, we provide theoretical performance guarantees for LOCBO’s iterates that hold for the unobserved objective function. These theoretical findings are validated through experiments on synthetic and real-world optimization tasks, demonstrating that LOCBO consistently outperforms state-of-the-art BO algorithms in the presence of model misspecification.
[LG-17] MFF-FTNet: Multi-scale Feature Fusion across Frequency and Temporal Domains for Time Series Forecasting
链接: https://arxiv.org/abs/2411.17382
作者: Yangyang Shi,Qianqian Ren,Yong Liu,Jianguo Sun
关键词-EN: current deep learning, Time Domain Contrastive, time series data, Domain Contrastive Module, deep learning models
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Time series forecasting is crucial in many fields, yet current deep learning models struggle with noise, data sparsity, and capturing complex multi-scale patterns. This paper presents MFF-FTNet, a novel framework addressing these challenges by combining contrastive learning with multi-scale feature extraction across both frequency and time domains. MFF-FTNet introduces an adaptive noise augmentation strategy that adjusts scaling and shifting factors based on the statistical properties of the original time series data, enhancing model resilience to noise. The architecture is built around two complementary modules: a Frequency-Aware Contrastive Module (FACM) that refines spectral representations through frequency selection and contrastive learning, and a Complementary Time Domain Contrastive Module (CTCM) that captures both short- and long-term dependencies using multi-scale convolutions and feature fusion. A unified feature representation strategy enables robust contrastive learning across domains, creating an enriched framework for accurate forecasting. Extensive experiments on five real-world datasets demonstrate that MFF-FTNet significantly outperforms state-of-the-art models, achieving a 7.7% MSE improvement on multivariate tasks. These findings underscore MFF-FTNet’s effectiveness in modeling complex temporal patterns and managing noise and sparsity, providing a comprehensive solution for both long- and short-term forecasting.
[LG-18] Epidemiology-informed Graph Neural Network for Heterogeneity-aware Epidemic Forecasting
链接: https://arxiv.org/abs/2411.17372
作者: Yufan Zheng,Wei Jiang,Alexander Zhou,Nguyen Quoc Viet Hung,Choujun Zhan,Tong Chen
关键词-EN: public health management, spatio-temporal prediction tasks, prediction tasks, health management, plays a critical
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Among various spatio-temporal prediction tasks, epidemic forecasting plays a critical role in public health management. Recent studies have demonstrated the strong potential of spatio-temporal graph neural networks (STGNNs) in extracting heterogeneous spatio-temporal patterns for epidemic forecasting. However, most of these methods bear an over-simplified assumption that two locations (e.g., cities) with similar observed features in previous time steps will develop similar infection numbers in the future. In fact, for any epidemic disease, there exists strong heterogeneity of its intrinsic evolution mechanisms across geolocation and time, which can eventually lead to diverged infection numbers in two ``similar’’ locations. However, such mechanistic heterogeneity is non-trivial to be captured due to the existence of numerous influencing factors like medical resource accessibility, virus mutations, mobility patterns, etc., most of which are spatio-temporal yet unreachable or even unobservable. To address this challenge, we propose a Heterogeneous Epidemic-Aware Transmission Graph Neural Network (HeatGNN), a novel epidemic forecasting framework. By binding the epidemiology mechanistic model into a GNN, HeatGNN learns epidemiology-informed location embeddings of different locations that reflect their own transmission mechanisms over time. With the time-varying mechanistic affinity graphs computed with the epidemiology-informed location embeddings, a heterogeneous transmission graph network is designed to encode the mechanistic heterogeneity among locations, providing additional predictive signals to facilitate accurate forecasting. Experiments on three benchmark datasets have revealed that HeatGNN outperforms various strong baselines. Moreover, our efficiency analysis verifies the real-world practicality of HeatGNN on datasets of different sizes.
[LG-19] Efficient Deployment of Transformer Models in Analog In-Memory Computing Hardware
链接: https://arxiv.org/abs/2411.17367
作者: Chen Li,Corey Lammie,Manuel Le Gallo,Bipin Rajendran
关键词-EN: von Neumann bottleneck, accelerating neural network, improving computational efficiency, neural network computations, Neumann bottleneck
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Analog in-memory computing (AIMC) has emerged as a promising solution to overcome the von Neumann bottleneck, accelerating neural network computations and improving computational efficiency. While AIMC has demonstrated success with architectures such as CNNs, MLPs, and RNNs, deploying transformer-based models using AIMC presents unique challenges. Transformers are expected to handle diverse downstream tasks and adapt to new user data or instructions after deployment, which requires more flexible approaches to suit AIMC constraints. In this paper, we propose a novel method for deploying pre-trained transformer models onto AIMC hardware. Unlike traditional approaches requiring hardware-aware training, our technique allows direct deployment without the need for retraining the original model. Instead, we utilize lightweight, low-rank adapters – compact modules stored in digital cores – to adapt the model to hardware constraints. We validate our approach on MobileBERT, demonstrating accuracy on par with, or even exceeding, a traditional hardware-aware training approach. Our method is particularly appealing in multi-task scenarios, as it enables a single analog model to be reused across multiple tasks. Moreover, it supports on-chip adaptation to new hardware constraints and tasks without updating analog weights, providing a flexible and versatile solution for real-world AI applications. Code is available. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2411.17367 [cs.AR] (or arXiv:2411.17367v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2411.17367 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-20] Joint Combinatorial Node Selection and Resource Allocations in the Lightning Network using Attention-based Reinforcement Learning
链接: https://arxiv.org/abs/2411.17353
作者: Mahdi Salahshour,Amirahmad Shafiee,Mojtaba Tefagh
关键词-EN: Bitcoin scalability challenges, Payment Channel Networks, solution to Bitcoin, Bitcoin scalability, Lightning Network
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:
点击查看摘要
Abstract:The Lightning Network (LN) has emerged as a second-layer solution to Bitcoin’s scalability challenges. The rise of Payment Channel Networks (PCNs) and their specific mechanisms incentivize individuals to join the network for profit-making opportunities. According to the latest statistics, the total value locked within the Lightning Network is approximately \ 500 million. Meanwhile, joining the LN with the profit-making incentives presents several obstacles, as it involves solving a complex combinatorial problem that encompasses both discrete and continuous control variables related to node selection and resource allocation, respectively. Current research inadequately captures the critical role of resource allocation and lacks realistic simulations of the LN routing mechanism. In this paper, we propose a Deep Reinforcement Learning (DRL) framework, enhanced by the power of transformers, to address the Joint Combinatorial Node Selection and Resource Allocation (JCNSRA) problem. We have improved upon an existing environment by introducing modules that enhance its routing mechanism, thereby narrowing the gap with the actual LN routing system and ensuring compatibility with the JCNSRA problem. We compare our model against several baselines and heuristics, demonstrating its superior performance across various settings. Additionally, we address concerns regarding centralization in the LN by deploying our agent within the network and monitoring the centrality measures of the evolved graph. Our findings suggest not only an absence of conflict between LN’s decentralization goals and individuals’ revenue-maximization incentives but also a positive association between the two.
[LG-21] Correlation-Aware Graph Convolutional Networks for Multi-Label Node Classification KDD2025
链接: https://arxiv.org/abs/2411.17350
作者: Yuanchen Bei,Weizhi Chen,Hao Chen,Sheng Zhou,Carl Yang,Jiapei Fan,Longtao Huang,Jiajun Bu
关键词-EN: real-world nodes belong, Multi-label node classification, Graph Convolution Networks, Graph Convolutional Network, important yet under-explored
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 14 pages, accepted by KDD2025
点击查看摘要
Abstract:Multi-label node classification is an important yet under-explored domain in graph mining as many real-world nodes belong to multiple categories rather than just a single one. Although a few efforts have been made by utilizing Graph Convolution Networks (GCNs) to learn node representations and model correlations between multiple labels in the embedding space, they still suffer from the ambiguous feature and ambiguous topology induced by multiple labels, which reduces the credibility of the messages delivered in graphs and overlooks the label correlations on graph data. Therefore, it is crucial to reduce the ambiguity and empower the GCNs for accurate classification. However, this is quite challenging due to the requirement of retaining the distinctiveness of each label while fully harnessing the correlation between labels simultaneously. To address these issues, in this paper, we propose a Correlation-aware Graph Convolutional Network (CorGCN) for multi-label node classification. By introducing a novel Correlation-Aware Graph Decomposition module, CorGCN can learn a graph that contains rich label-correlated information for each label. It then employs a Correlation-Enhanced Graph Convolution to model the relationships between labels during message passing to further bolster the classification process. Extensive experiments on five datasets demonstrate the effectiveness of our proposed CorGCN.
[LG-22] sbi reloaded: a toolkit for simulation-based inference workflows
链接: https://arxiv.org/abs/2411.17337
作者: Jan Boelts,Michael Deistler,Manuel Gloeckler,Álvaro Tejero-Cantero,Jan-Matthis Lueckmann,Guy Moss,Peter Steinbach,Thomas Moreau,Fabio Muratore,Julia Linhart,Conor Durkan,Julius Vetter,Benjamin Kurt Miller,Maternus Herold,Abolfazl Ziaeemehr,Matthijs Pals,Theo Gruner,Sebastian Bischoff,Nastya Krouglova,Richard Gao,Janne K. Lappalainen,Bálint Mucsányi,Felix Pei,Auguste Schulz,Zinovia Stefanidi,Pedro Rodrigues,Cornelius Schröder,Faried Abu Zaid,Jonas Beck,Jaivardhan Kapoor,David S. Greenberg,Pedro J. Gonçalves,Jakob H. Macke
关键词-EN: SBI, match observed data, empirically observed phenomena, observed data, observed phenomena
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Scientists and engineers use simulators to model empirically observed phenomena. However, tuning the parameters of a simulator to ensure its outputs match observed data presents a significant challenge. Simulation-based inference (SBI) addresses this by enabling Bayesian inference for simulators, identifying parameters that match observed data and align with prior knowledge. Unlike traditional Bayesian inference, SBI only needs access to simulations from the model and does not require evaluations of the likelihood-function. In addition, SBI algorithms do not require gradients through the simulator, allow for massive parallelization of simulations, and can perform inference for different observations without further simulations or training, thereby amortizing inference. Over the past years, we have developed, maintained, and extended \textttsbi , a PyTorch-based package that implements Bayesian SBI algorithms based on neural networks. The \textttsbi toolkit implements a wide range of inference methods, neural network architectures, sampling methods, and diagnostic tools. In addition, it provides well-tested default settings but also offers flexibility to fully customize every step of the simulation-based inference workflow. Taken together, the \textttsbi toolkit enables scientists and engineers to apply state-of-the-art SBI methods to black-box simulators, opening up new possibilities for aligning simulations with empirically observed data.
[LG-23] On the Generalization of Handwritten Text Recognition Models
链接: https://arxiv.org/abs/2411.17332
作者: Carlos Garrido-Munoz,Jorge Calvo-Zaragoza
关键词-EN: Handwritten Text Recognition, Text Recognition, Handwritten Text, Recent advances, advances in Handwritten
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent advances in Handwritten Text Recognition (HTR) have led to significant reductions in transcription errors on standard benchmarks under the i.i.d. assumption, thus focusing on minimizing in-distribution (ID) errors. However, this assumption does not hold in real-world applications, which has motivated HTR research to explore Transfer Learning and Domain Adaptation techniques. In this work, we investigate the unaddressed limitations of HTR models in generalizing to out-of-distribution (OOD) data. We adopt the challenging setting of Domain Generalization, where models are expected to generalize to OOD data without any prior access. To this end, we analyze 336 OOD cases from eight state-of-the-art HTR models across seven widely used datasets, spanning five languages. Additionally, we study how HTR models leverage synthetic data to generalize. We reveal that the most significant factor for generalization lies in the textual divergence between domains, followed by visual divergence. We demonstrate that the error of HTR models in OOD scenarios can be reliably estimated, with discrepancies falling below 10 points in 70% of cases. We identify the underlying limitations of HTR models, laying the foundation for future research to address this challenge.
[LG-24] Privacy Preserving Federated Unsupervised Domain Adaptation with Application to Age Prediction from DNA Methylation Data
链接: https://arxiv.org/abs/2411.17287
作者: Cem Ata Baykara,Ali Burak Ünal,Nico Pfeifer,Mete Akgün
关键词-EN: data, predictive models, models are widely, performance can suffer, suffer greatly
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In computational biology, predictive models are widely used to address complex tasks, but their performance can suffer greatly when applied to data from different distributions. The current state-of-the-art domain adaptation method for high-dimensional data aims to mitigate these issues by aligning the input dependencies between training and test data. However, this approach requires centralized access to both source and target domain data, raising concerns about data privacy, especially when the data comes from multiple sources. In this paper, we introduce a privacy-preserving federated framework for unsupervised domain adaptation in high-dimensional settings. Our method employs federated training of Gaussian processes and weighted elastic nets to effectively address the problem of distribution shift between domains, while utilizing secure aggregation and randomized encoding to protect the local data of participating data owners. We evaluate our framework on the task of age prediction using DNA methylation data from multiple tissues, demonstrating that our approach performs comparably to existing centralized methods while maintaining data privacy, even in distributed environments where data is spread across multiple institutions. Our framework is the first privacy-preserving solution for high-dimensional domain adaptation in federated environments, offering a promising tool for fields like computational biology and medicine, where protecting sensitive data is essential.
[LG-25] Using Large Language Models for Expert Prior Elicitation in Predictive Modelling
链接: https://arxiv.org/abs/2411.17284
作者: Alexander Capstick,Rahul G. Krishnan,Payam Barnaghi
关键词-EN: diverse data effectively, data effectively acquire, Large language models, Large language, trained on diverse
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Large language models (LLMs), trained on diverse data effectively acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency hinder their direct application for specialised tasks. In fields such as clinical research, acquiring expert annotations or prior knowledge about predictive models is often costly and time-consuming. This study proposes using LLMs to elicit expert prior distributions for predictive models. This approach also provides an alternative to in-context learning, where language models are tasked with making predictions directly. We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation. Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings. Applied to clinical problems, this translates to fewer required biological samples, lowering cost and resources. Prior elicitation also consistently outperforms and proves more reliable than in-context learning at a lower cost, making it a preferred alternative in our setting. We demonstrate the utility of this method across various use cases, including clinical applications. For infection prediction, using LLM-elicited priors reduced the number of required labels to achieve the same accuracy as an uninformative prior by 55%, at 200 days earlier in the study.
[LG-26] he Exploration of Neural Collapse under Imbalanced Data
链接: https://arxiv.org/abs/2411.17278
作者: Haixia Liu
关键词-EN: newly identified characteristic, left orthonormal transformation, orthonormal transformation, Neural collapse, explore neural collapse
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 26pages, 4figures
点击查看摘要
Abstract:Neural collapse, a newly identified characteristic, describes a property of solutions during model training. In this paper, we explore neural collapse in the context of imbalanced data. We consider the L -extended unconstrained feature model with a bias term and provide a theoretical analysis of global minimizer. Our findings include: (1) Features within the same class converge to their class mean, similar to both the balanced case and the imbalanced case without bias. (2) The geometric structure is mainly on the left orthonormal transformation of the product of L linear classifiers and the right transformation of the class-mean matrix. (3) Some rows of the left orthonormal transformation of the product of L linear classifiers collapse to zeros and others are orthogonal, which relies on the singular values of \hat Y=(I_K-1/N\mathbfn1^\top_K)D , where K is class size, \mathbfn is the vector of sample size for each class, D is the diagonal matrix whose diagonal entries are given by \sqrt\mathbfn . Similar results are for the columns of the right orthonormal transformation of the product of class-mean matrix and D . (4) The i -th row of the left orthonormal transformation of the product of L linear classifiers aligns with the i -th column of the right orthonormal transformation of the product of class-mean matrix and D . (5) We provide the estimation of singular values about \hat Y . Our numerical experiments support these theoretical findings. Comments: 26pages, 4figures Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) MSC classes: cs.LG, math.OC Cite as: arXiv:2411.17278 [cs.LG] (or arXiv:2411.17278v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.17278 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] Disentangled Interpretable Representation for Efficient Long-term Time Series Forecasting ICDE
链接: https://arxiv.org/abs/2411.17257
作者: Yuang Zhao,Tianyu Li,Jiadong Chen,Shenrong Ye,Fuxin Jiang,Tieying Zhang,Xiaofeng Gao
关键词-EN: Time Series Forecasting, Long-term Time Series, high-stakes application scenarios, Long-term Time, Series Forecasting
类目: Machine Learning (cs.LG)
*备注: This work is submitted to IEEE International Conference on Data Engineering (ICDE) 2025
点击查看摘要
Abstract:Industry 5.0 introduces new challenges for Long-term Time Series Forecasting (LTSF), characterized by high-dimensional, high-resolution data and high-stakes application scenarios. Against this backdrop, developing efficient and interpretable models for LTSF becomes a key challenge. Existing deep learning and linear models often suffer from excessive parameter complexity and lack intuitive interpretability. To address these issues, we propose DiPE-Linear, a Disentangled interpretable Parameter-Efficient Linear network. DiPE-Linear incorporates three temporal components: Static Frequential Attention (SFA), Static Temporal Attention (STA), and Independent Frequential Mapping (IFM). These components alternate between learning in the frequency and time domains to achieve disentangled interpretability. The decomposed model structure reduces parameter complexity from quadratic in fully connected networks (FCs) to linear and computational complexity from quadratic to log-linear. Additionally, a Low-Rank Weight Sharing policy enhances the model’s ability to handle multivariate series. Despite operating within a subspace of FCs with limited expressive capacity, DiPE-Linear demonstrates comparable or superior performance to both FCs and nonlinear models across multiple open-source and real-world LTSF datasets, validating the effectiveness of its sophisticatedly designed structure. The combination of efficiency, accuracy, and interpretability makes DiPE-Linear a strong candidate for advancing LTSF in both research and real-world applications. The source code is available at this https URL.
[LG-28] On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning
链接: https://arxiv.org/abs/2411.17207
作者: Anton Frederik Thielmann,Soheila Samiee
关键词-EN: Recent advancements, tabular deep learning, substantial performance improvements, deep learning, surpassing the capabilities
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent advancements in tabular deep learning (DL) have led to substantial performance improvements, surpassing the capabilities of traditional models. With the adoption of techniques from natural language processing (NLP), such as language model-based approaches, DL