Arxiv今日论文 | 2024-11-28

本篇博文主要展示 2024-11-28 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 中视觉和语言信息如何交互和整合的问题。解决方案的关键在于揭示了信息流在不同层次上的处理机制：在较低层，模型将图像的整体视觉特征转化为问题标记的表示；在中层，模型将特定对象的视觉信息转移到问题中相应标记的位置；在高层，最终的多模态表示被传递到输入序列的最后一个位置以进行最终预测。这一发现为理解MLLMs中图像和语言处理的空间和功能特性提供了新的视角，并为未来的多模态信息定位和编辑研究奠定了基础。

链接: https://arxiv.org/abs/2411.18620
作者: Zhi Zhang,Srishti Yadav,Fengze Han,Ekaterina Shutova
关键词-EN: demonstrated promising progress, large language models, auto-regressive multimodal large, large language, multimodal large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities – language and vision – in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing.
zh

[NLP-1] Automated Literature Review Using NLP Techniques and LLM -Based Retrieval-Augmented Generation

【速读】：该论文试图解决手动文献综述效率低下的问题，特别是面对日益增多的研究文章时。解决方案的关键在于自动化生成文献综述，通过多种自然语言处理 (NLP) 技术和检索增强生成 (RAG) 结合大型语言模型 (LLM) 的方法来实现。具体方法包括频率分析法 (spaCy)、变换器模型 (Simple T5) 和 RAG 结合 GPT-3.5-turbo 模型。实验结果表明，GPT-3.5-turbo 模型在 ROUGE-1 评分上表现最佳，达到 0.364，因此基于该模型的系统被认为是最佳解决方案，并为其开发了图形用户界面。

链接: https://arxiv.org/abs/2411.18583
作者: Nurshat Fateh Ali,Md. Mahdi Mohtasim,Shakil Mosharrof,T. Gopi Krishna
关键词-EN: Large Language Model, Natural Language Processing, compares multiple approaches, Large Language, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Key Words : T5, SpaCy, Large Language Model, GPT, ROUGE, Literature Review, Natural Language Processing, Retrieval-augmented generation

点击查看摘要

Abstract:This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.
zh

[NLP-2] On Importance of Code-Mixed Embeddings for Hate Speech Identification

【速读】：该论文试图解决在多语言社区中，由于代码混合（code-mixing）现象导致的自然语言处理（NLP）工具在处理多语言文本时面临的挑战，特别是在仇恨言论检测（hate speech detection）任务中。解决方案的关键在于利用专门训练的代码混合嵌入（code-mixed embeddings）和模型，如HingBERT和Hing-FastText，这些模型在Hindi-English语料库（L3Cube-HingCorpus）上进行了训练，能够更有效地处理和识别多语言文本中的仇恨言论，从而在性能上优于传统的BERT和标准英语FastText模型。

链接: https://arxiv.org/abs/2411.18577
作者: Shruti Jagdale,Omkar Khade,Gauri Takalikar,Mihir Inamdar,Raviraj Joshi
关键词-EN: people commonly speak, India where people, commonly speak multiple, occurs in multilingual, multilingual communities
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, face challenges when dealing with code-mixed data. Extracting meaningful information from sentences containing multiple languages becomes difficult, particularly in tasks like hate speech detection, due to linguistic variation, cultural nuances, and data sparsity. To address this, we aim to analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech detection. Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform BERT models when tested on hate speech text datasets. We also found that code-mixed Hing-FastText performs better than standard English FastText and vanilla BERT models.
zh

[NLP-3] Challenges in Adapting Multilingual LLM s to Low-Resource Languages using LoRA PEFT Tuning

【速读】：该论文试图解决大语言模型（LLMs）在低资源语言（如马拉地语）上的适应性问题。解决方案的关键在于采用低秩适应（Low-Rank Adaptation, LoRA）参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术，对多语言Gemma模型进行微调。研究结果表明，尽管微调后模型在评估指标上可能出现性能下降，但在手动评估中，微调后的模型往往优于原始模型，显示出在目标语言生成能力上的提升，但推理能力有所减弱。这强调了改进评估方法和创建高质量本地数据集的必要性，以准确评估低资源环境下语言模型的性能。

链接: https://arxiv.org/abs/2411.18571
作者: Omkar Khade,Shruti Jagdale,Abhishek Phaltankar,Gauri Takalikar,Raviraj Joshi
关键词-EN: Large Language Models, demonstrated remarkable multilingual, Large Language, multilingual Gemma models, demonstrated remarkable
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi, a language with limited resources. Using a translated Alpaca dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation metrics often show a performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts. The observations indicate improvements in target language generation capabilities but a reduction in reasoning abilities following language adaptation. These results underscore the need for improved evaluation methodologies and the creation of high-quality native datasets to accurately assess language-specific model performance in low-resource settings.
zh

[NLP-4] A Pipeline of Neural-Symbolic Integration to Enhance Spatial Reasoning in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在空间推理能力上的不足问题。解决方案的关键在于提出了一种新颖的神经-符号框架，通过结合神经网络和符号推理来增强LLMs的空间推理能力。具体实施了三种策略：(1) 基于答案集编程（ASP）的符号推理，(2) 使用DSPy的LLM + ASP管道，以及(3) 事实 + 逻辑规则。实验结果表明，这些策略在StepGame和SparQA两个基准数据集上显著提升了空间推理的准确性，分别提高了40-50%和3-13%。特别是“LLM + ASP”管道在寻找关系（FR）和寻找块（FB）问题上表现尤为出色，展示了神经-符号方法在提升LLMs空间推理能力方面的潜力和广泛适用性。

链接: https://arxiv.org/abs/2411.18564
作者: Rong Wang,Kun Sun,Jonas Kuhn
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated impressive capabilities, Answer Set Programming
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks. However, LLMs often struggle with spatial reasoning which is one essential part of reasoning and inference and requires understanding complex relationships between objects in space. This paper proposes a novel neural-symbolic framework that enhances LLMs’ spatial reasoning abilities. We evaluate our approach on two benchmark datasets: StepGame and SparQA, implementing three distinct strategies: (1) ASP (Answer Set Programming)-based symbolic reasoning, (2) LLM + ASP pipeline using DSPy, and (3) Fact + Logical rules. Our experiments demonstrate significant improvements over the baseline prompting methods, with accuracy increases of 40-50% on StepGame dataset and 3-13% on the more complex SparQA dataset. The “LLM + ASP” pipeline achieves particularly strong results on the tasks of Finding Relations (FR) and Finding Block (FB) questions, though performance varies across different question types. The impressive results suggest that while neural-symbolic approaches offer promising directions for enhancing spatial reasoning in LLMs, their effectiveness depends heavily on the specific task characteristics and implementation strategies. We propose an integrated, simple yet effective set of strategies using a neural-symbolic pipeline to boost spatial reasoning abilities in LLMs. This pipeline and its strategies demonstrate strong and broader applicability to other reasoning domains in LLMs, such as temporal reasoning, deductive inference etc.
zh

[NLP-5] Retrofitting (Large) Language Models with Dynamic Tokenization

【速读】：该论文试图解决当前语言模型（LMs）使用固定静态子词分词器（subword tokenizer）导致在非英语语言中效率和能力下降的问题，以及难以适应新领域或语言的挑战。解决方案的关键在于引入动态分词（dynamic tokenization），即根据输入文本动态决定分词边界。对于编码器风格的模型，论文提出了一种受字节对编码（BPE）启发的子词合并算法，在批次级别上合并频繁的子词序列，并使用预训练的嵌入预测超网络（hypernetwork）实时计算词嵌入。对于解码器风格的模型，动态分词通过两种方式实现：1) 预填充（prefilling），在保持Mistral-7B性能的同时，序列长度最多可减少40%；2) 通过近似最近邻索引（approximate nearest neighbor index），实现快速生成，支持百万级词汇量，展示了扩展到更大动态词汇的潜力。总体而言，动态分词显著提高了推理速度，促进了语言间的公平性，为克服静态分词的局限性迈出了重要一步，使语言模型更具适应性和公平性。

链接: https://arxiv.org/abs/2411.18553
作者: Darius Feher,Benjamin Minixhofer,Ivan Vulić
关键词-EN: Current language models, Current language, static subword tokenizer, Current, subword tokenizer
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current language models (LMs) use a fixed, static subword tokenizer. This choice, often taken for granted, typically results in degraded efficiency and capabilities in languages other than English, and makes it challenging to apply LMs to new domains or languages. To address these issues, we propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text. For encoder-style models, we introduce a subword-merging algorithm inspired by byte-pair encoding (BPE), but at a batch level. We merge frequent subword sequences in a batch, then apply a pretrained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. When applied with word-level boundaries, this on average reduces token sequence lengths by 20% across 14 languages on XNLI with XLM-R while degrading its task performance by less than 2%. For decoder-style models, we apply dynamic tokenization in two ways: 1) for prefilling, maintaining performance of Mistral-7B almost completely with up to 40% sequence reduction - relative to the word-level; and 2) via an approximate nearest neighbor index, achieving fast generation with a one million token vocabulary, demonstrating scalability to even larger, dynamic vocabularies. Overall, our findings show that dynamic tokenization substantially improves inference speed and promotes fairness across languages, making a leap towards overcoming the limitations of static tokenization and enabling more equitable and adaptable LMs.
zh

[NLP-6] Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models

【速读】：该论文试图解决人工智能系统中自我身份（self-identity）的定义和量化问题，这是一个在人工意识理论基础中的关键空白。解决方案的关键在于提出了一个基于度量空间理论（metric space theory）、测度理论（measure theory）和泛函分析（functional analysis）的数学框架。该框架假设自我身份的形成源于两个可量化的数学条件：一是记忆的连通连续体（connected continuum of memories）在度量空间中的存在，二是维持这一连续体上自我识别一致性的连续映射（continuous mapping）。通过使用Llama 3.2 1B模型和低秩适应（Low-Rank Adaptation, LoRA）进行实验验证，论文展示了在可测量的自我意识指标上的显著提升，从而为构建具有验证自我身份特征的AI系统提供了理论和实践基础。

链接: https://arxiv.org/abs/2411.18530
作者: Minhyeok Lee
关键词-EN: mathcal, addressing a critical, paper introduces, introduces a mathematical, defining and quantifying
类目: Computation and Language (cs.CL); Metric Geometry (math.MG)
备注:

点击查看摘要

Abstract:This paper introduces a mathematical framework for defining and quantifying self-identity in artificial intelligence (AI) systems, addressing a critical gap in the theoretical foundations of artificial consciousness. While existing approaches to artificial self-awareness often rely on heuristic implementations or philosophical abstractions, we present a formal framework grounded in metric space theory, measure theory, and functional analysis. Our framework posits that self-identity emerges from two mathematically quantifiable conditions: the existence of a connected continuum of memories C \subseteq \mathcalM in a metric space (\mathcalM, d_\mathcalM) , and a continuous mapping I: \mathcalM \to \mathcalS that maintains consistent self-recognition across this continuum, where (\mathcalS, d_\mathcalS) represents the metric space of possible self-identities. To validate this theoretical framework, we conducted empirical experiments using the Llama 3.2 1B model, employing Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model was trained on a synthetic dataset containing temporally structured memories, designed to capture the complexity of coherent self-identity formation. Our evaluation metrics included quantitative measures of self-awareness, response consistency, and linguistic precision. The experimental results demonstrate substantial improvements in measurable self-awareness metrics, with the primary self-awareness score increasing from 0.276 to 0.801. This enables the structured creation of AI systems with validated self-identity features. The implications of our study are immediately relevant to the fields of humanoid robotics and autonomous systems.
zh

[NLP-7] Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

【速读】：该论文试图解决传统上下文学习（In-context Learning, ICL）在面对复杂数学推理任务时的局限性，主要问题在于其对示例质量和人类干预的高度依赖。解决方案的关键在于提出了高层次自动化推理（High-level Automated Reasoning, HiAR-ICL）范式，该范式通过引入五个原子推理动作作为构建链式结构模式的基础，并利用蒙特卡洛树搜索（Monte Carlo Tree Search）探索推理路径和构建思维卡片（thought cards），以指导后续推理。此外，论文还开发了一个认知复杂性框架，用于动态匹配问题与适当的思维卡片，从而在不依赖具体示例的情况下提升推理能力。实验结果表明，HiAR-ICL在MATH基准测试中达到了最先进的准确率（79.6%），超过了GPT-4o（76.6%）和Claude 3.5（71.1%）。

链接: https://arxiv.org/abs/2411.18478
作者: Jinyang Wu,Mingkuan Feng,Shuai Zhang,Feihu Che,Zengqi Wen,Jianhua Tao
关键词-EN: In-context Learning, enables large language, large language models, tackle downstream tasks, enables large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context Learning (ICL) enables large language models (LLMs) to tackle downstream tasks through sophisticated prompting and high-quality demonstrations. However, this traditional ICL paradigm shows limitations when facing complex mathematical reasoning tasks, primarily due to its heavy dependence on example quality and the necessity for human intervention in challenging scenarios. To address these limitations, this paper presents HiAR-ICL, a \textbfHigh-level \textbfAutomated \textbfReasoning paradigm in \textbfICL that shifts focus from specific examples to abstract thinking patterns, extending the conventional concept of context in ICL. HiAR-ICL introduces five atomic reasoning actions as fundamental components for constructing chain-structured patterns. Using Monte Carlo Tree Search, we explore reasoning paths and construct thought cards to guide subsequent inference. We then develop a cognitive complexity framework that dynamically matches problems with appropriate thought cards. Experimental results demonstrate HiAR-ICL’s effectiveness, achieving state-of-the-art accuracy (79.6 % ) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6 % ) and Claude 3.5 (71.1 % ).
zh

[NLP-8] Isolating authorship from content with semantic embeddings and contrastive learning

【速读】：该论文试图解决内容与作者风格之间不可避免的关联问题，特别是在使用现代神经模型进行作者识别时，内容泄露可能影响识别的准确性。解决方案的关键在于采用对比学习（InfoNCE）结合语义相似模型生成的额外硬负样本，以实现内容嵌入空间与风格嵌入空间的解耦。这种方法旨在使嵌入更侧重于风格特征，从而提高在具有挑战性的评估中对多产作者的识别准确性，尤其是在特别困难的设置下，准确性提高了多达10%。此外，该方法在微调过程中仍保留了零样本能力。

链接: https://arxiv.org/abs/2411.18472
作者: Javier Huertas-Tato,Adrián Girón-Jiménez,Alejandro Martín,David Camacho
关键词-EN: entangled style, style, content inside, contrastive learning, content
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Authorship has entangled style and content inside. Authors frequently write about the same topics in the same style, so when different authors write about the exact same topic the easiest way out to distinguish them is by understanding the nuances of their style. Modern neural models for authorship can pick up these features using contrastive learning, however, some amount of content leakage is always present. Our aim is to reduce the inevitable impact and correlation between content and authorship. We present a technique to use contrastive learning (InfoNCE) with additional hard negatives synthetically created using a semantic similarity model. This disentanglement technique aims to distance the content embedding space from the style embedding space, leading to embeddings more informed by style. We demonstrate the performance with ablations on two different datasets and compare them on out-of-domain challenges. Improvements are clearly shown on challenging evaluations on prolific authors with up to a 10% increase in accuracy when the settings are particularly hard. Trials on challenges also demonstrate the preservation of zero-shot capabilities of this method as fine tuning.
zh

[NLP-9] Parole de presidents (1958-2022)

【速读】：该论文试图通过分析法国第五共和国八位总统（de Gaulle, Pompidou, Giscard d’Estaing, Mitterrand, Chirac, Sarkozy, Hollande, Macron）的演讲文本，来解决如何从语言风格上区分每位总统的问题。解决方案的关键在于构建一个包含9202篇演讲文本和超过2000万标记词汇的语料库，并利用词汇（词元和词性分类）以及文本间距离分析，揭示每位总统的典型语言序列，从而绘制出各总统之间在语言风格上的相似性和差异性。

链接: https://arxiv.org/abs/2411.18468
作者: Dominique Labbé,Jacques Savoy
关键词-EN: République française, soixante ans, sont succédé, Giscard d’Estaing, huit présidents
类目: Computation and Language (cs.CL)
备注: in French language

点击查看摘要

Abstract:En plus de soixante ans, huit présidents se sont succédé à la tête de la Ve République française (de Gaulle, Pompidou, Giscard d’Estaing, Mitterrand, Chirac, Sarkozy, Hollande, Macron). Après avoir présenté le corpus de leurs discours – soit 9202 textes et plus de 20 millions de mots étiquetés – le style de chacun des présidents sera caractérisé à l’aide de leurs vocabulaire (vocables et catégories grammaticales). Une analyse plus approfondie révèle les séquences typiques de chaque locataire de l’Élysée. Basée sur les distances entre l’ensemble des allocutions, une figure illustre les similitudes et différences entre les différents présidents. Over the past sixty-six years, eight presidents successively headed the Fifth French Republic (de Gaulle, Pompidou, Giscard d’Estaing, Mitterrand, Chirac, Sarkozy, Holland, Macron). After presenting the corpus of their speeches – 9,202 texts and more than 20 million labelled words – the style of each of them will be characterized by their vocabulary (lemmas and part-of-speech). A deeper analysis reveals the typical sequences of each tenant of the Elysée. Based on an intertextual distance between all presidential speeches, a synthesis can be drawn reflecting the similarities and differences between presidents. Comments: in French language Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.18468 [cs.CL] (or arXiv:2411.18468v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.18468 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding ALT

【速读】：该论文试图解决传统推测解码 (Speculative Decoding, SD) 方法中固定草稿长度 (draft length) 策略无法适应不同任务中标记生成难度的问题。解决方案的关键在于提出了SVIP（难度感知动态草稿长度策略），该策略基于草稿标记接受率的理论下界及其推断时近似值，根据每个草稿标记分布的熵自适应地确定草稿序列的长度。SVIP不仅在主流SD基准测试和框架中展示了优越的性能，实现了在SpecBench上高达20%的墙钟时间加速，在MT-Bench上长达8K标记的长文本生成任务中实现了60%的加速，而且完全无需训练，并兼容任何现有的自回归生成草稿标记的SD方法。

链接: https://arxiv.org/abs/2411.18462
作者: Ziyin Zhang,Jiahao Xu,Tian Liang,Xingyu Chen,Zhiwei He,Rui Wang,Zhaopeng Tu
关键词-EN: large language models, Speculative Decoding, language models, speculative decoding systems, important technique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code at this https URL

点击查看摘要

Abstract:Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20% walltime speedup on SpecBench over baseline SD methods and 60% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe CaPE and EAGLE-2.
zh

[NLP-11] Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

【速读】：该论文试图解决自然语言生成系统生成的会议摘要质量难以自动评估的问题。现有评估指标如ROUGE和BERTScore与人类判断的相关性较低，且无法捕捉细微错误。论文提出的解决方案是MESA框架，其关键在于利用大型语言模型（LLM）进行三步评估：首先识别个体错误类型，然后通过多代理讨论进行决策优化，最后通过反馈自训练来精炼错误定义与人类判断的对齐。MESA框架不仅提高了错误检测的准确性和一致性，还增强了适应自定义错误指南的能力，使其在有限的人类标注数据下仍能有效应用于各种任务。

链接: https://arxiv.org/abs/2411.18444
作者: Frederic Kirstein,Terry Ruas,Bela Gipp
关键词-EN: meeting summaries generated, natural language generation, measure automatically, meeting summaries, summaries generated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA’s components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework’s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.
zh

[NLP-12] Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication

【速读】：该论文旨在比较法国和意大利政治家在移民和欧盟等极化议题上的文本与其使用ChatGPT 3.5生成的聊天机器人文本之间的差异。研究重点在于隐性沟通，特别是预设（presuppositions）及其在话语中的功能，这些在文献中被认为是潜在的语言操控特征。解决方案的关键在于分析这些预设在不同文本中的表现及其对信息传递的影响，从而为大型语言模型（Large Language Models）的语用能力研究提供新的视角。

链接: https://arxiv.org/abs/2411.18403
作者: Davide Garassino,Vivana Masia,Nicola Brocca,Alice Delorme Benites
关键词-EN: European Union, chatbot counterparts created, French and Italian, produced by French, Italian politicians
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Published: 2024-07-04

点击查看摘要

Abstract:This paper aims to provide a comparison between texts produced by French and Italian politicians on polarizing issues, such as immigration and the European Union, and their chatbot counterparts created with ChatGPT 3.5. In this study, we focus on implicit communication, in particular on presuppositions and their functions in discourse, which have been considered in the literature as a potential linguistic feature of manipulation. This study also aims to contribute to the emerging literature on the pragmatic competences of Large Language Models.
zh

[NLP-13] opic Modeling and Sentiment Analysis on Japanese Online Medias Coverage of Nuclear Energy

【速读】：该论文试图解决的问题是如何在福岛核事故13年后，通过有效沟通来重振日本的核能产业，并实现可持续发展目标。解决方案的关键在于利用社交媒体（特别是YouTube视频）来深入理解公众对核能相关议题的情感和态度。通过分析超过3,000个涉及核能话题的YouTube视频内容及其评论，论文采用了主题建模（Topic Modeling）和情感分析（Sentiment Analysis）结合大型语言模型来提取主要话题并分类用户情感。此外，通过共现网络分析（Word Co-occurrence Network Analysis）研究了2023年8月至9月期间关于处理水排放的在线讨论变化。这些方法共同为理解日本公众对核能的在线讨论提供了宝贵的见解。

链接: https://arxiv.org/abs/2411.18383
作者: Yifan Sun,Hirofumi Tsuruta,Masaya Kumagai,Ken Kurosaki
关键词-EN: Fukushima Daiichi nuclear, Fukushima Daiichi, power plant accident, Daiichi nuclear power, plants remain shut
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 15 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Thirteen years after the Fukushima Daiichi nuclear power plant accident, Japan’s nuclear energy accounts for only approximately 6% of electricity production, as most nuclear plants remain shut down. To revitalize the nuclear industry and achieve sustainable development goals, effective communication with Japanese citizens, grounded in an accurate understanding of public sentiment, is of paramount importance. While nationwide surveys have traditionally been used to gauge public views, the rise of social media in recent years has provided a promising new avenue for understanding public sentiment. To explore domestic sentiment on nuclear energy-related issues expressed online, we analyzed the content and comments of over 3,000 YouTube videos covering topics related to nuclear energy. Topic modeling was used to extract the main topics from the videos, and sentiment analysis with large language models classified user sentiments towards each topic. Additionally, word co-occurrence network analysis was performed to examine the shift in online discussions during August and September 2023 regarding the release of treated water. Overall, our results provide valuable insights into the online discourse on nuclear energy and contribute to a more comprehensive understanding of public sentiment in Japan.
zh

[NLP-14] ChatGPT as speechwriter for the French presidents

【速读】：该论文试图解决的问题是分析生成式 AI (Generative AI) 模型 ChatGPT 在生成文本时的写作风格，并将其与法国总统的演讲风格进行比较。解决方案的关键在于通过对比 ChatGPT 自动生成的年终致辞与法国总统 Chirac、Sarkozy、Hollande 和 Macron 的实际演讲文本，揭示 ChatGPT 在词汇使用、句子结构和语法特征上的差异。研究发现，ChatGPT 倾向于过度使用名词、所有格限定词和数字，而较少使用动词、代词和副词，生成的句子较为标准化。此外，ChatGPT 在某些词汇（如“devoir”、“continuer”和“nous”）的使用上存在偏差，且对某些助动词（如“être”）和情态动词（如“vouloir”和“falloir”）的使用不足。通过提供短文本示例，ChatGPT 能够生成风格接近原始文本的简短信息，但其整体风格与真实总统演讲存在显著差异。

链接: https://arxiv.org/abs/2411.18382
作者: Dominique Labbé,Cyril Labbé,Jacques Savoy
关键词-EN: large language models, Generative AI proposes, language models, users’ requests, proposes several large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI proposes several large language models (LLMs) to automatically generate a message in response to users’ requests. Such scientific breakthroughs promote new writing assistants but with some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT by comparing its generated messages with those of the recent French presidents. To achieve this, we compare end-of-the-year addresses written by Chirac, Sarkozy, Hollande, and Macron with those automatically produced by ChatGPT. We found that ChatGPT tends to overuse nouns, possessive determiners, and numbers. On the other hand, the generated speeches employ less verbs, pronouns, and adverbs and include, in mean, too standardized sentences. Considering some words, one can observe that ChatGPT tends to overuse “to must” (devoir), “to continue” or the lemma “we” (nous). Moreover, GPT underuses the auxiliary verb “to be” (^etre), or the modal verbs “to will” (vouloir) or “to have to” (falloir). In addition, when a short text is provided as example to ChatGPT, the machine can generate a short message with a style closed to the original wording. Finally, we reveal that ChatGPT style exposes distinct features compared to real presidential speeches.
zh

[NLP-15] AMPS: ASR with Multimodal Paraphrase Supervision

【速读】：该论文试图解决多语言自发或对话式语音识别（ASR）系统面临的挑战。解决方案的关键在于引入了一种名为AMPS（Augmented Multilingual Paraphrase Supervision）的新技术，通过在训练多模态ASR模型时使用参考转录的释义作为额外的监督，并针对ASR性能较差的语句有选择地调用这一释义目标，从而显著降低词错误率（WER）。具体实施中，AMPS与先进的SeamlessM4T多模态模型结合使用，在包括印地语、马拉地语、马拉雅拉姆语、卡纳达语和尼扬贾语在内的多种语言中，实现了高达5%的相对词错误率降低。

链接: https://arxiv.org/abs/2411.18368
作者: Amruta Parulekar,Abhishek Gupta,Sameep Chattopadhyay,Preethi Jyothi
关键词-EN: automatic speech recognition, conversational multilingual speech, automatic speech, speech recognition, multilingual speech presents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.
zh

[NLP-16] GPT as ghostwriter at the White House

【速读】：该论文试图解决的问题是分析大型语言模型（LLMs）如ChatGPT 3.5在生成文本时的写作风格，并将其与美国总统的演讲风格进行比较。解决方案的关键在于通过对比ChatGPT生成的国情咨文与里根至奥巴马时期的真实国情咨文，揭示ChatGPT在词汇使用、句子结构和情感表达上的特点。研究发现，ChatGPT倾向于过度使用代词“we”、名词和逗号，句子较长且动词使用较少，且在模仿特定风格时仍保持其独特的写作特征，如中性语调和积极的情感表达。这些发现表明，尽管ChatGPT能够生成类似风格的文本，但其风格与真实总统演讲存在显著差异。

链接: https://arxiv.org/abs/2411.18365
作者: Jacques Savoy
关键词-EN: large language models, Recently several large, language models, user request, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recently several large language models (LLMs) have demonstrated their capability to generate a message in response to a user request. Such scientific breakthroughs promote new perspectives but also some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT 3.5 by comparing its generated messages with those of the recent US presidents. To achieve this objective, we compare the State of the Union addresses written by Reagan to Obama with those automatically produced by ChatGPT. We found that ChatGPT tends to overuse the lemma “we” as well as nouns and commas. On the other hand, the generated speeches employ less verbs and include, in mean, longer sentences. Even when imposing a given style to ChatGPT, the resulting speech remains distinct from messages written by the target author. Moreover, ChatGPT opts for a neutral tone with mainly positive emotional expressions and symbolic terms (e.g., freedom, nation). Finally, we show that the GPT’s style exposes distinct features compared to real presidential addresses.
zh

[NLP-17] Can LLM s assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

【速读】：该论文试图解决词汇歧义（Lexical Ambiguity）问题，特别是在数据有限的情况下，传统的词义消歧（Word Sense Disambiguation, WSD）方法面临的挑战。解决方案的关键在于结合大型语言模型（Large Language Models, LLMs）与知识库（Knowledge Base, KB），通过系统化的提示增强机制（Prompt Augmentation Mechanism）来提升WSD的效率。具体方法包括使用词性标注（Part-of-Speech Tagging）、歧义词的同义词、基于方面的词义过滤（Aspect-Based Sense Filtering）以及少样本提示（Few-Shot Prompting）来指导LLM。通过少样本思维链（Few-Shot Chain of Thought, COT）提示方法，该研究显著提升了WSD的性能，并在FEWS测试数据和词义标签上进行了评估。

链接: https://arxiv.org/abs/2411.18337
作者: T.G.D.K. Sumanathilaka,Nicholas Micallef,Julian Hough
关键词-EN: found in modern, Large Language Models, Word Sense Disambiguation, modern digital communications, Ambiguous words
类目: Computation and Language (cs.CL)
备注: 12 pages,6 tables, 1 figure, Proceedings of the 1st International Conference on NLP AI for Cyber Security

点击查看摘要

Abstract:Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.
zh

[NLP-18] Continual Learning in Machine Speech Chain Using Gradient Episodic Memory

【速读】：该论文试图解决自动语音识别 (ASR) 系统中的持续学习问题，特别是在避免灾难性遗忘的同时保持对先前学习任务的性能。解决方案的关键在于利用机器语音链框架，结合梯度片段记忆 (GEM) 和文本到语音 (TTS) 组件，通过回放机制支持 ASR 模型的持续学习。这种方法允许模型在顺序学习新任务时，不会显著降低对早期任务的性能，实验结果表明其在 LJ Speech 数据集上的错误率显著降低，并在不同噪声条件下保持高性能，优于传统的微调和多任务学习方法。

链接: https://arxiv.org/abs/2411.18320
作者: Geoffrey Tyndall,Kurniawati Azizah,Dipta Tanaya,Ayu Purwarianti,Dessi Puji Lestari,Sakriani Sakti
关键词-EN: avoid catastrophic forgetting, machine speech chain, previously learned tasks, automatic speech recognition, Continual learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Published as a conference paper at O-COCOSDA 2024. 6 pages; 2 figures

点击查看摘要

Abstract:Continual learning for automatic speech recognition (ASR) systems poses a challenge, especially with the need to avoid catastrophic forgetting while maintaining performance on previously learned tasks. This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). By incorporating a text-to-speech (TTS) component within the machine speech chain, we support the replay mechanism essential for GEM, allowing the ASR model to learn new tasks sequentially without significant performance degradation on earlier tasks. Our experiments, conducted on the LJ Speech dataset, demonstrate that our method outperforms traditional fine-tuning and multitask learning approaches, achieving a substantial error rate reduction while maintaining high performance across varying noise conditions. We showed the potential of our semi-supervised machine speech chain approach for effective and efficient continual learning in speech recognition.
zh

[NLP-19] Aligning Pre-trained Models for Spoken Language Translation

【速读】：该论文试图解决端到端语音翻译 (End-to-End Speech Translation, ST) 的问题，特别是如何有效地将预训练的自动语音识别 (Automatic Speech Recognition, ASR) 和机器翻译 (Machine Translation, MT) 模型进行对齐。解决方案的关键在于引入一个名为 Q-Former（Subsampler-Transformer Encoder）的小型连接模块，该模块在训练过程中仅优化自身，将ASR编码器的嵌入转换为MT编码器的潜在表示空间，从而桥接语音和文本模态之间的差距。通过这种方式，论文展示了在保持连接模块规模较小的同时，增加基础ASR和MT模型的规模和能力可以普遍提高翻译结果，并且连接模块还能作为领域适配器，显著提升对齐ST设置下的翻译性能。

链接: https://arxiv.org/abs/2411.18294
作者: Šimon Sedláček,Santosh Kesiraju,Alexander Polok,Jan Černocký
关键词-EN: aligning frozen pre-trained, frozen pre-trained automatic, automatic speech recognition, pre-trained automatic speech, transforming ASR encoder
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.
zh

[NLP-20] Neutralizing Backdoors through Information Conflicts for Large Language Models

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 中存在的后门攻击问题，即模型在标准查询下表现正常，但在特定触发条件下生成有害或意外输出的现象。解决方案的关键在于通过构建信息冲突来消除后门行为，具体包括内部和外部两种机制。内部机制通过利用轻量级数据集训练冲突模型，并将其与被后门攻击的模型合并，以在模型的参数化记忆中嵌入矛盾信息来中和恶意行为。外部机制则通过在提示中加入具有说服力的矛盾证据，挑战模型内部的后门知识。实验结果表明，该方法在分类和对话任务中显著优于8种最先进的后门防御基线，能够将高级后门攻击的成功率降低多达98%，同时保持超过90%的干净数据准确性，并且对自适应后门攻击具有鲁棒性。

链接: https://arxiv.org/abs/2411.18280
作者: Chen Chen,Yuchen Sun,Xueluan Gong,Jiaxin Gao,Kwok-Yan Lam
关键词-EN: Natural Language Processing, Large language models, Large language, Language Processing, Natural Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model’s parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model’s internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.
zh

[NLP-21] Large Language Model-Brained GUI Agents : A Survey

【速读】：该论文试图解决的问题是如何利用大型语言模型（LLMs）和多模态模型来实现新一代的图形用户界面（GUI）自动化代理。解决方案的关键在于开发能够理解和处理复杂GUI元素的LLM-brained GUI agents，这些代理能够根据自然语言指令自主执行多步骤任务。论文通过全面调研LLM-brained GUI agents的历史演变、核心组件、先进技术以及评估方法，旨在为研究人员和从业者提供一个系统的理解框架，并指出未来的研究方向和挑战。

链接: https://arxiv.org/abs/2411.18279
作者: Chaoyun Zhang,Shilin He,Jiaxu Qian,Bowen Li,Liqun Li,Si Qin,Yu Kang,Minghua Ma,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
关键词-EN: LLM-brained GUI agents, GUI agents, GUI, LLM-brained GUI, providing an intuitive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2411.18279 [cs.AI] (or arXiv:2411.18279v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.18279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] Hidden Data Privacy Breaches in Federated Learning

【速读】：该论文试图解决联邦学习（Federated Learning, FL）中数据隐私泄露的问题，特别是通过模型操纵或梯度分析进行的数据窃取。解决方案的关键在于提出了一种新型的数据重建攻击方法，该方法利用恶意代码注入，并通过两种关键技术——独特且稀疏的编码设计和块分割（block partitioning）来实现。与传统方法不同，该方法通过参数共享隐秘地嵌入隐藏模型，系统地提取敏感数据，同时利用斐波那契索引设计确保数据的高效、结构化检索。块分割方法则增强了处理高分辨率图像的能力，通过将图像分割成更小的可管理单元来实现。实验结果表明，该方法在处理大规模和高分辨率数据时，优于现有的五种最先进的数据重建攻击方法，并且不易被现有的防御方法检测或缓解。

链接: https://arxiv.org/abs/2411.18269
作者: Xueluan Gong,Yuji Wang,Shuaike Li,Mengyuan Sun,Songze Li,Qian Wang,Kwok-Yan Lam,Chen Chen
关键词-EN: conducting machine learning, Federated Learning, promising enhanced privacy, machine learning, promising enhanced
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Federated Learning (FL) emerged as a paradigm for conducting machine learning across broad and decentralized datasets, promising enhanced privacy by obviating the need for direct data sharing. However, recent studies show that attackers can steal private data through model manipulation or gradient analysis. Existing attacks are constrained by low theft quantity or low-resolution data, and they are often detected through anomaly monitoring in gradients or weights. In this paper, we propose a novel data-reconstruction attack leveraging malicious code injection, supported by two key techniques, i.e., distinctive and sparse encoding design and block partitioning. Unlike conventional methods that require detectable changes to the model, our method stealthily embeds a hidden model using parameter sharing to systematically extract sensitive data. The Fibonacci-based index design ensures efficient, structured retrieval of memorized data, while the block partitioning method enhances our method’s capability to handle high-resolution images by dividing them into smaller, manageable units. Extensive experiments on 4 datasets confirmed that our method is superior to the five state-of-the-art data-reconstruction attacks under the five respective detection methods. Our method can handle large-scale and high-resolution data without being detected or mitigated by state-of-the-art data reconstruction defense methods. In contrast to baselines, our method can be directly applied to both FedAVG and FedSGD scenarios, underscoring the need for developers to devise new defenses against such vulnerabilities. We will open-source our code upon acceptance.
zh

[NLP-23] MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets

【速读】：该论文试图解决比喻研究领域中标注数据集的分散和不易获取问题。解决方案的关键在于创建了一个名为MetaphorShare的网站，该网站旨在整合各种语言的比喻数据集，使其开放且易于访问。通过统一格式和集中存储，MetaphorShare促进了研究人员之间的数据共享，从而推动比喻研究和未来比喻处理自然语言处理系统的发展。

链接: https://arxiv.org/abs/2411.18260
作者: Joanne Boisson,Arif Mehmood,Jose Camacho-Collados
关键词-EN: developed numerous valuable, numerous valuable labelled, valuable labelled corpora, developed numerous, numerous valuable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an identical format. To facilitate this, we present MetaphorShare, a website to integrate metaphor datasets making them open and accessible. With this effort, our aim is to encourage researchers to share and upload more datasets in any language in order to facilitate metaphor studies and the development of future metaphor processing NLP systems. The website is accessible at this http URL.
zh

[NLP-24] A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering

【速读】：该论文试图解决在预训练数据中仅部分包含目标语言（意大利语）的情况下，如何在不进行昂贵的微调（fine-tuning）的情况下提升模型在该语言任务上的性能。解决方案的关键在于探索基于激活引导（activation steering）的技术，通过实验证明意大利语引导（Italian steering）可以在不同模型上成功应用，其性能可与甚至优于经过微调的模型，并且在意大利语生成中产生更高的质量和一致性。此外，论文还讨论了在当前大型语言模型（LLM）环境中，即使未明确针对意大利语进行训练，模型也能获得较高的意大利语性能，因此激活引导和微调的实用性。

链接: https://arxiv.org/abs/2411.18247
作者: Daniel Scalena,Elisabetta Fersini,Malvina Nissim
关键词-EN: pre-training data requires, data requires fine-tuning, computational resources, Adapting models, partially present
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to different models, (ii) achieves performances comparable to, or even better than, fine-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.
zh

[NLP-25] hai Financial Domain Adaptation of THaLLE – Technical Report

【速读】：该论文试图解决现有大型语言模型（LLMs）在泰国金融领域应用中的不足，特别是缺乏针对泰国金融专业术语和本地法规的支持。解决方案的关键在于开发了一个专门针对泰国金融领域的LLM，利用泰国证券交易所的投资顾问（IC）考试数据集，并通过数据增强、ReLoRA高效训练、继续预训练（CPT）、Rank-Stabilized LoRA（rsLoRA）微调、监督微调（SFT）和直接偏好优化（DPO）等技术手段，有效提升了模型在泰国金融咨询任务中的表现，特别是在IC考试的P1、P2和P3级别上分别达到了72%、72%和84%的分数。

链接: https://arxiv.org/abs/2411.18242
作者: KBTG Labs,Atthakorn Petchsod,Pornchanan Balee,Danupat Khamnuansin,Anuruth Lertpiya,Chanatip Saetia,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong
关键词-EN: Large Language Models, Large Language, Thai Financial LLM, Thai financial, excel in general
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in general tasks but struggle with domain-specific challenges, such as specialized terminology and localized regulations. Existing financial LLMs, like FinGPT and BloombergGPT, lack support for the Thai financial domain. We developed a Thai Financial LLM using the Investment Consultant (IC) exam dataset from the Stock Exchange of Thailand. To address dataset limitations, we applied data augmentation, ReLoRA for efficient training, Continued Pretraining (CPT) for domain knowledge, and Rank-Stabilized LoRA (rsLoRA) for fine-tuning. Supervised Fine-Tuning (SFT) simulated exam scenarios, while Direct Preference Optimization (DPO) refined the model using feedback. The model achieved scores of 72%, 72%, and 84% on IC exam levels P1, P2, and P3, respectively, demonstrating its effectiveness in Thai financial advisory tasks and its potential for specialized applications.
zh

[NLP-26] How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

【速读】：该论文试图解决低资源语言自动语音识别（ASR）中预训练模型与目标语言之间的领域不匹配问题。解决方案的关键在于扩展了一种基于适配器（adapter）的高效微调方案，通过引入额外的中间适应步骤来预热适配器和下游模型初始化。具体来说，该方法仅更新总模型参数的1-5%，从而显著降低了计算成本，并在ML-SUPERB数据集上的实验结果表明，相比传统的高效微调方法，该方案在适应未见语言时，字符/音素错误率相对提高了28%。

链接: https://arxiv.org/abs/2411.18217
作者: Shih-Heng Wang,Zih-Ching Chen,Jiatong Shi,Ming-To Chuang,Guan-Ting Lin,Kuan-Po Huang,David Harwath,Shang-Wen Li,Hung-yi Lee
关键词-EN: Automatic Speech Recognition, speech Self-Supervised Learning, Speech Recognition, Automatic Speech, Self-Supervised Learning
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.
zh

[NLP-27] Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

【速读】：该论文试图解决视觉语言模型（Vision-language models, VLMs）在多模态推理任务中生成不准确或不相关响应的问题，主要由于图像理解错误或推理路径不完善所致。解决方案的关键在于引入Critic-V框架，该框架受Actor-Critic范式启发，通过分离推理过程和批评过程，集成两个独立组件：Reasoner负责基于视觉和文本输入生成推理路径，Critic则提供建设性批评以优化这些路径。Critic-V框架通过强化学习理论驱动，Critic提供自然语言批评而非标量奖励，从而实现更细致的反馈，提升Reasoner在复杂推理任务中的能力。Critic模型通过直接偏好优化（Direct Preference Optimization, DPO）训练，利用基于规则的奖励（Rule-based Reward, RBR）排序的偏好数据集，增强其批评能力。实验结果表明，Critic-V框架在8个基准测试中5个显著优于现有方法，特别是在推理准确性和效率方面。

链接: https://arxiv.org/abs/2411.18203
作者: Di Zhang,Jingdi Lei,Junxian Li,Xunzhi Wang,Yujie Liu,Zonglin Yang,Jiatong Li,Weida Wang,Suorong Yang,Jianbo Wu,Peng Ye,Wanli Ouyang,Dongzhan Zhou
关键词-EN: shown remarkable advancements, Vision-language models, reasoning, critic, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Vision-language models~(VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner’s capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.
zh

[NLP-28] SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

【速读】：该论文试图解决多语言复杂情境下的细粒度情感分类问题。解决方案的关键在于提出了情感跨语言识别与逻辑框架 (Sentiment Cross-Lingual Recognition and Logic Framework, SentiXRL)，该框架包含两个核心模块：情感检索增强模块 (emotion retrieval enhancement module) 通过历史对话和逻辑推理提升复杂情境下的情感分类准确性，以及自循环分析协商机制 (self-circulating analysis negotiation mechanism, SANM) 促进单模型内的自主决策，从而实现更精准的分类。实验结果表明，SentiXRL 在多个标准数据集上优于现有模型，特别是在 CPED 和 CH-SIMS 数据集上表现突出，并在 MELD、Emorynlp 和 IEMOCAP 数据集上取得整体更好的性能。此外，论文还统一了多个细粒度情感标注数据集的标签，并进行了类别混淆实验，揭示了标准数据集中类别不平衡的挑战及其影响。

链接: https://arxiv.org/abs/2411.18162
作者: Jie Wang,Yichen Wang,Zhilin Zhang,Jianhao Zeng,Kaidi Wang,Zhiyang Chen
关键词-EN: Large Language Models, strong expressive capabilities, generative models effectively, Large Language, models effectively capture
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With strong expressive capabilities in Large Language Models(LLMs), generative models effectively capture sentiment structures and deep semantics, however, challenges remain in fine-grained sentiment classification across multi-lingual and complex contexts. To address this, we propose the Sentiment Cross-Lingual Recognition and Logic Framework (SentiXRL), which incorporates two modules,an emotion retrieval enhancement module to improve sentiment classification accuracy in complex contexts through historical dialogue and logical reasoning,and a self-circulating analysis negotiation mechanism (SANM)to facilitates autonomous decision-making within a single model for classification this http URL have validated SentiXRL’s superiority on multiple standard datasets, outperforming existing models on CPED and CH-SIMS,and achieving overall better performance on MELD,Emorynlp and IEMOCAP. Notably, we unified labels across several fine-grained sentiment annotation datasets and conducted category confusion experiments, revealing challenges and impacts of class imbalance in standard datasets.
zh

[NLP-29] A survey on cutting-edge relation extraction techniques based on language models

【速读】：该论文试图解决关系抽取 (Relation Extraction, RE) 这一自然语言处理中的关键任务，特别是在生物医学、金融和法律等领域的应用。解决方案的关键在于分析和评估近年来在计算语言学协会 (Association for Computational Linguistics, ACL) 会议上发表的137篇论文中提出的RE技术，特别是那些利用语言模型的方法。研究发现，基于BERT的方法在RE任务中取得了最先进的成果，而新兴的大型语言模型 (Large Language Models, LLMs) 如T5在少样本关系抽取场景中表现出色，尤其是在识别未见过的关系方面。

链接: https://arxiv.org/abs/2411.18157
作者: Jose A. Diaz-Garcia,Julio Amador Diaz Lopez
关键词-EN: comprehensive survey delves, natural language processing, language processing essential, applications across biomedical, legal sectors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 50 pages, under review in Artificial Intelligence Review

点击查看摘要

Abstract:This comprehensive survey delves into the latest advancements in Relation Extraction (RE), a pivotal task in natural language processing essential for applications across biomedical, financial, and legal sectors. This study highlights the evolution and current state of RE techniques by analyzing 137 papers presented at the Association for Computational Linguistics (ACL) conferences over the past four years, focusing on models that leverage language models. Our findings underscore the dominance of BERT-based methods in achieving state-of-the-art results for RE while also noting the promising capabilities of emerging large language models (LLMs) like T5, especially in few-shot relation extraction scenarios where they excel in identifying previously unseen relations.
zh

[NLP-30] MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

【速读】：该论文试图解决的是在自动语音识别（ASR）中准确分配语音转录到相应说话者的问题，即说话者属性自动语音识别（SA-ASR）。解决方案的关键在于利用一个冻结的多语言ASR模型，通过仅使用标准的单语言ASR数据集，将说话者属性融入转录中。具体方法包括训练一个说话者模块，该模块基于弱标签预测说话者嵌入，而不需要对ASR模型进行额外修改。尽管仅使用非重叠的单语言数据进行训练，该方法仍能有效地从多样化的多语言数据集中提取说话者属性，包括重叠语音的情况。实验结果表明，该方法在性能上与强基线方法相当，显示出其鲁棒性和实际应用潜力。

链接: https://arxiv.org/abs/2411.18152
作者: Thai-Binh Nguyen,Alexander Waibel
关键词-EN: Speaker-attributed automatic speech, Speaker-attributed automatic, automatic speech recognition, aims to transcribe, assigning transcripts
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model’s robustness and potential for practical applications.
zh

[NLP-31] Curriculum Demonstration Selection for In-Context Learning

【速读】：该论文试图解决如何选择演示样本来最大化大型语言模型（LLMs）的上下文学习（ICL）能力的问题。解决方案的关键在于提出了课程演示选择（Curriculum Demonstration Selection, CDS）方法，该方法不仅基于样本的相似性，还通过复杂度测量将样本分组，并按照从易到难的顺序选择演示样本。这种策略使得选择的演示样本涵盖了广泛的难度级别，从而使LLMs能够在训练集中学习到不同复杂度的内容。实验结果表明，CDS在多个LLMs和基准测试中显著优于基线方法，特别是在解决复杂问题时表现尤为突出。

链接: https://arxiv.org/abs/2411.18126
作者: Duc Anh Vu,Nguyen Tran Cong Duy,Xiaobao Wu,Hoang Minh Nhat,Du Mingzhe,Nguyen Thanh Thong,Anh Tuan Luu
关键词-EN: Large Language Models, Large Language, Language Models, shown strong in-context, strong in-context learning
类目: Computation and Language (cs.CL)
备注: Accepted at the 40th ACM/SIGAPP Symposium On Applied Computing (SAC 2025), Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong in-context learning (ICL) abilities with a few demonstrations. However, one critical challenge is how to select demonstrations to elicit the full potential of LLMs. In this paper, we propose Curriculum Demonstration Selection (CDS), a novel demonstration selection method for ICL. Instead of merely using similarity, CDS additionally partitions samples by their complexity measurements. Following curriculum learning, CDS then selects demonstrations from easy to difficult. Thus the selected demonstrations cover a wide range of difficulty levels, enabling LLMs to learn from varied complexities within the training set. Experiments demonstrate that our CDS consistently outperforms baseline methods, achieving notable improvements across nine LLMs on three benchmarks. Moreover, CDS proves especially effective in enhancing LLM performance in solving challenging problems.
zh

[NLP-32] raining and Evaluating Language Models with Template-based Data Generation

【速读】：该论文试图解决大型语言模型（LLMs）在复杂推理任务，特别是数学问题解决中表现不佳的问题，主要原因是缺乏大规模、高质量的领域特定数据集来训练复杂的推理能力。解决方案的关键是引入了一种名为模板化数据生成（Template-based Data Generation, TDG）的新方法，利用GPT-4自动生成参数化的元模板，进而合成大量高质量的问题及其解决方案。通过TDG，论文创建了TemplateMath Part I: TemplateGSM数据集，包含超过700万个合成的小学数学问题，每个问题都附有代码和自然语言解决方案，且具有生成几乎无限数量问题的潜力。这种方法不仅解决了数据稀缺问题，还通过GPT-4进行元模板生成，显著提升了数据增强的质量和多样性。

链接: https://arxiv.org/abs/2411.18104
作者: Yifan Zhang
关键词-EN: showcasing remarkable capabilities, Llama has significantly, significantly transformed natural, showcasing remarkable, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems–each accompanied by code-based and natural language solutions–with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at this https URL. The code is available at this https URL.
zh

[NLP-33] Fine-Tuning Small Embeddings for Elevated Performance

【速读】：该论文试图解决低资源语言（如尼泊尔语）在自然语言处理任务中由于数据不足而难以训练高性能上下文嵌入模型的问题。解决方案的关键在于对一个预训练但结构不完整的BERT模型（仅包含六个注意力头）进行微调，使其在尼泊尔语数据上表现提升。通过对比原始模型基线和完整预训练的BERT模型（作为参考），研究结果表明，尽管完整模型在平均表现上更优，但微调小型嵌入模型显著提升了性能，相较于原始基线有显著改进。

链接: https://arxiv.org/abs/2411.18099
作者: Biraj Silwal
关键词-EN: language processing tasks, Contextual Embeddings, natural language processing, processing tasks, Nepali language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contextual Embeddings have yielded state-of-the-art results in various natural language processing tasks. However, these embeddings are constrained by models requiring large amounts of data and huge computing power. This is an issue for low-resource languages like Nepali as the amount of data available over the internet is not always sufficient for the models. This work has taken an incomplete BERT model with six attention heads pretrained on Nepali language and finetuned it on previously unseen data. The obtained results from intrinsic and extrinsic evaluations have been compared to the results drawn from the original model baseline and a complete BERT model pretrained on Nepali language as the oracle. The results demonstrate that even though the oracle is better on average, finetuning the small embeddings drastically improves results compared to the original baseline.
zh

[NLP-34] Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache

【速读】：该论文试图解决大语言模型（LLM）在实际应用中由于其高内存和计算需求而导致的效率问题，特别是长上下文任务中的KV缓存（KV cache）占用内存过大的瓶颈。解决方案的关键在于引入了一种名为MiniKV的KV缓存优化方法，该方法通过创新的2-bit层判别KV缓存（2-bit layer-discriminative KV cache）显著减少了KV缓存的大小，同时保持了长上下文任务的准确性。此外，论文还开发了专门的CUDA内核，使得MiniKV能够兼容FlashAttention，从而在广泛的实验中实现了86%的KV缓存压缩率，同时恢复了超过98.5%的准确性，优于现有最先进的方法，并显著提升了系统性能。

链接: https://arxiv.org/abs/2411.18077
作者: Akshat Sharma,Hangliang Ding,Jianping Li,Neel Dani,Minjia Zhang
关键词-EN: exceptionally challenging due, efficiently serve LLMs, computation requirements, long context tasks, efficiently serve
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How to efficiently serve LLMs in practice has become exceptionally challenging due to their prohibitive memory and computation requirements. In this study, we investigate optimizing the KV cache, whose memory footprint poses a critical bottleneck in LLM inference, especially when dealing with long context tasks. To tackle the challenge, we introduce MiniKV, a KV cache optimization method that simultaneously preserves long context task accuracy while significantly reducing KV cache size via a novel 2-bit layer-discriminative KV cache. More importantly, we develop specialized CUDA kernels to make MiniKV compatible with FlashAttention. Experiments on a wide range of long context tasks show that MiniKV effectively achieves 86% KV cache compression ratio while recovering over 98.5% of accuracy, outperforming state-of-the-art methods while achieving excellent measured system performance improvements.
zh

[NLP-35] Can bidirectional encoder become the ultimate winner for downstream applications of foundation models?

【速读】：该论文试图解决在基础模型框架下，如何通过双向编码器（如BERT）改进自然语言处理（NLP）任务中的模型性能问题。解决方案的关键在于BERT模型通过使用掩码语言模型（Masked Language Model）突破了传统单向语言模型在预训练阶段的限制，能够捕捉双向上下文信息，从而提升模型在下游任务中的特征提取能力。具体来说，BERT的双向编码器能够更好地理解领域知识，并更有效地应用于各种下游任务，如斯坦福问答数据集（SQuAD）和通用语言理解评估（GLUE）。论文通过对比基于GPT的单向模型和基于BERT的双向模型，分析了它们在模型目的和性能上的差异，并强调了BERT在捕捉上下文信息和提升模型性能方面的重要性。

链接: https://arxiv.org/abs/2411.18021
作者: Lewen Yang,Xuanyu Zhou,Juao Fan,Xinyi Xie,Shengxin Zhu
关键词-EN: Artificial Intelligence, machine learning stage, deep learning stage, initial machine learning, learning stage
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures, FLLM2024

点击查看摘要

Abstract:Over the past few decades, Artificial Intelligence(AI) has progressed from the initial machine learning stage to the deep learning stage, and now to the stage of foundational models. Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning, and pre-trained models can be fine-tuned and applied to various downstream tasks. Under the framework of foundational models, models such as Bidirectional Encoder Representations from Transformers(BERT) and Generative Pre-trained Transformer(GPT) have greatly advanced the development of natural language processing(NLP), especially the emergence of many models based on BERT. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. It can capture bidirectional context information to predict the masked words in the sequence, this can improve the feature extraction ability of the model. This makes the model very useful for downstream tasks, especially for specialized applications. The model using the bidirectional encoder can better understand the domain knowledge and be better applied to these downstream tasks. So we hope to help understand how this technology has evolved and improved model performance in various natural language processing tasks under the background of foundational models and reveal its importance in capturing context information and improving the model’s performance on downstream tasks. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model. It also briefly analyzes BERT and the improvements of some models based on BERT. The model’s performance on the Stanford Question Answering Dataset(SQuAD) and General Language Understanding Evaluation(GLUE) was compared.
zh

[NLP-36] DRS: Deep Question Reformulation With Structured Output

【速读】：该论文试图解决大型语言模型（LLMs）在面对全新知识文本时，无法帮助用户重构问题以提取相关知识的问题。解决方案的关键在于提出了一种零样本方法，称为深度问题重构与结构化输出（DRS: Deep Question Reformulation With Structured Output）。该方法利用大型语言模型和基于深度优先搜索（DFS）的算法，通过迭代搜索可能的实体组合并约束输出，从而显著提升LLMs在问题重构方面的能力。实验结果表明，DRS方法将GPT-3.5的重构准确率从23.03%提升至70.42%，并有效提高了开源大型语言模型如Gemma2-9B的得分，从26.35%提升至56.75%。

链接: https://arxiv.org/abs/2411.17993
作者: Zhecheng Li,Yiwei Wang,Bryan Hooi,Yujun Cai,Nanyun Peng,Kai-Wei Chang
关键词-EN: large language models, large language, language models, fundamental capability, language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question answering is a fundamental capability of large language models (LLMs). However, when people encounter completely new knowledge texts, they often ask questions that the text cannot answer due to a lack of understanding of the knowledge. Recent research shows that large language models identify the unanswerability of questions, but they lack the ability to help people reformulate their questions. Even powerful models like GPT-3.5 perform poorly in this regard. To enhance the ability of LLMs to assist humans in reformulating questions to extract relevant knowledge from new documents, we propose a zero-shot method called DRS: Deep Question Reformulation With Structured Output. Our proposed method leverages large language models and the DFS-based algorithm to iteratively search for possible entity combinations and constrain the output with certain entities, effectively improving the capabilities of large language models in this area. Extensive experimental results show that our zero-shot DRS method significantly improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42% and effectively improves the score of open-source large language models, such as Gemma2-9B, from 26.35% to 56.75%.
zh

[NLP-37] New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

【速读】：该论文试图解决的问题是如何为复杂的通用神经自然语言处理模型提供并确保忠实性解释。解决方案的关键在于开发新的解释性范式，具体包括两个新范式：忠实性可测量模型（Faithfulness Measurable Models, FMMs）和自我解释（Self-Explanations）。FMMs通过设计使得测量忠实性变得廉价且精确，从而优化解释以达到最大忠实性。自我解释则利用大型语言模型自我解释，尽管当前模型在这方面表现不一致，但论文提出了实现这一目标的建议。研究表明，FMMs能够产生接近理论最优的忠实性解释，且即使在相同的解释方法下，FMMs也能提供更为一致的忠实性解释，这表明对模型进行简单修改（如随机掩码训练数据）可以显著改善解释的忠实性。

链接: https://arxiv.org/abs/2411.17992
作者: Andreas Madsen
关键词-EN: prevent unintended behavior, critical applications, unintended behavior, machine learning, prevent unintended
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Doctoral thesis

点击查看摘要

Abstract:As machine learning becomes more widespread and is used in more critical applications, it’s important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question “How to provide and ensure faithful explanations for complex general-purpose neural NLP models?” The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently. However, we suggest how this could be achieved. The idea of FMMs is to create models that are designed such that measuring faithfulness is cheap and precise. This makes it possible to optimize an explanation towards maximum faithfulness, which makes FMMs designed to be explained. We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness. Overall, from all investigations of faithfulness, results show that post-hoc and intrinsic explanations are by default model and task-dependent. However, this was not the case when using FMMs, even with the same post-hoc explanation methods. This shows, that even simple modifications to the model, such as randomly masking the training dataset, as was done in FMMs, can drastically change the situation and result in consistently faithful explanations. This answers the question of how to provide and ensure faithful explanations.
zh

[NLP-38] VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

【速读】：该论文试图解决视频大语言模型（VideoLLM）在实时交互和时间敏感任务中的应用受限问题。现有方法通常要求用户提供整个视频和查询作为输入，模型生成响应，这在直播理解和需要定位视频片段的任务中表现不佳。论文提出的解决方案之关键是引入视频-文本二重奏交互格式（video-text duet interaction format），即视频持续播放，用户和模型可以在视频播放的任何位置插入文本消息，实现实时响应。为此，论文构建了MMDuetIT数据集，并引入了多答案基础视频问答（MAGQA）任务来评估模型的实时响应能力。实验结果表明，采用这种交互格式显著提升了模型在时间敏感任务中的表现，并实现了视频播放过程中的实时回复。

链接: https://arxiv.org/abs/2411.17991
作者: Yueqian Wang,Xiaojun Meng,Yuxuan Wang,Jianxin Liang,Jiansheng Wei,Huishuai Zhang,Dongyan Zhao
关键词-EN: Recent researches, duet interaction format, interaction format, large language models, video large language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: this https URL.
zh

[NLP-39] QuaLLM -Health: An Adaptation of an LLM -Based Framework for Quantitative Data Extraction from Online Health Discussions

【速读】：该论文试图解决从社交媒体（如Reddit）中提取与健康相关的定量数据的问题，特别是关于胰高血糖素样肽-1（GLP-1）受体激动剂的讨论。解决方案的关键在于开发了一个名为QuaLLM-Health的框架，该框架基于大型语言模型（LLMs），并通过迭代提示工程（Iterative Prompt Engineering）和领域专家的注释来优化数据提取过程。具体步骤包括：收集和筛选相关讨论，制定注释指南，由领域专家手动注释样本以创建黄金标准数据集，然后使用OpenAI的"GPT-4o-mini"模型进行优化，最终实现高效且准确地从大量非结构化文本中提取临床相关的定量数据。该方法不仅提高了数据提取的准确性和可靠性，还展示了其在不同健康领域进行大规模患者生成数据分析的潜力。

链接: https://arxiv.org/abs/2411.17967
作者: Ramez Kouzy,Roxanna Attar-Olyaee,Michael K. Rooney,Comron J. Hassanzadeh,Junyi Jessy Li,Osama Mohamad
关键词-EN: Reddit offer valuable, Health-related discussions, text is challenging, Reddit offer, quantitative data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Health-related discussions on social media like Reddit offer valuable insights, but extracting quantitative data from unstructured text is challenging. In this work, we present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from Reddit discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large language models (LLMs). We collected 410k posts and comments from five GLP-1-related communities using the Reddit API in July 2024. After filtering for cancer-related discussions, 2,059 unique entries remained. We developed annotation guidelines to manually extract variables such as cancer survivorship, family cancer history, cancer types mentioned, risk perceptions, and discussions with physicians. Two domain-experts independently annotated a random sample of 100 entries to create a gold-standard dataset. We then employed iterative prompt engineering with OpenAI’s “GPT-4o-mini” on the gold-standard dataset to build an optimized pipeline that allowed us to extract variables from the large dataset. The optimized LLM achieved accuracies above 0.85 for all variables, with precision, recall and F1 score macro averaged 0.90, indicating balanced performance. Stability testing showed a 95% match rate across runs, confirming consistency. Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis, costing under 3 and completing in approximately one hour. QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract clinically relevant quantitative data from unstructured social media content. Incorporating human expertise and iterative prompt refinement ensures accuracy and reliability. This methodology can be adapted for large-scale analysis of patient-generated data across various health domains, facilitating valuable insights for healthcare research.
zh

[NLP-40] Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative Quantitative and Mixed-Methods Approaches

【速读】：该论文试图解决生成式 AI (Generative AI, GenAI) 在科学写作中的应用效果评估问题。解决方案的关键在于采用定性、定量和混合方法研究，通过假设的医学影像协作论文案例，展示每种方法如何提供独特的见解来评估 GenAI 对科学写作的改进。定性方法通过专家评审的深入反馈和主题分析工具捕捉细微的改进和局限性；定量方法使用 BLEU、ROUGE 和可读性评分等自动化指标以及用户调查，客观测量连贯性、流畅性和结构的改进；混合方法则结合统计评估和详细定性见解，提供全面的评估。这些方法不仅量化了 GenAI 生成内容的质量提升，还为 GenAI 工具与传统编辑过程的基准测试提供了坚实框架，确保这些技术的可靠性和有效性。

链接: https://arxiv.org/abs/2411.17943
作者: Saman Sarraf
关键词-EN: offering transformative capabilities, improving language coherence, revolutionized content generation, offering transformative, transformative capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) has revolutionized content generation, offering transformative capabilities for improving language coherence, readability, and overall quality. This manuscript explores the application of qualitative, quantitative, and mixed-methods research approaches to evaluate the performance of GenAI models in enhancing scientific writing. Using a hypothetical use case involving a collaborative medical imaging manuscript, we demonstrate how each method provides unique insights into the impact of GenAI. Qualitative methods gather in-depth feedback from expert reviewers, analyzing their responses using thematic analysis tools to capture nuanced improvements and identify limitations. Quantitative approaches employ automated metrics such as BLEU, ROUGE, and readability scores, as well as user surveys, to objectively measure improvements in coherence, fluency, and structure. Mixed-methods research integrates these strengths, combining statistical evaluations with detailed qualitative insights to provide a comprehensive assessment. These research methods enable quantifying improvement levels in GenAI-generated content, addressing critical aspects of linguistic quality and technical accuracy. They also offer a robust framework for benchmarking GenAI tools against traditional editing processes, ensuring the reliability and effectiveness of these technologies. By leveraging these methodologies, researchers can evaluate the performance boost driven by GenAI, refine its applications, and guide its responsible adoption in high-stakes domains like healthcare and scientific research. This work underscores the importance of rigorous evaluation frameworks for advancing trust and innovation in GenAI.
zh

[NLP-41] HOPPR Medical-Grade Platform for Medical Imaging AI

【速读】：该论文试图解决在医疗影像领域中部署大规模视觉语言模型（LVLMs）所面临的挑战，包括高昂的计算成本、复杂的模型开发专业知识以及获取高质量、大规模数据集的困难。解决方案的关键在于HOPPR Medical-Grade Platform，该平台通过提供强大的计算基础设施、一系列基础模型（开发者可以在其上进行特定用例的微调）以及严格的质量管理系统，来克服这些障碍。平台还拥有数百万份来自多样性人群的影像研究和文本报告，用于预训练基础模型和微调特定用例的模型。所有数据均经过去识别化处理，并符合HIPAA标准的安全存储要求。开发者可以通过API安全地托管和访问模型，以便在既定的临床工作流程中使用这些模型进行推理。HOPPR平台的使命是加速LVLM解决方案在医疗影像领域的部署，最终优化放射科医生的工作流程，满足该领域日益增长的需求。

链接: https://arxiv.org/abs/2411.17891
作者: Kalina P. Slavkova,Melanie Traughber,Oliver Chen,Robert Bakos,Shayna Goldstein,Dan Harms,Bradley J. Erickson,Khan M. Siddiqui
关键词-EN: large vision language, Technological advances, vision language models, artificial intelligence, HOPPR Platform
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Technological advances in artificial intelligence (AI) have enabled the development of large vision language models (LVLMs) that are trained on millions of paired image and text samples. Subsequent research efforts have demonstrated great potential of LVLMs to achieve high performance in medical imaging use cases (e.g., radiology report generation), but there remain barriers that hinder the ability to deploy these solutions broadly. These include the cost of extensive computational requirements for developing large scale models, expertise in the development of sophisticated AI models, and the difficulty in accessing substantially large, high-quality datasets that adequately represent the population in which the LVLM solution is to be deployed. The HOPPR Medical-Grade Platform addresses these barriers by providing powerful computational infrastructure, a suite of foundation models on top of which developers can fine-tune for their specific use cases, and a robust quality management system that sets a standard for evaluating fine-tuned models for deployment in clinical settings. The HOPPR Platform has access to millions of imaging studies and text reports sourced from hundreds of imaging centers from diverse populations to pretrain foundation models and enable use case-specific cohorts for fine-tuning. All data are deidentified and securely stored for HIPAA compliance. Additionally, developers can securely host models on the HOPPR platform and access them via an API to make inferences using these models within established clinical workflows. With the Medical-Grade Platform, HOPPR’s mission is to expedite the deployment of LVLM solutions for medical imaging and ultimately optimize radiologist’s workflows and meet the growing demands of the field.
zh

[NLP-42] Leveraging Large Language Models and Topic Modeling for Toxicity Classification

【速读】：该论文试图解决内容审核和毒性分类模型中存在的偏见问题，特别是这些模型在分类过程中可能放大或减少偏见，并可能忽视或不利某些边缘化群体的问题。解决方案的关键在于通过主题建模策略对BERTweet和HateBERT进行微调，以减少标注者立场对模型学习过程中传播的偏见。研究结果表明，针对特定主题进行微调显著提高了模型的F1分数，相较于GPT-4、PerspectiveAPI和RewireAPI等其他知名分类模型，显示出更高的准确性。

链接: https://arxiv.org/abs/2411.17876
作者: Haniyeh Ehsani Oskouie,Christina Chance,Claire Huang,Margaret Capetz,Elizabeth Eyeson,Majid Sarrafzadeh
关键词-EN: represent critical tasks, classification represent critical, significant social implications, toxicity classification represent, social implications
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Content moderation and toxicity classification represent critical tasks with significant social implications. However, studies have shown that major classification models exhibit tendencies to magnify or reduce biases and potentially overlook or disadvantage certain marginalized groups within their classification processes. Researchers suggest that the positionality of annotators influences the gold standard labels in which the models learned from propagate annotators’ bias. To further investigate the impact of annotator positionality, we delve into fine-tuning BERTweet and HateBERT on the dataset while using topic-modeling strategies for content moderation. The results indicate that fine-tuning the models on specific topics results in a notable improvement in the F1 score of the models when compared to the predictions generated by other prominent classification models such as GPT-4, PerspectiveAPI, and RewireAPI. These findings further reveal that the state-of-the-art large language models exhibit significant limitations in accurately detecting and interpreting text toxicity contrasted with earlier methodologies. Code is available at this https URL.
zh

[NLP-43] LongKey: Keyphrase Extraction for Long Documents

【速读】：该论文试图解决在信息过载时代，手动标注大量不断增长的文档和学术论文变得不切实际的问题。解决方案的关键在于引入了一种名为LongKey的新框架，用于从长篇文档中提取关键短语。LongKey利用基于编码器的语言模型来捕捉扩展文本的复杂性，并通过使用最大池化嵌入器（max-pooling embedder）来增强关键短语候选的表示。该方法在LDKP数据集和六个多样化的未见数据集上进行了验证，并持续优于现有的无监督和基于语言模型的关键短语提取方法，展示了其在不同文本长度和领域中的广泛适用性和优越性能。

链接: https://arxiv.org/abs/2411.17863
作者: Jeovane Honorio Alves,Radu State,Cinthia Obladen de Almendra Freitas,Jean Paul Barddal
关键词-EN: information overload, manually annotating, increasingly impractical, era of information, annotating the vast
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted for presentation at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). Code available at this https URL

点击查看摘要

Abstract:In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey’s versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.
zh

[NLP-44] Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

【速读】：该论文试图解决阿拉伯书籍页面转换为结构化Markdown文本的问题，解决方案的关键在于开发了一套名为Arabic-Nougat的光学字符识别（OCR）模型。这些模型基于Meta的Nougat架构，包括三个专门模型：arabic-small-nougat、arabic-base-nougat和arabic-large-nougat。关键创新点包括：1) 使用合成数据集arabic-img2md进行微调，该数据集包含13.7k对阿拉伯书籍页面及其Markdown表示；2) 引入Aranizer-PBE-86k分词器（tokenizer）以提高分词效率；3) 采用torch.bfloat16精度与Flash Attention 2技术优化训练和推理过程。这些创新使得Arabic-Nougat模型在Markdown结构准确性和字符错误率方面达到了最先进的性能，特别是arabic-large-nougat模型表现最佳。此外，论文还公开了一个包含11亿阿拉伯语标记的大规模数据集，为阿拉伯语OCR研究提供了宝贵资源。

链接: https://arxiv.org/abs/2411.17835
作者: Mohamed Rashad
关键词-EN: structured Markdown text, Arabic book pages, converting Arabic book, Meta Nougat architecture, Markdown text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta’s Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at this https URL.
zh

[NLP-45] Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

【速读】：该论文试图解决手语生成 (Sign Language Generation, SLG) 的问题，即从文本生成手语视频。现有方法主要将SLG视为视觉内容生成任务，忽略了手语的语言特性。论文提出的解决方案之关键是引入了一个多语言手语模型，称为“手语作为标记 (Signs as Tokens, SOKE)”，该模型利用预训练语言模型 (Language Model, LM) 自回归生成3D手语头像。关键创新在于开发了一种解耦的标记器，将连续的手语动作离散化为表示不同身体部位的标记序列，并将这些手语标记整合到LM的原始文本词汇中，从而实现对手语数据集的监督微调。此外，论文还构建了一个大规模的中文手语数据集 CSL-Daily，包含高质量的3D姿态注释，以促进多语言手语生成研究。

链接: https://arxiv.org/abs/2411.17799
作者: Ronglai Zuo,Rolandos Alexandros Potamias,Evangelos Ververas,Jiankang Deng,Stefanos Zafeiriou
关键词-EN: primary communication method, Sign language, Sign, language, features of natural
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at this https URL.
zh

[NLP-46] H3Fusion: Helpful Harmless Honest Fusion of Aligned LLM s

【速读】：该论文试图解决预训练大型语言模型（LLMs）在指令数据集上的对齐问题，以创建能够反映人类偏好的微调模型。解决方案的关键在于开发了一种名为 H^3 Fusion 的对齐融合方法，该方法具有三个独特特征：首先，通过集成多个单独对齐的LLMs，创建一个最终的微调对齐模型，该模型在帮助性、无害性和诚实性方面超越了单个模型，从而实现稳健的对齐。其次，H^3 Fusion 利用了混合专家（Mixture-of-Experts, MoE）方法，在冻结每个模型的多头注意力权重的同时，调整前馈神经网络（FFN）层，并根据输入指令类型动态选择最适合生成输出响应的专家子集。最后，通过引入门控损失和正则化项来提升H^3 Fusion模型的性能，前者惩罚专家路由器的错误选择，后者在微调过程中调节专家权重的漂移，并通过调节专家的激活来动态调整融合行为。实验结果表明，H^3 Fusion在三个基准数据集上的表现优于单独对齐的模型和现有的最先进LLM集成方法。

链接: https://arxiv.org/abs/2411.17792
作者: Selim Furkan Tekin,Fatih Ilhan,Tiansheng Huang,Sihao Hu,Zachary Yahn,Ling Liu
关键词-EN: reflect human preference, creating fine-tuned models, human preference, alignment fusion, critical for creating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference. A growing number of alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling the efforts on effective alignments of pre-trained LLMs to ensure helpful, harmless, and honest answers from both open-source and closed-source LLMs. This paper tackles this problem by developing an alignment fusion approach, coined as H^3 Fusion, with three unique characteristics. First, H^3 Fusion ensembles multiple individually aligned LLMs to create a final fine-tuned alignment model with enhanced capabilities beyond those of individual models, delivering robust alignment through promoting helpful, harmless, honest fusion. Second, H^3 Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We first freeze the multi-head attention weights of each individual model while tuning the FFN layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. Finally, we boost the performance of the resulting H^3 3Fusion model by introducing gating loss and regularization terms. The former penalizes the selection errors of the expert-router, and the latter mediates the expert weights drifting during fine-tuning and dynamically adjusts the fusion behavior of the resulting model by canalizing the activations on the experts. Extensive evaluations on three benchmark datasets show that H^3 3Fusion is more helpful, less harmful, and more honest from two aspects: it outperforms each individually aligned model by 11.37% , and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% . Code is available at this http URL.
zh

[NLP-47] Efficient Self-Improvement in Multimodal Large Language Models : A Model-Level Judge-Free Approach

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 在自我改进过程中依赖模型自身作为评判者所导致的计算成本高、潜在的奖励作弊和模型崩溃问题。解决方案的关键在于引入一种无需模型级别评判者的自我改进框架。该框架通过可控的反馈机制生成偏好学习对，并利用轻量级的对比语言-图像编码器来评估和反转必要的数据对，从而优化数据质量。这种方法在公共基准和新的IC数据集上展示了优越的精度和召回率，同时显著降低了计算需求，为MLLMs的自我改进提供了一条高效且资源节约的路径。

链接: https://arxiv.org/abs/2411.17760
作者: Shijian Deng,Wentian Zhao,Yu-Jhe Li,Kun Wan,Daniel Miranda,Ajinkya Kale,Yapeng Tian
关键词-EN: multimodal large language, large language models, reliability and robustness, multimodal large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.
zh

[NLP-48] SlideSpawn: An Automatic Slides Generation System for Research Publications

【速读】：该论文试图解决研究论文摘要生成的问题，特别是如何从研究论文中提取关键信息并以视觉和简洁的方式生成高质量的演示文稿。解决方案的关键在于提出了一种名为SlideSpwan的新系统，该系统通过以下步骤实现目标：首先将研究论文的PDF格式转换为包含结构信息的XML文档；然后利用基于PS5K和Aminer 9.5K Insights数据集训练的机器学习模型预测论文中各句子的重要性；接着使用整数线性规划（ILP）选择用于幻灯片的句子，并根据相似性进行聚类，为每个聚类分配合适的标题；最后生成幻灯片，将所选句子中的图形元素与其并列展示。实验结果表明，该系统生成的演示文稿质量优于现有方法。

链接: https://arxiv.org/abs/2411.17719
作者: Keshav Kumar,Ravindranath Chowdary
关键词-EN: Research, Research papers, structured documents, Abstract, PDF
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, 2 tables, 5 equations, 41 references

点击查看摘要

Abstract:Research papers are well structured documents. They have text, figures, equations, tables etc., to covey their ideas and findings. They are divided into sections like Introduction, Model, Experiments etc., which deal with different aspects of research. Characteristics like these set research papers apart from ordinary documents and allows us to significantly improve their summarization. In this paper, we propose a novel system, SlideSpwan, that takes PDF of a research document as an input and generates a quality presentation providing it’s summary in a visual and concise fashion. The system first converts the PDF of the paper to an XML document that has the structural information about various elements. Then a machine learning model, trained on PS5K dataset and Aminer 9.5K Insights dataset (that we introduce), is used to predict salience of each sentence in the paper. Sentences for slides are selected using ILP and clustered based on their similarity with each cluster being given a suitable title. Finally a slide is generated by placing any graphical element referenced in the selected sentences next to them. Experiments on a test set of 650 pairs of papers and slides demonstrate that our system generates presentations with better quality.
zh

[NLP-49] owards Efficient Neurally-Guided Program Induction for ARC-AGI

【速读】：该论文试图解决在开放世界问题领域中，生成式 AI (Generative AI) 在分布外泛化 (out-of-distribution generalization) 能力的问题。解决方案的关键在于采用神经引导的程序归纳 (neurally-guided program induction) 方法，并通过实验比较了三种不同的归纳范式：学习网格空间 (Learning the grid space)、学习程序空间 (Learning the program space) 和学习变换空间 (Learning the transform space)。论文详细实施并实验了前两种方法，并保留了第二种方法用于 ARC-AGI 提交。通过识别这两种方法的优缺点，论文提出第三种方法作为潜在解决方案，并进行了初步实验。

链接: https://arxiv.org/abs/2411.17708
作者: Simon Ouellette
关键词-EN: open-world problem domain, ability to generalize, crucial quality, open-world problem, problem domain
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:ARC-AGI is an open-world problem domain in which the ability to generalize out-of-distribution is a crucial quality. Under the program induction paradigm, we present a series of experiments that reveal the efficiency and generalization characteristics of various neurally-guided program induction approaches. The three paradigms we consider are Learning the grid space, Learning the program space, and Learning the transform space. We implement and experiment thoroughly on the first two, and retain the second one for ARC-AGI submission. After identifying the strengths and weaknesses of both of these approaches, we suggest the third as a potential solution, and run preliminary experiments.
zh

[NLP-50] SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

【速读】：该论文试图解决传统模块化对话AI系统中存在的错误传播和信息分离问题，提出了一种全双工多模态大语言模型 (Full-duplex multimodal large language models, LLMs) 作为统一框架，以实现更自然和无缝的人机对话。解决方案的关键在于引入SALMONN-omni模型，这是一种无编解码器 (codec-free) 的全双工语音理解和生成模型，能够同时处理自身生成的语音和背景声音。该模型通过一种新颖的双工对话框架，结合“思考”机制，实现了基于嵌入而非编解码器的异步文本和语音生成，从而在语音识别、语音增强和口语问答等流式语音任务中表现出卓越的性能，特别是在处理轮次转换、打断和回声消除等复杂场景中，展示了其作为全双工对话AI系统原型的潜力。

链接: https://arxiv.org/abs/2411.18138
作者: Wenyi Yu,Siyin Wang,Xiaoyu Yang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Yuxuan Wang,Chao Zhang
关键词-EN: seamless human-machine conversations, multimodal large language, addressing diverse speech, large language models, Full-duplex multimodal large
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Technical report

点击查看摘要

Abstract:Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking’’ mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni’s versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.
zh

[NLP-51] JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在无线网络中部署时面临的计算资源需求和通信负载问题。解决方案的关键在于提出了联合功率和提示优化（Joint Power and Prompt Optimization, JPPO）框架，该框架结合了小语言模型（Small Language Model, SLM）的提示压缩与无线功率分配优化。通过在用户设备上部署SLM进行提示压缩，并利用深度强化学习（Deep Reinforcement Learning）进行压缩比和传输功率的联合优化，JPPO有效地平衡了服务质量与资源效率。实验结果表明，该框架在优化功率使用的同时，实现了高服务保真度和低比特错误率，并减少了约17%的响应时间，具体改进效果取决于原始提示的长度。

链接: https://arxiv.org/abs/2411.18010
作者: Feiran You,Hongyang Du,Kaibin Huang,Abbas Jamalipour
关键词-EN: Large Language Models, Small Language Model, demonstrated remarkable capabilities, Large Language, Language Models
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, leading to their increasing deployment in wireless networks for a wide variety of user services. However, the growing longer prompt setting highlights the crucial issue of computational resource demands and huge communication load. To address this challenge, we propose Joint Power and Prompt Optimization (JPPO), a framework that combines Small Language Model (SLM)-based prompt compression with wireless power allocation optimization. By deploying SLM at user devices for prompt compression and employing Deep Reinforcement Learning for joint optimization of compression ratio and transmission power, JPPO effectively balances service quality with resource efficiency. Experimental results demonstrate that our framework achieves high service fidelity and low bit error rates while optimizing power usage in wireless LLM services. The system reduces response time by about 17%, with the improvement varying based on the length of the original prompt.
zh

计算机视觉

[CV-0] xtured Gaussians for Enhanced 3D Scene Appearance Modeling

【速读】：该论文试图解决3D高斯喷射 (3D Gaussian Splatting, 3DGS) 技术在表达复杂纹理和几何细节方面的局限性。解决方案的关键在于引入了一种新的广义高斯外观表示方法，通过为每个高斯增加alpha (A)、RGB或RGBA纹理映射，以模拟每个高斯范围内空间变化的颜色和不透明度。这种方法使得每个高斯能够表示更丰富的纹理模式和几何结构，而不仅仅是单一颜色和椭球体。研究还发现，仅使用alpha纹理映射即可显著提高高斯的表达能力，而进一步增加RGB纹理映射则能达到最高的表达能力。

链接: https://arxiv.org/abs/2411.18625
作者: Brian Chao,Hung-Yu Tseng,Lorenzo Porzi,Chen Gao,Tuotuo Li,Qinbo Li,Ayush Saraf,Jia-Bin Huang,Johannes Kopf,Gordon Wetzstein,Changil Kim
关键词-EN: rendering technique due, Gaussian, Gaussian Splatting, reconstruction and rendering, rendering time
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a state-of-the-art 3D reconstruction and rendering technique due to its high-quality results and fast training and rendering time. However, pixels covered by the same Gaussian are always shaded in the same color up to a Gaussian falloff scaling factor. Furthermore, the finest geometric detail any individual Gaussian can represent is a simple ellipsoid. These properties of 3DGS greatly limit the expressivity of individual Gaussian primitives. To address these issues, we draw inspiration from texture and alpha mapping in traditional graphics and integrate it with 3DGS. Specifically, we propose a new generalized Gaussian appearance representation that augments each Gaussian with alpha~(A), RGB, or RGBA texture maps to model spatially varying color and opacity across the extent of each Gaussian. As such, each Gaussian can represent a richer set of texture patterns and geometric structures, instead of just a single color and ellipsoid as in naive Gaussian Splatting. Surprisingly, we found that the expressivity of Gaussians can be greatly improved by using alpha-only texture maps, and further augmenting Gaussians with RGB texture maps achieves the highest expressivity. We validate our method on a wide variety of standard benchmark datasets and our own custom captures at both the object and scene levels. We demonstrate image quality improvements over existing methods while using a similar or lower number of Gaussians.
zh

[CV-1] GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

【速读】：该论文试图解决从单张自然场景中的人类照片重建高保真3D人体模型的挑战。解决方案的关键在于提出了一种名为GeneMAN的通用图像到3D人体重建框架，该框架基于多源高质量人体数据的全面收集，包括3D扫描、多视角视频、单张照片以及生成的合成人体数据。GeneMAN包含三个核心模块：1) 训练一个特定于人体的文本到图像扩散模型和一个视图条件扩散模型，分别作为2D和3D人体先验；2) 利用预训练的人体先验模型，通过几何初始化-雕刻流程从单张图像中恢复高质量的3D人体几何；3) 采用多空间纹理细化流程，在潜在空间和像素空间中连续细化纹理，以实现高保真3D人体纹理。实验结果表明，GeneMAN在处理自然场景图像时表现出更好的通用性，能够生成高质量的3D人体模型，优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.18624
作者: Wentao Wang,Hang Ye,Fangzhou Hong,Xue Yang,Jianfu Zhang,Yizhou Wang,Ziwei Liu,Liang Pan
关键词-EN: human, remains a challenging, challenging task, task to reconstruct, high-quality human data
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization–Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.
zh

[CV-2] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

【速读】：该论文试图解决机器人3D操作任务中面临的挑战，包括缺乏大规模的机器人3D数据和空间几何信息的潜在损失。解决方案的关键在于提出了Lift3D框架，该框架通过逐步增强2D基础模型与隐式和显式的3D机器人表示，构建了一个鲁棒的3D操作策略。具体来说，Lift3D首先设计了一个任务感知的掩码自编码器（task-aware masked autoencoder），用于掩码任务相关的功能性补丁并重建深度信息，从而增强2D基础模型的隐式3D机器人表示。随后，通过自监督微调，引入了一种2D模型提升策略（2D model-lifting strategy），建立了输入3D点与2D模型位置嵌入之间的位置映射。基于此映射，Lift3D利用2D基础模型直接编码点云数据，利用大规模预训练知识构建显式的3D机器人表示，同时最小化空间信息的损失。实验结果表明，Lift3D在多个仿真基准和真实场景中均优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.18623
作者: Yueru Jia,Jiaming Liu,Sixiang Chen,Chenyang Gu,Zhilue Wang,Longzan Luo,Lily Lee,Pengwei Wang,Zhongyuan Wang,Renrui Zhang,Shanghang Zhang
关键词-EN: intricate spatial configurations, interact with intricate, spatial relationships, spatial configurations, manipulation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D geometric information is essential for manipulation tasks, as robots need to perceive the 3D environment, reason about spatial relationships, and interact with intricate spatial configurations. Recent research has increasingly focused on the explicit extraction of 3D features, while still facing challenges such as the lack of large-scale robotic 3D data and the potential loss of spatial geometry. To address these limitations, we propose the Lift3D framework, which progressively enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy. Specifically, we first design a task-aware masked autoencoder that masks task-relevant affordance patches and reconstructs depth information, enhancing the 2D foundation model’s implicit 3D robotic representation. After self-supervised fine-tuning, we introduce a 2D model-lifting strategy that establishes a positional mapping between the input 3D points and the positional embeddings of the 2D model. Based on the mapping, Lift3D utilizes the 2D foundation model to directly encode point cloud data, leveraging large-scale pretrained knowledge to construct explicit 3D robotic representations while minimizing spatial information loss. In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.
zh

[CV-3] Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data

【速读】：该论文试图解决在标签数据稀缺的情况下，如何从大规模、高维度和复杂数据中有效提取有价值信息的问题。解决方案的关键在于引入半监督学习方法，通过自训练方法结合卷积神经网络（CNN）进行图像特征提取和分类，并通过迭代过程不断优化模型预测性能。这种方法显著提升了在有限标签数据条件下的数据分析和模式识别能力，实验结果表明其在CIFAR-10图像分类数据集上的表现优于传统的机器学习技术，如支持向量机（SVM）、XGBoost和多层感知器（MLP），并在不同噪声水平下验证了其鲁棒性和抗噪能力。

链接: https://arxiv.org/abs/2411.18622
作者: Aoran Shen,Minghao Dai,Jiacheng Hu,Yingbin Liang,Shiru Wang,Junliang Du
关键词-EN: extracting valuable information, effectively extracting valuable, big data technology, information age, valuable information
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the 21st-century information age, with the development of big data technology, effectively extracting valuable information from massive data has become a key issue. Traditional data mining methods are inadequate when faced with large-scale, high-dimensional and complex data. Especially when labeled data is scarce, their performance is greatly limited. This study optimizes data mining algorithms by introducing semi-supervised learning methods, aiming to improve the algorithm’s ability to utilize unlabeled data, thereby achieving more accurate data analysis and pattern recognition under limited labeled data conditions. Specifically, we adopt a self-training method and combine it with a convolutional neural network (CNN) for image feature extraction and classification, and continuously improve the model prediction performance through an iterative process. The experimental results demonstrate that the proposed method significantly outperforms traditional machine learning techniques such as Support Vector Machine (SVM), XGBoost, and Multi-Layer Perceptron (MLP) on the CIFAR-10 image classification dataset. Notable improvements were observed in key performance metrics, including accuracy, recall, and F1 score. Furthermore, the robustness and noise-resistance capabilities of the semi-supervised CNN model were validated through experiments under varying noise levels, confirming its practical applicability in real-world scenarios.
zh

[CV-4] Diffusion Self-Distillation for Zero-Shot Customized Image Generation ECAI

【速读】：该论文试图解决生成式 AI (Generative AI) 模型在图像生成过程中缺乏细粒度控制的问题，特别是对于艺术家希望在不同情境下保持特定实例身份（即“身份保持生成”）的需求。解决方案的关键在于提出了一种名为“扩散自蒸馏 (Diffusion Self-Distillation)”的方法，通过利用预训练的文本到图像扩散模型生成自身的数据集，用于训练文本条件下的图像到图像任务模型。具体步骤包括：首先利用文本到图像扩散模型的上下文生成能力创建图像网格，并借助视觉语言模型 (Visual-Language Model) 筛选出高质量的配对数据集；然后使用该数据集对文本到图像模型进行微调，使其转变为文本加图像到图像模型。实验结果表明，该方法在身份保持生成任务中优于现有的零样本方法，并且在不需测试时优化的前提下，与实例级微调技术具有竞争力。

链接: https://arxiv.org/abs/2411.18616
作者: Shengqu Cai,Eric Chan,Yunzhi Zhang,Leonidas Guibas,Jiajun Wu,Gordon Wetzstein
关键词-EN: desire fine-grained control, produce impressive results, models produce impressive, diffusion models produce, fine-grained control
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., “identity-preserving generation”. This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model’s in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
zh

[CV-5] Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective

【速读】：该论文试图解决多任务学习中的梯度冲突问题，即在联合训练多个下游任务时，不同任务之间的梯度竞争可能导致某些任务性能下降。解决方案的关键在于通过稀疏训练（Sparse Training, ST）来减少梯度冲突。稀疏训练方法在训练过程中仅更新模型参数的一部分，而保持其余参数不变，从而有效缓解梯度冲突并提升模型性能。此外，稀疏训练还可以与现有的梯度操纵技术结合，进一步增强其效果。

链接: https://arxiv.org/abs/2411.18615
作者: Zhi Zhang,Jiayi Shen,Congfeng Cao,Gaole Dai,Shiji Zhou,Qizhe Zhang,Shanghang Zhang,Ekaterina Shutova
关键词-EN: generalist agents necessitates, Advancing towards generalist, multiple downstream tasks, multiple downstream, generalist agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancing towards generalist agents necessitates the concurrent processing of multiple tasks using a unified model, thereby underscoring the growing significance of simultaneous model training on multiple downstream tasks. A common issue in multi-task learning is the occurrence of gradient conflict, which leads to potential competition among different tasks during joint training. This competition often results in improvements in one task at the expense of deterioration in another. Although several optimization methods have been developed to address this issue by manipulating task gradients for better task balancing, they cannot decrease the incidence of gradient conflict. In this paper, we systematically investigate the occurrence of gradient conflict across different methods and propose a strategy to reduce such conflicts through sparse training (ST), wherein only a portion of the model’s parameters are updated during training while keeping the rest unchanged. Our extensive experiments demonstrate that ST effectively mitigates conflicting gradients and leads to superior performance. Furthermore, ST can be easily integrated with gradient manipulation techniques, thus enhancing their effectiveness.
zh

[CV-6] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

【速读】：该论文试图解决从单目视频中创建4D（动态3D）场景的问题。解决方案的关键在于CAT4D方法，它利用多视角视频扩散模型（multi-view video diffusion model），该模型在多样化的数据集组合上进行训练，能够在新指定的相机姿态和时间戳上实现新颖视角合成。结合一种新颖的采样方法，CAT4D模型可以将单目视频转换为多视角视频，通过优化可变形3D高斯表示（deformable 3D Gaussian representation）实现稳健的4D重建。

链接: https://arxiv.org/abs/2411.18613
作者: Rundi Wu,Ruiqi Gao,Ben Poole,Alex Trevithick,Changxi Zheng,Jonathan T. Barron,Aleksander Holynski
关键词-EN: method for creating, monocular video, multi-view video, multi-view video diffusion, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos. See our project page for results and interactive demos: \urlthis http URL.
zh

[CV-7] Structured light with a million light planes per second

【速读】：该论文试图解决高速深度捕捉的问题，目标是实现每秒千帧的全帧深度捕捉，比现有技术快四倍。解决方案的关键在于设计了一种声光扫描设备 (acousto-optic light scanning device)，该设备能够以每秒高达两百万次的速度扫描光平面。通过将这种高速扫描设备与事件相机 (event camera) 结合，利用事件相机在光平面扫描场景时触发的稀疏事件进行深度三角测量。与以往依赖光扫描速度的结构光系统不同，该论文的声光扫描设备比事件相机的全帧带宽快三个数量级，从而充分利用了事件相机的快速操作。此外，论文还展示了仅对感兴趣区域进行自适应扫描，速度比事件相机的理论全帧极限快一个数量级。

链接: https://arxiv.org/abs/2411.18597
作者: Dhawal Sirikonda,Praneeth Chakravarthula,Ioannis Gkioulekas,Adithya Pediredla
关键词-EN: structured light system, light scanning device, light, system that captures, thousand frames
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a structured light system that captures full-frame depth at rates of a thousand frames per second, four times faster than the previous state of the art. Our key innovation to this end is the design of an acousto-optic light scanning device that can scan light planes at rates up to two million planes per second. We combine this device with an event camera for structured light, using the sparse events triggered on the camera as we sweep a light plane on the scene for depth triangulation. In contrast to prior work, where light scanning is the bottleneck towards faster structured light operation, our light scanning device is three orders of magnitude faster than the event camera’s full-frame bandwidth, thus allowing us to take full advantage of the event camera’s fast operation. To surpass this bandwidth, we additionally demonstrate adaptive scanning of only regions of interest, at speeds an order of magnitude faster than the theoretical full-frame limit for event cameras.
zh

[CV-8] Biomolecular Analysis of Soil Samples and Rock Imagery for Tracing Evidence of Life Using a Mobile Robot MICRO

【速读】：该论文试图解决火星上过去生命迹象检测的技术挑战，主要通过改进凤凰号火星车（Phoenix rover）的探测能力来实现。解决方案的关键在于整合先进的数字显微成像仪（digital microscopic imagers）和光谱仪（spectrometers），以提高土壤样本的高分辨率检测能力，同时增强机械部件的机动性和优化地下采样能力。这些改进使得凤凰号火星车能够在多样化的地质环境中导航并采集样本进行生物分子分析，从而扩展了火星上可检测的生物标志物（biomarkers）和生物信号（biosignatures）的范围。

链接: https://arxiv.org/abs/2411.18594
作者: Shah Md Ahasan Siddique,Ragib Tahshin Rinath,Shakil Mosharrof,Syed Tanjib Mahmud,Sakib Ahmed
关键词-EN: advanced robotic technologies, search for evidence, evidence of past, past life, requires the usage
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Key Words : Mars, Rover, Phoenix, Biosignatures, Biomolecular Analysis, Microscopy, Spectroscopy, Sampling, Astrobiology

点击查看摘要

Abstract:The search for evidence of past life on Mars presents a tremendous challenge that requires the usage of very advanced robotic technologies to overcome it. Current digital microscopic imagers and spectrometers used for astrobiological examination suffer from limitations such as insufficient resolution, narrow detection range, and lack of portability. To overcome these challenges, this research study presents modifications to the Phoenix rover to expand its capability for detecting biosignatures on Mars. This paper examines the modifications implemented on the Phoenix rover to enhance its capability to detect a broader spectrum of biosignatures. One of the notable improvements comprises the integration of advanced digital microscopic imagers and spectrometers, enabling high-resolution examination of soil samples. Additionally, the mechanical components of the device have been reinforced to enhance maneuverability and optimize subsurface sampling capabilities. Empirical investigations have demonstrated that Phoenix has the capability to navigate diverse geological environments and procure samples for the purpose of biomolecular analysis. The biomolecular instrumentation and hybrid analytical methods showcased in this study demonstrate considerable potential for future astrobiology missions on Mars. The potential for enhancing the system lies in the possibility of broadening the range of detectable biomarkers and biosignatures.
zh

[CV-9] Hierarchical Information Flow for Generalized Efficient Image Restoration

【速读】：该论文试图解决在图像恢复 (Image Restoration, IR) 任务中，如何高效地泛化和扩展基于Transformer的模型的问题。解决方案的关键在于提出了一种分层信息流机制，称为Hi-IR，它通过自底向上的方式逐步在像素间传播信息。Hi-IR构建了一个分层信息树，代表退化图像在三个层次上的信息，每个层次封装不同类型的信息，高层包含更广泛的对象和概念，低层则聚焦于局部细节。此外，分层树结构去除了长距离自注意力机制，提高了计算效率和内存利用率，从而为模型的有效扩展奠定了基础。通过模型扩展，Hi-IR在大规模训练设置下有望显著提升图像恢复能力。实验结果表明，Hi-IR在七个常见的图像恢复任务中达到了最先进的性能，验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2411.18588
作者: Yawei Li,Bin Ren,Jingyun Liang,Rakesh Ranjan,Mengyuan Liu,Nicu Sebe,Ming-Hsuan Yang,Luca Benini
关键词-EN: transformers show promise, vision transformers show, numerous image restoration, vision transformers, promise in numerous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method’s capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability.
zh

[CV-10] Exploring Depth Information for Detecting Manipulated Face Videos

【速读】：该论文试图解决面部操纵检测中的鲁棒性问题，特别是通过引入面部深度图（face depth map）作为辅助信息来增强检测性能。解决方案的关键在于提出了一个面部深度图变换器（Face Depth Map Transformer, FDMT），用于从RGB面部图像中逐块估计面部深度图，以捕捉因操纵而产生的局部深度异常。此外，论文还设计了多头部深度注意力机制（Multi-head Depth Attention, MDA），将估计的面部深度图与主干特征集成，以及RGB-深度不一致注意力模块（RGB-Depth Inconsistency Attention, RDIA），用于有效捕捉多帧输入中的帧间不一致性。这些创新模块共同提升了面部操纵检测的鲁棒性和准确性。

链接: https://arxiv.org/abs/2411.18572
作者: Haoyue Wang,Sheng Li,Ji He,Zhenxing Qian,Xinpeng Zhang,Shaolin Fan
关键词-EN: face depth map, Face manipulation detection, face depth, depth map, Face
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures. arXiv admin note: substantial text overlap with arXiv:2212.14230

点击查看摘要

Abstract:Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images/videos. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as face recognition or face detection, is unfortunately paid little attention to in literature for face manipulation detection. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information for robust face manipulation detection. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from an RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. We also propose an RGB-Depth Inconsistency Attention (RDIA) module to effectively capture the inter-frame inconsistency for multi-frame input. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.
zh

[CV-11] DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

【速读】：该论文试图解决复杂接触交互下的灵巧操作问题，特别是在处理复杂的序列交互时，现有的基于扩散的规划方法往往产生不现实的“幽灵状态”或缺乏适应性。解决方案的关键是引入DexDiffuser，一个交互感知的扩散规划框架，通过双阶段扩散过程（包括预交互接触对齐和接触后目标导向控制）来建模状态-动作动态，从而实现目标自适应的通用灵巧操作。此外，结合基于动力学模型的双重引导和利用大型语言模型自动生成引导函数，增强了物理交互的通用性和通过语言提示实现多样化目标适应的能力。实验结果表明，DexDiffuser在训练分布外的目标上表现出显著的性能提升，成功率超过现有方法的两倍。

链接: https://arxiv.org/abs/2411.18562
作者: Zhixuan Liang,Yao Mu,Yixiao Wang,Fei Ni,Tianxing Chen,Wenqi Shao,Wei Zhan,Masayoshi Tomizuka,Ping Luo,Mingyu Ding
关键词-EN: advanced robotics, crucial for advanced, Dexterous manipulation, manipulation, adaptive dexterous manipulation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages. Project page: this https URL

点击查看摘要

Abstract:Dexterous manipulation with contact-rich interactions is crucial for advanced robotics. While recent diffusion-based planning approaches show promise for simpler manipulation tasks, they often produce unrealistic ghost states (e.g., the object automatically moves without hand contact) or lack adaptability when handling complex sequential interactions. In this work, we introduce DexDiffuser, an interaction-aware diffusion planning framework for adaptive dexterous manipulation. DexDiffuser models joint state-action dynamics through a dual-phase diffusion process which consists of pre-interaction contact alignment and post-contact goal-directed control, enabling goal-adaptive generalizable dexterous manipulation. Additionally, we incorporate dynamics model-based dual guidance and leverage large language models for automated guidance function generation, enhancing generalizability for physical interactions and facilitating diverse goal adaptation through language cues. Experiments on physical interaction tasks such as door opening, pen and block re-orientation, and hammer striking demonstrate DexDiffuser’s effectiveness on goals outside training distributions, achieving over twice the average success rate (59.2% vs. 29.5%) compared to existing methods. Our framework achieves 70.0% success on 30-degree door opening, 40.0% and 36.7% on pen and block half-side re-orientation respectively, and 46.7% on hammer nail half drive, highlighting its robustness and flexibility in contact-rich manipulation.
zh

[CV-12] FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

【速读】：该论文试图解决扩散模型在非训练分辨率下进行推理时产生的重复图案和结构失真问题。解决方案的关键在于提出了两个模块：频率调制模块（Frequency Modulation, FM）和注意力调制模块（Attention Modulation, AM）。FM模块利用傅里叶域来增强全局结构的一致性，而AM模块则改善局部纹理模式的一致性，这一问题在先前的工作中被忽视。这两个模块的结合使得现有的扩散模型能够在无需额外训练的情况下，无缝地适应不同的测试分辨率，显著减少了结构和局部伪影，同时避免了传统方法中为提高一致性而引入的冗余推理技巧，从而保持了较低的延迟开销。

链接: https://arxiv.org/abs/2411.18552
作者: Haosen Yang,Adrian Bulat,Isma Hadji,Hai X. Pham,Xiatian Zhu,Georgios Tzimiropoulos,Brais Martinez
关键词-EN: generating high-quality images, high-quality images, proficient at generating, generating high-quality, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.
zh

[CV-13] PhyCAGE: Physically Plausible Compositional 3D Asset Generation from a Single Image

【速读】：该论文试图解决从单张图像生成物理上合理的组合式3D资产的问题。解决方案的关键在于引入了一种名为物理模拟增强的得分蒸馏采样技术 (Physical Simulation-Enhanced Score Distillation Sampling, PSE-SDS)，通过将得分蒸馏采样 (SDS) 损失的梯度作为物理模拟的初始速度，使得物理模拟器能够作为物理引导的优化器，逐步修正高斯分布的位置，从而确保生成的3D资产在物理上是兼容的。

链接: https://arxiv.org/abs/2411.18548
作者: Han Yan,Mingrui Zhang,Yang Li,Chao Ma,Pan Ji
关键词-EN: present PhyCAGE, physically plausible compositional, Score Distillation Sampling, Gaussian Splatting representations, asset generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present PhyCAGE, the first approach for physically plausible compositional 3D asset generation from a single image. Given an input image, we first generate consistent multi-view images for components of the assets. These images are then fitted with 3D Gaussian Splatting representations. To ensure that the Gaussians representing objects are physically compatible with each other, we introduce a Physical Simulation-Enhanced Score Distillation Sampling (PSE-SDS) technique to further optimize the positions of the Gaussians. It is achieved by setting the gradient of the SDS loss as the initial velocity of the physical simulation, allowing the simulator to act as a physics-guided optimizer that progressively corrects the Gaussians’ positions to a physically compatible state. Experimental results demonstrate that the proposed method can generate physically plausible compositional 3D assets given a single image.
zh

[CV-14] AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

【速读】：该论文试图解决视觉语言导航（Visual Language Navigation, VLN）任务在动态环境中面临的挑战，特别是在存在动态人类障碍的情况下。解决方案的关键在于提出了自适应视觉语言导航（Adaptive Visual Language Navigation, AdaVLN）任务，并通过引入AdaVLN模拟器和AdaR2R数据集来支持这一任务的探索。AdaVLN模拟器允许将完全动画化的人类模型直接集成到现有的Matterport3D等数据集中，同时引入“冻结时间”机制，以在代理推理期间暂停世界状态更新，确保实验的可重复性和公平比较。这一解决方案的核心在于通过模拟复杂动态环境，增强VLN任务的真实性，从而缩小模拟与现实之间的差距。

链接: https://arxiv.org/abs/2411.18539
作者: Dillon Loh,Tomasz Bednarz,Xinxing Xia,Frank Guan
关键词-EN: Visual Language Navigation, natural language instructions, Adaptive Visual Language, Visual Language, realistic environments based
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a “freeze-time” mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
zh

[CV-15] Utilizing the Mean Teacher with Supcontrast Loss for Wafer Pattern Recognition

【速读】：该论文试图解决半导体制造中晶圆图（wafer map）模式识别的问题，特别是在有限标注数据和数据不平衡的情况下。解决方案的关键在于结合Mean Teacher框架与监督对比学习损失（supervised contrastive learning loss），以提升模式识别的准确性和鲁棒性。此外，通过采用SMOTE和欠采样技术来处理数据集中的不平衡问题，进一步优化了识别过程。实验结果表明，该方法在准确率（Accuracy）、精确率（Precision）、召回率（Recall）和F1分数（F1 score）上均优于基线方法。

链接: https://arxiv.org/abs/2411.18533
作者: Qiyu Wei,Xun Xu,Zeng Zeng,Xulei Yang
关键词-EN: helping engineers identify, map pattern recognition, wafer map pattern, wafer maps play, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages,1 figures

点击查看摘要

Abstract:The patterns on wafer maps play a crucial role in helping engineers identify the causes of production issues during semiconductor manufacturing. In order to reduce costs and improve accuracy, automation technology is essential, and recent developments in deep learning have led to impressive results in wafer map pattern recognition. In this context, inspired by the effectiveness of semi-supervised learning and contrastive learning methods, we introduce an innovative approach that integrates the Mean Teacher framework with the supervised contrastive learning loss for enhanced wafer map pattern recognition. Our methodology not only addresses the nuances of wafer patterns but also tackles challenges arising from limited labeled data. To further refine the process, we address data imbalance in the wafer dataset by employing SMOTE and under-sampling techniques. We conduct a comprehensive analysis of our proposed method and demonstrate its effectiveness through experiments using real-world dataset WM811K obtained from semiconductor manufacturers. Compared to the baseline method, our method has achieved 5.46%, 6.68%, 5.42%, and 4.53% improvements in Accuracy, Precision, Recall, and F1 score, respectively.
zh

[CV-16] Enhancing weed detection performance by means of GenAI-based image augmentation

【速读】：该论文试图解决传统除草剂应用面临的经济和环境挑战，强调通过深度学习驱动的智能除草控制系统来实现精准的杂草管理。解决方案的关键在于利用生成式 AI (Generative AI) 技术，特别是基于 Stable Diffusion 模型的数据增强方法，生成多样化的合成图像以扩充训练数据集。这种方法不仅提高了数据的数量和质量，还显著提升了基于 YOLO nano 等紧凑型卷积神经网络 (CNN) 的实时检测系统的性能，具体表现为平均精度 (mAP50 和 mAP50-95) 分数的显著提升。

链接: https://arxiv.org/abs/2411.18513
作者: Sourav Modak,Anthony Stein
关键词-EN: Precise weed management, sustaining crop productivity, Precise weed, ecological balance, management is essential
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise weed management is essential for sustaining crop productivity and ecological balance. Traditional herbicide applications face economic and environmental challenges, emphasizing the need for intelligent weed control systems powered by deep learning. These systems require vast amounts of high-quality training data. The reality of scarcity of well-annotated training data, however, is often addressed through generating more data using data augmentation. Nevertheless, conventional augmentation techniques such as random flipping, color changes, and blurring lack sufficient fidelity and diversity. This paper investigates a generative AI-based augmentation technique that uses the Stable Diffusion model to produce diverse synthetic images that improve the quantity and quality of training datasets for weed detection models. Moreover, this paper explores the impact of these synthetic images on the performance of real-time detection systems, thus focusing on compact CNN-based models such as YOLO nano for edge devices. The experimental results show substantial improvements in mean Average Precision (mAP50 and mAP50-95) scores for YOLO models trained with generative AI-augmented datasets, demonstrating the promising potential of synthetic data to enhance model robustness and accuracy.
zh

[CV-17] GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在生成交错图像-文本内容方面的挑战，这一挑战需要综合的多模态理解和生成能力。解决方案的关键在于引入了一个名为GATE OpenING（OpenING）的综合基准，该基准包含5,400个高质量的人工标注实例，涵盖56个现实世界任务，涉及旅行指南、设计和头脑风暴等多样化日常场景。此外，论文还提出了IntJudge模型，用于评估开放式多模态生成方法，该模型通过新颖的数据流水线训练，与人类判断的一致率达到82.42%，优于基于GPT的评估器11.34%。这些创新为评估和提升交错图像-文本生成方法提供了强有力的平台和工具。

链接: https://arxiv.org/abs/2411.18499
作者: Pengfei Zhou,Xiaopeng Peng,Jiajun Song,Chuanhao Li,Zhaopan Xu,Yue Yang,Ziyao Guo,Hao Zhang,Yuqi Lin,Yefei He,Lirui Zhao,Shuo Liu,Tianhua Li,Yuxuan Xie,Xiaojun Chang,Yu Qiao,Wenqi Shao,Kaipeng Zhang
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, made significant strides, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 53 pages, 19 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at this https URL.
zh

[CV-18] A comparison of extended object tracking with multi-modal sensors in indoor environment

【速读】：该论文试图解决在3D点云数据中进行高效目标跟踪的问题，并比较了两种不同传感器（LiDAR和立体相机）的性能。解决方案的关键在于开发了一种快速启发式目标检测器，该检测器利用环境先验信息和目标信息，将检测到的目标点输入到扩展目标跟踪框架中，使用星凸超曲面模型参数化目标形状。实验结果表明，使用立体相机的目标跟踪方法在性能上与LiDAR传感器相当，但成本降低了十倍以上。

链接: https://arxiv.org/abs/2411.18476
作者: Jiangtao Shuai,Martin Baerveldt,Manh Nguyen-Duc,Anh Le-Tuan,Manfred Hauswirth,Danh Le-Phuoc
关键词-EN: cloud sensory sources, point cloud sensory, significant price differences, object tracking approach, sensory sources
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a preliminary study of an efficient object tracking approach, comparing the performance of two different 3D point cloud sensory sources: LiDAR and stereo cameras, which have significant price differences. In this preliminary work, we focus on single object tracking. We first developed a fast heuristic object detector that utilizes prior information about the environment and target. The resulting target points are subsequently fed into an extended object tracking framework, where the target shape is parameterized using a star-convex hypersurface model. Experimental results show that our object tracking method using a stereo camera achieves performance similar to that of a LiDAR sensor, with a cost difference of more than tenfold.
zh

[CV-19] Weakly Supervised Framework Considering Multi-temporal Information for Large-scale Cropland Mapping with Satellite Imagery

【速读】：该论文试图解决大规模农田精准映射中标签成本高的问题。解决方案的关键在于提出了一种弱监督框架，该框架结合了多时相信息，通过利用全球土地覆盖产品（GLC）中的高质量标签的一致性来构建监督学习信号。为了减轻模型对高质量标签中残留错误的过度信任导致的过拟合问题，研究者引入了视觉和空间域中的农田相似性和聚集性作为无监督学习信号，并将其作为正则化项来约束监督部分。此外，为了充分利用没有高质量标签的样本信息，研究者还将无监督学习信号应用于这些样本，以丰富特征空间的多样性。最后，通过引入密集卫星图像时间序列（SITS）来捕捉农田的物候特征，扩展了框架的时间维度，并通过实验验证了该框架在不同研究区域的适应性和鲁棒性。

链接: https://arxiv.org/abs/2411.18475
作者: Yuze Wang,Aoran Hu,Ji Qi,Yang Liu,Chao Tao
关键词-EN: agricultural production management, Accurately mapping large-scale, large-scale cropland mapping, Accurately mapping, cropland mapping
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately mapping large-scale cropland is crucial for agricultural production management and planning. Currently, the combination of remote sensing data and deep learning techniques has shown outstanding performance in cropland mapping. However, those approaches require massive precise labels, which are labor-intensive. To reduce the label cost, this study presented a weakly supervised framework considering multi-temporal information for large-scale cropland mapping. Specifically, we extract high-quality labels according to their consistency among global land cover (GLC) products to construct the supervised learning signal. On the one hand, to alleviate the overfitting problem caused by the model’s over-trust of remaining errors in high-quality labels, we encode the similarity/aggregation of cropland in the visual/spatial domain to construct the unsupervised learning signal, and take it as the regularization term to constrain the supervised part. On the other hand, to sufficiently leverage the plentiful information in the samples without high-quality labels, we also incorporate the unsupervised learning signal in these samples, enriching the diversity of the feature space. After that, to capture the phenological features of croplands, we introduce dense satellite image time series (SITS) to extend the proposed framework in the temporal dimension. We also visualized the high dimensional phenological features to uncover how multi-temporal information benefits cropland extraction, and assessed the method’s robustness under conditions of data scarcity. The proposed framework has been experimentally validated for strong adaptability across three study areas (Hunan Province, Southeast France, and Kansas) in large-scale cropland mapping, and the internal mechanism and temporal generalizability are also investigated.
zh

[CV-20] HEMGS: A Hybrid Entropy Model for 3D Gaussian Splatting Data Compression

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）在3D建模和图像渲染中数据存储和传输的挑战。解决方案的关键在于提出了一种混合熵模型（Hybrid Entropy Model, HEMGS），该模型结合了超先验网络（hyperprior network）和自回归网络（autoregressive network）。通过逐步编码算法生成超先验特征，利用先前压缩的属性和位置信息来有效减少属性间的结构冗余。此外，采用领域感知和实例感知的架构来捕捉领域结构关系，并通过多层感知机（MLPs）揭示场景特定特征。自回归网络则用于减少每个属性内的冗余，通过自适应上下文编码算法灵活捕捉相邻压缩元素的关系。最终，该方法在保持渲染质量的同时，实现了约40%的平均压缩率提升。

链接: https://arxiv.org/abs/2411.18473
作者: Lei Liu,Zhenghao Chen,Dong Xu
关键词-EN: Gaussian Splatting, creates big challenges, Fast progress, Gaussians popular, modeling and image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fast progress in 3D Gaussian Splatting (3DGS) has made 3D Gaussians popular for 3D modeling and image rendering, but this creates big challenges in data storage and transmission. To obtain a highly compact 3DGS representation, we propose a hybrid entropy model for Gaussian Splatting (HEMGS) data compression, which comprises two primary components, a hyperprior network and an autoregressive network. To effectively reduce structural redundancy across attributes, we apply a progressive coding algorithm to generate hyperprior features, in which we use previously compressed attributes and location as prior information. In particular, to better extract the location features from these compressed attributes, we adopt a domain-aware and instance-aware architecture to respectively capture domain-aware structural relations without additional storage costs and reveal scene-specific features through MLPs. Additionally, to reduce redundancy within each attribute, we leverage relationships between neighboring compressed elements within the attributes through an autoregressive network. Given its unique structure, we propose an adaptive context coding algorithm with flexible receptive fields to effectively capture adjacent compressed elements. Overall, we integrate our HEMGS into an end-to-end optimized 3DGS compression framework and the extensive experimental results on four benchmarks indicate that our method achieves about 40% average reduction in size while maintaining the rendering quality over our baseline method and achieving state-of-the-art compression results.
zh

[CV-21] Complexity Experts are Task-Discriminative Learners for Any Image Restoration

【速读】：该论文试图解决传统混合专家模型（Mixture-of-Experts, MoE）在图像恢复任务中表现不一致的问题，即某些专家在跨任务时表现出色，而其他专家在其指定任务范围内却表现不佳，从而限制了MoE架构在推理过程中通过绕过无关专家来提高计算效率的潜力。解决方案的关键在于引入“复杂度专家”（complexity experts），这些专家模块具有不同的计算复杂度和感受野，能够灵活应对不同复杂度的图像恢复任务。论文通过实验验证，采用简单偏向于较低复杂度的任务分配策略，意外地实现了任务与适当复杂度专家的有效匹配，从而在保持高性能的同时，显著提高了模型的计算效率。

链接: https://arxiv.org/abs/2411.18466
作者: Eduard Zamfir,Zongwei Wu,Nancy Mehta,Yuedong Tan,Danda Pani Paudel,Yulun Zhang,Radu Timofte
关键词-EN: Recent advancements, image restoration models, image restoration, unified framework, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs’ computational benefits by bypassing irrelevant experts during inference. We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" – flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity. To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability. The source will be publicly made available at \hrefthis https URL\textttthis http URL
zh

[CV-22] Neural Image Unfolding: Flattening Sparse Anatomical Structures using Neural Fields

【速读】：该论文试图解决在断层成像中可视化非平面稀疏解剖结构（如血管、导管或骨系统）的难题，这些结构跨越多个2D切片，难以直观展示。解决方案的关键在于利用神经场（neural field）拟合感兴趣解剖结构的变换，将其映射到畸变最小化的2D概览图像上。论文提出了畸变正则化策略，并结合几何与基于强度的损失函数，以显示未标注和辅助目标。此外，该技术在稀疏结构上优于基于网格的基线方法，特别是在峰值畸变方面，并且其正则化方案相比基于神经场的图像配准中的雅可比公式，能产生更平滑的变换。

链接: https://arxiv.org/abs/2411.18415
作者: Leonhard Rist,Pluvio Stephan,Noah Maul,Linda Vorberg,Hendrik Ditt,Michael Sühling,Andreas Maier,Bernhard Egger,Oliver Taubmann
关键词-EN: imaging reveals internal, Tomographic imaging reveals, reveals internal structures, medical diagnoses, imaging reveals
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tomographic imaging reveals internal structures of 3D objects and is crucial for medical diagnoses. Visualizing the morphology and appearance of non-planar sparse anatomical structures that extend over multiple 2D slices in tomographic volumes is inherently difficult but valuable for decision-making and reporting. Hence, various organ-specific unfolding techniques exist to map their densely sampled 3D surfaces to a distortion-minimized 2D representation. However, there is no versatile framework to flatten complex sparse structures including vascular, duct or bone systems. We deploy a neural field to fit the transformation of the anatomy of interest to a 2D overview image. We further propose distortion regularization strategies and combine geometric with intensity-based loss formulations to also display non-annotated and auxiliary targets. In addition to improved versatility, our unfolding technique outperforms mesh-based baselines for sparse structures w.r.t. peak distortion and our regularization scheme yields smoother transformations compared to Jacobian formulations from neural field-based image registration.
zh

[CV-23] Adaptive Blind All-in-One Image Restoration

【速读】：该论文试图解决现有盲全图像恢复模型在处理未知和未见过的退化类型时泛化能力有限的问题。解决方案的关键在于提出了一种自适应盲全图像恢复模型 (Adaptive Blind All-in-One Restoration, ABAIR)，该模型通过以下几个关键步骤实现：首先，训练一个强大的基线模型，该模型在包含多种合成退化的大量自然图像数据集上进行训练，并结合分割头来估计每个像素的退化类型，从而使其能够泛化到广泛的退化类型。其次，使用独立的低秩适配器 (low-rank adapters) 来适应不同的图像恢复任务。最后，通过一个灵活且轻量级的退化估计器来自适应地组合适配器，以处理多样化的图像。这种方法不仅在处理特定退化方面表现出色，而且在适应复杂任务和未见过的退化类型方面也显示出显著的改进。

链接: https://arxiv.org/abs/2411.18412
作者: David Serrano-Lozano,Luis Herranz,Shaolin Su,Javier Vazquez-Corral
关键词-EN: restoration models aim, aim to recover, recover a high-quality, input degraded, degraded with unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Blind all-in-one image restoration models aim to recover a high-quality image from an input degraded with unknown distortions. However, these models require all the possible degradation types to be defined during the training stage while showing limited generalization to unseen degradations, which limits their practical application in complex cases. In this paper, we propose a simple but effective adaptive blind all-in-one restoration (ABAIR) model, which can address multiple degradations, generalizes well to unseen degradations, and efficiently incorporate new degradations by training a small fraction of parameters. First, we train our baseline model on a large dataset of natural images with multiple synthetic degradations, augmented with a segmentation head to estimate per-pixel degradation types, resulting in a powerful backbone able to generalize to a wide range of degradations. Second, we adapt our baseline model to varying image restoration tasks using independent low-rank adapters. Third, we learn to adaptively combine adapters to versatile images via a flexible and lightweight degradation estimator. Our model is both powerful in handling specific distortions and flexible in adapting to complex tasks, it not only outperforms the state-of-the-art by a large margin on five- and three-task IR setups, but also shows improved generalization to unseen degradations and also composite distortions.
zh

[CV-24] Deep Fourier-embedded Network for Bi-modal Salient Object Detection

【速读】：该论文试图解决深度学习模型在处理RGB和热成像图像融合时的两个主要问题：一是Transformer模型的高计算和内存需求，二是预测结果与真实值之间存在的频率差异。解决方案的关键在于提出了一个基于快速傅里叶变换的模型，即深度傅里叶嵌入网络（DFENet）。该模型通过快速傅里叶变换高效地获取全局依赖关系，并设计了模态协调感知注意力机制来融合RGB和热成像模态之间的频率差异，同时通过频率分解的边缘感知模块（FEM）和傅里叶残差通道注意力块来增强细节信息的提取和边缘特征的细化。此外，提出的共焦频率损失（CFL）通过动态加权硬频率来最小化频率差异，从而提高最终像素级预测的质量。

链接: https://arxiv.org/abs/2411.18409
作者: Pengfei Lyu,Xiaosheng Yu,Chengdong Wu,Jagath C. Rajapakse
关键词-EN: RGB and thermal, rapid development, significant improvement, Fourier, thermal images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 13 figures. Submitted to TMM on April 29, 2024

点击查看摘要

Abstract:The rapid development of deep learning provides a significant improvement of salient object detection combining both RGB and thermal images. However, existing deep learning-based models suffer from two major shortcomings. First, the computation and memory demands of Transformer-based models with quadratic complexity are unbearable, especially in handling high-resolution bi-modal feature fusion. Second, even if learning converges to an ideal solution, there remains a frequency gap between the prediction and ground truth. Therefore, we propose a purely fast Fourier transform-based model, namely deep Fourier-embedded network (DFENet), for learning bi-modal information of RGB and thermal images. On one hand, fast Fourier transform efficiently fetches global dependencies with low complexity. Inspired by this, we design modal-coordinated perception attention to fuse the frequency gap between RGB and thermal modalities with multi-dimensional representation enhancement. To obtain reliable detailed information during decoding, we design the frequency-decomposed edge-aware module (FEM) to clarify object edges by deeply decomposing low-level features. Moreover, we equip proposed Fourier residual channel attention block in each decoder layer to prioritize high-frequency information while aligning channel global relationships. On the other hand, we propose co-focus frequency loss (CFL) to steer FEM towards minimizing the frequency gap. CFL dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing the bi-modal edge information in the Fourier domain. This frequency-level refinement of edge features further contributes to the quality of the final pixel-level prediction. Extensive experiments on four bi-modal salient object detection benchmark datasets demonstrate our proposed DFENet outperforms twelve existing state-of-the-art models.
zh

[CV-25] GeneQuery: A General QA-based Framework for Spatial Gene Expression Predictions from Histology Images

【速读】：该论文试图解决基因表达预测中的两个主要问题：1) 现有方法未能捕捉基因间的共享依赖和共表达模式；2) 现有方法只能预测训练过程中见过的基因，无法泛化到未见过的基因。解决方案的关键在于提出了一种名为 GeneQuery 的新方法，该方法将基因表达预测任务转化为问答 (QA) 方式，以提高通用性和灵活性。具体来说，GeneQuery 将基因相关文本作为查询，全切片图像作为上下文，通过引入基因随机变量来隐式估计基因分布。此外，GeneQuery 包含两种架构实现：spot-aware GeneQuery 用于捕捉图像间的模式，gene-aware GeneQuery 用于捕捉基因间的模式。实验结果表明，GeneQuery 在已知和未见基因的预测上均优于现有最先进的方法，并展示了分析组织结构的能力。

链接: https://arxiv.org/abs/2411.18391
作者: Ying Xiong,Linjing Liu,Yufei Cui,Shangyu Wu,Xue Liu,Antoni B. Chan,Chun Jason Xue
关键词-EN: Gene expression, presents significant challenges, Gene expression profiling, Gene, molecular mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gene expression profiling provides profound insights into molecular mechanisms, but its time-consuming and costly nature often presents significant challenges. In contrast, whole-slide hematoxylin and eosin (HE) stained histological images are readily accessible and allow for detailed examinations of tissue structure and composition at the microscopic level. Recent advancements have utilized these histological images to predict spatially resolved gene expression profiles. However, state-of-the-art works treat gene expression prediction as a multi-output regression problem, where each gene is learned independently with its own weights, failing to capture the shared dependencies and co-expression patterns between genes. Besides, existing works can only predict gene expression values for genes seen during training, limiting their ability to generalize to new, unseen genes. To address the above limitations, this paper presents GeneQuery, which aims to solve this gene expression prediction task in a question-answering (QA) manner for better generality and flexibility. Specifically, GeneQuery takes gene-related texts as queries and whole-slide images as contexts and then predicts the queried gene expression values. With such a transformation, GeneQuery can implicitly estimate the gene distribution by introducing the gene random variable. Besides, the proposed GeneQuery consists of two architecture implementations, i.e., spot-aware GeneQuery for capturing patterns between images and gene-aware GeneQuery for capturing patterns between genes. Comprehensive experiments on spatial transcriptomics datasets show that the proposed GeneQuery outperforms existing state-of-the-art methods on known and unseen genes. More results also demonstrate that GeneQuery can potentially analyze the tissue structure. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.18391 [cs.CV] (or arXiv:2411.18391v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.18391 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-26] Convolutional Neural Networks Do Work with Pre-Defined Filters

【速读】：该论文试图解决深度卷积神经网络（Convolutional Neural Networks, CNNs）中卷积核的灵活性与计算效率之间的平衡问题。解决方案的关键在于提出了一种新型的卷积神经网络——预定义滤波器卷积神经网络（Pre-defined Filter Convolutional Neural Networks, PFCNNs）。在PFCNNs中，所有nxn的卷积核在训练过程中是预定义且恒定的，通过引入一种特殊的深度卷积操作——预定义滤波器模块（Pre-defined Filter Module, PFM），在通道卷积部分使用从固定池中选取的少量（16个）预定义的1xnxn卷积核，而在1x1卷积部分则学习这些预定义滤波器输出的线性组合。尽管存在严格的限制，PFCNNs仍能学习到复杂且具有区分性的特征，从而为深度CNNs的信息处理方式提供了新的视角。

链接: https://arxiv.org/abs/2411.18388
作者: Christoph Linse,Erhardt Barth,Thomas Martinetz
关键词-EN: Convolutional Neural Networks, Filter Convolutional Neural, Neural Networks called, Convolutional Neural, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel class of Convolutional Neural Networks called Pre-defined Filter Convolutional Neural Networks (PFCNNs), where all nxn convolution kernels with n1 are pre-defined and constant during training. It involves a special form of depthwise convolution operation called a Pre-defined Filter Module (PFM). In the channel-wise convolution part, the 1xnxn kernels are drawn from a fixed pool of only a few (16) different pre-defined kernels. In the 1x1 convolution part linear combinations of the pre-defined filter outputs are learned. Despite this harsh restriction, complex and discriminative features are learned. These findings provide a novel perspective on the way how information is processed within deep CNNs. We discuss various properties of PFCNNs and prove their effectiveness using the popular datasets Caltech101, CIFAR10, CUB-200-2011, FGVC-Aircraft, Flowers102, and Stanford Cars. Our implementation of PFCNNs is provided on Github this https URL
zh

[CV-27] Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

【速读】：该论文试图解决联邦学习（Federated Learning, FL）中贝叶斯方法计算成本高和通信开销大的问题。解决方案的关键在于提出了一种新颖的贝叶斯联邦学习方法，该方法采用高效的二阶优化（second-order optimization）技术，其计算成本与一阶优化方法（如Adam）相当，但同时保留了贝叶斯方法的优势，如不确定性估计和个性化模型学习。这种方法不仅显著提高了计算效率和准确性，还优于现有的最先进（SOTA）贝叶斯联邦学习方法，在标准和个性化联邦学习设置中均表现出色。

链接: https://arxiv.org/abs/2411.18385
作者: Shivam Pal,Aishwarya Gupta,Saqib Sarwar,Piyush Rai
关键词-EN: Federated Learning, Bayesian, Bayesian approach, decentralized and heterogeneous, Bayesian approach enables
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.
zh

[CV-28] XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration WACV2025

【速读】：该论文试图解决在XR（AR/VR）设备中实现用户全身运动跟踪的问题，特别是在缺乏专用腿部传感器的情况下。解决方案的关键在于利用现有的深度感知信号，结合自监督学习，训练一个多模态姿态估计模型，以实时跟踪全身运动。具体来说，论文提出了一种语义点云编码器网络和残差网络的组合，用于扩展当前基于3点信号的合成模型，使其能够处理点云数据。这些模块通过自监督方式联合训练，利用未注册的真实点云数据和从动作捕捉系统获得的模拟数据。与现有最先进的XR身体跟踪系统相比，该方法首次实现了在XR中对腿部的跟踪，而传统基于部分身体跟踪的合成方法则无法实现这一点。

链接: https://arxiv.org/abs/2411.18377
作者: Denys Rozumnyi,Nadine Bertsch,Othman Sbai,Filippo Arcadu,Yuhua Chen,Artsiom Sanakoyeu,Manoj Kumar,Catherine Herold,Robin Kips
关键词-EN: authentic social presence, social presence, fundamental challenge, challenge to bring, bring a sense
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:Tracking the full body motions of users in XR (AR/VR) devices is a fundamental challenge to bring a sense of authentic social presence. Due to the absence of dedicated leg sensors, currently available body tracking methods adopt a synthesis approach to generate plausible motions given a 3-point signal from the head and controller tracking. In order to enable mixed reality features, modern XR devices are capable of estimating depth information of the headset surroundings using available sensors combined with dedicated machine learning models. Such egocentric depth sensing cannot drive the body directly, as it is not registered and is incomplete due to limited field-of-view and body self-occlusions. For the first time, we propose to leverage the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices. We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation. These modules are trained jointly in a self-supervised way, leveraging a combination of real unregistered point clouds and simulated data obtained from motion capture. We compare our approach against several state-of-the-art systems for XR body tracking and show that our method accurately tracks a diverse range of body motions. XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.
zh

[CV-29] Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

【速读】：该论文试图解决视频扩散模型（Video Diffusion Model, VDM）在实际应用中面临的高计算成本和慢推理时间问题。解决方案的关键在于提出了一种新的视频扩散模型压缩方法，通过保留个体内容和运动动态的剪枝策略以及一致性损失（Individual Content and Motion Dynamics, ICMD Consistency Loss）来实现。具体来说，论文首先通过实验观察到深层VDM层对于维持视频运动动态（如视频的整体连贯性）至关重要，而浅层则更关注个体内容（如单帧图像）。因此，论文提出从浅层剪枝冗余块，同时保留更多深层块，从而创建了一个轻量级的VDM变体，称为VDMini。此外，论文引入了ICMD一致性损失，包括个体内容蒸馏损失（Individual Content Distillation, ICD Loss）和多帧内容对抗损失（Multi-frame Content Adversarial, MCA Loss），以确保学生模型VDMini与教师模型在生成视频质量上的可比性。这种方法显著加速了推理时间，同时保持了高质量的视频生成效果。

链接: https://arxiv.org/abs/2411.18375
作者: Yiming Wu,Huan Wang,Zhenghao Chen,Dong Xu
关键词-EN: video diffusion model, Diffusion Model Compression, high computational cost, individual content, diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 figures, 9 tables

点击查看摘要

Abstract:The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbfmotion dynamics e.g., coherence of the entire video, while shallower layers are more focused on \textbfindividual content e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbfIndividual Content and Motion Dynamics (ICMD) Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 \times and 1.4 \times speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.
zh

[CV-30] G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

【速读】：该论文试图解决3D机器人操作中的人类级灵巧性问题，特别是在几何精度与语义理解的无缝集成方面。解决方案的关键在于提出了G3Flow框架，该框架通过结合3D生成模型（用于数字孪生创建）、视觉基础模型（用于语义特征提取）和鲁棒姿态跟踪（用于连续语义流更新），构建了实时的语义流（semantic flow），这是一种动态的、以对象为中心的3D语义表示。这种集成不仅在遮挡情况下实现了完整的语义理解，还消除了手动标注的需求。通过将语义流融入扩散策略（diffusion policies），G3Flow显著提升了终端约束操作和跨对象泛化的性能。

链接: https://arxiv.org/abs/2411.18369
作者: Tianxing Chen,Yao Mu,Zhixuan Liang,Zanxin Chen,Shijia Peng,Qiangyu Chen,Mingkun Xu,Ruizhen Hu,Hongyuan Zhang,Xuelong Li,Ping Luo
关键词-EN: Recent advances, shown promising results, advances in imitation, imitation learning, shown promising
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Webpage: this https URL

点击查看摘要

Abstract:Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
zh

[CV-31] ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

【速读】：该论文试图解决多模态大语言模型（MLLM）在视觉感知能力上的不足，特别是现有模型如Qwen2-VL在COCO数据集上的召回率仅为43.9%，限制了需要感知与理解结合的任务。解决方案的关键在于从模型设计和数据开发两个角度入手：首先，引入ChatRex模型，采用解耦感知设计，通过将通用提议网络的输出框输入LLM，使其输出相应的框索引来表示检测结果，将回归任务转化为LLM更擅长的检索任务；其次，构建了一个全自动的数据引擎和Rexverse-2M数据集，该数据集具有多粒度特征，支持感知与理解的联合训练。通过这两方面的改进，ChatRex在保持多模态理解性能的同时，显著提升了视觉感知能力。

链接: https://arxiv.org/abs/2411.18363
作者: Qing Jiang,Gen luo,Yuqin Yang,Yuda Xiong,Yihao Chen,Zhaoyang Zeng,Tianhe Ren,Lei Zhang
关键词-EN: computer vision, pillars of computer, Perception, understanding, MLLM
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 19 figures

点击查看摘要

Abstract:Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at \urlthis https URL.
zh

[CV-32] ryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

【速读】：该论文试图解决从单张穿着衣物的个体照片中生成标准化衣物图像的问题，这一任务被称为虚拟试穿脱衣 (Virtual Try-Off, VTOFF)。与传统的虚拟试穿 (Virtual Try-On, VTON) 不同，VTOFF 旨在提取衣物的高保真图像，面临捕捉衣物形状、纹理和复杂图案的独特挑战。解决方案的关键在于提出了 TryOffDiff 模型，该模型通过结合 Stable Diffusion 和基于 SigLIP 的视觉条件化技术，确保生成图像的高保真度和细节保留。实验结果表明，TryOffDiff 在减少预处理和后处理步骤的同时，优于基于姿态迁移和虚拟试穿的基线方法。此外，论文指出传统的图像生成指标不足以评估重建质量，因此采用 DISTS 进行更准确的评估。

链接: https://arxiv.org/abs/2411.18350
作者: Riza Velioglu,Petra Bevandic,Robin Chan,Barbara Hammer
关键词-EN: introduces Virtual Try-Off, paper introduces Virtual, generating standardized garment, standardized garment images, clothed individuals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals. Unlike traditional Virtual Try-On (VTON), which digitally dresses models, VTOFF aims to extract a canonical garment image, posing unique challenges in capturing garment shape, texture, and intricate patterns. This well-defined target makes VTOFF particularly effective for evaluating reconstruction fidelity in generative models. We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention. Experiments on a modified VITON-HD dataset show that our approach outperforms baseline methods based on pose transfer and virtual try-on with fewer pre- and post-processing steps. Our analysis reveals that traditional image generation metrics inadequately assess reconstruction quality, prompting us to rely on DISTS for more accurate evaluation. Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction. Demo, code, and models are available at: this https URL
zh

[CV-33] Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

【速读】：该论文试图解决全景立体深度估计（omnidirectional stereo depth estimation）中数据不足的问题。解决方案的关键在于引入了一个名为Helvipad的真实世界数据集，该数据集包含40K帧来自不同环境（包括拥挤的室内外场景和多样化的光照条件）的视频序列。数据集通过顶部和底部设置的两个360°相机和一个LiDAR传感器采集，提供了精确的深度和视差标签，并通过将3D点云投影到等距柱状图像（equirectangular images）上进行标注。此外，论文还通过深度补全（depth completion）技术增强了训练集的标签密度。通过基准测试现有立体深度估计模型，论文发现尽管现有方法在标准图像上表现良好，但在全景图像上仍存在显著的深度估计挑战。为此，论文提出了对立体模型进行必要的适应性调整，以提高全景立体深度估计的性能。

链接: https://arxiv.org/abs/2411.18335
作者: Mehdi Zayene,Jannik Endres,Albias Havolli,Charles Corbière,Salim Cherkaoui,Alexandre Kontouli,Alexandre Alahi
关键词-EN: imaging remains underexplored, remains underexplored, stereo depth estimation, considerable progress, depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite considerable progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, consisting of 40K frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with diverse lighting conditions. Collected using two 360° cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with a significantly increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. The results show that while recent stereo methods perform decently, a significant challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, achieving improved performance.
zh

[CV-34] EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond

【速读】：该论文试图解决事件驱动动作识别 (Event-based Action Recognition, EAR) 中异步事件数据特有的密集时间特征和稀疏空间特征难以有效结合的问题。解决方案的关键在于提出了一种协同感知框架，即 EventCrab，该框架巧妙地将“较轻量”的帧特定网络用于密集事件帧处理，与“较重”的点特定网络用于稀疏事件点处理相结合，从而在准确性和效率之间取得平衡。此外，论文还建立了一个联合帧-文本-点表示空间，以桥接不同的事件帧和点。具体而言，为了更好地利用异步事件点固有的时空关系，论文设计了两种策略：一是脉冲状上下文学习器 (Spiking-like Context Learner, SCL)，用于从原始事件流中提取上下文化的事件点；二是事件点编码器 (Event Point Encoder, EPE)，通过希尔伯特扫描方式进一步探索事件点的长时间空特征。实验结果表明，EventCrab在多个数据集上显著提升了性能，特别是在SeAct和HARDVS数据集上分别提高了5.17%和7.01%。

链接: https://arxiv.org/abs/2411.18328
作者: Meiqi Cao,Xiangbo Shu,Jiachao Zhang,Rui Yan,Zechao Li,Jinhui Tang
关键词-EN: Event-based Action Recognition, traditional action recognition, Action Recognition, Event-based Action, high-temporal resolution capturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the “lighter” frame-specific networks for dense event frames with the “heavier” point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the “heavier” point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.
zh

[CV-35] Mixture of Experts in Image Classification: Whats the Sweet Spot?

【速读】：该论文试图解决在计算机视觉领域中如何有效集成混合专家模型（Mixture-of-Experts, MoE）以实现参数高效扩展的问题。解决方案的关键在于探索不同MoE配置在公开数据集上的表现，并发现当每样本激活的参数数量适中时，图像分类模型能获得最佳性能。然而，随着每样本参数数量的增加，性能提升逐渐减弱。

链接: https://arxiv.org/abs/2411.18322
作者: Mathurin Videau,Alessandro Leite,Marc Schoenauer,Olivier Teytaud
关键词-EN: shown promising potential, shown promising, promising potential, potential for parameter-efficient, parameter-efficient scaling
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across various domains. However, the implementation in computer vision remains limited, and often requires large-scale datasets comprising billions of samples. In this study, we investigate the integration of MoE within computer vision models and explore various MoE configurations on open datasets. When introducing MoE layers in image classification, the best results are obtained for models with a moderate number of activated parameters per sample. However, such improvements gradually vanish when the number of parameters per sample increases.
zh

[CV-36] Real-time Video Target Tracking Algorithm Utilizing Convolutional Neural Networks (CNN)

【速读】：该论文旨在解决复杂场景下实时视频目标跟踪的准确性和鲁棒性问题。解决方案的关键在于基于卷积神经网络 (Convolutional Neural Networks, CNN) 的算法设计，通过在线学习机制不断更新目标模型，以适应目标的形态变化和背景干扰。该算法在处理目标遮挡、快速运动和复杂背景等挑战时，表现出更高的跟踪成功率和更低的失败率，显著提升了目标跟踪的精度和稳定性，同时保持了较高的处理速度。

链接: https://arxiv.org/abs/2411.18314
作者: Chaoyi Tan,Xiangtian Li,Xiaobo Wang,Zhen Qi,Ao Xiang
关键词-EN: http URL continuouslyupdatesthetargetmodelthroughanonline, http URL studysuccessfullyappliesCNNtoreal-timevideotarget, http URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin, Thispaperaimstoresearchandimplementa real-timevideotargettrackingalgorithmbasedon ConvolutionalNeuralNetworks, URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin videosurveillanceandintelligenttransportationdomains
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Thispaperaimstoresearchandimplementa real-timevideotargettrackingalgorithmbasedon ConvolutionalNeuralNetworks(CNN),enhancingthe accuracyandrobustnessoftargettrackingincomplex this http URL algorithmsinhandlingissuessuchastargetocclusion,morphologicalchanges,andbackgroundinterference,our this http URL continuouslyupdatesthetargetmodelthroughanonline learningmechanismtoadapttochangesinthetarget’s this http URL,when dealingwithsituationsinvolvingrapidmotion,partial occlusion,andcomplexbackgrounds,theproposedalgorithm exhibitshighertrackingsuccessratesandlowerfailurerates this http URL studysuccessfullyappliesCNNtoreal-timevideotarget tracking,improvingtheaccuracyandstabilityofthetracking algorithmwhilemaintaininghighprocessingspeeds,thus this http URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin videosurveillanceandintelligenttransportationdomains.
zh

[CV-37] Neural Surface Priors for Editable Gaussian Splatting

【速读】：该论文试图解决从图像数据中恢复可轻松修改的3D几何和外观表示的问题。解决方案的关键在于引入了一种基于3D高斯样条（3D Gaussian Splatting）的新方法，通过神经符号距离场（Signed Distance Field）重建底层几何并提取高质量网格，然后估计一组高斯分布，每个高斯分布的透明度与恢复的神经表面相关。为了便于编辑，生成一个代理表示，编码高斯分布的形状和位置信息。与其他方法不同，该方法允许对提取的网格进行的修改传播到代理表示，从而更新高斯分布的参数，实现对外观表示的编辑。通过利用网格引导的变换，该方法简化了3D场景编辑，并在编辑的可用性和视觉保真度方面优于现有方法。

链接: https://arxiv.org/abs/2411.18311
作者: Jakub Szymkowiak,Weronika Jakubowska,Dawid Malarz,Weronika Smolak-Dyżewska,Maciej Zięba,Przemysław Musialski,Wojtek Pałubicki,Przemysław Spurek
关键词-EN: easily modifiable representations, recover easily modifiable, Signed Distance Field, computer graphics, easily modifiable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:In computer graphics, there is a need to recover easily modifiable representations of 3D geometry and appearance from image data. We introduce a novel method for this task using 3D Gaussian Splatting, which enables intuitive scene editing through mesh adjustments. Starting with input images and camera poses, we reconstruct the underlying geometry using a neural Signed Distance Field and extract a high-quality mesh. Our model then estimates a set of Gaussians, where each component is flat, and the opacity is conditioned on the recovered neural surface. To facilitate editing, we produce a proxy representation that encodes information about the Gaussians’ shape and position. Unlike other methods, our pipeline allows modifications applied to the extracted mesh to be propagated to the proxy representation, from which we recover the updated parameters of the Gaussians. This effectively transfers the mesh edits back to the recovered appearance representation. By leveraging mesh-guided transformations, our approach simplifies 3D scene editing and offers improvements over existing methods in terms of usability and visual fidelity of edits. The complete source code for this project can be accessed at \urlthis https URL
zh

[CV-38] MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement

【速读】：该论文试图解决现有CT报告生成(CTRG)方法在整合多视角诊断信息和缺乏临床专业知识方面的不足。解决方案的关键在于提出了一种新的多视角感知知识增强Transformer (Multi-view perception Knowledge-enhanced Transformer, MvKeTR)，该模型模拟了临床医生的诊断流程。具体来说，通过多视角感知聚合器(Multi-View Perception Aggregator, MVPA)结合视角感知注意力机制，有效整合来自多个解剖视角的诊断信息；并通过跨模态知识增强器(Cross-Modal Knowledge Enhancer, CMKE)检索相似报告，将领域知识融入诊断过程。此外，使用Kolmogorov-Arnold网络(Kolmogorov-Arnold Networks, KANs)作为基础模块，以更好地捕捉CT解读中的复杂诊断模式。

链接: https://arxiv.org/abs/2411.18309
作者: Xiwei Deng,Xianchun He,Yudan Zhou,Shuhui Cai,Congbo Cai,Zhong Chen
关键词-EN: relieving clinicians’ workload, improving patient care, automatically generate diagnostic, perception Knowledge-enhanced Tansformer, aims to automatically
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians’ workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propose a novel Multi-view perception Knowledge-enhanced Tansformer (MvKeTR) to mimic the diagnostic workflow of clinicians. Just as radiologists first examine CT scans from multiple planes, a Multi-View Perception Aggregator (MVPA) with view-aware attention effectively synthesizes diagnostic information from multiple anatomical views. Then, inspired by how radiologists further refer to relevant clinical records to guide diagnostic decision-making, a Cross-Modal Knowledge Enhancer (CMKE) retrieves the most similar reports based on the query volume to incorporate domain knowledge into the diagnosis procedure. Furthermore, instead of traditional MLPs, we employ Kolmogorov-Arnold Networks (KANs) with learnable nonlinear activation functions as the fundamental building blocks of both modules to better capture intricate diagnostic patterns in CT interpretation. Extensive experiments on the public CTRG-Chest-548K dataset demonstrate that our method outpaces prior state-of-the-art models across all metrics.
zh

[CV-39] InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

【速读】：该论文试图解决现有运动生成方法在处理长序列时由于缺乏长运动训练数据而受限的问题。解决方案的关键在于提出了InfiniDreamer框架，该框架通过生成与文本描述对应的子运动，并使用随机初始化的过渡段将其组装成粗略的扩展序列。随后，引入了一种基于优化的方法——段落评分蒸馏（Segment Score Distillation, SSD），以无训练的方式利用现有的短片段训练的先验知识来精炼整个长运动序列。SSD通过迭代地优化从粗略扩展序列中采样的重叠短片段，逐步使其与预训练的运动扩散先验对齐，从而确保每个片段的局部连贯性，同时保持整个序列的全局一致性。

链接: https://arxiv.org/abs/2411.18303
作者: Wenjie Zhuo,Fan Ma,Hehe Fan
关键词-EN: arbitrarily long human, human motion generation, motion, long human motion, present InfiniDreamer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.
zh

[CV-40] Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

【速读】：该论文试图解决Multimodal Diffusion Transformer (MMDiT)在处理包含多个语义或外观相似主体的输入文本时出现的主题忽视或混淆问题。解决方案的关键在于通过测试时优化（test-time optimization）在早期去噪步骤中实时修复潜在的模糊性。具体来说，论文设计了三种损失函数：块对齐损失（Block Alignment Loss）、文本编码器对齐损失（Text Encoder Alignment Loss）和重叠损失（Overlap Loss），分别针对模型架构中的三种模糊性（Inter-block Ambiguity, Text Encoder Ambiguity, Semantic Ambiguity）进行缓解。此外，为了进一步解决语义模糊性问题，论文还提出了重叠在线检测（Overlap Online Detection）和返回起点采样策略（Back-to-Start Sampling Strategy）。实验结果表明，该方法在处理相似主体的生成任务中表现出色，显著提高了生成质量和成功率。

链接: https://arxiv.org/abs/2411.18301
作者: Tianyi Wei,Dongdong Chen,Yifan Zhou,Xingang Pan
关键词-EN: Multimodal Diffusion Transformer, latest Multimodal Diffusion, Diffusion Transformer, Multimodal Diffusion, Representing the cutting-edge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at this https URL.
zh

[CV-41] HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning

【速读】：该论文试图解决水下图像因光线折射和吸收导致的能见度降低问题，并在此基础上平衡视觉质量与实际应用需求。解决方案的关键在于提出了一种启发式可逆网络（HUPE），通过引入信息保留的可逆变换和嵌入的傅里叶变换，建立水下图像与其清晰图像之间的双向映射。此外，通过结合启发式先验信息和语义协同学习模块，该方法不仅提升了视觉质量，还能更好地适应下游任务的需求，从而在视觉增强与应用导向之间实现有效平衡。

链接: https://arxiv.org/abs/2411.18296
作者: Zengxi Zhang,Zhiying Jiang,Long Ma,Jinyuan Liu,Xin Fan,Risheng Liu
关键词-EN: refraction and absorption, reducing visibility, affected by light, light refraction, visibility and interfering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 21 figures

点击查看摘要

Abstract:Underwater images are often affected by light refraction and absorption, reducing visibility and interfering with subsequent applications. Existing underwater image enhancement methods primarily focus on improving visual quality while overlooking practical implications. To strike a balance between visual quality and application, we propose a heuristic invertible network for underwater perception enhancement, dubbed HUPE, which enhances visual quality and demonstrates flexibility in handling other downstream tasks. Specifically, we introduced an information-preserving reversible transformation with embedded Fourier transform to establish a bidirectional mapping between underwater images and their clear images. Additionally, a heuristic prior is incorporated into the enhancement process to better capture scene information. To further bridge the feature gap between vision-based enhancement images and application-oriented images, a semantic collaborative learning module is applied in the joint optimization process of the visual enhancement task and the downstream task, which guides the proposed enhancement model to extract more task-oriented semantic features while obtaining visually pleasing images. Extensive experiments, both quantitative and qualitative, demonstrate the superiority of our HUPE over state-of-the-art methods. The source code is available at this https URL.
zh

[CV-42] HiFiVFS: High Fidelity Video Face Swapping

【速读】：该论文试图解决视频人脸交换中的时间稳定性问题以及在扩散模型（DMs）中保留细粒度属性（如光照和妆容）的挑战。解决方案的关键在于提出了一个高保真视频人脸交换（HiFiVFS）框架，该框架利用了Stable Video Diffusion（SVD）的强大生成能力和时间先验。具体来说，通过构建细粒度属性模块，采用身份去敏化和对抗学习来提取身份解耦和细粒度属性特征，并引入详细的身份注入以进一步增强身份相似性。实验结果表明，该方法在视频人脸交换方面达到了最先进的（SOTA）水平。

链接: https://arxiv.org/abs/2411.18293
作者: Xu Chen,Keke He,Junwei Zhu,Yanhao Ge,Wei Li,Chengjie Wang
关键词-EN: Face swapping, Face swapping aims, video face swapping, aims to generate, generate results
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
zh

[CV-43] Dont Let Your Robot be Harmful: Responsible Robotic Manipulation

【速读】：该论文试图解决机器人执行人类指令时可能导致的严重安全风险问题，如中毒、火灾和爆炸等。解决方案的关键在于提出了一种名为“Safety-as-policy”的策略，该策略包括两个核心组件：(i) 一个世界模型，用于自动生成包含安全风险的场景并进行虚拟交互；(ii) 一个心理模型，用于推断后果并逐步发展对安全的认知，从而使机器人在完成任务的同时避免危险。此外，论文还创建了SafeBox合成数据集，包含一百个不同安全风险场景下的责任机器人操作任务，有效降低了真实世界实验的风险。实验结果表明，Safety-as-policy在合成数据集和真实世界实验中均能有效避免风险并高效完成任务，显著优于基线方法。

链接: https://arxiv.org/abs/2411.18289
作者: Minheng Ni,Lei Zhang,Zihan Chen,Lei Zhang,Wangmeng Zuo
关键词-EN: Unthinking execution, responsible robotic manipulation, robotic manipulation, severe safety risks, execution of human
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unthinking execution of human instructions in robotic manipulation can lead to severe safety risks, such as poisonings, fires, and even explosions. In this paper, we present responsible robotic manipulation, which requires robots to consider potential hazards in the real-world environment while completing instructions and performing complex operations safely and efficiently. However, such scenarios in real world are variable and risky for training. To address this challenge, we propose Safety-as-policy, which includes (i) a world model to automatically generate scenarios containing safety risks and conduct virtual interactions, and (ii) a mental model to infer consequences with reflections and gradually develop the cognition of safety, allowing robots to accomplish tasks while avoiding dangers. Additionally, we create the SafeBox synthetic dataset, which includes one hundred responsible robotic manipulation tasks with different safety risk scenarios and instructions, effectively reducing the risks associated with real-world experiments. Experiments demonstrate that Safety-as-policy can avoid risks and efficiently complete tasks in both synthetic dataset and real-world experiments, significantly outperforming baseline methods. Our SafeBox dataset shows consistent evaluation results with real-world scenarios, serving as a safe and effective benchmark for future research.
zh

[CV-44] Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

【速读】：该论文试图解决多光谱目标检测（multispectral object detection）中的关键问题，包括光谱差异、空间对齐误差和环境依赖性等，这些问题阻碍了多光谱检测系统在不同场景中的泛化能力。解决方案的关键在于提出了首个公平且可复现的基准（benchmark），用于评估训练“技术”，系统地分类现有的多光谱目标检测方法，研究其对超参数的敏感性，并标准化核心配置。此外，论文还引入了一种高效且易于部署的多光谱目标检测框架，能够将高性能的单模态模型无缝优化为双模态模型，结合先进的训练技术。

链接: https://arxiv.org/abs/2411.18288
作者: Chen Zhou,Peng Cheng,Junfeng Fang,Yifan Zhang,Yibo Yan,Xiaojun Jia,Yanyan Xu,Kun Wang,Xiaochun Cao
关键词-EN: RGB and TIR, Multispectral object detection, Multispectral object, thermal infrared, object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these “optimization techniques”. Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training “techniques”, which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.
zh

[CV-45] MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

【速读】：该论文试图解决个性化文本到视频生成 (Text-to-Video, T2V) 中存在的身份一致性 (identity consistency) 和可控运动动态 (controllable motion dynamics) 问题。解决方案的关键在于提出了一个名为 MotionCharacter 的高效高保真人视频生成框架，该框架通过以下几个关键技术来实现：

身份保持模块 (ID-preserving module)：用于在允许灵活属性修改的同时保持身份的保真度。
身份一致性和区域感知损失机制 (ID-consistency and region-aware loss mechanisms)：显著增强身份一致性和细节保真度。
运动控制模块 (motion control module)：优先处理与动作相关的文本，同时保持主体一致性。
Human-Motion 数据集：利用大型语言模型生成详细的动作描述。
运动强度参数化：通过单一系数简化用户在推理过程中的控制。

这些技术的结合使得 MotionCharacter 在身份保持和高品质视频生成方面表现出显著的改进。

链接: https://arxiv.org/abs/2411.18281
作者: Haopeng Fang,Di Qiu,Binjie Mao,Pengfei Yan,He Tang
关键词-EN: integrating character-specific identities, Recent advancements, advancements in personalized, importance of integrating, integrating character-specific
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.
zh

[CV-46] Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

【速读】：该论文试图解决在自动驾驶（AD）领域中，视觉-语言模型（VLMs）易受对抗攻击的问题。解决方案的关键在于提出了ADvLM框架，该框架针对AD环境中的VLMs设计了专门的对抗攻击方法。具体来说，ADvLM引入了语义不变诱导（Semantic-Invariant Induction）和场景关联增强（Scenario-Associated Enhancement）两个核心技术。语义不变诱导利用大型语言模型生成多样化的文本指令库，确保语义内容的一致性；场景关联增强则通过注意力机制选择关键帧和视角，优化对抗扰动，使其在整个驾驶场景中具有泛化能力。这些技术共同提升了对抗攻击在AD VLMs中的有效性和适用性。

链接: https://arxiv.org/abs/2411.18275
作者: Tianyuan Zhang,Lu Wang,Xinwei Zhang,Yitong Zhang,Boyi Jia,Siyuan Liang,Shengshan Hu,Qiang Fu,Aishan Liu,Xianglong Liu
关键词-EN: enhancing reasoning capabilities, significantly advanced autonomous, Vision-language models, advanced autonomous driving, reasoning capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have significantly advanced autonomous driving (AD) by enhancing reasoning capabilities. However, these models remain highly vulnerable to adversarial attacks. While existing research has primarily focused on general VLM attacks, the development of attacks tailored to the safety-critical AD context has been largely overlooked. In this paper, we take the first step toward designing adversarial attacks specifically targeting VLMs in AD, exposing the substantial risks these attacks pose within this critical domain. We identify two unique challenges for effective adversarial attacks on AD VLMs: the variability of textual instructions and the time-series nature of visual scenarios. To this end, we propose ADvLM, the first visual adversarial attack framework specifically designed for VLMs in AD. Our framework introduces Semantic-Invariant Induction, which uses a large language model to create a diverse prompt library of textual instructions with consistent semantic content, guided by semantic entropy. Building on this, we introduce Scenario-Associated Enhancement, an approach where attention mechanisms select key frames and perspectives within driving scenarios to optimize adversarial perturbations that generalize across the entire scenario. Extensive experiments on several AD VLMs over multiple benchmarks show that ADvLM achieves state-of-the-art attack effectiveness. Moreover, real-world attack studies further validate its applicability and potential in practice.
zh

[CV-47] Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents

【速读】：该论文试图解决多模态模型在物体识别和场景理解中精确空间定位能力不足的问题。解决方案的关键在于引入显式的视觉位置编码，通过在输入图像上叠加一个9x9的黑网格图案，为模型提供视觉空间指导，类似于Transformer中的位置编码，但以显式视觉形式呈现。这种方法在COCO 2017数据集上的实验结果显示，显著提升了定位精度，IoU从0.27提高到0.56，GIoU从0.18提高到0.53，表明其有效性和简单性使其特别适用于需要精确空间推理的应用，如机器人操作、医学影像和自动驾驶导航。

链接: https://arxiv.org/abs/2411.18270
作者: Joongwon Chae,Zhenyu Wang,Peiwu Qin
关键词-EN: demonstrated impressive capabilities, Recent advances, scene understanding, advances in multimodal, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Recent advances in multimodal models have demonstrated impressive capabilities in object recognition and scene understanding. However, these models often struggle with precise spatial localization - a critical capability for real-world applications. Inspired by how humans use grid-based references like chess boards and maps, we propose introducing explicit visual position encoding through a simple grid overlay approach. By adding a 9x9 black grid pattern onto input images, our method provides visual spatial guidance analogous to how positional encoding works in transformers, but in an explicit, visual form. Experiments on the COCO 2017 dataset demonstrate that our grid-based approach achieves significant improvements in localization accuracy, with a 107.4% increase in IoU (from 0.27 to 0.56) and a 194.4% improvement in GIoU (from 0.18 to 0.53) compared to baseline performance. Through attention visualization analysis, we show how this visual position encoding helps models better ground spatial relationships. Our method’s simplicity and effectiveness make it particularly valuable for applications requiring accurate spatial reasoning, such as robotic manipulation, medical imaging, and autonomous navigation. Comments: 10 pages, 2 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.18270 [cs.CV] (or arXiv:2411.18270v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.18270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-48] Incomplete Multi-view Multi-label Classification via a Dual-level Contrastive Learning Framework

【速读】：该论文试图解决多视图多标签分类任务中存在的双缺失问题，即视图和标签的不完整性。解决方案的关键在于提出了一个双层对比学习框架，通过将一致性信息和视图特有信息解耦到不同的空间中，并利用对比学习理论充分分离这两种异质属性。具体来说，该方法首先引入了一个包含共享表示和视图专属表示的双通道解耦模块，以有效提取所有视图之间的一致性和互补信息。其次，为了从多视图表示中高效筛选出高质量的一致性信息，分别在高层特征和语义标签上进行了基于对比学习的两个一致性目标。实验结果表明，该方法在多个广泛使用的基准数据集上表现出更稳定和优越的分类性能。

链接: https://arxiv.org/abs/2411.18267
作者: Bingyan Nie,Wulin Xie,Jiang Long,Xiaohuan Lu
关键词-EN: comprehensive data analysis, multi-view multi-label classification, multi-label classification, analysis and exploration, significant domains
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, multi-view and multi-label classification have become significant domains for comprehensive data analysis and exploration. However, incompleteness both in views and labels is still a real-world scenario for multi-view multi-label classification. In this paper, we seek to focus on double missing multi-view multi-label classification tasks and propose our dual-level contrastive learning framework to solve this issue. Different from the existing works, which couple consistent information and view-specific information in the same feature space, we decouple the two heterogeneous properties into different spaces and employ contrastive learning theory to fully disentangle the two properties. Specifically, our method first introduces a two-channel decoupling module that contains a shared representation and a view-proprietary representation to effectively extract consistency and complementarity information across all views. Second, to efficiently filter out high-quality consistent information from multi-view representations, two consistency objectives based on contrastive learning are conducted on the high-level features and the semantic labels, respectively. Extensive experiments on several widely used benchmark datasets demonstrate that the proposed method has more stable and superior classification performance.
zh

[CV-49] SD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

【速读】：该论文试图解决预训练文本到图像扩散模型在实际图像超分辨率（Real-ISR）任务中计算成本高的问题。解决方案的关键在于提出了一种新的蒸馏框架TSD-SR，旨在构建一个高效且有效的一步模型。具体来说，论文引入了目标分数蒸馏（Target Score Distillation），利用扩散模型的先验知识和真实图像参考来实现更逼真的图像恢复；同时，提出了分布感知采样模块（Distribution-Aware Sampling Module），以更容易地获取细节导向的梯度，从而解决恢复精细细节的挑战。实验结果表明，TSD-SR在恢复效果和推理速度上均优于以往基于预训练扩散先验的Real-ISR方法。

链接: https://arxiv.org/abs/2411.18263
作者: Linwei Dong,Qingnan Fan,Yihong Guo,Zhonghao Wang,Qi Zhang,Jinwei Chen,Yawei Luo,Changqing Zou
关键词-EN: real-world image super-resolution, increasingly applied, diffusion models, image super-resolution, Target Score Distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.
zh

[CV-50] SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

【速读】：该论文试图解决单目度量深度估计中存在的两个主要问题：一是传统判别式深度估计方法（如Metric3D, UniDepth）在真实世界数据上训练时，虽然能准确预测度量深度，但往往生成过度平滑或细节不足的深度图；二是生成式方法（如Marigold, Lotus）在合成数据上训练时，虽然能生成具有锐利边界的深度图，但仅提供相对深度且精度较低。解决方案的关键在于提出了一种名为SharpDepth的新方法，该方法通过结合判别式方法的度量准确性和生成式方法的细节边界锐利性，实现了既度量精确又视觉锐利的深度预测。这一创新在零样本评估中表现出色，适用于需要高质量深度感知的多样化真实世界环境。

链接: https://arxiv.org/abs/2411.18229
作者: Duc-Hai Pham,Tung Do,Phong Nguyen,Binh-Son Hua,Khoi Nguyen,Rang Nguyen
关键词-EN: sharpness typically achieved, fine-grained boundary sharpness, boundary sharpness typically, depth estimation methods, monocular metric depth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Uncompressed version can be found in this https URL

点击查看摘要

Abstract:We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.
zh

[CV-51] PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis

【速读】：该论文试图解决在计算病理学中处理全切片图像（Whole Slide Images, WSIs）时，由于大量无信息补丁（如仅包含健康或脂肪组织的补丁）导致的噪声和计算负担问题。解决方案的关键是提出了一个名为Pathology Transformer with Hierarchical Selection (PATHS)的新方法，它通过分层弱监督表示学习，模拟人类病理学家在不同放大倍数下递归筛选相关补丁的过程，从而有效地减少了处理的无信息补丁数量，提高了计算效率和模型性能。PATHS方法的核心在于其能够实现二次自注意力机制，并提供了一个简单且可解释的区域重要性度量，从而在处理全切片图像时显著提升了预测任务的准确性。

链接: https://arxiv.org/abs/2411.18225
作者: Zak Buzzard,Konstantin Hemker,Nikola Simidjievski,Mateja Jamnik
关键词-EN: significant research progress, recent years, research progress, progress in recent, applications ranging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational analysis of whole slide images (WSIs) has seen significant research progress in recent years, with applications ranging across important diagnostic and prognostic tasks such as survival or cancer subtype prediction. Many state-of-the-art models process the entire slide - which may be as large as 150,000 \times 150,000 pixels - as a bag of many patches, the size of which necessitates computationally cheap feature aggregation methods. However, a large proportion of these patches are uninformative, such as those containing only healthy or adipose tissue, adding significant noise and size to the bag. We propose Pathology Transformer with Hierarchical Selection (PATHS), a novel top-down method for hierarchical weakly supervised representation learning on slide-level tasks in computational pathology. PATHS is inspired by the cross-magnification manner in which a human pathologist examines a slide, recursively filtering patches at each magnification level to a small subset relevant to the diagnosis. Our method overcomes the complications of processing the entire slide, enabling quadratic self-attention and providing a simple interpretable measure of region importance. We apply PATHS to five datasets of The Cancer Genome Atlas (TCGA), and achieve superior performance on slide-level prediction tasks when compared to previous methods, despite processing only a small proportion of the slide.
zh

[CV-52] KANs for Computer Vision: An Experimental Study

【速读】：该论文试图解决Kolmogorov-Arnold Networks (KANs)在复杂计算机视觉任务中的应用问题，特别是图像分类任务。解决方案的关键在于评估KANs在实际应用中的表现，揭示其在处理复杂视觉任务时的优势与局限性。研究发现，尽管KANs在特定视觉任务中表现良好，但其面临显著的超参数敏感性和计算成本增加的挑战。因此，论文提出KANs需要与其他架构集成或进行架构调整，以适应大规模视觉问题的实际需求。

链接: https://arxiv.org/abs/2411.18224
作者: Karthik Mohan,Hanxiao Wang,Xiatian Zhu
关键词-EN: Convolutional Neural Networks, Kolmogorov-Arnold Networks, Neural Networks, image classification, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:This paper presents an experimental study of Kolmogorov-Arnold Networks (KANs) applied to computer vision tasks, particularly image classification. KANs introduce learnable activation functions on edges, offering flexible non-linear transformations compared to traditional pre-fixed activation functions with specific neural work like Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). While KANs have shown promise mostly in simplified or small-scale datasets, their effectiveness for more complex real-world tasks such as computer vision tasks remains less explored. To fill this gap, this experimental study aims to provide extended observations and insights into the strengths and limitations of KANs. We reveal that although KANs can perform well in specific vision tasks, they face significant challenges, including increased hyperparameter sensitivity and higher computational costs. These limitations suggest that KANs require architectural adaptations, such as integration with other architectures, to be practical for large-scale vision problems. This study focuses on empirical findings rather than proposing new methods, aiming to inform future research on optimizing KANs, in particular computer vision applications or alike.
zh

[CV-53] meMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

【速读】：该论文试图解决现有视频-语言模型在处理不同长度视频时，难以进行精确时间定位的问题。解决方案的关键在于引入TimeMarker，这是一种多功能视频-大型语言模型（Video-LLM），特别强调时间定位。TimeMarker通过集成时间分隔符标记（Temporal Separator Tokens）来增强时间感知能力，准确标记视频中的特定时刻。此外，它采用AnyLength机制进行动态帧采样和自适应标记合并，从而有效处理短视频和长视频。通过使用多样化的数据集，包括进一步转换的时间相关视频问答数据集，TimeMarker显著提升了其时间理解能力。实验评估显示，TimeMarker在多个基准测试中达到了最先进的性能，尤其在短视频和长视频类别中表现出色。

链接: https://arxiv.org/abs/2411.18211
作者: Shimin Chen,Xiaohan Lan,Yitian Yuan,Zequn Jie,Lin Ma
关键词-EN: large language models, multimodal large language, advanced multimodal large, large language, significantly advanced multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization. TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos. It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos. Additionally, TimeMarker utilizes diverse datasets, including further transformed temporal-related video QA datasets, to bolster its temporal understanding capabilities. Image and interleaved data are also employed to further enhance the model’s semantic perception ability. Evaluations demonstrate that TimeMarker achieves state-of-the-art performance across multiple benchmarks, excelling in both short and long video categories. Our project page is at \urlthis https URL.
zh

[CV-54] From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

【速读】：该论文试图解决开放词汇对象检测 (Open Vocabulary Object Detection, OVD) 在实际应用中的局限性，特别是在自动驾驶场景感知等关键应用中，OVD模型容易对近分布外 (Near-Out-of-Distribution, NOOD) 对象误分类，并忽略远分布外 (Far-Out-of-Distribution, FOOD) 对象的问题。解决方案的关键在于提出了一种开放世界嵌入学习 (Open World Embedding Learning, OWEL) 框架，通过引入伪未知嵌入 (Pseudo Unknown Embedding) 概念，推断未知类在连续语义空间中的位置，并结合多尺度对比锚点学习 (Multi-Scale Contrastive Anchor Learning, MSCAL) 方法，增强对象嵌入在不同尺度下的类内一致性，从而实现对未知对象的识别和增量学习。

链接: https://arxiv.org/abs/2411.18207
作者: Zizhao Li,Zhengkang Xiang,Joseph West,Kourosh Khoshelham
关键词-EN: Traditional object detection, Traditional object, closed-set assumption, fixed number, OVD models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ‘‘oracle’’, which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar semantics to known classes, and ignore far-out-of-distribution (FOOD) objects. To address theses limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning novel objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance in common open world object detection and autonomous driving benchmarks.
zh

[CV-55] Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

【速读】：该论文试图解决现有自动绑定工具在生成可动画3D角色时面临的局限性，包括手动标注需求、刚性骨骼拓扑结构以及对多样形状和姿态的泛化能力有限的问题。解决方案的关键在于提出了一个名为“Make-It-Animatable”的数据驱动方法，该方法能够在不到一秒的时间内使任何3D人形模型准备好进行角色动画，无论其形状和姿态如何。其核心创新包括：1) 生成高质量的混合权重、骨骼和姿态变换；2) 采用基于粒子的形状自编码器，支持多种3D表示形式（如网格和3D高斯点云）；3) 使用由粗到细的表示和结构感知建模策略，确保对非标准骨骼结构的角色的准确性和鲁棒性。

链接: https://arxiv.org/abs/2411.18197
作者: Zhiyang Guo,Jinxu Xiang,Kai Ma,Wengang Zhou,Houqiang Li,Ran Zhang
关键词-EN: modern creative industries, creative industries, extensive manual work, demands extensive manual, essential to modern
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach is to generate animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework’s effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed.
zh

[CV-56] DistinctAD: Distinctive Audio Description Generation in Contexts

【速读】：该论文试图解决自动生成音频描述 (Audio Descriptions, ADs) 中的两个主要挑战：一是电影-AD数据与现有视觉-语言模型训练数据之间的领域差距 (domain gap)；二是由于电影中相邻视觉片段高度相似导致的上下文冗余问题 (contextual redundancy)。解决方案的关键在于提出了一个名为DistinctAD的两阶段框架，通过以下创新来解决这些问题：首先，引入了一种无需额外AD语料库的CLIP-AD适应策略，以在全局和细粒度级别上更有效地对齐电影和AD模态；其次，在第二阶段，通过引入上下文期望最大化注意力 (Contextual Expectation-Maximization Attention, EMA) 模块来减少冗余，并通过显式的独特词预测损失 (explicit distinctive word prediction loss) 来过滤上下文中的重复词，确保生成特定于当前AD的独特词汇。这些创新使得DistinctAD在多个基准测试中表现优异，特别是在Recall@k/N指标上，显著提升了生成高质量、独特AD的能力。

链接: https://arxiv.org/abs/2411.18180
作者: Bo Fang,Wenhao Wu,Qiangqiang Wu,Yuxin Song,Antoni B. Chan
关键词-EN: Audio Descriptions, aim to provide, text form, scene establishment, provide a narration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.
zh

[CV-57] Enhancing Computer Vision with Knowledge: a Rummikub Case Study

【速读】：该论文试图解决人工神经网络在图像识别中无法有效整合和解释各个组件的问题。解决方案的关键在于扩展网络，引入显式的背景知识和一个独立的推理组件。通过在流行的棋盘游戏Rummikub中应用这种方法，研究证明添加的背景知识与数据集的三分之二同样有价值，并且能将训练时间缩短至原来的一半。

链接: https://arxiv.org/abs/2411.18172
作者: Simon Vandevelde,Laurent Mertens,Sverre Lauwers,Joost Vennekens
关键词-EN: Artificial Neural Networks, Neural Networks excel, Artificial Neural, identifying individual components, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
备注: Submitted to ESANN2025

点击查看摘要

Abstract:Artificial Neural Networks excel at identifying individual components in an image. However, out-of-the-box, they do not manage to correctly integrate and interpret these components as a whole. One way to alleviate this weakness is to expand the network with explicit knowledge and a separate reasoning component. In this paper, we evaluate an approach to this end, applied to the solving of the popular board game Rummikub. We demonstrate that, for this particular example, the added background knowledge is equally valuable as two-thirds of the data set, and allows to bring down the training time to half the original time.
zh

[CV-58] PDZSeg: Adapting the Foundation Model for Dissection Zone Segmentation with Visual Prompts in Robot-assisted Endoscopic Submucosal Dissection

【速读】：该论文旨在解决内镜手术环境中解剖区域分割的挑战，特别是在内镜黏膜下剥离术 (ESD) 中，由于组织类型间边界不清晰导致的分割错误问题。解决方案的关键在于提出了基于提示的解剖区域分割模型 (Prompted-based Dissection Zone Segmentation, PDZSeg)，该模型通过利用多样化的视觉提示（如涂鸦和边界框）来增强分割性能。通过将这些提示叠加到图像上，并在专门的数据集上微调基础模型，PDZSeg 不仅提高了分割精度，还通过灵活的输入方法改善了用户体验。实验结果表明，PDZSeg 在 ESD-DZSeg 数据集上的表现优于现有的最先进分割方法，为 ESD 中的解剖区域分割提供了新的基准。

链接: https://arxiv.org/abs/2411.18169
作者: Mengya Xu,Wenjin Mo,Guankun Wang,Huxin Gao,An Wang,Zhen Li,Xiaoxiao Yang,Hongliang Ren
关键词-EN: dissection zone segmentation, Endoscopic surgical environments, surgical environments present, environments present challenges, dissection zone
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: Endoscopic surgical environments present challenges for dissection zone segmentation due to unclear boundaries between tissue types, leading to segmentation errors where models misidentify or overlook edges. This study aims to provide precise dissection zone suggestions during endoscopic submucosal dissection (ESD) procedures, enhancing ESD safety. Methods: We propose the Prompted-based Dissection Zone Segmentation (PDZSeg) model, designed to leverage diverse visual prompts such as scribbles and bounding boxes. By overlaying these prompts onto images and fine-tuning a foundational model on a specialized dataset, our approach improves segmentation performance and user experience through flexible input methods. Results: The PDZSeg model was validated using three experimental setups: in-domain evaluation, variability in visual prompt availability, and robustness assessment. Using the ESD-DZSeg dataset, results show that our method outperforms state-of-the-art segmentation approaches. This is the first study to integrate visual prompt design into dissection zone segmentation. Conclusion: The PDZSeg model effectively utilizes visual prompts to enhance segmentation performance and user experience, supported by the novel ESD-DZSeg dataset as a benchmark for dissection zone segmentation in ESD. Our work establishes a foundation for future research. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.18169 [cs.CV] (or arXiv:2411.18169v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.18169 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wenjin Mo [view email] [v1] Wed, 27 Nov 2024 09:28:50 UTC (827 KB)
zh

[CV-59] KAN See Your Face

【速读】：该论文试图解决从隐私保护人脸识别系统（PPFR）和传统人脸识别系统（FR）的嵌入中提取人脸图像的问题。解决方案的关键在于引入Kolmogorov-Arnold网络（KAN）进行嵌入到人脸的攻击，并提出了两种变体模型：FEM-KAN和FEM-MLP，用于高效地进行非线性嵌入到嵌入的映射，从而从相应的嵌入中重建出逼真的人脸图像。通过广泛的实验验证，论文展示了这些模型在精确嵌入映射和人脸重建方面的有效性。

链接: https://arxiv.org/abs/2411.18165
作者: Dong Han,Yong Li,Joachim Denzler
关键词-EN: enhanced facial privacy, privacy-preserving face recognition, facial privacy protection, secure face recognition, face recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:With the advancement of face reconstruction (FR) systems, privacy-preserving face recognition (PPFR) has gained popularity for its secure face recognition, enhanced facial privacy protection, and robustness to various attacks. Besides, specific models and algorithms are proposed for face embedding protection by mapping embeddings to a secure space. However, there is a lack of studies on investigating and evaluating the possibility of extracting face images from embeddings of those systems, especially for PPFR. In this work, we introduce the first approach to exploit Kolmogorov-Arnold Network (KAN) for conducting embedding-to-face attacks against state-of-the-art (SOTA) FR and PPFR systems. Face embedding mapping (FEM) models are proposed to learn the distribution mapping relation between the embeddings from the initial domain and target domain. In comparison with Multi-Layer Perceptrons (MLP), we provide two variants, FEM-KAN and FEM-MLP, for efficient non-linear embedding-to-embedding mapping in order to reconstruct realistic face images from the corresponding face embedding. To verify our methods, we conduct extensive experiments with various PPFR and FR models. We also measure reconstructed face images with different metrics to evaluate the image quality. Through comprehensive experiments, we demonstrate the effectiveness of FEMs in accurate embedding mapping and face reconstruction.
zh

[CV-60] RPEE-HEADS: A Novel Benchmark for Pedestrian Head Detection in Crowd Videos

【速读】：该论文试图解决在拥挤环境中，特别是铁路站台和活动入口等高风险场景中，行人头部自动检测的难题。由于现有公共数据集中缺乏这类复杂环境的代表性数据，导致现有深度学习模型在这些场景中的表现不佳。解决方案的关键在于引入了一个名为Railway Platforms and Event Entrances-Heads (RPEE-Heads) 的新型、多样化、高分辨率且准确标注的数据集。该数据集包含109,913个标注的行人头部，分布在1,886张图像中，平均每张图像有56.2个头部。通过使用RPEE-Heads数据集，论文评估了八种最先进的物体检测算法，并发现You Only Look Once v9和Real-Time Detection Transformer在检测精度和推理速度上表现最佳，分别为90.7%和90.8%的平均精度，推理时间分别为11和14毫秒。这表明，专门针对这些复杂环境的数据集对于训练和评估高精度头部检测模型至关重要。

链接: https://arxiv.org/abs/2411.18164
作者: Mohamad Abubaker,Zubayda Alsadder,Hamed Abdelhaq,Maik Boltes,Ahmed Alia
关键词-EN: platforms and event, railway platforms, management tasks, analysis and management, high-risk settings
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 8 figures, 7 tables

点击查看摘要

Abstract:The automatic detection of pedestrian heads in crowded environments is essential for crowd analysis and management tasks, particularly in high-risk settings such as railway platforms and event entrances. These environments, characterized by dense crowds and dynamic movements, are underrepresented in public datasets, posing challenges for existing deep learning models. To address this gap, we introduce the Railway Platforms and Event Entrances-Heads (RPEE-Heads) dataset, a novel, diverse, high-resolution, and accurately annotated resource. It includes 109,913 annotated pedestrian heads across 1,886 images from 66 video recordings, with an average of 56.2 heads per image. Annotations include bounding boxes for visible head regions. In addition to introducing the RPEE-Heads dataset, this paper evaluates eight state-of-the-art object detection algorithms using the RPEE-Heads dataset and analyzes the impact of head size on detection accuracy. The experimental results show that You Only Look Once v9 and Real-Time Detection Transformer outperform the other algorithms, achieving mean average precisions of 90.7% and 90.8%, with inference times of 11 and 14 milliseconds, respectively. Moreover, the findings underscore the need for specialized datasets like RPEE-Heads for training and evaluating accurate models for head detection in railway platforms and event entrances. The dataset and pretrained models are available at this https URL.
zh

[CV-61] ype-R: Automatically Retouching Typos for Text-to-Image Generation

【速读】：该论文试图解决文本到图像生成模型在图像中准确渲染文字的问题。解决方案的关键在于提出了一种名为Type-R的后处理方法，该方法通过识别图像中的排版错误、擦除错误文字、重新生成缺失的文本框以及修正渲染中的拼写错误，从而显著提高文字渲染的准确性，同时保持图像质量，并在文本准确性和图像质量之间实现最佳平衡。

链接: https://arxiv.org/abs/2411.18159
作者: Wataru Shimoda,Naoto Inoue,Daichi Haraguchi,Hayato Mitani,Seichi Uchida,Kota Yamaguchi
关键词-EN: reflect detailed instructions, face significant challenges, generate photorealistic images, accurately rendering words, detailed instructions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.
zh

[CV-62] Online Knowledge Integration for 3D Semantic Mapping: A Survey

【速读】：该论文试图解决传统语义地图中几何与知识表示松散集成的问题，解决方案的关键在于利用深度学习技术将先验知识（如知识图谱或语言概念）完全整合到传感器数据处理和语义地图构建流程中。具体方法包括使用语义场景图（Semantic Scene Graphs）来集成符号化的先验知识，以及利用语言模型（Language Models）来捕捉隐含的常识知识和自然语言概念。这种整合方式不仅在地图构建过程中实现了知识的在线集成，还推动了语义地图技术的显著进步，开辟了新的应用领域。

链接: https://arxiv.org/abs/2411.18147
作者: Felix Igelbrink,Marian Renz,Martin Günther,Piper Powell,Lennart Niecksch,Oscar Lima,Martin Atzmueller,Joachim Hertzberg
关键词-EN: Semantic mapping, structured environments, Semantic, key component, component of robots
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to Robotics and Autonomous Systems

点击查看摘要

Abstract:Semantic mapping is a key component of robots operating in and interacting with objects in structured environments. Traditionally, geometric and knowledge representations within a semantic map have only been loosely integrated. However, recent advances in deep learning now allow full integration of prior knowledge, represented as knowledge graphs or language concepts, into sensor data processing and semantic mapping pipelines. Semantic scene graphs and language models enable modern semantic mapping approaches to incorporate graph-based prior knowledge or to leverage the rich information in human language both during and after the mapping process. This has sparked substantial advances in semantic mapping, leading to previously impossible novel applications. This survey reviews these recent developments comprehensively, with a focus on online integration of knowledge into semantic mapping. We specifically focus on methods using semantic scene graphs for integrating symbolic prior knowledge and language models for respective capture of implicit common-sense knowledge and natural language concepts
zh

[CV-63] COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

【速读】：该论文试图解决当前大型视觉语言模型（VLMs）在遥感地球观测领域缺乏全面评估基准的问题。解决方案的关键在于提出了COREval，这是首个专门用于全面客观评估VLMs在遥感能力方面的基准。COREval聚焦于遥感领域的两个主要能力维度：感知和推理，并进一步细分为六个次级维度和22个具体任务，确保了对该领域的全面覆盖。通过从全球50个城市收集数据、严格的问卷构建和质量控制，COREval提供了6,263个高质量的多选题，以客观直接地评估VLM性能。论文还对13个开源VLMs进行了综合评估，揭示了它们在遥感能力方面的不足，并指出了改进方向。

链接: https://arxiv.org/abs/2411.18145
作者: Xiao An,Jiaxing Sun,Zihan Gui,Wei He
关键词-EN: Large Vision-Language Models, sensing Earth observation, remote sensing Earth, remote sensing capabilities, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:With the rapid development of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing Earth observation, have demonstrated exceptional perception and reasoning abilities within this specific field. However, the current absence of a comprehensive benchmark for holistically evaluating the remote sensing capabilities of these VLMs represents a significant gap. To bridge this gap, we propose COREval, the first benchmark designed to comprehensively and objectively evaluate the hierarchical remote sensing capabilities of VLMs. Concentrating on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 22 leaf tasks to ensure a well-rounded assessment coverage for this specific field. COREval guarantees the quality of the total of 6,263 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control, and the format of multiple-choice questions with definitive answers allows for an objective and straightforward evaluation of VLM performance. We conducted a holistic evaluation of 13 prominent open-source VLMs from both the general and remote sensing domains, highlighting current shortcomings in their remote sensing capabilities and providing directions for improvements in their application within this specialized context. We hope that COREval will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing.
zh

[CV-64] Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

【速读】：该论文试图解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理复杂视觉场景时，依赖于线索发现（clue finding）的局限性问题。现有的方法在设计时主要针对线索发现在整个推理过程中起主导作用的任务，导致在处理复杂场景时，线索发现并不能简化推理任务。论文提出的解决方案之关键是引入了一种新的视觉推理范式，使MLLMs能够根据其推理状态自主修改输入场景，从而将思维链（Chain-of-Thought, CoT）重构为在一系列想象中的视觉场景下进行简单的闭环决策和推理步骤。为此，论文设计了一个即插即用的想象空间（imagination space），MLLMs在此空间中通过聚焦、忽略和转换等操作进行视觉修改，这些操作基于其固有的推理能力，无需特定训练。通过这种方法，MLLMs能够在没有线索发现的情况下，逐步有效地进行推理。

链接: https://arxiv.org/abs/2411.18142
作者: Jingming Liu,Yumeng Li,Boyuan Xiao,Yichang Jian,Ziang Qin,Tianjia Shao,Yao-Xiang Ding,Kun Zhou
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There have been recent efforts to extend the Chain-of-Thought (CoT) paradigm to Multimodal Large Language Models (MLLMs) by finding visual clues in the input scene, advancing the visual reasoning ability of MLLMs. However, current approaches are specially designed for the tasks where clue finding plays a major role in the whole reasoning process, leading to the difficulty in handling complex visual scenes where clue finding does not actually simplify the whole reasoning task. To deal with this challenge, we propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status, such that CoT is reformulated as conducting simple closed-loop decision-making and reasoning steps under a sequence of imagined visual scenes, leading to natural and general CoT construction. To implement this paradigm, we introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform based on their native reasoning ability without specific training. We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement, challenging the reasoning ability beyond clue finding. The results verify that while existing techniques fall short, our approach enables MLLMs to effectively reason step by step through autonomous imagination. Project page: this https URL.
zh

[CV-65] ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

【速读】：该论文试图解决现有基于分数蒸馏采样（Score Distillation Sampling, SDS）的文本到3D生成方法中存在的过平滑和低质量输出的问题。这些问题源于当前方法的模式寻求行为，导致模型更新时的分数在多个模式间振荡，从而引发不稳定的优化和输出质量下降。论文提出的解决方案之关键是引入了一种新的图像提示分数蒸馏损失（Image Prompt Score Distillation, ISD），通过使用参考图像来引导文本到3D优化的特定模式。ISD损失的实现依赖于IP-Adapter，这是一种轻量级的适配器，用于将图像提示能力集成到文本到图像扩散模型中，作为模式选择模块。此外，当不使用参考图像时，该适配器的一个变体可以作为有效的控制变量，减少分数估计的方差，从而提高输出质量和优化稳定性。实验结果表明，ISD损失在T3Bench基准测试中，无论是在定性还是定量评估上，均能持续生成视觉一致的高质量输出，并提高优化速度。

链接: https://arxiv.org/abs/2411.18135
作者: Uy Dieu Tran,Minh Luu,Phong Ha Nguyen,Khoi Nguyen,Binh-Son Hua
关键词-EN: Existing Score Distillation, Score Distillation Sampling, driven significant progress, Distillation Sampling, Existing Score
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.
zh

[CV-66] owards Cross-device and Training-free Robotic Grasping in 3D Open World

【速读】：该论文试图解决在开放世界环境中机器人抓取任务中的深度信息获取和复杂堆叠场景下的抓取性能问题。解决方案的关键在于提出了一种新的管道，该管道能够在无需训练的情况下执行对未见过对象的抓取任务，并且支持在不同场景中灵活使用多种3D点云分割模型。通过利用分割结果，论文提出了一种无需训练的二进制聚类算法，该算法不仅提高了分割精度，还能够对未见过的对象进行聚类和定位，从而实现抓取操作。实验结果表明，该管道在各种环境、机器人、相机和对象中表现出显著的鲁棒性和通用性。

链接: https://arxiv.org/abs/2411.18133
作者: Weiguang Zhao,Chenru Jiang,Chengrui Zhang,Jie Sun,Yuyao Yan,Rui Zhang,Kaizhu Huang
关键词-EN: Robotic grasping, automation processes, open world, critical component, component of manufacturing
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic grasping in the open world is a critical component of manufacturing and automation processes. While numerous existing approaches depend on 2D segmentation output to facilitate the grasping procedure, accurately determining depth from 2D imagery remains a challenge, often leading to limited performance in complex stacking scenarios. In contrast, techniques utilizing 3D point cloud data inherently capture depth information, thus enabling adeptly navigating and manipulating a diverse range of complex stacking scenes. However, such efforts are considerably hindered by the variance in data capture devices and the unstructured nature of the data, which limits their generalizability. Consequently, much research is narrowly concentrated on managing designated objects within specific settings, which confines their real-world applicability. This paper presents a novel pipeline capable of executing object grasping tasks in open-world scenarios even on previously unseen objects without the necessity for training. Additionally, our pipeline supports the flexible use of different 3D point cloud segmentation models across a variety of scenes. Leveraging the segmentation results, we propose to engage a training-free binary clustering algorithm that not only improves segmentation precision but also possesses the capability to cluster and localize unseen objects for executing grasping operations. In our experiments, we investigate a range of open-world scenarios, and the outcomes underscore the remarkable robustness and generalizability of our pipeline, consistent across various environments, robots, cameras, and objects. The code will be made available upon acceptance of the paper.
zh

[CV-67] Spectral-Spatial Transformer with Active Transfer Learning for Hyperspectral Image Classification

【速读】：该论文试图解决高光谱图像分类中的挑战，主要由于高光谱图像的高光谱维度和有限的标记数据。解决方案的关键在于提出了一种新颖的多阶段主动迁移学习框架（Multi-stage Active Transfer Learning, ATL），该框架结合了空间-光谱变换器（Spatial-Spectral Transformer, SST）与主动学习过程。具体来说，该方法利用预训练的SST模型，通过不确定性-多样性查询机制（Uncertainty-Diversity Querying Mechanism）迭代地对新获取的标记样本进行微调，从而优化迁移学习过程，减少标记成本和模型不确定性。此外，引入了动态冻结策略（Dynamic Freezing Strategy），选择性地冻结SST模型的层，以减少计算开销并保持对新数据光谱变化的适应性。论文的创新点还包括通过不确定性引导的主动学习实现光谱和空间注意力权重的自校准，以及促进多样性的采样策略，确保选定的样本涵盖不同的光谱区域，防止对特定光谱类的过拟合。实验结果表明，SST-ATL框架在基准高光谱图像数据集上显著优于现有的CNN和SST方法，提供了更高的准确性、效率和计算性能。

链接: https://arxiv.org/abs/2411.18115
作者: Muhammad Ahmad,Manuel Mazzara,Salvatore Distefano
关键词-EN: challenging task due, high spectral dimensionality, hyperspectral images, challenging task, task due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The classification of hyperspectral images (HSI) is a challenging task due to the high spectral dimensionality and limited labeled data typically available for training. In this study, we propose a novel multi-stage active transfer learning (ATL) framework that integrates a Spatial-Spectral Transformer (SST) with an active learning process for efficient HSI classification. Our approach leverages a pre-trained (initially trained) SST model, fine-tuned iteratively on newly acquired labeled samples using an uncertainty-diversity (Spatial-Spectral Neighborhood Diversity) querying mechanism. This mechanism identifies the most informative and diverse samples, thereby optimizing the transfer learning process to reduce both labeling costs and model uncertainty. We further introduce a dynamic freezing strategy, selectively freezing layers of the SST model to minimize computational overhead while maintaining adaptability to spectral variations in new data. One of the key innovations in our work is the self-calibration of spectral and spatial attention weights, achieved through uncertainty-guided active learning. This not only enhances the model’s robustness in handling dynamic and disjoint spectral profiles but also improves generalization across multiple HSI datasets. Additionally, we present a diversity-promoting sampling strategy that ensures the selected samples span distinct spectral regions, preventing overfitting to particular spectral classes. Experiments on benchmark HSI datasets demonstrate that the SST-ATL framework significantly outperforms existing CNN and SST-based methods, offering superior accuracy, efficiency, and computational performance. The source code can be accessed at \urlthis https URL.
zh

[CV-68] When Large Vision-Language Models Meet Person Re-Identification

【速读】：该论文试图解决在行人重识别（Person Re-Identification, ReID）任务中如何有效利用大规模视觉-语言模型（Large Vision-Language Models, LVLMs）的问题。解决方案的关键在于提出了LVLM-ReID框架，该框架通过指令引导LVLM生成一个包含行人图像关键外观语义的行人语义标记（pedestrian semantic token），并通过语义引导交互（Semantic-Guided Interaction, SGI）模块进一步优化该标记，使其与视觉标记之间建立互惠交互。最终，强化后的语义标记作为行人身份表示，将LVLM的语义理解和生成能力整合到端到端的ReID训练中，从而在训练和推理过程中捕捉行人图像的丰富语义线索。

链接: https://arxiv.org/abs/2411.18111
作者: Qizao Wang,Bin Li,Xiangyang Xue
关键词-EN: Large Language Models, Large Vision-Language Models, Large Language, Language Models, incorporate visual models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) that incorporate visual models and Large Language Models (LLMs) have achieved impressive results across various cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one pedestrian semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the pedestrian identity representation. Our framework integrates the semantic understanding and generation capabilities of LVLMs into end-to-end ReID training, allowing LVLMs to capture rich semantic cues from pedestrian images during both training and inference. Our method achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID and offering a promising direction for future research.
zh

[CV-69] raining Data Synthesis with Difficulty Controlled Diffusion Model

【速读】：该论文试图解决在半监督学习 (Semi-supervised Learning, SSL) 中，由于公开图像源中合成图像的增加，导致未标记数据中包含合成图像（即污染数据）对模型性能的影响问题。解决方案的关键是提出了一个新的任务，即真实-合成混合半监督学习 (Real-Synthetic Hybrid SSL, RS-SSL)，并设计了一种名为 RSMatch 的新型 SSL 方法。RSMatch 能够有效识别未标记数据中的合成图像，并将其转化为有用的资源，从而提升模型的性能。通过广泛的实验验证，RSMatch 成功地将合成未标记数据从“障碍”转变为“资源”，并通过消融研究和可视化进一步验证了其有效性。

链接: https://arxiv.org/abs/2411.18109
作者: Zerun Wang,Jiafeng Mao,Xueting Wang,Toshihiko Yamasaki
关键词-EN: Semi-supervised learning, public image sources, SSL, low costs, performance by leveraging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from obstacles' to resources.’ The effectiveness is further verified through ablation studies and visualization.
zh

[CV-70] Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis

【速读】：该论文试图解决全切片图像 (Whole Slide Images, WSIs) 分析中由于数据量大且缺乏细粒度标注而导致的复杂性问题。解决方案的关键在于引入了一种基于知识概念的多实例学习 (Multiple Instance Learning, MIL) 框架，名为 ConcepPath。该框架通过利用 GPT-4 从医学文献中提取可靠的疾病特异性人类专家概念，并将其与一组纯学习概念结合，从而从训练数据中提取互补知识。ConcepPath 使用病理视觉语言模型 (pathology vision-language model) 作为基础组件，将 WSIs 与这些语言知识概念对齐，从而在肺癌亚型分类、乳腺癌 HER2 评分和胃癌免疫治疗敏感性亚型分类任务中显著优于缺乏人类专家知识指导的现有最先进方法。

链接: https://arxiv.org/abs/2411.18101
作者: Weiqin Zhao,Ziyu Guo,Yinshuang Fan,Yuming Jiang,Maximus Yeung,Lequan Yu
关键词-EN: Multiple Instance Learning, Slide Images, Instance Learning, Multiple Instance, fine-grained annotation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to the large size and lack of fine-grained annotation, Whole Slide Images (WSIs) analysis is commonly approached as a Multiple Instance Learning (MIL) problem. However, previous studies only learn from training data, posing a stark contrast to how human clinicians teach each other and reason about histopathologic entities and factors. Here we present a novel knowledge concept-based MIL framework, named ConcepPath to fill this gap. Specifically, ConcepPath utilizes GPT-4 to induce reliable diseasespecific human expert concepts from medical literature, and incorporate them with a group of purely learnable concepts to extract complementary knowledge from training data. In ConcepPath, WSIs are aligned to these linguistic knowledge concepts by utilizing pathology vision-language model as the basic building component. In the application of lung cancer subtyping, breast cancer HER2 scoring, and gastric cancer immunotherapy-sensitive subtyping task, ConcepPath significantly outperformed previous SOTA methods which lack the guidance of human expert knowledge.
zh

[CV-71] raining Noise Token Pruning

【速读】：该论文试图解决视觉Transformer模型在训练过程中由于离散的token丢弃（token dropping）导致的优化不平滑问题。解决方案的关键在于引入训练噪声标记（Training Noise Token, TNT）剪枝方法，通过将离散的token丢弃条件放宽为连续的加性噪声，从而在训练过程中实现平滑优化，同时在部署时保留离散丢弃带来的计算效率提升。该方法不仅在理论上与率失真（Rate-Distortion）文献建立了联系，还在ImageNet数据集上使用ViT和DeiT架构进行了实证评估，证明了其相对于先前剪枝方法的优势。

链接: https://arxiv.org/abs/2411.18092
作者: Mingxing Rao,Bohan Jiang,Daniel Moyer
关键词-EN: present Training Noise, Training Noise Token, vision transformers, present work, present Training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT’s advantages over previous pruning methods.
zh

[CV-72] Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?

【速读】：该论文试图解决在复杂类别中检测违禁物品的问题，特别是通过双视角X射线图像（垂直和侧面）来模拟人类检查员的工作方式。解决方案的关键在于引入了一个大规模的双视角X射线数据集（LDXray），并提出了一个辅助视角增强网络（Auxiliary-view Enhanced Network, AENet）。AENet通过利用同一物体的两个视角（主视角和辅助视角）来提升检测性能，其中主视角用于检测常见类别，而辅助视角则通过从主视角学习到的“专家模型”来处理更具挑战性的类别。实验结果表明，这种双视角机制显著提高了检测性能，特别是在检测如雨伞等挑战性类别时，性能提升高达24.7%。此外，AENet在七种不同的X射线检测模型中展示了强大的泛化能力。

链接: https://arxiv.org/abs/2411.18082
作者: Renshuai Tao,Haoyu Wang,Yuzhe Guo,Hairong Chen,Li Zhang,Xianglong Liu,Yunchao Wei,Yao Zhao
关键词-EN: inspectors typically rely, detect prohibited items, human inspectors typically, dual-view X-ray images, vertical and side
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:To detect prohibited items in challenging categories, human inspectors typically rely on images from two distinct views (vertical and side). Can AI detect prohibited items from dual-view X-ray images in the same way humans do? Existing X-ray datasets often suffer from limitations, such as single-view imaging or insufficient sample diversity. To address these gaps, we introduce the Large-scale Dual-view X-ray (LDXray), which consists of 353,646 instances across 12 categories, providing a diverse and comprehensive resource for training and evaluating models. To emulate human intelligence in dual-view detection, we propose the Auxiliary-view Enhanced Network (AENet), a novel detection framework that leverages both the main and auxiliary views of the same object. The main-view pipeline focuses on detecting common categories, while the auxiliary-view pipeline handles more challenging categories using ``expert models" learned from the main view. Extensive experiments on the LDXray dataset demonstrate that the dual-view mechanism significantly enhances detection performance, e.g., achieving improvements of up to 24.7% for the challenging category of umbrellas. Furthermore, our results show that AENet exhibits strong generalization across seven different detection models for X-ray Inspection
zh

[CV-73] Dual-Level Boost Network for Long-Tail Prohibited Items Detection in X-ray Security Inspection

【速读】：该论文试图解决X射线安检中违禁品检测的问题，特别是由于违禁品种类的长尾分布（long-tail distribution）导致的罕见类别数据不足的问题。解决方案的关键在于提出了一个双层增强网络（Dual-level Boost Network, DBNet），该网络通过两种创新方法来克服这些挑战：(1) 采用泊松混合（Poisson blending）的数据增强策略，生成逼真的罕见物品合成实例，从而有效缓解数据不平衡问题；(2) 引入上下文感知特征增强模块，捕捉物体与其周围环境的空间和语义交互，提升对稀有类别的分类准确性。实验结果表明，DBNet显著提高了对尾部类别的检测性能，相比现有最先进方法提升了17.2%，从而增强了公共安全保障。

链接: https://arxiv.org/abs/2411.18078
作者: Renshuai Tao,Haoyu Wang,Wei Wang,Yunchao Wei,Yao Zhao
关键词-EN: X-ray security, Dual-level Boost Network, X-ray, prohibited items, X-ray security inspections
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:The detection of prohibited items in X-ray security inspections is vital for ensuring public safety. However, the long-tail distribution of item categories, where certain prohibited items are far less common, poses a big challenge for detection models, as rare categories often lack sufficient training data. Existing methods struggle to classify these rare items accurately due to this imbalance. In this paper, we propose a Dual-level Boost Network (DBNet) specifically designed to overcome these challenges in X-ray security screening. Our approach introduces two key innovations: (1) a specific data augmentation strategy employing Poisson blending, inspired by the characteristics of X-ray images, to generate realistic synthetic instances of rare items which can effectively mitigate data imbalance; and (2) a context-aware feature enhancement module that captures the spatial and semantic interactions between objects and their surroundings, enhancing classification accuracy for underrepresented categories. Extensive experimental results demonstrate that DBNet improves detection performance for tail categories, outperforming sota methods in X-ray security inspection scenarios by a large margin 17.2%, thereby ensuring enhanced public safety.
zh

[CV-74] SmileSplat: Generalizable Gaussian Splats for Unconstrained Sparse Images

【速读】：该论文试图解决在缺乏真实相机参数输入的情况下，如何从稀疏的多视角图像中预测显式辐射场的问题。解决方案的关键在于提出了一种新的通用高斯拼接方法，称为SmileSplat。该方法通过多头的Gaussian回归解码器预测像素对齐的高斯表面元素（Gaussian surfels），这些元素具有较低的自由度但更好的多视角一致性。此外，通过增强高斯表面元素的法向量，并结合提出的Bundle-Adjusting Gaussian Splatting模块优化高斯分布和相机参数（包括外参和内参），从而实现高质量的新视角合成任务。实验结果表明，该方法在多种3D视觉任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2411.18072
作者: Yanyan Li,Yixin Fang,Federico Tombari,Gim Hee Lee
关键词-EN: Generalizable Gaussian Splatting, Sparse Multi-view Images, Gaussian Splatting approaches, wider application prospects, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse Multi-view Images can be Learned to predict explicit radiance fields via Generalizable Gaussian Splatting approaches, which can achieve wider application prospects in real-life when ground-truth camera parameters are not required as inputs. In this paper, a novel generalizable Gaussian Splatting method, SmileSplat, is proposed to reconstruct pixel-aligned Gaussian surfels for diverse scenarios only requiring unconstrained sparse multi-view images. First, Gaussian surfels are predicted based on the multi-head Gaussian regression decoder, which can are represented with less degree-of-freedom but have better multi-view consistency. Furthermore, the normal vectors of Gaussian surfel are enhanced based on high-quality of normal priors. Second, the Gaussians and camera parameters (both extrinsic and intrinsic) are optimized to obtain high-quality Gaussian radiance fields for novel view synthesis tasks based on the proposed Bundle-Adjusting Gaussian Splatting module. Extensive experiments on novel view rendering and depth map prediction tasks are conducted on public datasets, demonstrating that the proposed method achieves state-of-the-art performance in various 3D vision tasks. More information can be found on our project page (this https URL)
zh

[CV-75] Large Scale Evaluation of Deep Learning-based Explainable Solar Flare Forecasting Models with Attribution-based Proximity Analysis

【速读】：该论文试图解决在太阳耀斑预测中深度学习模型的可解释性和可靠性问题。解决方案的关键在于提出了一种基于接近度的框架，用于分析事后解释（post hoc explanations），以评估深度学习模型在太阳耀斑预测中的可解释性。具体来说，研究采用了Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM)方法生成归因图（attribution maps），并通过引入一种基于接近度的度量来量化评估这些解释的准确性和相关性，特别是在已知感兴趣区域的情况下。这一框架不仅增强了模型解释性的评估，还支持开发更透明和可靠的太阳耀斑预测系统。

链接: https://arxiv.org/abs/2411.18070
作者: Temitope Adeyeha,Chetraj Pandey,Berkay Aydin
关键词-EN: potentially significant impact, impact on Earth, Earth and space-based, Accurate and reliable, space-based infrastructure
类目: Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: This is a preprint accepted at IEEE International Conference on Big Data 2024( IEEE BigData 2024) Conference

点击查看摘要

Abstract:Accurate and reliable predictions of solar flares are essential due to their potentially significant impact on Earth and space-based infrastructure. Although deep learning models have shown notable predictive capabilities in this domain, current evaluations often focus on accuracy while neglecting interpretability and reliability–factors that are especially critical in operational settings. To address this gap, we propose a novel proximity-based framework for analyzing post hoc explanations to assess the interpretability of deep learning models for solar flare prediction. Our study compares two models trained on full-disk line-of-sight (LoS) magnetogram images to predict \geq M-class solar flares within a 24-hour window. We employ the Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM) method to generate attribution maps from these models, which we then analyze to gain insights into their decision-making processes. To support the evaluation of explanations in operational systems, we introduce a proximity-based metric that quantitatively assesses the accuracy and relevance of local explanations when regions of interest are known. Our findings indicate that the models’ predictions align with active region characteristics to varying degrees, offering valuable insights into their behavior. This framework enhances the evaluation of model interpretability in solar flare forecasting and supports the development of more transparent and reliable operational systems.
zh

[CV-76] PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion

【速读】：该论文试图解决现有个性化图像生成方法在处理多人图像时遇到的遮挡问题以及无法准确个性化全身形状的挑战。解决方案的关键在于提出了一种名为PersonaCraft的新方法，该方法结合了扩散模型与3D人体建模技术。具体来说，PersonaCraft通过引入3D感知姿态条件（SMPLx-ControlNet）来有效管理遮挡问题，并通过SMPLx拟合技术准确个性化人体全身形状。此外，该方法还支持用户定义的体型调整，增加了个体化定制的灵活性。实验结果表明，PersonaCraft在生成高质量、逼真的多人图像方面表现优异，解决了遮挡问题，为多人个性化图像合成设立了新的标准。

链接: https://arxiv.org/abs/2411.18068
作者: Gwanghyun Kim,Suh Yoon Jeon,Seunggyu Lee,Se Young Chun
关键词-EN: significantly advanced, enabling the creation, Personalized image generation, creation of highly, Personalized image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis. Project page: this https URL
zh

[CV-77] GLS: Geometry-aware 3D Language Gaussian Splatting

【速读】：该论文试图解决室内表面重建和开放词汇分割两个任务的联合优化问题。解决方案的关键在于提出了GLS框架，该框架基于3D高斯喷射（3D Gaussian Splatting, 3DGS），通过探索两个任务之间的关联性来实现统一优化。具体来说，对于表面重建，引入了表面法线先验作为几何线索来指导渲染的法线，并利用法线误差优化渲染深度；对于开放词汇分割，使用2D CLIP特征引导实例特征，并利用DEVA掩码增强视图一致性。实验结果表明，GLS在MuSHRoom、ScanNet++和LERF-OVS数据集上均超越了现有最先进的方法。

链接: https://arxiv.org/abs/2411.18066
作者: Jiaxiong Qiu,Liu Liu,Zhizhong Su,Tianwei Lin
关键词-EN: Gaussian Splatting, achieved significant performance, surface reconstruction, indoor surface reconstruction, open-vocabulary segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has achieved significant performance on indoor surface reconstruction and open-vocabulary segmentation. This paper presents GLS, a unified framework of surface reconstruction and open-vocabulary segmentation based on 3DGS. GLS extends two fields by exploring the correlation between them. For indoor surface reconstruction, we introduce surface normal prior as a geometric cue to guide the rendered normal, and use the normal error to optimize the rendered depth. For open-vocabulary segmentation, we employ 2D CLIP features to guide instance features and utilize DEVA masks to enhance their view consistency. Extensive experiments demonstrate the effectiveness of jointly optimizing surface reconstruction and open-vocabulary segmentation, where GLS surpasses state-of-the-art approaches of each task on MuSHRoom, ScanNet++, and LERF-OVS datasets. Code will be available at this https URL.
zh

[CV-78] Lightweight Gaze Estimation Model Via Fusion Global Information

【速读】：该论文试图解决基于深度学习的外观注视估计方法中，现有高精度模型因依赖深层网络而导致的参数多、训练时间长和收敛速度慢的问题。解决方案的关键在于提出了一个名为FGI-Net（Fusion Global Information）的新型轻量级注视估计模型，该模型通过融合全局信息到卷积神经网络（CNN）中，有效减少了多层卷积和池化间接捕捉全局信息的需求，从而降低了模型复杂性，提高了模型精度和收敛速度。实验结果表明，FGI-Net在多个数据集上相比现有模型显著减少了角度误差，并且在参数数量和计算量（FLOPs）上均有大幅减少，同时实现了更快的收敛速度。

链接: https://arxiv.org/abs/2411.18064
作者: Zhang Cheng,Yanxia Wang
关键词-EN: Deep learning-based appearance, gaining popularity due, Deep learning-based, learning-based appearance gaze, appearance gaze estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Net(Fusion Global Information). The model fuses global information into the CNN, effectively compensating for the need of multi-layer convolution and pooling to indirectly capture global information, while reducing the complexity of the model, improving the model accuracy and convergence speed. To validate the performance of the model, a large number of experiments are conducted, comparing accuracy with existing classical models and lightweight models, comparing convergence speed with models of different architectures, and conducting ablation experiments. Experimental results show that compared with GazeCaps, the latest gaze estimation model, FGI-Net achieves a smaller angle error with 87.1% and 79.1% reduction in parameters and FLOPs, respectively (MPIIFaceGaze is 3.74°, EyeDiap is 5.15°, Gaze360 is 10.50° and RT-Gene is 6.02°). Moreover, compared with different architectural models such as CNN and Transformer, FGI-Net is able to quickly converge to a higher accuracy range with fewer iterations of training, when achieving optimal accuracy on the Gaze360 and EyeDiap datasets, the FGI-Net model has 25% and 37.5% fewer iterations of training compared to GazeTR, respectively.
zh

[CV-79] Multi-task Gaze Estimation Via Unidirectional Convolution

【速读】：该论文试图解决轻量级模型在视线估计任务中性能显著下降的问题，主要原因是轻量级网络的特征通道数量通常较少，导致模型表达能力受限。解决方案的关键在于提出了一种名为Multitask-Gaze的网络模型，其核心组件包括单向卷积（Unidirectional Convolution, UC）、空间和通道注意力机制（Spatial and Channel Attention, SCA）、全局卷积模块（Global Convolution Module, GCM）以及多任务回归模块（Multi-task Regression Module, MRM）。UC不仅显著减少了参数数量和计算量（FLOPs），还扩展了感受野并提升了模型的长距离建模能力，从而提高了模型性能。SCA强调与视线相关的特征并抑制无关特征。GCM替代了池化层，避免了因信息丢失导致的性能下降。MRM提高了单个任务的准确性，并增强了任务间的连接，以提升整体性能。实验结果表明，与最先进的方法SUGE相比，Multitask-Gaze在MPIIFaceGaze和Gaze360数据集上的性能分别提升了1.71%和2.75%，同时参数数量和计算量显著减少了75.5%和86.88%。

链接: https://arxiv.org/abs/2411.18061
作者: Zhang Cheng,Yanxia Wang
关键词-EN: gaze estimation tasks, Global Convolution Module, Multi-task Regression Module, lightweight models, significant performance degradation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Using lightweight models as backbone networks in gaze estimation tasks often results in significant performance degradation. The main reason is that the number of feature channels in lightweight networks is usually small, which makes the model expression ability limited. In order to improve the performance of lightweight models in gaze estimation tasks, a network model named Multitask-Gaze is proposed. The main components of Multitask-Gaze include Unidirectional Convolution (UC), Spatial and Channel Attention (SCA), Global Convolution Module (GCM), and Multi-task Regression Module(MRM). UC not only significantly reduces the number of parameters and FLOPs, but also extends the receptive field and improves the long-distance modeling capability of the model, thereby improving the model performance. SCA highlights gaze-related features and suppresses gaze-irrelevant features. The GCM replaces the pooling layer and avoids the performance degradation due to information loss. MRM improves the accuracy of individual tasks and strengthens the connections between tasks for overall performance improvement. The experimental results show that compared with the State-of-the-art method SUGE, the performance of Multitask-Gaze on MPIIFaceGaze and Gaze360 datasets is improved by 1.71% and 2.75%, respectively, while the number of parameters and FLOPs are significantly reduced by 75.5% and 86.88%.
zh

[CV-80] HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

【速读】：该论文试图解决多模态大型语言模型（LLMs）在理解视频场景时面临的挑战，特别是难以处理复杂的多对象交互和推理问题。解决方案的关键在于提出了基于场景超图（Scene HyperGraph）的HyperGLM方法，该方法通过整合实体场景图（entity scene graphs）和过程图（procedural graph），形成一个统一的HyperGraph，从而促进对多对象交互和高级关系的推理。HyperGLM通过将这个统一的HyperGraph注入LLMs中，显著提升了模型在复杂视频场景中的关系建模和推理能力。此外，论文还引入了一个新的视频场景图推理（VSGR）数据集，支持多种任务，并展示了HyperGLM在多个任务上的优越性能。

链接: https://arxiv.org/abs/2411.18042
作者: Trong-Thuan Nguyen,Pha Nguyen,Jackson Cothren,Alper Yilmaz,Khoa Luu
关键词-EN: Scene Graph Generation, Video Scene Graph, Scene Graph, understanding video scenes, Scene Graph Anticipation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.
zh

[CV-81] VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

【速读】：该论文试图解决人-物交互检测 (Human-Object Interaction, HOI) 任务中的准确性问题。解决方案的关键在于利用大型视觉语言模型 (Large Vision Language Model, VLM) 作为目标函数形式，通过图像-文本匹配技术量化预测的HOI三元组的相似性。具体来说，论文提出了一种方法，将HOI三元组语言化表示，以充分利用VLM的语言理解能力，这种能力比CLIP模型更适合，因为VLM具有定位和以对象为中心的特性。通过对比优化，将这种匹配得分作为目标函数，从而提升HOI检测的准确性。实验结果表明，该方法在基准测试中达到了最先进的HOI检测精度。

链接: https://arxiv.org/abs/2411.18038
作者: Donggoo Kang,Dasol Jeong,Hyunmin Lee,Sangwoo Park,Hasil Park,Sunkyu Kwon,Yeongjoon Kim,Joonki Paik
关键词-EN: Large Vision Language, recently addressed remarkable, Vision Language Model, Large Vision, addressed remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (\textbfVLM-HOI). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this is the first utilization of VLM language abilities for HOI detection. Experiments demonstrate the effectiveness of our method, achieving state-of-the-art HOI detection accuracy on benchmarks. We believe integrating VLMs into HOI detection represents important progress towards more advanced and interpretable analysis of human-object interactions.
zh

[CV-82] Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

【速读】：该论文试图解决RGB和近红外（NIR）立体成像系统中像素级对齐缺失的问题，这一问题在下游视觉任务中带来了挑战。解决方案的关键在于引入了一种配备像素对齐的RGB-NIR立体相机和LiDAR传感器的机器人视觉系统。该系统能够同时捕获像素对齐的RGB立体图像、NIR立体图像以及时间同步的LiDAR点云数据。通过利用机器人的移动性，论文提供了一个包含多种光照条件下连续视频帧的数据集。此外，论文提出了两种利用像素对齐的RGB-NIR图像的方法：RGB-NIR图像融合方法和特征融合方法。前者使现有的RGB预训练视觉模型能够直接利用RGB-NIR信息，而无需微调；后者则通过微调现有视觉模型，更有效地利用RGB-NIR信息。实验结果表明，在各种光照条件下，使用像素对齐的RGB-NIR图像能够显著提升视觉任务的效果。

链接: https://arxiv.org/abs/2411.18025
作者: Jinnyeong Kim,Seung-Hwan Baek
关键词-EN: Integrating RGB, potentially enhancing robotic, NIR stereo imaging, RGB and NIR, NIR stereo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Integrating RGB and NIR stereo imaging provides complementary spectral information, potentially enhancing robotic 3D vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream vision tasks. In this paper, we introduce a robotic vision system equipped with pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures pixel-aligned pairs of RGB stereo images, NIR stereo images, and temporally synchronized LiDAR points. Utilizing the mobility of the robot, we present a dataset containing continuous video frames under diverse lighting conditions. We then introduce two methods that utilize the pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information. Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.
zh

[CV-83] FASIONAD : FAst and Slow FusION Thinking Systems for Human-Like Autonomous Driving with Adaptive Feedback

【速读】：该论文试图解决自动驾驶系统在处理罕见、长尾事件时的挑战，特别是在保证安全、舒适和高效导航的前提下。解决方案的关键在于提出了FASIONAD，一个受“快思考与慢思考”认知模型启发的双系统框架。该框架通过快速系统（fast system）处理常规导航任务，利用数据驱动的路径规划快速响应；而慢速系统（slow system）则专注于复杂推理和决策，特别是在面对挑战性或不熟悉的情况时。通过基于评分分布和反馈的动态切换机制，两个系统能够无缝衔接，确保在不同驾驶场景下的高效和安全导航。此外，快速系统生成的视觉提示有助于慢速系统进行类人推理，而慢速系统的高质量反馈则进一步增强了快速系统的决策能力。

链接: https://arxiv.org/abs/2411.18013
作者: Kangan Qian,Zhikun Ma,Yangfan He,Ziang Luo,Tianyu Shi,Tianze Zhu,Jiayin Li,Jianhui Wang,Ziyu Chen,Xiao He,Yining Shi,Zheng Fu,Xinyu Jiao,Kun Jiang,Diange Yang,Takafumi Matsumaru
关键词-EN: Ensuring safe, critical goal, Fast, Ensuring, fast system
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring safe, comfortable, and efficient navigation is a critical goal for autonomous driving systems. While end-to-end models trained on large-scale datasets excel in common driving scenarios, they often struggle with rare, long-tail events. Recent progress in large language models (LLMs) has introduced enhanced reasoning capabilities, but their computational demands pose challenges for real-time decision-making and precise planning. This paper presents FASIONAD, a novel dual-system framework inspired by the cognitive model “Thinking, Fast and Slow.” The fast system handles routine navigation tasks using rapid, data-driven path planning, while the slow system focuses on complex reasoning and decision-making in challenging or unfamiliar situations. A dynamic switching mechanism based on score distribution and feedback allows seamless transitions between the two systems. Visual prompts generated by the fast system enable human-like reasoning in the slow system, which provides high-quality feedback to enhance the fast system’s decision-making. To evaluate FASIONAD, we introduce a new benchmark derived from the nuScenes dataset, specifically designed to differentiate fast and slow scenarios. FASIONAD achieves state-of-the-art performance on this benchmark, establishing a new standard for frameworks integrating fast and slow cognitive processes in autonomous driving. This approach paves the way for more adaptive, human-like autonomous driving systems.
zh

[CV-84] Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

【速读】：该论文试图解决家具组装中的离散-连续优化问题，即在物理现实的基础上选择家具部件并估计其连接姿态。解决方案的关键在于利用组装说明书中的图示，将问题分解为离散和连续两个阶段。具体来说，论文提出了一个基于Transformer的指令引导的3D部件组装框架，称为Manual-PA。该框架通过对比学习骨干网络，学习将3D部件与其在说明书中的图示进行语义对齐，以预测组装顺序，并通过将每个部件与其在说明书中最终家具的图示相关联来推断其6D姿态。实验结果表明，使用图示和部件顺序显著提升了组装性能，并在IKEA-Manual数据集上展示了强大的泛化能力。

链接: https://arxiv.org/abs/2411.18011
作者: Jiahao Zhang,Anoop Cherian,Cristian Rodriguez,Weijian Deng,Stephen Gould
关键词-EN: Assembling furniture amounts, physically realistic manner, discrete-continuous optimization task, Assembling furniture, realistic manner
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.
zh

[CV-85] Monocular Obstacle Avoidance Based on Inverse PPO for Fixed-wing UAVs

【速读】：该论文试图解决固定翼无人机在低空经济和城市空中交通中面临的未知障碍物避障问题。解决方案的关键在于提出了一种轻量级的基于深度强化学习（DRL）的无人机避障系统，该系统仅依赖于机载视觉传感器，能够在巡航速度超过30m/s的情况下实时检测并避开未知障碍物。具体来说，解决方案包括：1) 采用简化的网络架构进行单帧图像深度推断，以确保边缘计算设备的实时障碍物检测；2) 设计了一种新的奖励函数，结合强化学习控制器，平衡目标接近与飞行轨迹平滑性，满足固定翼无人机的动态约束和稳定性要求；3) 引入自适应熵调整机制，缓解DRL中探索与利用的权衡问题，提高训练收敛速度和避障成功率。通过广泛的软件在环和硬件在环实验，验证了该框架在避障效率和飞行轨迹平滑性方面的优越性，并确认了其在边缘设备上的可行性。

链接: https://arxiv.org/abs/2411.18009
作者: Haochen Chai,Meimei Su,Yang Lyu,Zhunga Liu,Chunhui Zhao,Quan Pan
关键词-EN: Unmanned Aerial Vehicles, Urban Air Mobility, Fixed-wing Unmanned Aerial, burgeoning Low-altitude Economy, Aerial Vehicles
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fixed-wing Unmanned Aerial Vehicles (UAVs) are one of the most commonly used platforms for the burgeoning Low-altitude Economy (LAE) and Urban Air Mobility (UAM), due to their long endurance and high-speed capabilities. Classical obstacle avoidance systems, which rely on prior maps or sophisticated sensors, face limitations in unknown low-altitude environments and small UAV platforms. In response, this paper proposes a lightweight deep reinforcement learning (DRL) based UAV collision avoidance system that enables a fixed-wing UAV to avoid unknown obstacles at cruise speed over 30m/s, with only onboard visual sensors. The proposed system employs a single-frame image depth inference module with a streamlined network architecture to ensure real-time obstacle detection, optimized for edge computing devices. After that, a reinforcement learning controller with a novel reward function is designed to balance the target approach and flight trajectory smoothness, satisfying the specific dynamic constraints and stability requirements of a fixed-wing UAV platform. An adaptive entropy adjustment mechanism is introduced to mitigate the exploration-exploitation trade-off inherent in DRL, improving training convergence and obstacle avoidance success rates. Extensive software-in-the-loop and hardware-in-the-loop experiments demonstrate that the proposed framework outperforms other methods in obstacle avoidance efficiency and flight trajectory smoothness and confirm the feasibility of implementing the algorithm on edge devices. The source code is publicly available at \urlthis https URL.
zh

[CV-86] AI-Driven Smartphone Solution for Digitizing Rapid Diagnostic Test Kits and Enhancing Accessibility for the Visually Impaired

【速读】：该论文试图解决快速诊断测试结果准确解读的挑战，特别是如何在不完美拍摄条件下提高测试结果的可靠性和准确性。解决方案的关键在于将人工智能算法（包括卷积神经网络 (CNN)）集成到智能手机应用程序中，通过使用YOLOv8算法精确裁剪和提取测试膜区域，即使测试套件未居中或位于图像边缘。此外，通过CNN分类器分析提取的图像，确定结果为阳性、阴性或无效，并提供结果的置信度。这种集成显著提高了测试结果解读的敏感性和特异性，并通过SHapley Additive exPlanations (SHAP) 分析揭示了影响模型决策的因素，从而提供了一个强大的解决方案来区分真实测试线与背景噪声，并评估测试线的强度和均匀性。

链接: https://arxiv.org/abs/2411.18007
作者: R. B. Dastagir,J. T. Jami,S. Chanda,F. Hafiz,M. Rahman,K. Dey,M. M. Rahman,M. Qureshi,M. M. Chowdhury
关键词-EN: timely disease detection, results remains challenging, test results remains, test result interpretation, diagnostic test result
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid diagnostic tests are crucial for timely disease detection and management, yet accurate interpretation of test results remains challenging. In this study, we propose a novel approach to enhance the accuracy and reliability of rapid diagnostic test result interpretation by integrating artificial intelligence (AI) algorithms, including convolutional neural networks (CNN), within a smartphone-based application. The app enables users to take pictures of their test kits, which YOLOv8 then processes to precisely crop and extract the membrane region, even if the test kit is not centered in the frame or is positioned at the very edge of the image. This capability offers greater accessibility, allowing even visually impaired individuals to capture test images without needing perfect alignment, thus promoting user independence and inclusivity. The extracted image is analyzed by an additional CNN classifier that determines if the results are positive, negative, or invalid, providing users with the results and a confidence level. Through validation experiments with commonly used rapid test kits across various diagnostic applications, our results demonstrate that the synergistic integration of AI significantly improves sensitivity and specificity in test result interpretation. This improvement can be attributed to the extraction of the membrane zones from the test kit images using the state-of-the-art YOLO algorithm. Additionally, we performed SHapley Additive exPlanations (SHAP) analysis to investigate the factors influencing the model’s decisions, identifying reasons behind both correct and incorrect classifications. By facilitating the differentiation of genuine test lines from background noise and providing valuable insights into test line intensity and uniformity, our approach offers a robust solution to challenges in rapid test interpretation.
zh

[CV-87] An End-to-End Two-Stream Network Based on RGB Flow and Representation Flow for Human Action Recognition

【速读】：该论文试图解决基于视频的动作识别中传统双流神经网络模型计算成本高的问题。解决方案的关键在于引入了一种新的表示流算法（representation flow algorithm），用以替代原有的光流分支（optical flow branch），从而实现端到端的训练，同时显著降低计算成本和预测时间。具体来说，该模型在头戴式动作识别任务中，通过使用类激活图（CAMs）和卷积长短期记忆网络（ConvLSTM）结合空间注意力机制，提升了识别精度，并在GTEA61、EGTEA GAZE+和HMDB数据集上分别实现了0.1881s、0.1503s和0.1459s的预测时间，相较于原模型的101.6795s、25.3799s和203.9958s有显著减少。

链接: https://arxiv.org/abs/2411.18002
作者: Song-Jiang Lai,Tsun-Hin Cheung,Ka-Chun Fung,Tian-Shan Liu,Kin-Man Lam
关键词-EN: video based action, making two-stream neural, based action recognition, two-stream neural networks, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 9 tables

点击查看摘要

Abstract:With the rapid advancements in deep learning, computer vision tasks have seen significant improvements, making two-stream neural networks a popular focus for video based action recognition. Traditional models using RGB and optical flow streams achieve strong performance but at a high computational cost. To address this, we introduce a representation flow algorithm to replace the optical flow branch in the egocentric action recognition model, enabling end-to-end training while reducing computational cost and prediction time. Our model, designed for egocentric action recognition, uses class activation maps (CAMs) to improve accuracy and ConvLSTM for spatio temporal encoding with spatial attention. When evaluated on the GTEA61, EGTEA GAZE+, and HMDB datasets, our model matches the accuracy of the original model on GTEA61 and exceeds it by 0.65% and 0.84% on EGTEA GAZE+ and HMDB, respectively. Prediction runtimes are significantly reduced to 0.1881s, 0.1503s, and 0.1459s, compared to the original model’s 101.6795s, 25.3799s, and 203.9958s. Ablation studies were also conducted to study the impact of different parameters on model performance. Keywords: two-stream, egocentric, action recognition, CAM, representation flow, CAM, ConvLSTM Comments: 6 pages, 3 figures, 9 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.18002 [cs.CV] (or arXiv:2411.18002v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.18002 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-88] Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在安全性对齐方面的问题，特别是如何通过对抗性图像攻击来揭示和利用这些模型的视觉漏洞。解决方案的关键在于提出了一个名为 MLAI (Multi-Loss Adversarial Images) 的新型越狱框架。该框架通过情景感知的图像生成实现语义对齐，利用平坦最小值理论 (flat minima theory) 进行鲁棒的对抗性图像选择，并采用多图像协同攻击以增强攻击效果。实验结果表明，MLAI 在 MiniGPT-4 和 LLaVA-2 上的攻击成功率分别达到 77.75% 和 82.80%，显著优于现有方法，并且具有较高的可迁移性，能够在商业黑箱 VLMs 上达到高达 60.11% 的成功率。

链接: https://arxiv.org/abs/2411.18000
作者: Shuyang Hao,Bryan Hooi,Jun Liu,Kai-Wei Chang,Zi Huang,Yujun Cai
关键词-EN: underlying language models, inheriting security measures, Vision-Language Models, language models, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI’s significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.
zh

[CV-89] Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion

【速读】：该论文试图解决多光谱行人检测中的模态对齐问题，特别是在实际应用中数据严重未对齐的情况下。解决方案的关键在于利用大规模视觉语言模型（Large-scale Vision-Language Models, LVLM）进行跨模态语义对齐，从而在不依赖复杂且昂贵的传统预处理校准的情况下，提高RGB和热成像域之间的语义信息对齐，进而提升检测精度。这一方法简化了操作需求，并增强了多光谱检测技术在实际应用中的实用性。

链接: https://arxiv.org/abs/2411.17995
作者: Taeheon Kim,Sangyun Chung,Youngjoon Yu,Yong Man Ro
关键词-EN: crucial component, Multispectral pedestrian detection, Multispectral pedestrian, critical applications, heavily misaligned
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multispectral pedestrian detection is a crucial component in various critical applications. However, a significant challenge arises due to the misalignment between these modalities, particularly under real-world conditions where data often appear heavily misaligned. Conventional methods developed on well-aligned or minimally misaligned datasets fail to address these discrepancies adequately. This paper introduces a new framework for multispectral pedestrian detection designed specifically to handle heavily misaligned datasets without the need for costly and complex traditional pre-processing calibration. By leveraging Large-scale Vision-Language Models (LVLM) for cross-modal semantic alignment, our approach seeks to enhance detection accuracy by aligning semantic information across the RGB and thermal domains. This method not only simplifies the operational requirements but also extends the practical usability of multispectral detection technologies in practical applications.
zh

[CV-90] Differentiable Inverse Rendering with Interpretable Basis BRDFs CVPR2025

【速读】：该论文试图解决逆向渲染中几何和空间变化双向反射分布函数 (SVBRDFs) 的重建问题。现有方法在处理复杂场景时，生成的基底 BRDFs (basis BRDFs) 缺乏直观分离性且扩展性有限。论文提出的解决方案之关键是引入一种可微分的逆向渲染方法，通过使用2D高斯模型来表示场景，其中每个高斯的反射率由一组基底 BRDFs 的加权混合定义。该方法利用可微分光栅化和渲染损失来优化重建过程，并动态调整基底 BRDFs 的数量以适应目标场景，同时鼓励基底权重的稀疏性，确保每个高斯的反射率由少数基底 BRDFs 表示。这种方法不仅实现了几何和基底 BRDFs 的准确重建，还支持基于物理的新视角重照明和直观的场景编辑。

链接: https://arxiv.org/abs/2411.17994
作者: Hoon-Gyu Chung,Seokjun Choi,Seung-Hwan Baek
关键词-EN: basis BRDFs, basis, Inverse rendering, BRDFs, Inverse rendering seeks
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: This paper is submitted to CVPR 2025. This is a different paper from my previous paper “Differentiable Point-based Inverse Rendering”. It must not be removed automatically

点击查看摘要

Abstract:Inverse rendering seeks to reconstruct both geometry and spatially varying BRDFs (SVBRDFs) from captured images. To address the inherent ill-posedness of inverse rendering, basis BRDF representations are commonly used, modeling SVBRDFs as spatially varying blends of a set of basis BRDFs. However, existing methods often yield basis BRDFs that lack intuitive separation and have limited scalability to scenes of varying complexity. In this paper, we introduce a differentiable inverse rendering method that produces interpretable basis BRDFs. Our approach models a scene using 2D Gaussians, where the reflectance of each Gaussian is defined by a weighted blend of basis BRDFs. We efficiently render an image from the 2D Gaussians and basis BRDFs using differentiable rasterization and impose a rendering loss with the input images. During this analysis-by-synthesis optimization process of differentiable inverse rendering, we dynamically adjust the number of basis BRDFs to fit the target scene while encouraging sparsity in the basis weights. This ensures that the reflectance of each Gaussian is represented by only a few basis BRDFs. This approach enables the reconstruction of accurate geometry and interpretable basis BRDFs that are spatially separated. Consequently, the resulting scene representation, comprising basis BRDFs and 2D Gaussians, supports physically-based novel-view relighting and intuitive scene editing.
zh

[CV-91] RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

【速读】：该论文试图解决高分辨率遥感图像处理中计算效率低和可解释性有限的问题。解决方案的关键在于引入了一种基于热传导物理过程的并行计算模型，即热传导算子 (Heat Conduction Operator, HCO)，用于模拟高分辨率遥感图像中的局部区域相关性。具体来说，RS-vHeat模型通过应用复杂度为O(N^1.5)的HCO，不仅降低了计算开销，还捕捉了遥感对象的结构信息以指导热扩散过程。此外，模型通过自监督策略学习频率分布表示，并采用频率域层次掩码和多域重构方法，显著提升了在多个任务和数据集上的效率和性能。相比基于注意力机制的遥感基础模型，RS-vHeat在内存消耗、计算量和吞吐量方面均有显著改进。

链接: https://arxiv.org/abs/2411.17984
作者: Huiyang Hu,Peijin Wang,Hanbo Bi,Boyuan Tong,Zhaozhi Wang,Wenhui Diao,Hao Chang,Yingchao Feng,Ziqi Zhang,Qixiang Ye,Kun Fu,Xian Sun
关键词-EN: offering greater scalability, Remote sensing foundation, Remote sensing, remote sensing images, models largely break
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures and 9 tables

点击查看摘要

Abstract:Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with high-resolution remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of O(N^1.5) and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduces memory consumption by 84%, decreases FLOPs by 24% and improves throughput by 2.7 times.
zh

[CV-92] HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction

【速读】：该论文试图解决现有神经SLAM（Neural SLAM）或基于3D高斯溅射（3DGS）的SLAM方法在渲染质量和几何精度之间难以平衡的问题。解决方案的关键在于通过结合易于获取的单目先验信息与基于学习的密集SLAM，增强几何估计能力，并采用3D高斯溅射作为核心地图表示，以高效地建模场景。此外，论文提出了一种基于网格的尺度对齐策略，以维持先验深度中的尺度一致性，从而实现更精细的深度细节。通过这些创新，HI-SLAM2系统在仅使用RGB输入的情况下，实现了快速且准确的三维场景重建，并在多个数据集上显著优于现有的神经SLAM方法，甚至在重建和渲染质量上超越了基于RGB-D的方法。

链接: https://arxiv.org/abs/2411.17982
作者: Wei Zhang,Qing Cheng,David Skuddis,Niclas Zeller,Daniel Cremers,Norbert Haala
关键词-EN: geometry-aware Gaussian SLAM, Gaussian SLAM system, Existing Neural SLAM, RGB input, Neural SLAM
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review process

点击查看摘要

Abstract:We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality. The project page and source code will be made available at this https URL.
zh

[CV-93] Vision Mamba Distillation for Low-resolution Fine-grained Image Classification

【速读】：该论文试图解决低分辨率细粒度图像分类中模型参数和计算复杂度指数级增加的问题。解决方案的关键在于提出了一种名为Vision Mamba Distillation (ViMD) 的方法，通过设计轻量级的超分辨率视觉Mamba分类网络 (SRVM-Net) 和多层次Mamba知识蒸馏损失，有效提升了低分辨率细粒度图像分类的效率和效果。具体来说，SRVM-Net通过重新设计分类子网络以增强视觉特征提取能力，而多层次Mamba知识蒸馏损失则能够将高分辨率视觉Mamba分类网络 (HRVM-Net) 的先验知识传递给SRVM-Net，从而在保持高精度的同时，显著减少了模型参数和计算量，更适合嵌入式设备应用。

链接: https://arxiv.org/abs/2411.17980
作者: Yao Chen,Jiabao Wang,Peichao Wang,Rui Zhang,Yang Li
关键词-EN: made significant progress, recently made significant, Low-resolution fine-grained image, vision Mamba classification, Mamba classification network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-resolution fine-grained image classification has recently made significant progress, largely thanks to the super-resolution techniques and knowledge distillation methods. However, these approaches lead to an exponential increase in the number of parameters and computational complexity of models. In order to solve this problem, in this letter, we propose a Vision Mamba Distillation (ViMD) approach to enhance the effectiveness and efficiency of low-resolution fine-grained image classification. Concretely, a lightweight super-resolution vision Mamba classification network (SRVM-Net) is proposed to improve its capability for extracting visual features by redesigning the classification sub-network with Mamba modeling. Moreover, we design a novel multi-level Mamba knowledge distillation loss boosting the performance, which can transfer prior knowledge obtained from a High-resolution Vision Mamba classification Network (HRVM-Net) as a teacher into the proposed SRVM-Net as a student. Extensive experiments on seven public fine-grained classification datasets related to benchmarks confirm our ViMD achieves a new state-of-the-art performance. While having higher accuracy, ViMD outperforms similar methods with fewer parameters and FLOPs, which is more suitable for embedded device applications. Code is available at this https URL.
zh

[CV-94] Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery

【速读】：该论文试图解决森林碳储量估算的问题，特别是在中国云南省曲靖市会泽县的大规模碳储量估算。解决方案的关键在于引入生成式 AI (Generative AI) 模型，如改进的隐式扩散模型 (IIDM)，结合 KD-VGG 和 KD-UNet 模块进行初始特征提取，并通过 Cross-attention + MLPs 模块实现有效的特征融合。这些方法显著提高了碳储量估算的准确性，相较于传统回归模型，IIDM 模型的均方根误差 (RMSE) 降低了 41.69% 至 42.33%，展示了生成式 AI 在定量遥感中的应用潜力，为区域碳储量管理提供了强有力的支持。

链接: https://arxiv.org/abs/2411.17973
作者: Zhenyu Yu
关键词-EN: mitigating climate change, significant terrestrial carbon, effectively reducing atmospheric, carbon stock mechanism, concentrations and mitigating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:The forest serves as the most significant terrestrial carbon stock mechanism, effectively reducing atmospheric CO _2 concentrations and mitigating climate change. Remote sensing provides high data accuracy and enables large-scale observations. Optical images facilitate long-term monitoring, which is crucial for future carbon stock estimation studies. This study focuses on Huize County, Qujing City, Yunnan Province, China, utilizing GF-1 WFV satellite imagery. The KD-VGG and KD-UNet modules were introduced for initial feature extraction, and the improved implicit diffusion model (IIDM) was proposed. The results showed: (1) The VGG module improved initial feature extraction, improving accuracy, and reducing inference time with optimized model parameters. (2) The Cross-attention + MLPs module enabled effective feature fusion, establishing critical relationships between global and local features, achieving high-accuracy estimation. (3) The IIDM model, a novel contribution, demonstrated the highest estimation accuracy with an RMSE of 12.17%, significantly improving by 41.69% to 42.33% compared to the regression model. In carbon stock estimation, the generative model excelled in extracting deeper features, significantly outperforming other models, demonstrating the feasibility of AI-generated content in quantitative remote sensing. The 16-meter resolution estimates provide a robust basis for tailoring forest carbon sink regulations, enhancing regional carbon stock management.
zh

[CV-95] Adversarial Training in Low-Label Regimes with Margin-Based Interpolation

【速读】：该论文试图解决在低标签数据环境下，如何通过对抗训练提高神经网络模型的鲁棒性和自然准确率的问题。解决方案的关键在于引入了一种新颖的半监督对抗训练方法，该方法通过生成有效的对抗样本，结合线性插值技术创建跨决策边界的插值对抗样本，并采用全局epsilon调度策略逐步调整扰动强度的上限。这些策略共同作用，使得模型能够在训练过程中逐步构建更复杂的决策边界，从而在提升鲁棒性的同时保持较高的自然准确率。

链接: https://arxiv.org/abs/2411.17959
作者: Tian Ye,Rajgopal Kannan,Viktor Prasanna
关键词-EN: train robust neural, robust neural network, neural network models, train robust, robust neural
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial training has emerged as an effective approach to train robust neural network models that are resistant to adversarial attacks, even in low-label regimes where labeled data is scarce. In this paper, we introduce a novel semi-supervised adversarial training approach that enhances both robustness and natural accuracy by generating effective adversarial examples. Our method begins by applying linear interpolation between clean and adversarial examples to create interpolated adversarial examples that cross decision boundaries by a controlled margin. This sample-aware strategy tailors adversarial examples to the characteristics of each data point, enabling the model to learn from the most informative perturbations. Additionally, we propose a global epsilon scheduling strategy that progressively adjusts the upper bound of perturbation strengths during training. The combination of these strategies allows the model to develop increasingly complex decision boundaries with better robustness and natural accuracy. Empirical evaluations show that our approach effectively enhances performance against various adversarial attacks, such as PGD and AutoAttack.
zh

[CV-96] Optimization-Free Image Immunization Against Diffusion-Based Editing

【速读】：该论文试图解决当前图像免疫防御技术在面对基于扩散的编辑模型时存在的可扩展性问题，特别是针对每张图像进行重新优化所需的时间成本过高的问题。解决方案的关键在于提出了DiffVax框架，这是一个可扩展、轻量级且无需优化的图像免疫方法。DiffVax通过引入一种损失项（loss term），确保编辑尝试的失败以及扰动的不可感知性，从而实现了对未见内容的有效泛化，显著降低了计算成本，并将免疫时间从几天缩短到毫秒级别，实现了250,000倍的加速。该方法不仅适用于各种基于扩散的编辑工具，还首次有效地保护了视频内容免受编辑。

链接: https://arxiv.org/abs/2411.17957
作者: Tarik Can Ozden,Ozgur Kara,Oguzhan Akcin,Kerem Zaman,Shashank Srivastava,Sandeep P. Chinchali,James M. Rehg
关键词-EN: embed imperceptible noise, Current image immunization, immunization defense techniques, Current image, editing embed imperceptible
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds-achieving a 250,000x speedup. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code is provided in our project webpage.
zh

[CV-97] ROICtrl: Boosting Instance Control for Visual Generation

【速读】：该论文试图解决自然语言难以准确关联多个实例的位置和属性信息的问题，这限制了当前基于文本的视觉生成模型只能处理简单的组合，通常只包含少数主导实例。解决方案的关键在于引入区域实例控制（regional instance control），通过将每个实例与一个边界框和自由形式的描述配对来增强扩散模型。论文提出了一种称为ROI-Unpool的互补操作，结合ROI-Align，实现了在高分辨率特征图上对感兴趣区域（ROIs）的显式、高效和准确的操控。基于此，论文提出了ROICtrl，一种适用于预训练扩散模型的适配器，能够实现精确的区域实例控制，并兼容社区微调的扩散模型以及现有的基于空间和嵌入的附加组件（如ControlNet、T2I-Adapter、IP-Adapter、ED-LoRA），扩展了其在多实例生成中的应用。实验结果表明，ROICtrl在区域实例控制方面表现优异，同时显著降低了计算成本。

链接: https://arxiv.org/abs/2411.17949
作者: Yuchao Gu,Yipin Zhou,Yunfan Ye,Yixin Nie,Licheng Yu,Pingchuan Ma,Kevin Qinghong Lin,Mike Zheng Shou
关键词-EN: accurately associate positional, limits current text-based, simpler compositions featuring, Natural language, current text-based visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.
zh

[CV-98] MARVEL-40M: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

【速读】：该论文试图解决从文本提示生成高质量3D内容的问题，主要挑战在于现有数据集的规模、多样性和注释深度有限。解决方案的关键在于引入了一个名为MARVEL-40M+的扩展数据集，包含4000万条文本注释，覆盖超过890万个3D资产，这些资产来自七个主要的3D数据集。论文提出了一种新颖的多阶段注释流水线，结合了开源预训练的多视图视觉语言模型（VLMs）和大型语言模型（LLMs），自动生成从详细描述（150-200字）到简洁语义标签（10-20字）的多层次描述。此外，该流水线还整合了源数据集的人类元数据，以减少VLM的幻觉现象并增加领域特定信息。论文还开发了MARVEL-FX3D，一个两阶段的文本到3D流水线，通过微调Stable Diffusion并使用预训练的图像到3D网络，在15秒内生成3D纹理网格。实验结果表明，MARVEL-40M+在注释质量和语言多样性方面显著优于现有数据集。

链接: https://arxiv.org/abs/2411.17945
作者: Sankalp Sinha,Mohammad Sadil Khan,Muhammad Usama,Shino Sam,Didier Stricker,Sk Aziz Ali,Muhammad Zeshan Afzal
关键词-EN: computer vision due, Generating high-fidelity, text prompts remains, limited size, prompts remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.
zh

[CV-99] Stealthy Multi-Task Adversarial Attacks

【速读】：该论文试图解决在多任务深度神经网络中，如何针对特定任务进行隐蔽攻击，同时保持其他任务性能不受影响的问题。解决方案的关键在于提出了一种隐蔽的多任务攻击框架，利用多种算法向输入数据中注入难以察觉的噪声，从而在不影响非目标任务性能的情况下，有效攻击目标任务。此外，论文还引入了一种自动化的方法来搜索损失函数中的权重因子，以进一步提高攻击效率，这种方法在实验中表现出与手动调参相当的效能，从而确立了该领域最先进的多任务攻击框架。

链接: https://arxiv.org/abs/2411.17936
作者: Jiacheng Guo,Tianyun Zhang,Lei Li,Haochen Yang,Hongkai Yu,Minghai Qin
关键词-EN: Deep Neural Networks, Neural Networks exhibit, Networks exhibit inherent, Deep Neural, Neural Networks
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep Neural Networks exhibit inherent vulnerabilities to adversarial attacks, which can significantly compromise their outputs and reliability. While existing research primarily focuses on attacking single-task scenarios or indiscriminately targeting all tasks in multi-task environments, we investigate selectively targeting one task while preserving performance in others within a multi-task framework. This approach is motivated by varying security priorities among tasks in real-world applications, such as autonomous driving, where misinterpreting critical objects (e.g., signs, traffic lights) poses a greater security risk than minor depth miscalculations. Consequently, attackers may hope to target security-sensitive tasks while avoiding non-critical tasks from being compromised, thus evading being detected before compromising crucial functions. In this paper, we propose a method for the stealthy multi-task attack framework that utilizes multiple algorithms to inject imperceptible noise into the input. This novel method demonstrates remarkable efficacy in compromising the target task while simultaneously maintaining or even enhancing performance across non-targeted tasks - a criterion hitherto unexplored in the field. Additionally, we introduce an automated approach for searching the weighting factors in the loss function, further enhancing attack efficiency. Experimental results validate our framework’s ability to successfully attack the target task while preserving the performance of non-targeted tasks. The automated loss function weight searching method demonstrates comparable efficacy to manual tuning, establishing a state-of-the-art multi-task attack framework.
zh

[CV-100] Exploring Superpixel Segmentation Methods in the Context of Citizen Science and Deforestation Detection

【速读】：该论文试图解决热带森林监测中，如何通过公民科学活动有效识别和分割遥感图像中的森林砍伐区域的问题。解决方案的关键在于采用基于超像素的分割技术，通过分析22种超像素分割方法在遥感图像上的表现，识别出比当前ForestEyes项目中使用的SLIC方法更优的分割方法，从而提高公民科学活动在森林监测中的效率和准确性。

链接: https://arxiv.org/abs/2411.17922
作者: Hugo Resende,Isabela Borlido,Victor Sundermann,Eduardo B. Neto,Silvio Jamil F. Guimarães,Fabio Faria,Alvaro Luiz Fazenda
关键词-EN: Tropical forests play, Tropical forests, citizen science campaigns, planet ecosystem, making the conservation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper was accepted for presentation at SAC 2025

点击查看摘要

Abstract:Tropical forests play an essential role in the planet’s ecosystem, making the conservation of these biomes a worldwide priority. However, ongoing deforestation and degradation pose a significant threat to their existence, necessitating effective monitoring and the proposal of actions to mitigate the damage caused by these processes. In this regard, initiatives range from government and private sector monitoring programs to solutions based on citizen science campaigns, for example. Particularly in the context of citizen science campaigns, the segmentation of remote sensing images to identify deforested areas and subsequently submit them to analysis by non-specialized volunteers is necessary. Thus, segmentation using superpixel-based techniques proves to be a viable solution for this important task. Therefore, this paper presents an analysis of 22 superpixel-based segmentation methods applied to remote sensing images, aiming to identify which of them are more suitable for generating segments for citizen science campaigns. The results reveal that seven of the segmentation methods outperformed the baseline method (SLIC) currently employed in the ForestEyes citizen science project, indicating an opportunity for improvement in this important stage of campaign development.
zh

[CV-101] DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

【速读】：该论文试图解决自动驾驶车辆在复杂环境中进行运动预测时，由于需要频繁更新模型以适应新场景而导致的模型遗忘和存储需求增加的问题。解决方案的关键在于引入了一个名为DECODE的持续学习框架，该框架通过预训练的通用模型逐步发展出针对不同领域的专用模型。DECODE的核心创新在于平衡了模型的专业化和泛化能力，利用超网络（hypernetwork）生成模型参数以减少存储需求，并结合归一化流机制（normalizing flow mechanism）进行实时模型选择。此外，DECODE通过深度贝叶斯不确定性估计技术（deep Bayesian uncertainty estimation techniques）融合最相关的专用和通用模型的输出，确保在熟悉条件下达到最佳性能，同时在陌生场景中保持鲁棒性。实验结果表明，该框架显著降低了遗忘率（0.044），并在平均最小平均位移误差（minADE）方面达到了0.584米的优异表现。

链接: https://arxiv.org/abs/2411.17917
作者: Boqi Li,Haojie Zhu,Henry X. Liu
关键词-EN: effectively navigate complex, navigate complex environments, Motion prediction, traffic participants, prediction is critical
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.
zh

[CV-102] Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey

【速读】：该论文试图解决深度伪造（Deepfakes, DFs）检测领域中现有研究主要集中在单一模态（如图像、视频或音频）的被动检测方法的准确性上，而忽略了多模态检测、泛化性、鲁棒性、归因性和可解释性等关键问题。解决方案的关键在于全面探讨跨多模态（图像、视频、音频和多模态域）的被动检测方法，并扩展讨论范围，包括检测方法的泛化性、鲁棒性、归因性和可解释性。此外，论文还讨论了被动检测方法的威胁模型，包括潜在的对抗策略和不同级别的对手知识和能力，并指出了当前深度伪造检测面临的挑战，如不同生成模型间的泛化性不足、全面可信度评估的需求以及现有多模态方法的局限性。最后，论文提出了未来研究方向，如自适应学习、动态基准测试、整体可信度评估和针对说话人脸视频生成的多模态检测器。

链接: https://arxiv.org/abs/2411.17911
作者: Hong-Hanh Nguyen-Le,Van-Tuan Tran,Dinh-Thuc Nguyen,Nhien-An Le-Khac
关键词-EN: artists’ style imitation, misinformation spreading, recent years, malicious purposes, individual impersonation
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 26 pages

点击查看摘要

Abstract:In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists’ style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.
zh

[CV-103] Automating grapevine LAI features estimation with UAV imagery and machine learning

【速读】：该论文试图解决传统叶面积指数（Leaf Area Index, LAI）计算方法耗时、破坏性、成本高且局限于小规模的问题。解决方案的关键在于利用无人机图像数据和机器学习模型自动化LAI估算过程。通过传统特征提取和深度学习方法从数据中获取有用信息，提升不同机器学习模型在LAI预测中的性能。研究结果表明，基于深度学习的特征提取方法比传统方法更为有效，提供了一种更快、非破坏性和成本效益更高的LAI计算方法，从而增强了精准农业实践。

链接: https://arxiv.org/abs/2411.17897
作者: Muhammad Waseem Akram,Marco Vannucci,Giorgio Buttazzo,Valentina Colla,Stefano Roccella,Andrea Vannini,Giovanni Caruso,Simone Nesi,Alessandra Francini,Luca Sebastiani
关键词-EN: determines crop health, index determines crop, leaf area index, health and growth, area index determines
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted in 2024 IEEE INTERNATIONAL WORKSHOP ON Metrology for Agriculture and Forestry

点击查看摘要

Abstract:The leaf area index determines crop health and growth. Traditional methods for calculating it are time-consuming, destructive, costly, and limited to a scale. In this study, we automate the index estimation method using drone image data of grapevine plants and a machine learning model. Traditional feature extraction and deep learning methods are used to obtain helpful information from the data and enhance the performance of the different machine learning models employed for the leaf area index prediction. The results showed that deep learning based feature extraction is more effective than traditional methods. The new approach is a significant improvement over old methods, offering a faster, non-destructive, and cost-effective leaf area index calculation, which enhances precision agriculture practices.
zh

[CV-104] Multimodal Crash Likelihood Prediction: A Complexity-Infused Approach Integrating Semantic Contextual and Driving Features

【速读】：该论文试图解决复杂驾驶环境中碰撞概率预测的问题，解决方案的关键在于引入了一个两阶段框架，该框架整合了道路复杂性特征（roadway complexity features）以提高预测精度。第一阶段通过编码器从这些特征中提取隐藏的上下文信息，生成复杂性增强特征（complexity-infused features）。第二阶段结合原始特征和复杂性增强特征进行碰撞概率预测，显著提升了预测准确率，从仅使用原始特征的87.98%提高到使用复杂性增强特征的90.15%。研究表明，语义、驾驶和上下文特征的结合能最有效地捕捉道路复杂性，且大型语言模型生成的复杂性指数注释优于人工注释，表明自动化工具在构建准确、可扩展的碰撞预测系统中的潜力。

链接: https://arxiv.org/abs/2411.17886
作者: Meng Wang,Zach Noonan,Pnina Gershon,Shannon C. Roberts
关键词-EN: improving traffic safety, Predicting crash likelihood, complex driving environments, advancing autonomous driving, Predicting crash
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting crash likelihood in complex driving environments is essential for improving traffic safety and advancing autonomous driving. Previous studies have used statistical models and deep learning to predict crashes based on semantic, contextual, or driving features, but none have examined the combined influence of these factors, termed roadway complexity in this study. This paper introduces a two-stage framework that integrates roadway complexity features for crash prediction. In the first stage, an encoder extracts hidden contextual information from these features, generating complexity-infused features. The second stage uses both original and complexity-infused features to predict crash likelihood, achieving an accuracy of 87.98% with original features alone and 90.15% with the added complexity-infused features. Ablation studies confirm that a combination of semantic, driving, and contextual features yields the best results, which emphasize their role in capturing roadway complexity. Additionally, complexity index annotations generated by Large Language Models outperform those by Amazon Mechanical Turk, highlighting the potential of automated tools for accurate, scalable crash prediction systems.
zh

[CV-105] ReC-TTT: Contrastive Feature Reconstruction for Test-Time Training

【速读】：该论文试图解决深度学习模型在面对测试数据分布变化时的适应性问题。解决方案的关键在于提出了一种名为ReC-TTT的测试时训练技术，该技术通过在训练阶段引入交叉重构作为辅助任务，利用冻结的编码器和两个可训练的编码器，结合一个共享的解码器，在测试阶段使编码器能够提取出能被解码器正确重构的特征。这种方法在测试时通过冻结源域的解码器来适应新域，从而在大多数域偏移分类挑战中优于现有的最先进技术。

链接: https://arxiv.org/abs/2411.17869
作者: Marco Colussi,Sergio Mascetti,Jose Dolz,Christian Desrosiers
关键词-EN: computer vision tasks, showcases outstanding results, showcases outstanding, remarkable progress, progress in deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The remarkable progress in deep learning (DL) showcases outstanding results in various computer vision tasks. However, adaptation to real-time variations in data distributions remains an important challenge. Test-Time Training (TTT) was proposed as an effective solution to this issue, which increases the generalization ability of trained models by adding an auxiliary task at train time and then using its loss at test time to adapt the model. Inspired by the recent achievements of contrastive representation learning in unsupervised tasks, we propose ReC-TTT, a test-time training technique that can adapt a DL model to new unseen domains by generating discriminative views of the input data. ReC-TTT uses cross-reconstruction as an auxiliary task between a frozen encoder and two trainable encoders, taking advantage of a single shared decoder. This enables, at test time, to adapt the encoders to extract features that will be correctly reconstructed by the decoder that, in this phase, is frozen on the source domain. Experimental results show that ReC-TTT achieves better results than other state-of-the-art techniques in most domain shift classification challenges.
zh

[CV-106] Generative Image Layer Decomposition with Visual Effects

【速读】：该论文试图解决图像编辑中精确控制图像合成任务的难题，特别是如何将图像分解为可独立编辑的图层，同时保留透明视觉效果（如阴影和反射）。解决方案的关键在于提出了一个名为 LayerDecomp 的生成式框架，该框架能够输出逼真的干净背景和高品质的透明前景，并忠实保留视觉效果。为实现有效训练，论文首先引入了一个数据准备管道，自动扩展模拟的多层数据并合成视觉效果。此外，通过补充包含自然视觉效果的相机捕捉图像，增强了实际应用性。论文还提出了一种一致性损失，以在没有真实标注的情况下强制模型学习透明前景层的准确表示。该方法在图层分解质量上优于现有方法，在对象移除和空间编辑任务中表现出色，为图层级图像编辑解锁了多种创意可能性。

链接: https://arxiv.org/abs/2411.17864
作者: Jinrui Yang,Qing Liu,Yijun Li,Soo Ye Kim,Daniil Pakhomov,Mengwei Ren,Jianming Zhang,Zhe Lin,Cihang Xie,Yuyin Zhou
关键词-EN: Recent advancements, advancements in large, significantly enhanced, enhanced the capabilities, visual effects
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page: this https URL

点击查看摘要

Abstract:Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose \textbfLayerDecomp , a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. The project page is this https URL.
zh

[CV-107] OracleSage: Towards Unified Visual-Linguistic Understanding of Oracle Bone Scripts through Cross-Modal Knowledge Fusion

【速读】：该论文试图解决甲骨文（Oracle bone script, OBS）自动识别的难题，由于其复杂的象形结构与现代汉字差异较大，传统方法难以有效处理。解决方案的关键在于引入了一个名为OracleSage的新型跨模态框架，该框架结合了层次视觉理解与基于图的语义推理。具体来说，OracleSage包括：(1) 层次视觉-语义理解模块，通过逐步微调LLaVA的视觉骨干网络实现多粒度特征提取；(2) 基于图的语义推理框架，通过动态消息传递捕捉视觉组件与语义概念之间的关系；(3) OracleSem，一个语义丰富的甲骨文数据集，包含全面的象形和语义标注。实验结果表明，OracleSage显著优于现有的视觉-语言模型，为古代文本解释提供了新的范式，并为考古研究提供了技术支持。

链接: https://arxiv.org/abs/2411.17837
作者: Hanqi Jiang,Yi Pan,Junhao Chen,Zhengliang Liu,Yifan Zhou,Peng Shu,Yiwei Li,Huaqin Zhao,Stephen Mihm,Lewis C Howe,Tianming Liu
关键词-EN: Oracle bone script, modern Chinese characters, China earliest mature, mature writing system, present significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Oracle bone script (OBS), as China’s earliest mature writing system, present significant challenges in automatic recognition due to their complex pictographic structures and divergence from modern Chinese characters. We introduce OracleSage, a novel cross-modal framework that integrates hierarchical visual understanding with graph-based semantic reasoning. Specifically, we propose (1) a Hierarchical Visual-Semantic Understanding module that enables multi-granularity feature extraction through progressive fine-tuning of LLaVA’s visual backbone, (2) a Graph-based Semantic Reasoning Framework that captures relationships between visual components and semantic concepts through dynamic message passing, and (3) OracleSem, a semantically enriched OBS dataset with comprehensive pictographic and semantic annotations. Experimental results demonstrate that OracleSage significantly outperforms state-of-the-art vision-language models. This research establishes a new paradigm for ancient text interpretation while providing valuable technical support for archaeological studies.
zh

[CV-108] SVGDreamer: Advancing Editability and Diversity in Text-Guided SVG Generation

【速读】：该论文试图解决现有Text-to-SVG方法在生成可编辑性、视觉质量和多样性方面的不足。解决方案的关键在于提出了一个新颖的文本引导矢量图形合成方法，其中包括：1) 矢量化粒子基分数蒸馏 (Vectorized Particle-based Score Distillation, VPSD) 方法，用于解决现有方法中的过饱和问题并增强样本多样性；2) 引入预训练的奖励模型来重新加权矢量粒子，提升美学吸引力并加速收敛；3) 设计了一种自适应矢量基元控制策略，动态调整基元数量以增强图形细节的呈现。这些创新显著提升了生成SVG的可编辑性、视觉质量和多样性，并在实验中证明了其优于基线方法。

链接: https://arxiv.org/abs/2411.17832
作者: Ximing Xing,Qian Yu,Chuang Wang,Haitao Zhou,Jing Zhang,Dong Xu
关键词-EN: demonstrated significant potential, iconography and sketching, Particle-based Score Distillation, demonstrated significant, significant potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 17 figures. arXiv admin note: substantial text overlap with arXiv:2312.16476

点击查看摘要

Abstract:Recently, text-guided scalable vector graphics (SVG) synthesis has demonstrated significant potential in domains such as iconography and sketching. However, SVGs generated from existing Text-to-SVG methods often lack editability and exhibit deficiencies in visual quality and diversity. In this paper, we propose a novel text-guided vector graphics synthesis method to address these limitations. To improve the diversity of output SVGs, we present a Vectorized Particle-based Score Distillation (VPSD) approach. VPSD addresses over-saturation issues in existing methods and enhances sample diversity. A pre-trained reward model is incorporated to re-weight vector particles, improving aesthetic appeal and enabling faster convergence. Additionally, we design a novel adaptive vector primitives control strategy, which allows for the dynamic adjustment of the number of primitives, thereby enhancing the presentation of graphic details. Extensive experiments validate the effectiveness of the proposed method, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity. We also show that our new method supports up to six distinct vector styles, capable of generating high-quality vector assets suitable for stylized vector design and poster design.
zh

[CV-109] Rapid Distributed Fine-tuning of a Segmentation Model Onboard Satellites

【速读】：该论文试图解决地球观测卫星数据在地面站处理时因数据传输瓶颈和通信窗口限制而导致的延迟问题，特别是在自然灾害分析和应急响应中的实时性需求。解决方案的关键在于利用轻量级预训练分割模型MobileSAM在卫星硬件上进行近实时数据分析，并通过与开源模拟模块PASEOS的集成，评估其在卫星星座模拟环境下的性能。研究还探讨了在多卫星平台上以去中心化方式快速微调MobileSAM的潜力，发现其在频繁通信模型更新的情况下，能够以少量训练数据实现快速微调和分割性能的提升。这一解决方案强调了去中心化学习和预训练模型微调在快速响应场景中的优势，特别是在极端天气事件频发的背景下。

链接: https://arxiv.org/abs/2411.17831
作者: Meghan Plumridge,Rasmus Maråk,Chiara Ceccobello,Pablo Gómez,Gabriele Meoni,Filip Svoboda,Nicholas D. Lane
关键词-EN: Earth observation, natural hazard analysis, natural hazard, Earth, data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at the Sixth IEEE International Conference on Image Processing Applications and Systems (IPAS) 2025

点击查看摘要

Abstract:Segmentation of Earth observation (EO) satellite data is critical for natural hazard analysis and disaster response. However, processing EO data at ground stations introduces delays due to data transmission bottlenecks and communication windows. Using segmentation models capable of near-real-time data analysis onboard satellites can therefore improve response times. This study presents a proof-of-concept using MobileSAM, a lightweight, pre-trained segmentation model, onboard Unibap iX10-100 satellite hardware. We demonstrate the segmentation of water bodies from Sentinel-2 satellite imagery and integrate MobileSAM with PASEOS, an open-source Python module that simulates satellite operations. This integration allows us to evaluate MobileSAM’s performance under simulated conditions of a satellite constellation. Our research investigates the potential of fine-tuning MobileSAM in a decentralised way onboard multiple satellites in rapid response to a disaster. Our findings show that MobileSAM can be rapidly fine-tuned and benefits from decentralised learning, considering the constraints imposed by the simulated orbital environment. We observe improvements in segmentation performance with minimal training data and fast fine-tuning when satellites frequently communicate model updates. This study contributes to the field of onboard AI by emphasising the benefits of decentralised learning and fine-tuning pre-trained models for rapid response scenarios. Our work builds on recent related research at a critical time; as extreme weather events increase in frequency and magnitude, rapid response with onboard data analysis is essential.
zh

[CV-110] CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

【速读】：该论文试图解决在动态城市环境中，实体代理（embodied agents）进行导航时面临的挑战，特别是在无地图或非街道设置下的导航问题。解决方案的关键在于利用大规模的网络视频数据进行数据驱动的训练。具体来说，论文提出了一种可扩展的数据处理管道，从数千小时的野外城市行走和驾驶视频中提取动作监督信息，从而实现大规模的模仿学习（imitation learning），而无需昂贵的标注成本。这种方法使得模型能够学习复杂的导航策略，以应对多样化的挑战和关键场景，显著提升导航性能，超越当前的方法。

链接: https://arxiv.org/abs/2411.17820
作者: Xinhao Liu,Jintong Li,Yichen Jiang,Niranjan Sujay,Zhicheng Yang,Juexiao Zhang,John Abanes,Jing Zhang,Chen Feng
关键词-EN: requiring advanced spatial, environments presents significant, advanced spatial reasoning, Navigating dynamic urban, urban environments presents
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. this https URL
zh

[CV-111] Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation

【速读】：该论文试图解决全天气图像恢复 (All-weather image restoration, AWIR) 模型在面对分布外 (Out-of-distribution, OoD) 样本或未见过的退化情况时效果不佳的问题。解决方案的关键在于使用低秩适应 (Low-Rank Adaptation, LoRA) 技术，以高效地适应预训练的全天气模型到新的天气恢复任务。此外，论文提出了一种基于LoRA的微调方法，称为LoRA-Align (LoRA-A)，通过奇异值分解 (Singular Value Decomposition, SVD) 对齐微调后的权重矩阵与预训练权重矩阵的奇异向量，从而在适应新任务的同时保留模型对原始任务的知识。

链接: https://arxiv.org/abs/2411.17814
作者: Sudarshan Rajagopalan,Vishal M. Patel
关键词-EN: adverse weather conditions, reliable autonomous navigation, crucial for reliable, weather conditions, autonomous navigation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model’s knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation.
zh

[CV-112] From memorization to generalization: a theoretical framework for diffusion-based generative models

【速读】：该论文试图解决生成式模型在训练集规模增加时从记忆训练数据向非记忆（泛化）状态转变的问题。解决方案的关键在于提出了一个基于相对距离的数学定义，用于区分模型处于记忆状态还是泛化状态。具体来说，当生成分布与训练数据的高斯核近似分布之间的相对距离足够大时，模型被认为处于泛化状态。论文通过构建一个可解析的扩散模型，并推导出生成分布与采样分布之间的Kullback-Leibler散度的下界，验证了这一转变的存在。研究还发现，当训练数据来自各向同性高斯分布时，随着训练样本的增加，生成分布与底层采样分布之间的个体距离开始减小，从而实现了从记忆到泛化的转变。这一发现与模型记忆性能下降但泛化性能未提升的情景形成对比。

链接: https://arxiv.org/abs/2411.17807
作者: Indranil Halder
关键词-EN: Diffusion-based generative models, training set increases, generative models demonstrate, Diffusion-based generative, training dataset
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages

点击查看摘要

Abstract:Diffusion-based generative models demonstrate a transition from memorizing the training dataset to a non-memorization regime as the size of the training set increases. Here, we begin by introducing a mathematically precise definition of this transition in terms of a relative distance: the model is said to be in the non-memorization/`generalization’ regime if the generated distribution is almost surely far from the probability distribution associated with a Gaussian kernel approximation to the training dataset, relative to the sampling distribution. Then, we develop an analytically tractable diffusion model and establish a lower bound on Kullback-Leibler divergence between the generated and sampling distribution. The model also features the transition, according to our definition in terms of the relative distance, when the training data is sampled from an isotropic Gaussian distribution. Further, our study reveals that this transition occurs when the individual distance between the generated and underlying sampling distribution begins to decrease with the addition of more training samples. This is to be contrasted with an alternative scenario, where the model’s memorization performance degrades, but generalization performance doesn’t improve. We also provide empirical evidence indicating that realistic diffusion models exhibit the same alignment of scales.
zh

[CV-113] NEMO: Can Multimodal LLM s Identify Attribute-Modified Objects?

【速读】：该论文试图解决多模态大语言模型（MLLMs）在识别具有特定属性修改的对象时的能力问题。解决方案的关键在于引入了一个名为NEMO的新基准，该基准包含900张原始水果及其属性修改后的图像，以及2,700个包括开放式、多项选择和不可解类型的问题。通过评估26个最新的开源和商业模型，研究发现尽管更强的视觉编码器能提升性能，但MLLMs在NEMO上的表现仍落后于独立的视觉编码器。此外，模型规模的扩大并不总能带来更好的结果，因为更深的分析表明，更大的语言模型在微调过程中可能会削弱视觉编码器的效果。这些发现揭示了当前MLLMs的关键局限性，并为开发更通用和稳健的多模态模型提供了潜在路径。

链接: https://arxiv.org/abs/2411.17794
作者: Jiaxuan Li,Junwen Mo,MinhDuc Vo,Akihiro Sugimoto,Hideki Nakayama
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, made notable advances, specific attributes remain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made notable advances in visual understanding, yet their abilities to recognize objects modified by specific attributes remain an open question. To address this, we explore MLLMs’ reasoning capabilities in object recognition, ranging from commonsense to beyond-commonsense scenarios. We introduce a novel benchmark, NEMO, which comprises 900 images of origiNal fruits and their corresponding attributE-MOdified ones; along with a set of 2,700 questions including open-, multiple-choice-, unsolvable types. We assess 26 recent open-sourced and commercial models using our benchmark. The findings highlight pronounced performance gaps in recognizing objects in NEMO and reveal distinct answer preferences across different models. Although stronger vision encoders improve performance, MLLMs still lag behind standalone vision encoders. Interestingly, scaling up the model size does not consistently yield better outcomes, as deeper analysis reveals that larger LLMs can weaken vision encoders during fine-tuning. These insights shed light on critical limitations in current MLLMs and suggest potential pathways toward developing more versatile and resilient multimodal models.
zh

[CV-114] Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

【速读】：该论文试图解决内窥镜中三维映射的准确性问题，特别是在胃肠道（GI）内进行定量和整体病变表征时，需要可靠的深度和姿态估计。解决方案的关键在于提出了一种鲁棒的自监督单目深度和姿态估计框架，该框架结合了生成式潜在库（Generative Latent Bank）和变分自编码器（VAE）。生成式潜在库利用自然图像中的广泛深度场景来条件化深度网络，通过潜在特征先验增强深度预测的真实性和鲁棒性。对于姿态估计，论文将其重新构想在VAE框架内，将姿态转换视为潜在变量，以正则化尺度、稳定z轴显著性并提高x-y敏感性。这种双重优化管道能够实现准确的深度和姿态预测，有效应对GI道复杂纹理和光照条件。

链接: https://arxiv.org/abs/2411.17790
作者: Ziang Xu,Bin Li,Yang Hu,Chenyu Zhang,James East,Sharib Ali,Jens Rittscher
关键词-EN: holistic lesion characterization, Generative Latent Bank, requiring reliable depth, endoscopy enables quantitative, holistic lesion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract’s complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework’s superior performance over published self-supervised methods in endoscopic depth and pose estimation.
zh

[CV-115] Geometric Point Attention Transformer for 3D Shape Reassembly

【速读】：该论文试图解决形状装配（Shape Assembly）中部件间几何关系推理不足的问题。现有方法主要依赖网络预测单个部件的姿态，但往往无法有效捕捉部件间的几何交互。解决方案的关键在于提出了几何点注意力变换器（Geometric Point Attention Transformer, GPAT），通过几何点注意力模块整合全局形状信息和局部成对几何特征，并结合旋转和平移向量表示的姿态信息。此外，引入几何循环机制（Geometric Recycling Scheme），通过迭代更新和动态推理来不断优化预测结果，从而在语义和几何装配任务中实现了更精确的姿态估计和高对齐精度。

链接: https://arxiv.org/abs/2411.17788
作者: Jiahan Li,Chaoran Cheng,Jianzhu Ma,Ge Liu
关键词-EN: gained significant interest, reassemble separate parts, Geometric Point Attention, Point Attention Transformer, complete object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Shape assembly, which aims to reassemble separate parts into a complete object, has gained significant interest in recent years. Existing methods primarily rely on networks to predict the poses of individual parts, but often fail to effectively capture the geometric interactions between the parts and their poses. In this paper, we present the Geometric Point Attention Transformer (GPAT), a network specifically designed to address the challenges of reasoning about geometric relationships. In the geometric point attention module, we integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part. To enable iterative updates and dynamic reasoning, we introduce a geometric recycling scheme, where each prediction is fed into the next iteration for refinement. We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation, achieving accurate pose predictions and high alignment accuracy.
zh

[CV-116] Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

【速读】：该论文试图解决视觉自回归模型（Visual Auto-Regressive, VAR）在图像生成过程中由于其从粗到细（coarse-to-fine）的特性导致的内存消耗过大和计算冗余问题。解决方案的关键是提出了一种名为协作解码（Collaborative Decoding, CoDe）的新型高效解码策略。CoDe的核心在于将多尺度推理过程分解为大模型和小模型的无缝协作：大模型（drafter）负责在小尺度上生成低频内容，而小模型（refiner）则专注于在大尺度上预测高频细节。这种协作不仅显著提高了效率（速度提升1.7倍，内存使用减少约50%），而且在图像质量上仅略有下降（FID从1.95增加到1.98），甚至在减少草图步骤的情况下，仍能实现2.9倍的加速比，达到每秒41张256x256分辨率的图像生成速度，同时保持FID为2.27。

链接: https://arxiv.org/abs/2411.17787
作者: Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang
关键词-EN: next-scale prediction approach, rapidly advancing field, garnered considerable attention, innovative next-scale prediction, Visual Auto-Regressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Working in progress. Code repository: this https URL

点击查看摘要

Abstract:In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the ‘drafter’, specializing in generating low-frequency content at smaller scales, while the smaller model serves as the ‘refiner’, solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at this https URL
zh

[CV-117] DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

【速读】：该论文试图解决个性化图像生成中现有方法面临的复杂训练需求、高推理成本、灵活性有限等问题。解决方案的关键在于引入DreamCache，这是一种可扩展的方法，通过缓存预训练扩散去噪器中少量参考图像特征（来自子集层和单个时间步），并利用轻量级的条件适配器进行动态调制，从而实现高效且高质量的个性化图像生成。DreamCache不仅在图像与文本对齐方面达到了最先进水平，而且使用的额外参数数量显著减少，同时在计算效率和灵活性方面优于现有模型。

链接: https://arxiv.org/abs/2411.17786
作者: Emanuele Aiello,Umberto Michieli,Diego Valsesia,Mete Ozay,Enrico Magli
关键词-EN: Personalized image generation, image generation requires, capture the core, Personalized image, generation requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.
zh

[CV-118] Diffusion Autoencoders for Few-shot Image Generation in Hyperbolic Space

【速读】：该论文试图解决少样本图像生成中图像质量和多样性之间的权衡问题，并提供对新生成图像属性的有限控制。解决方案的关键在于提出了一种名为双曲扩散自编码器 (Hyperbolic Diffusion Autoencoders, HypDAE) 的新方法，该方法在双曲空间中操作，以捕捉已知类别图像和文本之间的层次关系。通过利用预训练的基础模型，HypDAE 能够为未知类别生成高质量且多样化的图像，同时通过调整双曲盘内的半径来增加对语义多样性的额外控制。实验结果表明，HypDAE 在有限数据的情况下显著优于现有方法，实现了质量和多样性的更好平衡，并提供了高度可控和可解释的生成过程。

链接: https://arxiv.org/abs/2411.17784
作者: Lingxiao Li,Kaixuan Fan,Boqing Gong,Xiangyu Yue
关键词-EN: Few-shot image generation, Few-shot image, Hyperbolic Diffusion Autoencoders, image generation aims, Few-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images and texts from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes or guided by textual instructions. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.
zh

[CV-119] Network Inversion and Its Applications

【速读】：该论文试图解决神经网络决策过程的透明性问题，即神经网络常被视为“黑箱”，难以解释其决策依据。解决方案的关键在于提出了一种基于条件生成器的网络反演技术，通过学习训练神经网络输入空间的数据分布，实现对输入的重构，从而揭示网络决策背后的特征和模式。具体方法包括将条件标签信息编码为向量和中间矩阵，并最小化生成图像特征的余弦相似度，以及引入特征正交性作为正则化项，以增强图像多样性，确保每个标签的表示是独特且非冗余的。

链接: https://arxiv.org/abs/2411.17777
作者: Pirzada Suhail,Hao Tang,Amit Sethi
关键词-EN: remains opaque, emerged as powerful, powerful tools, process often remains, Network inversion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
备注: arXiv admin note: substantial text overlap with arXiv:2410.16884 , arXiv:2407.18002

点击查看摘要

Abstract:Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as “black boxes.” This opacity raises concerns about their interpretability and reliability, especially in safety-critical scenarios. Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes and thereby provide valuable insights into how neural networks arrive at their conclusions, making them more interpretable and trustworthy. This paper presents a simple yet effective approach to network inversion using a meticulously conditioned generator that learns the data distribution in the input space of the trained neural network, enabling the reconstruction of inputs that would most likely lead to the desired outputs. To capture the diversity in the input space for a given output, instead of simply revealing the conditioning labels to the generator, we encode the conditioning label information into vectors and intermediate matrices and further minimize the cosine similarity between features of the generated images. Additionally, we incorporate feature orthogonality as a regularization term to boost image diversity which penalises the deviations of the Gram matrix of the features from the identity matrix, ensuring orthogonality and promoting distinct, non-redundant representations for each label. The paper concludes by exploring immediate applications of the proposed network inversion approach in interpretability, out-of-distribution detection, and training data reconstruction.
zh

[CV-120] Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

【速读】：该论文试图解决基于文本的人物异常行为搜索问题，即在监控网络中通过自然语言描述检索参与常规或异常活动的行人。解决方案的关键在于构建了一个大规模的图像-文本行人异常行为 (Pedestrian Anomaly Behavior, PAB) 基准，该基准包含了广泛的常规和异常行为，如跑步、表演、踢足球以及相应的异常行为（如躺下、被击中、摔倒）。通过合成数据和真实数据相结合的方式，PAB 基准提供了丰富的训练和测试资源。此外，论文提出了一种跨模态姿态感知框架，该框架结合了人体姿态模式和基于身份的硬负样本对采样，显著提升了异常行为检索的准确性，实验结果显示召回率@1提高了2.88%。

链接: https://arxiv.org/abs/2411.17776
作者: Shuyu Yang,Yaxiong Wang,Li Zhu,Zhedong Zheng
关键词-EN: natural language descriptions, retrieve specific individuals, person search aims, Text-based person search, text-based person anomaly
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval in the real-world test set, while the proposed pose-aware method further improves the recall@1 by 2.88%. We will release the dataset, code, and checkpoints to facilitate further research and ensure the reproducibility of our results.
zh

[CV-121] Efficient Multi-modal Large Language Models via Visual Token Grouping

【速读】：该论文试图解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在处理高分辨率图像和视频时面临的计算成本高昂的问题。解决方案的关键是引入了一种名为VisToG的新型分组机制，该机制利用预训练的视觉编码器的能力，在不需要分割掩码的情况下对相似的图像片段进行分组。具体来说，VisToG在视觉编码器之前的线性投影层后，将语义标记连接起来以表示图像的语义片段。此外，通过采用孤立注意力机制，VisToG能够利用预训练视觉编码器中的先验知识识别并消除冗余的视觉标记，从而有效降低计算需求。实验结果表明，VisToG在保持原始性能的98.1%的同时，实现了超过27%的推理时间减少。

链接: https://arxiv.org/abs/2411.17773
作者: Minbin Huang,Runhui Huang,Han Shi,Yimeng Chen,Chuanyang Zheng,Xiangguo Sun,Xin Jiang,Zhenguo Li,Hong Cheng
关键词-EN: Large Language Models, Multi-modal Large Language, enhances Large Language, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27% inference time.
zh

[CV-122] MVBoost: Boost 3D Reconstruction with Multi-View Refinement

【速读】：该论文试图解决现有3D物体重建模型在缺乏多样化3D数据集的情况下，泛化能力受限的问题。解决方案的关键在于提出了一种名为MVBoost（Multi-View Refinement Boosting）的新框架，通过生成伪GT数据（pseudo-GT data）来增强3D重建。MVBoost的核心在于结合多视角生成模型的高精度和3D重建模型的数据一致性，生成可靠的数据源。具体步骤包括：利用多视角扩散模型生成多个视角，然后通过大型3D重建模型生成一致的3D数据，最后通过自适应地优化这些多视角图像，构建大规模的多视角数据集，用于训练前馈3D重建模型。此外，输入视角优化模块确保根据用户输入图像优化对应视角，以满足用户需求。实验结果表明，该方法在重建效果和泛化能力上均优于现有技术。

链接: https://arxiv.org/abs/2411.17772
作者: Xiangyu Liu,Xiaomei Zhang,Zhiyuan Ma,Xiangyu Zhu,Zhen Lei
关键词-EN: Recent advancements, models rely heavily, heavily on existing, rely heavily, reconstruction model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user’s input image, ensuring that the most important viewpoint is accurately tailored to the user’s needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.
zh

[CV-123] DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams

【速读】：该论文试图解决现有视觉问题生成 (Visual Question Generation, VQG) 研究主要集中在自然图像上，而忽略了教育材料中的图表 (diagrams) 的问题。解决方案的关键在于引入了DiagramQG数据集，该数据集包含8,372个图表和19,475个问题，涵盖多个学科，并引入了概念和目标文本约束，以指导模型生成以概念为中心的教育问题。此外，论文提出了层次知识整合框架 (Hierarchical Knowledge Integration framework for Diagram Question Generation, HKI-DQG)，该框架通过获取图表的多尺度补丁 (multi-scale patches) 并使用冻结参数的视觉语言模型 (visual language model) 获取知识，然后将知识、文本约束和补丁整合以生成概念聚焦的问题。实验结果表明，HKI-DQG在DiagramQG数据集上表现优于现有方法，并在其他VQG数据集上也展示了其泛化能力。

链接: https://arxiv.org/abs/2411.17771
作者: Xinyu Zhang,Lingling Zhang,Yanrui Wu,Muye Huang,Wenjun Wu,Bo Li,Shaowei Wang,Jun Liu
关键词-EN: gained significant attention, significant attention due, Visual Question Generation, Question Generation, Diagram Question Generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Question Generation (VQG) has gained significant attention due to its potential in educational applications. However, VQG researches mainly focus on natural images, neglecting diagrams in educational materials used to assess students’ conceptual understanding. To address this gap, we introduce DiagramQG, a dataset containing 8,372 diagrams and 19,475 questions across various subjects. DiagramQG introduces concept and target text constraints, guiding the model to generate concept-focused questions for educational purposes. Meanwhile, we present the Hierarchical Knowledge Integration framework for Diagram Question Generation (HKI-DQG) as a strong baseline. This framework obtains multi-scale patches of diagrams and acquires knowledge using a visual language model with frozen parameters. It then integrates knowledge, text constraints and patches to generate concept-focused questions. We evaluate the performance of existing VQG models, open-source and closed-source vision-language models, and HKI-DQG on the DiagramQG dataset. Our HKI-DQG outperform existing methods, demonstrating that it serves as a strong baseline. Furthermore, to assess its generalizability, we apply HKI-DQG to two other VQG datasets of natural images, namely VQG-COCO and K-VQG, achieving state-of-the-art this http URL dataset and code are available at this https URL.
zh

[CV-124] Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

【速读】：该论文试图解决在基于扩散模型的合成过程中，如何有效控制生成图像的细节粒度问题。解决方案的关键在于引入一个单一参数 (\omega)，该参数在扩散模型的反向去噪步骤中被整合，从而在不重新训练模型、不修改模型架构或增加推理计算开销的情况下，实现对生成输出细节水平的精确控制。此外，通过应用空间掩码或具有不同 (\omega) 值的去噪调度，可以实现区域特定或时间步长特定的粒度控制。结合控制信号或参考图像的先验知识，可以创建精确的 (\omega) 掩码，以实现对特定对象的粒度控制。该技术被命名为 Omegance，结合了 “omega” 和 “nuance”，强调了其在控制细微细节变化中的作用。

链接: https://arxiv.org/abs/2411.17769
作者: Xinyu Hou,Zongsheng Yue,Xiaoming Li,Chen Change Loy
关键词-EN: introduce a single, omega, effectively control granularity, single parameter, control
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we introduce a single parameter \omega , to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model’s reverse process. Our approach does not require model retraining, architectural modifications, or additional computational overhead during inference, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying \omega values can be applied to achieve region-specific or timestep-specific granularity control. Prior knowledge of image composition from control signals or reference images further facilitates the creation of precise \omega masks for granularity control on specific objects. To highlight the parameter’s role in controlling subtle detail variations, the technique is named Omegance, combining “omega” and “nuance”. Our method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code is available at this https URL.
zh

[CV-125] Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models

【速读】：该论文试图解决在目标检测任务中由于数据集的随机性或噪声导致的固有不确定性（aleatoric uncertainty）问题。解决方案的关键在于利用视觉基础模型（vision foundation models）来建模和利用这种不确定性。具体来说，论文提出了一种基于视觉基础模型特征空间的数据不确定性估计方法，假设对象特征具有高斯混合结构，并通过马氏距离（Mahalanobis distance）来量化数据不确定性。此外，论文还提出了两种实际应用：1) 定义不确定性感知的样本过滤器，以丢弃噪声和冗余实例，避免过拟合；2) 定义样本自适应正则化器，以平衡简单和困难样本，实现自适应训练。这些估计的不确定性可以作为数据集的额外标注，并以即插即用的方式应用于任何模型。

链接: https://arxiv.org/abs/2411.17767
作者: Peng Cui,Guande He,Dan Zhang,Zhijie Deng,Yinpeng Dong,Jun Zhu
关键词-EN: open world unavoidably, world unavoidably suffer, randomness or noiseness, open world, world unavoidably
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Datasets collected from the open world unavoidably suffer from various forms of randomness or noiseness, leading to the ubiquity of aleatoric (data) uncertainty. Quantifying such uncertainty is particularly pivotal for object detection, where images contain multi-scale objects with occlusion, obscureness, and even noisy annotations, in contrast to images with centric and similar-scale objects in classification. This paper suggests modeling and exploiting the uncertainty inherent in object detection data with vision foundation models and develops a data-centric reliable training paradigm. Technically, we propose to estimate the data uncertainty of each object instance based on the feature space of vision foundation models, which are trained on ultra-large-scale datasets and able to exhibit universal data representation. In particular, we assume a mixture-of-Gaussian structure of the object features and devise Mahalanobis distance-based measures to quantify the data uncertainty. Furthermore, we suggest two curial and practical usages of the estimated uncertainty: 1) for defining uncertainty-aware sample filter to abandon noisy and redundant instances to avoid over-fitting, and 2) for defining sample adaptive regularizer to balance easy/hard samples for adaptive training. The estimated aleatoric uncertainty serves as an extra level of annotations of the dataset, so it can be utilized in a plug-and-play manner with any model. Extensive empirical studies verify the effectiveness of the proposed aleatoric uncertainty measure on various advanced detection models and challenging benchmarks.
zh

[CV-126] I2VControl: Disentangled and Unified Video Motion Synthesis Control

【速读】：该论文试图解决视频合成中控制性不足的问题，特别是在文本描述与视频运动之间的联合分布捕捉方面的挑战。解决方案的关键在于提出了一个名为I2VControl的解耦与统一框架，该框架将视频分解为独立的动作单元，并使用解耦的控制信号来表示每个单元，从而允许在单一系统中灵活组合多种控制类型。此外，该方法可以无缝集成到预训练模型中，且不依赖于特定的模型架构，从而增强了用户驱动的创意组合，促进了创新和创造力。

链接: https://arxiv.org/abs/2411.17765
作者: Wanquan Feng,Tianhao Qi,Jiawei Liu,Mingzhen Sun,Pengqi Tu,Tianxiang Ma,Fei Dai,Songtao Zhao,Siyu Zhou,Qian He
关键词-EN: undergoing rapid progress, Video synthesis techniques, rapid progress, usability for end-users, techniques are undergoing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video synthesis techniques are undergoing rapid progress, with controllability being a significant aspect of practical usability for end-users. Although text condition is an effective way to guide video synthesis, capturing the correct joint distribution between text descriptions and video motion remains a substantial challenge. In this paper, we present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis. Our approach partitions the video into individual motion units and represents each unit with disentangled control signals, which allows for various control types to be flexibly combined within our single system. Furthermore, our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. The project page is: this https URL .
zh

[CV-127] Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

【速读】：该论文试图解决从单张RGB图像中检测三维反射对称性的问题，并揭示其在单图像三维生成中的显著优势。解决方案的关键在于引入Reflect3D，这是一种可扩展的、零样本对称性检测器，能够稳健地泛化到多样化和真实世界的场景。该方法受基础模型成功的启发，采用基于Transformer的架构来扩展对称性检测，并利用多视角扩散模型的生成先验来解决单视角对称性检测中的固有模糊性。通过在多种数据源上的广泛评估，Reflect3D在单图像对称性检测方面建立了新的最先进水平。此外，论文展示了将对称性检测整合到单图像三维生成流程中的实际效益，通过对称性感知的优化过程，显著提升了重建的三维几何和纹理的结构准确性、连贯性和视觉保真度，从而推进了三维内容创作的能力。

链接: https://arxiv.org/abs/2411.17763
作者: Xiang Li,Zixuan Huang,Anh Thai,James M. Rehg
关键词-EN: structure interpretation, ubiquitous and fundamental, fundamental property, critical cue, cue for perception
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.
zh

[CV-128] MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

【速读】：该论文试图解决现有视觉生成和理解统一模型在处理视觉信息时仅考虑低级信息，导致训练复杂度高且性能不如专用理解模型的问题。解决方案的关键在于提出了语义离散编码（Semantic Discrete Encoding, SDE），通过在视觉标记器中加入语义约束，有效对齐视觉标记和语言标记的信息，从而显著降低训练难度并提升统一模型的性能。

链接: https://arxiv.org/abs/2411.17762
作者: Rongchang Xie,Chen Du,Ping Song,Chang Liu
关键词-EN: Semantic discrete Encoding, introduce MUSE-VL, discrete Encoding, Semantic discrete, Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
zh

[CV-129] OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

【速读】：该论文试图解决开放世界自动驾驶中的两个关键问题：领域泛化（domain generalization）和开放词汇（open vocabulary）。领域泛化涉及自动驾驶系统在不同场景和传感器参数配置下的适应能力，而开放词汇则指系统识别训练过程中未见过的语义类别的能力。解决方案的关键在于提出了OpenAD，这是首个用于3D物体检测的现实世界开放世界自动驾驶基准。OpenAD通过结合多模态大语言模型（MLLM）的角案例发现和标注流水线，对五个自动驾驶感知数据集中的2000个场景进行了统一格式的角案例物体标注。此外，论文设计了评估方法，并评估了多种2D和3D开放世界及专用模型。论文还提出了一种以视觉为中心的3D开放世界物体检测基线，并通过融合通用和专用模型来解决现有开放世界方法在OpenAD基准上精度较低的问题。

链接: https://arxiv.org/abs/2411.17761
作者: Zhongyu Xia,Jishuo Li,Zhiwei Lin,Xinhao Wang,Yongtao Wang,Ming-Hsuan Yang
关键词-EN: encompasses domain generalization, driving encompasses domain, autonomous driving encompasses, autonomous driving, Open-world autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-world autonomous driving encompasses domain generalization and open-vocabulary. Domain generalization refers to the capabilities of autonomous driving systems across different scenarios and sensor parameter configurations. Open vocabulary pertains to the ability to recognize various semantic categories not encountered during training. In this paper, we introduce OpenAD, the first real-world open-world autonomous driving benchmark for 3D object detection. OpenAD is built on a corner case discovery and annotation pipeline integrating with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various 2D and 3D open-world and specialized models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. Annotations, toolkit code, and all evaluation codes will be released.
zh

[CV-130] UVCG: Leveraging Temporal Consistency for Universal Video Protection

【速读】：该论文试图解决AI驱动的视频编辑中存在的安全风险问题，特别是如何防止恶意编辑者通过视频编辑技术恢复被扰动的视频内容。解决方案的关键在于提出了一个名为“通用视频一致性保护 (Universal Video Consistency Guard, UVCG)”的方法。UVCG通过在受保护的视频中嵌入另一个视频的内容，引入连续且不易察觉的扰动，迫使编辑模型的编码器将连续输入映射到不连续的输出，从而阻止生成与预期文本提示一致的视频。此外，UVCG利用相邻帧之间扰动的相似性，通过扰动复用策略提高了扰动生成的计算效率。该方法在多种版本的潜在扩散模型 (Latent Diffusion Models, LDM) 上进行了测试，结果表明其在保护视频内容免受未经授权修改方面具有有效性、可转移性和高效性。

链接: https://arxiv.org/abs/2411.17746
作者: KaiZhou Li,Jindong Gu,Xinchun Yu,Junjie Cao,Yansong Tang,Xiao-Ping Zhang
关键词-EN: garnered significant attention, AI-driven video editing, Video Consistency Guard, Universal Video Consistency, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The security risks of AI-driven video editing have garnered significant attention. Although recent studies indicate that adding perturbations to images can protect them from malicious edits, directly applying image-based methods to perturb each frame in a video becomes ineffective, as video editing techniques leverage the consistency of inter-frame information to restore individually perturbed content. To address this challenge, we leverage the temporal consistency of video content to propose a straightforward and efficient, yet highly effective and broadly applicable approach, Universal Video Consistency Guard (UVCG). UVCG embeds the content of another video(target video) within a protected video by introducing continuous, imperceptible perturbations which has the ability to force the encoder of editing models to map continuous inputs to misaligned continuous outputs, thereby inhibiting the generation of videos consistent with the intended textual prompts. Additionally leveraging similarity in perturbations between adjacent frames, we improve the computational efficiency of perturbation generation by employing a perturbation-reuse strategy. We applied UVCG across various versions of Latent Diffusion Models (LDM) and assessed its effectiveness and generalizability across multiple LDM-based editing pipelines. The results confirm the effectiveness, transferability, and efficiency of our approach in safeguarding video content from unauthorized modifications.
zh

[CV-131] SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning

【速读】：该论文试图解决现有三维场景表示方法在复杂环境中长期探索和推理能力不足的问题。现有方法如以对象为中心的三维场景图（object-centric 3D scene graphs）过于简化空间关系，难以处理需要细致空间理解的查询，并且缺乏主动探索和记忆管理的自然机制。论文提出的解决方案是SnapMem，一种基于快照的三维场景记忆表示方法。SnapMem的关键在于使用记忆快照（Memory Snapshots）捕捉探索区域的丰富视觉信息，并通过前沿快照（Frontier Snapshots）展示未探索区域，从而支持智能体在探索过程中做出基于已知和潜在新信息的决策。此外，SnapMem还引入了一个增量构建管道和有效的记忆检索技术，以支持在主动探索环境中的终身记忆管理。

链接: https://arxiv.org/abs/2411.17735
作者: Yuncong Yang,Han Yang,Jiachen Zhou,Peihao Chen,Hongxin Zhang,Yilun Du,Chuang Gan
关键词-EN: Constructing compact, Constructing, memory, scene, exploration
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Constructing compact and informative 3D scene representations is essential for effective embodied exploration and reasoning, especially in complex environments over long periods. Existing scene representations, such as object-centric 3D scene graphs, have significant limitations. They oversimplify spatial relationships by modeling scenes as individual objects, with inter-object relationships described by restrictive texts, making it difficult to answer queries that require nuanced spatial understanding. Furthermore, these representations lack natural mechanisms for active exploration and memory management, which hampers their application to lifelong autonomy. In this work, we propose SnapMem, a novel snapshot-based scene representation serving as 3D scene memory for embodied agents. SnapMem employs informative images, termed Memory Snapshots, to capture rich visual information of explored regions. It also integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-that enable agents to make informed exploration decisions by considering both known and potential new information. Meanwhile, to support lifelong memory in active exploration settings, we further present an incremental construction pipeline for SnapMem, as well as an effective memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that SnapMem significantly enhances agents’ exploration and reasoning capabilities in 3D environments over extended periods, highlighting its potential for advancing applications in embodied AI.
zh

[CV-132] Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis

【速读】：该论文旨在探索生成合成胸部X光图像并扩充医学影像数据集的最佳实践方法，以优化深度学习模型在分类和分割等下游任务中的性能。解决方案的关键在于利用潜在扩散模型（latent diffusion model）根据文本提示和/或分割掩码生成合成胸部X光图像，并通过代理模型（proxy model）和放射科医生反馈来提高合成数据的质量。这些合成图像基于相关疾病信息或几何变换的分割掩码生成，并添加到来自CheXpert、CANDID-PTX、SIIM和RSNA Pneumonia数据集的真实训练图像中，以评估分类和分割模型性能的提升。实验结果表明，合成数据显著提高了分类和分割任务的性能，最大分类F1分数提升为0.150453，最大分割Dice分数提升为0.14575。

链接: https://arxiv.org/abs/2411.18602
作者: Eva Prakash,Jeya Maria Jose Valanarasu,Zhihong Chen,Eduardo Pontes Reis,Andrew Johnston,Anuj Pareek,Christian Bluethgen,Sergios Gatidis,Cameron Olsen,Akshay Chaudhari,Andrew Ng,Curtis Langlotz
关键词-EN: synthetic chest X-ray, chest X-ray images, explore best-practice approaches, augmenting medical imaging, chest X-ray
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: To explore best-practice approaches for generating synthetic chest X-ray images and augmenting medical imaging datasets to optimize the performance of deep learning models in downstream tasks like classification and segmentation. Materials and Methods: We utilized a latent diffusion model to condition the generation of synthetic chest X-rays on text prompts and/or segmentation masks. We explored methods like using a proxy model and using radiologist feedback to improve the quality of synthetic data. These synthetic images were then generated from relevant disease information or geometrically transformed segmentation masks and added to ground truth training set images from the CheXpert, CANDID-PTX, SIIM, and RSNA Pneumonia datasets to measure improvements in classification and segmentation model performance on the test sets. F1 and Dice scores were used to evaluate classification and segmentation respectively. One-tailed t-tests with Bonferroni correction assessed the statistical significance of performance improvements with synthetic data. Results: Across all experiments, the synthetic data we generated resulted in a maximum mean classification F1 score improvement of 0.150453 (CI: 0.099108-0.201798; P=0.0031) compared to using only real data. For segmentation, the maximum Dice score improvement was 0.14575 (CI: 0.108267-0.183233; P=0.0064). Conclusion: Best practices for generating synthetic chest X-ray images for downstream tasks include conditioning on single-disease labels or geometrically transformed segmentation masks, as well as potentially using proxy modeling for fine-tuning such generations.
zh

[CV-133] Learning the Evolution of Physical Structure of Galaxies via Diffusion Models

【速读】：该论文试图解决通过图像数据理解星系演化的问题，特别是如何利用生成模型准确捕捉星系图像与其红移（redshift）之间的关系。解决方案的关键在于引入了一种新颖的方法，即将去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPM）条件化于红移值，从而生成具有特定红移特征的星系图像。这种方法不仅能够生成视觉上逼真的星系图像，还能编码红移变化所导致的星系物理特性的变化，从而为理解宇宙现象提供了新的科学洞察。

链接: https://arxiv.org/abs/2411.18440
作者: Andrew Lizarraga,Eric Hanchen Jiang,Jacob Nowack,Yun Qi Li,Ying Nian Wu,Bernie Boscoe,Tuan Do
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, conditioning Denoising Diffusion, primarily through imaging, imaging data
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In astrophysics, understanding the evolution of galaxies in primarily through imaging data is fundamental to comprehending the formation of the Universe. This paper introduces a novel approach to conditioning Denoising Diffusion Probabilistic Models (DDPM) on redshifts for generating galaxy images. We explore whether this advanced generative model can accurately capture the physical characteristics of galaxies based solely on their images and redshift measurements. Our findings demonstrate that this model not only produces visually realistic galaxy images but also encodes the underlying changes in physical properties with redshift that are the result of galaxy evolution. This approach marks a significant advancement in using generative models to enhance our scientific insight into cosmic phenomena.
zh

[CV-134] Leveraging Semantic Asymmetry for Precise Gross Tumor Volume Segmentation of Nasopharyngeal Carcinoma in Planning CT

【速读】：该论文试图解决在鼻咽癌（Nasopharyngeal Carcinoma, NPC）放射治疗中，使用非对比剂计划计算机断层扫描（Computed Tomography, CT）图像进行肿瘤大体体积（Gross Tumor Volume, GTV）分割时，由于肿瘤与周围正常组织对比度低而导致的分割困难问题。解决方案的关键在于提出了一种三维语义不对称肿瘤分割（3D Semantic Asymmetry Tumor segmentation, SATs）方法。该方法基于健康鼻咽区域具有双侧对称性，而鼻咽癌的出现会破坏这种对称性的假设，通过对比学习框架来增强特征对语义不对称的敏感性。具体来说，该方法通过最小化无肿瘤区域的原图像与翻转图像之间的体素距离，同时增大有肿瘤区域的原图像与翻转图像之间的距离，从而提高了在低对比度CT图像上对NPC GTV的分割精度。

链接: https://arxiv.org/abs/2411.18290
作者: Zi Li,Ying Chen,Zeli Chen,Yanzhou Su,Tai Ma,Tony C. W. Mok,Yan-Jie Zhou,Yunhai Bai,Zhinlin Zheng,Le Lu,Yirui Wang,Jia Ge,Xianghua Ye,Senxiang Yan,Dakai Jin
关键词-EN: clinicians typically delineate, radiation dose delivery, ensure accurate radiation, accurate radiation dose, planning computed tomography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the radiation therapy of nasopharyngeal carcinoma (NPC), clinicians typically delineate the gross tumor volume (GTV) using non-contrast planning computed tomography to ensure accurate radiation dose delivery. However, the low contrast between tumors and adjacent normal tissues necessitates that radiation oncologists manually delineate the tumors, often relying on diagnostic MRI for guidance. % In this study, we propose a novel approach to directly segment NPC gross tumors on non-contrast planning CT images, circumventing potential registration errors when aligning MRI or MRI-derived tumor masks to planning CT. To address the low contrast issues between tumors and adjacent normal structures in planning CT, we introduce a 3D Semantic Asymmetry Tumor segmentation (SATs) method. Specifically, we posit that a healthy nasopharyngeal region is characteristically bilaterally symmetric, whereas the emergence of nasopharyngeal carcinoma disrupts this symmetry. Then, we propose a Siamese contrastive learning segmentation framework that minimizes the voxel-wise distance between original and flipped areas without tumor and encourages a larger distance between original and flipped areas with tumor. Thus, our approach enhances the sensitivity of features to semantic asymmetries. % Extensive experiments demonstrate that the proposed SATs achieves the leading NPC GTV segmentation performance in both internal and external testing, \emphe.g., with at least 2% absolute Dice score improvement and 12% average distance error reduction when compared to other state-of-the-art methods in the external testing.
zh

[CV-135] Deep End-to-end Adaptive k-Space Sampling Reconstruction and Registration for Dynamic MRI

【速读】：该论文试图解决动态磁共振成像（Dynamic MRI）中由于时间限制和生理运动（如呼吸和心脏运动）导致的k空间数据欠采样问题，以及由此引发的图像质量下降和变形场估计不准确的问题。解决方案的关键在于提出一个端到端的深度学习（DL）框架，该框架整合了自适应动态k空间采样、重建和配准三个模块。具体来说，该框架首先通过DL自适应采样策略优化动态k空间数据的获取，然后利用DL重建模块从欠采样的动态数据中生成优化后的图像，最后通过配准模块估计变形场，将重建的动态图像与静态参考图像对齐。该框架的独立性和模块化设计允许灵活集成不同的重建和配准组件，并通过联合训练结合监督和无监督损失函数，实现端到端优化，从而提高从欠采样动态数据中进行运动估计的鲁棒性。

链接: https://arxiv.org/abs/2411.18249
作者: George Yiasemis,Jan-Jakob Sonke,Jonas Teuwen
关键词-EN: Dynamic MRI enables, organ motion tracking, Dynamic MRI, MRI enables, range of clinical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 39 pages, 19 figures, 4 tables

点击查看摘要

Abstract:Dynamic MRI enables a range of clinical applications, including cardiac function assessment, organ motion tracking, and radiotherapy guidance. However, fully sampling the dynamic k-space data is often infeasible due to time constraints and physiological motion such as respiratory and cardiac motion. This necessitates undersampling, which degrades the quality of reconstructed images. Poor image quality not only hinders visualization but also impairs the estimation of deformation fields, crucial for registering dynamic (moving) images to a static reference image. This registration enables tasks such as motion correction, treatment planning, and quantitative analysis in applications like cardiac imaging and MR-guided radiotherapy. To overcome the challenges posed by undersampling and motion, we introduce an end-to-end deep learning (DL) framework that integrates adaptive dynamic k-space sampling, reconstruction, and registration. Our approach begins with a DL-based adaptive sampling strategy, optimizing dynamic k-space acquisition to capture the most relevant data for each specific case. This is followed by a DL-based reconstruction module that produces images optimized for accurate deformation field estimation from the undersampled moving data. Finally, a registration module estimates the deformation fields aligning the reconstructed dynamic images with a static reference. The proposed framework is independent of specific reconstruction and registration modules allowing for plug-and-play integration of these components. The entire framework is jointly trained using a combination of supervised and unsupervised loss functions, enabling end-to-end optimization for improved performance across all components. Through controlled experiments and ablation studies, we validate each component, demonstrating that each choice contributes to robust motion estimation from undersampled dynamic data.
zh

[CV-136] owards Lensless Image Deblurring with Prior-Embedded Implicit Neural Representations in the Low-Data Regime

【速读】：该论文试图解决计算成像领域中的无透镜图像重建问题，特别是通过无训练神经网络实现图像去模糊。解决方案的关键在于利用隐式神经表示（implicit neural representations）进行无训练迭代优化，从而在不依赖预训练数据的情况下提升重建性能和加速收敛。这种方法有效地填补了无数据和高数据处理之间的空白，并通过对比分析展示了其在多种无训练和低样本方法中的显著优势。

链接: https://arxiv.org/abs/2411.18189
作者: Abeer Banerjee,Sanjay Singh
关键词-EN: promising paradigm shift, Generative Adversarial Networks, computational imaging problems, inverse computational imaging, leveraging Generative Adversarial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of computational imaging has witnessed a promising paradigm shift with the emergence of untrained neural networks, offering novel solutions to inverse computational imaging problems. While existing techniques have demonstrated impressive results, they often operate either in the high-data regime, leveraging Generative Adversarial Networks (GANs) as image priors, or through untrained iterative reconstruction in a data-agnostic manner. This paper delves into lensless image reconstruction, a subset of computational imaging that replaces traditional lenses with computation, enabling the development of ultra-thin and lightweight imaging systems. To the best of our knowledge, we are the first to leverage implicit neural representations for lensless image deblurring, achieving reconstructions without the requirement of prior training. We perform prior-embedded untrained iterative optimization to enhance reconstruction performance and speed up convergence, effectively bridging the gap between the no-data and high-data regimes. Through a thorough comparative analysis encompassing various untrained and low-shot methods, including under-parameterized non-convolutional methods and domain-restricted low-shot methods, we showcase the superior performance of our approach by a significant margin.
zh

[CV-137] Mortality Prediction of Pulmonary Embolism Patients with Deep Learning and XGBoost CEC

【速读】：该论文试图解决肺栓塞（Pulmonary Embolism, PE）患者30天内住院死亡率的预测问题，特别是通过常规临床方法难以准确预测这一指标的局限性。解决方案的关键在于提出了一种名为PEP-Net的新算法，该算法结合了3D残差网络（3DResNet）和极端梯度提升（Extreme Gradient Boosting, XGBoost）算法，利用初始影像数据（CT扫描）进行预测。PEP-Net通过处理类别不平衡问题、通过正则化减少过拟合并降低预测方差，从而实现更稳定的预测。实验结果表明，PEP-Net在193例急性PE患者的CT扫描数据上表现优异，准确率分别达到94.5%（±0.3）和94.0%（±0.7），显著优于基线模型，为PE的预后评估设定了新的基准。

链接: https://arxiv.org/abs/2411.18063
作者: Yalcin Tur,Vedat Cicek,Tufan Cinar,Elif Keles,Bradlay D. Allen,Hatice Savas,Gorkem Durak,Alpay Medetalibeyoglu,Ulas Bagci
关键词-EN: Pulmonary Embolism, enhanced diagnostic strategies, Extreme Gradient Boosting, critical illness, cardiovascular condition
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at IEEE ICECCME 2024, Maldives, 4-6 November 2024

点击查看摘要

Abstract:Pulmonary Embolism (PE) is a serious cardiovascular condition that remains a leading cause of mortality and critical illness, underscoring the need for enhanced diagnostic strategies. Conventional clinical methods have limited success in predicting 30-day in-hospital mortality of PE patients. In this study, we present a new algorithm, called PEP-Net, for 30-day mortality prediction of PE patients based on the initial imaging data (CT) that opportunistically integrates a 3D Residual Network (3DResNet) with Extreme Gradient Boosting (XGBoost) algorithm with patient level binary labels without annotations of the emboli and its extent. Our proposed system offers a comprehensive prediction strategy by handling class imbalance problems, reducing overfitting via regularization, and reducing the prediction variance for more stable predictions. PEP-Net was tested in a cohort of 193 volumetric CT scans diagnosed with Acute PE, and it demonstrated a superior performance by significantly outperforming baseline models (76-78%) with an accuracy of 94.5% (+/-0.3) and 94.0% (+/-0.7) when the input image is either lung region (Lung-ROI) or heart region (Cardiac-ROI). Our results advance PE prognostics by using only initial imaging data, setting a new benchmark in the field. While purely deep learning models have become the go-to for many medical classification (diagnostic) tasks, combined ResNet and XGBoost models herein outperform sole deep learning models due to a potential reason for having lack of enough data.
zh

[CV-138] Neural Finite-State Machines for Surgical Phase Recognition

【速读】：该论文试图解决手术阶段识别（surgical phase recognition）中，基于Transformer的架构在处理长手术视频时难以保持一致性的问题。解决方案的关键在于引入神经有限状态机（Neural Finite-State Machine, NFSM）模块，该模块通过结合过程级理解与神经网络，利用全局状态嵌入（global state embeddings）、基于注意力的动态转移表（attention-based dynamic transition tables）以及转移感知的训练和推理机制（transition-aware training and inference mechanisms），实现了离线和在线应用中的手术阶段识别。NFSM不仅提升了视频级准确性、阶段级精确度、召回率和Jaccard指数，还展示了其作为现有最先进模型（如Surgformer）的附加模块的互补价值，并在非手术数据集上的扩展实验中验证了其跨领域的通用性。

链接: https://arxiv.org/abs/2411.18018
作者: Hao Ding,Zhongpai Gao,Benjamin Planche,Tianyu Luan,Abhishek Sharma,Meng Zheng,Ange Lou,Terrence Chen,Mathias Unberath,Ziyan Wu
关键词-EN: analyzing procedure-specific surgical, procedure-specific surgical videos, essential for analyzing, analyzing procedure-specific, procedure-specific surgical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical phase recognition is essential for analyzing procedure-specific surgical videos. While recent transformer-based architectures have advanced sequence processing capabilities, they struggle with maintaining consistency across lengthy surgical procedures. Drawing inspiration from classical hidden Markov models’ finite-state interpretations, we introduce the neural finite-state machine (NFSM) module, which bridges procedural understanding with deep learning approaches. NFSM combines procedure-level understanding with neural networks through global state embeddings, attention-based dynamic transition tables, and transition-aware training and inference mechanisms for offline and online applications. When integrated into our future-aware architecture, NFSM improves video-level accuracy, phase-level precision, recall, and Jaccard indices on Cholec80 datasets by 2.3, 3.2, 3.0, and 4.8 percentage points respectively. As an add-on module to existing state-of-the-art models like Surgformer, NFSM further enhances performance, demonstrating its complementary value. Extended experiments on non-surgical datasets validate NFSM’s generalizability beyond surgical domains. Comprehensive experiments demonstrate that incorporating NSFM into deep learning frameworks enables more robust and consistent phase recognition across long procedural videos.
zh

[CV-139] HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution

【速读】：该论文试图解决图像超分辨率领域中现有方法在自注意力机制上的局限性，特别是由于将自注意力限制在非重叠窗口内而忽略跨通道信息的问题。解决方案的关键在于提出了一个名为混合注意力聚合Transformer (Hybrid Attention Aggregation Transformer, HAAT) 的新模型。HAAT通过集成Swin-Dense-Residual-Connected Blocks (SDRCB) 和混合网格注意力块 (Hybrid Grid Attention Blocks, HGAB) 来实现。SDRCB扩展了感受野并保持了简洁的架构，从而提升了性能；HGAB则结合了通道注意力、稀疏注意力和窗口注意力，以改进非局部特征融合，并生成更具视觉吸引力的结果。实验评估表明，HAAT在基准数据集上超越了现有的最先进方法。

链接: https://arxiv.org/abs/2411.18003
作者: Song-Jiang Lai,Tsun-Hin Cheung,Ka-Chun Fung,Kai-wen Xue,Kin-Man Lama
关键词-EN: global spatial modeling, shifting window attention, research area, global spatial, spatial modeling
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:In the research area of image super-resolution, Swin-transformer-based models are favored for their global spatial modeling and shifting window attention mechanism. However, existing methods often limit self-attention to non overlapping windows to cut costs and ignore the useful information that exists across channels. To address this issue, this paper introduces a novel model, the Hybrid Attention Aggregation Transformer (HAAT), designed to better leverage feature information. HAAT is constructed by integrating Swin-Dense-Residual-Connected Blocks (SDRCB) with Hybrid Grid Attention Blocks (HGAB). SDRCB expands the receptive field while maintaining a streamlined architecture, resulting in enhanced performance. HGAB incorporates channel attention, sparse attention, and window attention to improve nonlocal feature fusion and achieve more visually compelling results. Experimental evaluations demonstrate that HAAT surpasses state-of-the-art methods on benchmark datasets. Keywords: Image super-resolution, Computer vision, Attention mechanism, Transformer Comments: 6 pages, 2 figures, 1 table Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.18003 [eess.IV] (or arXiv:2411.18003v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.18003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-140] Breast Tumor Classification Using EfficientNet Deep Learning Model

【速读】：该论文试图解决乳腺病理图像分类中的数据不平衡问题，特别是由于某些肿瘤亚型在数据集中出现频率极低，导致模型预测偏向多数类，忽视关键但罕见的类别。解决方案的关键在于采用EfficientNet这一先进的卷积神经网络（CNN）模型，并通过强化数据增强（intensive data augmentation）和成本敏感学习（cost-sensitive learning）来改善数据不平衡问题。此外，通过迁移学习（transfer learning），利用在二分类任务中预训练的权重来提升多分类任务的性能，从而提高模型对复杂模式的检测能力。这些策略显著提升了模型在二分类和多分类任务中的表现，特别是在少数类（如Mucinous carcinoma和Papillary carcinoma）的精度和召回率上取得了显著提升。

链接: https://arxiv.org/abs/2411.17870
作者: Majid Behzadpour,Bengie L. Ortiz,Ebrahim Azizi,Kai Wu
关键词-EN: Precise breast cancer, Precise breast, breast cancer classification, outcome in oncology, breast cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Precise breast cancer classification on histopathological images has the potential to greatly improve the diagnosis and patient outcome in oncology. The data imbalance problem largely stems from the inherent imbalance within medical image datasets, where certain tumor subtypes may appear much less frequently. This constitutes a considerable limitation in biased model predictions that can overlook critical but rare classes. In this work, we adopted EfficientNet, a state-of-the-art convolutional neural network (CNN) model that balances high accuracy with computational cost efficiency. To address data imbalance, we introduce an intensive data augmentation pipeline and cost-sensitive learning, improving representation and ensuring that the model does not overly favor majority classes. This approach provides the ability to learn effectively from rare tumor types, improving its robustness. Additionally, we fine-tuned the model using transfer learning, where weights in the beginning trained on a binary classification task were adopted to multi-class classification, improving the capability to detect complex patterns within the BreakHis dataset. Our results underscore significant improvements in the binary classification performance, achieving an exceptional recall increase for benign cases from 0.92 to 0.95, alongside an accuracy enhancement from 97.35 % to 98.23%. Our approach improved the performance of multi-class tasks from 91.27% with regular augmentation to 94.54% with intensive augmentation, reaching 95.04% with transfer learning. This framework demonstrated substantial gains in precision in the minority classes, such as Mucinous carcinoma and Papillary carcinoma, while maintaining high recall consistently across these critical subtypes, as further confirmed by confusion matrix analysis.
zh

[CV-141] Reliability of deep learning models for anatomical landmark detection: The role of inter-rater variability

【速读】：该论文试图解决在构建用于解剖标志点检测的深度学习 (DL) 模型时，忽视了不同标注者之间标注差异（inter-rater variability）对模型性能和可靠性的影响问题。解决方案的关键在于研究并实施不同的标注融合策略，以保留标注者间的差异性，从而提升DL模型的性能和可靠性。论文通过引入一种新的加权坐标方差 (Weighted Coordinate Variance) 指标来量化检测不确定性，探讨了标注者间差异与DL模型性能及不确定性之间的关联，揭示了多标注者标注融合策略对这些因素的影响。

链接: https://arxiv.org/abs/2411.17850
作者: Soorena Salari,Hassan Rivaz,Yiming Xiao
关键词-EN: Automated detection, anatomical landmarks plays, surgical applications, anatomical landmark detection, diagnostic and surgical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SPIE Medical Imaging 2025

点击查看摘要

Abstract:Automated detection of anatomical landmarks plays a crucial role in many diagnostic and surgical applications. Progresses in deep learning (DL) methods have resulted in significant performance enhancement in tasks related to anatomical landmark detection. While current research focuses on accurately localizing these landmarks in medical scans, the importance of inter-rater annotation variability in building DL models is often overlooked. Understanding how inter-rater variability impacts the performance and reliability of the resulting DL algorithms, which are crucial for clinical deployment, can inform the improvement of training data construction and boost DL models’ outcomes. In this paper, we conducted a thorough study of different annotation-fusion strategies to preserve inter-rater variability in DL models for anatomical landmark detection, aiming to boost the performance and reliability of the resulting algorithms. Additionally, we explored the characteristics and reliability of four metrics, including a novel Weighted Coordinate Variance metric to quantify landmark detection uncertainty/inter-rater variability. Our research highlights the crucial connection between inter-rater variability, DL-models performances, and uncertainty, revealing how different approaches for multi-rater landmark annotation fusion can influence these factors.
zh

[CV-142] CAMLD: Contrast-Agnostic Medical Landmark Detection with Consistency-Based Regularization

【速读】：该论文试图解决医学图像中解剖标志点检测的问题，特别是在缺乏大量标注数据的情况下，如何利用深度学习方法进行高效且准确的检测。解决方案的关键在于引入了一种名为CAMLD的新型自监督深度学习框架，该框架通过单一参考样本实现对未标注扫描中解剖标志点的检测。关键技术包括：1) 使用跨受试者标志点一致性损失和图像配准损失；2) 采用基于3D卷积的对比度增强策略以提升模型对新对比度的泛化能力；3) 利用自适应混合损失函数动态调整不同子任务的贡献，以优化检测结果。通过在多个临床和公共数据集上的实验，CAMLD在平均径向误差（MREs）和成功检测率（SDRs）方面优于现有最先进的方法，展示了其在不同成像对比度下的鲁棒性和准确性。

链接: https://arxiv.org/abs/2411.17845
作者: Soorena Salari,Arash Harirpoush,Hassan Rivaz,Yiming Xiao
关键词-EN: including disease diagnosis, Anatomical landmark detection, research applications, surgical planning, disease diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Anatomical landmark detection in medical images is essential for various clinical and research applications, including disease diagnosis and surgical planning. However, manual landmark annotation is time-consuming and requires significant expertise. Existing deep learning (DL) methods often require large amounts of well-annotated data, which are costly to acquire. In this paper, we introduce CAMLD, a novel self-supervised DL framework for anatomical landmark detection in unlabeled scans with varying contrasts by using only a single reference example. To achieve this, we employed an inter-subject landmark consistency loss with an image registration loss while introducing a 3D convolution-based contrast augmentation strategy to promote model generalization to new contrasts. Additionally, we utilize an adaptive mixed loss function to schedule the contributions of different sub-tasks for optimal outcomes. We demonstrate the proposed method with the intricate task of MRI-based 3D brain landmark detection. With comprehensive experiments on four diverse clinical and public datasets, including both T1w and T2w MRI scans at different MRI field strengths, we demonstrate that CAMLD outperforms the state-of-the-art methods in terms of mean radial errors (MREs) and success detection rates (SDRs). Our framework provides a robust and accurate solution for anatomical landmark detection, reducing the need for extensively annotated datasets and generalizing well across different imaging contrasts. Our code will be publicly available at: this https URL.
zh

人工智能

[AI-0] Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization

链接: https://arxiv.org/abs/2411.18612
作者: Cheng Tang,Zhishuai Liu,Pan Xu
关键词-EN: Markov Decision Process, Distributionally Robust Markov, Regularized Markov Decision, addressing dynamics shift, Robust Regularized Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: 52 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case transition dynamics within a constrained set. However, solving its dual optimization oracle poses significant challenges, limiting theoretical analysis and computational efficiency. The recently proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set constraint with a regularization term on the value function, offering improved scalability and theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often leading to overly conservative policies by considering transitions that are unrealistic. To address these issues, we propose a novel framework, the d -rectangular linear robust regularized Markov decision process ( d -RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline RL setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and f -divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. This term is further shown to be fundamental to d -RRMDPs via information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns robust policies and is computationally more efficient than methods for constrained DRMDPs.

[AI-1] NeuroAI for AI Safety

链接: https://arxiv.org/abs/2411.18526
作者: Patrick Mineault,Niccolò Zanichelli,Joanne Zichen Peng,Anton Arkhipov,Eli Bingham,Julian Jara-Ettinger,Emily Mackevicius,Adam Marblestone,Marcelo Mattar,Andrew Payne,Sophia Sanborn,Karen Schroeder,Zenna Tavares,Andreas Tolias
关键词-EN: increasingly powerful, safety, brain, Neuroscience, Abstract
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 133 pages, 19 figures

点击查看摘要

Abstract:As AI systems become increasingly powerful, the need for safe AI has become more pressing. Humans are an attractive model for AI safety: as the only known agents capable of general intelligence, they perform robustly even under conditions that deviate significantly from prior experiences, explore the world safely, understand pragmatics, and can cooperate to meet their intrinsic goals. Intelligence, when coupled with cooperation and safety mechanisms, can drive sustained progress and well-being. These properties are a function of the architecture of the brain and the learning algorithms it implements. Neuroscience may thus hold important keys to technical AI safety that are currently underexplored and underutilized. In this roadmap, we highlight and critically evaluate several paths toward AI safety inspired by neuroscience: emulating the brain’s representations, information processing, and architecture; building robust sensory and motor systems from imitating brain data and bodies; fine-tuning AI systems on brain data; advancing interpretability using neuroscience methods; and scaling up cognitively-inspired architectures. We make several concrete recommendations for how neuroscience can positively impact AI safety.

[AI-2] LLM -ABBA: Understand time series via symbolic approximation

链接: https://arxiv.org/abs/2411.18506
作者: Erin Carson,Xinye Chen,Cheng Kang
关键词-EN: time series, time, series, symbolic time series, time series tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The success of large language models (LLMs) for time series has been demonstrated in previous work. Utilizing a symbolic time series representation, one can efficiently bridge the gap between LLMs and time series. However, the remaining challenge is to exploit the semantic information hidden in time series by using symbols or existing tokens of LLMs, while aligning the embedding space of LLMs according to the hidden information of time series. The symbolic time series approximation (STSA) method called adaptive Brownian bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in preserving salient time series features by modeling time series patterns in terms of amplitude and period while using existing tokens of LLMs. In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA into large language models for various downstream time series tasks. By symbolizing time series, LLM-ABBA compares favorably to the recent state-of-the-art (SOTA) in UCR and three medical time series classification tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to \kcavoid obvious drifting during prediction tasks by significantly mitigating the effects of cumulative error arising from misused symbols during the transition from symbols to numerical values. In time series regression tasks, LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER) benchmarks. LLM-ABBA also shows competitive prediction capability compared to recent SOTA time series prediction results. We believe this framework can also seamlessly extend to other time series tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.18506 [cs.LG] (or arXiv:2411.18506v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.18506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] SoK: Watermarking for AI-Generated Content

链接: https://arxiv.org/abs/2411.18479
作者: Xuandong Zhao,Sam Gunn,Miranda Christ,Jaiden Fairoze,Andres Fabrega,Nicholas Carlini,Sanjam Garg,Sanghyun Hong,Milad Nasr,Florian Tramer,Somesh Jha,Lei Li,Yu-Xiang Wang,Dawn Song
关键词-EN: improve in quality, outputs of generative, increasingly challenging, challenging to distinguish, Watermarking
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not a silver bullet for addressing all risks associated with GenAI, it can play a crucial role in enhancing AI safety and trustworthiness by combating misinformation and deception. This paper presents a comprehensive overview of watermarking techniques for GenAI, beginning with the need for watermarking from historical and regulatory perspectives. We formalize the definitions and desired properties of watermarking schemes and examine the key objectives and threat models for existing approaches. Practical evaluation strategies are also explored, providing insights into the development of robust watermarking techniques capable of resisting various attacks. Additionally, we review recent representative works, highlight open challenges, and discuss potential directions for this emerging field. By offering a thorough understanding of watermarking in GenAI, this work aims to guide researchers in advancing watermarking methods and applications, and support policymakers in addressing the broader implications of GenAI.

[AI-4] Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification

链接: https://arxiv.org/abs/2411.18456
作者: José Fernando Núñez,Jamie Arjona,Javier Béjar
关键词-EN: Deep learning models, Deep learning, data, sufficient amount, find the hidden
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning models need a sufficient amount of data in order to be able to find the hidden patterns in it. It is the purpose of generative modeling to learn the data distribution, thus allowing us to sample more data and augment the original dataset. In the context of physiological data, and more specifically electrocardiogram (ECG) data, given its sensitive nature and expensive data collection, we can exploit the benefits of generative models in order to enlarge existing datasets and improve downstream tasks, in our case, classification of heart rhythm. In this work, we explore the usefulness of synthetic data generated with different generative models from Deep Learning namely Diffweave, Time-Diffusion and Time-VQVAE in order to obtain better classification results for two open source multivariate ECG datasets. Moreover, we also investigate the effects of transfer learning, by fine-tuning a synthetically pre-trained model and then progressively adding increasing proportions of real data. We conclude that although the synthetic samples resemble the real ones, the classification improvement when simply augmenting the real dataset is barely noticeable on individual datasets, but when both datasets are merged the results show an increase across all metrics for the classifiers when using synthetic samples as augmented data. From the fine-tuning results the Time-VQVAE generative model has shown to be superior to the others but not powerful enough to achieve results close to a classifier trained with real data only. In addition, methods and metrics for measuring closeness between synthetic data and the real one have been explored as a side effect of the main research questions of this study. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.18456 [cs.LG] (or arXiv:2411.18456v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.18456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation NEURIPS2024

链接: https://arxiv.org/abs/2411.18447
作者: Marco Pasini,Javier Nistal,Stefan Lattner,George Fazekas
关键词-EN: discrete tokens, typically applied, recent research, Continuous Autoregressive Models, continuous embeddings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to NeurIPS 2024 - Audio Imagination Workshop

点击查看摘要

Abstract:Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.

[AI-6] Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning

链接: https://arxiv.org/abs/2411.18442
作者: Yasin I. Tepeli,Mathijs de Wolf,Joana P. Goncalves
关键词-EN: exhibit undesirable behavior, Selection bias poses, Selection bias, underrepresented profiles, poses a critical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages main manuscript (4 main figures), 7 pages of supplementary

点击查看摘要

Abstract:Selection bias poses a critical challenge for fairness in machine learning, as models trained on data that is less representative of the population might exhibit undesirable behavior for underrepresented profiles. Semi-supervised learning strategies like self-training can mitigate selection bias by incorporating unlabeled data into model training to gain further insight into the distribution of the population. However, conventional self-training seeks to include high-confidence data samples, which may reinforce existing model bias and compromise effectiveness. We propose Metric-DST, a diversity-guided self-training strategy that leverages metric learning and its implicit embedding space to counter confidence-based bias through the inclusion of more diverse samples. Metric-DST learned more robust models in the presence of selection bias for generated and real-world datasets with induced bias, as well as a molecular biology prediction task with intrinsic bias. The Metric-DST learning strategy offers a flexible and widely applicable solution to mitigate selection bias and enhance fairness of machine learning models.

[AI-7] MM-Path: Multi-modal Multi-granularity Path Representation Learning – Extended Version

链接: https://arxiv.org/abs/2411.18428
作者: Ronghui Xu,Hanyin Cheng,Chenjuan Guo,Hongfan Gao,Jilin Hu,Sean Bin Yang,Bin Yang
关键词-EN: Developing effective path, path representation learning, Developing effective, path representation, Representation Learning Framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing effective path representations has become increasingly essential across various fields within intelligent transportation. Although pre-trained path representation learning models have shown improved performance, they predominantly focus on the topological structures from single modality data, i.e., road networks, overlooking the geometric and contextual features associated with path-related images, e.g., remote sensing images. Similar to human understanding, integrating information from multiple modalities can provide a more comprehensive view, enhancing both representation accuracy and generalization. However, variations in information granularity impede the semantic alignment of road network-based paths (road paths) and image-based paths (image paths), while the heterogeneity of multi-modal data poses substantial challenges for effective fusion and utilization. In this paper, we propose a novel Multi-modal, Multi-granularity Path Representation Learning Framework (MM-Path), which can learn a generic path representation by integrating modalities from both road paths and image paths. To enhance the alignment of multi-modal data, we develop a multi-granularity alignment strategy that systematically associates nodes, road sub-paths, and road paths with their corresponding image patches, ensuring the synchronization of both detailed local information and broader global contexts. To address the heterogeneity of multi-modal data effectively, we introduce a graph-based cross-modal residual fusion component designed to comprehensively fuse information across different modalities and granularities. Finally, we conduct extensive experiments on two large-scale real-world datasets under two downstream tasks, validating the effectiveness of the proposed MM-Path. This is an extended version of the paper accepted by KDD 2025.

[AI-8] Optimal In-Network Distribution of Learning Functions for a Secure-by-Design Programmable Data Plane of Next-Generation Networks

链接: https://arxiv.org/abs/2411.18384
作者: Mattia Giovanni Spina,Edoardo Scalzo,Floriano De Rango,Francesca Guerriero,Antonio Iera
关键词-EN: advanced computing tasks, performing advanced computing, network interface cards, computing tasks, advanced computing
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The rise of programmable data plane (PDP) and in-network computing (INC) paradigms paves the way for the development of network devices (switches, network interface cards, etc.) capable of performing advanced computing tasks. This allows to execute algorithms of various nature, including machine learning ones, within the network itself to support user and network services. In particular, this paper delves into the issue of implementing in-network learning models to support distributed intrusion detection systems (IDS). It proposes a model that optimally distributes the IDS workload, resulting from the subdivision of a “Strong Learner” (SL) model into lighter distributed “Weak Learner” (WL) models, among data plane devices; the objective is to ensure complete network security without excessively burdening their normal operations. Furthermore, a meta-heuristic approach is proposed to reduce the long computational time required by the exact solution provided by the mathematical model, and its performance is evaluated. The analysis conducted and the results obtained demonstrate the enormous potential of the proposed new approach to the creation of intelligent data planes that effectively act as a first line of defense against cyber attacks, with minimal additional workload on network devices.

[AI-9] FreqX: What neural networks learn is what network designers say

链接: https://arxiv.org/abs/2411.18343
作者: Zechen Liu
关键词-EN: Personalized Federal learning, Personalized Federal, Federal learning, private dataset, personalized model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16pages, 9 figures

点击查看摘要

Abstract:Personalized Federal learning(PFL) allows clients to cooperatively train a personalized model without disclosing their private dataset. However, PFL suffers from Non-IID, heterogeneous devices, lack of fairness, and unclear contribution which urgently need the interpretability of deep learning model to overcome these challenges. These challenges proposed new demands for interpretability. Low cost, privacy, and detailed information. There is no current interpretability method satisfying them. In this paper, we propose a novel interpretability method \emphFreqX by introducing Signal Processing and Information Theory. Our experiments show that the explanation results of FreqX contain both attribution information and concept information. FreqX runs at least 10 times faster than the baselines which contain concept information.

[AI-10] RITA: Automatic Framework for Designing of Resilient IoT Applications

链接: https://arxiv.org/abs/2411.18324
作者: Luis Eduardo Pessoa,Cristovao Freitas Iglesias Jr,Claudio Miceli
关键词-EN: IoT Critical Objects, mitigation strategy selection, Critical Objects, Designing resilient Internet, designing resilient IoT
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing resilient Internet of Things (IoT) systems requires i) identification of IoT Critical Objects (ICOs) such as services, devices, and resources, ii) threat analysis, and iii) mitigation strategy selection. However, the traditional process for designing resilient IoT systems is still manual, leading to inefficiencies and increased risks. In addition, while tools such as ChatGPT could support this manual and highly error-prone process, their use raises concerns over data privacy, inconsistent outputs, and internet dependence. Therefore, we propose RITA, an automated, open-source framework that uses a fine-tuned RoBERTa-based Named Entity Recognition (NER) model to identify ICOs from IoT requirement documents, correlate threats, and recommend countermeasures. RITA operates entirely offline and can be deployed on-site, safeguarding sensitive information and delivering consistent outputs that enhance standardization. In our empirical evaluation, RITA outperformed ChatGPT in four of seven ICO categories, particularly in actuator, sensor, network resource, and service identification, using both human-annotated and ChatGPT-generated test data. These findings indicate that RITA can improve resilient IoT design by effectively supporting key security operations, offering a practical solution for developing robust IoT architectures.

[AI-11] Application of Soft Actor-Critic Algorithms in Optimizing Wastewater Treatment with Time Delays Integration

链接: https://arxiv.org/abs/2411.18305
作者: Esmaeel Mohammadi,Daniel Ortiz-Arroyo,Aviaja Anna Hansen,Mikkel Stokholm-Bjerregaard,Sebastien Gros,Akhil S Anand,Petar Durdevic
关键词-EN: Wastewater treatment plants, plants face unique, Wastewater treatment, face unique challenges, slow time constants
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wastewater treatment plants face unique challenges for process control due to their complex dynamics, slow time constants, and stochastic delays in observations and actions. These characteristics make conventional control methods, such as Proportional-Integral-Derivative controllers, suboptimal for achieving efficient phosphorus removal, a critical component of wastewater treatment to ensure environmental sustainability. This study addresses these challenges using a novel deep reinforcement learning approach based on the Soft Actor-Critic algorithm, integrated with a custom simulator designed to model the delayed feedback inherent in wastewater treatment plants. The simulator incorporates Long Short-Term Memory networks for accurate multi-step state predictions, enabling realistic training scenarios. To account for the stochastic nature of delays, agents were trained under three delay scenarios: no delay, constant delay, and random delay. The results demonstrate that incorporating random delays into the reinforcement learning framework significantly improves phosphorus removal efficiency while reducing operational costs. Specifically, the delay-aware agent achieved 36% reduction in phosphorus emissions, 55% higher reward, 77% lower target deviation from the regulatory limit, and 9% lower total costs than traditional control methods in the simulated environment. These findings underscore the potential of reinforcement learning to overcome the limitations of conventional control strategies in wastewater treatment, providing an adaptive and cost-effective solution for phosphorus removal.

[AI-12] DualCast: Disentangling Aperiodic Events from Traffic Series with a Dual-Branch Model

链接: https://arxiv.org/abs/2411.18286
作者: Xinyu Su,Feng Liu,Yanchuan Chang,Egemen Tanin,Majid Sarvi,Jianzhong Qi
关键词-EN: transportation systems, important problem, Traffic forecasting, training data, forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic forecasting is an important problem in the operation and optimisation of transportation systems. State-of-the-art solutions train machine learning models by minimising the mean forecasting errors on the training data. The trained models often favour periodic events instead of aperiodic ones in their prediction results, as periodic events often prevail in the training data. While offering critical optimisation opportunities, aperiodic events such as traffic incidents may be missed by the existing models. To address this issue, we propose DualCast – a model framework to enhance the learning capability of traffic forecasting models, especially for aperiodic events. DualCast takes a dual-branch architecture, to disentangle traffic signals into two types, one reflecting intrinsic spatial-temporal patterns and the other reflecting external environment contexts including aperiodic events. We further propose a cross-time attention mechanism, to capture high-order spatial-temporal relationships from both periodic and aperiodic patterns. DualCast is versatile. We integrate it with recent traffic forecasting models, consistently reducing their forecasting errors by up to 9.6% on multiple real datasets.

[AI-13] GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation

链接: https://arxiv.org/abs/2411.18276
作者: Wenbo Cui,Chengyang Zhao,Songlin Wei,Jiazhao Zhang,Haoran Geng,Yaran Chen,He Wang
关键词-EN: Effectively manipulating articulated, embodied artificial intelligence, achieving general embodied, general embodied artificial, Effectively manipulating
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research in 3D vision has primarily focused on manipulation through depth perception and pose detection. However, in real-world environments, these methods often face challenges due to imperfect depth perception, such as with transparent lids and reflective handles. Moreover, they generally lack the diversity in part-based interactions required for flexible and adaptable manipulation. To address these challenges, we introduced a large-scale part-centric dataset for articulated object manipulation that features both photo-realistic material randomizations and detailed annotations of part-oriented, scene-level actionable interaction poses. We evaluated the effectiveness of our dataset by integrating it with several state-of-the-art methods for depth estimation and interaction pose prediction. Additionally, we proposed a novel modular framework that delivers superior and robust performance for generalizable articulated object manipulation. Our extensive experiments demonstrate that our dataset significantly improves the performance of depth perception and actionable interaction pose prediction in both simulation and real-world scenarios.

[AI-14] Multimodal Integration of Longitudinal Noninvasive Diagnostics for Survival Prediction in Immunotherapy Using Deep Learning

链接: https://arxiv.org/abs/2411.18253
作者: Melda Yeghaian,Zuhir Bodalal,Daan van den Broek,John B A G Haanen,Regina G H Beets-Tan,Stefano Trebeschi,Marcel A J van Gerven
关键词-EN: potentially transform immunotherapy, precision medicine, artificial intelligence, intelligence could potentially, potentially transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Purpose: Analyzing noninvasive longitudinal and multimodal data using artificial intelligence could potentially transform immunotherapy for cancer patients, paving the way towards precision medicine. Methods: In this study, we integrated pre- and on-treatment blood measurements, prescribed medications and CT-based volumes of organs from a large pan-cancer cohort of 694 patients treated with immunotherapy to predict short and long-term overall survival. By leveraging a combination of recent developments, different variants of our extended multimodal transformer-based simple temporal attention (MMTSimTA) network were trained end-to-end to predict mortality at three, six, nine and twelve months. These models were also compared to baseline methods incorporating intermediate and late fusion based integration methods. Results: The strongest prognostic performance was demonstrated using the extended transformer-based multimodal model with area under the curves (AUCs) of 0.84 \pm 0.04, 0.83 \pm 0.02, 0.82 \pm 0.02, 0.81 \pm 0.03 for 3-, 6-, 9-, and 12-month survival prediction, respectively. Conclusion: Our findings suggest that analyzing integrated early treatment data has potential for predicting survival of immunotherapy patients. Integrating complementary noninvasive modalities into a jointly trained model, using our extended transformer-based architecture, demonstrated an improved multimodal prognostic performance, especially in short term survival prediction.

[AI-15] IKUN: Initialization to Keep snn training and generalization great with sUrrogate-stable variaNce

链接: https://arxiv.org/abs/2411.18250
作者: Da Chang,Deliang Wang,Xiao Yang
关键词-EN: Weight initialization significantly, neural networks, textbf, artificial neural networks, initialization significantly impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Weight initialization significantly impacts the convergence and performance of neural networks. While traditional methods like Xavier and Kaiming initialization are widely used, they often fall short for spiking neural networks (SNNs), which have distinct requirements compared to artificial neural networks (ANNs). To address this, we introduce \textbfIKUN, a variance-stabilizing initialization method integrated with surrogate gradient functions, specifically designed for SNNs. \textbfIKUN stabilizes signal propagation, accelerates convergence, and enhances generalization. Experiments show \textbfIKUN improves training efficiency by up to \textbf50%, achieving \textbf95% training accuracy and \textbf91% generalization accuracy. Hessian analysis reveals that \textbfIKUN-trained models converge to flatter minima, characterized by Hessian eigenvalues near zero on the positive side, promoting better generalization. The method is open-sourced for further exploration: \hrefthis https URLthis https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.18250 [cs.LG] (or arXiv:2411.18250v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.18250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-16] Exploration of LLM Multi-Agent Application Implementation Based on LangGraphCrewAI

链接: https://arxiv.org/abs/2411.18241
作者: Zhihua Duan,Jialin Wang
关键词-EN: profoundly changing people, changing people work, increasingly widespread, profoundly changing, work and lifestyles
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of large model technology, the application of agent technology in various fields is becoming increasingly widespread, profoundly changing people’s work and lifestyles. In complex and dynamic systems, multi-agents achieve complex tasks that are difficult for a single agent to complete through division of labor and collaboration among agents. This paper discusses the integrated application of LangGraph and CrewAI. LangGraph improves the efficiency of information transmission through graph architecture, while CrewAI enhances team collaboration capabilities and system performance through intelligent task allocation and resource management. The main research contents of this paper are: (1) designing the architecture of agents based on LangGraph for precise control; (2) enhancing the capabilities of agents based on CrewAI to complete a variety of tasks. This study aims to delve into the application of LangGraph and CrewAI in multi-agent systems, providing new perspectives for the future development of agent technology, and promoting technological progress and application innovation in the field of large model intelligent agents.

[AI-17] Certified Training with Branch-and-Bound: A Case Study on Lyapunov-stable Neural Control

链接: https://arxiv.org/abs/2411.18235
作者: Zhouxing Shi,Cho-Jui Hsieh,Huan Zhang
关键词-EN: learning Lyapunov-stable neural, Lyapunov-stable neural controllers, Lyapunov asymptotic stability, asymptotic stability condition, learning Lyapunov-stable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Preprint

点击查看摘要

Abstract:We study the problem of learning Lyapunov-stable neural controllers which provably satisfy the Lyapunov asymptotic stability condition within a region-of-attraction. Compared to previous works which commonly used counterexample guided training on this task, we develop a new and generally formulated certified training framework named CT-BaB, and we optimize for differentiable verified bounds, to produce verification-friendly models. In order to handle the relatively large region-of-interest, we propose a novel framework of training-time branch-and-bound to dynamically maintain a training dataset of subregions throughout training, such that the hardest subregions are iteratively split into smaller ones whose verified bounds can be computed more tightly to ease the training. We demonstrate that our new training framework can produce models which can be more efficiently verified at test time. On the largest 2D quadrotor dynamical system, verification for our model is more than 5X faster compared to the baseline, while our size of region-of-attraction is 16X larger than the baseline.

[AI-18] Randomized-Grid Search for Hyperparameter Tuning in Decision Tree Model to Improve Performance of Cardiovascular Disease Classification

链接: https://arxiv.org/abs/2411.18234
作者: Abhay Kumar Pathak,Mrityunjay Chaubey,Manjari Gupta
关键词-EN: Cardiovascular disease refers, Cardiovascular disease, Search, critical condition, condition that impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cardiovascular disease refers to any critical condition that impacts the heart. Because heart diseases can be life-threatening. Researchers are focusing on designing smart systems to accurately diagnose them based on electronic health data, with the aid of machine learning algorithms. Heart disease classification using machine learning (ML) algorithms such as Support Vector Machine(SVM), Naïve Bayes(NB), Decision Trees (DTs) and Random Forests (RFs) are often hindered by overfitting. These ML algorithms need extensive hyperparameter tuning. Random Search offers a faster, and, more efficient exploration of hyperparameter space, but, it may overlook optimal regions. Grid Search, though exhaustive, but, it is computationally expensive and inefficient, particularly with high-dimensional data. To address these limitations, Randomized-Grid Search, a novel hybrid optimization method is proposed that combines the global exploration strengths of Random Search with the focused, and, exhaustive search of Grid Search in the most promising regions. This hybrid approach efficiently balances exploration and exploitation. The proposed model optimizes the hyperparameter for Decision Tree model. The proposed model is applied to UCI heart disease dataset for classification. It enhances model performance, provides improved accuracy, generalization, and computational efficiency. Experimental results demonstrate that Randomized-Grid Search outperforms traditional methods by significant margins. The proposed model provides a more effective solution for machine learning applications in healthcare diagnosis.

[AI-19] Dependency-Aware CAV Task Scheduling via Diffusion-Based Reinforcement Learning

链接: https://arxiv.org/abs/2411.18230
作者: Xiang Cheng,Zhi Mao,Ying Wang,Wen Wu
关键词-EN: connected autonomous vehicles, dynamic unmanned aerial, unmanned aerial vehicle-assisted, aerial vehicle-assisted connected, vehicle-assisted connected autonomous
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:In this paper, we propose a novel dependency-aware task scheduling strategy for dynamic unmanned aerial vehicle-assisted connected autonomous vehicles (CAVs). Specifically, different computation tasks of CAVs consisting of multiple dependency subtasks are judiciously assigned to nearby CAVs or the base station for promptly completing tasks. Therefore, we formulate a joint scheduling priority and subtask assignment optimization problem with the objective of minimizing the average task completion time. The problem aims at improving the long-term system performance, which is reformulated as a Markov decision process. To solve the problem, we further propose a diffusion-based reinforcement learning algorithm, named Synthetic DDQN based Subtasks Scheduling, which can make adaptive task scheduling decision in real time. A diffusion model-based synthetic experience replay is integrated into the reinforcement learning framework, which can generate sufficient synthetic data in experience replay buffer, thereby significantly accelerating convergence and improving sample efficiency. Simulation results demonstrate the effectiveness of the proposed algorithm on reducing task completion time, comparing to benchmark schemes.

[AI-20] Feature-Factory: Automating Software Feature Integration Using Generative AI

链接: https://arxiv.org/abs/2411.18226
作者: Ruslan Idelfonso Magana Vsevolodovna
关键词-EN: time-consuming process, complex and time-consuming, existing software projects, Integrating new features, Integrating
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:Integrating new features into existing software projects can be a complex and time-consuming process. Feature-Factory leverages Generative AI with this http URL to automate the analysis, planning, and implementation of feature requests. By combining advanced project parsing, dependency resolution, and AI-generated code, the program ensures seamless integration of features into software systems while maintaining structural integrity. This paper presents the methodology, mathematical model, and results of the Feature-Factory framework.

[AI-21] SCoTT: Wireless-Aware Path Planning with Vision Language Models and Strategic Chains-of-Thought

链接: https://arxiv.org/abs/2411.18212
作者: Aladin Djuhera,Vlad C. Andrei,Amin Seffo,Holger Boche,Walid Saad
关键词-EN: practical applications, Path, Path planning, complex, complex problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Path planning is a complex problem for many practical applications, particularly in robotics. Existing algorithms, however, are exhaustive in nature and become increasingly complex when additional side constraints are incorporated alongside distance minimization. In this paper, a novel approach using vision language models (VLMs) is proposed for enabling path planning in complex wireless-aware environments. To this end, insights from a digital twin (DT) with real-world wireless ray tracing data are explored in order to guarantee an average path gain threshold while minimizing the trajectory length. First, traditional approaches such as A* are compared to several wireless-aware extensions, and an optimal iterative dynamic programming approach (DP-WA*) is derived, which fully takes into account all path gains and distance metrics within the DT. On the basis of these baselines, the role of VLMs as an alternative assistant for path planning is investigated, and a strategic chain-of-thought tasking (SCoTT) approach is proposed. SCoTT divides the complex planning task into several subproblems and solves each with advanced CoT prompting. Results show that SCoTT achieves very close average path gains compared to DP-WA* while at the same time yielding consistently shorter path lengths. The results also show that VLMs can be used to accelerate DP-WA* by efficiently reducing the algorithm’s search space and thus saving up to 62% in execution time. This work underscores the potential of VLMs in future digital systems as capable assistants for solving complex tasks, while enhancing user interaction and accelerating rapid prototyping under diverse wireless constraints.

[AI-22] Learning for Long-Horizon Planning via Neuro-Symbolic Abductive Imitation KDD2025 KDD

链接: https://arxiv.org/abs/2411.18201
作者: Jie-Jing Shao,Hao-Ran Hao,Xiao-Wen Yang,Yu-Feng Li
关键词-EN: shown promising results, methods have shown, long-horizon tasks, Recent, symbolic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by KDD2025. The KDD version is titled ‘‘Abductive Learning for Neuro-Symbolic Grounded Imitation’’

点击查看摘要

Abstract:Recent learning-to-imitation methods have shown promising results in planning via imitating within the observation-action space. However, their ability in open environments remains constrained, particularly in long-horizon tasks. In contrast, traditional symbolic planning excels in long-horizon tasks through logical reasoning over human-defined symbolic spaces but struggles to handle observations beyond symbolic states, such as high-dimensional visual inputs encountered in real-world scenarios. In this work, we draw inspiration from abductive learning and introduce a novel framework \textbfABductive \textbfImitation \textbfLearning (ABIL) that integrates the benefits of data-driven learning and symbolic-based reasoning, enabling long-horizon planning. Specifically, we employ abductive reasoning to understand the demonstrations in symbolic space and design the principles of sequential consistency to resolve the conflicts between perception and reasoning. ABIL generates predicate candidates to facilitate the perception from raw observations to symbolic space without laborious predicate annotations, providing a groundwork for symbolic planning. With the symbolic understanding, we further develop a policy ensemble whose base policies are built with different logical objectives and managed through symbolic reasoning. Experiments show that our proposal successfully understands the observations with the task-relevant symbolics to assist the imitation learning. Importantly, ABIL demonstrates significantly improved data efficiency and generalization across various long-horizon tasks, highlighting it as a promising solution for long-horizon planning. Project website: \urlthis https URL.

[AI-23] Prediction with Action: Visual Policy Learning via Joint Denoising Process NEURIPS2024

链接: https://arxiv.org/abs/2411.18179
作者: Yanjiang Guo,Yucheng Hu,Jianke Zhang,Yen-Jen Wang,Xiaoyu Chen,Chaochao Lu,Jianyu Chen
关键词-EN: including image editing, demonstrated remarkable capabilities, representing a good, Diffusion, Diffusion models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable capabilities in image generation tasks, including image editing and video creation, representing a good understanding of the physical world. On the other line, diffusion models have also shown promise in robotic control tasks by denoising actions, known as diffusion policy. Although the diffusion generative model and diffusion policy exhibit distinct capabilities–image prediction and robotic action, respectively–they technically follow a similar denoising process. In robotic tasks, the ability to predict future images and generate actions is highly correlated since they share the same underlying dynamics of the physical world. Building on this insight, we introduce PAD, a novel visual policy learning framework that unifies image Prediction and robot Action within a joint Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on both robotic demonstrations and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark, by utilizing a single text-conditioned visual policy within a data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world robot manipulation settings with 28.0% success rate increase compared to the strongest baseline. Project page at this https URL

[AI-24] Abductive Symbolic Solver on Abstraction and Reasoning Corpus IJCAI2024

链接: https://arxiv.org/abs/2411.18158
作者: Mintaek Lim,Seokki Lee,Liyew Woletemaryam Abitew,Sundong Kim
关键词-EN: enhancing artificial intelligence, intelligence reasoning capabilities, artificial intelligence reasoning, Reasoning Corpus, focusing on logicality
类目: Artificial Intelligence (cs.AI)
*备注: Presented at IJCAI 2024 LNSAI Workshop

点击查看摘要

Abstract:This paper addresses the challenge of enhancing artificial intelligence reasoning capabilities, focusing on logicality within the Abstraction and Reasoning Corpus (ARC). Humans solve such visual reasoning tasks based on their observations and hypotheses, and they can explain their solutions with a proper reason. However, many previous approaches focused only on the grid transition and it is not enough for AI to provide reasonable and human-like solutions. By considering the human process of solving visual reasoning tasks, we have concluded that the thinking process is likely the abductive reasoning process. Thus, we propose a novel framework that symbolically represents the observed data into a knowledge graph and extracts core knowledge that can be used for solution generation. This information limits the solution search space and helps provide a reasonable mid-process. Our approach holds promise for improving AI performance on ARC tasks by effectively narrowing the solution space and providing logical solutions grounded in core knowledge extraction.

[AI-25] Derivation of Closed Form of Expected Improvement for Gaussian Process Trained on Log-Transformed Objective

链接: https://arxiv.org/abs/2411.18095
作者: Shuhei Watanabe
关键词-EN: Expected Improvement, Bayesian optimization, widely used acquisition, Expected, Improvement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Expected Improvement (EI) is arguably the most widely used acquisition function in Bayesian optimization. However, it is often challenging to enhance the performance with EI due to its sensitivity to numerical precision. Previously, Hutter et al. (2009) tackled this problem by using Gaussian process trained on the log-transformed objective function and it was reported that this trick improves the predictive accuracy of GP, leading to substantially better performance. Although Hutter et al. (2009) offered the closed form of their EI, its intermediate derivation has not been provided so far. In this paper, we give a friendly derivation of their proposition.

[AI-26] MONOPOLY: Learning to Price Public Facilities for Revaluing Private Properties with Large-Scale Urban Data CIKM’19

链接: https://arxiv.org/abs/2411.18085
作者: Miao Fan,Jizhou Huang,An Zhuo,Ying Li,Ping Li,Haifeng Wang
关键词-EN: public facilities, private properties, attractive but challenging, challenging task, widely concerned
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: CIKM’19

点击查看摘要

Abstract:The value assessment of private properties is an attractive but challenging task which is widely concerned by a majority of people around the world. A prolonged topic among us is \textithow much is my house worth?''. To answer this question, most experienced agencies would like to price a property given the factors of its attributes as well as the demographics and the public facilities around it. However, no one knows the exact prices of these factors, especially the values of public facilities which may help assess private properties. In this paper, we introduce our newly launched project Monopoly’’ (named after a classic board game) in which we propose a distributed approach for revaluing private properties by learning to price public facilities (such as hospitals etc.) with the large-scale urban data we have accumulated via Baidu Maps. To be specific, our method organizes many points of interest (POIs) into an undirected weighted graph and formulates multiple factors including the virtual prices of surrounding public facilities as adaptive variables to parallelly estimate the housing prices we know. Then the prices of both public facilities and private properties can be iteratively updated according to the loss of prediction until convergence. We have conducted extensive experiments with the large-scale urban data of several metropolises in China. Results show that our approach outperforms several mainstream methods with significant margins. Further insights from more in-depth discussions demonstrate that the ``Monopoly’’ is an innovative application in the interdisciplinary field of business intelligence and urban computing, and it will be beneficial to tens of millions of our users for investments and to the governments for urban planning as well as taxation.

[AI-27] From Exploration to Revelation: Detecting Dark Patterns in Mobile Apps

链接: https://arxiv.org/abs/2411.18084
作者: Jieshan Chen,Zhen Wang,Jiamou Sun,Wenbo Zou,Zhenchang Xing,Qinghua Lu,Qing Huang,Xiwei Xu
关键词-EN: manipulate user behavior, Mobile apps, user behavior, nag users, manipulate user
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Mobile apps are essential in daily life, yet they often employ dark patterns, such as visual tricks to highlight certain options or linguistic tactics to nag users into making purchases, to manipulate user behavior. Current research mainly uses manual methods to detect dark patterns, a process that is time-consuming and struggles to keep pace with continually updating and emerging apps. While some studies targeted at automated detection, they are constrained to static patterns and still necessitate manual app exploration. To bridge these gaps, we present AppRay, an innovative system that seamlessly blends task-oriented app exploration with automated dark pattern detection, reducing manual efforts. Our approach consists of two steps: First, we harness the commonsense knowledge of large language models for targeted app exploration, supplemented by traditional random exploration to capture a broader range of UI states. Second, we developed a static and dynamic dark pattern detector powered by a contrastive learning-based multi-label classifier and a rule-based refiner to perform detection. We contributed two datasets, AppRay-Dark and AppRay-Light, with 2,185 unique deceptive patterns (including 149 dynamic instances) across 18 types from 876 UIs and 871 benign UIs. These datasets cover both static and dynamic dark patterns while preserving UI relationships. Experimental results confirm that AppRay can efficiently explore the app and identify a wide range of dark patterns with great performance.

[AI-28] DuMapper: Towards Automatic Verification of Large-Scale POIs with Street Views at Baidu Maps

链接: https://arxiv.org/abs/2411.18073
作者: Miao Fan,Jizhou Huang,Haifeng Wang
关键词-EN: Web mapping services, Web mapping, POI verification, mobile devices, increased popularity
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the increased popularity of mobile devices, Web mapping services have become an indispensable tool in our daily lives. To provide user-satisfied services, such as location searches, the point of interest (POI) database is the fundamental infrastructure, as it archives multimodal information on billions of geographic locations closely related to people’s lives, such as a shop or a bank. Therefore, verifying the correctness of a large-scale POI database is vital. To achieve this goal, many industrial companies adopt volunteered geographic information (VGI) platforms that enable thousands of crowdworkers and expert mappers to verify POIs seamlessly; but to do so, they have to spend millions of dollars every year. To save the tremendous labor costs, we devised DuMapper, an automatic system for large-scale POI verification with the multimodal street-view data at Baidu Maps. DuMapper takes the signboard image and the coordinates of a real-world place as input to generate a low-dimensional vector, which can be leveraged by ANN algorithms to conduct a more accurate search through billions of archived POIs in the database for verification within milliseconds. It can significantly increase the throughput of POI verification by 50 times. DuMapper has already been deployed in production since \DuMPOnline, which dramatically improves the productivity and efficiency of POI verification at Baidu Maps. As of December 31, 2021, it has enacted over 405 million iterations of POI verification within a 3.5-year period, representing an approximate workload of 800 high-performance expert mappers.

[AI-29] Simulating Tabular Datasets through LLM s to Rapidly Explore Hypotheses about Real-World Entities

链接: https://arxiv.org/abs/2411.18071
作者: Miguel Zabaleta,Joel Lehman
关键词-EN: qualitative hypothesis, horror writers, qualitative hypothesis requires, writers, worse childhoods
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of adverse childhood events, in the context of the running example). The hope is to allow sifting through hypotheses more quickly through collaboration between human and machine. Our experiments highlight that indeed, LLMs can serve as useful estimators of tabular data about specific entities across a range of domains, and that such estimations improve with model scale. Further, initial experiments demonstrate the potential of LLMs to map a qualitative hypothesis of interest to relevant concrete variables that the LLM can then estimate. The conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.

[AI-30] RL for Mitigating Cascading Failures: Targeted Exploration via Sensitivity Factors

链接: https://arxiv.org/abs/2411.18050
作者: Anmol Dwivedi,Ali Tajer,Santiago Paternain,Nurali Virani
关键词-EN: Electricity grid resiliency, Electricity grid, array of technical, technical and policy-related, policy-related decisions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Electricity grid’s resiliency and climate change strongly impact one another due to an array of technical and policy-related decisions that impact both. This paper introduces a physics-informed machine learning-based framework to enhance grid’s resiliency. Specifically, when encountering disruptive events, this paper designs remedial control actions to prevent blackouts. The proposed Physics-Guided Reinforcement Learning (PG-RL) framework determines effective real-time remedial line-switching actions, considering their impact on power balance, system security, and grid reliability. To identify an effective blackout mitigation policy, PG-RL leverages power-flow sensitivity factors to guide the RL exploration during agent training. Comprehensive evaluations using the Grid2Op platform demonstrate that incorporating physical signals into RL significantly improves resource utilization within electric grids and achieves better blackout mitigation policies - both of which are critical in addressing climate change.

[AI-31] Heterogeneous Relationships of Subjects and Shapelets for Semi-supervised Multivariate Series Classification ICDE

链接: https://arxiv.org/abs/2411.18043
作者: Mingsen Du,Meng Chen,Yongjian Li,Cun Ji,Shoushui Wei
关键词-EN: Multivariate time series, complex time series, time series data, Multivariate time, extract key features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE International Conference on Data Engineering (ICDE) 2025

点击查看摘要

Abstract:Multivariate time series (MTS) classification is widely applied in fields such as industry, healthcare, and finance, aiming to extract key features from complex time series data for accurate decision-making and prediction. However, existing methods for MTS often struggle due to the challenges of effectively modeling high-dimensional data and the lack of labeled data, resulting in poor classification performance. To address this issue, we propose a heterogeneous relationships of subjects and shapelets method for semi-supervised MTS classification. This method offers a novel perspective by integrating various types of additional information while capturing the relationships between them. Specifically, we first utilize a contrast temporal self-attention module to obtain sparse MTS representations, and then model the similarities between these representations using soft dynamic time warping to construct a similarity graph. Secondly, we learn the shapelets for different subject types, incorporating both the subject features and their shapelets as additional information to further refine the similarity graph, ultimately generating a heterogeneous graph. Finally, we use a dual level graph attention network to get prediction. Through this method, we successfully transform dataset into a heterogeneous graph, integrating multiple additional information and achieving precise semi-supervised node classification. Experiments on the Human Activity Recognition, sleep stage classification and University of East Anglia datasets demonstrate that our method outperforms current state-of-the-art methods in MTS classification tasks, validating its superiority.

[AI-32] AEGIS: An Agent -based Framework for General Bug Reproduction from Issue Descriptions

链接: https://arxiv.org/abs/2411.18015
作者: Xinchen Wang,Pengfei Gao,Xiangxin Meng,Chao Peng,Ruida Hu,Yun Lin,Cuiyun Gao
关键词-EN: effective fault localization, bug reproduction, buG reproductIon Scripts, general bug reproduction, reproduction
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In software maintenance, bug reproduction is essential for effective fault localization and repair. Manually writing reproduction scripts is a time-consuming task with high requirements for developers. Hence, automation of bug reproduction has increasingly attracted attention from researchers and practitioners. However, the existing studies on bug reproduction are generally limited to specific bug types such as program crashes, and hard to be applied to general bug reproduction. In this paper, considering the superior performance of agent-based methods in code intelligence tasks, we focus on designing an agent-based framework for the task. Directly employing agents would lead to limited bug reproduction performance, due to entangled subtasks, lengthy retrieved context, and unregulated actions. To mitigate the challenges, we propose an Automated gEneral buG reproductIon Scripts generation framework, named AEGIS, which is the first agent-based framework for the task. AEGIS mainly contains two modules: (1) A concise context construction module, which aims to guide the code agent in extracting structured information from issue descriptions, identifying issue-related code with detailed explanations, and integrating these elements to construct the concise context; (2) A FSM-based multi-feedback optimization module to further regulate the behavior of the code agent within the finite state machine (FSM), ensuring a controlled and efficient script generation process based on multi-dimensional feedback. Extensive experiments on the public benchmark dataset show that AEGIS outperforms the state-of-the-art baseline by 23.0% in F-P metric. In addition, the bug reproduction scripts generated by AEGIS can improve the relative resolved rate of Agentless by 12.5%.

[AI-33] Causal and Local Correlations Based Network for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2411.18008
作者: Mingsen Du,Yanxuan Wei,Xiangwei Zheng,Cun Ji
关键词-EN: time series classification, number of researchers, attracted the attention, large number, Recently
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Submitted on April 03, 2023; major revisions on March 25, 2024; minor revisions on July 9, 2024

点击查看摘要

Abstract:Recently, time series classification has attracted the attention of a large number of researchers, and hundreds of methods have been proposed. However, these methods often ignore the spatial correlations among dimensions and the local correlations among features. To address this issue, the causal and local correlations based network (CaLoNet) is proposed in this study for multivariate time series classification. First, pairwise spatial correlations between dimensions are modeled using causality modeling to obtain the graph structure. Then, a relationship extraction network is used to fuse local correlations to obtain long-term dependency features. Finally, the graph structure and long-term dependency features are integrated into the graph neural network. Experiments on the UEA datasets show that CaLoNet can obtain competitive performance compared with state-of-the-art methods.

[AI-34] A Novel Pareto-optimal Ranking Method for Comparing Multi-objective Optimization Algorithms

链接: https://arxiv.org/abs/2411.17999
作者: Amin Ibrahim,Azam Asilian Bidgoli,Shahryar Rahnamayan,Kalyanmoy Deb
关键词-EN: optimization algorithms grows, many-objective optimization algorithms, performance indicators, optimization algorithms, algorithms
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:As the interest in multi- and many-objective optimization algorithms grows, the performance comparison of these algorithms becomes increasingly important. A large number of performance indicators for multi-objective optimization algorithms have been introduced, each of which evaluates these algorithms based on a certain aspect. Therefore, assessing the quality of multi-objective results using multiple indicators is essential to guarantee that the evaluation considers all quality perspectives. This paper proposes a novel multi-metric comparison method to rank the performance of multi-/ many-objective optimization algorithms based on a set of performance indicators. We utilize the Pareto optimality concept (i.e., non-dominated sorting algorithm) to create the rank levels of algorithms by simultaneously considering multiple performance indicators as criteria/objectives. As a result, four different techniques are proposed to rank algorithms based on their contribution at each Pareto level. This method allows researchers to utilize a set of existing/newly developed performance metrics to adequately assess/rank multi-/many-objective algorithms. The proposed methods are scalable and can accommodate in its comprehensive scheme any newly introduced metric. The method was applied to rank 10 competing algorithms in the 2018 CEC competition solving 15 many-objective test problems. The Pareto-optimal ranking was conducted based on 10 well-known multi-objective performance indicators and the results were compared to the final ranks reported by the competition, which were based on the inverted generational distance (IGD) and hypervolume indicator (HV) measures. The techniques suggested in this paper have broad applications in science and engineering, particularly in areas where multiple metrics are used for comparisons. Examples include machine learning and data mining.

[AI-35] Regularized Multi-LLM s Collaboration for Enhanced Score-based Causal Discovery

链接: https://arxiv.org/abs/2411.17989
作者: Xiaoxuan Li,Yao Liu,Ruoyu Wang,Lina Yao
关键词-EN: randomized control trials, conducting randomized control, purely observational data, relationships among variables, systems and algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:As the significance of understanding the cause-and-effect relationships among variables increases in the development of modern systems and algorithms, learning causality from observational data has become a preferred and efficient approach over conducting randomized control trials. However, purely observational data could be insufficient to reconstruct the true causal graph. Consequently, many researchers tried to utilise some form of prior knowledge to improve causal discovery process. In this context, the impressive capabilities of large language models (LLMs) have emerged as a promising alternative to the costly acquisition of prior expert knowledge. In this work, we further explore the potential of using LLMs to enhance causal discovery approaches, particularly focusing on score-based methods, and we propose a general framework to utilise the capacity of not only one but multiple LLMs to augment the discovery process.

[AI-36] he importance of visual modelling languages in generative software engineering

链接: https://arxiv.org/abs/2411.17976
作者: Roberto Rossi
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, Software Engineering, represent a watershed
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 9 pages, working paper

点击查看摘要

Abstract:Multimodal GPTs represent a watershed in the interplay between Software Engineering and Generative Artificial Intelligence. GPT-4 accepts image and text inputs, rather than simply natural language. We investigate relevant use cases stemming from these enhanced capabilities of GPT-4. To the best of our knowledge, no other work has investigated similar use cases involving Software Engineering tasks carried out via multimodal GPTs prompted with a mix of diagrams and natural language.

[AI-37] Spatio-temporal Causal Learning for Streamflow Forecasting

链接: https://arxiv.org/abs/2411.17937
作者: Shu Wan,Reepal Shah,Qi Deng,John Sabo,Huan Liu,K. Selçuk
关键词-EN: national water resources, water resources, plays an essential, essential role, sustainable planning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published at IEEE Big Data 2024

点击查看摘要

Abstract:Streamflow plays an essential role in the sustainable planning and management of national water resources. Traditional hydrologic modeling approaches simulate streamflow by establishing connections across multiple physical processes, such as rainfall and runoff. These data, inherently connected both spatially and temporally, possess intrinsic causal relations that can be leveraged for robust and accurate forecasting. Recently, spatio-temporal graph neural networks (STGNNs) have been adopted, excelling in various domains, such as urban traffic management, weather forecasting, and pandemic control, and they also promise advances in streamflow management. However, learning causal relationships directly from vast observational data is theoretically and computationally challenging. In this study, we employ a river flow graph as prior knowledge to facilitate the learning of the causal structure and then use the learned causal graph to predict streamflow at targeted sites. The proposed model, Causal Streamflow Forecasting (CSF) is tested in a real-world study in the Brazos River basin in Texas. Our results demonstrate that our method outperforms regular spatio-temporal graph neural networks and achieves higher computational efficiency compared to traditional simulation methods. By effectively integrating river flow graphs with STGNNs, this research offers a novel approach to streamflow prediction, showcasing the potential of combining advanced neural network techniques with domain-specific knowledge for enhanced performance in hydrologic modeling.

[AI-38] Neural Networks Use Distance Metrics

链接: https://arxiv.org/abs/2411.17932
作者: Alan Oursland
关键词-EN: present empirical evidence, learn distance-based representations, ReLU and Absolute, activations learn distance-based, present empirical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 8 pages excluding references and appendix. 12 pages total. 3 figures. The code for the experiments in this paper is available at this https URL

点击查看摘要

Abstract:We present empirical evidence that neural networks with ReLU and Absolute Value activations learn distance-based representations. We independently manipulate both distance and intensity properties of internal activations in trained models, finding that both architectures are highly sensitive to small distance-based perturbations while maintaining robust performance under large intensity-based perturbations. These findings challenge the prevailing intensity-based interpretation of neural network activations and offer new insights into their learning and decision-making processes.

[AI-39] Combining Threat Intelligence with IoT Scanning to Predict Cyber Attack

链接: https://arxiv.org/abs/2411.17931
作者: Jubin Abhishek Soni
关键词-EN: Dark Web information, Dark Web, Web, platform for communication, worldwide platform
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI)
*备注: 8 pages, 6 figures, 2 tables. This manuscript has been submitted to Springer for review (Manuscript ID: PDSE-D-24-00163) and is under consideration. It has not yet been peer-reviewed or published. Researchers are welcome to read and build upon this work; please cite it appropriately. For questions or clarifications, feel free to contact me

点击查看摘要

Abstract:While the Web has become a worldwide platform for communication, hackers and hacktivists share their ideology and communicate with members on the “Dark Web” - the reverse of the Web. Currently, the problems of information overload and difficulty to obtain a comprehensive picture of hackers and cyber-attackers hinder the effective analysis of predicting their activities on the Web. Also, there are currently more objects connected to the internet than there are people in the world and this gap will continue to grow as more and more objects gain ability to directly interface with the Internet. Many technical communities are vigorously pursuing research topics that contribute to the Internet of Things (IoT). In this paper we have proposed a novel methodology for collecting and analyzing the Dark Web information to identify websites of hackers from the Web sea, and how this information can help us in predicting IoT vulnerabilities. This methodology incorporates information collection, analysis, visualization techniques, and exploits some of the IoT devices. Through this research we want to contribute to the existing literature on cyber-security that could potentially guide in both policy-making and intelligence research.

[AI-40] AI2T: Building Trustable AI Tutors by Interactively Teaching a Self-Aware Learning Agent

链接: https://arxiv.org/abs/2411.17924
作者: Daniel Weitekamp,Erik Harpstead,Kenneth Koedinger
关键词-EN: intelligent tutoring systems, authoring intelligent tutoring, tutoring systems, interactively teachable, authoring intelligent
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI2T is an interactively teachable AI for authoring intelligent tutoring systems (ITSs). Authors tutor AI2T by providing a few step-by-step solutions and then grading AI2T’s own problem-solving attempts. From just 20-30 minutes of interactive training, AI2T can induce robust rules for step-by-step solution tracking (i.e., model-tracing). As AI2T learns it can accurately estimate its certainty of performing correctly on unseen problem steps using STAND: a self-aware precondition learning algorithm that outperforms state-of-the-art methods like XGBoost. Our user study shows that authors can use STAND’s certainty heuristic to estimate when AI2T has been trained on enough diverse problems to induce correct and complete model-tracing programs. AI2T-induced programs are more reliable than hallucination-prone LLMs and prior authoring-by-tutoring approaches. With its self-aware induction of hierarchical rules, AI2T offers a path toward trustable data-efficient authoring-by-tutoring for complex ITSs that normally require as many as 200-300 hours of programming per hour of instruction.

[AI-41] Can LLM s plan paths in the real world?

链接: https://arxiv.org/abs/2411.17912
作者: Wanyi Chen,Meng-Wen Su,Nafisa Mehjabin,Mary L. Cummings
关键词-EN: vehicle navigation systems, increasingly integrate, navigation systems, capability is crucial, large language models
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly integrate into vehicle navigation systems, understanding their path-planning capability is crucial. We tested three LLMs through six real-world path-planning scenarios in various settings and with various difficulties. Our experiments showed that all LLMs made numerous errors in all scenarios, revealing that they are unreliable path planners. We suggest that future work focus on implementing mechanisms for reality checks, enhancing model transparency, and developing smaller models.

[AI-42] Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Games with Delayed Rewards

链接: https://arxiv.org/abs/2411.17861
作者: Ahmad Ahmad,Mehdi Kermanshah,Kevin Leahy,Zachary Serlin,Ho Chit Siu,Makai Mann,Cristian-Ioan Vasile,Roberto Tron,Calin Belta
关键词-EN: Proximal Policy Optimization, Region Policy Optimization, Window Temporal Logic, Policy Optimization, Policy Gradient method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we tackle the challenging problem of delayed rewards in reinforcement learning (RL). While Proximal Policy Optimization (PPO) has emerged as a leading Policy Gradient method, its performance can degrade under delayed rewards. We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy (trained on expert demonstrations) with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL). The hybrid architecture leverages offline data throughout training while maintaining PPO’s theoretical guarantees. Building on the monotonic improvement framework of Trust Region Policy Optimization (TRPO), we prove that our approach ensures improvement over both the offline policy and previous iterations, with a bounded performance gap of (2\varsigma\gamma\alpha^2)/(1-\gamma)^2 , where \alpha is the mixing parameter, \gamma is the discount factor, and \varsigma bounds the expected advantage. Additionally, we prove that our TWTL-based reward shaping preserves the optimal policy of the original problem. TWTL enables formal translation of temporal objectives into immediate feedback signals that guide learning. We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments, showing improvements in both learning speed and final performance compared to standard PPO and offline-only approaches.

[AI-43] “Give me the code” – Log Analysis of First-Year CS Students Interactions With GPT

链接: https://arxiv.org/abs/2411.17855
作者: Pedro Alves,Bruno Pereira Cipriano
关键词-EN: Large Language Models, Language Models, Large Language, impact of Large, Bard in computer
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: This is the author’s version of the work. It is posted here for your personal use. Not for redistribution

点击查看摘要

Abstract:The impact of Large Language Models (LLMs) like GPT-3, GPT-4, and Bard in computer science (CS) education is expected to be profound. Students now have the power to generate code solutions for a wide array of programming assignments. For first-year students, this may be particularly problematic since the foundational skills are still in development and an over-reliance on generative AI tools can hinder their ability to grasp essential programming concepts. This paper analyzes the prompts used by 69 freshmen undergraduate students to solve a certain programming problem within a project assignment, without giving them prior prompt training. We also present the rules of the exercise that motivated the prompts, designed to foster critical thinking skills during the interaction. Despite using unsophisticated prompting techniques, our findings suggest that the majority of students successfully leveraged GPT, incorporating the suggested solutions into their projects. Additionally, half of the students demonstrated the ability to exercise judgment in selecting from multiple GPT-generated solutions, showcasing the development of their critical thinking skills in evaluating AI-generated code.

[AI-44] SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors DATE2025

链接: https://arxiv.org/abs/2411.17847
作者: Mariam Rakka,Jinhao Li,Guohao Dai,Ahmed Eltawil,Mohammed E. Fouda,Fadi Kurdahi
关键词-EN: Large Language Models, Recent research efforts, research efforts focus, Language Models, Large Language
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Accepted in DATE 2025

点击查看摘要

Abstract:Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.

[AI-45] Basic Research Lethal Effects: Military AI Research Funding as Enlistment

链接: https://arxiv.org/abs/2411.17840
作者: David Gray Widder,Sireesh Gururaja,Lucy Suchman
关键词-EN: Department of Defense, algorithmically based warfighting, DoD funding, research, DoD
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 22 pages, 9945 words

点击查看摘要

Abstract:In the context of unprecedented U.S. Department of Defense (DoD) budgets, this paper examines the recent history of DoD funding for academic research in algorithmically based warfighting. We draw from a corpus of DoD grant solicitations from 2007 to 2023, focusing on those addressed to researchers in the field of artificial intelligence (AI). Considering the implications of DoD funding for academic research, the paper proceeds through three analytic sections. In the first, we offer a critical examination of the distinction between basic and applied research, showing how funding calls framed as basic research nonetheless enlist researchers in a war fighting agenda. In the second, we offer a diachronic analysis of the corpus, showing how a ‘one small problem’ caveat, in which affirmation of progress in military technologies is qualified by acknowledgement of outstanding problems, becomes justification for additional investments in research. We close with an analysis of DoD aspirations based on a subset of Defense Advanced Research Projects Agency (DARPA) grant solicitations for the use of AI in battlefield applications. Taken together, we argue that grant solicitations work as a vehicle for the mutual enlistment of DoD funding agencies and the academic AI research community in setting research agendas. The trope of basic research in this context offers shelter from significant moral questions that military applications of one’s research would raise, by obscuring the connections that implicate researchers in U.S. militarism.

[AI-46] STAR: Synthesis of Tailored Architectures

链接: https://arxiv.org/abs/2411.17800
作者: Armin W. Thomas,Rom Parnichkun,Alexander Amini,Stefano Massaroli,Michael Poli
关键词-EN: Iterative improvement, deep learning, enabled scaling, fundamental to deep, recent advances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.

[AI-47] Engineering AI Judge Systems

链接: https://arxiv.org/abs/2411.17793
作者: Jiahuei Lin(Justina),Dayi Lin,Sky Zhang,Ahmed E. Hassan
关键词-EN: Foundation Model-powered software, evaluate Foundation Model-powered, Foundation Model-powered, automatically evaluate Foundation, Model-powered software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI judge systems are designed to automatically evaluate Foundation Model-powered software (i.e., FMware). Due to the intrinsic dynamic and stochastic nature of FMware, the development of AI judge systems requires a unique engineering life cycle and presents new challenges. In this paper, we discuss the challenges based on our industrial experiences in developing AI judge systems for FMware. These challenges lead to substantial time consumption, cost and inaccurate judgments. We propose a framework that tackles the challenges with the goal of improving the productivity of developing high-quality AI judge systems. Finally, we evaluate our framework with a case study on judging a commit message generation FMware. The accuracy of the judgments made by the AI judge system developed with our framework outperforms those made by the AI judge system that is developed without our framework by up to 6.2%, with a significant reduction in development effort.

[AI-48] Joint Resource Optimization Computation Offloading and Resource Slicing for Multi-Edge Traffic-Cognitive Networks

链接: https://arxiv.org/abs/2411.17782
作者: Ting Xiaoyang,Minfeng Zhang,Shu gonglee,Saimin Chen Zhang
关键词-EN: edge computing envisions, computing envisions platforms, envisions platforms operating, edge servers, computational services
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The evolving landscape of edge computing envisions platforms operating as dynamic intermediaries between application providers and edge servers (ESs), where task offloading is coupled with payments for computational services. Ensuring efficient resource utilization and meeting stringent Quality of Service (QoS) requirements necessitates incentivizing ESs while optimizing the platforms operational objectives. This paper investigates a multi-agent system where both the platform and ESs are self-interested entities, addressing the joint optimization of revenue maximization, resource allocation, and task offloading. We propose a novel Stackelberg game-based framework to model interactions between stakeholders and solve the optimization problem using a Bayesian Optimization-based centralized algorithm. Recognizing practical challenges in information collection due to privacy concerns, we further design a decentralized solution leveraging neural network optimization and a privacy-preserving information exchange protocol. Extensive numerical evaluations demonstrate the effectiveness of the proposed mechanisms in achieving superior performance compared to existing baselines.

[AI-49] Leaning Time-Varying Instruments for Identifying Causal Effects in Time-Series Data

链接: https://arxiv.org/abs/2411.17774
作者: Debo Cheng(1),Ziqi Xu(2),Jiuyong Li(1),Lin Liu(1),Thuc duy Le(1),Xudong Guo(1),Shichao Zhang(3) ((1) University of South Australia (2) RMIT University (3) Guangxi Normal University)
关键词-EN: Querying causal effects, causal effect estimation, Querying causal, including healthcare, climate science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Querying causal effects from time-series data is important across various fields, including healthcare, economics, climate science, and epidemiology. However, this task becomes complex in the existence of time-varying latent confounders, which affect both treatment and outcome variables over time and can introduce bias in causal effect estimation. Traditional instrumental variable (IV) methods are limited in addressing such complexities due to the need for predefined IVs or strong assumptions that do not hold in dynamic settings. To tackle these issues, we develop a novel Time-varying Conditional Instrumental Variables (CIV) for Debiasing causal effect estimation, referred to as TDCIV. TDCIV leverages Long Short-Term Memory (LSTM) and Variational Autoencoder (VAE) models to disentangle and learn the representations of time-varying CIV and its conditioning set from proxy variables without prior knowledge. Under the assumptions of the Markov property and availability of proxy variables, we theoretically establish the validity of these learned representations for addressing the biases from time-varying latent confounders, thus enabling accurate causal effect estimation. Our proposed TDCIV is the first to effectively learn time-varying CIV and its associated conditioning set without relying on domain-specific knowledge.

[AI-50] PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

链接: https://arxiv.org/abs/2411.17764
作者: Tewodros Ayalew,Xiao Zhang,Kevin Yuanbo Wu,Tianchong Jiang,Michael Maire,Matthew R. Walter
关键词-EN: enabling policy training, goal-conditioned reinforcement learning, enabling policy, goal-conditioned reinforcement, task-agnostic reward function
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 15 pages,13 figures

点击查看摘要

Abstract:We present PROGRESSOR, a novel framework that learns a task-agnostic reward function from videos, enabling policy training through goal-conditioned reinforcement learning (RL) without manual supervision. Underlying this reward is an estimate of the distribution over task progress as a function of the current, initial, and goal observations that is learned in a self-supervised fashion. Crucially, PROGRESSOR refines rewards adversarially during online RL training by pushing back predictions for out-of-distribution observations, to mitigate distribution shift inherent in non-expert observations. Utilizing this progress prediction as a dense reward together with an adversarial push-back, we show that PROGRESSOR enables robots to learn complex behaviors without any external supervision. Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning. Our findings highlight the potential of PROGRESSOR for scalable robotic applications where direct action labels and task-specific rewards are not readily available.

[AI-51] Will an AI with Private Information Allow Itself to Be Switched Off?

链接: https://arxiv.org/abs/2411.17749
作者: Andrew Garber,Rohan Subramani,Linus Luu,Mark Bedaywi,Stuart Russell,Scott Emmons
关键词-EN: wide variety, variety of goals, fetch the coffee, shutdown problem, Russell
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:A wide variety of goals could cause an AI to disable its off switch because “you can’t fetch the coffee if you’re dead” (Russell 2019). Prior theoretical work on this shutdown problem assumes that humans know everything that AIs do. In practice, however, humans have only limited information. Moreover, in many of the settings where the shutdown problem is most concerning, AIs might have vast amounts of private information. To capture these differences in knowledge, we introduce the Partially Observable Off-Switch Game (POSG), a game-theoretic model of the shutdown problem with asymmetric information. Unlike when the human has full observability, we find that in optimal play, even AI agents assisting perfectly rational humans sometimes avoid shutdown. As expected, increasing the amount of communication or information available always increases (or leaves unchanged) the agents’ expected common payoff. But counterintuitively, introducing bounded communication can make the AI defer to the human less in optimal play even though communication mitigates information asymmetry. In particular, communication sometimes enables new optimal behavior requiring strategic AI deference to achieve outcomes that were previously inaccessible. Thus, designing safe artificial agents in the presence of asymmetric information requires careful consideration of the tradeoffs between maximizing payoffs (potentially myopically) and maintaining AIs’ incentives to defer to humans.

[AI-52] Fast convolution algorithm for state space models

链接: https://arxiv.org/abs/2411.17729
作者: Gregory Beylkin
关键词-EN: linear time invariant, present a fast, matrix-vector multiplications, LTI, time invariant system
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

Abstract:We present a fast, robust algorithm for applying a matrix transfer function of a linear time invariant system (LTI) in time domain. Computing L states of a multiple-input multiple-output (MIMO) LTI appears to require L matrix-vector multiplications. We demonstrate that, for any finite user-selected accuracy, the number of matrix-vector multiplications can be reduced to \mathcalO\left(\log_2L\right) (within an \mathcalO\left(L\right) algorithm). The algorithm uses an approximation of the rational transfer function in the z-domain by a matrix polynomial of degree 2^N+1-1 , where N is chosen to achieve any user-selected accuracy. Importantly, using a cascade implementation in time domain, applying the transfer function requires only N+1 matrix-vector multiplications. We note that LTI systems are used in state space models (SSMs) for modeling long range dependencies where L is large. In applications where the state matrix of LTI system is approximated by a structured matrix, the computational cost is further reduced. We briefly describe several structured approximations of matrices that can be used for such purpose.

[AI-53] When IoT Meet LLM s: Applications and Challenges

链接: https://arxiv.org/abs/2411.17722
作者: Ibrahim Kok,Orhan Demirci,Suat Ozdemir
关键词-EN: Large Language Models, Large Language, efficiently transformed workflows, Recent advances, advances in Large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Accepted in 2024 IEEE International Conference on Big Data (IEEE BigData), 10 pages, 2 figures, 1 table

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have positively and efficiently transformed workflows in many domains. One such domain with significant potential for LLM integration is the Internet of Things (IoT), where this integration brings new opportunities for improved decision making and system interaction. In this paper, we explore the various roles of LLMs in IoT, with a focus on their reasoning capabilities. We show how LLM-IoT integration can facilitate advanced decision making and contextual understanding in a variety of IoT scenarios. Furthermore, we explore the integration of LLMs with edge, fog, and cloud computing paradigms, and show how this synergy can optimize resource utilization, enhance real-time processing, and provide scalable solutions for complex IoT applications. To the best of our knowledge, this is the first comprehensive study covering IoT-LLM integration between edge, fog, and cloud systems. Additionally, we propose a novel system model for industrial IoT applications that leverages LLM-based collective intelligence to enable predictive maintenance and condition monitoring. Finally, we highlight key challenges and open issues that provide insights for future research in the field of LLM-IoT integration.

[AI-54] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

链接: https://arxiv.org/abs/2411.17720
作者: Mohammadali Shakerdargah,Shan Lu,Chao Gao,Di Niu
关键词-EN: enabling unprecedented task, unprecedented task accuracy, revolutionized various fields, enabling unprecedented, computational linguistics
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 10 pages, 6 figures, under review for MLSys 2025

点击查看摘要

Abstract:The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based on open-sourced simulation frameworks show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario. Further experiments on a real-world edge neural processing unit demonstrate speedup of up to 1.76x for attention as compared to FLAT, without affecting model output accuracy.

[AI-55] Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

链接: https://arxiv.org/abs/2411.17713
作者: Igor Fedorov,Kate Plawiak,Lemeng Wu,Tarek Elgamal,Naveen Suda,Eric Smith,Hongyuan Zhan,Jianfeng Chi,Yuriy Hulovatyy,Kimish Patel,Zechun Liu,Changsheng Zhao,Yangyang Shi,Tijmen Blankevoort,Mahesh Pasupuleti,Bilge Soran,Zacharie Delpierre Coudert,Rachad Alao,Raghuraman Krishnamoorthi,Vikas Chandra
关键词-EN: Llama Guard model, Meta Connect, presents Llama Guard, efficient Llama Guard, Llama Guard
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

[AI-56] Generative AI on the Edge: Architecture and Performance Evaluation

链接: https://arxiv.org/abs/2411.17712
作者: Zeinab Nezami,Maryam Hafeez,Karim Djemame,Syed Ali Raza Zaidi
关键词-EN: embedding advance intelligence, evaluation of Generative, native vision, vision of embedding, embedding advance
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:6G’s AI native vision of embedding advance intelligence in the network while bringing it closer to the user requires a systematic evaluation of Generative AI (GenAI) models on edge devices. Rapidly emerging solutions based on Open RAN (ORAN) and Network-in-a-Box strongly advocate the use of low-cost, off-the-shelf components for simpler and efficient deployment, e.g., in provisioning rural connectivity. In this context, conceptual architecture, hardware testbeds and precise performance quantification of Large Language Models (LLMs) on off-the-shelf edge devices remains largely unexplored. This research investigates computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN. We investigate various LLMs, including small, medium and large models, on a Raspberry Pi 5 Cluster using a lightweight Kubernetes distribution (K3s) with modular prompting implementation. We study its feasibility and limitations by analyzing throughput, latency, accuracy and efficiency. Our findings indicate that CPU-only deployment of lightweight models, such as Yi, Phi, and Llama3, can effectively support edge applications, achieving a generation throughput of 5 to 12 tokens per second with less than 50% CPU and RAM usage. We conclude that GenAI on the edge offers localized inference in remote or bandwidth-constrained environments in 6G networks without reliance on cloud infrastructure.

[AI-57] Functional relevance based on the continuous Shapley value

链接: https://arxiv.org/abs/2411.18575
作者: Pedro Delicado,Cristian Pachón-García
关键词-EN: Artificial Intelligence, including machine learning, machine learning predictive, learning predictive algorithms, predictive algorithms fed
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 36 pages, 13 figures

点击查看摘要

Abstract:The presence of Artificial Intelligence (AI) in our society is increasing, which brings with it the need to understand the behaviour of AI mechanisms, including machine learning predictive algorithms fed with tabular data, text, or images, among other types of data. This work focuses on interpretability of predictive models based on functional data. Designing interpretability methods for functional data models implies working with a set of features whose size is infinite. In the context of scalar on function regression, we propose an interpretability method based on the Shapley value for continuous games, a mathematical formulation that allows to fairly distribute a global payoff among a continuous set players. The method is illustrated through a set of experiments with simulated and real data sets. The open source Python package ShapleyFDA is also presented.

[AI-58] Isometry pursuit

链接: https://arxiv.org/abs/2411.18502
作者: Samson Koelle,Marina Meila
关键词-EN: identifying orthonormal column-submatrices, Isometry pursuit, wide matrices, convex algorithm, algorithm for identifying
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.

[AI-59] Hotspot-Driven Peptide Design via Multi-Fragment Autoregressive Extension

链接: https://arxiv.org/abs/2411.18463
作者: Jiahan Li,Tong Chen,Shitong Luo,Chaoran Cheng,Jiaqi Guan,Ruihan Guo,Sheng Wang,Ge Liu,Jian Peng,Jianzhu Ma
关键词-EN: treating human diseases, peptide, short chains, amino acids, human diseases
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint, Under review

点击查看摘要

Abstract:Peptides, short chains of amino acids, interact with target proteins, making them a unique class of protein-based therapeutics for treating human diseases. Recently, deep generative models have shown great promise in peptide generation. However, several challenges remain in designing effective peptide binders. First, not all residues contribute equally to peptide-target interactions. Second, the generated peptides must adopt valid geometries due to the constraints of peptide bonds. Third, realistic tasks for peptide drug development are still lacking. To address these challenges, we introduce PepHAR, a hot-spot-driven autoregressive generative model for designing peptides targeting specific proteins. Building on the observation that certain hot spot residues have higher interaction potentials, we first use an energy-based density model to fit and sample these key residues. Next, to ensure proper peptide geometry, we autoregressively extend peptide fragments by estimating dihedral angles between residue frames. Finally, we apply an optimization process to iteratively refine fragment assembly, ensuring correct peptide structures. By combining hot spot sampling with fragment-based extension, our approach enables de novo peptide design tailored to a target protein and allows the incorporation of key hot spot residues into peptide scaffolds. Extensive experiments, including peptide design and peptide scaffold generation, demonstrate the strong potential of PepHAR in computational peptide binder design.

[AI-60] Learning optimal objective values for MILP

链接: https://arxiv.org/abs/2411.18321
作者: Lara Scavuzzo,Karen Aardal,Neil Yorke-Smith
关键词-EN: Modern Mixed Integer, Integer Linear Programming, Mixed Integer Linear, Modern Mixed, Linear Programming
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:Modern Mixed Integer Linear Programming (MILP) solvers use the Branch-and-Bound algorithm together with a plethora of auxiliary components that speed up the search. In recent years, there has been an explosive development in the use of machine learning for enhancing and supporting these algorithmic components. Within this line, we propose a methodology for predicting the optimal objective value, or, equivalently, predicting if the current incumbent is optimal. For this task, we introduce a predictor based on a graph neural network (GNN) architecture, together with a set of dynamic features. Experimental results on diverse benchmarks demonstrate the efficacy of our approach, achieving high accuracy in the prediction task and outperforming existing methods. These findings suggest new opportunities for integrating ML-driven predictions into MILP solvers, enabling smarter decision-making and improved performance.

[AI-61] Wearable intelligent throat enables natural speech in stroke patients with dysarthria

链接: https://arxiv.org/abs/2411.18266
作者: Chenyu Tang,Shuo Gao,Cong Li,Wentian Yi,Yuxuan Jin,Xiaoxue Zhai,Sixuan Lei,Hongbei Meng,Zibo Zhang,Muzi Xu,Shengbo Wang,Xuhang Chen,Chenxi Wang,Hongyun Yang,Ningli Wang,Wenyu Wang,Jin Cao,Xiaodong Feng,Peter Smielewski,Yu Pan,Wenhui Song,Martin Birchall,Luigi G. Occhipint
关键词-EN: Wearable silent speech, Wearable silent, systems hold significant, hold significant potential, hold significant
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Systems and Control (eess.SY)
*备注: 5 figures, 45 references

点击查看摘要

Abstract:Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT’s LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems.

[AI-62] R-MTLLM F: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge

链接: https://arxiv.org/abs/2411.18220
作者: Aladin Djuhera,Vlad C. Andrei,Mohsen Pourghasemian,Haris Gacanin,Holger Boche,Walid Saad
关键词-EN: multiple tasks efficiently, Multi-task large language, handle multiple tasks, large language models, demand specialized models
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently. However, training MTLLMs is complex and exhaustive, particularly when tasks are subject to change. Recently, the concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM. In this paper, the problem of enabling edge users to collaboratively craft such MTTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks. To this end, first the influence of adversarial noise to multi-task model fusion is investigated and a relationship between the so-called weight disentanglement error and the mean squared error (MSE) is derived. Using hypothesis testing, it is directly shown that the MSE increases interference between task vectors, thereby rendering model fusion ineffective. Then, a novel resilient MTLLM fusion (R-MTLLMF) is proposed, which leverages insights about the LLM architecture and fine-tuning process to safeguard task vector aggregation under adversarial noise by realigning the MTLLM. The proposed R-MTLLMF is then compared for both worst-case and ideal transmission scenarios to study the impact of the wireless channel. Extensive model fusion experiments with vision LLMs demonstrate R-MTLLMF’s effectiveness, achieving close-to-baseline performance across eight different tasks in ideal noise scenarios and significantly outperforming unprotected model fusion in worst-case scenarios. The results further advocate for additional physical layer protection for a holistic approach to resilience, from both a wireless and LLM perspective.

[AI-63] Predicting Water Quality using Quantum Machine Learning: The Case of the Umgeni Catchment (U20A) Study Region

链接: https://arxiv.org/abs/2411.18141
作者: Muhammad Al-Zafar Khan,Jamal Al-Karaki,Marwan Omar
关键词-EN: study water quality, South Africa, region in Durban, application of QML, QML techniques
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:In this study, we consider a real-world application of QML techniques to study water quality in the U20A region in Durban, South Africa. Specifically, we applied the quantum support vector classifier (QSVC) and quantum neural network (QNN), and we showed that the QSVC is easier to implement and yields a higher accuracy. The QSVC models were applied for three kernels: Linear, polynomial, and radial basis function (RBF), and it was shown that the polynomial and RBF kernels had exactly the same performance. The QNN model was applied using different optimizers, learning rates, noise on the circuit components, and weight initializations were considered, but the QNN persistently ran into the dead neuron problem. Thus, the QNN was compared only by accraucy and loss, and it was shown that with the Adam optimizer, the model has the best performance, however, still less than the QSVC.

[AI-64] Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization

链接: https://arxiv.org/abs/2411.17983
作者: Tian Bai,Ying Jin
关键词-EN: inference is challenging, break the exchangeability, Machine Learning, Model, unlabeled data
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting’’ instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting. This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation. Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.17983 [stat.ME] (or arXiv:2411.17983v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2411.17983 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-65] Graph Neural Network for Cerebral Blood Flow Prediction With Clinical Datasets

链接: https://arxiv.org/abs/2411.17971
作者: Seungyeon Kim,Wheesung Lee,Sung-Ho Ahn,Do-Eun Lee,Tae-Rin Lee
关键词-EN: Accurate prediction, diagnosis and treatment, Accurate, GNN, blood flow
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Accurate prediction of cerebral blood flow is essential for the diagnosis and treatment of cerebrovascular diseases. Traditional computational methods, however, often incur significant computational costs, limiting their practicality in real-time clinical applications. This paper proposes a graph neural network (GNN) to predict blood flow and pressure in previously unseen cerebral vascular network structures that were not included in training data. The GNN was developed using clinical datasets from patients with stenosis, featuring complex and abnormal vascular geometries. Additionally, the GNN model was trained on data incorporating a wide range of inflow conditions, vessel topologies, and network connectivities to enhance its generalization capability. The approach achieved Pearson’s correlation coefficients of 0.727 for pressure and 0.824 for flow rate, with sufficient training data. These findings demonstrate the potential of the GNN for real-time cerebrovascular diagnostics, particularly in handling intricate and pathological vascular networks.

[AI-66] DapPep: Domain Adaptive Peptide-agnostic Learning for Universal T-cell Receptor-antigen Binding Affinity Prediction

链接: https://arxiv.org/abs/2411.17798
作者: Jiangbin Zheng,Qianhui Xu,Ruichen Xia,Stan Z. Li
关键词-EN: Identifying T-cell receptors, T-cell receptors, vaccines and immunotherapies, Identifying T-cell, interact with antigenic
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying T-cell receptors (TCRs) that interact with antigenic peptides provides the technical basis for developing vaccines and immunotherapies. The emergent deep learning methods excel at learning antigen binding patterns from known TCRs but struggle with novel or sparsely represented antigens. However, binding specificity for unseen antigens or exogenous peptides is critical. We introduce a domain-adaptive peptide-agnostic learning framework DapPep for universal TCR-antigen binding affinity prediction to address this challenge. The lightweight self-attention architecture combines a pre-trained protein language model with an inner-loop self-supervised regime to enable robust TCR-peptide representations. Extensive experiments on various benchmarks demonstrate that DapPep consistently outperforms existing tools, showcasing robust generalization capability, especially for data-scarce settings and unseen peptides. Moreover, DapPep proves effective in challenging clinical tasks such as sorting reactive T cells in tumor neoantigen therapy and identifying key positions in 3D structures.

[AI-67] Pan-protein Design Learning Enables Task-adaptive Generalization for Low-resource Enzyme Design

链接: https://arxiv.org/abs/2411.17795
作者: Jiangbin Zheng,Ge Wang,Han Zhang,Stan Z. Li
关键词-EN: offers transformative potential, current deep CPD, Computational protein design, Computational protein, deep CPD models
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational protein design (CPD) offers transformative potential for bioengineering, but current deep CPD models, focused on universal domains, struggle with function-specific designs. This work introduces a novel CPD paradigm tailored for functional design tasks, particularly for enzymes-a key protein class often lacking specific application efficiency. To address structural data scarcity, we present CrossDesign, a domain-adaptive framework that leverages pretrained protein language models (PPLMs). By aligning protein structures with sequences, CrossDesign transfers pretrained knowledge to structure models, overcoming the limitations of limited structural data. The framework combines autoregressive (AR) and non-autoregressive (NAR) states in its encoder-decoder architecture, applying it to enzyme datasets and pan-proteins. Experimental results highlight CrossDesign’s superior performance and robustness, especially with out-of-domain enzymes. Additionally, the model excels in fitness prediction when tested on large-scale mutation data, showcasing its stability.

[AI-68] Soil Characterization of Watermelon Field through Internet of Things: A New Approach to Soil Salinity Measurement

链接: https://arxiv.org/abs/2411.17731
作者: Md. Naimur Rahman,Shafak Shahriar Sozol,Md. Samsuzzaman,Md. Shahin Hossin,Mohammad Tariqul Islam,S.M. Taohidul Islam,Md. Maniruzzaman
关键词-EN: modern agricultural industry, soil, agricultural industry, technology plays, modern agricultural
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the modern agricultural industry, technology plays a crucial role in the advancement of cultivation. To increase crop productivity, soil require some specific characteristics. For watermelon cultivation, soil needs to be sandy and of high temperature with proper irrigation. This research aims to design and implement an intelligent IoT-based soil characterization system for the watermelon field to measure the soil characteristics. IoT based developed system measures moisture, temperature, and pH of soil using different sensors, and the sensor data is uploaded to the cloud via Arduino and Raspberry Pi, from where users can obtain the data using mobile application and webpage developed for this system. To ensure the precision of the framework, this study includes the comparison between the readings of the soil parameters by the existing field soil meters, the values obtained from the sensors integrated IoT system, and data obtained from soil science laboratory. Excessive salinity in soil affects the watermelon yield. This paper proposes a model for the measurement of soil salinity based on soil resistivity. It establishes a relationship between soil salinity and soil resistivity from the data obtained in the laboratory using artificial neural network (ANN).

[AI-69] Hybrid Quantum Deep Learning Model for Emotion Detection using raw EEG Signal Analysis

链接: https://arxiv.org/abs/2411.17715
作者: Ali Asgar Chandanwala,Srutakirti Bhowmik,Parna Chaudhury,Sheena Christabel Pravin
关键词-EN: behavioural research, ability to recognize, emotion recognition, deep learning, mental health depend
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Applications in behavioural research, human-computer interaction, and mental health depend on the ability to recognize emotions. In order to improve the accuracy of emotion recognition using electroencephalography (EEG) data, this work presents a hybrid quantum deep learning technique. Conventional EEG-based emotion recognition techniques are limited by noise and high-dimensional data complexity, which make feature extraction difficult. To tackle these issues, our method combines traditional deep learning classification with quantum-enhanced feature extraction. To identify important brain wave patterns, Bandpass filtering and Welch method are used as preprocessing techniques on EEG data. Intricate inter-band interactions that are essential for determining emotional states are captured by mapping frequency band power attributes (delta, theta, alpha, and beta) to quantum representations. Entanglement and rotation gates are used in a hybrid quantum circuit to maximize the model’s sensitivity to EEG patterns associated with different emotions. Promising results from evaluation on a test dataset indicate the model’s potential for accurate emotion recognition. The model will be extended for real-time applications and multi-class categorization in future study, which could improve EEG-based mental health screening instruments. This method offers a promising tool for applications in adaptive human-computer systems and mental health monitoring by showcasing the possibilities of fusing traditional deep learning with quantum processing for reliable, scalable emotion recognition.

[AI-70] AnyECG: Foundational Models for Electrocardiogram Analysis

链接: https://arxiv.org/abs/2411.17711
作者: Yue Wang,Xu Cao,Yaojun Hu,Haochao Ying,James Matthew Rehg,Jimeng Sun,Jian Wu,Jintai Chen
关键词-EN: acute heart attacks, detecting acute heart, ECG, non-invasive and affordable, affordable tool
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiogram (ECG), a non-invasive and affordable tool for cardiac monitoring, is highly sensitive in detecting acute heart attacks. However, due to the lengthy nature of ECG recordings, numerous machine learning methods have been developed for automated heart disease detection to reduce human workload. Despite these efforts, performance remains suboptimal. A key obstacle is the inherent complexity of ECG data, which includes heterogeneity (e.g., varying sampling rates), high levels of noise, demographic-related pattern shifts, and intricate rhythm-event associations. To overcome these challenges, this paper introduces AnyECG, a foundational model designed to extract robust representations from any real-world ECG data. Specifically, a tailored ECG Tokenizer encodes each fixed-duration ECG fragment into a token and, guided by proxy tasks, converts noisy, continuous ECG features into discrete, compact, and clinically meaningful local rhythm codes. These codes encapsulate basic morphological, frequency, and demographic information (e.g., sex), effectively mitigating signal noise. We further pre-train the AnyECG to learn rhythmic pattern associations across ECG tokens, enabling the capture of cardiac event semantics. By being jointly pre-trained on diverse ECG data sources, AnyECG is capable of generalizing across a wide range of downstream tasks where ECG signals are recorded from various devices and scenarios. Experimental results in anomaly detection, arrhythmia detection, corrupted lead generation, and ultra-long ECG signal analysis demonstrate that AnyECG learns common ECG knowledge from data and significantly outperforms cutting-edge methods in each respective task.

[AI-71] A Composite Fault Diagnosis Model for NPPs Based on Bayesian-EfficientNet Module

链接: https://arxiv.org/abs/2411.17707
作者: Siwei Li,Jiangwen Chen,Hua Lin,Wei Wang
关键词-EN: important mechanical components, reactor coolant system, main steam system, main feedwater system, condensate system
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This article focuses on the faults of important mechanical components such as pumps, valves, and pipelines in the reactor coolant system, main steam system, condensate system, and main feedwater system of nuclear power plants (NPPs). It proposes a composite multi-fault diagnosis model based on Bayesian algorithm and EfficientNet large model using data-driven deep learning fault diagnosis technology. The aim is to evaluate the effectiveness of automatic deep learning-based large model technology through transfer learning in nuclear power plant scenarios.

[AI-72] EEG-DCNet: A Fast and Accurate MI-EEG Dilated CNN Classification Method

链接: https://arxiv.org/abs/2411.17705
作者: Wei Peng,Kang Liu,Jiaxi Shi,Jianchen Hu
关键词-EN: based motor imagery, based motor, motor imagery, brain-computer interface, regain mobility
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The electroencephalography (EEG)-based motor imagery (MI) classification is a critical and challenging task in brain-computer interface (BCI) technology, which plays a significant role in assisting patients with functional impairments to regain mobility. We present a novel multi-scale atrous convolutional neural network (CNN) model called EEG-dilated convolution network (DCNet) to enhance the accuracy and efficiency of the EEG-based MI classification tasks. We incorporate the 1\times1 convolutional layer and utilize the multi-branch parallel atrous convolutional architecture in EEG-DCNet to capture the highly nonlinear characteristics and multi-scale features of the EEG signals. Moreover, we utilize the sliding window to enhance the temporal consistency and utilize the attension mechanism to improve the accuracy of recognizing user intentions. The experimental results (via the BCI-IV-2a ,BCI-IV-2b and the High-Gamma datasets) show that EEG-DCNet outperforms existing state-of-the-art (SOTA) approaches in terms of classification accuracy and Kappa scores. Furthermore, since EEG-DCNet requires less number of parameters, the training efficiency and memory consumption are also improved. The experiment code is open-sourced at \hrefthis https URLhere.

机器学习

[LG-0] ask Arithmetic Through The Lens Of One-Shot Federated Learning

链接: https://arxiv.org/abs/2411.18607
作者: Zhixu Tao,Ian Mason,Sanjeev Kulkarni,Xavier Boix
关键词-EN: Task Arithmetic, multiple models’ capabilities, Arithmetic, Federated Learning, Task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Task Arithmetic is a model merging technique that enables the combination of multiple models’ capabilities into a single model through simple arithmetic in the weight space, without the need for additional fine-tuning or access to the original training data. However, the factors that determine the success of Task Arithmetic remain unclear. In this paper, we examine Task Arithmetic for multi-task learning by framing it as a one-shot Federated Learning problem. We demonstrate that Task Arithmetic is mathematically equivalent to the commonly used algorithm in Federated Learning, called Federated Averaging (FedAvg). By leveraging well-established theoretical results from FedAvg, we identify two key factors that impact the performance of Task Arithmetic: data heterogeneity and training heterogeneity. To mitigate these challenges, we adapt several algorithms from Federated Learning to improve the effectiveness of Task Arithmetic. Our experiments demonstrate that applying these algorithms can often significantly boost performance of the merged model compared to the original Task Arithmetic approach. This work bridges Task Arithmetic and Federated Learning, offering new theoretical perspectives on Task Arithmetic and improved practical methodologies for model merging.

[LG-1] Surveying the space of descriptions of a composite system with machine learning

链接: https://arxiv.org/abs/2411.18579
作者: Kieran A. Murphy,Yujing Zhang,Dani S. Bassett
关键词-EN: Multivariate information theory, Multivariate information, general and principled, Multivariate, information theory
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Code here: this https URL

点击查看摘要

Abstract:Multivariate information theory provides a general and principled framework for understanding how the components of a complex system are connected. Existing analyses are coarse in nature – built up from characterizations of discrete subsystems – and can be computationally prohibitive. In this work, we propose to study the continuous space of possible descriptions of a composite system as a window into its organizational structure. A description consists of specific information conveyed about each of the components, and the space of possible descriptions is equivalent to the space of lossy compression schemes of the components. We introduce a machine learning framework to optimize descriptions that extremize key information theoretic quantities used to characterize organization, such as total correlation and O-information. Through case studies on spin systems, Sudoku boards, and letter sequences from natural language, we identify extremal descriptions that reveal how system-wide variation emerges from individual components. By integrating machine learning into a fine-grained information theoretic analysis of composite random variables, our framework opens a new avenues for probing the structure of real-world complex systems.

[LG-2] Pruning Deep Convolutional Neural Network Using Conditional Mutual Information

链接: https://arxiv.org/abs/2411.18578
作者: Tien Vu-Van,Dat Du Thanh,Nguyen Ho,Mai Vu
关键词-EN: Convolutional Neural Networks, Convolutional Neural, achieve high performance, image classification tasks, resource-limited hardware due
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) achieve high performance in image classification tasks but are challenging to deploy on resource-limited hardware due to their large model sizes. To address this issue, we leverage Mutual Information, a metric that provides valuable insights into how deep learning models retain and process information through measuring the shared information between input features or output labels and network layers. In this study, we propose a structured filter-pruning approach for CNNs that identifies and selectively retains the most informative features in each layer. Our approach successively evaluates each layer by ranking the importance of its feature maps based on Conditional Mutual Information (CMI) values, computed using a matrix-based Renyi \alpha-order entropy numerical method. We propose several formulations of CMI to capture correlation among features across different layers. We then develop various strategies to determine the cutoff point for CMI values to prune unimportant features. This approach allows parallel pruning in both forward and backward directions and significantly reduces model size while preserving accuracy. Tested on the VGG16 architecture with the CIFAR-10 dataset, the proposed method reduces the number of filters by more than a third, with only a 0.32% drop in test accuracy.

[LG-3] Concentration of Cumulative Reward in Markov Decision Processes

链接: https://arxiv.org/abs/2411.18551
作者: Borna Sayedana,Peter E. Caines,Aditya Mahajan
关键词-EN: Markov Decision Processes, Decision Processes, Markov Decision, Processes, law of iterated
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 60 pages

点击查看摘要

Abstract:In this paper, we investigate the concentration properties of cumulative rewards in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a novel martingale decomposition of cumulative rewards, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.

[LG-4] Perturbation Ontology based Graph Attention Networks

链接: https://arxiv.org/abs/2411.18520
作者: Yichen Wang,Jie Wang,Fulin Wang,Xiang Li,Hao Yin,Bhiksha Raj
关键词-EN: graph neural networks, recent years, emergence and proliferation, heterogeneous counterparts, graph representation learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, graph representation learning has undergone a paradigm shift, driven by the emergence and proliferation of graph neural networks (GNNs) and their heterogeneous counterparts. Heterogeneous GNNs have shown remarkable success in extracting low-dimensional embeddings from complex graphs that encompass diverse entity types and relationships. While meta-path-based techniques have long been recognized for their ability to capture semantic affinities among nodes, their dependence on manual specification poses a significant limitation. In contrast, matrix-focused methods accelerate processing by utilizing structural cues but often overlook contextual richness. In this paper, we challenge the current paradigm by introducing ontology as a fundamental semantic primitive within complex graphs. Our goal is to integrate the strengths of both matrix-centric and meta-path-based approaches into a unified framework. We propose perturbation Ontology-based Graph Attention Networks (POGAT), a novel methodology that combines ontology subgraphs with an advanced self-supervised learning paradigm to achieve a deep contextual understanding. The core innovation of POGAT lies in our enhanced homogeneous perturbing scheme designed to generate rigorous negative samples, encouraging the model to explore minimal contextual features more thoroughly. Through extensive empirical evaluations, we demonstrate that POGAT significantly outperforms state-of-the-art baselines, achieving a groundbreaking improvement of up to 10.78% in F1-score for the critical task of link prediction and 12.01% in Micro-F1 for the critical task of node classification.

[LG-5] Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection

链接: https://arxiv.org/abs/2411.18516
作者: Siddhant Gupta,Fred Lu,Andrew Barlow,Edward Raff,Francis Ferraro,Cynthia Matuszek,Charles Nicholas,James Holt
关键词-EN: malicious actor intent, actor intent, victim systems, YARA rules, malicious actors
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To appear in BigData’24 CyberHunt 2024

点击查看摘要

Abstract:A strategy used by malicious actors is to “live off the land,” where benign systems and tools already available on a victim’s systems are used and repurposed for the malicious actor’s intent. In this work, we ask if there is a way for anti-virus developers to similarly re-purpose existing work to improve their malware detection capability. We show that this is plausible via YARA rules, which use human-written signatures to detect specific malware families, functionalities, or other markers of interest. By extracting sub-signatures from publicly available YARA rules, we assembled a set of features that can more effectively discriminate malicious samples from benign ones. Our experiments demonstrate that these features add value beyond traditional features on the EMBER 2018 dataset. Manual analysis of the added sub-signatures shows a power-law behavior in a combination of features that are specific and unique, as well as features that occur often. A prior expectation may be that the features would be limited in being overly specific to unique malware families. This behavior is observed, and is apparently useful in practice. In addition, we also find sub-signatures that are dual-purpose (e.g., detecting virtual machine environments) or broadly generic (e.g., DLL imports).

[LG-6] Multiple Choice Learning for Efficient Speech Separation with Many Speakers

链接: https://arxiv.org/abs/2411.18497
作者: David Perera,François Derrida,Théo Mariotte,Gaël Richard,Slim Essid
关键词-EN: truth separated signals, ground truth separated, Permutation Invariant Training, supervised setting raises, speech separation models
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally introduced to tackle ambiguous tasks. We demonstrate experimentally on the popular WSJ0-mix and LibriMix benchmarks that MCL matches the performances of PIT, while being computationally advantageous. This opens the door to a promising research direction, as MCL can be naturally extended to handle a variable number of speakers, or to tackle speech separation in the unsupervised setting.

[LG-7] SPTTE: A Spatiotemporal Probabilistic Framework for Travel Time Estimation

链接: https://arxiv.org/abs/2411.18484
作者: Chen Xu,Qiang Wang,Lijun Sun
关键词-EN: Accurate travel time, Accurate travel, travel time, itinerary planning, multi-trip travel time
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate travel time estimation is essential for navigation and itinerary planning. While existing research employs probabilistic modeling to assess travel time uncertainty and account for correlations between multiple trips, modeling the temporal variability of multi-trip travel time distributions remains a significant challenge. Capturing the evolution of joint distributions requires large, well-organized datasets; however, real-world trip data are often temporally sparse and spatially unevenly distributed. To address this issue, we propose SPTTE, a spatiotemporal probabilistic framework that models the evolving joint distribution of multi-trip travel times by formulating the estimation task as a spatiotemporal stochastic process regression problem with fragmented observations. SPTTE incorporates an RNN-based temporal Gaussian process parameterization to regularize sparse observations and capture temporal dependencies. Additionally, it employs a prior-based heterogeneity smoothing strategy to correct unreliable learning caused by unevenly distributed trips, effectively modeling temporal variability under sparse and uneven data distributions. Evaluations on real-world datasets demonstrate that SPTTE outperforms state-of-the-art deterministic and probabilistic methods by over 10.13%. Ablation studies and visualizations further confirm the effectiveness of the model components.

[LG-8] What do physics-informed DeepONets learn? Understanding and improving training for scientific computing applications

链接: https://arxiv.org/abs/2411.18459
作者: Emily Williams,Amanda Howard,Brek Meuris,Panos Stinis
关键词-EN: deep operator networks, partial differential equations, Physics-informed deep operator, operator networks, differential equations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-informed deep operator networks (DeepONets) have emerged as a promising approach toward numerically approximating the solution of partial differential equations (PDEs). In this work, we aim to develop further understanding of what is being learned by physics-informed DeepONets by assessing the universality of the extracted basis functions and demonstrating their potential toward model reduction with spectral methods. Results provide clarity about measuring the performance of a physics-informed DeepONet through the decays of singular values and expansion coefficients. In addition, we propose a transfer learning approach for improving training for physics-informed DeepONets between parameters of the same PDE as well as across different, but related, PDEs where these models struggle to train well. This approach results in significant error reduction and learned basis functions that are more effective in representing the solution of a PDE.

[LG-9] Advancements in Myocardial Infarction Detection and Classification Using Wearable Devices: A Comprehensive Review

链接: https://arxiv.org/abs/2411.18451
作者: Abhijith S,Arjun Rajesh,Mansi Manoj,Sandra Davis Kollannur,Sujitta R V,Jerrin Thomas Panachakel
关键词-EN: critical health condition, health condition caused, restricted blood flow, Myocardial infarction, heart attack
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Myocardial infarction (MI), commonly known as a heart attack, is a critical health condition caused by restricted blood flow to the heart. Early-stage detection through continuous ECG monitoring is essential to minimize irreversible damage. This review explores advancements in MI classification methodologies for wearable devices, emphasizing their potential in real-time monitoring and early diagnosis. It critically examines traditional approaches, such as morphological filtering and wavelet decomposition, alongside cutting-edge techniques, including Convolutional Neural Networks (CNNs) and VLSI-based methods. By synthesizing findings on machine learning, deep learning, and hardware innovations, this paper highlights their strengths, limitations, and future prospects. The integration of these techniques into wearable devices offers promising avenues for efficient, accurate, and energy-aware MI detection, paving the way for next-generation wearable healthcare solutions.

[LG-10] An End-to-End Smart Predict-then-Optimize Framework for Vehicle Relocation Problems in Large-Scale Vehicle Crowd Sensing

链接: https://arxiv.org/abs/2411.18432
作者: Xinyu Wang,Yiyang Peng,Wei Ma
关键词-EN: Ubiquitous mobile devices, Ubiquitous mobile, vehicle crowd sensing, vehicle sensing systems, mobile devices
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 31 pages, 12 figures

点击查看摘要

Abstract:Ubiquitous mobile devices have catalyzed the development of vehicle crowd sensing (VCS). In particular, vehicle sensing systems show great potential in the flexible acquisition of spatio-temporal urban data through built-in sensors under diverse sensing scenarios. However, vehicle systems often exhibit biased coverage due to the heterogeneous nature of trip requests and routes. To achieve a high sensing coverage, a critical challenge lies in optimally relocating vehicles to minimize the divergence between vehicle distributions and target sensing distributions. Conventional approaches typically employ a two-stage predict-then-optimize (PTO) process: first predicting real-time vehicle distributions and subsequently generating an optimal relocation strategy based on the predictions. However, this approach can lead to suboptimal decision-making due to the propagation of errors from upstream prediction. To this end, we develop an end-to-end Smart Predict-then-Optimize (SPO) framework by integrating optimization into prediction within the deep learning architecture, and the entire framework is trained by minimizing the task-specific matching divergence rather than the upstream prediction error. Methodologically, we formulate the vehicle relocation problem by quadratic programming (QP) and incorporate a novel unrolling approach based on the Alternating Direction Method of Multipliers (ADMM) within the SPO framework to compute gradients of the QP layer, facilitating backpropagation and gradient-based optimization for end-to-end learning. The effectiveness of the proposed framework is validated by real-world taxi datasets in Hong Kong. Utilizing the alternating differentiation method, the general SPO framework presents a novel concept of addressing decision-making problems with uncertainty, demonstrating significant potential for advancing applications in intelligent transportation systems.

[LG-11] Streamlining Prediction in Bayesian Deep Learning

链接: https://arxiv.org/abs/2411.18425
作者: Rui Li,Marcus Klasson,Arno Solin,Martin Trapp
关键词-EN: Bayesian deep learning, interest in Bayesian, Bayesian deep, deep learning, Monte Carlo integration
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rising interest in Bayesian deep learning (BDL) has led to a plethora of methods for estimating the posterior distribution. However, efficient computation of inferences, such as predictions, has been largely overlooked with Monte Carlo integration remaining the standard. In this work we examine streamlining prediction in BDL through a single forward pass without sampling. For this we use local linearisation on activation functions and local Gaussian approximations at linear layers. Thus allowing us to analytically compute an approximation to the posterior predictive distribution. We showcase our approach for both MLP and transformers, such as ViT and GPT-2, and assess its performance on regression and classification tasks.

[LG-12] FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

链接: https://arxiv.org/abs/2411.18424
作者: Ao Shen,Zhiyao Li,Mingyu Gao
关键词-EN: Large Language Models, Language Models, Large Language, concurrently requires good, Service Level Objectives
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Serving numerous users and requests concurrently requires good fairness in Large Language Models (LLMs) serving system. This ensures that, at the same cost, the system can meet the Service Level Objectives (SLOs) of more users , such as time to first token (TTFT) and time between tokens (TBT), rather than allowing a few users to experience performance far exceeding the SLOs. To achieve better fairness, the preemption-based scheduling policy dynamically adjusts the priority of each request to maintain balance during runtime. However, existing systems tend to overly prioritize throughput, overlooking the overhead caused by preemption-induced context switching, which is crucial for maintaining fairness through priority adjustments. In this work, we identify three main challenges that result in this overhead. 1) Inadequate I/O utilization. 2) GPU idleness. 3) Unnecessary I/O transmission during multi-turn conversations. Our key insight is that the block-based KV cache memory policy in existing systems, while achieving near-zero memory waste, leads to discontinuity and insufficient granularity in the KV cache memory. To respond, we introduce FastSwitch, a fairness-aware serving system that not only aligns with existing KV cache memory allocation policy but also mitigates context switching overhead. Our evaluation shows that FastSwitch outperforms the state-of-the-art LLM serving system vLLM with speedups of 1.4-11.2x across different tail TTFT and TBT.

[LG-13] When does a bridge become an aeroplane?

链接: https://arxiv.org/abs/2411.18406
作者: Tina A. Dardeno,Lawrence A. Bull,Nikolaos Dervilis,Keith Worden
关键词-EN: structural health monitoring, population-based structural health, remains a challenge, health monitoring, recent advances
类目: Machine Learning (cs.LG)
*备注: Conference proceedings paper for ISMA, Sept. 2024

点击查看摘要

Abstract:Despite recent advances in population-based structural health monitoring (PBSHM), knowledge transfer between highly-disparate structures (i.e., heterogeneous populations) remains a challenge. It has been proposed that heterogeneous transfer may be accomplished via intermediate structures that bridge the gap in information between the structures of interest. A key aspect of the technique is the idea that by varying parameters such as material properties and geometry, one structure can be continuously morphed into another. The current work demonstrates the development of these interpolating structures, via case studies involving the parameterisation of (and transfer between) a simple, simulated ‘bridge’ and ‘aeroplane’. The facetious question ‘When is a bridge not an aeroplane?’ has been previously asked in the context of predicting positive transfer based on structural similarity. While the obvious answer to this question is ‘Always,’ the current work demonstrates that in some cases positive transfer can be achieved between highly-disparate systems.

[LG-14] Preserving Deep Representations In One-Shot Pruning: A Hessian-Free Second-Order Optimization Framework

链接: https://arxiv.org/abs/2411.18376
作者: Ryan Lucas,Rahul Mazumder
关键词-EN: one-shot post-training pruning, inference without retraining, aimed at reducing, reducing the cost, pruning framework aimed
类目: Machine Learning (cs.LG)
*备注: 10 pages excl. appendix

点击查看摘要

Abstract:We present SNOWS, a one-shot post-training pruning framework aimed at reducing the cost of vision network inference without retraining. Current leading one-shot pruning methods minimize layer-wise least squares reconstruction error which does not take into account deeper network representations. We propose to optimize a more global reconstruction objective. This objective accounts for nonlinear activations deep in the network to obtain a better proxy for the network loss. This nonlinear objective leads to a more challenging optimization problem – we demonstrate it can be solved efficiently using a specialized second-order optimization framework. A key innovation of our framework is the use of Hessian-free optimization to compute exact Newton descent steps without needing to compute or store the full Hessian matrix. A distinct advantage of SNOWS is that it can be readily applied on top of any sparse mask derived from prior methods, readjusting their weights to exploit nonlinearities in deep feature representations. SNOWS obtains state-of-the-art results on various one-shot pruning benchmarks including residual networks and Vision Transformers (ViT/B-16 and ViT/L-16, 86m and 304m parameters respectively).

[LG-15] Large Models Enabled Ubiquitous Wireless Sensing

链接: https://arxiv.org/abs/2411.18277
作者: Shun Hu
关键词-EN: enhancing network performance, channel state information, spatial CSI prediction, knowledge of channel, channel state
类目: Machine Learning (cs.LG)
*备注: 8 pages, 11 figures

点击查看摘要

Abstract:In the era of 5G communication, the knowledge of channel state information (CSI) is crucial for enhancing network performance. This paper explores the utilization of language models for spatial CSI prediction within MIMO-OFDM systems. We begin by outlining the significance of accurate CSI in enabling advanced functionalities such as adaptive modulation. We review existing methodologies for CSI estimation, emphasizing the shift from traditional to data-driven approaches. Then a novel framework for spatial CSI prediction using realistic environment information is proposed, and experimental results demonstrate the effectiveness. This research paves way for innovative strategies in managing wireless networks.

[LG-16] Break the ID-Language Barrier: An Adaption Framework for Sequential Recommendation

链接: https://arxiv.org/abs/2411.18262
作者: Xiaohan Yu,Li Zhang,Xin Zhao,Yue Wang
关键词-EN: natural language processing, large language models, recent breakthrough, breakthrough of large, processing has sparked
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent breakthrough of large language models (LLMs) in natural language processing has sparked exploration in recommendation systems, however, their limited domain-specific knowledge remains a critical bottleneck. Specifically, LLMs lack key pieces of information crucial for sequential recommendations, such as user behavior patterns. To address this critical gap, we propose IDLE-Adapter, a novel framework that integrates pre-trained ID embeddings, rich in domain-specific knowledge, into LLMs to improve recommendation accuracy. IDLE-Adapter acts as a bridge, transforming sparse user-item interaction data into dense, LLM-compatible representations through a Pre-trained ID Sequential Model, Dimensionality Alignment, Layer-wise Embedding Refinement, and Layer-wise Distribution Alignment. Furthermore, IDLE-Adapter demonstrates remarkable flexibility by seamlessly integrating ID embeddings from diverse ID-based sequential models and LLM architectures. Extensive experiments across various datasets demonstrate the superiority of IDLE-Adapter, achieving over 10% and 20% improvements in HitRate@5 and NDCG@5 metrics, respectively, compared to state-of-the-art methods.

[LG-17] Dynamic Retail Pricing via Q-Learning – A Reinforcement Learning Framework for Enhanced Revenue Management

链接: https://arxiv.org/abs/2411.18261
作者: Mohit Apte,Ketan Kale,Pranav Datar,Pratiksha Deshmukh
关键词-EN: enhance dynamic pricing, dynamic pricing strategies, reinforcement learning, paper explores, explores the application
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at the 1st IEEE International Conference on AIML-Applications for Engineering Technology (ICAET-25)

点击查看摘要

Abstract:This paper explores the application of a reinforcement learning (RL) framework using the Q-Learning algorithm to enhance dynamic pricing strategies in the retail sector. Unlike traditional pricing methods, which often rely on static demand models, our RL approach continuously adapts to evolving market dynamics, offering a more flexible and responsive pricing strategy. By creating a simulated retail environment, we demonstrate how RL effectively addresses real-time changes in consumer behavior and market conditions, leading to improved revenue outcomes. Our results illustrate that the RL model not only surpasses traditional methods in terms of revenue generation but also provides insights into the complex interplay of price elasticity and consumer demand. This research underlines the significant potential of applying artificial intelligence in economic decision-making, paving the way for more sophisticated, data-driven pricing models in various commercial domains.

[LG-18] ransfer Learning for Deep Learning-based Prediction of Lattice Thermal Conductivity

链接: https://arxiv.org/abs/2411.18259
作者: L. Klochko,M. d’Aquin,A. Togo,L. Chaput
关键词-EN: Machine learning promises, Machine learning, descriptors or structures, promises to accelerate, desirable macro-properties
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Machine learning promises to accelerate the material discovery by enabling high-throughput prediction of desirable macro-properties from atomic-level descriptors or structures. However, the limited data available about precise values of these properties have been a barrier, leading to predictive models with limited precision or the ability to generalize. This is particularly true of lattice thermal conductivity (LTC): existing datasets of precise (ab initio, DFT-based) computed values are limited to a few dozen materials with little variability. Based on such datasets, we study the impact of transfer learning on both the precision and generalizability of a deep learning model (ParAIsite). We start from an existing model (MEGNet~\citeChen2019) and show that improvements are obtained by fine-tuning a pre-trained version on different tasks. Interestingly, we also show that a much greater improvement is obtained when first fine-tuning it on a large datasets of low-quality approximations of LTC (based on the AGL model) and then applying a second phase of fine-tuning with our high-quality, smaller-scale datasets. The promising results obtained pave the way not only towards a greater ability to explore large databases in search of low thermal conductivity materials but also to methods enabling increasingly precise predictions in areas where quality data are rare.

[LG-19] Active partitioning: inverting the paradigm of active learning

链接: https://arxiv.org/abs/2411.18254
作者: Marius Tacke,Matthias Busch,Kevin Linka,Christian J. Cyron,Roland C. Aydin
关键词-EN: functional patterns related, functional patterns, aspects or regimes, equally present, models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Datasets often incorporate various functional patterns related to different aspects or regimes, which are typically not equally present throughout the dataset. We propose a novel, general-purpose partitioning algorithm that utilizes competition between models to detect and separate these functional patterns. This competition is induced by multiple models iteratively submitting their predictions for the dataset, with the best prediction for each data point being rewarded with training on that data point. This reward mechanism amplifies each model’s strengths and encourages specialization in different patterns. The specializations can then be translated into a partitioning scheme. The amplification of each model’s strengths inverts the active learning paradigm: while active learning typically focuses the training of models on their weaknesses to minimize the number of required training data points, our concept reinforces the strengths of each model, thus specializing them. We validate our concept – called active partitioning – with various datasets with clearly distinct functional patterns, such as mechanical stress and strain data in a porous structure. The active partitioning algorithm produces valuable insights into the datasets’ structure, which can serve various further applications. As a demonstration of one exemplary usage, we set up modular models consisting of multiple expert models, each learning a single partition, and compare their performance on more than twenty popular regression problems with single models learning all partitions simultaneously. Our results show significant improvements, with up to 54% loss reduction, confirming our partitioning algorithm’s utility.

[LG-20] Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLM s

链接: https://arxiv.org/abs/2411.18216
作者: Samuele Pasini,Jinhan Kim,Tommaso Aiello,Rocio Cabrera Lozoya,Antonino Sabetta,Paolo Tonella
关键词-EN: Large Language Models, Large Language, Language Models, implement security requirements, software development
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in software development to generate functions, such as attack detectors, that implement security requirements. However, LLMs struggle to generate accurate code, resulting, e.g., in attack detectors that miss well-known attacks when used in practice. This is most likely due to the LLM lacking knowledge about some existing attacks and to the generated code being not evaluated in real usage scenarios. We propose a novel approach integrating Retrieval Augmented Generation (RAG) and Self-Ranking into the LLM pipeline. RAG enhances the robustness of the output by incorporating external knowledge sources, while the Self-Ranking technique, inspired to the concept of Self-Consistency, generates multiple reasoning paths and creates ranks to select the most robust detector. Our extensive empirical study targets code generated by LLMs to detect two prevalent injection attacks in web security: Cross-Site Scripting (XSS) and SQL injection (SQLi). Results show a significant improvement in detection performance compared to baselines, with an increase of up to 71%pt and 37%pt in the F2-Score for XSS and SQLi detection, respectively.

[LG-21] Semantic Edge Computing and Semantic Communications in 6G Networks: A Unifying Survey and Research Challenges

链接: https://arxiv.org/abs/2411.18199
作者: Milin Zhang,Mohammad Abdi,Venkat R. Dasari,Francesco Restuccia
关键词-EN: Semantic Edge Computing, Edge Computing, achieve real-time edge-enabled, real-time edge-enabled intelligence, Deep Neural Networks
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Submitted to ACM Computing Surveys (CSUR)

点击查看摘要

Abstract:Semantic Edge Computing (SEC) and Semantic Communications (SemComs) have been proposed as viable approaches to achieve real-time edge-enabled intelligence in sixth-generation (6G) wireless networks. On one hand, SemCom leverages the strength of Deep Neural Networks (DNNs) to encode and communicate the semantic information only, while making it robust to channel distortions by compensating for wireless effects. Ultimately, this leads to an improvement in the communication efficiency. On the other hand, SEC has leveraged distributed DNNs to divide the computation of a DNN across different devices based on their computational and networking constraints. Although significant progress has been made in both fields, the literature lacks a systematic view to connect both fields. In this work, we fulfill the current gap by unifying the SEC and SemCom fields. We summarize the research problems in these two fields and provide a comprehensive review of the state of the art with a focus on their technical strengths and challenges.

[LG-22] Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance

链接: https://arxiv.org/abs/2411.18195
作者: Dimitris Michailidis,Willem Röpke,Diederik M. Roijers,Sennay Ghebreab,Fernando P. Santos
关键词-EN: Multi-Objective Reinforcement Learning, Reinforcement Learning, aims to learn, trade-offs between multiple, learn a set
类目: Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Multi-Objective Reinforcement Learning (MORL) aims to learn a set of policies that optimize trade-offs between multiple, often conflicting objectives. MORL is computationally more complex than single-objective RL, particularly as the number of objectives increases. Additionally, when objectives involve the preferences of agents or groups, ensuring fairness is socially desirable. This paper introduces a principled algorithm that incorporates fairness into MORL while improving scalability to many-objective problems. We propose using Lorenz dominance to identify policies with equitable reward distributions and introduce \lambda-Lorenz dominance to enable flexible fairness preferences. We release a new, large-scale real-world transport planning environment and demonstrate that our method encourages the discovery of fair policies, showing improved scalability in two large cities (Xi’an and Amsterdam). Our methods outperform common multi-objective approaches, particularly in high-dimensional objective spaces.

[LG-23] Machine Unlearning reveals that the Gender-based Violence Victim Condition can be detected from Speech in a Speaker-Agnostic Setting

链接: https://arxiv.org/abs/2411.18177
作者: Emma Reyner-Fuentes,Esther Rituerto-Gonzalez,Carmen Pelaez-Moreno
关键词-EN: study addresses, addresses the critical, women mental health, mental health, GBV
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the critical issue of gender-based violence’s (GBV) impact on women’s mental health. GBV, encompassing physical and sexual aggression, often results in long-lasting adverse effects for the victims, including anxiety, depression, post-traumatic stress disorder (PTSD), and substance abuse. Artificial Intelligence (AI)-based speech technologies have proven valuable for mental health assessments. However, these technologies experience performance challenges when confronted with speakers whose data has not been used for training. Our research presents a novel approach to speaker-agnostic detection of the gender-based violence victim condition (GBVVC), focusing on the development of robust AI models capable of generalization across diverse speakers. Leveraging advanced deep learning models and domain-adversarial training techniques, we minimize speaker identity’s influence, achieving a 26.95% relative reduction in speaker identification ability while enhancing the GBVVC detection by a 6.37% relative improvement in the accuracy. This shows that models can focus on discriminative paralinguistic biomarkers that enhance the GBVVC prediction, and reduce the subject-specific traits’ impact. Additionally, our model’s predictions moderately correlate with pre-clinical PTSD symptoms, emphasizing the link between GBV and mental health. This work paves the way for AI-powered tools to aid mental health professionals in addressing this societal issue, offering a promising baseline for further research. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.18177 [cs.LG] (or arXiv:2411.18177v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.18177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

链接: https://arxiv.org/abs/2411.18148
作者: Ehsan Kabir,Austin R.J. Downey,Jason D. Bakos,David Andrews,Miaoqing Huang
关键词-EN: Transformer neural networks, natural language processing, machine translation, neural networks, excel in natural
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: arXiv admin note: text overlap with arXiv:2409.14023

点击查看摘要

Abstract:Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2 \times and 2.87 \times more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25 \times compared to some state-of-the-art FPGA-based accelerators.

[LG-25] A Machine Learning-based Framework towards Assessment of Decision-Makers Biases

链接: https://arxiv.org/abs/2411.18122
作者: Wanxue Dong,Maria De-arteaga,Maytal Saar-Tsechansky
关键词-EN: yielding unfair treatment, Biased human decisions, Biased human, yielding unfair, consequential impacts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biased human decisions have consequential impacts across various domains, yielding unfair treatment of individuals and resulting in suboptimal outcomes for organizations and society. In recognition of this fact, organizations regularly design and deploy interventions aimed at mitigating these biases. However, measuring human decision biases remains an important but elusive task. Organizations are frequently concerned with mistaken decisions disproportionately affecting one group. In practice, however, this is typically not possible to assess due to the scarcity of a gold standard: a label that indicates what the correct decision would have been. In this work, we propose a machine learning-based framework to assess bias in human-generated decisions when gold standard labels are scarce. We provide theoretical guarantees and empirical evidence demonstrating the superiority of our method over existing alternatives. This proposed methodology establishes a foundation for transparency in human decision-making, carrying substantial implications for managerial duties, and offering potential for alleviating algorithmic biases when human decisions are used as labels to train algorithms.

[LG-26] ORIS: Online Active Learning Using Reinforcement Learning-based Inclusive Sampling for Robust Streaming Analytics System

链接: https://arxiv.org/abs/2411.18060
作者: Rahul Pandey,Ziwei Zhu,Hemant Purohit
关键词-EN: Effective labeled data, labeled data collection, data collection plays, Effective labeled, fine-tuning robust streaming
类目: Machine Learning (cs.LG)
*备注: To appear in 2024 IEEE International Conference on Big Data (IEEE BigData 2024)

点击查看摘要

Abstract:Effective labeled data collection plays a critical role in developing and fine-tuning robust streaming analytics systems. However, continuously labeling documents to filter relevant information poses significant challenges like limited labeling budget or lack of high-quality labels. There is a need for efficient human-in-the-loop machine learning (HITL-ML) design to improve streaming analytics systems. One particular HITL- ML approach is online active learning, which involves iteratively selecting a small set of the most informative documents for labeling to enhance the ML model performance. The performance of such algorithms can get affected due to human errors in labeling. To address these challenges, we propose ORIS, a method to perform Online active learning using Reinforcement learning-based Inclusive Sampling of documents for labeling. ORIS aims to create a novel Deep Q-Network-based strategy to sample incoming documents that minimize human errors in labeling and enhance the ML model performance. We evaluate the ORIS method on emotion recognition tasks, and it outperforms traditional baselines in terms of both human labeling performance and the ML model performance.

[LG-27] FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs–Down to 2 Bits!

链接: https://arxiv.org/abs/2411.18055
作者: Yi Ren,Ruge Xu,Xinfei Guo,Weikang Qian
关键词-EN: deep neural network, energy-efficient deep neural, designing energy-efficient deep, neural network, widely-used technique
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:A widely-used technique in designing energy-efficient deep neural network (DNN) accelerators is quantization. Recent progress in this direction has reduced the bitwidths used in DNN down to 2. Meanwhile, many prior works apply approximate multipliers (AppMuls) in designing DNN accelerators to lower their energy consumption. Unfortunately, these works still assume a bitwidth much larger than 2, which falls far behind the state-of-the-art in quantization area and even challenges the meaningfulness of applying AppMuls in DNN accelerators, since a high-bitwidth AppMul consumes much more energy than a low-bitwidth exact multiplier! Thus, an important problem to study is: Can approximate multipliers be effectively applied to quantized DNN models with very low bitwidths? In this work, we give an affirmative answer to this question and present a systematic solution that achieves the answer: FAMES, a fast approximate multiplier substitution method for mixed-precision DNNs. Our experiments demonstrate an average 28.67% energy reduction on state-of-the-art mixed-precision quantized models with bitwidths as low as 2 bits and accuracy losses kept under 1%. Additionally, our approach is up to 300x faster than previous genetic algorithm-based methods.

[LG-28] Diffeomorphic Latent Neural Operator Learning for Data-Efficient Predictions of Solutions to Partial Differential Equations

链接: https://arxiv.org/abs/2411.18014
作者: Zan Ahmad,Shiyi Chen,Minglang Yin,Avisha Kumar,Nicolas Charon,Natalia Trayanova,Mauro Maggioni
关键词-EN: partial differential equations, latent neural operator, science and engineering, computed approximation, system of partial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A computed approximation of the solution operator to a system of partial differential equations (PDEs) is needed in various areas of science and engineering. Neural operators have been shown to be quite effective at predicting these solution generators after training on high-fidelity ground truth data (e.g. numerical simulations). However, in order to generalize well to unseen spatial domains, neural operators must be trained on an extensive amount of geometrically varying data samples that may not be feasible to acquire or simulate in certain contexts (i.e., patient-specific medical data, large-scale computationally intensive simulations.) We propose that in order to learn a PDE solution operator that can generalize across multiple domains without needing to sample enough data expressive enough for all possible geometries, we can train instead a latent neural operator on just a few ground truth solution fields diffeomorphically mapped from different geometric/spatial domains to a fixed reference configuration. Furthermore, the form of the solutions is dependent on the choice of mapping to and from the reference domain. We emphasize that preserving properties of the differential operator when constructing these mappings can significantly reduce the data requirement for achieving an accurate model due to the regularity of the solution fields that the latent neural operator is training on. We provide motivating numerical experimentation that demonstrates an extreme case of this consideration by exploiting the conformal invariance of the Laplacian

[LG-29] Generative Semantic Communication for Joint Image Transmission and Segmentation

链接: https://arxiv.org/abs/2411.18005
作者: Weiwen Yuan,Jinke Ren,Chongjie Wang,Ruichen Zhang,Jun Wei,Dong In Kim,Shuguang Cui
关键词-EN: enhancing communication efficiency, semantic communication system, generative semantic communication, Semantic communication, promising technology
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 7 figures

点击查看摘要

Abstract:Semantic communication has emerged as a promising technology for enhancing communication efficiency. However, most existing research emphasizes single-task reconstruction, neglecting model adaptability and generalization across multi-task systems. In this paper, we propose a novel generative semantic communication system that supports both image reconstruction and segmentation tasks. Our approach builds upon semantic knowledge bases (KBs) at both the transmitter and receiver, with each semantic KB comprising a source KB and a task KB. The source KB at the transmitter leverages a hierarchical Swin-Transformer, a generative AI scheme, to extract multi-level features from the input image. Concurrently, the counterpart source KB at the receiver utilizes hierarchical residual blocks to generate task-specific knowledge. Furthermore, the two task KBs adopt a semantic similarity model to map different task requirements into pre-defined task instructions, thereby facilitating the feature selection of the source KBs. Additionally, we develop a unified residual block-based joint source and channel (JSCC) encoder and two task-specific JSCC decoders to achieve the two image tasks. In particular, a generative diffusion model is adopted to construct the JSCC decoder for the image reconstruction task. Experimental results demonstrate that our multi-task generative semantic communication system outperforms previous single-task communication systems in terms of peak signal-to-noise ratio and segmentation accuracy.

[LG-30] Optimized Tradeoffs for Private Prediction with Majority Ensembling

链接: https://arxiv.org/abs/2411.17965
作者: Shuli Jiang,Qiuyi(Richard)Zhang,Gauri Joshi
关键词-EN: delta, differentially private majority, differentially private, geq, Randomized Response Majority
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 57 pages, 10 figures. Proceedings of Transactions on Machine Learning Research (TMLR), November 2024

点击查看摘要

Abstract:We study a classical problem in private prediction, the problem of computing an (m\epsilon, \delta) -differentially private majority of K (\epsilon, \Delta) -differentially private algorithms for 1 \leq m \leq K and 1 \delta \geq \Delta \geq 0 . Standard methods such as subsampling or randomized response are widely used, but do they provide optimal privacy-utility tradeoffs? To answer this, we introduce the Data-dependent Randomized Response Majority (DaRRM) algorithm. It is parameterized by a data-dependent noise function \gamma , and enables efficient utility optimization over the class of all private algorithms, encompassing those standard methods. We show that maximizing the utility of an (m\epsilon, \delta) -private majority algorithm can be computed tractably through an optimization problem for any m \leq K by a novel structural result that reduces the infinitely many privacy constraints into a polynomial set. In some settings, we show that DaRRM provably enjoys a privacy gain of a factor of 2 over common baselines, with fixed utility. Lastly, we demonstrate the strong empirical effectiveness of our first-of-its-kind privacy-constrained utility optimization for ensembling labels for private prediction from private teachers in image classification. Notably, our DaRRM framework with an optimized \gamma exhibits substantial utility gains when compared against several baselines.

[LG-31] ESS-ReduNet: Enhancing Subspace Separability of ReduNet via Dynamic Expansion with Bayesian Inference

链接: https://arxiv.org/abs/2411.17961
作者: Xiaojie Yu,Haibo Zhang,Lizhi Peng,Fengyang Sun,Jeremiah Deng
关键词-EN: maximal coding rate, linear discriminative feature, discriminative feature representation, transform original data, neural network model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ReduNet is a deep neural network model that leverages the principle of maximal coding rate \textbfreduction to transform original data samples into a low-dimensional, linear discriminative feature representation. Unlike traditional deep learning frameworks, ReduNet constructs its parameters explicitly layer by layer, with each layer’s parameters derived based on the features transformed from the preceding layer. Rather than directly using labels, ReduNet uses the similarity between each category’s spanned subspace and the data samples for feature updates at each layer. This may lead to features being updated in the wrong direction, impairing the correct construction of network parameters and reducing the network’s convergence speed. To address this issue, based on the geometric interpretation of the network parameters, this paper presents ESS-ReduNet to enhance the separability of each category’s subspace by dynamically controlling the expansion of the overall spanned space of the samples. Meanwhile, label knowledge is incorporated with Bayesian inference to encourage the decoupling of subspaces. Finally, stability, as assessed by the condition number, serves as an auxiliary criterion for halting training. Experiments on the ESR, HAR, Covertype, and Gas datasets demonstrate that ESS-ReduNet achieves more than 10x improvement in convergence compared to ReduNet. Notably, on the ESR dataset, the features transformed by ESS-ReduNet achieve a 47% improvement in SVM classification accuracy.

[LG-32] Multi-Label Bayesian Active Learning with Inter-Label Relationships

链接: https://arxiv.org/abs/2411.17941
作者: Yuanyuan Qi,Jueqing Lu,Xiaohao Yang,Joanne Enticott,Lan Du
关键词-EN: multi-class active learning, lies in assessing, multi-label active learning, assessing the informativeness, indefinite number
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The primary challenge of multi-label active learning, differing it from multi-class active learning, lies in assessing the informativeness of an indefinite number of labels while also accounting for the inherited label correlation. Existing studies either require substantial computational resources to leverage correlations or fail to fully explore label dependencies. Additionally, real-world scenarios often require addressing intrinsic biases stemming from imbalanced data distributions. In this paper, we propose a new multi-label active learning strategy to address both challenges. Our method incorporates progressively updated positive and negative correlation matrices to capture co-occurrence and disjoint relationships within the label space of annotated samples, enabling a holistic assessment of uncertainty rather than treating labels as isolated elements. Furthermore, alongside diversity, our model employs ensemble pseudo labeling and beta scoring rules to address data imbalances. Extensive experiments on four realistic datasets demonstrate that our strategy consistently achieves more reliable and superior performance, compared to several established methods.

[LG-33] Enhancing Project Performance Forecasting using Machine Learning Techniques

链接: https://arxiv.org/abs/2411.17914
作者: Soheila Sadeghi
关键词-EN: urban road reconstruction, road reconstruction project, delivering urban road, project performance metrics, Work Breakdown Structure
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Accurate forecasting of project performance metrics is crucial for successfully managing and delivering urban road reconstruction projects. Traditional methods often rely on static baseline plans and fail to consider the dynamic nature of project progress and external factors. This research proposes a machine learning-based approach to forecast project performance metrics, such as cost variance and earned value, for each Work Breakdown Structure (WBS) category in an urban road reconstruction project. The proposed model utilizes time series forecasting techniques, including Autoregressive Integrated Moving Average (ARIMA) and Long Short-Term Memory (LSTM) networks, to predict future performance based on historical data and project progress. The model also incorporates external factors, such as weather patterns and resource availability, as features to enhance the accuracy of forecasts. By applying the predictive power of machine learning, the performance forecasting model enables proactive identification of potential deviations from the baseline plan, which allows project managers to take timely corrective actions. The research aims to validate the effectiveness of the proposed approach using a case study of an urban road reconstruction project, comparing the model’s forecasts with actual project performance data. The findings of this research contribute to the advancement of project management practices in the construction industry, offering a data-driven solution for improving project performance monitoring and control.

[LG-34] RankMap: Priority-Aware Multi-DNN Manager for Heterogeneous Embedded Devices DATE DATE2025

链接: https://arxiv.org/abs/2411.17867
作者: Andreas Karatzas,Dimitrios Stamoulis,Iraklis Anagnostopoulos
关键词-EN: Deep Neural Networks, multiple Deep Neural, Modern edge data, Neural Networks, Deep Neural
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注: 8 pages, 10 figures, 1 table, Accepted for publication at the 28th Design Automation and Test in Europe Conference (DATE 2025), Best Paper Award Candidate

点击查看摘要

Abstract:Modern edge data centers simultaneously handle multiple Deep Neural Networks (DNNs), leading to significant challenges in workload management. Thus, current management systems must leverage the architectural heterogeneity of new embedded systems to efficiently handle multi-DNN workloads. This paper introduces RankMap, a priority-aware manager specifically designed for multi-DNN tasks on heterogeneous embedded devices. RankMap addresses the extensive solution space of multi-DNN mapping through stochastic space exploration combined with a performance estimator. Experimental results show that RankMap achieves x3.6 higher average throughput compared to existing methods, while preventing DNN starvation under heavy workloads and improving the prioritization of specified DNNs by x57.5.

[LG-35] Distributed Sign Momentum with Local Steps for Training Transformers

链接: https://arxiv.org/abs/2411.17866
作者: Shuhua Yu,Ding Zhou,Cong Xie,An Xu,Zhi Zhang,Xin Liu,Soummya Kar
关键词-EN: training large-scale deep, large-scale deep learning, Pre-training Transformer models, Pre-training Transformer, Transformer models
类目: Machine Learning (cs.LG)
*备注: 23 pages, 21 figures

点击查看摘要

Abstract:Pre-training Transformer models is resource-intensive, and recent studies have shown that sign momentum is an efficient technique for training large-scale deep learning models, particularly Transformers. However, its application in distributed training or federated learning remains underexplored. This paper investigates a novel communication-efficient distributed sign momentum method with local updates. Our proposed method allows for a broad class of base optimizers for local updates, and uses sign momentum in global updates, where momentum is generated from differences accumulated during local steps. We evaluate our method on the pre-training of various GPT-2 models, and the empirical results show significant improvement compared to other distributed methods with local updates. Furthermore, by approximating the sign operator with a randomized version that acts as a continuous analog in expectation, we present an O(1/\sqrtT) convergence for one instance of the proposed method for nonconvex smooth functions.

[LG-36] Integrating Machine Learning and Quantum Circuits for Proton Affinity Predictions

链接: https://arxiv.org/abs/2411.17856
作者: Hongni Jin,Kenneth M. Merz Jr
关键词-EN: interpreting gas-phase ion, gas-phase ion mobility, ion mobility coupled, favorable protonated structure, unknown structure prediction
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:A key step in interpreting gas-phase ion mobility coupled with mass spectrometry (IM-MS) data for unknown structure prediction involves identifying the most favorable protonated structure. In the gas phase, the site of protonation is determined using proton affinity (PA) measurements. Currently, mass spectrometry and ab initio computation methods are widely used to evaluate PA; however, both methods are resource-intensive and time-consuming. Therefore, there is a critical need for efficient methods to estimate PA, enabling the rapid identification of the most favorable protonation site in complex organic molecules with multiple proton binding sites. In this work, we developed a fast and accurate method for PA prediction by using multiple descriptors in combination with machine learning (ML) models. Using a comprehensive set of 186 descriptors, our model demonstrated strong predictive performance, with an R2 of 0.96 and a MAE of 2.47kcal/mol, comparable to experimental uncertainty. Furthermore, we designed quantum circuits as feature encoders for a classical neural network. To evaluate the effectiveness of this hybrid quantum-classical model, we compared its performance with traditional ML models using a reduced feature set derived from the full set. The result showed that this hybrid model achieved consistent performance comparable to traditional ML models with the same reduced feature set on both a noiseless simulator and real quantum hardware, highlighting the potential of quantum machine learning for accurate and efficient PA predictions.

[LG-37] Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering

链接: https://arxiv.org/abs/2411.17838
作者: Christopher Holder,Anthony Bagnall
关键词-EN: machine learning techniques, series machine learning, machine learning, Time series data, time series machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data has become increasingly prevalent across numerous domains, driving a growing demand for time series machine learning techniques. Among these, time series clustering (TSCL) stands out as one of the most popular machine learning tasks. TSCL serves as a powerful exploratory analysis tool and is also employed as a preprocessing step or subroutine for various tasks, including anomaly detection, segmentation, and classification. The most popular TSCL algorithms are either fast (in terms of run time) but perform poorly on benchmark problems, or perform well on benchmarks but scale poorly. We present a new TSCL algorithm, the k -means (K) accelerated (A) Stochastic subgradient (S) Barycentre (B) Average (A) (KASBA) clustering algorithm. KASBA is a k -means clustering algorithm that uses the Move-Split-Merge (MSM) elastic distance at all stages of clustering, applies a randomised stochastic subgradient gradient descent to find barycentre centroids, links each stage of clustering to accelerate convergence and exploits the metric property of MSM distance to avoid a large proportion of distance calculations. It is a versatile and scalable clusterer designed for real-world TSCL applications. It allows practitioners to balance run time and clustering performance. We demonstrate through extensive experimentation that KASBA produces significantly better clustering than the faster state of the art clusterers and is offers orders of magnitude improvement in run time over the most performant k -means alternatives. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.17838 [cs.LG] (or arXiv:2411.17838v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.17838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Adaptive Client Selection with Personalization for Communication Efficient Federated Learning

链接: https://arxiv.org/abs/2411.17833
作者: Allan M. de Souza,Filipe Maciel,Joahannes B. D. da Costa,Luiz F. Bittencourt,Eduardo Cerqueira,Antonio A. F. Loureiro,Leandro A. Villas
关键词-EN: Federated Learning, machine learning models, training machine learning, machine learning, collaboratively training machine
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed approach to collaboratively training machine learning models. FL requires a high level of communication between the devices and a central server, thus imposing several challenges, including communication bottlenecks and network scalability. This article introduces ACSP-FL (this https URL), a solution to reduce the overall communication and computation costs for training a model in FL environments. ACSP-FL employs a client selection strategy that dynamically adapts the number of devices training the model and the number of rounds required to achieve convergence. Moreover, ACSP-FL enables model personalization to improve clients performance. A use case based on human activity recognition datasets aims to show the impact and benefits of ACSP-FL when compared to state-of-the-art approaches. Experimental evaluations show that ACSP-FL minimizes the overall communication and computation overheads to train a model and converges the system efficiently. In particular, ACSP-FL reduces communication up to 95% compared to literature approaches while providing good convergence even in scenarios where data is distributed differently, non-independent and identical way between client devices.

[LG-39] Rate-Informed Discovery via Bayesian Adaptive Multifidelity Sampling

链接: https://arxiv.org/abs/2411.17826
作者: Aman Sinha,Payam Nikdel,Supratik Paul,Shimon Whiteson
关键词-EN: potential failure cases, Ensuring the safety, autonomous vehicles, requires both accurate, failure cases
类目: Robotics (cs.RO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at CoRL 2024: this https URL

点击查看摘要

Abstract:Ensuring the safety of autonomous vehicles (AVs) requires both accurate estimation of their performance and efficient discovery of potential failure cases. This paper introduces Bayesian adaptive multifidelity sampling (BAMS), which leverages the power of adaptive Bayesian sampling to achieve efficient discovery while simultaneously estimating the rate of adverse events. BAMS prioritizes exploration of regions with potentially low performance, leading to the identification of novel and critical scenarios that traditional methods might miss. Using real-world AV data we demonstrate that BAMS discovers 10 times as many issues as Monte Carlo (MC) and importance sampling (IS) baselines, while at the same time generating rate estimates with variances 15 and 6 times narrower than MC and IS baselines respectively.

[LG-40] Scalable iterative pruning of large language and vision models using block coordinate descent

链接: https://arxiv.org/abs/2411.17796
作者: Gili Rosenberg,J. Kyle Brubaker,Martin J. A. Schuetz,Elton Yechao Zhu,Serdar Kadıoğlu,Sima E. Borujeni,Helmut G. Katzgraber
关键词-EN: Combinatorial Brain Surgeon, maintain high accuracy, Brain Surgeon, reducing model complexity, significantly reducing model
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注: 16 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon’’ (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.

[LG-41] MTS-UNMixers: Multivariate Time Series Forecasting via Channel-Time Dual Unmixing

链接: https://arxiv.org/abs/2411.17770
作者: Xuanbing Zhu,Dunbin Shen,Zhongwen Rao,Huiyi Ma,Yingguang Hao,Hongyu Wang
关键词-EN: ensuring broad applicability, series data provide, Multivariate time series, ensuring broad, practical scenarios
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series data provide a robust framework for future predictions by leveraging information across multiple dimensions, ensuring broad applicability in practical scenarios. However, their high dimensionality and mixing patterns pose significant challenges in establishing an interpretable and explicit mapping between historical and future series, as well as extracting long-range feature dependencies. To address these challenges, we propose a channel-time dual unmixing network for multivariate time series forecasting (named MTS-UNMixer), which decomposes the entire series into critical bases and coefficients across both the time and channel dimensions. This approach establishes a robust sharing mechanism between historical and future series, enabling accurate representation and enhancing physical interpretability. Specifically, MTS-UNMixers represent sequences over time as a mixture of multiple trends and cycles, with the time-correlated representation coefficients shared across both historical and future time periods. In contrast, sequence over channels can be decomposed into multiple tick-wise bases, which characterize the channel correlations and are shared across the whole series. To estimate the shared time-dependent coefficients, a vanilla Mamba network is employed, leveraging its alignment with directional causality. Conversely, a bidirectional Mamba network is utilized to model the shared channel-correlated bases, accommodating noncausal relationships. Experimental results show that MTS-UNMixers significantly outperform existing methods on multiple benchmark datasets. The code is available at this https URL.

[LG-42] Integrating Dual Prototypes for Task-Wise Adaption in Pre-Trained Model-Based Class-Incremental Learning

链接: https://arxiv.org/abs/2411.17766
作者: Zhiming Xu,Suorong Yang,Baile Xu,Jian Zhao,Furao Shen
关键词-EN: historical knowledge incrementally, conserving historical knowledge, Class-incremental learning, aims to acquire, conserving historical
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages,6 figures,2 tables

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to acquire new classes while conserving historical knowledge incrementally. Despite existing pre-trained model (PTM) based methods performing excellently in CIL, it is better to fine-tune them on downstream incremental tasks with massive patterns unknown to PTMs. However, using task streams for fine-tuning could lead to catastrophic forgetting that will erase the knowledge in PTMs. This paper proposes the Dual Prototype network for Task-wise Adaption (DPTA) of PTM-based CIL. For each incremental learning task, a task-wise adapter module is built to fine-tune the PTM, where the center-adapt loss forces the representation to be more centrally clustered and class separable. The dual prototype network improves the prediction process by enabling test-time adapter selection, where the raw prototypes deduce several possible task indexes of test samples to select suitable adapter modules for PTM, and the augmented prototypes that could separate highly correlated classes are utilized to determine the final result. Experiments on several benchmark datasets demonstrate the state-of-the-art performance of DPTA. The code will be open-sourced after the paper is published.

[LG-43] Machine learning-based classification for Single Photon Space Debris Light Curves

链接: https://arxiv.org/abs/2411.18231
作者: Nadine M. Trummer,Amit Reza,Michael A. Steindorfer,Christiane Helling
关键词-EN: Earth orbit poses, Earth orbit, active satellite missions, satellite missions due, Single Photon basis
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing number of man-made debris in Earth’s orbit poses a threat to active satellite missions due to the risk of collision. Characterizing unknown debris is, therefore, of high interest. Light Curves (LCs) are temporal variations of object brightness and have been shown to contain information such as shape, attitude, and rotational state. Since 2015, the Satellite Laser Ranging (SLR) group of Space Research Institute (IWF) Graz has been building a space debris LC catalogue. The LCs are captured on a Single Photon basis, which sets them apart from CCD-based measurements. In recent years, Machine Learning (ML) models have emerged as a viable technique for analyzing LCs. This work aims to classify Single Photon Space Debris using the ML framework. We have explored LC classification using k-Nearest Neighbour (k-NN), Random Forest (RDF), XGBoost (XGB), and Convolutional Neural Network (CNN) classifiers in order to assess the difference in performance between traditional and deep models. Instead of performing classification on the direct LCs data, we extracted features from the data first using an automated pipeline. We apply our models on three tasks, which are classifying individual objects, objects grouped into families according to origin (e.g., GLONASS satellites), and grouping into general types (e.g., rocket bodies). We successfully classified Space Debris LCs captured on Single Photon basis, obtaining accuracies as high as 90.7%. Further, our experiments show that the classifiers provide better classification accuracy with automated extracted features than other methods.

[LG-44] he Bigger the Better? Accurate Molecular Potential Energy Surfaces from Minimalist Neural Networks

链接: https://arxiv.org/abs/2411.18121
作者: Silvan Käser,Debasish Koner,Markus Meuwly
关键词-EN: length scales, powerful tool, tool for studying, materials on wide, Atomistic simulations
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Atomistic simulations are a powerful tool for studying the dynamics of molecules, proteins, and materials on wide time and length scales. Their reliability and predictiveness, however, depend directly on the accuracy of the underlying potential energy surface (PES). Guided by the principle of parsimony this work introduces KerNN, a combined kernel/neural network-based approach to represent molecular PESs. Compared to state-of-the-art neural network PESs the number of learnable parameters of KerNN is significantly reduced. This speeds up training and evaluation times by several orders of magnitude while retaining high prediction accuracy. Importantly, using kernels as the features also improves the extrapolation capabilities of KerNN far beyond the coverage provided by the training data which solves a general problem of NN-based PESs. KerNN applied to spectroscopy and reaction dynamics shows excellent performance on test set statistics and observables including vibrational bands computed from classical and quantum simulations.

[LG-45] Using different sources of ground truths and transfer learning to improve the generalization of photometric redshift estimation NEURIPS2024

链接: https://arxiv.org/abs/2411.18054
作者: Jonathan Soriano,Srinath Saikrishnan,Vikram Seenivasan,Bernie Boscoe,Jack Singal,Tuan Do
关键词-EN: improve galaxy redshift, galaxy redshift predictions, combining ground truth, ground truth redshifts, TransferZ
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 2 tables, accepted to NeurIPS 2024 Workshop ML4PS

点击查看摘要

Abstract:In this work, we explore methods to improve galaxy redshift predictions by combining different ground truths. Traditional machine learning models rely on training sets with known spectroscopic redshifts, which are precise but only represent a limited sample of galaxies. To make redshift models more generalizable to the broader galaxy population, we investigate transfer learning and directly combining ground truth redshifts derived from photometry and spectroscopy. We use the COSMOS2020 survey to create a dataset, TransferZ, which includes photometric redshift estimates derived from up to 35 imaging filters using template fitting. This dataset spans a wider range of galaxy types and colors compared to spectroscopic samples, though its redshift estimates are less accurate. We first train a base neural network on TransferZ and then refine it using transfer learning on a dataset of galaxies with more precise spectroscopic redshifts (GalaxiesML). In addition, we train a neural network on a combined dataset of TransferZ and GalaxiesML. Both methods reduce bias by \sim 5x, RMS error by \sim 1.5x, and catastrophic outlier rates by 1.3x on GalaxiesML, compared to a baseline trained only on TransferZ. However, we also find a reduction in performance for RMS and bias when evaluated on TransferZ data. Overall, our results demonstrate these approaches can meet cosmological requirements.

[LG-46] On the ERM Principle in Meta-Learning

链接: https://arxiv.org/abs/2411.17898
作者: Yannay Alon,Steve Hanneke,Shay Moran,Uri Shalit
关键词-EN: Classic supervised learning, involves algorithms trained, Classic supervised, number, learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Classic supervised learning involves algorithms trained on n labeled examples to produce a hypothesis h \in \mathcalH aimed at performing well on unseen examples. Meta-learning extends this by training across n tasks, with m examples per task, producing a hypothesis class \mathcalH within some meta-class \mathbbH . This setting applies to many modern problems such as in-context learning, hypernetworks, and learning-to-learn. A common method for evaluating the performance of supervised learning algorithms is through their learning curve, which depicts the expected error as a function of the number of training examples. In meta-learning, the learning curve becomes a two-dimensional learning surface, which evaluates the expected error on unseen domains for varying values of n (number of tasks) and m (number of training examples). Our findings characterize the distribution-free learning surfaces of meta-Empirical Risk Minimizers when either m or n tend to infinity: we show that the number of tasks must increase inversely with the desired error. In contrast, we show that the number of examples exhibits very different behavior: it satisfies a dichotomy where every meta-class conforms to one of the following conditions: (i) either m must grow inversely with the error, or (ii) a \emphfinite number of examples per task suffices for the error to vanish as n goes to infinity. This finding illustrates and characterizes cases in which a small number of examples per task is sufficient for successful learning. We further refine this for positive values of \varepsilon and identify for each \varepsilon how many examples per task are needed to achieve an error of \varepsilon in the limit as the number of tasks n goes to infinity. We achieve this by developing a necessary and sufficient condition for meta-learnability using a bounded number of examples per domain. Comments: 20 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2411.17898 [stat.ML] (or arXiv:2411.17898v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.17898 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] New Test-Time Scenario for Biosignal: Concept and Its Approach ALT ML4H

链接: https://arxiv.org/abs/2411.17785
作者: Yong-Yeon Jo,Byeong Tak Lee,Beom Joon Kim,Jeong-Ho Hong,Hak Seung Lee,Joon-myoung Kwon
关键词-EN: Online Test-Time Adaptation, enhances model robustness, updating pre-trained models, Online Test-Time, enhances model
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 6 pages

点击查看摘要

Abstract:Online Test-Time Adaptation (OTTA) enhances model robustness by updating pre-trained models with unlabeled data during testing. In healthcare, OTTA is vital for real-time tasks like predicting blood pressure from biosignals, which demand continuous adaptation. We introduce a new test-time scenario with streams of unlabeled samples and occasional labeled samples. Our framework combines supervised and self-supervised learning, employing a dual-queue buffer and weighted batch sampling to balance data types. Experiments show improved accuracy and adaptability under real-world conditions.

[LG-48] KACDP: A Highly Interpretable Credit Default Prediction Model

链接: https://arxiv.org/abs/2411.17783
作者: Kun Liu,Jin Zhao
关键词-EN: credit risk prediction, individual credit risk, credit default, individual credit, Credit Default Predict
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of finance, the prediction of individual credit default is of vital importance. However, existing methods face problems such as insufficient interpretability and transparency as well as limited performance when dealing with high-dimensional and nonlinear data. To address these issues, this paper introduces a method based on Kolmogorov-Arnold Networks (KANs). KANs is a new type of neural network architecture with learnable activation functions and no linear weights, which has potential advantages in handling complex multi-dimensional data. Specifically, this paper applies KANs to the field of individual credit risk prediction for the first time and constructs the Kolmogorov-Arnold Credit Default Predict (KACDP) model. Experiments show that the KACDP model outperforms mainstream credit default prediction models in performance metrics (ROC_AUC and F1 values). Meanwhile, through methods such as feature attribution scores and visualization of the model structure, the model’s decision-making process and the importance of different features are clearly demonstrated, providing transparent and interpretable decision-making basis for financial institutions and meeting the industry’s strict requirements for model interpretability. In conclusion, the KACDP model constructed in this paper exhibits excellent predictive performance and satisfactory interpretability in individual credit risk prediction, providing an effective way to address the limitations of existing methods and offering a new and practical credit risk prediction tool for financial institutions.

[LG-49] MetaGraphLoc: A Graph-based Meta-learning Scheme for Indoor Localization via Sensor Fusion

链接: https://arxiv.org/abs/2411.17781
作者: Yaya Etiabi,Eslam Eldeeb,Mohammad Shehab,Wafa Njima,Hirley Alves,Mohamed-Slim Alouini,El Mehdi Amhoud
关键词-EN: remains challenging due, limited data availability, Accurate indoor localization, localization remains challenging, Accurate indoor
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Accurate indoor localization remains challenging due to variations in wireless signal environments and limited data availability. This paper introduces MetaGraphLoc, a novel system leveraging sensor fusion, graph neural networks (GNNs), and meta-learning to overcome these limitations. MetaGraphLoc integrates received signal strength indicator measurements with inertial measurement unit data to enhance localization accuracy. Our proposed GNN architecture, featuring dynamic edge construction (DEC), captures the spatial relationships between access points and underlying data patterns. MetaGraphLoc employs a meta-learning framework to adapt the GNN model to new environments with minimal data collection, significantly reducing calibration efforts. Extensive evaluations demonstrate the effectiveness of MetaGraphLoc. Data fusion reduces localization error by 15.92%, underscoring its importance. The GNN with DEC outperforms traditional deep neural networks by up to 30.89%, considering accuracy. Furthermore, the meta-learning approach enables efficient adaptation to new environments, minimizing data collection requirements. These advancements position MetaGraphLoc as a promising solution for indoor localization, paving the way for improved navigation and location-based services in the ever-evolving Internet of Things networks.

[LG-50] Deciphering Acoustic Emission with Machine Learning

链接: https://arxiv.org/abs/2411.17755
作者: Dénes Berta,Balduin Katzer,Katrin Schulz,Péter Dusán Ispánovity
关键词-EN: domain wall movement, Acoustic emission signals, accompany avalanche-like events, Acoustic emission, acoustic emission data
类目: ignal Processing (eess.SP); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Acoustic emission signals have been shown to accompany avalanche-like events in materials, such as dislocation avalanches in crystalline solids, collapse of voids in porous matter or domain wall movement in ferroics. The data provided by acoustic emission measurements is tremendously rich, but it is rather challenging to precisely connect it to the characteristics of the triggering avalanche. In our work we propose a machine learning based method with which one can infer microscopic details of dislocation avalanches in micropillar compression tests from merely acoustic emission data. As it is demonstrated in the paper, this approach is suitable for the prediction of the force-time response as it can provide outstanding prediction for the temporal location of avalanches and can also predict the magnitude of individual deformation events. Various descriptors (including frequency dependent and independent ones) are utilised in our machine learning approach and their importance in the prediction is analysed. The transferability of the method to other specimen sizes is also demonstrated and the possible application in more generic settings is discussed.

[LG-51] Path Loss Prediction Using Deep Learning

链接: https://arxiv.org/abs/2411.17752
作者: Ryan Dempsey,Jonathan Ethier,Halim Yanikomeroglu
关键词-EN: Radio deployments, deployments and spectrum, spectrum planning, planning can benefit, Radio
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Radio deployments and spectrum planning can benefit from path loss predictions. Obstructions along a communications link are often considered implicitly or through derived metrics such as representative clutter height or total obstruction depth. In this paper, we propose a path-specific path loss prediction method that uses convolutional neural networks to automatically perform feature extraction from high-resolution obstruction height maps. Our methods result in low prediction error in a variety of environments without requiring derived obstruction metrics.

[LG-52] Deployment of ARX Models for Thermal Forecasting in Power Electronics Boards Using WBG Semiconductors

链接: https://arxiv.org/abs/2411.17748
作者: Mohammed Riadh Berramdane(IFPEN),Alexandre Battiston(IFPEN),Michele Bardi(IFPEN),Nicolas Blet(LEMTA),Benjamin Rémy(LEMTA),Matthieu Urbain(LEMTA)
关键词-EN: material physical properties, Wide Bandgap, provide accurate temperature, accurate temperature predictions, requiring detailed understanding
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: in French language. Conf{é}rence des jeunes chercheurs en g{é}nie {é}lectrique, CNRS; GDR SEEDS, Jun 2024, Le croisic, France

点击查看摘要

Abstract:Facing the thermal management challenges of Wide Bandgap (WBG) semiconductors, this study highlights the use of ARX parametric models, which provide accurate temperature predictions without requiring detailed understanding of component thickness disparities or material physical properties, relying solely on experimental measurements. These parametric models emerge as a reliable alternative to FEM simulations and conventional thermal models, significantly simplifying system identification while ensuring high result accuracy.

[LG-53] Comparison of Tiny Machine Learning Techniques for Embedded Acoustic Emission Analysis

链接: https://arxiv.org/abs/2411.17733
作者: Uditha Muthumala,Yuxuan Zhang,Luciano Sebastian Martinez-Rau,Sebastian Bader
关键词-EN: acoustic emission, paper compares machine, input data formats, machine learning, compares machine learning
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Conference Presentations (Accepted) at IEEE 10th World Forum on Internet of Things. " this https URL

点击查看摘要

Abstract:This paper compares machine learning approaches with different input data formats for the classification of acoustic emission (AE) signals. AE signals are a promising monitoring technique in many structural health monitoring applications. Machine learning has been demonstrated as an effective data analysis method, classifying different AE signals according to the damage mechanism they represent. These classifications can be performed based on the entire AE waveform or specific features that have been extracted from it. However, it is currently unknown which of these approaches is preferred. With the goal of model deployment on resource-constrained embedded Internet of Things (IoT) systems, this work evaluates and compares both approaches in terms of classification accuracy, memory requirement, processing time, and energy consumption. To accomplish this, features are extracted and carefully selected, neural network models are designed and optimized for each input data scenario, and the models are deployed on a low-power IoT node. The comparative analysis reveals that all models can achieve high classification accuracies of over 99%, but that embedded feature extraction is computationally expensive. Consequently, models utilizing the raw AE signal as input have the fastest processing speed and thus the lowest energy consumption, which comes at the cost of a larger memory requirement.

[LG-54] Analytic Continuation by Feature Learning

链接: https://arxiv.org/abs/2411.17728
作者: Zhe Zhao,Jingping Xu,Ce Wang,Yaping Yang
关键词-EN: imaginary-time Green functions, reconstruct real-time spectral, real-time spectral functions, imaginary-time Green, Feature Learning Network
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Signal Processing (eess.SP); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Analytic continuation aims to reconstruct real-time spectral functions from imaginary-time Green’s functions; however, this process is notoriously ill-posed and challenging to solve. We propose a novel neural network architecture, named the Feature Learning Network (FL-net), to enhance the prediction accuracy of spectral functions, achieving an improvement of at least 20% over traditional methods, such as the Maximum Entropy Method (MEM), and previous neural network approaches. Furthermore, we develop an analytical method to evaluate the robustness of the proposed network. Using this method, we demonstrate that increasing the hidden dimensionality of FL-net, while leading to lower loss, results in decreased robustness. Overall, our model provides valuable insights into effectively addressing the complex challenges associated with analytic continuation.

[LG-55] EQNN: Enhanced Quantum Neural Network

链接: https://arxiv.org/abs/2411.17726
作者: Abel C. H. Chen
关键词-EN: Quantum Neural Networks, Enhanced Feature Map, Enhanced Quantum Neural, quantum computing technology, gradually shifted
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: in Chinese language

点击查看摘要

Abstract:With the maturation of quantum computing technology, research has gradually shifted towards exploring its applications. Alongside the rise of artificial intelligence, various machine learning methods have been developed into quantum circuits and algorithms. Among them, Quantum Neural Networks (QNNs) can map inputs to quantum circuits through Feature Maps (FMs) and adjust parameter values via variational models, making them applicable in regression and classification tasks. However, designing a FM that is suitable for a given application problem is a significant challenge. In light of this, this study proposes an Enhanced Quantum Neural Network (EQNN), which includes an Enhanced Feature Map (EFM) designed in this research. This EFM effectively maps input variables to a value range more suitable for quantum computing, serving as the input to the variational model to improve accuracy. In the experimental environment, this study uses mobile data usage prediction as a case study, recommending appropriate rate plans based on users’ mobile data usage. The proposed EQNN is compared with current mainstream QNNs, and experimental results show that the EQNN achieves higher accuracy with fewer quantum logic gates and converges to the optimal solution faster under different optimization algorithms.

[LG-56] Automatic EEG Independent Component Classification Using ICLabel in Python

链接: https://arxiv.org/abs/2411.17721
作者: Arnaud Delorme,Dung Truong,Luca Pion-Tonachini,Scott Makeig
关键词-EN: important plug-in function, EEG data processing, Independent Component Analysis, EEG data, EEG data involves
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:ICLabel is an important plug-in function in EEGLAB, the most widely used software for EEG data processing. A powerful approach to automated processing of EEG data involves decomposing the data by Independent Component Analysis (ICA) and then classifying the resulting independent components (ICs) using ICLabel. While EEGLAB pipelines support high-performance computing (HPC) platforms running the open-source Octave interpreter, the ICLabel plug-in is incompatible with Octave because of its specialized neural network architecture. To enhance cross-platform compatibility, we developed a Python version of ICLabel that uses standard EEGLAB data structures. We compared ICLabel MATLAB and Python implementations to data from 14 subjects. ICLabel returns the likelihood of classification in 7 classes of components for each ICA component. The returned IC classifications were virtually identical between Python and MATLAB, with differences in classification percentage below 0.001%.

[LG-57] Comprehensive Methodology for Sample Augmentation in EEG Biomarker Studies for Alzheimers Risk Classification

链接: https://arxiv.org/abs/2411.17717
作者: Veronica Henao Isaza,David Aguillon,Carlos Andres Tobon Quintero,Francisco Lopera,John Fredy Ochoa Gomez
关键词-EN: global health challenge, marked by cognitive, cognitive decline, health challenge, global health
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Background: Dementia, marked by cognitive decline, is a global health challenge. Alzheimer’s disease (AD), the leading type, accounts for ~70% of cases. Electroencephalography (EEG) measures show promise in identifying AD risk, but obtaining large samples for reliable comparisons is challenging. Objective: This study integrates signal processing, harmonization, and statistical techniques to enhance sample size and improve AD risk classification reliability. Methods: We used advanced EEG preprocessing, feature extraction, harmonization, and propensity score matching (PSM) to balance healthy non-carriers (HC) and asymptomatic E280A mutation carriers (ACr). Data from four databases were harmonized to adjust site effects while preserving covariates like age and sex. PSM ratios (2:1, 5:1, 10:1) were applied to assess sample size impact on model performance. The final dataset underwent machine learning analysis with decision trees and cross-validation for robust results. Results: Balancing sample sizes via PSM significantly improved classification accuracy, ranging from 0.92 to 0.96 across ratios. This approach enabled precise risk identification even with limited samples. Conclusion: Integrating data processing, harmonization, and balancing techniques improves AD risk classification accuracy, offering potential for other neurodegenerative diseases.

[LG-58] Quantity versus Diversity: Influence of Data on Detecting EEG Pathology with Advanced ML Models

链接: https://arxiv.org/abs/2411.17709
作者: Martyna Poziomska,Marian Dovgialo,Przemysław Olbratowski,Paweł Niedbalski,Paweł Ogniewski,Joanna Zych,Jacek Rogala,Jarosław Żygierewicz
关键词-EN: general EEG pathology, detecting general EEG, Temple University Hospital, EEG pathology, study investigates
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 20 pages, 17 figures

点击查看摘要

Abstract:This study investigates the impact of quantity and diversity of data on the performance of various machine-learning models for detecting general EEG pathology. We utilized an EEG dataset of 2,993 recordings from Temple University Hospital and a dataset of 55,787 recordings from Elmiko Biosignals sp. z o.o. The latter contains data from 39 hospitals and a diverse patient set with varied conditions. Thus, we introduce the Elmiko dataset - the largest publicly available EEG corpus. Our findings show that small and consistent datasets enable a wide range of models to achieve high accuracy; however, variations in pathological conditions, recording protocols, and labeling standards lead to significant performance degradation. Nonetheless, increasing the number of available recordings improves predictive accuracy and may even compensate for data diversity, particularly in neural networks based on attention mechanism or transformer architecture. A meta-model that combined these networks with a gradient-boosting approach using handcrafted features demonstrated superior performance across varied datasets.

[LG-59] Probabilistic Forecasting of Radiation Exposure for Spaceflight

链接: https://arxiv.org/abs/2411.17703
作者: Rutuja Gurav,Elena Massara,Xiaomei Song,Kimberly Sinclair,Edward Brown,Matt Kusner,Bala Poduval,Atilim Gunes Baydin
关键词-EN: Extended human presence, Moon and Mars, pose significant challenges, Extended human, Mars will pose
类目: pace Physics (physics.space-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extended human presence beyond low-Earth orbit (BLEO) during missions to the Moon and Mars will pose significant challenges in the near future. A primary health risk associated with these missions is radiation exposure, primarily from galatic cosmic rays (GCRs) and solar proton events (SPEs). While GCRs present a more consistent, albeit modulated threat, SPEs are harder to predict and can deliver acute doses over short periods. Currently NASA utilizes analytical tools for monitoring the space radiation environment in order to make decisions of immediate action to shelter astronauts. However this reactive approach could be significantly enhanced by predictive models that can forecast radiation exposure in advance, ideally hours ahead of major events, while providing estimates of prediction uncertainty to improve decision-making. In this work we present a machine learning approach for forecasting radiation exposure in BLEO using multimodal time-series data including direct solar imagery from Solar Dynamics Observatory, X-ray flux measurements from GOES missions, and radiation dose measurements from the BioSentinel satellite that was launched as part of Artemis~1 mission. To our knowledge, this is the first time full-disk solar imagery has been used to forecast radiation exposure. We demonstrate that our model can predict the onset of increased radiation due to an SPE event, as well as the radiation decay profile after an event has occurred.

[LG-60] Finding “Good Views” of Electrocardiogram Signals for Inferring Abnormalities in Cardiac Condition

链接: https://arxiv.org/abs/2411.17702
作者: Hyewon Jeong,Suyeol Yun,Hammaad Adam
关键词-EN: technique to screen, screen for abnormal, abnormal cardiac signals, Electrocardiograms, established technique
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiograms (ECGs) are an established technique to screen for abnormal cardiac signals. Recent work has established that it is possible to detect arrhythmia directly from the ECG signal using deep learning algorithms. While a few prior approaches with contrastive learning have been successful, the best way to define a positive sample remains an open question. In this project, we investigate several ways to define positive samples, and assess which approach yields the best performance in a downstream task of classifying arrhythmia. We explore spatiotemporal invariances, generic augmentations, demographic similarities, cardiac rhythms, and wave attributes of ECG as potential ways to match positive samples. We then evaluate each strategy with downstream task performance, and find that learned representations invariant to patient identity are powerful in arrhythmia detection. We made our code available in: this https URL

[LG-61] When Is Heterogeneity Actionable for Personalization?

链接: https://arxiv.org/abs/2411.16552
作者: Anya Shchetkina,Ron Berman
关键词-EN: personalization, heterogeneity, uniform policy, performing treatment, gain
类目: Applications (stat.AP); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Targeting and personalization policies can be used to improve outcomes beyond the uniform policy that assigns the best performing treatment in an A/B test to everyone. Personalization relies on the presence of heterogeneity of treatment effects, yet, as we show in this paper, heterogeneity alone is not sufficient for personalization to be successful. We develop a statistical model to quantify “actionable heterogeneity,” or the conditions when personalization is likely to outperform the best uniform policy. We show that actionable heterogeneity can be visualized as crossover interactions in outcomes across treatments and depends on three population-level parameters: within-treatment heterogeneity, cross-treatment correlation, and the variation in average responses. Our model can be used to predict the expected gain from personalization prior to running an experiment and also allows for sensitivity analysis, providing guidance on how changing treatments can affect the personalization gain. To validate our model, we apply five common personalization approaches to two large-scale field experiments with many interventions that encouraged flu vaccination. We find an 18% gain from personalization in one and a more modest 4% gain in the other, which is consistent with our model. Counterfactual analysis shows that this difference in the gains from personalization is driven by a drastic difference in within-treatment heterogeneity. However, reducing cross-treatment correlation holds a larger potential to further increase personalization gains. Our findings provide a framework for assessing the potential from personalization and offer practical recommendations for improving gains from targeting in multi-intervention settings.

信息检索

[IR-0] Delineating Feminist Studies through bibliometric analysis

链接: https://arxiv.org/abs/2411.18306
作者: Natsumi S. Shokida,Diego Kozlowski,Vincent Larivière
关键词-EN: Feminist Studies presents, socially anchored nature, Studies presents unique, feminist and LGBTQIA, Gender Studies
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 2 tables, 5 figures

点击查看摘要

Abstract:The multidisciplinary and socially anchored nature of Feminist Studies presents unique challenges for bibliometric analysis, as this research area transcends traditional disciplinary boundaries and reflects discussions from feminist and LGBTQIA+ social movements. This paper proposes a novel approach for identifying gender/sex related publications scattered across diverse scientific disciplines. Using the Dimensions database, we employ bibliometric techniques, natural language processing (NLP) and manual curation to compile a dataset of scientific publications that allows for the analysis of Gender Studies and its influence across different disciplines. This is achieved through a methodology that combines a core of specialized journals with a comprehensive keyword search over titles. These keywords are obtained by applying Topic Modeling (BERTopic) to the corpus of titles and abstracts from the core. This methodological strategy, divided into two stages, reflects the dynamic interaction between Gender Studies and its dialogue with different disciplines. This hybrid system surpasses basic keyword search by mitigating potential biases introduced through manual keyword enumeration. The resulting dataset comprises over 1.9 million scientific documents published between 1668 and 2023, spanning four languages. This dataset enables a characterization of Gender Studies in terms of addressed topics, citation and collaboration dynamics, and institutional and regional participation. By addressing the methodological challenges of studying “more-than-disciplinary” research areas, this approach could also be adapted to delineate other conversations where disciplinary boundaries are difficult to disentangle. Comments: 2 tables, 5 figures Subjects: Digital Libraries (cs.DL); Information Retrieval (cs.IR) Cite as: arXiv:2411.18306 [cs.DL] (or arXiv:2411.18306v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2411.18306 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] he Rn-index: a more accurate variant of the Rk-index

链接: https://arxiv.org/abs/2411.18161
作者: Alonso Rodriguez-Navarro
关键词-EN: common bibliometric indicators, bibliometric indicators, pushing the boundaries, boundaries of knowledge, critical metric
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 6 pages; 2 figures; 4 tables

点击查看摘要

Abstract:The contribution to pushing the boundaries of knowledge is a critical metric for evaluating the research performance of countries and institutions, which in many cases is not revealed by common bibliometric indicators. The Rk-index was specifically designed to assess such contributions, and the Rn-index is a variant that corrects the weakness of the Rk-index, particularly in the evaluation of countries that produce a high proportion of global advancements. This is the case of the USA and China in many technological fields. Additionally, the Rn-index is simple to calculate and understand, as it involves only summing the ratios between the local and global ranks of papers, ordered by their citation count. Moreover, the Rn-index may also be fractionally counted.

[IR-2] Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track

链接: https://arxiv.org/abs/2411.18069
作者: Deepak Gupta,Dina Demner-Fushman,William Hersh,Steven Bedrick,Kirk Roberts
关键词-EN: lay language summarization, clinical note summarization, large language models, language summarization, lay language
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the advancement of large language models (LLMs), the biomedical domain has seen significant progress and improvement in multiple tasks such as biomedical question answering, lay language summarization of the biomedical literature, clinical note summarization, etc. However, hallucinations or confabulations remain one of the key challenges when using LLMs in the biomedical and other domains. Inaccuracies may be particularly harmful in high-risk situations, such as making clinical decisions or appraising biomedical research. Studies on the evaluation of the LLMs’ abilities to ground generated statements in verifiable sources have shown that models perform significantly worse on lay-user generated questions, and often fail to reference relevant sources. This can be problematic when those seeking information want evidence from studies to back up the claims from LLMs[3]. Unsupported statements are a major barrier to using LLMs in any applications that may affect health. Methods for grounding generated statements in reliable sources along with practical evaluation approaches are needed to overcome this barrier. Towards this, in our pilot task organized at TREC 2024, we introduced the task of reference attribution as a means to mitigate the generation of false statements by LLMs answering biomedical questions.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-28

目录

概览 (2024-11-28)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载