链接: https://arxiv.org/abs/2409.18110 作者: Hung-Ting Chen,Eunsol Choi 关键词-EN: harm than good, contentious question, Subjective questions, question, Retrieval Diversity 类目: Computation and Language (cs.CL); Information Retrieval (cs.IR) 备注:
点击查看摘要
Abstract:We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., will ChatGPT do more harm than good?). We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. On this data, retrievers paired with a corpus are evaluated to surface a document set that contains diverse perspectives. Our framing diverges from most retrieval tasks in that document relevancy cannot be decided by simple string matches to references. Instead, we build a language model based automatic evaluator that decides whether each retrieved document contains a perspective. This allows us to evaluate the performance of three different types of corpus (Wikipedia, web snapshot, and corpus constructed on the fly with retrieved pages from the search engine) paired with retrievers. Retrieving diverse documents remains challenging, with the outputs from existing retrievers covering all perspectives on only 33.74% of the examples. We further study the impact of query expansion and diversity-focused reranking approaches and analyze retriever sycophancy. Together, we lay the foundation for future studies in retrieval diversity handling complex queries.
摘要:我们研究如何检索一组涵盖复杂且有争议问题的各种观点的文档(例如,ChatGPT 是否会带来更多危害而非益处?)。我们构建了一个主观问题检索多样性基准 (BERDS),其中每个示例包含一个问题及其相关的多样化观点,这些观点来源于调查问卷和辩论网站。在此数据集上,结合语料库的检索器被评估以呈现包含多样化观点的文档集。我们的框架与大多数检索任务不同,因为文档的相关性不能通过简单的字符串匹配来确定。相反,我们构建了一个基于语言模型的自动评估器,用于判断每个检索到的文档是否包含某种观点。这使我们能够评估三种不同类型的语料库(维基百科、网页快照以及通过搜索引擎检索页面即时构建的语料库)与检索器配对时的性能。检索多样化文档仍然具有挑战性,现有检索器的输出仅在 33.74% 的示例中涵盖了所有观点。我们进一步研究了查询扩展和专注于多样性的重排序方法的影响,并分析了检索器的盲从性。综上所述,我们为未来处理复杂查询的检索多样性研究奠定了基础。
[NLP-1] Infer Humans Intentions Before Following Natural Language Instructions
【速读】: 该论文试图解决AI代理在遵循自然语言指令完成日常协作任务时,由于人类指令固有的模糊性而导致的执行失败问题。解决方案的关键在于提出了一个新的框架——Follow Instructions with Social and Embodied Reasoning (FISER),该框架通过显式推理人类目标和意图作为中间推理步骤,从而更好地处理指令中的模糊性。通过使用基于Transformer的模型,FISER在HandMeThat基准测试中表现优异,显著超越了纯粹的端到端方法和现有的强基线模型,达到了该领域的最新技术水平。
Abstract:For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.
摘要:为了使 AI 智能体对人类有所帮助,它们应当能够遵循自然语言指令,在人类环境中完成日常的合作任务。然而,真实的人类指令本身具有模糊性,因为说话者假设听者对其隐藏的目标和意图有足够的先验知识。标准的语言接地和规划方法无法解决这种模糊性,因为它们没有将人类的内在目标建模为环境中的额外部分可观察因素。我们提出了一种新的框架,即遵循指令与社会和具身推理 (Follow Instructions with Social and Embodied Reasoning, FISER),旨在更好地遵循合作具身任务中的自然语言指令。我们的框架明确地将人类目标和意图作为中间推理步骤进行推断。我们实现了一系列基于 Transformer 的模型,并在一个具有挑战性的基准测试 HandMeThat 上进行了评估。实证结果表明,在制定行动计划之前,使用社会推理明确推断人类意图的方法优于纯粹的端到端方法。我们还将其与强大的基线方法进行了比较,包括在最大可用预训练大语言模型上进行的思维链提示 (Chain of Thought prompting),发现 FISER 在所研究的具身社会推理任务中表现更佳,达到了 HandMeThat 上的最新技术水平。
[NLP-2] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning EMNLP2024
链接: https://arxiv.org/abs/2409.18046 作者: Soeun Lee,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim 关键词-EN: Recent advancements, paired image-text data, explored text-only training, text-only training, overcome the limitations 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: Accepted to EMNLP 2024
点击查看摘要
Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ( \textbfI mage-like Retrieval and \textbfF requency-based Entity Filtering for Zero-shot \textbfCap tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.
摘要:近年来,图像描述生成领域的进展探索了仅使用文本数据的训练方法,以克服配对图像-文本数据的局限性。然而,现有的仅使用文本数据的训练方法往往忽略了训练过程中使用文本数据与推理过程中使用图像之间的模态差异。为了解决这一问题,我们提出了一种名为“类图像检索”的新方法,该方法通过将文本特征与视觉相关特征对齐来缓解模态差异。我们的方法通过设计一个融合模块,将检索到的描述与输入特征相结合,进一步提高了生成描述的准确性。此外,我们还引入了一种基于频率的实体过滤技术,显著提升了描述质量。我们将这些方法整合到一个统一的框架中,称之为IFCap(Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning)。通过广泛的实验验证,我们这种简单而强大的方法展示了其有效性,在图像描述生成和视频描述生成方面,相比仅基于文本训练的零样本描述生成方法,显著优于当前最先进的方法。
[NLP-3] Unveiling the Role of Pretraining in Direct Speech Translation EMNLP2024
链接: https://arxiv.org/abs/2409.18044 作者: Belen Alastruey,Gerard I. Gállego,Marta R. Costa-jussà 关键词-EN: data scarcity, translation systems encounter, encounter an important, important drawback, drawback in data 类目: Computation and Language (cs.CL) 备注: EMNLP 2024
点击查看摘要
Abstract:Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.
摘要:直接语音到文本翻译系统面临数据稀缺的重要缺陷。常见的解决方案是在自动语音识别上预训练编码器,从而在训练过程中损失效率。在本研究中,我们比较了使用预训练编码器的系统、传统方法以及从头开始训练的系统的训练动态。我们观察到,在整个训练过程中,随机初始化的模型在预测时难以整合语音输入的信息。因此,我们假设这一问题源于有效训练直接语音翻译编码器的困难。虽然从头开始训练的模型需要同时学习声学和语义建模,但预训练的模型只需专注于后者。基于这些发现,我们提出了一种微妙的解码器交叉注意力变化,以在训练的早期步骤中整合源信息。我们展示了通过这一变化,从头开始训练的模型可以达到与预训练模型相当的性能,同时减少训练时间。
[NLP-4] EMOVA: Empowering Language Models to See Hear and Speak with Vivid Emotions
链接: https://arxiv.org/abs/2409.18042 作者: Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Lanqing Hong,Lu Hou,Hang Xu 关键词-EN: Large Language Models, enables vocal conversations, Large Language, empowering Large Language, enable Large Language 类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) 备注: Project Page: this https URL
点击查看摘要
Abstract:GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
摘要:GPT-4o,一种能够实现带有多种情感和语调的语音对话的全模态模型,标志着全模态基础模型的一个重要里程碑。然而,在开源社区中,使大语言模型能够感知和生成图像、文本和语音的端到端能力,并利用公开数据仍然是一个挑战。现有的视觉-语言模型依赖于外部工具进行语音处理,而语音-语言模型仍然面临视觉理解能力有限甚至缺失的问题。为了填补这一空白,我们提出了EMOVA(EMotionally Omni-present Voice Assistant),以赋予大语言模型端到端的语音能力,同时保持领先的视觉-语言性能。通过一种语义-声学解耦的语音Token化器,我们发现全模态对齐可以进一步增强视觉-语言和语音能力,相比于相应的双模态对齐模型。此外,我们还提出了一种轻量级的风格模块,用于灵活的语音风格控制(例如,情感和音调)。首次,EMOVA在视觉-语言和语音基准测试中均达到了最先进的性能,同时支持带有生动情感的全模态语音对话。
[NLP-5] Automated Detection and Analysis of Power Words in Persuasive Text Using Natural Language Processing
链接: https://arxiv.org/abs/2409.18033 作者: Sahil Garje 关键词-EN: influence readers’ behavior, evoke strong emotional, strong emotional responses, significantly influence readers’, Power words 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Power words are terms that evoke strong emotional responses and significantly influence readers’ behavior, playing a crucial role in fields like marketing, politics, and motivational writing. This study proposes a methodology for the automated detection and analysis of power words in persuasive text using a custom lexicon and the TextBlob library in Python. By identifying the presence and frequency of power words within a given text, we aim to classify and analyze their impact on sentiment and reader engagement. This research examines diverse datasets across various domains to provide insights into the effectiveness of power words, offering practical applications for content creators, advertisers, and policymakers.
摘要:影响力词汇是指能够引发强烈情感反应并显著影响读者行为的术语,在营销、政治和励志写作等领域中发挥着至关重要的作用。本研究提出了一种利用自定义词典和 Python 中的 TextBlob 库来自动检测和分析说服性文本中影响力词汇的方法。通过识别文本中影响力词汇的存在及其频率,我们旨在对其对情感和读者参与度的影响进行分类和分析。本研究考察了跨多个领域的多样化数据集,以提供关于影响力词汇有效性的见解,为内容创作者、广告商和政策制定者提供实际应用。
[NLP-6] Compositional Hardness of Code in Large Language Models – A Probabilistic Perspective
【速读】: 该论文试图解决大型语言模型(LLM)在处理复杂分析任务(如代码生成)时,由于上下文窗口限制导致的组合难题(in-context hardness of composition)。解决方案的关键在于通过多智能体系统(multi-agent system)将分解后的子任务分布到多个LLM中,从而降低生成复杂度。论文通过理论证明和实证研究,展示了在同一上下文中解决组合问题的生成复杂度与分布式多智能体系统之间的指数级差距。
链接: https://arxiv.org/abs/2409.18028 作者: Yotam Wolf,Binyamin Rothberg,Dorin Shteyman,Amnon Shashua 关键词-EN: large language model, complex analytical tasks, model context window, model context, usage for complex 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model’s context window. Previous works have shown that subtask decomposition within the model’s context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs’ ability to perform several sub-tasks within the same context window - an in-context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi-agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution’s length. We prove our results theoretically and demonstrate them empirically.
摘要:在大语言模型 (LLM) 用于代码生成等复杂分析任务时,常见做法是在模型的上下文窗口内采样整个任务的解决方案。先前的工作表明,在模型的上下文内进行子任务分解(即思维链),对解决此类任务是有益的。在本研究中,我们指出 LLM 在同一上下文窗口内执行多个子任务的能力存在局限性——即上下文内的组合难度,这表明在多智能体系统中分解问题具有优势。组合难度通过生成复杂度指标量化,即采样至少一个正确解决方案所需的大语言模型生成次数。我们发现,相对于在多个智能体之间分配问题,在同一上下文中解决组合问题的生成复杂度存在差距,并且随着解决方案长度的增加,这一差距呈指数级增长。我们通过理论证明和实证验证了这些结果。
[NLP-7] An Adversarial Perspective on Machine Unlearning for AI Safety
链接: https://arxiv.org/abs/2409.18025 作者: Jakub Łucki,Boyi Wei,Yangsibo Huang,Peter Henderson,Florian Tramèr,Javier Rando 关键词-EN: Large language models, Large language, finetuned to refuse, Large, hazardous knowledge 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR) 备注:
点击查看摘要
Abstract:Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
摘要:大语言模型经过微调以拒绝关于危险知识的提问,但这些保护措施往往可以被绕过。遗忘方法旨在完全移除模型的危险能力,使其无法被对手访问。本文从对抗角度探讨了遗忘与传统安全后训练之间的根本差异。我们证明,先前被认为对遗忘无效的现有越狱方法,在谨慎应用时可以成功。此外,我们开发了多种自适应方法,恢复了大部分被认为已遗忘的能力。例如,我们展示了在10个不相关的示例上进行微调或在激活空间中移除特定方向,可以恢复使用RMU(一种最先进的遗忘方法)编辑的模型的大部分危险能力。我们的研究结果挑战了当前遗忘方法的鲁棒性,并质疑其在安全训练上的优势。
[NLP-8] DARE: Diverse Visual Question Answering with Robustness Evaluation
链接: https://arxiv.org/abs/2409.18023 作者: Hannah Sterz,Jonas Pfeiffer,Ivan Vulić 关键词-EN: Vision Language Models, text-only large language, Vision Language, large language models, extend remarkable capabilities 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.
摘要:视觉语言模型 (Vision Language Models, VLMs) 扩展了仅文本大语言模型和仅视觉模型的显著能力,能够从多模态的视觉文本输入中学习和处理信息。尽管现代 VLMs 在许多标准图像分类和图文匹配任务中表现出色,但它们在计数和空间推理等关键视觉语言 (Vision-Language, VL) 推理能力方面仍显不足。此外,尽管它们对指令和/或评估协议的小变化可能非常脆弱,但现有基准未能评估其鲁棒性(或更确切地说,缺乏鲁棒性)。为了将具有挑战性的 VL 场景与全面的鲁棒性评估相结合,我们引入了 DARE,即多样化的视觉问答与鲁棒性评估 (Diverse Visual Question Answering with Robustness Evaluation),这是一个精心创建和策划的多项选择 VQA 基准。DARE 评估 VLM 在五个不同类别上的表现,并包含四个基于提示、答案选项子集、输出格式和正确答案数量变化的鲁棒性评估。在众多其他发现中,我们报告称,最先进的 VLMs 在大多数类别的问题上仍显吃力,并且在测试的鲁棒性评估中无法持续展现其峰值性能。在选项子集中的最差表现比标准情况下的表现低达 34%。开源 VLMs 如 LLaVA 1.6 和 Idefics2 的鲁棒性无法与 GPT-4 和 Gemini 等闭源模型相媲美,但即使是后者,对不同变化的鲁棒性也非常脆弱。
[NLP-9] Multilingual Evaluation of Long Context Retrieval and Reasoning
链接: https://arxiv.org/abs/2409.18006 作者: Ameeta Agrawal,Andy Dang,Sina Bagheri Nezhad,Rhitabrat Pokharel,Russell Scheinberg 关键词-EN: demonstrate impressive capabilities, Recent large language, exhibiting near-perfect recall, Recent large, handling long contexts 类目: Computation and Language (cs.CL) 备注: Under review
点击查看摘要
Abstract:Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We comprehensively evaluate several long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.
摘要:近期的大语言模型 (LLMs) 在处理长上下文方面展示了令人印象深刻的能力,其中一些在合成检索任务上表现出近乎完美的召回率。然而,这些评估主要集中在英文文本上,并且涉及长上下文中的单一目标句子。我们的研究探讨了 LLM 性能在多语言环境中如何泛化,特别是在存在多个隐藏目标句子的情况下。我们全面评估了几种长上下文 LLM 在检索和推理任务上的表现,涵盖了五种语言:英语、越南语、印度尼西亚语、斯瓦希里语和索马里语。这些语言虽然共享拉丁字母,但属于不同的语言家族和资源级别。我们的分析揭示了语言之间的显著性能差距。表现最佳的模型如 Gemini-1.5 和 GPT-4o,在英语中达到约 96% 的准确率,而在索马里语中仅为约 36%,且仅涉及单一目标句子。然而,当处理三个目标句子时,英语中的准确率降至 40%,而在索马里语中降至 0%。我们的研究结果突显了长上下文 LLM 在处理更长上下文、增加目标句子数量或资源较低语言时所面临的挑战。
[NLP-10] Extracting Affect Aggregates from Longitudinal Social Media Data with Temporal Adapters for Large Language Models
链接: https://arxiv.org/abs/2409.17990 作者: Georg Ahnert,Max Pellert,David Garcia,Markus Strohmaier 关键词-EN: aligned Large Language, Large Language Models, temporally aligned Large, Large Language, paper proposes temporally 类目: Computers and Society (cs.CY); Computation and Language (cs.CL) 备注: Code available at this https URL
点击查看摘要
Abstract:This paper proposes temporally aligned Large Language Models (LLMs) as a tool for longitudinal analysis of social media data. We fine-tune Temporal Adapters for Llama 3 8B on full timelines from a panel of British Twitter users, and extract longitudinal aggregates of emotions and attitudes with established questionnaires. We validate our estimates against representative British survey data and find strong positive, significant correlations for several collective emotions. The obtained estimates are robust across multiple training seeds and prompt formulations, and in line with collective emotions extracted using a traditional classification model trained on labeled data. To the best of our knowledge, this is the first work to extend the analysis of affect in LLMs to a longitudinal setting through Temporal Adapters. Our work enables new approaches towards the longitudinal analysis of social media data. 摘要: 本文提出将时间对齐的大语言模型 (LLMs) 作为社交媒体数据纵向分析的工具。我们对 Llama 3 8B 模型的时间适配器 (Temporal Adapters) 进行了微调,以处理来自英国 Twitter 用户面板的完整时间线数据,并使用既定问卷提取情感和态度的纵向聚合数据。我们通过与代表性英国调查数据进行对比验证了我们的估计,发现多个集体情感指标存在显著的正相关关系。所获得的估计值在多个训练种子和提示语句组合下均表现出稳健性,并与使用传统分类模型(基于标注数据训练)提取的集体情感相一致。据我们所知,这是首次通过时间适配器将大语言模型中的情感分析扩展到纵向分析领域。我们的工作为社交媒体数据的纵向分析开辟了新的途径。
[NLP-11] BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search
链接: https://arxiv.org/abs/2409.17972 作者: Linzhuang Sun,Hao Liang,Wentao Zhang 关键词-EN: Large Language Models, Large Language, exhibited exceptional performance, Language Models, tasks and domains 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. However, they still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. Previous studies have employed techniques such as supervised fine-tuning (SFT), prompt engineering, and search-based methods to improve the mathematical problem-solving abilities of LLMs. Despite these efforts, their performance remains suboptimal and demands substantial computational resources. To address this issue, we propose a novel approach, BEATS, to enhance mathematical problem-solving abilities. Our method leverages newly designed prompts that guide the model to iteratively rewrite, advance by one step, and generate answers based on previous steps. Additionally, we introduce a new back-verification technique that uses LLMs to validate the correctness of the generated answers. Furthermore, we employ a pruning tree search to optimize search time while achieving strong performance. Notably, our method improves Qwen2-7b-Instruct’s score from 36.94 to 61.52, outperforming GPT4’s 42.5 on the MATH benchmark.
摘要:大语言模型 (LLMs) 在广泛的任务和领域中展现了卓越的性能。然而,由于数学的严谨性和逻辑性,它们在解决数学问题时仍面临困难。先前的研究采用了监督微调 (SFT)、提示工程和基于搜索的方法来提升大语言模型的数学问题解决能力。尽管如此,这些方法的性能仍不尽如人意,并且需要大量的计算资源。为了解决这一问题,我们提出了一种新颖的方法,BEATS,以增强数学问题解决能力。我们的方法利用了新设计的提示,指导模型通过迭代重写、逐步推进并基于先前步骤生成答案。此外,我们引入了一种新的后验证技术,使用大语言模型来验证生成答案的正确性。同时,我们采用了一种剪枝树搜索来优化搜索时间,同时实现强大的性能。值得注意的是,我们的方法将 Qwen2-7b-Instruct 的分数从 36.94 提升至 61.52,超过了 GPT4 在 MATH 基准测试中的 42.5 分。
[NLP-12] he Hard Positive Truth about Vision-Language Compositionality ECCV2024
链接: https://arxiv.org/abs/2409.17958 作者: Amita Kamath,Cheng-Yu Hsieh,Kai-Wei Chang,Ranjay Krishna 关键词-EN: hard, CLIP, hard positives, hard negatives, vision-language models 类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) 备注: ECCV 2024
点击查看摘要
Abstract:Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated – because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts.
摘要:多项基准测试得出结论,我们最先进的视觉语言模型(例如 CLIP)在组合性方面存在不足。给定一张图像,这些基准测试考察模型在一组组合性干扰项中识别其相关描述的能力。为此,近期涌现出一系列通过使用干扰项作为硬负样本对 CLIP 进行微调以提升性能的方案。我们的研究揭示,这些改进实际上被显著夸大了——因为现有基准测试并未探究微调后的视觉语言模型是否对硬正样本保持不变性。通过精心构建包含 112,382 个硬负样本和硬正样本的评估数据集,我们发现引入硬正样本会使 CLIP 的性能下降 12.9%,而人类在此任务上的表现则轻松达到 99%。采用硬负样本微调的 CLIP 性能下降更为严重,高达 38.7%。基于这一发现,我们随后生成了一个包含 1,775,259 个图像-文本对的大型训练集,其中同时涵盖了硬负样本和硬正样本的描述。通过同时训练这两种样本,我们不仅在现有基准测试中看到了性能提升,而且在处理硬正样本时也表现更佳,这表明组合性方面的改进更为稳健。我们的工作表明,未来研究需要严格测试并提升 CLIP 对相关“正”概念间语义关系的理解能力。
[NLP-13] Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation
链接: https://arxiv.org/abs/2409.17946 作者: Shuai Zhao,Leilei Gan,Zhongliang Guo,Xiaobao Wu,Luwei Xiao,Xiaoyu Xu,Cong-Duy Nguyen,Luu Anh Tuan 关键词-EN: widely applied due, Large Language Models, Large Language, backdoor attacks, backdoor 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on contrastive knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through contrastive knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
摘要:尽管大语言模型 (Large Language Models, LLMs) 因其卓越的能力而被广泛应用,但已被证明容易受到后门攻击。这些攻击通过毒化训练样本和全参数微调引入目标漏洞。然而,这种后门攻击受限于其需要大量计算资源,尤其是随着 LLMs 规模的增加。此外,参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 提供了一种替代方案,但其受限的参数更新可能阻碍触发器与目标标签的对齐。在本研究中,我们首先验证了使用 PEFT 的后门攻击可能在实现可行性能方面遇到挑战。为了解决这些问题并提高使用 PEFT 的后门攻击的有效性,我们提出了一种基于对比知识蒸馏 (Contrastive Knowledge Distillation) 的由弱到强的新型后门攻击算法 (W2SAttack)。具体而言,我们通过全参数微调毒化小规模语言模型,作为教师模型。然后,教师模型通过对比知识蒸馏,使用 PEFT 将后门秘密转移给大规模学生模型。理论分析表明,W2SAttack 具有增强后门攻击效果的潜力。我们在四个语言模型、四种后门攻击算法和两种不同架构的教师模型上展示了 W2SAttack 在分类任务中的优越性能。实验结果显示,针对 PEFT 的后门攻击成功率接近 100%。
[NLP-14] On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms
链接: https://arxiv.org/abs/2409.17943 作者: Richard Yue,John E. Ortega,Kenneth Ward Church 关键词-EN: natural language processing, professional translator, models in natural, Google Translate, BLEU and COMET 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users
点击查看摘要
Abstract:The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.
摘要:专业翻译人员将文档从源语言 (Source Language, SL) 翻译为目标语言 (Target Language, TL) 的典型工作流程,并不总是专注于自然语言处理 (Natural Language Processing, NLP) 中的许多语言模型所做的——预测一系列单词中的下一个单词。尽管像英语和法语这样的高资源语言在使用 BLEU 和 COMET 等常见度量标准进行测量时,已报告达到接近人类的水平,但我们发现一个重要的步骤被忽略了:技术术语,特别是缩略词的翻译。一些公开可用的最先进的机器翻译系统,如 Google Translate,在处理缩略词时可能会出现错误,根据我们的研究,错误率高达 50%。本文通过在 SL-TL (FR-EN) 翻译工作流程中增加一个步骤来解决机器翻译 (Machine Translation, MT) 系统的缩略词歧义问题,首先提供一个新的缩略词语料库供公众使用,然后实验一种基于搜索的阈值算法,与 Google Translate 和 OpusMT 相比,该算法实现了近 10% 的提升。
[NLP-15] Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods
链接: https://arxiv.org/abs/2409.17939 作者: Richard Yue,John E. Ortega 关键词-EN: tools called computer-aided, called computer-aided translation, CAT tool, CAT tools, CAT 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users
点击查看摘要
Abstract:Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s’). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s’. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s’. Most FMR techniques use machine translation as a way of “repairing” those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.
摘要:翻译记忆库 (Translation Memories, TMs) 是专业翻译工具——计算机辅助翻译 (Computer-Aided Translation, CAT) 工具的核心。在使用 CAT 工具进行翻译时,译者利用 TM 来收集与目标翻译片段 (s’) 相似的翻译。许多 CAT 工具提供模糊匹配算法,以定位 TM 中与 s’ 距离相近的片段 (s)。在找到两个相似片段后,CAT 工具会展示包含源语言片段及其目标语言翻译的平行片段 (s, t)。此外,CAT 工具还包含模糊匹配修复 (Fuzzy-Match Repair, FMR) 技术,这些技术会自动使用 TM 中的平行片段来创建新的 TM 条目,这些条目包含原始片段的修改版本,旨在作为 s’ 的翻译。大多数 FMR 技术使用机器翻译来“修复”那些需要修改的词汇。在本文中,我们展示了对于那些锚定的词汇,我们可以使用基于机器学习的方法,如 Word2Vec、BERT 和 ChatGPT,来替代大部分词汇的修复工作。具体而言,我们展示了对于遵循连续词袋 (Continuous Bag-of-Words, CBOW) 范式的锚定词汇,Word2Vec、BERT 和 GPT-4 可以用于实现与神经机器翻译相似,甚至在某些情况下更好的结果,以将法语中的锚定词汇翻译为英语。
[NLP-16] he Lou Dataset – Exploring the Impact of Gender-Fair Language in German Text Classification
链接: https://arxiv.org/abs/2409.17929 作者: Andreas Waldis,Joel Birrer,Anne Lauscher,Iryna Gurevych 关键词-EN: evolving German linguistic, German linguistic variation, fosters inclusion, neutral forms, inclusion by addressing 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Gender-fair language, an evolving German linguistic variation, fosters inclusion by addressing all genders or using neutral forms. Nevertheless, there is a significant lack of resources to assess the impact of this linguistic shift on classification using language models (LMs), which are probably not trained on such variations. To address this gap, we present Lou, the first dataset featuring high-quality reformulations for German text classification covering seven tasks, like stance detection and toxicity classification. Evaluating 16 mono- and multi-lingual LMs on Lou shows that gender-fair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns. However, existing evaluations remain valid, as LM rankings of original and reformulated instances do not significantly differ. While we offer initial insights on the effect on German text classification, the findings likely apply to other languages, as consistent patterns were observed in multi-lingual and English LMs.
摘要:性别公平语言是一种不断发展的德语语言变体,通过涵盖所有性别或使用中性形式来促进包容性。然而,目前缺乏资源来评估这种语言变化对使用语言模型 (LMs) 进行分类的影响,这些模型可能并未针对此类变体进行训练。为了填补这一空白,我们推出了 Lou,这是首个包含高质量重构文本的德语文本分类数据集,涵盖了立场检测和毒性分类等七项任务。通过对 Lou 上的 16 个单语和多语言 LMs 进行评估,我们发现性别公平语言显著影响了预测结果,包括标签翻转、确定性降低以及注意力模式的变化。然而,现有的评估仍然有效,因为原始实例和重构实例的 LM 排名没有显著差异。尽管我们提供了关于性别公平语言对德语文本分类影响的初步见解,但这些发现很可能适用于其他语言,因为在多语言和英语 LMs 中观察到了一致的模式。
[NLP-17] Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion EMNLP24
Abstract:During pre-training, the Text-to-Image (T2I) diffusion models encode factual knowledge into their parameters. These parameterized facts enable realistic image generation, but they may become obsolete over time, thereby misrepresenting the current state of the world. Knowledge editing techniques aim to update model knowledge in a targeted way. However, facing the dual challenges posed by inadequate editing datasets and unreliable evaluation criterion, the development of T2I knowledge editing encounter difficulties in effectively generalizing injected knowledge. In this work, we design a T2I knowledge editing framework by comprehensively spanning on three phases: First, we curate a dataset \textbfCAKE, comprising paraphrase and multi-object test, to enable more fine-grained assessment on knowledge generalization. Second, we propose a novel criterion, \textbfadaptive CLIP threshold, to effectively filter out false successful images under the current criterion and achieve reliable editing evaluation. Finally, we introduce \textbfMPE, a simple but effective approach for T2I knowledge editing. Instead of tuning parameters, MPE precisely recognizes and edits the outdated part of the conditioning text-prompt to accommodate the up-to-date knowledge. A straightforward implementation of MPE (Based on in-context learning) exhibits better overall performance than previous model editors. We hope these efforts can further promote faithful evaluation of T2I knowledge editing methods.
摘要:在预训练阶段,文本到图像 (Text-to-Image, T2I) 扩散模型将其参数化的事实知识编码到模型参数中。这些参数化的事实使得模型能够生成逼真的图像,但随着时间的推移,这些知识可能会变得过时,从而导致对当前世界状态的错误描述。知识编辑技术旨在以有针对性的方式更新模型知识。然而,面对编辑数据集不足和评估标准不可靠的双重挑战,T2I 知识编辑的发展在有效泛化注入知识方面遇到了困难。在本研究中,我们设计了一个 T2I 知识编辑框架,全面涵盖了三个阶段:首先,我们精心构建了一个数据集 \textbf{CAKE},该数据集包含释义和多对象测试,以实现对知识泛化的更精细评估。其次,我们提出了一种新的标准,即 \textbf{自适应 CLIP 阈值},以有效过滤当前标准下的虚假成功图像,并实现可靠的编辑评估。最后,我们引入了 \textbf{MPE},这是一种简单但有效的 T2I 知识编辑方法。MPE 不是调整参数,而是精确识别并编辑条件文本提示中过时的部分,以适应最新的知识。基于上下文学习的 MPE 直接实现展示了比先前模型编辑器更好的整体性能。我们希望这些努力能够进一步促进 T2I 知识编辑方法的忠实评估。
[NLP-18] Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
链接: https://arxiv.org/abs/2409.17912 作者: Guokan Shang,Hadi Abdine,Yousef Khoubrane,Amr Mohamed,Yassine Abbahaddou,Sofiane Ennadir,Imane Momayiz,Xuguang Ren,Eric Moulines,Preslav Nakov,Michalis Vazirgiannis,Eric Xing 关键词-EN: models specifically developed, dialectal Arabic, Moroccan Arabic, introduce Atlas-Chat, large language models 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.
摘要:我们介绍了 Atlas-Chat,这是首个专门为阿拉伯方言开发的大语言模型集合。聚焦于摩洛哥阿拉伯语(也称为 Darija),我们通过整合现有的 Darija 语言资源、手动和合成创建新数据集,以及严格质量控制下的英语指令翻译,构建了我们的指令数据集。经过数据集微调的 Atlas-Chat-9B 和 2B 模型,在遵循 Darija 指令和执行标准自然语言处理任务方面表现出卓越能力。值得注意的是,我们的模型在 DarijaMMLU 上,相较于更大的 13B 模型,实现了 13% 的性能提升,超越了包括 LLaMa、Jais 和 AceGPT 在内的最先进和专门针对阿拉伯语的 LLM。此外,我们对多种微调策略和基础模型选择进行了实验分析,以确定最佳配置。所有资源均为公开可访问,我们相信我们的工作为低资源语言变体的指令微调提供了全面的设计方法,这些变体在当代 LLM 中往往被数据丰富的语言所忽视。
[NLP-19] EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
链接: https://arxiv.org/abs/2409.17892 作者: Shaoxiong Ji,Zihao Li,Indraneil Paul,Jaakko Paavola,Peiqin Lin,Pinzhen Chen,Dayyán O’Brien,Hengyu Luo,Hinrich Schütze,Jörg Tiedemann,Barry Haddow 关键词-EN: improving language coverage, focusing on improving, enhanced multilingual performance, continue-trained on texts, designed for enhanced 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models’ language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.
摘要:在本研究中,我们介绍了 EMMA-500,这是一个大规模的多语言语言模型,针对 546 种语言进行了持续训练,旨在提升多语言性能,特别是改善低资源语言的覆盖率。为了支持持续预训练,我们编纂了 MaLA 语料库,这是一个综合性的多语言数据集,涵盖了多个领域的精选数据集。利用这一语料库,我们对 Llama 2 7B 模型进行了广泛的持续预训练,从而生成了 EMMA-500,该模型在包括多语言任务和本研究开发的开放式生成基准 PolyWrite 在内的一系列基准测试中表现出色。我们的研究结果突显了持续预训练在扩展大语言模型语言能力方面的有效性,特别是在代表性不足的语言方面,显著提升了跨语言迁移、任务泛化和语言适应性。
[NLP-20] Implementing a Nordic-Baltic Federated Health Data Network: a case report
链接: https://arxiv.org/abs/2409.17865 作者: Taridzo Chomutare,Aleksandar Babic,Laura-Maria Peltonen,Silja Elunurm,Peter Lundberg,Arne Jönsson,Emma Eneling,Ciprian-Virgil Gerstenberger,Troels Siggaard,Raivo Kolde,Oskar Jerdhaf,Martin Hansson,Alexandra Makhlysheva,Miroslav Muzny,Erik Ylipää,Søren Brunak,Hercules Dalianis 关键词-EN: including privacy concerns, national borders pose, borders pose significant, pose significant challenges, including privacy 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: 24 pages (including appendices), 1 figure
点击查看摘要
Abstract:Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs. 摘要:
背景: 跨国集中收集和处理医疗数据面临重大挑战,包括隐私问题、数据异质性和法律障碍。为了应对其中一些挑战,我们组建了一个跨学科联盟,旨在开发一个联邦健康数据网络,该网络由五个国家的六个机构组成,以促进北欧-波罗的海地区在医疗数据二次使用方面的合作。本报告旨在提供我们在开发这一网络过程中的早期见解。方法: 我们采用了混合方法,结合实验设计和实施科学来评估影响我们网络实施的因素。结果: 从技术角度来看,我们的实验表明,与集中式模拟相比,该网络在功能上没有显著的性能下降。结论: 尽管跨学科方法具有解决建立此类协作网络所面临挑战的潜力,但我们的发现突显了监管环境的不确定性以及显著的运营成本。
[NLP-21] PEDRO: Parameter-Efficient Fine-tuning with Prompt DEpenDent Representation MOdification
链接: https://arxiv.org/abs/2409.17834 作者: Tianfang Xie,Tianjing Li,Wei Zhu,Wei Han,Yi Zhao 关键词-EN: large language models, substantial sizes, large language, typically deployed, underline 类目: Computation and Language (cs.CL) 备注: arXiv admin note: text overlap with arXiv:2405.18203
点击查看摘要
Abstract:Due to their substantial sizes, large language models (LLMs) are typically deployed within a single-backbone multi-tenant framework. In this setup, a single instance of an LLM backbone must cater to multiple users or tasks through the application of various parameter-efficient fine-tuning (PEFT) models. Despite the availability of numerous effective PEFT techniques such as LoRA, there remains a need for a PEFT approach that achieves both high efficiency during inference and competitive performance on downstream tasks. In this research, we introduce a new and straightforward PEFT methodology named \underlinePrompt D\underlineEpen\underlineDent \underlineRepresentation M\underlineOdification (PEDRO). The proposed method involves integrating a lightweight vector generator into each Transformer layer, which generates vectors contingent upon the input prompts. These vectors then modify the hidden representations created by the LLM through a dot product operation, thereby influencing the semantic output and generated content of the model. Extensive experimentation across a variety of tasks indicates that: (a) PEDRO surpasses recent PEFT benchmarks when using a similar number of tunable parameters. (b) Under the single-backbone multi-tenant deployment model, PEDRO exhibits superior efficiency compared to LoRA, indicating significant industrial potential.
摘要:由于大语言模型 (Large Language Model, LLM) 的规模庞大,它们通常部署在单一主干多租户框架中。在这种架构下,LLM 主干的一个实例必须通过应用各种参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 模型来服务于多个用户或任务。尽管存在许多有效的 PEFT 技术,如 LoRA,但仍然需要一种在推理过程中实现高效性并在下游任务中表现出色的 PEFT 方法。在本研究中,我们引入了一种新的、直接的 PEFT 方法,名为 \underline{Prompt D\underlineEpen\underlineDent \underlineRepresentation M\underlineOdification} (PEDRO)。该方法涉及将一个轻量级向量生成器集成到每个 Transformer 层中,该生成器根据输入提示生成向量。这些向量随后通过点积操作修改由 LLM 生成的隐藏表示,从而影响模型的语义输出和生成内容。在多种任务上的广泛实验表明:(a) 在使用相似数量的可调参数时,PEDRO 超越了最近的 PEFT 基准。(b) 在单一主干多租户部署模型下,PEDRO 相比 LoRA 表现出更高的效率,显示出显著的工业应用潜力。
[NLP-22] BeanCounter: A low-toxicity large-scale and open dataset of business-oriented text
链接: https://arxiv.org/abs/2409.17827 作者: Siyan Wang,Bradford Levy 关键词-EN: breakthroughs in language, language modeling, modeling have resulted, resulted from scaling, scaling effectively 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses’ disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data’s provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.
摘要:近年来,语言模型领域的许多突破性进展源于将相同的模型架构有效地扩展到更大的数据集上。在此背景下,最近的研究强调了增加训练数据集规模和质量带来的性能提升,这表明需要寻找新的、大规模的数据集来源。本文中,我们引入了 BeanCounter,这是一个包含超过 1590 亿 Token 的公开数据集,这些 Token 提取自企业的披露信息。我们证明,这些数据确实是新颖的:BeanCounter 中不到 0.1% 的内容出现在基于 Common Crawl 的数据集中,并且其规模比依赖类似来源的数据集大一个数量级。鉴于数据的来源,我们假设 BeanCounter 相对于基于网络的数据集,具有更高的真实性和更低的毒性。通过探索这一假设,我们发现许多人口统计身份在 BeanCounter 中出现的频率与其他数据集相似,但在相对较少的毒性环境中出现。为了展示 BeanCounter 的实用性,我们评估并比较了两个在 BeanCounter 上持续预训练的大语言模型与其基础模型。我们发现,持续预训练的模型在生成毒性内容方面减少了 18-33%,并且在金融领域的表现有所提升。总体而言,我们的研究表明,BeanCounter 是一个新颖的、低毒性且高质量的领域特定数据源,其规模足以训练拥有数十亿参数的大语言模型。
[NLP-23] Inference-Time Language Model Alignment via Integrated Value Guidance EMNLP2024
【速读】: 该论文试图解决大规模语言模型在微调过程中计算复杂且耗时的问题。解决方案的关键在于引入了一种名为**Integrated Value Guidance (IVG)**的方法,该方法通过隐式和显式的价值函数分别在token和chunk级别上指导语言模型的解码过程,从而在推理阶段高效地对齐大规模语言模型,避免了直接微调的复杂性,并在多个任务中显著提升了模型的对齐效果。
链接: https://arxiv.org/abs/2409.17819 作者: Zhixuan Liu,Zhanhui Zhou,Yuanfu Wang,Chao Yang,Yu Qiao 关键词-EN: Large language models, human preferences, intensive and complex, tuning large models, Large language 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: EMNLP 2024 Findings
点击查看摘要
Abstract:Large language models are typically fine-tuned to align with human preferences, but tuning large models is computationally intensive and complex. In this work, we introduce \textitIntegrated Value Guidance (IVG), a method that uses implicit and explicit value functions to guide language model decoding at token and chunk-level respectively, efficiently aligning large language models purely at inference time. This approach circumvents the complexities of direct fine-tuning and outperforms traditional methods. Empirically, we demonstrate the versatility of IVG across various tasks. In controlled sentiment generation and summarization tasks, our method significantly improves the alignment of large models using inference-time guidance from \textttgpt2 -based value functions. Moreover, in a more challenging instruction-following benchmark AlpacaEval 2.0, we show that both specifically tuned and off-the-shelf value functions greatly improve the length-controlled win rates of large models against \textttgpt-4-turbo (e.g., 19.51% \rightarrow 26.51% for \textttMistral-7B-Instruct-v0.2 and 25.58% \rightarrow 33.75% for \textttMixtral-8x7B-Instruct-v0.1 with Tulu guidance).
摘要:大语言模型通常经过微调以符合人类偏好,但微调大型模型在计算上既耗费资源又复杂。在本研究中,我们引入了 综合价值引导 (Integrated Value Guidance, IVG),这是一种利用隐式和显式价值函数分别在 Token 和块级别引导语言模型解码的方法,从而在推理时高效地对齐大语言模型。这种方法避免了直接微调的复杂性,并优于传统方法。通过实验,我们展示了 IVG 在各种任务中的广泛适用性。在受控情感生成和摘要任务中,我们的方法显著提升了大模型在推理时通过基于 gpt2 的价值函数引导下的对齐效果。此外,在更具挑战性的指令遵循基准测试 AlpacaEval 2.0 中,我们展示了专门调优和现成的价值函数都能大幅提高大模型在长度控制下的胜率,例如,Mistral-7B-Instruct-v0.2 的胜率从 19.51% 提升至 26.51%,而 Mixtral-8x7B-Instruct-v0.1 的胜率从 25.58% 提升至 33.75%(使用 Tulu 引导)。
[NLP-24] Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness EMNLP2024
链接: https://arxiv.org/abs/2409.17791 作者: Jian Li,Haojing Huang,Yujia Zhang,Pengfei Xu,Xi Chen,Rui Song,Lida Shi,Jingwen Wang,Hao Xu 关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Direct Preference Optimization, Language Models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Accepted at EMNLP 2024 Findings
点击查看摘要
Abstract:Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at this https URL.
摘要:最近,对于大语言模型 (LLM),如直接偏好优化 (DPO) 及其变体,在强化学习与人类反馈 (RLHF) 方法中替换奖励模型的兴趣显著增加。这些方法通常使用成对样本上的二元交叉熵机制,即分别基于偏好或非偏好的响应来最小化和最大化损失。然而,尽管这种训练策略省略了奖励模型,但它也忽略了不同响应中偏好的不同程度。我们假设这是阻碍 LLM 充分理解人类偏好的关键因素。为了解决这个问题,我们提出了一种新的自监督偏好优化 (SPO) 框架,该框架构建了一个自监督的偏好程度损失,结合对齐损失,从而帮助 LLM 提高理解偏好程度的能力。我们在两个广泛使用的不同任务的数据集上进行了大量实验。结果表明,SPO 可以无缝集成到现有的偏好优化方法中,并显著提升其性能,达到最先进的水平。我们还进行了详细的分析,以提供对 SPO 的全面见解,验证了其有效性。代码可在以下链接获取:https URL。
[NLP-25] Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations EMNLP2024
链接: https://arxiv.org/abs/2409.17774 作者: Supriya Manna,Niladri Sett 关键词-EN: critical metric, metric to assess, assess the reliability, reliability of explainable, Faithfulness 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Accepted as a Full Paper at EMNLP 2024 Workshop BlackBoxNLP
点击查看摘要
Abstract:Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer’s response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
摘要:忠实度无疑是评估可解释 AI 可靠性的最关键指标。在自然语言处理 (NLP) 领域,当前的忠实度评估方法存在诸多差异和偏见,往往无法捕捉模型的真实推理过程。我们提出了一种新颖的忠实度评估方法——对抗敏感性 (Adversarial Sensitivity),重点关注解释器在模型遭受对抗攻击时的响应。我们的方法通过捕捉对抗输入变化的敏感性来评估解释器的忠实度。这项工作解决了现有评估技术的重大局限性,并进一步从一种关键但未充分探索的范式中量化了忠实度。
[NLP-26] Integrating Hierarchical Semantic into Iterative Generation Model for Entailment Tree Explanation
链接: https://arxiv.org/abs/2409.17757 作者: Qin Wang,Jianzhou Feng,Yiming Xu 关键词-EN: explainable question answering, Manifestly and logically, question answering, logically displaying, reasoning from evidence 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Manifestly and logically displaying the line of reasoning from evidence to answer is significant to explainable question answering (QA). The entailment tree exhibits the lines structurally, which is different from the self-explanation principle in large-scale language models. Existing methods rarely consider the semantic association of sentences between and within hierarchies within the tree structure, which is prone to apparent mistakes in combinations. In this work, we propose an architecture of integrating the Hierarchical Semantics of sentences under the framework of Controller-Generator (HiSCG) to explain answers. The HiSCG designs a hierarchical mapping between hypotheses and facts, discriminates the facts involved in tree constructions, and optimizes single-step entailments. To the best of our knowledge, We are the first to notice hierarchical semantics of sentences between the same layer and adjacent layers to yield improvements. The proposed method achieves comparable performance on all three settings of the EntailmentBank dataset. The generalization results on two out-of-domain datasets also demonstrate the effectiveness of our method.
摘要:在可解释问答 (QA) 中,从证据到答案的推理路径的显式和逻辑展示至关重要。蕴涵树以结构化的方式展示这些路径,这与大规模语言模型中的自我解释原则不同。现有方法很少考虑树结构中层次之间和层次内部的句子语义关联,这容易导致组合中的明显错误。在本研究中,我们提出了一种在控制器-生成器 (Controller-Generator) 框架下整合句子层次语义 (Hierarchical Semantics of sentences) 的架构 (HiSCG) 来解释答案。HiSCG 设计了假设与事实之间的层次映射,区分了参与树构建的事实,并优化了单步蕴涵。据我们所知,我们是第一个注意到同一层和相邻层之间句子层次语义以实现改进的。所提出的方法在 EntailmentBank 数据集的所有三种设置中均取得了可比的表现。在两个域外数据集上的泛化结果也证明了我们方法的有效性。
[NLP-27] SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning
链接: https://arxiv.org/abs/2409.17755 作者: Rimvydas Rubavicius,Peter David Fagan,Alex Lascarides,Subramanian Ramamoorthy 关键词-EN: interactive task learning, challenging interactive task, task learning scenario, paper addresses, addresses a challenging 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: 10 pages,4 figures, 2 tables
点击查看摘要
Abstract:This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the robot is unaware of a concept that’s key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation. Through dialogue, the robot discovers and then learns to exploit unforeseen possibilities. Using SECURE, the robot not only learns from the user’s corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the robot to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that a robot that is semantics-aware – that is, it exploits the logical consequences of both sentence and discourse semantics in the learning and inference process – learns to solve rearrangement under unawareness more effectively than a robot that lacks such capabilities.
摘要:本文探讨了一种具有挑战性的交互式任务学习场景,我们称之为“无意识重排”:在机器人对解决指令任务的关键概念一无所知的情况下,操控刚体环境。我们提出了 SECURE,这是一个交互式任务学习框架,旨在通过实体对话修复缺陷领域模型来解决此类问题。通过对话,机器人能够发现并学习利用未预见的可能性。使用 SECURE,机器人不仅在出错时从用户的纠正反馈中学习,还能学会做出战略性对话决策,以揭示解决指令任务的新概念的有用证据。这些能力共同使机器人能够利用新获得的知识推广到后续任务中。我们证明,一个具有语义意识的机器人——即在学习与推理过程中利用句子和话语语义的逻辑结果——比缺乏此类能力的机器人更有效地解决无意识重排问题。
[NLP-28] Few-shot Pairwise Rank Prompting: An Effective Non-Parametric Retrieval Model EMNLP2024
链接: https://arxiv.org/abs/2409.17745 作者: Nilanjan Sinhababu,Andrew Parry,Debasis Ganguly,Debasis Samanta,Pabitra Mitra 关键词-EN: typically multiple stages, involves complex processing, typically multiple, pre-training and fine-tuning, multiple stages 类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: Accepted to EMNLP 2024
点击查看摘要
Abstract:A supervised ranking model, despite its advantage of being effective, usually involves complex processing - typically multiple stages of task-specific pre-training and fine-tuning. This has motivated researchers to explore simpler pipelines leveraging large language models (LLMs) that are capable of working in a zero-shot manner. However, since zero-shot inference does not make use of a training set of pairs of queries and their relevant documents, its performance is mostly worse than that of supervised models, which are trained on such example pairs. Motivated by the existing findings that training examples generally improve zero-shot performance, in our work, we explore if this also applies to ranking models. More specifically, given a query and a pair of documents, the preference prediction task is improved by augmenting examples of preferences for similar queries from a training set. Our proposed pairwise few-shot ranker demonstrates consistent improvements over the zero-shot baseline on both in-domain (TREC DL) and out-domain (BEIR subset) retrieval benchmarks. Our method also achieves a close performance to that of a supervised model without requiring any complex training pipeline.
摘要:尽管监督排序模型具有有效性的优势,但其通常涉及复杂的处理流程——通常包括多个阶段的任务特定预训练和微调。这促使研究人员探索利用大语言模型 (LLM) 的更简单流程,这些模型能够在零样本 (zero-shot) 模式下工作。然而,由于零样本推理不使用查询及其相关文档对的训练集,其性能通常不如在类似示例对上训练的监督模型。受现有研究结果的启发,即训练示例通常能提升零样本性能,我们在工作中探讨了这一现象是否也适用于排序模型。更具体地说,给定一个查询和一对文档,通过增加训练集中相似查询的偏好示例,偏好预测任务得到了改进。我们提出的成对少样本 (few-shot) 排序器在域内 (TREC DL) 和域外 (BEIR 子集) 检索基准测试中均显示出对零样本基线的持续改进。我们的方法还实现了与监督模型相近的性能,而无需任何复杂的训练流程。
[NLP-29] MIO: A Foundation Model on Multimodal Tokens
链接: https://arxiv.org/abs/2409.17692 作者: Zekun Wang,King Zhu,Chunpu Xu,Wangchunshu Zhou,Jiaheng Liu,Yibo Zhang,Jiashuo Wang,Ning Shi,Siyu Li,Yizhi Li,Haoran Que,Zhaoxiang Zhang,Yuanxing Zhang,Ge Zhang,Ke Xu,Jie Fu,Wenhao Huang 关键词-EN: foundation model built, large language models, autoregressive manner, understanding and generating, language models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: Technical Report. Codes and models will be available soon
点击查看摘要
Abstract:In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
摘要:本文介绍了一种名为 MIO 的新型基础模型,该模型基于多模态 Token,能够以端到端、自回归的方式理解和生成语音、文本、图像和视频。尽管大语言模型 (LLM) 和多模态大语言模型 (MM-LLM) 通过其多功能性推动了通用人工智能 (AGI) 的发展,但它们仍然缺乏真正的任意到任意理解和生成能力。最近,GPT-4o 的发布展示了任意到任意大语言模型在复杂现实任务中的显著潜力,实现了图像、语音和文本之间的全方位输入和输出。然而,它是闭源的,并且不支持多模态交错序列的生成。为了填补这一空白,我们提出了 MIO,该模型通过因果多模态建模在四种模态的混合离散 Token 上进行训练。MIO 经历了四个阶段的训练过程:(1) 对齐预训练,(2) 交错预训练,(3) 语音增强预训练,以及 (4) 在多样化的文本、视觉和语音任务上的综合监督微调。我们的实验结果表明,MIO 在性能上与之前的双模态基线、任意到任意模型基线以及特定模态基线相比具有竞争力,甚至在某些情况下表现更优。此外,MIO 展示了其任意到任意特性所固有的高级能力,例如交错视频-文本生成、视觉思维链推理、视觉指南生成、指导性图像编辑等。
[NLP-30] Zero- and Few-shot Named Entity Recognition and Text Expansion in Medication Prescriptions using ChatGPT
链接: https://arxiv.org/abs/2409.17683 作者: Natthanaphop Isaradech,Andrea Riedel,Wachiranun Sirikul,Markus Kreuzthaler,Stefan Schulz 关键词-EN: local brand, formats and abbreviations, include a mix, wide range, range of idiosyncratic 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Introduction: Medication prescriptions are often in free text and include a mix of two languages, local brand names, and a wide range of idiosyncratic formats and abbreviations. Large language models (LLMs) have shown promising ability to generate text in response to input prompts. We use ChatGPT 3.5 to automatically structure and expand medication statements in discharge summaries and thus make them easier to interpret for people and machines. Methods: Named-entity Recognition (NER) and Text Expansion (EX) are used in a zero- and few-shot setting with different prompt strategies. 100 medication statements were manually annotated and curated. NER performance was measured by using strict and partial matching. For the task EX, two experts interpreted the results by assessing semantic equivalence between original and expanded statements. The model performance was measured by precision, recall, and F1 score. Results: For NER, the best-performing prompt reached an average F1 score of 0.94 in the test set. For EX, the few-shot prompt showed superior performance among other prompts, with an average F1 score of 0.87. Conclusion: Our study demonstrates good performance for NER and EX tasks in free-text medication statements using ChatGPT. Compared to a zero-shot baseline, a few-shot approach prevented the system from hallucinating, which would be unacceptable when processing safety-relevant medication data.
1 2 3 4 5 6 7 8 9
**摘要:**
**引言:** 药物处方通常以自由文本形式存在,并包含两种语言的混合、本地品牌名称以及各种独特的格式和缩写。大语言模型 (Large Language Models, LLMs) 在根据输入提示生成文本方面展示了令人鼓舞的能力。我们使用 ChatGPT 3.5 来自动结构化和扩展出院总结中的药物声明,从而使其更易于人类和机器解读。
链接: https://arxiv.org/abs/2409.17673 作者: Kaden Uhlig,Joern Wuebker,Raphael Reinauer,John DeNero 关键词-EN: Reinforcement Learning, Direct Preference Optimization, Direct Quality Optimization, Human Feedback, repurpose general 类目: Computation and Language (cs.CL) 备注: 17 pages, 1 figure
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task–data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and human evaluation.
摘要:基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 及其衍生技术,如直接偏好优化 (Direct Preference Optimization, DPO),是用于将通用基础模型重新定位到特定任务的任务对齐算法。我们展示了将任务对齐应用于神经机器翻译 (Neural Machine Translation, NMT) 可以解决 NMT 中现有的任务与数据不匹配问题,从而在多语言模型的所有语言中实现改进,即使任务对齐仅应用于这些语言的一个子集。我们通过引入直接质量优化 (Direct Quality Optimization, DQO) 来实现这一点,DQO 是 DPO 的一个变体,利用预训练的翻译质量评估模型作为人类偏好的代理,并通过自动指标和人工评估验证了这些改进。
[NLP-32] Digital Twin Ecosystem for Oncology Clinical Operations
链接: https://arxiv.org/abs/2409.17650 作者: Himanshu Pandey,Akhil Amod,Shivang,Kshitij Jaggi,Ruchi Garg,Abheet Jain,Vinayak Tantia 关键词-EN: Large Language Models, Artificial Intelligence, Large Language, hold significant promise, Language Models 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: Pre Print
点击查看摘要
Abstract:Artificial Intelligence (AI) and Large Language Models (LLMs) hold significant promise in revolutionizing healthcare, especially in clinical applications. Simultaneously, Digital Twin technology, which models and simulates complex systems, has gained traction in enhancing patient care. However, despite the advances in experimental clinical settings, the potential of AI and digital twins to streamline clinical operations remains largely untapped. This paper introduces a novel digital twin framework specifically designed to enhance oncology clinical operations. We propose the integration of multiple specialized digital twins, such as the Medical Necessity Twin, Care Navigator Twin, and Clinical History Twin, to enhance workflow efficiency and personalize care for each patient based on their unique data. Furthermore, by synthesizing multiple data sources and aligning them with the National Comprehensive Cancer Network (NCCN) guidelines, we create a dynamic Cancer Care Path, a continuously evolving knowledge base that enables these digital twins to provide precise, tailored clinical recommendations.
摘要:人工智能 (AI) 和大语言模型 (LLM) 在革新医疗领域,特别是在临床应用方面,具有巨大的潜力。同时,数字孪生技术,通过建模和模拟复杂系统,在提升患者护理方面也逐渐受到重视。然而,尽管在实验临床环境中取得了进展,AI 和数字孪生在优化临床操作方面的潜力仍未得到充分开发。本文介绍了一种专为提升肿瘤临床操作而设计的新型数字孪生框架。我们提出整合多种专业数字孪生,如医疗必需性孪生、护理导航孪生和临床历史孪生,以提高工作流程效率并根据每位患者的独特数据个性化护理。此外,通过综合多个数据源并与国家综合癌症网络 (NCCN) 指南对齐,我们创建了一个动态的癌症护理路径,这是一个持续演进的知识库,使这些数字孪生能够提供精确、定制化的临床建议。
[NLP-33] Efficient In-Domain Question Answering for Resource-Constrained Environments
链接: https://arxiv.org/abs/2409.17648 作者: Isaac Chung,Phat Vo,Arman Kizilkale,Aaron Reite 关键词-EN: pretrained Large Language, Retrieval Augmented Generation, Large Language Models, Large Language, integrating external knowledge 类目: Computation and Language (cs.CL) 备注: 6 pages, 2 tables
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) is a common method for integrating external knowledge into pretrained Large Language Models (LLMs) to enhance accuracy and relevancy in question answering (QA) tasks. However, prompt engineering and resource efficiency remain significant bottlenecks in developing optimal and robust RAG solutions for real-world QA applications. Recent studies have shown success in using fine tuning to address these problems; in particular, Retrieval Augmented Fine Tuning (RAFT) applied to smaller 7B models has demonstrated superior performance compared to RAG setups with much larger models such as GPT-3.5. The combination of RAFT with parameter-efficient fine tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), promises an even more efficient solution, yet remains an unexplored area. In this work, we combine RAFT with LoRA to reduce fine tuning and storage requirements and gain faster inference times while maintaining comparable RAG performance. This results in a more compute-efficient RAFT, or CRAFT, which is particularly useful for knowledge-intensive QA tasks in resource-constrained environments where internet access may be restricted and hardware resources limited.
摘要:检索增强生成 (Retrieval Augmented Generation, RAG) 是一种常见的方法,用于将外部知识整合到预训练的大语言模型 (Large Language Models, LLMs) 中,以提高问答 (Question Answering, QA) 任务中的准确性和相关性。然而,提示工程和资源效率仍然是开发适用于实际 QA 应用的最佳且稳健的 RAG 解决方案的主要瓶颈。最近的研究表明,通过微调可以有效解决这些问题;特别是,应用于较小 7B 模型的检索增强微调 (Retrieval Augmented Fine Tuning, RAFT) 在性能上优于使用更大模型(如 GPT-3.5)的 RAG 设置。将 RAFT 与参数高效微调 (Parameter-Efficient Fine Tuning, PEFT) 技术(如低秩适应 (Low-Rank Adaptation, LoRA))相结合,有望提供更高效的解决方案,但这一领域仍未被充分探索。在本研究中,我们将 RAFT 与 LoRA 结合,以减少微调和存储需求,并实现更快的推理时间,同时保持与 RAG 相当的性能。这产生了一种更高效的 RAFT,即计算高效的 RAFT (Compute-Efficient RAFT, CRAFT),这对于资源受限环境中知识密集型 QA 任务特别有用,在这些环境中,互联网访问可能受限,硬件资源有限。
[NLP-34] 3: A Novel Zero-shot Transfer Learning Framework Iteratively Training on an Assistant Task for a Target Task
链接: https://arxiv.org/abs/2409.17640 作者: Xindi Tong,Yujin Zhu,Shijian Fan,Liang Xu 关键词-EN: Large Language Models, processing large volumes, efficiently processing large, contextual details dealing, Language Models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Long text summarization, gradually being essential for efficiently processing large volumes of information, stays challenging for Large Language Models (LLMs) such as GPT and LLaMA families because of the insufficient open-sourced training datasets and the high requirement of contextual details dealing. To address the issue, we design a novel zero-shot transfer learning framework, abbreviated as T3, to iteratively training a baseline LLM on an assistant task for the target task, where the former should own richer data resources and share structural or semantic similarity with the latter. In practice, T3 is approached to deal with the long text summarization task by utilizing question answering as the assistant task, and further validated its effectiveness on the BBC summary, NarraSum, FairytaleQA, and NLQuAD datasets, with up to nearly 14% improvement in ROUGE, 35% improvement in BLEU, and 16% improvement in Factscore compared to three baseline LLMs, demonstrating its potential for more assistant-target task combinations.
摘要:长文本摘要,作为高效处理大量信息的关键手段,对于 GPT 和 LLaMA 系列等大语言模型 (LLM) 来说仍然具有挑战性,这主要是因为开源训练数据集的不足以及对上下文细节处理的高要求。为解决这一问题,我们设计了一种新颖的零样本迁移学习框架,简称 T3,通过在辅助任务上迭代训练基线 LLM 以实现目标任务,其中辅助任务应拥有更丰富的数据资源,并与目标任务在结构或语义上具有相似性。在实际应用中,T3 通过利用问答作为辅助任务来处理长文本摘要任务,并在 BBC 摘要、NarraSum、FairytaleQA 和 NLQuAD 数据集上进一步验证了其有效性,相较于三个基线 LLM,ROUGE 提升了近 14%,BLEU 提升了 35%,Factscore 提升了 16%,展示了其在更多辅助-目标任务组合中的潜力。
[NLP-35] ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue
链接: https://arxiv.org/abs/2409.17610 作者: Zhangpu Li,Changhong Zou,Suxue Ma,Zhicheng Yang,Chen Du,Youbao Tang,Zhenjie Cao,Ning Zhang,Jui-Hsin Lai,Ruei-Sung Lin,Yuan Ni,Xingzhi Sun,Jing Xiao,Kai Zhang,Mei Han 关键词-EN: multimodal medical dialogue, multi-turn multimodal medical, large language models, multimodal medical, medical dialogue 类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients’ mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.
摘要:近年来,大语言模型 (LLM) 的迅猛发展推动了视觉-语言模型 (VLM) 在医疗领域的普及。在我们的在线医疗咨询场景中,医生通过多轮对话回应患者提供的文本和图像,以诊断其健康状况,形成了一种多轮多模态的医疗对话格式。与传统医疗视觉问答 (Med-VQA) 中由专业设备拍摄的高质量图像不同,我们场景中的图像由患者使用手机拍摄。这些图像质量控制较差,存在背景元素过多、病变区域严重偏离中心等问题,导致模型训练阶段的视觉-语言对齐效果下降。本文提出了 ZALM3,一种零样本策略,用于改进多轮多模态医疗对话中的视觉-语言对齐。鉴于我们观察到图像之前的文本对话可以推断出图像中的感兴趣区域 (RoI),ZALM3 采用大语言模型从上下文中总结关键词,并使用视觉定位模型提取 RoI。更新后的图像消除了不必要的背景噪声,提供了更有效的视觉-语言对齐。为了更好地评估我们提出的方法,我们设计了一种新的主观评估指标,用于多轮单模态/多模态医疗对话,以提供细粒度的性能比较。我们在三个不同临床部门的实验显著证明了 ZALM3 的有效性,并具有统计学意义。
[NLP-36] Deep CLAS: Deep Contextual Listen Attend and Spell
链接: https://arxiv.org/abs/2409.17603 作者: Shifu Xiong,Mengzhi Wang,Genshun Wan,Hang Chen,Jianqing Gao,Lirong Dai 关键词-EN: improving Automatic Speech, Automatic Speech Recognition, Automatic Speech, improving Automatic, Speech Recognition 类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) 备注: Accepted by NCMMSC 2022
点击查看摘要
Abstract:Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.
摘要:上下文感知语言模型 (Contextual-LAS, CLAS) 已被证明在提高罕见词的自动语音识别 (ASR) 方面有效。它依赖于短语级别的上下文建模和基于注意力的相关性评分,而没有明确的上下文约束,这导致上下文信息的利用不足。在这项工作中,我们提出了深度 CLAS 以更好地利用上下文信息。我们引入了偏置损失,迫使模型关注上下文信息。偏置注意力的查询也被丰富,以提高偏置注意力评分的准确性。为了获取细粒度的上下文信息,我们将短语级别的编码替换为字符级别的编码,并使用 Conformer 而不是 LSTM 来编码上下文信息。此外,我们直接使用偏置注意力评分来修正模型的输出概率分布。使用公开的 AISHELL-1 和 AISHELL-NER 进行实验。在 AISHELL-1 上,与 CLAS 基线相比,深度 CLAS 在命名实体识别场景中获得了 65.78% 的相对召回率和 53.49% 的相对 F1 分数提升。
[NLP-37] DualCoTs: Dual Chain-of-Thoughts Prompting for Sentiment Lexicon Expansion of Idioms
链接: https://arxiv.org/abs/2409.17588 作者: Fuqiang Niu,Minghuan Tan,Bowen Zhang,Min Yang,Ruifeng Xu 关键词-EN: idiom sentiment crucial, text sentiment analysis, idiom sentiment analysis, everyday discourse, rendering the nuanced 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Idioms represent a ubiquitous vehicle for conveying sentiments in the realm of everyday discourse, rendering the nuanced analysis of idiom sentiment crucial for a comprehensive understanding of emotional expression within real-world texts. Nevertheless, the existing corpora dedicated to idiom sentiment analysis considerably limit research in text sentiment analysis. In this paper, we propose an innovative approach to automatically expand the sentiment lexicon for idioms, leveraging the capabilities of large language models through the application of Chain-of-Thought prompting. To demonstrate the effectiveness of this approach, we integrate multiple existing resources and construct an emotional idiom lexicon expansion dataset (called EmoIdiomE), which encompasses a comprehensive repository of Chinese and English idioms. Then we designed the Dual Chain-of-Thoughts (DualCoTs) method, which combines insights from linguistics and psycholinguistics, to demonstrate the effectiveness of using large models to automatically expand the sentiment lexicon for idioms. Experiments show that DualCoTs is effective in idioms sentiment lexicon expansion in both Chinese and English. For reproducibility, we will release the data and code upon acceptance.
摘要:成语在日常对话领域中是传达情感的普遍媒介,因此对成语情感的细致分析对于全面理解现实文本中的情感表达至关重要。然而,现有专门用于成语情感分析的语料库在很大程度上限制了文本情感分析的研究。本文提出了一种创新方法,通过应用思维链提示 (Chain-of-Thought prompting) 利用大语言模型的能力,自动扩展成语的情感词典。为展示该方法的有效性,我们整合了多种现有资源,构建了一个情感成语词典扩展数据集 (称为 EmoIdiomE),该数据集包含全面的中英文成语库。随后,我们设计了双思维链 (Dual Chain-of-Thoughts, DualCoTs) 方法,结合了语言学和心理语言学的见解,以展示使用大模型自动扩展成语情感词典的有效性。实验表明,DualCoTs 在扩展中英文成语情感词典方面是有效的。为确保可重复性,我们将在接受后发布数据和代码。
[NLP-38] Leveraging Annotator Disagreement for Text Classification
链接: https://arxiv.org/abs/2409.17577 作者: Jin Xu,Mariët Theune,Daniel Braun 关键词-EN: common practice, annotated by multiple, text classification, abusive conversation detection, multiple annotators 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:It is common practice in text classification to only use one majority label for model training even if a dataset has been annotated by multiple annotators. Doing so can remove valuable nuances and diverse perspectives inherent in the annotators’ assessments. This paper proposes and compares three different strategies to leverage annotator disagreement for text classification: a probability-based multi-label method, an ensemble system, and instruction tuning. All three approaches are evaluated on the tasks of hate speech and abusive conversation detection, which inherently entail a high degree of subjectivity. Moreover, to evaluate the effectiveness of embracing annotation disagreements for model training, we conduct an online survey that compares the performance of the multi-label model against a baseline model, which is trained with the majority label. The results show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance. The results of the survey also show that the outputs from the multi-label models are considered a better representation of the texts than the single-label model. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.17577 [cs.CL] (or arXiv:2409.17577v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.17577 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在文本分类中,即使数据集由多个标注者进行标注,通常也只使用多数标签进行模型训练。这样做可能会忽略标注者评估中固有的细微差别和多样视角。本文提出了并比较了三种利用标注者分歧进行文本分类的不同策略:基于概率的多标签方法、集成系统和指令调优。所有三种方法都在仇恨言论和辱骂对话检测任务上进行了评估,这些任务本质上具有高度的主观性。此外,为了评估在模型训练中接受标注分歧的有效性,我们进行了一项在线调查,比较了多标签模型与使用多数标签训练的基线模型的性能。结果显示,在仇恨言论检测中,多标签方法优于其他两种方法,而在辱骂对话检测中,指令调优表现最佳。调查结果还表明,多标签模型的输出被认为比单标签模型更好地代表了文本。
链接: https://arxiv.org/abs/2409.17545 作者: Cheolhun Jang 关键词-EN: well-trained SFT model, Preference optimization methods, optimization methods typically, methods typically begin, reference model 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: 8pages, submitted to AAAI 2025
点击查看摘要
Abstract:Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model’s distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbfModulated Intervention Preference Optimization (MIPO) to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. If the data is well-aligned, the intervention is increased to prevent the policy model from diverging significantly from reference model. Conversely, if the alignment is poor, the interference is reduced to facilitate more extensive training. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental results demonstrate that MIPO consistently outperforms DPO across various evaluation scenarios.
摘要:偏好优化方法通常以一个经过良好训练的监督微调 (SFT) 模型作为参考模型开始训练。在强化学习人类反馈 (RLHF) 和直接偏好优化 (DPO) 中,偏好优化过程中使用了一个正则化项,以防止策略模型偏离参考模型的分布过远,从而避免生成异常响应。当参考模型已经与给定数据良好对齐或仅需要轻微调整时,这种方法可以产生一个良好对齐的模型。然而,如果参考模型与给定数据不对齐且需要显著偏离其当前状态,正则化项实际上可能阻碍模型的对齐。在本研究中,我们提出了调制干预偏好优化 (MIPO) 来解决这一问题。MIPO 根据给定数据与参考模型的对齐程度来调制参考模型的干预程度。如果数据与参考模型对齐良好,则增加干预以防止策略模型显著偏离参考模型;相反,如果对齐较差,则减少干预以促进更广泛的训练。我们使用 Mistral-7B 和 Llama3-8B 在 Alpaca Eval 2.0 和 MT-Bench 上比较了 MIPO 和 DPO 的性能。实验结果表明,MIPO 在各种评估场景中始终优于 DPO。
[NLP-40] Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models
链接: https://arxiv.org/abs/2409.17539 作者: Tongxuan Liu,Wenjiang Xu,Weizhe Huang,Xingyu Wang,Jiaxing Wang,Hailong Yang,Jing Li 关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, tasks remains unsatisfactory 类目: Computation and Language (cs.CL) 备注: 20 pages
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks but their performance in complex logical reasoning tasks remains unsatisfactory. Although some prompting methods, such as Chain-of-Thought, can improve the reasoning ability of LLMs to some extent, they suffer from an unfaithful issue where derived conclusions may not align with the generated reasoning chain. To address this issue, some studies employ the approach of propositional logic to further enhance logical reasoning abilities of LLMs. However, the potential omissions in the extraction of logical expressions in these methods can cause information loss in the logical reasoning process, thereby generating incorrect results. To this end, we propose Logic-of-Thought (LoT) prompting which employs propositional logic to generate expanded logical information from input context, and utilizes the generated logical information as an additional augmentation to the input prompts, thereby enhancing the capability of logical reasoning. The LoT is orthogonal to existing prompting methods and can be seamlessly integrated with them. Extensive experiments demonstrate that LoT boosts the performance of various prompting methods with a striking margin across five logical reasoning tasks. In particular, the LoT enhances Chain-of-Thought’s performance on the ReClor dataset by +4.35%; moreover, it improves Chain-of-Thought with Self-Consistency’s performance on LogiQA by +5%; additionally, it boosts performance of Tree-of-Thoughts on ProofWriter dataset by +8%.
摘要:大语言模型 (LLMs) 在各种任务中展示了显著的能力,但在复杂的逻辑推理任务中的表现仍不尽如人意。尽管一些提示方法,如思维链 (Chain-of-Thought),可以在一定程度上提高 LLMs 的推理能力,但它们存在一个不忠实的问题,即推导出的结论可能与生成的推理链不一致。为了解决这一问题,一些研究采用了命题逻辑的方法来进一步增强 LLMs 的逻辑推理能力。然而,这些方法在提取逻辑表达式时可能存在的遗漏会导致逻辑推理过程中的信息丢失,从而产生错误的结果。为此,我们提出了思维逻辑 (Logic-of-Thought, LoT) 提示方法,该方法利用命题逻辑从输入上下文中生成扩展的逻辑信息,并将生成的逻辑信息作为输入提示的额外增强,从而增强逻辑推理能力。LoT 与现有的提示方法是正交的,可以无缝地与它们集成。广泛的实验表明,LoT 在五个逻辑推理任务中显著提升了各种提示方法的性能。特别是,LoT 将思维链在 ReClor 数据集上的性能提升了 +4.35%;此外,它将思维链与自一致性在 LogiQA 上的性能提升了 +5%;另外,它将思维树在 ProofWriter 数据集上的性能提升了 +8%。
[NLP-41] On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy
链接: https://arxiv.org/abs/2409.17538 作者: Saber Malekmohammadi,Golnoosh Farnadi 关键词-EN: processing involves large-scale, involves large-scale pre-training, natural language processing, language processing involves, general domain data 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to specific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.
摘要:自然语言处理中的一个重要方法是在通用领域数据上进行大规模预训练,然后针对特定任务或领域进行适应。随着模型规模的扩大,对所有参数进行全面微调变得越来越不切实际。为了解决这个问题,一些针对语言模型的低秩任务适应方法被提出,例如 LoRA 和 FLoRA。这些方法保持预训练模型权重不变,并在 Transformer 架构的某些层中引入可训练的低秩分解矩阵,称为适配器。与对所有参数进行全面微调相比,这种方法显著减少了下游任务所需的可训练参数数量。在这项工作中,我们从数据隐私的角度审视低秩适应。我们理论上证明了 LoRA 和 FLoRA 中使用的低秩适应等同于向来自其全面微调的适配器参数的批次梯度中注入一些随机噪声,并量化了注入噪声的方差。通过在噪声分布与具有相同方差的高斯分布之间建立 Berry-Esseen 型总变差距离界限,我们表明 LoRA 和 FLoRA 的动力学非常接近于对适配器进行差分隐私全面微调,这表明低秩适应隐含地提供了关于微调数据的隐私保护。最后,利用 Johnson-Lindenstrauss 引理,我们表明,当与梯度裁剪结合时,低秩适应几乎等同于具有固定噪声尺度的差分隐私全面微调适配器。
[NLP-42] MUSE: Integrating Multi-Knowledge for Knowledge Graph Completion
链接: https://arxiv.org/abs/2409.17536 作者: Pengjie Liu 关键词-EN: Knowledge Graph Completion, Graph Completion, Knowledge Graph, aims to predict, existing KGC methods 类目: Computation and Language (cs.CL) 备注: arXiv admin note: text overlap with arXiv:2408.05283
点击查看摘要
Abstract:Knowledge Graph Completion (KGC) aims to predict the missing [relation] part of (head entity)–[relation]-(tail entity) triplet. Most existing KGC methods focus on single features (e.g., relation types) or sub-graph aggregation. However, they do not fully explore the Knowledge Graph (KG) features and neglect the guidance of external semantic knowledge. To address these shortcomings, we propose a knowledge-aware reasoning model (MUSE), which designs a novel multi-knowledge representation learning mechanism for missing relation prediction. Our model develops a tailored embedding space through three parallel components: 1) Prior Knowledge Learning for enhancing the triplets’ semantic representation by fine-tuning BERT; 2) Context Message Passing for enhancing the context messages of KG; 3) Relational Path Aggregation for enhancing the path representation from the head entity to the tail entity. The experimental results show that MUSE significantly outperforms other baselines on four public datasets, achieving over 5.50% H@1 improvement and 4.20% MRR improvement on the NELL995 dataset. The code and datasets will be released via this https URL.
摘要:知识图谱补全 (Knowledge Graph Completion, KGC) 旨在预测 (头实体)–[关系]-(尾实体) 三元组中缺失的 [关系] 部分。大多数现有的 KGC 方法侧重于单一特征(例如,关系类型)或子图聚合。然而,这些方法并未充分挖掘知识图谱 (Knowledge Graph, KG) 的特征,并且忽略了外部语义知识的指导。为了解决这些不足,我们提出了一种知识感知的推理模型 (MUSE),该模型设计了一种新颖的多知识表示学习机制,用于缺失关系的预测。我们的模型通过三个并行组件开发了一个定制的嵌入空间:1) 先验知识学习,通过微调 BERT 来增强三元组的语义表示;2) 上下文消息传递,用于增强 KG 的上下文消息;3) 关系路径聚合,用于增强从头实体到尾实体的路径表示。实验结果表明,MUSE 在四个公共数据集上显著优于其他基线方法,在 NELL995 数据集上实现了超过 5.50% 的 H@1 提升和 4.20% 的 MRR 提升。代码和数据集将通过此 https URL 发布。
[NLP-43] Data Proportion Detection for Optimized Data Management for Large Language Models
链接: https://arxiv.org/abs/2409.17527 作者: Hao Liang,Keshi Zhao,Yajie Yang,Bin Cui,Guosheng Dong,Zenan Zhou,Wentao Zhang 关键词-EN: Large language models, Large language, demonstrated exceptional performance, data preparation playing, data proportion detection 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textitdata proportion detection, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.
摘要:大语言模型 (LLMs) 在众多任务和领域中展现了卓越的性能,其中数据准备在实现这些成果中起到了关键作用。预训练数据通常结合了来自多个领域的信息。为了在整合来自不同领域的数据时最大化性能,确定最佳的数据比例至关重要。然而,最先进的 (SOTA) LLMs 很少披露其预训练数据的详细信息,这使得研究人员难以确定理想的数据比例。在本文中,我们引入了一个新课题,即数据比例检测 (data proportion detection),通过分析 LLMs 生成的输出来实现预训练数据比例的自动估算。我们提供了严格的理论证明、实用的算法以及初步的实验结果,以支持数据比例检测。基于这些发现,我们为有效数据比例检测和数据管理面临的挑战及未来方向提供了宝贵的见解。
[NLP-44] Comparing Unidirectional Bidirectional and Word2vec Models for Discovering Vulnerabilities in Compiled Lifted Code
链接: https://arxiv.org/abs/2409.17513 作者: Gary A. McCully,John D. Hastings,Shengjie Xu,Adam Fortier 关键词-EN: forms of malware, malware cause significant, significant financial, financial and operational, operational damage 类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE) 备注: 6 pages, 2 figures
点击查看摘要
Abstract:Ransomware and other forms of malware cause significant financial and operational damage to organizations by exploiting long-standing and often difficult-to-detect software vulnerabilities. To detect vulnerabilities such as buffer overflows in compiled code, this research investigates the application of unidirectional transformer-based embeddings, specifically GPT-2. Using a dataset of LLVM functions, we trained a GPT-2 model to generate embeddings, which were subsequently used to build LSTM neural networks to differentiate between vulnerable and non-vulnerable code. Our study reveals that embeddings from the GPT-2 model significantly outperform those from bidirectional models of BERT and RoBERTa, achieving an accuracy of 92.5% and an F1-score of 89.7%. LSTM neural networks were developed with both frozen and unfrozen embedding model layers. The model with the highest performance was achieved when the embedding layers were unfrozen. Further, the research finds that, in exploring the impact of different optimizers within this domain, the SGD optimizer demonstrates superior performance over Adam. Overall, these findings reveal important insights into the potential of unidirectional transformer-based approaches in enhancing cybersecurity defenses.
摘要:勒索软件和其他形式的恶意软件通过利用长期存在且难以检测的软件漏洞,对组织造成重大的财务和运营损害。为了检测编译代码中的漏洞,如缓冲区溢出,本研究探讨了单向 Transformer 嵌入的应用,特别是 GPT-2。使用 LLVM 函数的数据集,我们训练了一个 GPT-2 模型来生成嵌入,这些嵌入随后被用于构建 LSTM 神经网络,以区分易受攻击和不易受攻击的代码。我们的研究表明,GPT-2 模型生成的嵌入显著优于 BERT 和 RoBERTa 等双向模型的嵌入,达到了 92.5% 的准确率和 89.7% 的 F1 分数。LSTM 神经网络在冻结和未冻结嵌入模型层的情况下进行了开发。当嵌入层未冻结时,模型达到了最高性能。此外,研究发现,在探索该领域内不同优化器的影响时,SGD 优化器的表现优于 Adam。总体而言,这些发现揭示了单向 Transformer 方法在增强网络安全防御方面的潜力。
[NLP-45] HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection NEURIPS2024
链接: https://arxiv.org/abs/2409.17504 作者: Xuefeng Du,Chaowei Xiao,Yixuan Li 关键词-EN: large language models, language models, prompted concerns, misleading or fabricated, large language 类目: Machine Learning (cs.LG); Computation and Language (cs.CL) 备注: NeurIPS 2024 Spotlight
点击查看摘要
Abstract:The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at this https URL.
摘要:大语言模型 (LLM) 应用的激增引发了对其生成误导性或虚假信息(即幻觉)的担忧。因此,检测幻觉对于维护 LLM 生成内容的可信度至关重要。学习一个真实性分类器的主要挑战在于缺乏大量标记的真实和幻觉数据。为了应对这一挑战,我们引入了 HaloScope,这是一种新颖的学习框架,利用未标记的 LLM 生成数据进行幻觉检测。这种未标记数据在 LLM 部署于开放世界时自由产生,包含真实和幻觉信息。为了利用这些未标记数据,我们提出了一种自动成员资格估计评分,用于区分未标记混合数据中的真实和虚假生成,从而能够在此基础上训练一个二元真实性分类器。重要的是,我们的框架不需要额外的数据收集和人工标注,为实际应用提供了强大的灵活性和实用性。广泛的实验表明,HaloScope 能够实现卓越的幻觉检测性能,显著优于竞争对手。代码可在以下链接获取:https URL。
[NLP-46] MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models NEURIPS2024
链接: https://arxiv.org/abs/2409.17481 作者: Gongfan Fang,Hongxu Yin,Saurav Muralidharan,Greg Heinrich,Jeff Pool,Jan Kautz,Pavlo Molchanov,Xinchao Wang 关键词-EN: Large Language Models, massive parameter counts, Large Language, Language Models, significant redundancy 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: NeurIPS 2024 Spotlight
点击查看摘要
Abstract:Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M’') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model’s 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM’s learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \urlthis https URL.
摘要:大语言模型 (LLMs) 以其庞大的参数数量著称,这些参数通常导致显著的冗余。本文介绍了一种名为 MaskLLM 的可学习剪枝方法,该方法在大语言模型中建立了半结构化 (或称为“N:M”) 稀疏性,旨在减少推理过程中的计算开销。与开发新的重要性标准不同,MaskLLM 通过 Gumbel Softmax 采样显式地将 N:M 模式建模为可学习的分布。这种方法便于在大规模数据集上进行端到端训练,并具有两个显著优势:1) 高质量的掩码 - 我们的方法能够有效扩展到大型数据集并学习准确的掩码;2) 可迁移性 - 掩码分布的概率建模使得稀疏性可以在不同领域或任务之间进行迁移学习。我们使用 2:4 稀疏性对多种大语言模型进行了评估,包括 LLaMA-2、Nemotron-4 和 GPT-3,参数规模从 843M 到 15B 不等,实验结果显示我们的方法在性能上显著优于最先进的方法。例如,领先的方法在 Wikitext 数据集上达到的困惑度 (PPL) 为 10 或更高,而密集模型的 PPL 为 5.12,但 MaskLLM 仅通过学习掩码和冻结权重就达到了显著更低的 6.72 PPL。此外,MaskLLM 的可学习特性允许为下游任务或领域定制无损应用的 2:4 稀疏性掩码。代码可在 \urlthis https URL 获取。
[NLP-47] Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification
链接: https://arxiv.org/abs/2409.17474 作者: Guanyi Mou,Yichuan Li,Kyumin Lee 关键词-EN: model generalization ability, improving model generalization, generalization ability, shown its effectiveness, effectiveness in resolving 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: IEEE BigData 2021
点击查看摘要
Abstract:Data augmentation has shown its effectiveness in resolving the data-hungry problem and improving model’s generalization ability. However, the quality of augmented data can be varied, especially compared with the raw/original data. To boost deep learning models’ performance given augmented data/samples in text classification tasks, we propose a novel framework, which leverages both meta learning and contrastive learning techniques as parts of our design for reweighting the augmented samples and refining their feature representations based on their quality. As part of the framework, we propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples’ weight/quality information effectively. Through experiments, we show that our framework can reasonably cooperate with existing deep learning models (e.g., RoBERTa-base and Text-CNN) and augmentation techniques (e.g., Wordnet and Easydata) for specific supervised learning tasks. Experiment results show that our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders on seven GLUE benchmark datasets compared with the best baseline. We present an indepth analysis of our framework design, revealing the non-trivial contributions of our network components. Our code is publicly available for better reproducibility.
摘要:数据增强在解决数据饥渴问题和提升模型泛化能力方面展现了其有效性。然而,增强数据的质量可能参差不齐,尤其是与原始数据相比。为了在文本分类任务中利用增强数据/样本提升深度学习模型的性能,我们提出了一种新颖的框架,该框架结合了元学习 (meta learning) 和对比学习 (contrastive learning) 技术,用于根据增强样本的质量重新加权并优化其特征表示。作为框架的一部分,我们提出了新的权重依赖的入队和出队算法,以有效利用增强样本的权重/质量信息。通过实验,我们展示了该框架能够合理地与现有的深度学习模型(如 RoBERTa-base 和 Text-CNN)以及增强技术(如 Wordnet 和 Easydata)协同工作,用于特定的监督学习任务。实验结果表明,与最佳基线相比,我们的框架在七个 GLUE 基准数据集上,对 Text-CNN 编码器实现了平均 1.6%、最高 4.3% 的绝对提升,对 RoBERTa-base 编码器实现了平均 1.4%、最高 4.4% 的绝对提升。我们深入分析了框架设计,揭示了网络组件的非平凡贡献。我们的代码已公开,以促进更好的可复现性。
[NLP-48] Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards EMNLP2024
链接: https://arxiv.org/abs/2409.17472 作者: Heejin Do,Sangwon Ryu,Gary Geunbae Lee 关键词-EN: provide enriched feedback, evaluating multiple traits, Recent advances, automated essay scoring, enriched feedback 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: EMNLP 2024
点击查看摘要
Abstract:Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an autoregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.
摘要:近年来,自动作文评分 (Automated Essay Scoring, AES) 的发展趋势转向评估多种特征以提供更丰富的反馈。与典型的 AES 系统类似,多特征 AES 使用二次加权 Kappa (Quadratic Weighted Kappa, QWK) 来衡量与人工评分者的一致性,这与评分方案紧密契合;然而,其不可微分的特性使其无法直接用于神经网络训练。本文提出了一种评分感知的多奖励强化学习 (Scoring-aware Multi-reward Reinforcement Learning, SaMRL),通过设计基于 QWK 的奖励和均方误差惩罚,将实际评估方案整合到多特征 AES 的训练过程中。现有的强化学习 (Reinforcement Learning, RL) 在 AES 中的应用局限于分类模型,尽管存在性能下降的问题,因为 RL 需要概率分布;相反,我们采用了一种自回归评分生成框架,利用 Token 生成概率来实现稳健的多特征评分预测。实证分析表明,SaMRL 有助于模型训练,显著提升了以往评分较低的提示的评分效果。
[NLP-49] What is the social benefit of hate speech detection research? A Systematic Review
链接: https://arxiv.org/abs/2409.17467 作者: Sidney Gig-Jan Wong 关键词-EN: non-profit organisations, grown exponentially, minimal uptake, uptake or engagement, engagement from policy 类目: Computation and Language (cs.CL) 备注: Accepted to the 3rd Workshop on NLP for Positive Impact
点击查看摘要
Abstract:While NLP research into hate speech detection has grown exponentially in the last three decades, there has been minimal uptake or engagement from policy makers and non-profit organisations. We argue the absence of ethical frameworks have contributed to this rift between current practice and best practice. By adopting appropriate ethical frameworks, NLP researchers may enable the social impact potential of hate speech research. This position paper is informed by reviewing forty-eight hate speech detection systems associated with thirty-seven publications from different venues.
摘要:尽管过去三十年中自然语言处理 (NLP) 领域的仇恨言论检测研究呈指数级增长,但政策制定者和非营利组织对此的采纳和参与却极为有限。我们认为,缺乏伦理框架是导致当前实践与最佳实践之间差距的主要原因。通过采用适当的伦理框架,NLP 研究人员可以释放仇恨言论研究的社会影响潜力。本文通过对来自不同渠道的三十七篇相关出版物中的四十八个仇恨言论检测系统进行回顾,为这一立场提供了依据。
[NLP-50] RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
链接: https://arxiv.org/abs/2409.17458 作者: Yifan Jiang,Kriti Aggarwal,Tanmay Laud,Kashif Munir,Jay Pujara,Subhabrata Mukherjee 关键词-EN: Large Language Models, RED QUEEN ATTACK, presents challenges related, RED QUEEN, Large Language 类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model’s performance across standard benchmarks. Full implementation and dataset are publicly accessible at this https URL.
摘要:大语言模型 (LLM) 的快速发展为各个领域和应用带来了新的机遇;然而,它也带来了潜在滥用的挑战。为了减轻这些风险,红队测试 (red teaming) 已被用作一种主动的安全措施,通过越狱攻击 (jailbreak attacks) 来探测语言模型的有害输出。然而,当前的越狱攻击方法主要是单轮的,且恶意查询明确,未能完全捕捉现实世界交互的复杂性。实际上,用户可以与基于 LLM 的聊天助手进行多轮交互,从而以更隐蔽的方式隐藏其真实意图。为了填补这一空白,我们首先提出了一种新的越狱方法,即 RED QUEEN ATTACK。该方法构建了一个多轮场景,将恶意意图隐藏在防止伤害的伪装之下。我们设计了 40 个不同轮次的场景,并选择了 14 个有害类别,生成了 56k 个多轮攻击数据点。我们在四个不同大小的代表性 LLM 家族上进行了全面的 RED QUEEN ATTACK 实验。我们的实验结果表明,所有 LLM 都对 RED QUEEN ATTACK 存在漏洞,GPT-4o 的攻击成功率达到 87.62%,Llama3-70B 的攻击成功率达到 75.4%。进一步分析表明,较大的模型对 RED QUEEN ATTACK 更为敏感,多轮结构和隐藏策略是其成功的关键。为了优先考虑安全性,我们引入了一种简单的缓解策略,称为 RED QUEEN GUARD,该策略使 LLM 能够有效抵御对抗性攻击。这种方法将攻击成功率降低到 1% 以下,同时保持了模型在标准基准测试中的性能。完整的实现和数据集可在以下链接公开获取:https URL。
[NLP-51] Navigating the Shortcut Maze: A Comprehensive Analysis of Shortcut Learning in Text Classification by Language Models
链接: https://arxiv.org/abs/2409.17455 作者: Yuqing Zhou,Ruixiang Tang,Ziyu Yao,Ziwei Zhu 关键词-EN: spurious correlations, undermining their accuracy, accuracy and generalizability, depend on spurious, Language models 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Language models (LMs), despite their advances, often depend on spurious correlations, undermining their accuracy and generalizability. This study addresses the overlooked impact of subtler, more complex shortcuts that compromise model reliability beyond oversimplified shortcuts. We introduce a comprehensive benchmark that categorizes shortcuts into occurrence, style, and concept, aiming to explore the nuanced ways in which these shortcuts influence the performance of LMs. Through extensive experiments across traditional LMs, large language models, and state-of-the-art robust models, our research systematically investigates models’ resilience and susceptibilities to sophisticated shortcuts. Our benchmark and code can be found at: this https URL.
摘要:语言模型 (LMs) 尽管取得了进展,但往往依赖于虚假的相关性,从而削弱了其准确性和可推广性。本研究针对那些被忽视的、更为微妙且复杂的捷径对模型可靠性的影响,这些捷径超越了过于简化的捷径。我们引入了一个综合基准,将捷径分为发生、风格和概念三类,旨在探索这些捷径以微妙的方式影响 LMs 性能的途径。通过在传统 LMs、大语言模型以及最先进的鲁棒模型上进行广泛的实验,我们的研究系统地调查了模型对复杂捷径的韧性和脆弱性。我们的基准和代码可以在以下链接找到:this https URL。
[NLP-52] Enhancing Financial Sentiment Analysis with Expert-Designed Hint
链接: https://arxiv.org/abs/2409.17448 作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao 关键词-EN: social media posts, financial social media, enhancing sentiment analysis, media posts, paper investigates 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:This paper investigates the role of expert-designed hint in enhancing sentiment analysis on financial social media posts. We explore the capability of large language models (LLMs) to empathize with writer perspectives and analyze sentiments. Our findings reveal that expert-designed hint, i.e., pointing out the importance of numbers, significantly improve performances across various LLMs, particularly in cases requiring perspective-taking skills. Further analysis on tweets containing different types of numerical data demonstrates that the inclusion of expert-designed hint leads to notable improvements in sentiment analysis performance, especially for tweets with monetary-related numbers. Our findings contribute to the ongoing discussion on the applicability of Theory of Mind in NLP and open new avenues for improving sentiment analysis in financial domains through the strategic use of expert knowledge. 摘要: 本文探讨了专家设计的提示在增强金融社交媒体帖子情感分析中的作用。我们研究了大语言模型 (LLMs) 在共情作者视角和分析情感方面的能力。研究结果表明,专家设计的提示,即指出数字的重要性,显著提升了各种 LLMs 的性能,特别是在需要视角转换技能的情况下。进一步对包含不同类型数值数据的推文进行分析,结果显示,引入专家设计的提示显著提升了情感分析的性能,尤其是在涉及货币相关数字的推文中。我们的研究为自然语言处理 (NLP) 中关于心智理论适用性的持续讨论做出了贡献,并为通过战略性运用专家知识来改进金融领域情感分析开辟了新的途径。
[NLP-53] HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows
Abstract:Despite recent advancements in large language models (LLMs), their performance on complex reasoning problems requiring multi-step thinking and combining various skills is still limited. To address this, we propose a novel framework HDFlow for complex reasoning with LLMs that combines fast and slow thinking modes in an adaptive manner. Our approach consists of two key components: 1) a new approach for slow, deliberate reasoning called Dynamic Workflow, which automatically decomposes complex problems into more manageable sub-tasks and dynamically designs a workflow to assemble specialized LLM or symbolic reasoning tools to solve sub-tasks; 2) Hybrid Thinking, a general framework that dynamically combines fast and slow thinking based on problem complexity. Finally, we propose an easy-to-scale method for automatically synthesizing a large-scale dataset of 27K challenging reasoning problems for complex reasoning and a hybrid thinking tuning method that trains smaller LLMs on this dataset to internalize the fast/slow hybrid reasoning strategies. Experiments on four reasoning benchmark datasets demonstrate that our slow thinking with dynamic workflows significantly outperforms Chain-of-Thought, and hybrid thinking achieves the highest accuracy while providing an effective balance between computational efficiency and performance. Fine-tuning using our hybrid thinking approach also significantly boosts the complex reasoning capabilities of open-source language models. The results showcase the promise of slow thinking, dynamic workflows, and hybrid thinking in expanding the frontier of complex problem-solving with LLMs\footnoteCode and data will be released at \urlthis https URL…
摘要:尽管大语言模型 (LLM) 在近年来取得了显著进展,但在处理需要多步骤思考和结合多种技能的复杂推理问题时,其表现仍然有限。为解决这一问题,我们提出了一种名为 HDFlow 的新框架,该框架通过自适应方式结合快速和慢速思考模式来进行复杂推理。我们的方法包括两个关键组件:1) 一种名为动态工作流 (Dynamic Workflow) 的新方法,用于缓慢、深思熟虑的推理,该方法能够自动将复杂问题分解为更易管理的子任务,并动态设计工作流程以组合专门的 LLM 或符号推理工具来解决这些子任务;2) 混合思考 (Hybrid Thinking),这是一个通用的框架,能够根据问题复杂性动态结合快速和慢速思考。最后,我们提出了一种易于扩展的方法,用于自动合成包含 27,000 个挑战性推理问题的复杂推理大规模数据集,并提出了一种混合思考调优方法,该方法训练较小的 LLM 以内化快速/慢速混合推理策略。在四个推理基准数据集上的实验表明,我们的动态工作流慢速思考方法显著优于思维链 (Chain-of-Thought),而混合思考在提供计算效率与性能之间有效平衡的同时,达到了最高的准确率。使用我们的混合思考方法进行微调,也显著提升了开源语言模型的复杂推理能力。这些结果展示了慢速思考、动态工作流和混合思考在扩展 LLM 解决复杂问题前沿的潜力 [20]。
链接: https://arxiv.org/abs/2409.17431 作者: Jinghong Chen,Guangyu Yang,Weizhe Lin,Jingbiao Mei,Bill Byrne 关键词-EN: pair-wise comparisons, derive and investigate, possibility of declaring, DPO variants, DPO 类目: Computation and Language (cs.CL) 备注: 24 pages
点击查看摘要
Abstract:We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.
摘要:我们推导并研究了两种 DPO 变体,这些变体明确地建模了在成对比较中宣布平局的可能性。我们用 Rao 和 Kupper 以及 Davidson 提出的两种著名建模扩展替换了 DPO 中的 Bradley-Terry 模型,这些扩展为平局分配了概率,作为明确偏好的替代方案。我们在神经机器翻译和摘要生成任务中的实验表明,可以向这些 DPO 变体的数据集中添加明确标记的平局,而不会出现将相同平局对呈现给 DPO 时观察到的任务性能下降。我们通过经验发现,包含平局会导致相对于参考策略的更强正则化,如 KL 散度所衡量,即使在 DPO 的原始形式中也能看到这一点。这些发现促使并实现了在偏好优化中包含平局对,而不是简单地丢弃它们。
[NLP-55] Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理长上下文输入时面临的计算资源和延迟增加的问题。解决方案的关键在于提出了一种名为GemFilter的新算法,该算法利用LLM早期层的注意力机制来筛选和压缩输入令牌,从而显著减少后续处理的上下文长度。这种方法不仅提高了推理速度(2.4倍加速)和GPU内存效率(减少30%的内存使用),而且在Needle in a Haystack任务中显著优于标准注意力机制和SnapKV/H2O,同时在LongBench挑战中表现相当。GemFilter的简单性、无需训练以及广泛适用性使其成为优化LLM设计和推理的重要工具。
链接: https://arxiv.org/abs/2409.17422 作者: Zhenmei Shi,Yifei Ming,Xuan-Phi Nguyen,Yingyu Liang,Shafiq Joty 关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, increased computational resources 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4 \times speedup and 30% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at \urlthis https URL.
摘要:大语言模型 (LLMs) 在处理长上下文输入方面展示了显著的能力,但这是以增加计算资源和延迟为代价的。我们的研究引入了一种新颖的方法来解决长上下文瓶颈问题,以加速 LLM 推理并减少 GPU 内存消耗。我们的研究表明,LLMs 在生成查询答案之前,可以在早期层中识别相关 Token。基于这一洞察,我们提出了一种算法,该算法利用 LLM 的早期层作为过滤器,选择和压缩输入 Token,从而显著减少后续处理的上下文长度。我们的方法,GemFilter,与现有技术(如标准注意力机制和 SnapKV/H2O)相比,在速度和内存效率方面展示了显著的改进。值得注意的是,与最先进的方法相比,它实现了 2.4 倍的加速和 30% 的 GPU 内存使用量减少。在“Needle in a Haystack”任务上的评估显示,GemFilter 显著优于标准注意力机制和 SnapKV,并在 LongBench 挑战中展示了可比拟的性能。GemFilter 简单、无需训练,并且广泛适用于不同的大语言模型。关键的是,它通过允许人类检查选定的输入序列,提供了可解释性。这些发现不仅为 LLM 部署提供了实际效益,而且增强了我们对于 LLM 内部机制的理解,为 LLM 设计和推理的进一步优化铺平了道路。我们的代码可在 \urlthis https URL 获取。
[NLP-56] Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction
Abstract:Understanding the duration of news events’ impact on the stock market is crucial for effective time-series forecasting, yet this facet is largely overlooked in current research. This paper addresses this research gap by introducing a novel dataset, the Impact Duration Estimation Dataset (IDED), specifically designed to estimate impact duration based on investor opinions. Our research establishes that pre-finetuning language models with IDED can enhance performance in text-based stock movement predictions. In addition, we juxtapose our proposed pre-finetuning task with sentiment analysis pre-finetuning, further affirming the significance of learning impact duration. Our findings highlight the promise of this novel research direction in stock movement prediction, offering a new avenue for financial forecasting. We also provide the IDED and pre-finetuned language models under the CC BY-NC-SA 4.0 license for academic use, fostering further exploration in this field.
摘要:理解新闻事件对股票市场影响的持续时间对于有效的时间序列预测至关重要,然而这一方面在当前研究中大多被忽视。本文通过引入一个新颖的数据集——影响持续时间估计数据集 (Impact Duration Estimation Dataset, IDED),旨在基于投资者意见估计影响持续时间,填补了这一研究空白。我们的研究表明,使用 IDED 对语言模型进行预微调可以提升基于文本的股票走势预测性能。此外,我们将提出的预微调任务与情感分析预微调进行对比,进一步确认了学习影响持续时间的重要性。研究结果突显了这一新颖研究方向在股票走势预测中的潜力,为金融预测开辟了新的途径。我们还根据 CC BY-NC-SA 4.0 许可证提供了 IDED 和预微调语言模型,以促进该领域的进一步探索。
[NLP-57] Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis
链接: https://arxiv.org/abs/2409.17417 作者: Chung-Chi Chen,Hen-Hsen Huang,Hsin-Hsi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao 关键词-EN: media platform development, individuals readily share, social media platform, rapid Internet, Internet and social 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:In the era of rapid Internet and social media platform development, individuals readily share their viewpoints online. The overwhelming quantity of these posts renders comprehensive analysis impractical. This necessitates an efficient recommendation system to filter and present significant, relevant opinions. Our research introduces a dual-pronged argument mining technique to improve recommendation system effectiveness, considering both professional and amateur investor perspectives. Our first strategy involves using the discrepancy between target and closing prices as an opinion indicator. The second strategy applies argument mining principles to score investors’ opinions, subsequently ranking them by these scores. Experimental results confirm the effectiveness of our approach, demonstrating its ability to identify opinions with higher profit potential. Beyond profitability, our research extends to risk analysis, examining the relationship between recommended opinions and investor behaviors. This offers a holistic view of potential outcomes following the adoption of these recommended opinions.
摘要:在互联网和社交媒体平台快速发展的时代,个人可以轻松地在线分享他们的观点。然而,海量的帖子使得全面分析变得不切实际。这要求我们开发一个高效的推荐系统,以筛选和展示重要且相关的意见。我们的研究提出了一种双管齐下的论点挖掘技术,以提高推荐系统的有效性,同时考虑专业投资者和业余投资者的视角。我们的第一种策略是利用目标价格与收盘价格之间的差异作为观点的指标。第二种策略则应用论点挖掘原则对投资者的观点进行评分,并根据这些评分对观点进行排序。实验结果证实了我们的方法的有效性,展示了其识别具有更高盈利潜力观点的能力。除了盈利性,我们的研究还扩展到风险分析,探讨了推荐观点与投资者行为之间的关系。这为我们提供了一个全面的视角,以评估在采纳这些推荐观点后可能产生的结果。
[NLP-58] From Deception to Detection: The Dual Roles of Large Language Models in Fake News
链接: https://arxiv.org/abs/2409.17416 作者: Dorsaf Sallami,Yuan-Chen Chang,Esma Aïmeur 关键词-EN: Fake, public trust, poses a significant, significant threat, ecosystems and public 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Fake news poses a significant threat to the integrity of information ecosystems and public trust. The advent of Large Language Models (LLMs) holds considerable promise for transforming the battle against fake news. Generally, LLMs represent a double-edged sword in this struggle. One major concern is that LLMs can be readily used to craft and disseminate misleading information on a large scale. This raises the pressing questions: Can LLMs easily generate biased fake news? Do all LLMs have this capability? Conversely, LLMs offer valuable prospects for countering fake news, thanks to their extensive knowledge of the world and robust reasoning capabilities. This leads to other critical inquiries: Can we use LLMs to detect fake news, and do they outperform typical detection models? In this paper, we aim to address these pivotal questions by exploring the performance of various LLMs. Our objective is to explore the capability of various LLMs in effectively combating fake news, marking this as the first investigation to analyze seven such models. Our results reveal that while some models adhere strictly to safety protocols, refusing to generate biased or misleading content, other models can readily produce fake news across a spectrum of biases. Additionally, our results show that larger models generally exhibit superior detection abilities and that LLM-generated fake news are less likely to be detected than human-written ones. Finally, our findings demonstrate that users can benefit from LLM-generated explanations in identifying fake news.
摘要:虚假新闻对信息生态系统的完整性和公众信任构成了重大威胁。大语言模型 (LLM) 的出现为对抗虚假新闻带来了巨大的希望。然而,LLM 在这场斗争中是一把双刃剑。一个主要担忧是,LLM 可以被轻易用于大规模制造和传播误导性信息。这引发了一个紧迫的问题:LLM 是否容易生成带有偏见的虚假新闻?所有 LLM 都具备这种能力吗?相反,LLM 由于其广泛的世界知识和强大的推理能力,为对抗虚假新闻提供了宝贵的可能性。这引出了其他关键问题:我们能否利用 LLM 来检测虚假新闻,并且它们是否优于典型的检测模型?本文旨在通过探索各种 LLM 的性能来回答这些关键问题。我们的目标是探索不同 LLM 在有效对抗虚假新闻方面的能力,这标志着首次对七种此类模型进行分析的研究。我们的结果表明,尽管某些模型严格遵守安全协议,拒绝生成带有偏见或误导性的内容,但其他模型可以轻易地在各种偏见范围内生成虚假新闻。此外,我们的结果显示,较大的模型通常表现出更强的检测能力,并且由 LLM 生成的虚假新闻比人类编写的更难被检测到。最后,我们的研究结果表明,用户可以从 LLM 生成的解释中受益,以识别虚假新闻。
[NLP-59] Post-hoc Reward Calibration: A Case Study on Length Bias
链接: https://arxiv.org/abs/2409.17407 作者: Zeyu Huang,Zihan Qiu,Zili Wang,Edoardo M. Ponti,Ivan Titov 关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Human Feedback aligns, translates human feedback 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: Preprint
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM–RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at this https URL.
摘要:基于人类反馈的强化学习使大语言模型的输出与人类价值观和偏好相一致。这一过程的核心是奖励模型 (RM),它将人类反馈转化为优化大语言模型行为的训练信号。然而,RM 可能会通过利用训练数据中的虚假相关性产生偏差,例如偏好基于长度或风格的输出而非真正的质量。这些偏差可能导致输出排序错误、模型评估次优,以及在大语言模型对齐过程中放大不良行为。本文针对在不增加数据和训练的情况下纠正此类偏差的问题,提出了事后奖励校准的概念。我们首先提出了一种直观的方法来估计偏差项,从而将其移除以近似真实的奖励。然后,我们通过局部加权回归将该方法扩展为一种更通用和稳健的形式。重点针对普遍存在的长度偏差,我们在三种实验设置中验证了我们提出的方法,展示了持续的改进:(1) 在 RewardBench 数据集上,33 个奖励模型的平均性能提升了 3.11;(2) 增强了 RM 排序与 GPT-4 评估和基于 AlpacaEval 基准的人类偏好的一致性;(3) 在多个大语言模型与 RM 组合中,RLHF 过程的长度控制胜率有所提高。我们的方法在计算上高效且可推广到其他类型的偏差和 RM,为缓解大语言模型对齐中的偏差提供了一种可扩展且稳健的解决方案。我们的代码和结果可在以下链接获取:https URL。
[NLP-60] Severity Prediction in Mental Health: LLM-based Creation Analysis Evaluation of a Novel Multilingual Dataset
链接: https://arxiv.org/abs/2409.17397 作者: Konstantinos Skianis,John Pavlopoulos,A. Seza Doğruöz 关键词-EN: mental health support, mental health, health support systems, including mental health, health support 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on large language models (LLMs) in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.
摘要:大语言模型 (LLMs) 正越来越多地被整合到包括心理健康支持系统在内的各个医疗领域中。然而,关于 LLMs 在非英语心理健康支持应用中的有效性研究存在空白。为解决这一问题,我们提出了一种新颖的多语言适应方法,将广泛使用的心理健康数据集从英语翻译成六种语言 (希腊语、土耳其语、法语、葡萄牙语、德语和芬兰语)。该数据集能够全面评估 LLM 在检测心理健康状况及其严重程度方面的性能,涵盖多种语言。通过在 GPT 和 Llama 上进行实验,我们观察到尽管在相同的翻译数据集上进行评估,模型在不同语言中的表现存在显著差异。这种不一致性突显了多语言心理健康支持中固有的复杂性,其中语言特定的细微差别和心理健康数据的覆盖范围可能影响模型的准确性。通过全面的错误分析,我们强调了在医疗环境中完全依赖大语言模型 (LLMs) 的风险 (例如,它们可能导致误诊的潜在风险)。此外,我们提出的方法在多语言任务中提供了显著的成本节省,为大规模实施提供了主要优势。
[NLP-61] Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia EMNLP2024
链接: https://arxiv.org/abs/2409.17391 作者: Zhejian Zhou,Jiayu Wang,Dahua Lin,Kai Chen 关键词-EN: shown remarkable abilities, numeric operations accurately, Large Language Models, performing numeric operations, mathematics reasoning 类目: Computation and Language (cs.CL) 备注: EMNLP 2024 Findings
点击查看摘要
Abstract:Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into 1 -digit, and 2) Tokenize into 1\sim 3 digit. The difference is roughly equivalent to using different numeral systems (namely base 10 or base 10^3 ). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base 10 system is consistently more data-efficient than a base 10^2 or 10^3 system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base 10 system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base 100 and base 1000 systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.
摘要:尽管大语言模型 (LLM) 在数学推理方面展现了卓越的能力,但在执行加法和乘法等数值运算时仍面临挑战。不同的大语言模型可以通过多种方式将数字 Token 化,从而影响数值运算的性能。目前,主要有两种代表性方法:1) 将数字 Token 化为 1 位数字,2) 将数字 Token 化为 1 到 3 位数字。这两种方法的差异大致相当于使用不同的数制(即十进制或千进制)。基于此,我们研究了在基于 Transformer 的大语言模型背景下,不同数制的缩放行为。我们通过实证表明,在从头开始训练的设置下,十进制系统在训练数据规模和模型大小方面始终比百进制或千进制系统更具数据效率,而不同的数制在微调性能上非常相似。我们将此归因于十进制系统更高的 Token 频率。此外,我们揭示了加法和乘法上的外推行为模式。我们发现,百进制和千进制系统在 Token 级别的辨别和 Token 级别的运算上存在困难。我们还阐明了模型所学机制的原理。
[NLP-62] data2lang2vec: Data Driven Typological Features Completion
链接: https://arxiv.org/abs/2409.17373 作者: Hamidreza Amirzadeh,Sadegh Jafari,Anika Harju,Rob van der Goot 关键词-EN: Natural Language Processing, diverse linguistic structures, enhance multi-lingual Natural, improving model adaptability, multi-lingual Natural Language 类目: Computation and Language (cs.CL) 备注: 9 pages, 11 figures
点击查看摘要
Abstract:Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
摘要:语言类型学数据库通过提高模型对多样语言结构的适应性,增强了多语言自然语言处理 (NLP) 的能力。广泛使用的 lang2vec 工具包整合了多个此类数据库,但其覆盖率仍限于 28.9%。先前的工作主要通过基于其他语言特征的预测来自动增加覆盖率,或专注于单一特征,我们提出利用文本数据进行更明智的特征预测。为此,我们引入了一个多语言词性标注器 (POS tagger),在 1,749 种语言中实现了超过 70% 的准确率,并实验了外部统计特征和多种机器学习算法。我们还引入了一种更现实的评估设置,重点关注可能缺失的类型学特征,并展示了我们的方法在两种设置下均优于先前的工作。
[NLP-63] Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
链接: https://arxiv.org/abs/2409.17353 作者: Robin Shing-Hei Yuen,Timothy Tin-Long Tse,Jian Zhu 关键词-EN: Current speech-based LLMs, Current speech-based, excelling in tasks, predominantly trained, trained on extensive 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model’s native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.
摘要:当前基于语音的大语言模型 (LLM) 主要通过大量的自动语音识别 (ASR) 和文本转语音 (TTS) 数据集进行训练,在这些领域相关的任务中表现出色。然而,它们在处理直接的语音到语音对话方面的能力仍然显著受限。这些模型通常依赖于一个 ASR 到 TTS 的链式思维流程,先将语音转换为文本进行处理,然后再生成音频响应,这引入了延迟并丢失了音频特征。我们提出了一种方法,将 ASR 链式思维隐式内化到语音大语言模型中,增强其对语音的固有理解能力。我们的方法减少了延迟,并提高了模型对语音的固有理解能力,为更高效和自然的实时音频交互铺平了道路。我们还发布了一个大规模的合成对话数据集,以促进进一步的研究。
[NLP-64] How Transliterations Improve Crosslingual Alignment
链接: https://arxiv.org/abs/2409.17326 作者: Yihong Liu,Mingyang Wang,Amir Hossein Kargaran,Ayyoob Imani,Orgest Xhelili,Haotian Ye,Chunlan Ma,François Yvon,Hinrich Schütze 关键词-EN: post-aligning multilingual pretrained, Recent studies, multilingual pretrained language, studies have shown, shown that post-aligning 类目: Computation and Language (cs.CL) 备注: preprint
点击查看摘要
Abstract:Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.
摘要:近期研究表明,通过在原始数据和音译数据上使用对齐目标对多语言预训练语言模型 (mPLMs) 进行后对齐,可以提升跨语言对齐效果。这种提升进一步带来了更好的跨语言迁移性能。然而,目前尚不清楚这种更好的跨语言对齐是如何以及为何实现的,因为该技术仅涉及音译,并未使用任何平行数据。本文尝试明确评估跨语言对齐,并识别基于音译方法中对性能提升起关键作用的因素。为此,我们在两种相关语言对(波兰语与乌克兰语,以及印地语与乌尔都语)上,采用不同设置训练了多个模型。为了评估对齐效果,我们定义了基于句子表示的四种相似性类型。实验结果显示,仅添加音译数据就能提升整体相似性,即使是随机句子对。借助辅助对齐目标,特别是对比目标,模型能够区分匹配对与随机对,从而实现更好的对齐。然而,我们也发现更好的对齐并不总能带来更好的下游性能,这表明需要进一步研究以阐明对齐与性能之间的关系。
[NLP-65] Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP2024
链接: https://arxiv.org/abs/2409.17313 作者: Zehao Wang,Minye Wu,Yixin Cao,Yubo Ma,Meiqi Chen,Tinne Tuytelaars 关键词-EN: study presents, instruction categories, evaluation framework, VLN, Vision-Language Navigation 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: EMNLP 2024 Findings; project page: this https URL
点击查看摘要
Abstract:This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.
摘要:本研究提出了一种新颖的视觉-语言导航 (Vision-Language Navigation, VLN) 任务评估框架。其目标是在更细粒度的层面上诊断当前模型在各种指令类别中的表现。该框架围绕任务的上下文无关文法 (Context-Free Grammar, CFG) 构建。CFG 作为问题分解的基础和指令类别设计的核心前提。我们提出了一种半自动的 CFG 构建方法,借助大语言模型 (Large-Language Models, LLMs) 的帮助。随后,我们归纳并生成了涵盖五个主要指令类别(即方向变化、地标识别、区域识别、垂直移动和数值理解)的数据。我们对不同模型的分析揭示了显著的性能差异和反复出现的问题。数值理解的停滞、对方向概念的严重选择性偏差以及其他有趣的发现,为未来语言引导导航系统的发展提供了贡献。
[NLP-66] BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data CONLL2024
链接: https://arxiv.org/abs/2409.17312 作者: Jean-Loup Tastet,Inar Timiryasov 关键词-EN: million word corpus, million parameter model, parameter model distillation-pretrained, million word, million word datasets 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: 9 pages, 3 figures, 5 tables, submitted to the BabyLM Challenge (CoNLL 2024 Shared Task)
点击查看摘要
Abstract:We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.
摘要:我们提出了 BabyLlama-2,这是一个 3.45 亿参数的模型,通过从两个教师模型在 1000 万词的语料库上进行蒸馏预训练,用于 BabyLM 竞赛。在 BLiMP 和 SuperGLUE 基准测试中,BabyLlama-2 的表现优于在相同数据混合的 1000 万和 1 亿词数据集上训练的基线模型,以及其教师模型。通过广泛的参数扫描,我们证明了蒸馏的优势不能归因于教师模型次优的超参数选择。我们的研究结果强调了进一步研究蒸馏技术,特别是在数据受限环境中的必要性。
[NLP-67] On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains
链接: https://arxiv.org/abs/2409.17275 作者: Xun Xian,Ganghua Wang,Xuan Bi,Jayanth Srinivasa,Ashish Kundu,Charles Fleming,Mingyi Hong,Jie Ding 关键词-EN: large language models, Retrieval-Augmented Generation, language models, legal contexts, empirically shown 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs’ generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q\A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q\A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 已被实证证明能够提升大语言模型 (Large Language Models, LLMs) 在医疗、金融和法律等知识密集型领域的表现。给定一个查询,RAG 从语料库中检索相关文档,并将其整合到 LLMs 的生成过程中。在本研究中,我们探讨了 RAG 的对抗鲁棒性,特别关注检索系统的安全性。首先,在 225 种不同的语料库、检索器、查询和目标信息组合中,我们发现检索系统在医疗问答中容易受到普遍的投毒攻击。在这种攻击中,攻击者生成包含广泛目标信息的投毒文档,如个人身份信息。当这些投毒文档被插入到语料库中时,只要使用攻击者指定的查询,它们就能被准确检索到。为了理解这一漏洞,我们发现查询嵌入与投毒文档嵌入之间的偏差往往遵循一种模式,即投毒文档与查询之间的高相似性得以保留,从而实现精确检索。基于这些发现,我们开发了一种基于检测的新防御措施,以确保 RAG 的安全使用。通过在多个问答领域的广泛实验,我们观察到所提出的方法在几乎所有情况下都能持续实现优异的检测率。
[NLP-68] Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning
【速读】: 该论文试图解决大语言模型(LLMs)在处理新领域和复杂逻辑序列时推理不一致的问题。解决方案的关键在于引入“Proof of Thought”框架,通过将LLM生成的想法与形式逻辑验证相结合,使用自定义解释器将LLM输出转换为一阶逻辑结构,以便进行定理证明器的审查。核心方法包括一个基于JSON的领域特定语言,该语言在精确逻辑结构和直观人类概念之间取得平衡,从而实现严格的验证和易于理解的人类可解释性。此外,该方法还包括一个强大的类型系统,用于增强逻辑完整性,明确区分事实知识和推断知识,并提供灵活的架构以适应各种领域特定应用的扩展。
链接: https://arxiv.org/abs/2409.17270 作者: Debargha Ganguly,Srinivasan Iyengar,Vipin Chaudhary,Shivkumar Kalyanaraman 关键词-EN: Large Language Models, natural language processing, revolutionized natural language, complex logical sequences, Large Language 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE) 备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate Proof of Thought’s effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.
摘要:大语言模型 (LLMs) 已经彻底改变了自然语言处理领域,但它们在处理新领域和复杂逻辑序列时,推理过程往往不一致。本研究引入了“思维证明” (Proof of Thought) 框架,旨在提升 LLM 输出的可靠性和透明度。我们的方法通过自定义解释器,将 LLM 生成的想法与形式逻辑验证相结合,将 LLM 输出转换为用于定理证明器审查的一阶逻辑结构。该方法的核心是一个基于 JSON 的领域特定语言 (Domain-Specific Language),它在设计上平衡了精确的逻辑结构与直观的人类概念。这种混合表示方式既实现了严格的验证,又便于人类理解 LLM 的推理过程。主要贡献包括一个具有排序管理功能的强大类型系统,以增强逻辑完整性;明确表示规则,以清晰区分事实性知识和推理性知识;以及一个灵活的架构,便于轻松扩展到各种领域特定应用。我们通过在 StrategyQA 和一项新的多模态推理任务上的基准测试,展示了“思维证明”框架的有效性,表明在开放式场景中性能有所提升。通过提供可验证和可解释的结果,我们的技术满足了 AI 系统责任性的关键需求,并为高风险领域中的人机协同监督奠定了基础。
[NLP-69] Plurals: A System for Guiding LLMs Via Simulated Social Ensembles
链接: https://arxiv.org/abs/2409.17213 作者: Joshua Ashkinaze,Emily Fry,Narendra Edara,Eric Gilbert,Ceren Budak 关键词-EN: Recent debates raised, debates raised concerns, Recent debates, debates raised, raised concerns 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) 备注:
点击查看摘要
Abstract:Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a ‘view from nowhere’ but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by democratic deliberation theory, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at this https URL and will be continually updated.
摘要:近期关于语言模型的讨论引发了对其可能偏袒某些观点的担忧。但如果解决方案不是追求“无立场”,而是利用不同的观点呢?我们引入了 Plurals,这是一个用于多元 AI 审议的系统和 Python 库。Plurals 由 AI 智能体(大语言模型,可选地带有角色)组成,这些智能体在可定制的结构中进行审议,并由主持人监督审议过程。Plurals 是一个模拟社会集合的生成器。Plurals 整合了政府数据集以创建具有国家代表性的角色,包含了受民主审议理论启发的审议模板,并允许用户自定义信息共享结构和结构内的审议行为。六个案例研究展示了其对理论构件的忠实性和有效性。三个随机实验表明,模拟焦点小组产生的输出与相关在线受众样本产生了共鸣(在 75% 的试验中选择了少样本生成而非零样本生成)。Plurals 既是一种范式,也是一个具体的多元 AI 系统。Plurals 库可通过此 https URL 获取,并将持续更新。
[NLP-70] An Effective Robust and Fairness-aware Hate Speech Detection Framework
链接: https://arxiv.org/abs/2409.17191 作者: Guanyi Mou,Kyumin Lee 关键词-EN: online social networks, widespread online social, speeches are spreading, spreading faster, faster and causing 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: IEEE BigData 2021
点击查看摘要
Abstract:With the widespread online social networks, hate speeches are spreading faster and causing more damage than ever before. Existing hate speech detection methods have limitations in several aspects, such as handling data insufficiency, estimating model uncertainty, improving robustness against malicious attacks, and handling unintended bias (i.e., fairness). There is an urgent need for accurate, robust, and fair hate speech classification in online social networks. To bridge the gap, we design a data-augmented, fairness addressed, and uncertainty estimated novel framework. As parts of the framework, we propose Bidirectional Quaternion-Quasi-LSTM layers to balance effectiveness and efficiency. To build a generalized model, we combine five datasets collected from three platforms. Experiment results show that our model outperforms eight state-of-the-art methods under both no attack scenario and various attack scenarios, indicating the effectiveness and robustness of our model. We share our code along with combined dataset for better future research
摘要:随着在线社交网络的广泛普及,仇恨言论的传播速度比以往任何时候都快,造成的损害也更为严重。现有的仇恨言论检测方法在多个方面存在局限性,如处理数据不足、估计模型不确定性、提高对恶意攻击的鲁棒性以及处理意外偏差(即公平性)。在线社交网络中迫切需要准确、鲁棒且公平的仇恨言论分类。为了填补这一空白,我们设计了一个数据增强、公平性处理和不确定性估计的新颖框架。作为该框架的一部分,我们提出了双向四元数准LSTM层,以平衡有效性和效率。为了构建一个泛化模型,我们结合了从三个平台收集的五个数据集。实验结果表明,在无攻击场景和各种攻击场景下,我们的模型均优于八种最先进的方法,显示出我们模型的有效性和鲁棒性。我们共享了代码和合并的数据集,以促进未来的研究。
[NLP-71] Fully automatic extraction of morphological traits from the Web: utopia or reality?
链接: https://arxiv.org/abs/2409.17179 作者: Diego Marcos,Robert van de Vlasakker,Ioannis N. Athanasiadis,Pierre Bonnet,Hervé Goeau,Alexis Joly,W. Daniel Kissling,César Leblanc,André S.J. van Proosdij,Konstantinos P. Panousis 关键词-EN: Plant morphological traits, observable characteristics, fundamental to understand, understand the role, role played 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.
摘要:植物形态特征,即其可观察到的特性,是理解每种物种在其生态系统中所扮演角色的基础。然而,即使是对中等数量的物种进行特征信息编纂,也是一项耗时耗力的任务,可能需要专家花费数年时间才能完成。与此同时,尽管网络上存在大量关于物种描述的文本信息,但由于缺乏结构化,使得这些数据难以大规模利用。为了克服这一问题,我们提出利用大语言模型 (LLMs) 的最新进展,设计一种机制,用于收集和处理以非结构化文本形式描述的植物特征信息,而无需人工编排。我们通过自动复制三个手动创建的物种-特征矩阵来评估我们的方法。我们的方法成功为超过一半的物种-特征对找到了数值,F1-score 超过 75%。这些结果表明,得益于 LLMs 的信息提取能力,目前从非结构化在线文本中大规模创建结构化特征数据库是可行的,其限制主要在于涵盖所有感兴趣特征的文本描述的可用性。
[NLP-72] CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency
链接: https://arxiv.org/abs/2409.17174 作者: Kangsheng Wang,Xiao Zhang,Zizheng Guo,Tianyu Hu,Huimin Ma 关键词-EN: large language models, causal significance, significance and consistency, reasoning, solving reasoning tasks 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Chain-based reasoning methods like chain of thought (CoT) play a rising role in solving reasoning tasks for large language models (LLMs). However, the causal illusions between \textita step of reasoning and \textitcorresponding state transitions are becoming a significant obstacle to advancing LLMs’ reasoning capabilities, especially in long-range reasoning tasks. This paper proposes a non-chain-based reasoning framework for simultaneous consideration of causal significance and consistency, i.e., the Causal Significance and Consistency Enhancer (CSCE). We customize LLM’s loss function utilizing treatment effect assessments to enhance its reasoning ability from two aspects: causal significance and consistency. This ensures that the model captures essential causal relationships and maintains robust and consistent performance across various scenarios. Additionally, we transform the reasoning process from the cascading multiple one-step reasoning commonly used in Chain-Based methods, like CoT, to a causal-enhanced method that outputs the entire reasoning process in one go, further improving the model’s reasoning efficiency. Extensive experiments show that our method improves both the reasoning success rate and speed. These improvements further demonstrate that non-chain-based methods can also aid LLMs in completing reasoning tasks.
摘要:基于链的推理方法,如思维链 (Chain of Thought, CoT),在大语言模型 (Large Language Models, LLMs) 解决推理任务中扮演着日益重要的角色。然而,推理步骤与相应状态转换之间的因果错觉正成为提升 LLMs 推理能力,尤其是在长距离推理任务中的一个重大障碍。本文提出了一种非链式推理框架,即因果显著性与一致性增强器 (Causal Significance and Consistency Enhancer, CSCE),用于同时考虑因果显著性和一致性。我们通过利用处理效应评估来定制 LLM 的损失函数,从因果显著性和一致性两个方面增强其推理能力。这确保了模型能够捕捉关键的因果关系,并在各种场景中保持稳健且一致的性能。此外,我们将推理过程从基于链的方法(如 CoT)中常见的级联多步推理转变为因果增强的方法,该方法一次性输出整个推理过程,从而进一步提高了模型的推理效率。大量实验表明,我们的方法在推理成功率和速度上都有所提升。这些改进进一步证明了非链式方法也能帮助 LLMs 完成推理任务。
[NLP-73] A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models
链接: https://arxiv.org/abs/2409.17173 作者: Satoshi Munakata,Taku Fukui,Takao Mohri 关键词-EN: Large language models, Large language, language models, fabricate a hallucinatory, Large 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: 20 pages
点击查看摘要
Abstract:Large language models (LLMs) often fabricate a hallucinatory text. Several methods have been developed to detect such text by semantically comparing it with the multiple versions probabilistically regenerated. However, a significant issue is that if the storyline of each regenerated text changes, the generated texts become incomparable, which worsen detection accuracy. In this paper, we propose a hallucination detection method that incorporates a multiple-fill-in-the-blank exam approach to address this storyline-changing issue. First, our method creates a multiple-fill-in-the-blank exam by masking multiple objects from the original text. Second, prompts an LLM to repeatedly answer this exam. This approach ensures that the storylines of the exam answers align with the original ones. Finally, quantifies the degree of hallucination for each original sentence by scoring the exam answers, considering the potential for \emphhallucination snowballing within the original text itself. Experimental results show that our method alone not only outperforms existing methods, but also achieves clearer state-of-the-art performance in the ensembles with existing methods.
摘要:大语言模型 (LLMs) 常常生成幻觉文本。已有多种方法通过语义比较这些文本与概率性重新生成的多个版本来进行检测。然而,一个显著的问题是,如果每个重新生成的文本的故事线发生变化,生成的文本将变得不可比较,从而降低检测准确性。本文提出了一种幻觉检测方法,该方法结合了多填空题考试方法来解决故事线变化的问题。首先,我们的方法通过从原始文本中屏蔽多个对象来创建多填空题考试。其次,提示 LLM 反复回答此考试。这种方法确保了考试答案的故事线与原始故事线一致。最后,通过评分考试答案来量化每个原始句子的幻觉程度,考虑到原始文本内部可能存在的幻觉滚雪球效应。实验结果表明,我们的方法不仅单独优于现有方法,而且在与现有方法的集成中实现了更清晰的最新性能。
[NLP-74] What Would You Ask When You First Saw a2b2=c2? Evaluating LLM on Curiosity-Driven Questioning
链接: https://arxiv.org/abs/2409.17172 作者: Shashidhar Reddy Javaji,Zining Zhu 关键词-EN: knowledge remains unknown, remains unknown, store a massive, massive amount, knowledge 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems
摘要:大语言模型 (LLMs) 能够存储大量知识,但其获取新知识的能力仍未明确。我们提出了一种新颖的评估框架,用于评估这一能力。该框架引导 LLMs 针对引入科学知识的陈述生成问题,模拟初次接触该陈述的好奇者。我们通过评分生成问题的质量,从而评估 LLM 的知识获取潜力。我们采用控制性消融研究来验证评分程序。此外,我们创建了一个合成数据集,包含 1101 条物理、化学和数学领域的陈述,难度各异,300 条一般知识陈述,以及 567 条错误陈述。通过人类评估验证了我们的模型评估,在考虑的三项指标上达到了约 0.7 的加权 Cohen’s kappa 值。我们发现,尽管像 GPT-4 和 Mistral 8x7b 这样的大型模型擅长生成连贯且相关的问题,但较小的 Phi-2 模型同样或更为有效。这表明,模型大小并非决定知识获取潜力的唯一因素。所提出的框架量化了一种常被忽视的关键模型能力,并为开发更具知识性的 AI 系统开辟了研究机会。
[NLP-75] Cross-Domain Content Generation with Domain-Specific Small Language Models
链接: https://arxiv.org/abs/2409.17171 作者: Ankit Maloo Abhinav Garg 关键词-EN: small language models, language models poses, small language, minimal overlap, models poses challenges 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: 15 pages
点击查看摘要
Abstract:Generating domain-specific content using small language models poses challenges, especially when dealing with multiple distinct datasets with minimal overlap. In this study, we explore methods to enable a small language model to produce coherent and relevant outputs for two different domains: stories (Dataset A) and recipes (Dataset B). Our initial experiments show that training individual models on each dataset yields satisfactory results, with each model generating appropriate content within its domain. We find that utilizing custom tokenizers tailored to each dataset significantly enhances generation quality compared to using a generic tokenizer. Attempts to adapt a single model to both domains using Low-Rank Adaptation (LoRA) or standard fine-tuning do not yield substantial results, often failing to produce meaningful outputs. Moreover, full fine-tuning without freezing the model’s existing weights leads to catastrophic forgetting, where the model loses previously learned information and only retains knowledge from the new data. To overcome these challenges, we employ a knowledge expansion strategy: training only with additional parameters. This approach enables the model to generate both stories and recipes upon request, effectively handling multiple domains without suffering from catastrophic forgetting. Our findings demonstrate that knowledge expansion with frozen layers is an effective method for small language models to generate domain-specific content across distinct datasets. This work contributes to the development of efficient multi-domain language models and provides insights into managing catastrophic forgetting in small-scale architectures.
摘要:利用小型语言模型生成特定领域的内容面临挑战,尤其是在处理多个几乎没有重叠的不同数据集时。在本研究中,我们探讨了使小型语言模型能够为两个不同领域(故事(数据集 A)和食谱(数据集 B))生成连贯且相关输出的方法。我们的初步实验表明,针对每个数据集训练单独的模型可以获得满意的结果,每个模型都能在其领域内生成合适的内容。我们发现,使用针对每个数据集定制的 Tokenizer 显著提高了生成质量,相比于使用通用 Tokenizer。尝试使用低秩适应(Low-Rank Adaptation, LoRA)或标准微调来使单个模型适应两个领域并未取得显著成果,通常无法生成有意义的输出。此外,在不冻结模型现有权重的情况下进行全面微调会导致灾难性遗忘,模型会丢失之前学习的信息,仅保留新数据的知识。为了克服这些挑战,我们采用了知识扩展策略:仅通过增加参数进行训练。这种方法使模型能够根据请求生成故事和食谱,有效处理多个领域而不会遭受灾难性遗忘。我们的研究结果表明,冻结层的知识扩展是小型语言模型在不同数据集上生成特定领域内容的一种有效方法。这项工作有助于开发高效的多领域语言模型,并为管理小规模架构中的灾难性遗忘提供了见解。
[NLP-76] REAL: Response Embedding-based Alignment for LLMs
链接: https://arxiv.org/abs/2409.17169 作者: Honggen Zhang,Igor Molybog,June Zhang,Xufeng Zhao 关键词-EN: Aligning large language, Aligning large, Direct Preference Optimization, Preference Optimization rely, large language models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization rely on pairs of AI-generated responses ranked according to human feedback. The labeling process is the most labor-intensive and costly part of the alignment pipeline, and improving its efficiency would have a meaningful impact on AI development. We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs for labeling out of a set of AI-generated responses. Experimental results on synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. We also applied our method to the real-world dataset SHP2, selecting optimal pairs from multiple responses. The model aligned on dissimilar response pairs obtained the best win rate on the dialogue task. Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators’ work.
摘要:将大语言模型 (LLMs) 对齐到人类偏好是构建有用且安全的 AI 工具的关键步骤,这通常涉及在监督数据集上进行训练。流行的算法如直接偏好优化 (Direct Preference Optimization) 依赖于根据人类反馈排序的 AI 生成响应对。标注过程是对齐流程中最耗费人力和成本的部分,提高其效率将对 AI 开发产生重大影响。我们提出了一种策略,用于从一组 AI 生成的响应中采样高质量的训练数据集,重点是获取最具信息量的响应对进行标注。在合成 HH-RLHF 基准上的实验结果表明,选择不相似的响应对可以增强 LLMs 的直接对齐,同时减少继承的标注错误。我们还应用了我们的方法到实际数据集 SHP2,从多个响应中选择最佳对。在对齐不相似响应对的模型在对话任务中获得了最高的胜率。我们的研究结果表明,专注于不太相似的对可以提高 LLM 对齐的效率,节省高达 65% 的标注者工作量。
[NLP-77] StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?
链接: https://arxiv.org/abs/2409.17167 作者: Guobin Shen,Dongcheng Zhao,Aorigele Bao,Xiang He,Yiting Dong,Yi Zeng 关键词-EN: Large Language Models, Language Models, Large Language, stress, LLMs 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: 11 pages, 9 figures
点击查看摘要
Abstract:Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.
摘要:人类经常经历压力,这会显著影响他们的表现。本研究探讨了大语言模型 (LLMs) 是否表现出类似于人类的压力反应,以及它们在不同压力诱导提示下的表现是否波动。为了研究这一点,我们开发了一套新颖的提示集,称为 StressPrompt,旨在诱导不同程度的压力。这些提示源自已建立的心理学框架,并根据人类参与者的评分进行了仔细校准。然后,我们将这些提示应用于多个 LLMs,以评估它们在指令跟随、复杂推理和情感智能等任务中的响应。研究结果表明,LLMs 与人类一样,在中等压力下表现最佳,这与 Yerkes-Dodson 定律一致。值得注意的是,它们在低压力和高压力条件下的表现均有所下降。我们的进一步分析揭示,这些 StressPrompts 显著改变了 LLMs 的内部状态,导致其神经表示发生变化,这些变化与人类对压力的反应相似。这项研究为 LLMs 的操作稳健性和灵活性提供了关键见解,展示了设计能够在压力普遍存在的现实世界场景中保持高性能的 AI 系统的重要性,例如在客户服务、医疗保健和应急响应环境中。此外,本研究通过提供关于 LLMs 如何处理不同场景及其与人类认知相似性的新视角,为更广泛的 AI 研究社区做出了贡献。
[NLP-78] BERTScoreVisualizer: A Web Tool for Understanding Simplified Text Evaluation with BERTScore
链接: https://arxiv.org/abs/2409.17160 作者: Sebastian Jaskowski,Sahasra Chava,Agam Shah 关键词-EN: evaluate automatic text, automatic text simplification, evaluate automatic, text simplification systems, BERTScore metric 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:The BERTScore metric is commonly used to evaluate automatic text simplification systems. However, current implementations of the metric fail to provide complete visibility into all information the metric can produce. Notably, the specific token matchings can be incredibly useful in generating clause-level insight into the quality of simplified text. We address this by introducing BERTScoreVisualizer, a web application that goes beyond reporting precision, recall, and F1 score and provides a visualization of the matching between tokens. We believe that our software can help improve the analysis of text simplification systems by specifically showing where generated, simplified text deviates from reference text. We host our code and demo on GitHub.
摘要:BERTScore 指标常用于评估自动文本简化系统。然而,当前的实现未能完全展示该指标所能提供的所有信息。特别是,具体的 Token 匹配信息对于生成关于简化文本质量的从句级洞察极为有用。我们通过引入 BERTScoreVisualizer,一个网页应用程序,来解决这一问题。该应用不仅报告精度、召回率和 F1 分数,还提供了 Token 匹配的可视化。我们相信,我们的软件可以通过具体展示生成的简化文本与参考文本的偏差,来帮助改进文本简化系统的分析。我们将代码和演示托管在 GitHub 上。
[NLP-79] Unveiling the Potential of Graph Neural Networks in SME Credit Risk Assessment
链接: https://arxiv.org/abs/2409.17909 作者: Bingyao Liu,Iris Li,Jianhua Yao,Yuan Chen,Guanming Huang,Jiajing Wang 关键词-EN: graph neural network, enterprise financial indicators, credit risk assessment, enterprise credit risk, neural network model 类目: Risk Management (q-fin.RM); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:This paper takes the graph neural network as the technical framework, integrates the intrinsic connections between enterprise financial indicators, and proposes a model for enterprise credit risk assessment. The main research work includes: Firstly, based on the experience of predecessors, we selected 29 enterprise financial data indicators, abstracted each indicator as a vertex, deeply analyzed the relationships between the indicators, constructed a similarity matrix of indicators, and used the maximum spanning tree algorithm to achieve the graph structure mapping of enterprises; secondly, in the representation learning phase of the mapped graph, a graph neural network model was built to obtain its embedded representation. The feature vector of each node was expanded to 32 dimensions, and three GraphSAGE operations were performed on the graph, with the results pooled using the Pool operation, and the final output of three feature vectors was averaged to obtain the graph’s embedded representation; finally, a classifier was constructed using a two-layer fully connected network to complete the prediction task. Experimental results on real enterprise data show that the model proposed in this paper can well complete the multi-level credit level estimation of enterprises. Furthermore, the tree-structured graph mapping deeply portrays the intrinsic connections of various indicator data of the company, and according to the ROC and other evaluation criteria, the model’s classification effect is significant and has good “robustness”.
摘要:本文以图神经网络为技术框架,整合企业财务指标间的内在联系,提出了一种企业信用风险评估模型。主要研究工作包括:首先,基于前人的经验,选取了29个企业财务数据指标,将每个指标抽象为一个顶点,深入分析指标间的关系,构建了指标的相似矩阵,并利用最大生成树算法实现了企业图结构的映射;其次,在映射图的表示学习阶段,构建了图神经网络模型以获取其嵌入表示。每个节点的特征向量被扩展到32维,并在图上进行了三次GraphSAGE操作,结果通过Pool操作进行池化,最终将三个特征向量的输出取平均,得到图的嵌入表示;最后,构建了一个两层的全连接网络分类器,完成了预测任务。在真实企业数据上的实验结果表明,本文提出的模型能够很好地完成企业多层次信用等级估计。此外,树结构的图映射深度描绘了公司各项指标数据的内在联系,根据ROC等评价标准,模型的分类效果显著,并具有良好的“鲁棒性”。
[NLP-80] Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations
链接: https://arxiv.org/abs/2409.17899 作者: Yujia Sun,Zeyu Zhao,Korin Richmond,Yuanchao Li 关键词-EN: music SSL models, SSL models, speech and music, Emotion recognition, Music Emotion Recognition 类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD) 备注:
点击查看摘要
Abstract:Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.
摘要:语音和音乐的情感识别由于其声学重叠而具有相似性,这引起了跨领域知识转移的兴趣。然而,语音和音乐之间的共享声学线索,特别是由自监督学习 (Self-Supervised Learning, SSL) 模型编码的线索,在很大程度上仍未被探索,因为针对语音和音乐的 SSL 模型很少应用于跨领域研究。在这项工作中,我们重新审视了情感语音和音乐之间的声学相似性,首先分析了用于语音情感识别 (Speech Emotion Recognition, SER) 和音乐情感识别 (Music Emotion Recognition, MER) 的 SSL 模型的逐层行为。此外,我们通过在两阶段微调过程中比较几种方法,进行跨领域适应,探讨了利用音乐进行 SER 和利用语音进行 MER 的有效方式。最后,我们使用 Frechet 音频距离探索了情感语音和音乐之间的声学相似性,揭示了语音和音乐 SSL 模型中情感偏差的问题。我们的研究结果表明,尽管语音和音乐 SSL 模型确实捕捉到了共享的声学特征,但由于其训练策略和领域特异性,它们的行为会因不同情感而异。此外,参数高效的微调可以通过相互利用知识来提升 SER 和 MER 的性能。本研究为情感语音和音乐之间的声学相似性提供了新的见解,并强调了跨领域泛化以改进 SER 和 MER 系统的潜力。
[NLP-81] Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
链接: https://arxiv.org/abs/2409.17750 作者: Keyu An,Shiliang Zhang,Zhijie Yan 关键词-EN: Automatic Speech Recognition, Speech Recognition, pre-trained language models, Automatic Speech, Character Error Rate 类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD) 备注: 8pages
点击查看摘要
Abstract:In this study, we delve into the efficacy of transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR). Our underlying hypothesis posits that, despite being initially trained on text-based corpora, these transformers possess a remarkable capacity to extract effective features from the input sequence. This inherent capability, we argue, is transferrable to speech data, thereby augmenting the acoustic modeling ability of ASR. Through rigorous empirical analysis, our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated. Particularly, they serve as an advantageous starting point for initializing ASR encoders. Furthermore, we uncover that these transformers, when integrated into a well-established ASR encoder, can significantly boost performance, especially in scenarios where profound semantic comprehension is pivotal. This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems’ capabilities.
摘要:在本研究中,我们深入探讨了在预训练语言模型 (PLMs) 中使用的 Transformer 作为自动语音识别 (ASR) 编码器的有效性。我们的基本假设认为,尽管这些 Transformer 最初是在基于文本的语料库上进行训练的,但它们具有从输入序列中提取有效特征的显著能力。我们认为,这种固有能力可以转移到语音数据上,从而增强 ASR 的声学建模能力。通过严格的实证分析,我们的研究结果显示,当预训练语言模型中的 Transformer 被引入时,跨不同 ASR 任务的字符错误率 (CER) 和词错误率 (WER) 显著改善。特别是,它们为初始化 ASR 编码器提供了一个有利的起点。此外,我们发现,当这些 Transformer 被整合到一个成熟的 ASR 编码器中时,可以显著提升性能,尤其是在需要深刻语义理解的情况下。这突显了利用预训练 Transformer 中嵌入的语义优势来提升 ASR 系统能力的潜力。
[NLP-82] When A Man Says He Is Pregnant: ERP Evidence for A Rational Account of Speaker-contextualized Language Comprehension
链接: https://arxiv.org/abs/2409.17525 作者: Hanlin Wu,Zhenguang G. Cai 关键词-EN: includes the identities, Spoken, effect, Spoken language, ERP effects reflect 类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Spoken language is often, if not always, understood in a context that includes the identities of speakers. For instance, we can easily make sense of an utterance such as “I’m going to have a manicure this weekend” or “The first time I got pregnant I had a hard time” when the utterance is spoken by a woman, but it would be harder to understand when it is spoken by a man. Previous event-related potential (ERP) studies have shown mixed results regarding the neurophysiological responses to such speaker-mismatched utterances, with some reporting an N400 effect and others a P600 effect. In an experiment involving 64 participants, we showed that these different ERP effects reflect distinct cognitive processes employed to resolve the speaker-message mismatch. When possible, the message is integrated with the speaker context to arrive at an interpretation, as in the case of violations of social stereotypes (e.g., men getting a manicure), resulting in an N400 effect. However, when such integration is impossible due to violations of biological knowledge (e.g., men getting pregnant), listeners engage in an error correction process to revise either the perceived utterance or the speaker context, resulting in a P600 effect. Additionally, we found that the social N400 effect decreased as a function of the listener’s personality trait of openness, while the biological P600 effect remained robust. Our findings help to reconcile the empirical inconsistencies in the literature and provide a rational account of speaker-contextualized language comprehension.
摘要:口语交流通常(即使不是总是)在包含说话者身份的背景下被理解。例如,当一位女性说出“我这周末要去修指甲”或“我第一次怀孕时很困难”时,我们很容易理解这些话语,但如果这些话是由男性说出的,理解起来就会更加困难。先前的与事件相关电位(ERP)研究对这种说话者与信息不匹配的话语的神经生理反应结果不一,有些报告了N400效应,而另一些则报告了P600效应。在一个涉及64名参与者的实验中,我们发现这些不同的ERP效应反映了用于解决说话者与信息不匹配的不同认知过程。在可能的情况下,信息会与说话者背景整合以达成解释,例如在违反社会刻板印象(如男性修指甲)的情况下,这导致了N400效应。然而,当这种整合由于违反生物学知识(如男性怀孕)而变得不可能时,听者会进行错误修正过程,以修正感知到的话语或说话者背景,从而导致P600效应。此外,我们发现社会N400效应随着听者开放性人格特质的增加而减少,而生物P600效应则保持稳定。我们的研究有助于调和文献中的实证不一致性,并为说话者背景化的语言理解提供了合理的解释。
[NLP-83] Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control ICASSP2025
链接: https://arxiv.org/abs/2409.17452 作者: Ryuichi Yamamoto,Yuma Shirahata,Masaya Kawamura,Kentaro Tachibana 关键词-EN: cross-lingual control capability, TTS model trained, description-based controllable, TTS model, TTS 类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD) 备注: Submitted to ICASSP 2025
点击查看摘要
Abstract:We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages, even though no Japanese audio-description pairs are used.
摘要:我们提出了一种基于描述的可控跨语言文本到语音 (Text-to-Speech, TTS) 方法。为了解决目标语言中缺乏音频-描述配对数据的问题,我们将目标语言训练的 TTS 模型与另一种语言训练的描述控制模型相结合,该模型将输入文本描述映射到 TTS 模型的条件特征。这两个模型基于自监督学习 (Self-Supervised Learning, SSL) 共享解耦的音色和风格表示,从而实现解耦的语音控制,例如在保留原始音色的同时控制说话风格。此外,由于基于 SSL 的音色和风格表示是语言无关的,因此通过共享相同的嵌入空间将 TTS 和描述控制模型结合,可以有效地实现语音特征的跨语言控制。在英语和日语 TTS 上的实验表明,尽管没有使用日语音频-描述配对数据,我们的方法在两种语言上都实现了高自然度和可控性。
人工智能
[AI-0] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography MICCAI2024
链接: https://arxiv.org/abs/2409.18119 作者: Yuexi Du,John Onofrey,Nicha C. Dvornek 关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Language-Image Pre-training, requires substantial data, shows promise 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is also the basis of the overall best solution for the MICCAI 2024 CXR-LT Challenge
点击查看摘要
Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.
[AI-1] Find Rhinos without Finding Rhinos: Active Learning with Multimodal Imagery of South African Rhino Habitats IJCAI2023
链接: https://arxiv.org/abs/2409.18104 作者: Lucia Gordon,Nikhil Behari,Samuel Collier,Elizabeth Bondi-Kelly,Jackson A. Killian,Catherine Ressijac,Peter Boucher,Andrew Davies,Milind Tambe 关键词-EN: Earth charismatic megafauna, crisis in Africa, Earth charismatic, human activities, charismatic megafauna 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, IJCAI 2023 Special Track on AI for Good
点击查看摘要
Abstract:Much of Earth’s charismatic megafauna is endangered by human activities, particularly the rhino, which is at risk of extinction due to the poaching crisis in Africa. Monitoring rhinos’ movement is crucial to their protection but has unfortunately proven difficult because rhinos are elusive. Therefore, instead of tracking rhinos, we propose the novel approach of mapping communal defecation sites, called middens, which give information about rhinos’ spatial behavior valuable to anti-poaching, management, and reintroduction efforts. This paper provides the first-ever mapping of rhino midden locations by building classifiers to detect them using remotely sensed thermal, RGB, and LiDAR imagery in passive and active learning settings. As existing active learning methods perform poorly due to the extreme class imbalance in our dataset, we design MultimodAL, an active learning system employing a ranking technique and multimodality to achieve competitive performance with passive learning models with 94% fewer labels. Our methods could therefore save over 76 hours in labeling time when used on a similarly-sized dataset. Unexpectedly, our midden map reveals that rhino middens are not randomly distributed throughout the landscape; rather, they are clustered. Consequently, rangers should be targeted at areas with high midden densities to strengthen anti-poaching efforts, in line with UN Target 15.7.
[AI-2] AI-Powered Augmented Reality for Satellite Assembly Integration and Test
Abstract:The integration of Artificial Intelligence (AI) and Augmented Reality (AR) is set to transform satellite Assembly, Integration, and Testing (AIT) processes by enhancing precision, minimizing human error, and improving operational efficiency in cleanroom environments. This paper presents a technical description of the European Space Agency’s (ESA) project “AI for AR in Satellite AIT,” which combines real-time computer vision and AR systems to assist technicians during satellite assembly. Leveraging Microsoft HoloLens 2 as the AR interface, the system delivers context-aware instructions and real-time feedback, tackling the complexities of object recognition and 6D pose estimation in AIT workflows. All AI models demonstrated over 70% accuracy, with the detection model exceeding 95% accuracy, indicating a high level of performance and reliability. A key contribution of this work lies in the effective use of synthetic data for training AI models in AR applications, addressing the significant challenges of obtaining real-world datasets in highly dynamic satellite environments, as well as the creation of the Segmented Anything Model for Automatic Labelling (SAMAL), which facilitates the automatic annotation of real data, achieving speeds up to 20 times faster than manual human annotation. The findings demonstrate the efficacy of AI-driven AR systems in automating critical satellite assembly tasks, setting a foundation for future innovations in the space industry.
[AI-3] EfficientCrackNet: A Lightweight Model for Crack Segmentation
Abstract:Crack detection, particularly from pavement images, presents a formidable challenge in the domain of computer vision due to several inherent complexities such as intensity inhomogeneity, intricate topologies, low contrast, and noisy backgrounds. Automated crack detection is crucial for maintaining the structural integrity of essential infrastructures, including buildings, pavements, and bridges. Existing lightweight methods often face challenges including computational inefficiency, complex crack patterns, and difficult backgrounds, leading to inaccurate detection and impracticality for real-world applications. To address these limitations, we propose EfficientCrackNet, a lightweight hybrid model combining Convolutional Neural Networks (CNNs) and transformers for precise crack segmentation. EfficientCrackNet integrates depthwise separable convolutions (DSC) layers and MobileViT block to capture both global and local features. The model employs an Edge Extraction Method (EEM) and for efficient crack edge detection without pretraining, and Ultra-Lightweight Subspace Attention Module (ULSAM) to enhance feature extraction. Extensive experiments on three benchmark datasets Crack500, DeepCrack, and GAPs384 demonstrate that EfficientCrackNet achieves superior performance compared to existing lightweight models, while requiring only 0.26M parameters, and 0.483 FLOPs (G). The proposed model offers an optimal balance between accuracy and computational efficiency, outperforming state-of-the-art lightweight models, and providing a robust and adaptable solution for real-world crack segmentation.
Abstract:Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle’s surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets and our approach outperforms the state-of-the-art for SSC.
[AI-5] GSON: A Group-based Social Navigation Framework with Large Multimodal Model
链接: https://arxiv.org/abs/2409.18084 作者: Shangyi Luo,Ji Zhu,Peng Sun,Yuhong Deng,Cunjun Yu,Anxing Xiao,Xueqian Wang 关键词-EN: human-centered environments grows, Large Multimodal Model, environments grows, number of service, autonomous vehicles 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As the number of service robots and autonomous vehicles in human-centered environments grows, their requirements go beyond simply navigating to a destination. They must also take into account dynamic social contexts and ensure respect and comfort for others in shared spaces, which poses significant challenges for perception and planning. In this paper, we present a group-based social navigation framework GSON to enable mobile robots to perceive and exploit the social group of their surroundings by leveling the visual reasoning capability of the Large Multimodal Model (LMM). For perception, we apply visual prompting techniques to zero-shot extract the social relationship among pedestrians and combine the result with a robust pedestrian detection and tracking pipeline to alleviate the problem of low inference speed of the LMM. Given the perception result, the planning system is designed to avoid disrupting the current social structure. We adopt a social structure-based mid-level planner as a bridge between global path planning and local motion planning to preserve the global context and reactive response. The proposed method is validated on real-world mobile robot navigation tasks involving complex social structure understanding and reasoning. Experimental results demonstrate the effectiveness of the system in these scenarios compared with several baselines.
[AI-6] SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation
链接: https://arxiv.org/abs/2409.18082 作者: Xin Li,Siyuan Huang,Qiaojun Yu,Zhengkai Jiang,Ce Hao,Yimeng Zhu,Hongsheng Li,Peng Gao,Cewu Lu 关键词-EN: Automating garment manipulation, Automating garment, poses a significant, significant challenge, diverse and deformable 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and deformable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics for future.
[AI-7] Infer Humans Intentions Before Following Natural Language Instructions
Abstract:For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.
[AI-8] FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction
Abstract:Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user’s intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: this https URL.
[AI-9] Visual Data Diagnosis and Debiasing with Concept Graphs
链接: https://arxiv.org/abs/2409.18055 作者: Rwiddhi Chakraborty,Yinong Wang,Jialu Gao,Runkai Zheng,Cheng Zhang,Fernando De la Torre 关键词-EN: deep learning models, learning models today, size and complexity, widespread success, success of deep 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present CONBIAS, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. CONBIAS represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by CONBIAS improves generalization performance across multiple datasets compared to state-of-the-art methods. We will make our code and data publicly available.
[AI-10] DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving
链接: https://arxiv.org/abs/2409.18053 作者: Dingrui Wang,Marc Kaufeld,Johannes Betz 关键词-EN: designed to imitate, imitate human reasoning, autonomous driving framework, driving, autonomous driving 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Autonomous Driving, Large Language Models (LLMs), Human Reasoning, Critical Scenario
点击查看摘要
Abstract:We present a novel autonomous driving framework, DualAD, designed to imitate human reasoning during driving. DualAD comprises two layers: a rule-based motion planner at the bottom layer that handles routine driving tasks requiring minimal reasoning, and an upper layer featuring a rule-based text encoder that converts driving scenarios from absolute states into text description. This text is then processed by a large language model (LLM) to make driving decisions. The upper layer intervenes in the bottom layer’s decisions when potential danger is detected, mimicking human reasoning in critical situations. Closed-loop experiments demonstrate that DualAD, using a zero-shot pre-trained model, significantly outperforms rule-based motion planners that lack reasoning abilities. Our experiments also highlight the effectiveness of the text encoder, which considerably enhances the model’s scenario understanding. Additionally, the integrated DualAD model improves with stronger LLMs, indicating the framework’s potential for further enhancement. We make code and benchmarks publicly available.
[AI-11] Explaining Explaining
链接: https://arxiv.org/abs/2409.18052 作者: Sergei Nirenburg,Marjorie McShane,Kenneth W. Goodman,Sanjay Oruganti 关键词-EN: confidence in high-stakes, Abstract, machine learning, key to people, people having confidence 类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Explanation is key to people having confidence in high-stakes AI systems. However, machine-learning-based systems - which account for almost all current AI - can’t explain because they are usually black boxes. The explainable AI (XAI) movement hedges this problem by redefining “explanation”. The human-centered explainable AI (HCXAI) movement identifies the explanation-oriented needs of users but can’t fulfill them because of its commitment to machine learning. In order to achieve the kinds of explanations needed by real people operating in critical domains, we must rethink how to approach AI. We describe a hybrid approach to developing cognitive agents that uses a knowledge-based infrastructure supplemented by data obtained through machine learning when applicable. These agents will serve as assistants to humans who will bear ultimate responsibility for the decisions and actions of the human-robot team. We illustrate the explanatory potential of such agents using the under-the-hood panels of a demonstration system in which a team of simulated robots collaborates on a search task assigned by a human.
[AI-12] Revisit Anything: Visual Place Recognition via Image Segment Retrieval ECCV2024
链接: https://arxiv.org/abs/2409.18049 作者: Kartik Garg,Sai Shubodh Puligilla,Shishir Kolathaya,Madhava Krishna,Sourav Garg 关键词-EN: Accurately recognizing, localize and navigate, crucial for embodied, embodied agents, agents to localize 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Presented at ECCV 2024; Includes supplementary; 29 pages; 8 figures
点击查看摘要
Abstract:Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the “whole” image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: “the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap”. We address this by encoding and searching for “image segments” instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful’ entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything’’ by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: this https URL.
[AI-13] HARMONIC: Cognitive and Control Collaboration in Human-Robotic Teams ICRA2025
链接: https://arxiv.org/abs/2409.18047 作者: Sanjay Oruganti,Sergei Nirenburg,Marjorie McShane,Jesse English,Michael K. Roberts,Christian Arndt 关键词-EN: planning and collaboration, paper presents, multi-robot planning, natural language communication, natural human-robot communication 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Submitted to ICRA 2025 Conference, Atlanta, GA, USA
点击查看摘要
Abstract:This paper presents a novel approach to multi-robot planning and collaboration. We demonstrate a cognitive strategy for robots in human-robot teams that incorporates metacognition, natural language communication, and explainability. The system is embodied using the HARMONIC architecture that flexibly integrates cognitive and control capabilities across the team. We evaluate our approach through simulation experiments involving a joint search task by a team of heterogeneous robots (a UGV and a drone) and a human. We detail the system’s handling of complex, real-world scenarios, effective action coordination between robots with different capabilities, and natural human-robot communication. This work demonstrates that the robots’ ability to reason about plans, goals, and attitudes, and to provide explanations for actions and decisions are essential prerequisites for realistic human-robot teaming.
[AI-14] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning EMNLP2024
链接: https://arxiv.org/abs/2409.18046 作者: Soeun Lee,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim 关键词-EN: Recent advancements, paired image-text data, explored text-only training, text-only training, overcome the limitations 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024
点击查看摘要
Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ( \textbfI mage-like Retrieval and \textbfF requency-based Entity Filtering for Zero-shot \textbfCap tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.
[AI-15] HARMONIC: A Framework for Explanatory Cognitive Robots ICRA
链接: https://arxiv.org/abs/2409.18037 作者: Sanjay Oruganti,Sergei Nirenburg,Marjorie McShane,Jesse English,Michael K. Roberts,Christian Arndt 关键词-EN: trusted teammates capable, transforms general-purpose robots, implementing cognitive robots, natural communication, human-level explanation 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注: Accepted for presentation at ICRA@40. 23-26 September 2024, Rotterdam, Netherlands
点击查看摘要
Abstract:We present HARMONIC, a framework for implementing cognitive robots that transforms general-purpose robots into trusted teammates capable of complex decision-making, natural communication and human-level explanation. The framework supports interoperability between a strategic (cognitive) layer for high-level decision-making and a tactical (robot) layer for low-level control and execution. We describe the core features of the framework and our initial implementation, in which HARMONIC was deployed on a simulated UGV and drone involved in a multi-robot search and retrieval task.
[AI-16] Compositional Hardness of Code in Large Language Models – A Probabilistic Perspective
链接: https://arxiv.org/abs/2409.18028 作者: Yotam Wolf,Binyamin Rothberg,Dorin Shteyman,Amnon Shashua 关键词-EN: large language model, complex analytical tasks, model context window, model context, usage for complex 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model’s context window. Previous works have shown that subtask decomposition within the model’s context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs’ ability to perform several sub-tasks within the same context window - an in-context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi-agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution’s length. We prove our results theoretically and demonstrate them empirically.
[AI-17] An Adversarial Perspective on Machine Unlearning for AI Safety
链接: https://arxiv.org/abs/2409.18025 作者: Jakub Łucki,Boyi Wei,Yangsibo Huang,Peter Henderson,Florian Tramèr,Javier Rando 关键词-EN: Large language models, Large language, finetuned to refuse, Large, hazardous knowledge 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
[AI-18] ransferring disentangled representations: bridging the gap between synthetic and real images
Abstract:Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.
[AI-19] Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles
链接: https://arxiv.org/abs/2409.18014 作者: Lewei He,Tianyu Shi,Pengran Huang,Bingzhi Chen,Qianglong Chen,Jiahui Pan 关键词-EN: Online Long-context Processing, Large language models, long-context processing, named Online Long-context, language models 类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) with long-context processing are still challenging because of their implementation complexity, training efficiency and data sparsity. To address this issue, a new paradigm named Online Long-context Processing (OLP) is proposed when we process a document of unlimited length, which typically occurs in the information reception and organization of diverse streaming media such as automated news reporting, live e-commerce, and viral short videos. Moreover, a dilemma was often encountered when we tried to select the most suitable LLM from a large number of LLMs amidst explosive growth aiming for outstanding performance, affordable prices, and short response delays. In view of this, we also develop Role Reinforcement Learning (Role-RL) to automatically deploy different LLMs in their respective roles within the OLP pipeline according to their actual performance. Extensive experiments are conducted on our OLP-MINI dataset and it is found that OLP with Role-RL framework achieves OLP benchmark with an average recall rate of 93.2% and the LLM cost saved by 79.4%. The code and dataset are publicly available at: this https URL.
[AI-20] Control Industrial Automation System with Large Language Models
链接: https://arxiv.org/abs/2409.18009 作者: Yuchen Xia,Nasser Jazdi,Jize Zhang,Chaitanya Shah,Michael Weyrich 关键词-EN: require specialized expertise, Traditional industrial automation, systems require specialized, Traditional industrial, require specialized 类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Traditional industrial automation systems require specialized expertise to operate and complex reprogramming to adapt to new processes. Large language models offer the intelligence to make them more flexible and easier to use. However, LLMs’ application in industrial settings is underexplored. This paper introduces a framework for integrating LLMs to achieve end-to-end control of industrial automation systems. At the core of the framework are an agent system designed for industrial tasks, a structured prompting method, and an event-driven information modeling mechanism that provides real-time data for LLM inference. The framework supplies LLMs with real-time events on different context semantic levels, allowing them to interpret the information, generate production plans, and control operations on the automation system. It also supports structured dataset creation for fine-tuning on this downstream application of LLMs. Our contribution includes a formal system design, proof-of-concept implementation, and a method for generating task-specific datasets for LLM fine-tuning and testing. This approach enables a more adaptive automation system that can respond to spontaneous events, while allowing easier operation and configuration through natural language for more intuitive human-machine interaction. We provide demo videos and detailed data on GitHub: this https URL
[AI-21] Joint Localization and Planning using Diffusion ICRA2025
链接: https://arxiv.org/abs/2409.17995 作者: L. Lao Beyer,S. Karaman 关键词-EN: vehicle path planning, successfully applied, applied to robotics, manipulation and vehicle, path planning 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 9 figures. Submitted to ICRA 2025, under review
点击查看摘要
Abstract:Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation – including both perception and planning – by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in SE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model’s ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model’s use as a real-time, end-to-end localization and planning stack.
链接: https://arxiv.org/abs/2409.17994 作者: Sawinder Kaur,Avery Gump,Jingyu Xin,Yi Xiao,Harshit Sharma,Nina R Benway,Jonathan L Preston,Asif Salekin 关键词-EN: diverse human sensing, advancement in deep, deep learning, led to diverse, human sensing applications 类目: Artificial Intelligence (cs.AI)
*备注: 31 pages, 10 figues and 13 tables
点击查看摘要
Abstract:The advancement in deep learning and internet-of-things have led to diverse human sensing applications. However, distinct patterns in human sensing, influenced by various factors or contexts, challenge generic neural network model’s performance due to natural distribution shifts. To address this, personalization tailors models to individual users. Yet most personalization studies overlook intra-user heterogeneity across contexts in sensory data, limiting intra-user generalizability. This limitation is especially critical in clinical applications, where limited data availability hampers both generalizability and personalization. Notably, intra-user sensing attributes are expected to change due to external factors such as treatment progression, further complicating the this http URL work introduces CRoP, a novel static personalization approach using an off-the-shelf pre-trained model and pruning to optimize personalization and generalization. CRoP shows superior personalization effectiveness and intra-user robustness across four human-sensing datasets, including two from real-world health domains, highlighting its practical and social impact. Additionally, to support CRoP’s generalization ability and design choices, we provide empirical justification through gradient inner product analysis, ablation studies, and comparisons against state-of-the-art baselines.
[AI-23] HydraViT: Stacking Heads for a Scalable ViT
Abstract:The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at this https URL.
[AI-24] Enhancing elusive clues in knowledge learning by contrasting attention of language models
链接: https://arxiv.org/abs/2409.17954 作者: Jian Gao,Xiao Zhang,Ji Wu,Miao Li 关键词-EN: Causal language models, acquire vast amount, models acquire vast, general text corpus, Causal language 类目: Artificial Intelligence (cs.AI)
*备注: 7 pages and 17 figures
点击查看摘要
Abstract:Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models’ performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be ``amplified" for a straight-forward improvement in knowledge learning efficiency.
[AI-25] Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation
链接: https://arxiv.org/abs/2409.17946 作者: Shuai Zhao,Leilei Gan,Zhongliang Guo,Xiaobao Wu,Luwei Xiao,Xiaoyu Xu,Cong-Duy Nguyen,Luu Anh Tuan 关键词-EN: widely applied due, Large Language Models, Large Language, backdoor attacks, backdoor 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on contrastive knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through contrastive knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
[AI-26] On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms
链接: https://arxiv.org/abs/2409.17943 作者: Richard Yue,John E. Ortega,Kenneth Ward Church 关键词-EN: natural language processing, professional translator, models in natural, Google Translate, BLEU and COMET 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users
点击查看摘要
Abstract:The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.
[AI-27] Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods
链接: https://arxiv.org/abs/2409.17939 作者: Richard Yue,John E. Ortega 关键词-EN: tools called computer-aided, called computer-aided translation, CAT tool, CAT tools, CAT 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users
点击查看摘要
Abstract:Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s’). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s’. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s’. Most FMR techniques use machine translation as a way of “repairing” those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.
[AI-28] Intelligent Energy Management: Remaining Useful Life Prediction and Charging Automation System Comprised of Deep Learning and the Internet of Things
链接: https://arxiv.org/abs/2409.17931 作者: Biplov Paneru,Bishwash Paneru,DP Sharma Mainali 关键词-EN: battery remaining life, remaining life, battery RUL dataset, battery remaining, Remaining Useful Life 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Remaining Useful Life (RUL) of battery is an important parameter to know the battery’s remaining life and need for recharge. The goal of this research project is to develop machine learning-based models for the battery RUL dataset. Different ML models are developed to classify the RUL of the vehicle, and the IoT (Internet of Things) concept is simulated for automating the charging system and managing any faults aligning. The graphs plotted depict the relationship between various vehicle parameters using the Blynk IoT platform. Results show that the catboost, Multi-Layer Perceptron (MLP), Gated Recurrent Unit (GRU), and hybrid model developed could classify RUL into three classes with 99% more accuracy. The data is fed using the tkinter GUI for simulating artificial intelligence (AI)-based charging, and with a pyserial backend, data can be entered into the Esp-32 microcontroller for making charge discharge possible with the model’s predictions. Also, with an IoT system, the charging can be disconnected, monitored, and analyzed for automation. The results show that an accuracy of 99% can be obtained on models MLP, catboost model and similar accuracy on GRU model can be obtained, and finally relay-based triggering can be made by prediction through the model used for automating the charging and energy-saving mechanism. By showcasing an exemplary Blynk platform-based monitoring and automation phenomenon, we further present innovative ways of monitoring parameters and automating the system.
[AI-29] Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion EMNLP24
Abstract:During pre-training, the Text-to-Image (T2I) diffusion models encode factual knowledge into their parameters. These parameterized facts enable realistic image generation, but they may become obsolete over time, thereby misrepresenting the current state of the world. Knowledge editing techniques aim to update model knowledge in a targeted way. However, facing the dual challenges posed by inadequate editing datasets and unreliable evaluation criterion, the development of T2I knowledge editing encounter difficulties in effectively generalizing injected knowledge. In this work, we design a T2I knowledge editing framework by comprehensively spanning on three phases: First, we curate a dataset \textbfCAKE, comprising paraphrase and multi-object test, to enable more fine-grained assessment on knowledge generalization. Second, we propose a novel criterion, \textbfadaptive CLIP threshold, to effectively filter out false successful images under the current criterion and achieve reliable editing evaluation. Finally, we introduce \textbfMPE, a simple but effective approach for T2I knowledge editing. Instead of tuning parameters, MPE precisely recognizes and edits the outdated part of the conditioning text-prompt to accommodate the up-to-date knowledge. A straightforward implementation of MPE (Based on in-context learning) exhibits better overall performance than previous model editors. We hope these efforts can further promote faithful evaluation of T2I knowledge editing methods.
[AI-30] Navigation in a simplified Urban Flow through Deep Reinforcement Learning
Abstract:The increasing number of unmanned aerial vehicles (UAVs) in urban environments requires a strategy to minimize their environmental impact, both in terms of energy efficiency and noise reduction. In order to reduce these concerns, novel strategies for developing prediction models and optimization of flight planning, for instance through deep reinforcement learning (DRL), are needed. Our goal is to develop DRL algorithms capable of enabling the autonomous navigation of UAVs in urban environments, taking into account the presence of buildings and other UAVs, optimizing the trajectories in order to reduce both energetic consumption and noise. This is achieved using fluid-flow simulations which represent the environment in which UAVs navigate and training the UAV as an agent interacting with an urban environment. In this work, we consider a domain domain represented by a two-dimensional flow field with obstacles, ideally representing buildings, extracted from a three-dimensional high-fidelity numerical simulation. The presented methodology, using PPO+LSTM cells, was validated by reproducing a simple but fundamental problem in navigation, namely the Zermelo’s problem, which deals with a vessel navigating in a turbulent flow, travelling from a starting point to a target location, optimizing the trajectory. The current method shows a significant improvement with respect to both a simple PPO and a TD3 algorithm, with a success rate (SR) of the PPO+LSTM trained policy of 98.7%, and a crash rate (CR) of 0.1%, outperforming both PPO (SR = 75.6%, CR=18.6%) and TD3 (SR=77.4% and CR=14.5%). This is the first step towards DRL strategies which will guide UAVs in a three-dimensional flow field using real-time signals, making the navigation efficient in terms of flight time and avoiding damages to the vehicle.
[AI-31] Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy
链接: https://arxiv.org/abs/2409.17904 作者: Owen Henkel,Hannah Horne-Robinson,Maria Dyshel,Nabil Ch,Baptiste Moreau-Pernet,Ralph Abood 关键词-EN: paper introduces AMMORE, pairs from Rori, African countries, AMMORE dataset enables, large language models 类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment 1 we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach – chain-of-thought prompting – accurately scored 92% of these edge cases, effectively boosting the overall accuracy of the grading from 98.7% to 99.9%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy, by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that relatively modest improvements in model accuracy at the individual question level can lead to significant changes in the estimation of student mastery. Where the rules-based classifier currently used to grade student, answers misclassified the mastery status of 6.9% of students across their completed lessons, using the LLM chain-of-thought approach this misclassification rate was reduced to 2.6% of students. Taken together, these findings suggest that LLMs could be a valuable tool for grading open-response questions in K-12 mathematics education, potentially enabling encouraging wider adoption of open-ended questions in formative assessment.
[AI-32] Why Companies “Democratise” Artificial Intelligence: The Case of Open Source Software Donations
Abstract:Companies claim to “democratise” artificial intelligence (AI) when they donate AI open source software (OSS) to non-profit foundations or release AI models, among others, but what does this term mean and why do they do it? As the impact of AI on society and the economy grows, understanding the commercial incentives behind AI democratisation efforts is crucial for ensuring these efforts serve broader interests beyond commercial agendas. Towards this end, this study employs a mixed-methods approach to investigate commercial incentives for 43 AI OSS donations to the Linux Foundation. It makes contributions to both research and practice. It contributes a taxonomy of both individual and organisational social, economic, and technological incentives for AI democratisation. In particular, it highlights the role of democratising the governance and control rights of an OSS project (i.e., from one company to open governance) as a structural enabler for downstream goals, such as attracting external contributors, reducing development costs, and influencing industry standards, among others. Furthermore, OSS donations are often championed by individual developers within companies, highlighting the importance of the bottom-up incentives for AI democratisation. The taxonomy provides a framework and toolkit for discerning incentives for other AI democratisation efforts, such as the release of AI models. The paper concludes with a discussion of future research directions.
[AI-33] DarkSAM: Fooling Segment Anything Model to Segment Nothing NEURIPS’24
链接: https://arxiv.org/abs/2409.17874 作者: Ziqi Zhou,Yufei Song,Minghui Li,Shengshan Hu,Xianlong Wang,Leo Yu Zhang,Dezhong Yao,Hai Jin 关键词-EN: SAM, data and tasks, recently gained, gained much attention, outstanding generalization 类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by the 38th Annual Conference on Neural Information Processing Systems (NeurIPS’24)
点击查看摘要
Abstract:Segment Anything Model (SAM) has recently gained much attention for its outstanding generalization to unseen data and tasks. Despite its promising prospect, the vulnerabilities of SAM, especially to universal adversarial perturbation (UAP) have not been thoroughly investigated yet. In this paper, we propose DarkSAM, the first prompt-free universal attack framework against SAM, including a semantic decoupling-based spatial attack and a texture distortion-based frequency attack. We first divide the output of SAM into foreground and background. Then, we design a shadow target strategy to obtain the semantic blueprint of the image as the attack target. DarkSAM is dedicated to fooling SAM by extracting and destroying crucial object features from images in both spatial and frequency domains. In the spatial domain, we disrupt the semantics of both the foreground and background in the image to confuse SAM. In the frequency domain, we further enhance the attack effectiveness by distorting the high-frequency components (i.e., texture information) of the image. Consequently, with a single UAP, DarkSAM renders SAM incapable of segmenting objects across diverse images with varying prompts. Experimental results on four datasets for SAM and its two variant models demonstrate the powerful attack capability and transferability of DarkSAM.
[AI-34] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
链接: https://arxiv.org/abs/2409.17870 作者: Shaobo Ma,Chao Fang,Haikuo Shao,Zhongfeng Wang 关键词-EN: Large language models, Large language, GPU Tensor Core, GPU Tensor, language models 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach’s effectiveness, with up to 13\times speedup in matrix multiplication compared to NVIDIA’s CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.
[AI-35] Implementing a Nordic-Baltic Federated Health Data Network: a case report
链接: https://arxiv.org/abs/2409.17865 作者: Taridzo Chomutare,Aleksandar Babic,Laura-Maria Peltonen,Silja Elunurm,Peter Lundberg,Arne Jönsson,Emma Eneling,Ciprian-Virgil Gerstenberger,Troels Siggaard,Raivo Kolde,Oskar Jerdhaf,Martin Hansson,Alexandra Makhlysheva,Miroslav Muzny,Erik Ylipää,Søren Brunak,Hercules Dalianis 关键词-EN: including privacy concerns, national borders pose, borders pose significant, pose significant challenges, including privacy 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 pages (including appendices), 1 figure
点击查看摘要
Abstract:Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs.
[AI-36] A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios RECSYS’24
链接: https://arxiv.org/abs/2409.17864 作者: Christian Ganhör,Marta Moscati,Anna Hausberger,Shah Nawaz,Markus Schedl 关键词-EN: recommender systems adopt, adopt collaborative filtering, systems adopt collaborative, past collective interactions, provide recommendations based 类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted at 18th ACM Conference on Recommender Systems (RecSys '24)
点击查看摘要
Abstract:Most recommender systems adopt collaborative filtering (CF) and provide recommendations based on past collective interactions. Therefore, the performance of CF algorithms degrades when few or no interactions are available, a scenario referred to as cold-start. To address this issue, previous work relies on models leveraging both collaborative data and side information on the users or items. Similar to multimodal learning, these models aim at combining collaborative and content representations in a shared embedding space. In this work we propose a novel technique for multimodal recommendation, relying on a multimodal Single-Branch embedding network for Recommendation (SiBraR). Leveraging weight-sharing, SiBraR encodes interaction data as well as multimodal side information using the same single-branch embedding network on different modalities. This makes SiBraR effective in scenarios of missing modality, including cold start. Our extensive experiments on large-scale recommendation datasets from three different recommendation domains (music, movie, and e-commerce) and providing multimodal content information (audio, text, image, labels, and interactions) show that SiBraR significantly outperforms CF as well as state-of-the-art content-based RSs in cold-start scenarios, and is competitive in warm scenarios. We show that SiBraR’s recommendations are accurate in missing modality scenarios, and that the model is able to map different modalities to the same region of the shared embedding space, hence reducing the modality gap.
[AI-37] Machine Learning-based vs Deep Learning-based Anomaly Detection in Multivariate Time Series for Spacecraft Attitude Sensors
链接: https://arxiv.org/abs/2409.17841 作者: R. Gallon,F. Schiemenz,A. Krstova,A. Menicucci,E. Gill 关键词-EN: traditional threshold checking, limitations commonly imposed, Isolation and Recovery, framework of Failure, Failure Detection 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for the ESA SPAICE Conference 2024
点击查看摘要
Abstract:In the framework of Failure Detection, Isolation and Recovery (FDIR) on spacecraft, new AI-based approaches are emerging in the state of the art to overcome the limitations commonly imposed by traditional threshold checking. The present research aims at characterizing two different approaches to the problem of stuck values detection in multivariate time series coming from spacecraft attitude sensors. The analysis reveals the performance differences in the two approaches, while commenting on their interpretability and generalization to different scenarios. Comments: Accepted for the ESA SPAICE Conference 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.17841 [cs.LG] (or arXiv:2409.17841v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.17841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-38] Detecting and Measuring Confounding Using Causal Mechanism Shifts
Abstract:Detecting and measuring confounding effects from data is a key challenge in causal inference. Existing methods frequently assume causal sufficiency, disregarding the presence of unobserved confounding variables. Causal sufficiency is both unrealistic and empirically untestable. Additionally, existing methods make strong parametric assumptions about the underlying causal generative process to guarantee the identifiability of confounding variables. Relaxing the causal sufficiency and parametric assumptions and leveraging recent advancements in causal discovery and confounding analysis with non-i.i.d. data, we propose a comprehensive approach for detecting and measuring confounding. We consider various definitions of confounding and introduce tailored methodologies to achieve three objectives: (i) detecting and measuring confounding among a set of variables, (ii) separating observed and unobserved confounding effects, and (iii) understanding the relative strengths of confounding bias between different sets of variables. We present useful properties of a confounding measure and present measures that satisfy those properties. Empirical results support the theoretical analysis.
[AI-39] Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models NEURIPS2024
链接: https://arxiv.org/abs/2409.17836 作者: Hui-Po Wang,Mario Fritz 关键词-EN: neural network gradients, long been overlooked, neural network, statistical prior models, statistical prior 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear in NeurIPS 2024
点击查看摘要
Abstract:Despite the widespread use of statistical prior models in various fields, such models for neural network gradients have long been overlooked. The inherent challenge stems from their high-dimensional structures and complex interdependencies, which complicate effective modeling. In this work, we demonstrate the potential of large language models (LLMs) to act as gradient priors in a zero-shot setting. We examine the property by considering lossless gradient compression – a critical application in distributed learning – that depends heavily on precise probability modeling. To achieve this, we introduce LM-GC, a novel method that integrates LLMs with arithmetic coding. Our technique converts plain gradients into text-like formats, enhancing token efficiency by up to 38 times compared to their plain representations. We ensure that this data conversion maintains a close alignment with the structure of plain gradients and the symbols commonly recognized by LLMs. Our experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods, improving compression rates by 10% up to 17.2% across various datasets and architectures. Additionally, our approach shows promising compatibility with lossy compression techniques such as quantization and sparsification. These findings highlight the significant potential of LLMs as a model for effectively handling gradients. We will release the source code upon publication.
[AI-40] Inference-Time Language Model Alignment via Integrated Value Guidance EMNLP2024
链接: https://arxiv.org/abs/2409.17819 作者: Zhixuan Liu,Zhanhui Zhou,Yuanfu Wang,Chao Yang,Yu Qiao 关键词-EN: Large language models, human preferences, intensive and complex, tuning large models, Large language 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Findings
点击查看摘要
Abstract:Large language models are typically fine-tuned to align with human preferences, but tuning large models is computationally intensive and complex. In this work, we introduce \textitIntegrated Value Guidance (IVG), a method that uses implicit and explicit value functions to guide language model decoding at token and chunk-level respectively, efficiently aligning large language models purely at inference time. This approach circumvents the complexities of direct fine-tuning and outperforms traditional methods. Empirically, we demonstrate the versatility of IVG across various tasks. In controlled sentiment generation and summarization tasks, our method significantly improves the alignment of large models using inference-time guidance from \textttgpt2 -based value functions. Moreover, in a more challenging instruction-following benchmark AlpacaEval 2.0, we show that both specifically tuned and off-the-shelf value functions greatly improve the length-controlled win rates of large models against \textttgpt-4-turbo (e.g., 19.51% \rightarrow 26.51% for \textttMistral-7B-Instruct-v0.2 and 25.58% \rightarrow 33.75% for \textttMixtral-8x7B-Instruct-v0.1 with Tulu guidance).
[AI-41] DREAMS: A python framework to train deep learning models with model card reporting for medical and health applications
链接: https://arxiv.org/abs/2409.17815 作者: Rabindra Khadka,Pedro G Lind,Anis Yazidi,Asma Belhadi 关键词-EN: EEG data analysis, EEG data, observe brain activity, EEG, EEG data processing 类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Electroencephalography (EEG) data provides a non-invasive method for researchers and clinicians to observe brain activity in real time. The integration of deep learning techniques with EEG data has significantly improved the ability to identify meaningful patterns, leading to valuable insights for both clinical and research purposes. However, most of the frameworks so far, designed for EEG data analysis, are either too focused on pre-processing or in deep learning methods per, making their use for both clinician and developer communities problematic. Moreover, critical issues such as ethical considerations, biases, uncertainties, and the limitations inherent in AI models for EEG data analysis are frequently overlooked, posing challenges to the responsible implementation of these technologies. In this paper, we introduce a comprehensive deep learning framework tailored for EEG data processing, model training and report generation. While constructed in way to be adapted and developed further by AI developers, it enables to report, through model cards, the outcome and specific information of use for both developers and clinicians. In this way, we discuss how this framework can, in the future, provide clinical researchers and developers with the tools needed to create transparent and accountable AI models for EEG data analysis and diagnosis.
[AI-42] Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness EMNLP2024
链接: https://arxiv.org/abs/2409.17791 作者: Jian Li,Haojing Huang,Yujia Zhang,Pengfei Xu,Xi Chen,Rui Song,Lida Shi,Jingwen Wang,Hao Xu 关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Direct Preference Optimization, Language Models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024 Findings
点击查看摘要
Abstract:Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at this https URL.
[AI-43] Ophthalmic Biomarker Detection with Parallel Prediction of Transformer and Convolutional Architecture
链接: https://arxiv.org/abs/2409.17788 作者: Md. Touhidul Islam,Md. Abtahi Majeed Chowdhury,Mahmudul Hasan,Asif Quadir,Lutfa Aktar 关键词-EN: global health issue, Optical Coherence Tomography, Ophthalmic diseases represent, precise diagnostic tools, health issue 类目: Artificial Intelligence (cs.AI)
*备注: 5 pages
点击查看摘要
Abstract:Ophthalmic diseases represent a significant global health issue, necessitating the use of advanced precise diagnostic tools. Optical Coherence Tomography (OCT) imagery which offers high-resolution cross-sectional images of the retina has become a pivotal imaging modality in ophthalmology. Traditionally physicians have manually detected various diseases and biomarkers from such diagnostic imagery. In recent times, deep learning techniques have been extensively used for medical diagnostic tasks enabling fast and precise diagnosis. This paper presents a novel approach for ophthalmic biomarker detection using an ensemble of Convolutional Neural Network (CNN) and Vision Transformer. While CNNs are good for feature extraction within the local context of the image, transformers are known for their ability to extract features from the global context of the image. Using an ensemble of both techniques allows us to harness the best of both worlds. Our method has been implemented on the OLIVES dataset to detect 6 major biomarkers from the OCT images and shows significant improvement of the macro averaged F1 score on the dataset.
[AI-44] Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
链接: https://arxiv.org/abs/2409.17777 作者: Raja Kumar,Raghav Singhal,Pranamya Kulkarni,Deval Mehta,Kshitij Jadhav 关键词-EN: shown remarkable success, Deep multimodal learning, Deep multimodal, leveraging contrastive learning, Mixup-based contrastive loss 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: RK and RS contributed equally to this work, 20 Pages, 8 Figures, 9 Tables
点击查看摘要
Abstract:Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.
[AI-45] Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations EMNLP2024
链接: https://arxiv.org/abs/2409.17774 作者: Supriya Manna,Niladri Sett 关键词-EN: critical metric, metric to assess, assess the reliability, reliability of explainable, Faithfulness 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted as a Full Paper at EMNLP 2024 Workshop BlackBoxNLP
点击查看摘要
Abstract:Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer’s response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
[AI-46] Federated Learning under Attack: Improving Gradient Inversion for Batch of Images
Abstract:Federated Learning (FL) has emerged as a machine learning approach able to preserve the privacy of user’s data. Applying FL, clients train machine learning models on a local dataset and a central server aggregates the learned parameters coming from the clients, training a global machine learning model without sharing user’s data. However, the state-of-the-art shows several approaches to promote attacks on FL systems. For instance, inverting or leaking gradient attacks can find, with high precision, the local dataset used during the training phase of the FL. This paper presents an approach, called Deep Leakage from Gradients with Feedback Blending (DLG-FB), which is able to improve the inverting gradient attack, considering the spatial correlation that typically exists in batches of images. The performed evaluation shows an improvement of 19.18% and 48,82% in terms of attack success rate and the number of iterations per attacked image, respectively.
[AI-47] Confidence intervals uncovered: Are we ready for real-world medical imaging AI? MICCAI2024
链接: https://arxiv.org/abs/2409.17763 作者: Evangelia Christodoulou,Annika Reinke,Rola Houhou,Piotr Kalinowski,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Nicola Rieke,Veronika Cheplygina,Michela Antonelli,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Paul F. Jäger,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein 关键词-EN: Medical imaging, transformation of healthcare, imaging is spearheading, Performance, Medical 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted at MICCAI 2024 conference
点击查看摘要
Abstract:Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.
[AI-48] Integrating Hierarchical Semantic into Iterative Generation Model for Entailment Tree Explanation
链接: https://arxiv.org/abs/2409.17757 作者: Qin Wang,Jianzhou Feng,Yiming Xu 关键词-EN: explainable question answering, Manifestly and logically, question answering, logically displaying, reasoning from evidence 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Manifestly and logically displaying the line of reasoning from evidence to answer is significant to explainable question answering (QA). The entailment tree exhibits the lines structurally, which is different from the self-explanation principle in large-scale language models. Existing methods rarely consider the semantic association of sentences between and within hierarchies within the tree structure, which is prone to apparent mistakes in combinations. In this work, we propose an architecture of integrating the Hierarchical Semantics of sentences under the framework of Controller-Generator (HiSCG) to explain answers. The HiSCG designs a hierarchical mapping between hypotheses and facts, discriminates the facts involved in tree constructions, and optimizes single-step entailments. To the best of our knowledge, We are the first to notice hierarchical semantics of sentences between the same layer and adjacent layers to yield improvements. The proposed method achieves comparable performance on all three settings of the EntailmentBank dataset. The generalization results on two out-of-domain datasets also demonstrate the effectiveness of our method.
[AI-49] SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning
链接: https://arxiv.org/abs/2409.17755 作者: Rimvydas Rubavicius,Peter David Fagan,Alex Lascarides,Subramanian Ramamoorthy 关键词-EN: interactive task learning, challenging interactive task, task learning scenario, paper addresses, addresses a challenging 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages,4 figures, 2 tables
点击查看摘要
Abstract:This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the robot is unaware of a concept that’s key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation. Through dialogue, the robot discovers and then learns to exploit unforeseen possibilities. Using SECURE, the robot not only learns from the user’s corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the robot to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that a robot that is semantics-aware – that is, it exploits the logical consequences of both sentence and discourse semantics in the learning and inference process – learns to solve rearrangement under unawareness more effectively than a robot that lacks such capabilities.
[AI-50] Byzantine-Robust Aggregation for Securing Decentralized Federated Learning
Abstract:Federated Learning (FL) emerges as a distributed machine learning approach that addresses privacy concerns by training AI models locally on devices. Decentralized Federated Learning (DFL) extends the FL paradigm by eliminating the central server, thereby enhancing scalability and robustness through the avoidance of a single point of failure. However, DFL faces significant challenges in optimizing security, as most Byzantine-robust algorithms proposed in the literature are designed for centralized scenarios. In this paper, we present a novel Byzantine-robust aggregation algorithm to enhance the security of Decentralized Federated Learning environments, coined WFAgg. This proposal handles the adverse conditions and strength robustness of dynamic decentralized topologies at the same time by employing multiple filters to identify and mitigate Byzantine attacks. Experimental results demonstrate the effectiveness of the proposed algorithm in maintaining model accuracy and convergence in the presence of various Byzantine attack scenarios, outperforming state-of-the-art centralized Byzantine-robust aggregation schemes (such as Multi-Krum or Clustering). These algorithms are evaluated on an IID image classification problem in both centralized and decentralized scenarios.
[AI-51] AlterMOMA: Fusion Redundancy Pruning for Camera-LiDAR Fusion Models with Alternative Modality Masking NEURIPS2024
Abstract:Camera-LiDAR fusion models significantly enhance perception performance in autonomous driving. The fusion mechanism leverages the strengths of each modality while minimizing their weaknesses. Moreover, in practice, camera-LiDAR fusion models utilize pre-trained backbones for efficient training. However, we argue that directly loading single-modal pre-trained camera and LiDAR backbones into camera-LiDAR fusion models introduces similar feature redundancy across modalities due to the nature of the fusion mechanism. Unfortunately, existing pruning methods are developed explicitly for single-modal models, and thus, they struggle to effectively identify these specific redundant parameters in camera-LiDAR fusion models. In this paper, to address the issue above on camera-LiDAR fusion models, we propose a novelty pruning framework Alternative Modality Masking Pruning (AlterMOMA), which employs alternative masking on each modality and identifies the redundant parameters. Specifically, when one modality parameters are masked (deactivated), the absence of features from the masked backbone compels the model to reactivate previous redundant features of the other modality backbone. Therefore, these redundant features and relevant redundant parameters can be identified via the reactivation process. The redundant parameters can be pruned by our proposed importance score evaluation function, Alternative Evaluation (AlterEva), which is based on the observation of the loss changes when certain modality parameters are activated and deactivated. Extensive experiments on the nuScene and KITTI datasets encompassing diverse tasks, baseline models, and pruning algorithms showcase that AlterMOMA outperforms existing pruning methods, attaining state-of-the-art performance.
[AI-52] Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience
链接: https://arxiv.org/abs/2409.17702 作者: Leonard Bärmann,Chad DeChant,Joana Plewnia,Fabian Peller-Konrad,Daniel Bauer,Tamim Asfour,Alex Waibel 关键词-EN: improving human-robot interaction, human-robot interaction, question answering, crucial ability, ability for improving 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Code, data and demo videos at this https URL
点击查看摘要
Abstract:Verbalization of robot experience, i.e., summarization of and question answering about a robot’s past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user’s query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.
[AI-53] MoJE: Mixture of Jailbreak Experts Naive Tabular Classifiers as Guard for Prompt Attacks
链接: https://arxiv.org/abs/2409.17699 作者: Giandomenico Cornacchia,Giulio Zizzo,Kieran Fraser,Muhammad Zaid Hamed,Ambrish Rawat,Mark Purcell 关键词-EN: Large Language Models, Large Language, diverse applications underscores, proliferation of Large, thwart potential jailbreak 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The proliferation of Large Language Models (LLMs) in diverse applications underscores the pressing need for robust security measures to thwart potential jailbreak attacks. These attacks exploit vulnerabilities within LLMs, endanger data integrity and user privacy. Guardrails serve as crucial protective mechanisms against such threats, but existing models often fall short in terms of both detection accuracy, and computational efficiency. This paper advocates for the significance of jailbreak attack prevention on LLMs, and emphasises the role of input guardrails in safeguarding these models. We introduce MoJE (Mixture of Jailbreak Expert), a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. By employing simple linguistic statistical techniques, MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference. Through rigorous experimentation, MoJE demonstrates superior performance capable of detecting 90% of the attacks without compromising benign prompts, enhancing LLMs security against jailbreak attacks.
[AI-54] he application of GPT-4 in grading design university students assignment and providing feedback: An exploratory study
链接: https://arxiv.org/abs/2409.17698 作者: Qian Huang,Thijs Willems,King Wang Poon 关键词-EN: Custom GPT, GPT, Custom, design, design university students 类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 5 figures
点击查看摘要
Abstract:This study aims to investigate whether GPT-4 can effectively grade assignments for design university students and provide useful feedback. In design education, assignments do not have a single correct answer and often involve solving an open-ended design problem. This subjective nature of design projects often leads to grading problems,as grades can vary between different raters,for instance instructor from engineering background or architecture background. This study employs an iterative research approach in developing a Custom GPT with the aim of achieving more reliable results and testing whether it can provide design students with constructive feedback. The findings include: First,through several rounds of iterations the inter-reliability between GPT and human raters reached a level that is generally accepted by educators. This indicates that by providing accurate prompts to GPT,and continuously iterating to build a Custom GPT, it can be used to effectively grade students’ design assignments, serving as a reliable complement to human raters. Second, the intra-reliability of GPT’s scoring at different times is between 0.65 and 0.78. This indicates that, with adequate instructions, a Custom GPT gives consistent results which is a precondition for grading students. As consistency and comparability are the two main rules to ensure the reliability of educational assessment, this study has looked at whether a Custom GPT can be developed that adheres to these two rules. We finish the paper by testing whether Custom GPT can provide students with useful feedback and reflecting on how educators can develop and iterate a Custom GPT to serve as a complementary rater.
[AI-55] MIO: A Foundation Model on Multimodal Tokens
链接: https://arxiv.org/abs/2409.17692 作者: Zekun Wang,King Zhu,Chunpu Xu,Wangchunshu Zhou,Jiaheng Liu,Yibo Zhang,Jiashuo Wang,Ning Shi,Siyu Li,Yizhi Li,Haoran Que,Zhaoxiang Zhang,Yuanxing Zhang,Ge Zhang,Ke Xu,Jie Fu,Wenhao Huang 关键词-EN: foundation model built, large language models, autoregressive manner, understanding and generating, language models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Technical Report. Codes and models will be available soon
点击查看摘要
Abstract:In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
[AI-56] Efficient Bias Mitigation Without Privileged Information ECCV2024
链接: https://arxiv.org/abs/2409.17691 作者: Mateo Espinosa Zarlenga,Swami Sankaranarayanan,Jerone T. A. Andrews,Zohreh Shams,Mateja Jamnik,Alice Xiang 关键词-EN: Deep neural networks, empirical risk minimisation, Deep neural, grassy background, neural networks trained 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the 18th European Conference on Computer Vision (ECCV 2024) as an Oral presentation
点击查看摘要
Abstract:Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., “grassy background” and “cows”). Existing bias mitigation methods that aim to address this issue often either rely on group labels for training or validation, or require an extensive hyperparameter search. Such data and computational requirements hinder the practical deployment of these methods, especially when datasets are too large to be group-annotated, computational resources are limited, and models are trained through already complex pipelines. In this paper, we propose Targeted Augmentations for Bias Mitigation (TAB), a simple hyperparameter-free framework that leverages the entire training history of a helper model to identify spurious samples, and generate a group-balanced training set from which a robust model can be trained. We show that TAB improves worst-group performance without any group information or model selection, outperforming existing methods while maintaining overall accuracy.
[AI-57] Graph Edit Distance with General Costs Using Neural Set Divergence NEURIPS2024
链接: https://arxiv.org/abs/2409.17687 作者: Eeshaan Jain,Indradyumna Roy,Saswat Meher,Soumen Chakrabarti,Abir De 关键词-EN: Graph Edit Distance, minimum-cost edit sequence, Edit Distance, GED, sequence that transforms 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at NeurIPS 2024
点击查看摘要
Abstract:Graph Edit Distance (GED) measures the (dis-)similarity between two given graphs, in terms of the minimum-cost edit sequence that transforms one graph to the other. However, the exact computation of GED is NP-Hard, which has recently motivated the design of neural methods for GED estimation. However, they do not explicitly account for edit operations with different costs. In response, we propose GRAPHEDX, a neural GED estimator that can work with general costs specified for the four edit operations, viz., edge deletion, edge addition, node deletion and node addition. We first present GED as a quadratic assignment problem (QAP) that incorporates these four costs. Then, we represent each graph as a set of node and edge embeddings and use them to design a family of neural set divergence surrogates. We replace the QAP terms corresponding to each operation with their surrogates. Computing such neural set divergence require aligning nodes and edges of the two graphs. We learn these alignments using a Gumbel-Sinkhorn permutation generator, additionally ensuring that the node and edge alignments are consistent with each other. Moreover, these alignments are cognizant of both the presence and absence of edges between node-pairs. Experiments on several datasets, under a variety of edit cost settings, show that GRAPHEDX consistently outperforms state-of-the-art methods and heuristics in terms of prediction error.
[AI-58] Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets
Abstract:One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson’s disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.
[AI-59] Preserving logical and functional dependencies in synthetic tabular data
链接: https://arxiv.org/abs/2409.17684 作者: Chaithra Umesh,Kristian Schultz,Manjunath Mahendra,Saparshi Bej,Olaf Wolkenhauer 关键词-EN: data generation, tabular data generation, tabular data, data, data generation algorithms 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Pattern Recognition Journal
点击查看摘要
Abstract:Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.
[AI-60] Zero- and Few-shot Named Entity Recognition and Text Expansion in Medication Prescriptions using ChatGPT
链接: https://arxiv.org/abs/2409.17683 作者: Natthanaphop Isaradech,Andrea Riedel,Wachiranun Sirikul,Markus Kreuzthaler,Stefan Schulz 关键词-EN: local brand, formats and abbreviations, include a mix, wide range, range of idiosyncratic 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Introduction: Medication prescriptions are often in free text and include a mix of two languages, local brand names, and a wide range of idiosyncratic formats and abbreviations. Large language models (LLMs) have shown promising ability to generate text in response to input prompts. We use ChatGPT 3.5 to automatically structure and expand medication statements in discharge summaries and thus make them easier to interpret for people and machines. Methods: Named-entity Recognition (NER) and Text Expansion (EX) are used in a zero- and few-shot setting with different prompt strategies. 100 medication statements were manually annotated and curated. NER performance was measured by using strict and partial matching. For the task EX, two experts interpreted the results by assessing semantic equivalence between original and expanded statements. The model performance was measured by precision, recall, and F1 score. Results: For NER, the best-performing prompt reached an average F1 score of 0.94 in the test set. For EX, the few-shot prompt showed superior performance among other prompts, with an average F1 score of 0.87. Conclusion: Our study demonstrates good performance for NER and EX tasks in free-text medication statements using ChatGPT. Compared to a zero-shot baseline, a few-shot approach prevented the system from hallucinating, which would be unacceptable when processing safety-relevant medication data.
Abstract:Recent concept-based interpretable models have succeeded in providing meaningful explanations by pre-defined concept sets. However, the dependency on the pre-defined concepts restricts the application because of the limited number of concepts for explanations. This paper proposes a novel interpretable deep neural network called explanation bottleneck models (XBMs). XBMs generate a text explanation from the input without pre-defined concepts and then predict a final task prediction based on the generated explanation by leveraging pre-trained vision-language encoder-decoder models. To achieve both the target task performance and the explanation quality, we train XBMs through the target task loss with the regularization penalizing the explanation decoder via the distillation from the frozen pre-trained decoder. Our experiments, including a comparison to state-of-the-art concept bottleneck models, confirm that XBMs provide accurate and fluent natural language explanations without pre-defined concept sets. Code will be available at this https URL.
[AI-62] A Fuzzy-based Approach to Predict Human Interaction by Functional Near-Infrared Spectroscopy
Abstract:The paper introduces a Fuzzy-based Attention (Fuzzy Attention Layer) mechanism, a novel computational approach to enhance the interpretability and efficacy of neural models in psychological research. The proposed Fuzzy Attention Layer mechanism is integrated as a neural network layer within the Transformer Encoder model to facilitate the analysis of complex psychological phenomena through neural signals, such as those captured by functional Near-Infrared Spectroscopy (fNIRS). By leveraging fuzzy logic, the Fuzzy Attention Layer is capable of learning and identifying interpretable patterns of neural activity. This capability addresses a significant challenge when using Transformer: the lack of transparency in determining which specific brain activities most contribute to particular predictions. Our experimental results demonstrated on fNIRS data from subjects engaged in social interactions involving handholding reveal that the Fuzzy Attention Layer not only learns interpretable patterns of neural activity but also enhances model performance. Additionally, the learned patterns provide deeper insights into the neural correlates of interpersonal touch and emotional exchange. The application of our model shows promising potential in deciphering the subtle complexities of human social behaviors, thereby contributing significantly to the fields of social neuroscience and psychological AI.
[AI-63] Hierarchical End-to-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning
链接: https://arxiv.org/abs/2409.17659 作者: Siyi Lu,Lei He,Shengbo Eben Li,Yugong Luo,Jianqiang Wang,Keqiang Li 关键词-EN: traditional modular pipeline, Deep Reinforcement Learning, modular pipeline, offers a streamlined, streamlined alternative 类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird’s-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%.
[AI-64] Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection ICASSP2025
链接: https://arxiv.org/abs/2409.17656 作者: Pengfei Cai,Yan Song,Nan Jiang,Qing Gu,Ian McLoughlin 关键词-EN: sound event detection, labeled data due, high annotation costs, Masked Audio Model, labeled data 类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP2025; The code for this paper will be available at this https URL after the paper is accepted
点击查看摘要
Abstract:A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.
[AI-65] AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment
链接: https://arxiv.org/abs/2409.17655 作者: Nan Sun,Bo Mao,Yongchang Li,Lumeng Ma,Di Guo,Huaping Liu 关键词-EN: motivated significant research, autonomous robotic systems, Large Language Models, increasing demand, demand for intelligent 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 6 pages, 8 figures, 4 tables
点击查看摘要
Abstract:The increasing demand for intelligent assistants in human-populated environments has motivated significant research in autonomous robotic systems. Traditional service robots and virtual assistants, however, struggle with real-world task execution due to their limited capacity for dynamic reasoning and interaction, particularly when human collaboration is required. Recent developments in Large Language Models have opened new avenues for improving these systems, enabling more sophisticated reasoning and natural interaction capabilities. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed to operate autonomously in a physical office environment. Unlike conventional service robots, AssistantX leverages a novel multi-agent architecture, PPDR4X, which provides advanced inference capabilities and comprehensive collaboration awareness. By effectively bridging the gap between virtual operations and physical interactions, AssistantX demonstrates robust performance in managing complex real-world scenarios. Our evaluation highlights the architecture’s effectiveness, showing that AssistantX can respond to clear instructions, actively retrieve supplementary information from memory, and proactively seek collaboration from team members to ensure successful task completion. More details and videos can be found at this https URL.
[AI-66] FactorSim: Generative Simulation via Factorized Representation NEURIPS2024
链接: https://arxiv.org/abs/2409.17652 作者: Fan-Yun Sun,S. I. Harini,Angela Yi,Yihan Zhou,Alex Zook,Jonathan Tremblay,Logan Cross,Jiajun Wu,Nick Haber 关键词-EN: remains an open-ended, natural language input, train intelligent agents, open-ended challenge, task documentation 类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: neurips 2024, project website: this https URL
点击查看摘要
Abstract:Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code’s accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (e.g., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks.
[AI-67] Digital Twin Ecosystem for Oncology Clinical Operations
链接: https://arxiv.org/abs/2409.17650 作者: Himanshu Pandey,Akhil Amod,Shivang,Kshitij Jaggi,Ruchi Garg,Abheet Jain,Vinayak Tantia 关键词-EN: Large Language Models, Artificial Intelligence, Large Language, hold significant promise, Language Models 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Pre Print
点击查看摘要
Abstract:Artificial Intelligence (AI) and Large Language Models (LLMs) hold significant promise in revolutionizing healthcare, especially in clinical applications. Simultaneously, Digital Twin technology, which models and simulates complex systems, has gained traction in enhancing patient care. However, despite the advances in experimental clinical settings, the potential of AI and digital twins to streamline clinical operations remains largely untapped. This paper introduces a novel digital twin framework specifically designed to enhance oncology clinical operations. We propose the integration of multiple specialized digital twins, such as the Medical Necessity Twin, Care Navigator Twin, and Clinical History Twin, to enhance workflow efficiency and personalize care for each patient based on their unique data. Furthermore, by synthesizing multiple data sources and aligning them with the National Comprehensive Cancer Network (NCCN) guidelines, we create a dynamic Cancer Care Path, a continuously evolving knowledge base that enables these digital twins to provide precise, tailored clinical recommendations.
[AI-68] AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure
链接: https://arxiv.org/abs/2409.17642 作者: Xi Chen,Zhiyang Zhang,Fangkai Yang,Xiaoting Qin,Chao Du,Xi Cheng,Hangxin Liu,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang 关键词-EN: Large language model, Large language, language model, conversational interfaces, increasingly utilized 类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:Large language model (LLM)-based AI delegates are increasingly utilized to act on behalf of users, assisting them with a wide range of tasks through conversational interfaces. Despite their advantages, concerns arise regarding the potential risk of privacy leaks, particularly in scenarios involving social interactions. While existing research has focused on protecting privacy by limiting the access of AI delegates to sensitive user information, many social scenarios require disclosing private details to achieve desired outcomes, necessitating a balance between privacy protection and disclosure. To address this challenge, we conduct a pilot study to investigate user preferences for AI delegates across various social relations and task scenarios, and then propose a novel AI delegate system that enables privacy-conscious self-disclosure. Our user study demonstrates that the proposed AI delegate strategically protects privacy, pioneering its use in diverse and dynamic social interactions.
[AI-69] 3: A Novel Zero-shot Transfer Learning Framework Iteratively Training on an Assistant Task for a Target Task
链接: https://arxiv.org/abs/2409.17640 作者: Xindi Tong,Yujin Zhu,Shijian Fan,Liang Xu 关键词-EN: Large Language Models, processing large volumes, efficiently processing large, contextual details dealing, Language Models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Long text summarization, gradually being essential for efficiently processing large volumes of information, stays challenging for Large Language Models (LLMs) such as GPT and LLaMA families because of the insufficient open-sourced training datasets and the high requirement of contextual details dealing. To address the issue, we design a novel zero-shot transfer learning framework, abbreviated as T3, to iteratively training a baseline LLM on an assistant task for the target task, where the former should own richer data resources and share structural or semantic similarity with the latter. In practice, T3 is approached to deal with the long text summarization task by utilizing question answering as the assistant task, and further validated its effectiveness on the BBC summary, NarraSum, FairytaleQA, and NLQuAD datasets, with up to nearly 14% improvement in ROUGE, 35% improvement in BLEU, and 16% improvement in Factscore compared to three baseline LLMs, demonstrating its potential for more assistant-target task combinations.
[AI-70] P4Q: Learning to Prompt for Quantization in Visual-language Models
Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization’’ (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 \times while achieving 66.94% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24% with negligible additional parameters on the ImageNet dataset.
[AI-71] Hand-object reconstruction via interaction-aware graph attention mechanism ICIP2024
链接: https://arxiv.org/abs/2409.17629 作者: Taeyun Woo,Tae-Kyun Kim,Jinah Park 关键词-EN: advanced vision computing, Estimating the poses, vision computing, important area, area of research 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, Accepted by ICIP 2024
点击查看摘要
Abstract:Estimating the poses of both a hand and an object has become an important area of research due to the growing need for advanced vision computing. The primary challenge involves understanding and reconstructing how hands and objects interact, such as contact and physical plausibility. Existing approaches often adopt a graph neural network to incorporate spatial information of hand and object meshes. However, these approaches have not fully exploited the potential of graphs without modification of edges within and between hand- and object-graphs. We propose a graph-based refinement method that incorporates an interaction-aware graph-attention mechanism to account for hand-object interactions. Using edges, we establish connections among closely correlated nodes, both within individual graphs and across different graphs. Experiments demonstrate the effectiveness of our proposed method with notable improvements in the realm of physical plausibility.
[AI-72] Neural P3M: A Long-Range Interaction Modeling Enhancer for Geometric GNNs NEURIPS2024
链接: https://arxiv.org/abs/2409.17622 作者: Yusong Wang,Chaoran Cheng,Shaoning Li,Yuxuan Ren,Bin Shao,Ge Liu,Pheng-Ann Heng,Nanning Zheng 关键词-EN: modeling molecular geometry, graph neural networks, Geometric graph neural, emerged as powerful, powerful tools 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at NeurIPS 2024
点击查看摘要
Abstract:Geometric graph neural networks (GNNs) have emerged as powerful tools for modeling molecular geometry. However, they encounter limitations in effectively capturing long-range interactions in large molecular systems. To address this challenge, we introduce Neural P ^3 M, a versatile enhancer of geometric GNNs to expand the scope of their capabilities by incorporating mesh points alongside atoms and reimaging traditional mathematical operations in a trainable manner. Neural P ^3 M exhibits flexibility across a wide range of molecular systems and demonstrates remarkable accuracy in predicting energies and forces, outperforming on benchmarks such as the MD22 dataset. It also achieves an average improvement of 22% on the OE62 dataset while integrating with various architectures.
[AI-73] Dirichlet-Based Coarse-to-Fine Example Selection For Open-Set Annotation
链接: https://arxiv.org/abs/2409.17607 作者: Ye-Wen Wang,Chen-Chen Zong,Ming-Kun Xie,Sheng-Jun Huang 关键词-EN: achieved great success, Active learning, achieved great, great success, Active 类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Active learning (AL) has achieved great success by selecting the most valuable examples from unlabeled data. However, they usually deteriorate in real scenarios where open-set noise gets involved, which is studied as open-set annotation (OSA). In this paper, we owe the deterioration to the unreliable predictions arising from softmax-based translation invariance and propose a Dirichlet-based Coarse-to-Fine Example Selection (DCFS) strategy accordingly. Our method introduces simplex-based evidential deep learning (EDL) to break translation invariance and distinguish known and unknown classes by considering evidence-based data and distribution uncertainty simultaneously. Furthermore, hard known-class examples are identified by model discrepancy generated from two classifier heads, where we amplify and alleviate the model discrepancy respectively for unknown and known classes. Finally, we combine the discrepancy with uncertainties to form a two-stage strategy, selecting the most informative examples from known classes. Extensive experiments on various openness ratio datasets demonstrate that DCFS achieves state-of-art performance.
[AI-74] Open Digital Rights Enforcement Framework (ODRE): from descriptive to enforceable policies
链接: https://arxiv.org/abs/2409.17602 作者: Andrea Cimmino,Juan Cano-Benito,Raúl García-Castro 关键词-EN: Data Spaces, Open Digital, ODRL, data usage policies, decentralised ecosystems 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 20 pages, 3 Figures, Submitted to Computers Security journal
点击查看摘要
Abstract:From centralised platforms to decentralised ecosystems, like Data Spaces, sharing data has become a paramount challenge. For this reason, the definition of data usage policies has become crucial in these domains, highlighting the necessity of effective policy enforcement mechanisms. The Open Digital Rights Language (ODRL) is a W3C standard ontology designed to describe data usage policies, however, it lacks built-in enforcement capabilities, limiting its practical application. This paper introduces the Open Digital Rights Enforcement (ODRE) framework, whose goal is to provide ODRL with enforcement capabilities. The ODRE framework proposes a novel approach to express ODRL policies that integrates the descriptive ontology terms of ODRL with other languages that allow behaviour specification, such as dynamic data handling or function evaluation. The framework includes an enforcement algorithm for ODRL policies and two open-source implementations in Python and Java. The ODRE framework is also designed to support future extensions of ODRL to specific domain scenarios. In addition, current limitations of ODRE, ODRL, and current challenges are reported. Finally, to demonstrate the enforcement capabilities of the implementations, their performance, and their extensibility features, several experiments have been carried out with positive results.
[AI-75] A-Cleaner: A Fine-grained Text Alignment Backdoor Defense Strategy for Multimodal Contrastive Learning
Abstract:Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the supervised learning domain, fine-tuning defense strategies can achieve excellent defense performance. However, in the unsupervised and semi-supervised domain, we find that when CLIP faces some complex attack techniques, the existing fine-tuning defense strategy, CleanCLIP, has some limitations on defense performance. The synonym substitution of its text-augmentation is insufficient to enhance the text feature space. To compensate for this weakness, we improve it by proposing a fine-grained \textbfText \textbfAlignment \textbfCleaner (TA-Cleaner) to cut off feature connections of backdoor triggers. We randomly select a few samples for positive and negative subtext generation at each epoch of CleanCLIP, and align the subtexts to the images to strengthen the text self-supervision. We evaluate the effectiveness of our TA-Cleaner against six attack algorithms and conduct comprehensive zero-shot classification tests on ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques. Even when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02% and 63.88%, respectively.
[AI-76] Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming
链接: https://arxiv.org/abs/2409.17596 作者: Zehao Zhu,Wei Sun,Jun Jia,Wei Wu,Sibin Deng,Kai Li,Ying Chen,Xiongkuo Min,Jia Wang,Guangtao Zhai 关键词-EN: live video streaming, gained widespread popularity, social media platforms, QoE, live video 类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 14 pages, 5 figures
点击查看摘要
Abstract:In recent years, live video streaming has gained widespread popularity across various social media platforms. Quality of experience (QoE), which reflects end-users’ satisfaction and overall experience, plays a critical role for media service providers to optimize large-scale live compression and transmission strategies to achieve perceptually optimal rate-distortion trade-off. Although many QoE metrics for video-on-demand (VoD) have been proposed, there remain significant challenges in developing QoE metrics for live video streaming. To bridge this gap, we conduct a comprehensive study of subjective and objective QoE evaluations for live video streaming. For the subjective QoE study, we introduce the first live video streaming QoE dataset, TaoLive QoE, which consists of 42 source videos collected from real live broadcasts and 1,155 corresponding distorted ones degraded due to a variety of streaming distortions, including conventional streaming distortions such as compression, stalling, as well as live streaming-specific distortions like frame skipping, variable frame rate, etc. Subsequently, a human study was conducted to derive subjective QoE scores of videos in the TaoLive QoE dataset. For the objective QoE study, we benchmark existing QoE models on the TaoLive QoE dataset as well as publicly available QoE datasets for VoD scenarios, highlighting that current models struggle to accurately assess video QoE, particularly for live content. Hence, we propose an end-to-end QoE evaluation model, Tao-QoE, which integrates multi-scale semantic features and optical flow-based motion features to predicting a retrospective QoE score, eliminating reliance on statistical quality of service (QoS) features.
[AI-77] Deep Manifold Part 1: Anatomy of Neural Network Manifold
Abstract:Based on the numerical manifold method principle, we developed a mathematical framework of a neural network manifold: Deep Manifold and discovered that neural networks: 1) is numerical computation combining forward and inverse; 2) have near infinite degrees of freedom; 3) exponential learning capacity with depth; 4) have self-progressing boundary conditions; 5) has training hidden bottleneck. We also define two concepts: neural network learning space and deep manifold space and introduce two concepts: neural network intrinsic pathway and fixed point. We raise three fundamental questions: 1). What is the training completion definition; 2). where is the deep learning convergence point (neural network fixed point); 3). How important is token timestamp in training data given negative time is critical in inverse problem.
[AI-78] Improving Fast Adversarial Training via Self-Knowledge Guidance
Abstract:Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples, which leads to an imbalanced optimization. However, this imbalance remains unexplored in the field of FAT. In this paper, we conduct a comprehensive study of the imbalance issue in FAT and observe an obvious class disparity regarding their performances. This disparity could be embodied from a perspective of alignment between clean and robust accuracy. Based on the analysis, we mainly attribute the observed misalignment and disparity to the imbalanced optimization in FAT, which motivates us to optimize different training data adaptively to enhance robustness. Specifically, we take disparity and misalignment into consideration. First, we introduce self-knowledge guided regularization, which assigns differentiated regularization weights to each class based on its training state, alleviating class disparity. Additionally, we propose self-knowledge guided label relaxation, which adjusts label relaxation according to the training accuracy, alleviating the misalignment and improving robustness. By combining these methods, we formulate the Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge during training to enhance the adversarial robustness without compromising training efficiency. Extensive experiments on four standard datasets demonstrate that the SKG-FAT improves the robustness and preserves competitive clean accuracy, outperforming the state-of-the-art methods.
[AI-79] Multimodal Banking Dataset: Understanding Client Needs through Event Sequences
Abstract:Financial organizations collect a huge amount of data about clients that typically has a temporal (sequential) structure and is collected from various sources (modalities). Due to privacy issues, there are no large-scale open-source multimodal datasets of event sequences, which significantly limits the research in this area. In this paper, we present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients with several modalities: 950M bank transactions, 1B geo position events, 5M embeddings of dialogues with technical support and monthly aggregated purchases of four bank’s products. All entries are properly anonymized from real proprietary bank data. Using this dataset, we introduce a novel benchmark with two business tasks: campaigning (purchase prediction in the next month) and matching of clients. We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task. As a result, the proposed dataset can open new perspectives and facilitate the future development of practically important large-scale multimodal algorithms for event sequences. HuggingFace Link: this https URL Github Link: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.17587 [cs.LG] (or arXiv:2409.17587v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.17587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-80] A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K Filings Using Large Language Models
链接: https://arxiv.org/abs/2409.17581 作者: Syed Affan Daimi,Asma Iqbal 关键词-EN: number of companies, growing exponentially, market analysts, significant challenge, challenge for market 类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures
点击查看摘要
Abstract:The number of companies listed on the NYSE has been growing exponentially, creating a significant challenge for market analysts, traders, and stockholders who must monitor and assess the performance and strategic shifts of a large number of companies regularly. There is an increasing need for a fast, cost-effective, and comprehensive method to evaluate the performance and detect and compare many companies’ strategy changes efficiently. We propose a novel data-driven approach that leverages large language models (LLMs) to systematically analyze and rate the performance of companies based on their SEC 10-K filings. These filings, which provide detailed annual reports on a company’s financial performance and strategic direction, serve as a rich source of data for evaluating various aspects of corporate health, including confidence, environmental sustainability, innovation, and workforce management. We also introduce an automated system for extracting and preprocessing 10-K filings. This system accurately identifies and segments the required sections as outlined by the SEC, while also isolating key textual content that contains critical information about the company. This curated data is then fed into Cohere’s Command-R+ LLM to generate quantitative ratings across various performance metrics. These ratings are subsequently processed and visualized to provide actionable insights. The proposed scheme is then implemented on an interactive GUI as a no-code solution for running the data pipeline and creating the visualizations. The application showcases the rating results and provides year-on-year comparisons of company performance.
[AI-81] Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study
Abstract:Extracting meaningful insights from large and complex datasets poses significant challenges, particularly in ensuring the accuracy and relevance of retrieved information. Traditional data retrieval methods such as sequential search and index-based retrieval often fail when handling intricate and interconnected data structures, resulting in incomplete or misleading outputs. To overcome these limitations, we introduce Structured-GraphRAG, a versatile framework designed to enhance information retrieval across structured datasets in natural language queries. Structured-GraphRAG utilizes multiple knowledge graphs, which represent data in a structured format and capture complex relationships between entities, enabling a more nuanced and comprehensive retrieval of information. This graph-based approach reduces the risk of errors in language model outputs by grounding responses in a structured format, thereby enhancing the reliability of results. We demonstrate the effectiveness of Structured-GraphRAG by comparing its performance with that of a recently published method using traditional retrieval-augmented generation. Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times. While our case study focuses on soccer data, the framework’s design is broadly applicable, offering a powerful tool for data analysis and enhancing language model applications across various structured domains.
[AI-82] Dr. GPT in Campus Counseling: Understanding Higher Education Students Opinions on LLM-assisted Mental Health Services
链接: https://arxiv.org/abs/2409.17572 作者: Owen Xingjian Zhang,Shuyao Zhou,Jiayi Geng,Yuhan Liu,Sunny Xun Liu 关键词-EN: Large Language Models, Language Models, health challenges faced, Large Language, General Information Inquiry 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 5 pages
点击查看摘要
Abstract:In response to the increasing mental health challenges faced by college students, we sought to understand their perspectives on how AI applications, particularly Large Language Models (LLMs), can be leveraged to enhance their mental well-being. Through pilot interviews with ten diverse students, we explored their opinions on the use of LLMs across five fictional scenarios: General Information Inquiry, Initial Screening, Reshaping Patient-Expert Dynamics, Long-term Care, and Follow-up Care. Our findings revealed that students’ acceptance of LLMs varied by scenario, with participants highlighting both potential benefits, such as proactive engagement and personalized follow-up care, and concerns, including limitations in training data and emotional support. These insights inform how AI technology should be designed and implemented to effectively support and enhance students’ mental well-being, particularly in scenarios where LLMs can complement traditional methods, while maintaining empathy and respecting individual preferences.
[AI-83] Showing Many Labels in Multi-label Classification Models: An Empirical Study of Adversarial Examples
链接: https://arxiv.org/abs/2409.17568 作者: Yujiang Liu,Wenjian Luo,Zhijian Chen,Muhammad Luqman Naseem 关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, development of Deep, numerous fields 类目: Artificial Intelligence (cs.AI)
*备注: 14 pages
点击查看摘要
Abstract:With the rapid development of Deep Neural Networks (DNNs), they have been applied in numerous fields. However, research indicates that DNNs are susceptible to adversarial examples, and this is equally true in the multi-label domain. To further investigate multi-label adversarial examples, we introduce a novel type of attacks, termed “Showing Many Labels”. The objective of this attack is to maximize the number of labels included in the classifier’s prediction results. In our experiments, we select nine attack algorithms and evaluate their performance under “Showing Many Labels”. Eight of the attack algorithms were adapted from the multi-class environment to the multi-label environment, while the remaining one was specifically designed for the multi-label environment. We choose ML-LIW and ML-GCN as target models and train them on four popular multi-label datasets: VOC2007, VOC2012, NUS-WIDE, and COCO. We record the success rate of each algorithm when it shows the expected number of labels in eight different scenarios. Experimental results indicate that under the “Showing Many Labels”, iterative attacks perform significantly better than one-step attacks. Moreover, it is possible to show all labels in the dataset.
[AI-84] Pixel-Space Post-Training of Latent Diffusion Models
链接: https://arxiv.org/abs/2409.17565 作者: Christina Zhang,Simran Motwani,Matthew Yu,Ji Hou,Felix Juefei-Xu,Sam Tsai,Peter Vajda,Zijian He,Jialiang Wang 关键词-EN: made significant advancements, recent years, made significant, significant advancements, generation in recent 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 \times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.
Abstract:Existing 3D mask learning methods encounter performance bottlenecks under limited data, and our objective is to overcome this limitation. In this paper, we introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders to achieve multi-mask learning for 3D point clouds. Specifically, we augment the baselines with two additional mask choices (i.e., medium mask and low mask) as our core insight is that the recovery process of an object can manifest in diverse ways. Previous high-masking schemes focus on capturing the global representation but lack the fine-grained recovery capability, so that the generated pre-trained weights tend to play a limited role in the fine-tuning process. With the support of the proposed TPM, available methods can exhibit more flexible and accurate completion capabilities, enabling the potential autoencoder in the pre-training stage to consider multiple representations of a single 3D object. In addition, an SVM-guided weight selection module is proposed to fill the encoder parameters for downstream networks with the optimal weight during the fine-tuning stage, maximizing linear accuracy and facilitating the acquisition of intricate representations for new objects. Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
[AI-86] Modulated Intervention Preference Optimization (MIPO): Keey the Easy Refine the Difficult AAAI2025
链接: https://arxiv.org/abs/2409.17545 作者: Cheolhun Jang 关键词-EN: well-trained SFT model, Preference optimization methods, optimization methods typically, methods typically begin, reference model 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8pages, submitted to AAAI 2025
点击查看摘要
Abstract:Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model’s distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbfModulated Intervention Preference Optimization (MIPO) to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. If the data is well-aligned, the intervention is increased to prevent the policy model from diverging significantly from reference model. Conversely, if the alignment is poor, the interference is reduced to facilitate more extensive training. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental results demonstrate that MIPO consistently outperforms DPO across various evaluation scenarios.
[AI-87] On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy
链接: https://arxiv.org/abs/2409.17538 作者: Saber Malekmohammadi,Golnoosh Farnadi 关键词-EN: processing involves large-scale, involves large-scale pre-training, natural language processing, language processing involves, general domain data 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to specific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.
[AI-88] Just say what you want: only-prompting self-rewarding online preference optimization
Abstract:We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator’s judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we conduct extensive experiments on two base models, Mistral-7B and Mistral-Instruct-7B, which significantly bootstrap the performance of the reference model, achieving 34.5% in the Length-controlled Win Rates of AlpacaEval 2.0.
[AI-89] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion NEURIPS2024
链接: https://arxiv.org/abs/2409.17531 作者: Ming Dai,Lingfeng Yang,Yihao Xu,Zhenhua Feng,Wankou Yang 关键词-EN: involves grounding descriptive, grounding descriptive sentences, common vision task, common vision, descriptive sentences 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21pages, 11figures, NeurIPS2024
点击查看摘要
Abstract:Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \urlthis https URL.
[AI-90] Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Integrating SGBM and Segmentation Models
链接: https://arxiv.org/abs/2409.17526 作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green 关键词-EN: radiata pine trees, pine trees presents, trees presents significant, safety risks due, Manual pruning 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Manual pruning of radiata pine trees presents significant safety risks due to their substantial height and the challenging terrains in which they thrive. To address these risks, this research proposes the development of a drone-based pruning system equipped with specialized pruning tools and a stereo vision camera, enabling precise detection and trimming of branches. Deep learning algorithms, including YOLO and Mask R-CNN, are employed to ensure accurate branch detection, while the Semi-Global Matching algorithm is integrated to provide reliable distance estimation. The synergy between these techniques facilitates the precise identification of branch locations and enables efficient, targeted pruning. Experimental results demonstrate that the combined implementation of YOLO and SGBM enables the drone to accurately detect branches and measure their distances from the drone. This research not only improves the safety and efficiency of pruning operations but also makes a significant contribution to the advancement of drone technology in the automation of agricultural and forestry practices, laying a foundational framework for further innovations in environmental management.
链接: https://arxiv.org/abs/2409.17523 作者: Jing Bi,Yunlong Tang,Luchuan Song,Ali Vosoughi,Nguyen Nguyen,Chenliang Xu 关键词-EN: video analysis brings, understanding human activities, first-person perspective, egocentric video analysis, egocentric video 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACMMM 24
点击查看摘要
Abstract:The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textitfirst large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE’s superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.
[AI-92] Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization
链接: https://arxiv.org/abs/2409.17519 作者: Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba 关键词-EN: diverse environments, environmental state recognition, autonomously navigate, navigate and operate, operate in diverse 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Advanced Robotics, website - this https URL
点击查看摘要
Abstract:In order for robots to autonomously navigate and operate in diverse environments, it is essential for them to recognize the state of their environment. On the other hand, the environmental state recognition has traditionally involved distinct methods tailored to each state to be recognized. In this study, we perform a unified environmental state recognition for robots through the spoken language with pre-trained large-scale vision-language models. We apply Visual Question Answering and Image-to-Text Retrieval, which are tasks of Vision-Language Models. We show that with our method, it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed and whether water is running in a sink, without training neural networks or manual programming. In addition, the recognition accuracy can be improved by selecting appropriate texts from the set of prepared texts based on black-box optimization. For each state recognition, only the text set and its weighting need to be changed, eliminating the need to prepare multiple different models and programs, and facilitating the management of source code and computer resource. We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.
[AI-93] Multi-Designated Detector Watermarking for Language Models
链接: https://arxiv.org/abs/2409.17518 作者: Zhengan Huang,Gongxian Zeng,Xin Mu,Yu Wang,Yue Yu 关键词-EN: large language models, multi-designated detector watermarking, initiate the study, large language, MDDW 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In this paper, we initiate the study of \emphmulti-designated detector watermarking (MDDW) for large language models (LLMs). This technique allows model providers to generate watermarked outputs from LLMs with two key properties: (i) only specific, possibly multiple, designated detectors can identify the watermarks, and (ii) there is no perceptible degradation in the output quality for ordinary users. We formalize the security definitions for MDDW and present a framework for constructing MDDW for any LLM using multi-designated verifier signatures (MDVS). Recognizing the significant economic value of LLM outputs, we introduce claimability as an optional security feature for MDDW, enabling model providers to assert ownership of LLM outputs within designated-detector settings. To support claimable MDDW, we propose a generic transformation converting any MDVS to a claimable MDVS. Our implementation of the MDDW scheme highlights its advanced functionalities and flexibility over existing methods, with satisfactory performance metrics.
[AI-94] Dataset Distillation-based Hybrid Federated Learning on Non-IID Data
链接: https://arxiv.org/abs/2409.17517 作者: Xiufang Shi,Wei Zhang,Mincheng Wu,Guangyi Liu,Zhenyu Wen,Shibo He,Tejal Shah,Rajiv Ranjan 关键词-EN: data, model training, data labels, federated learning, training 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In federated learning, the heterogeneity of client data has a great impact on the performance of model training. Many heterogeneity issues in this process are raised by non-independently and identically distributed (Non-IID) data. This study focuses on the issue of label distribution skew. To address it, we propose a hybrid federated learning framework called HFLDD, which integrates dataset distillation to generate approximately independent and equally distributed (IID) data, thereby improving the performance of model training. Particularly, we partition the clients into heterogeneous clusters, where the data labels among different clients within a cluster are unbalanced while the data labels among different clusters are balanced. The cluster headers collect distilled data from the corresponding cluster members, and conduct model training in collaboration with the server. This training process is like traditional federated learning on IID data, and hence effectively alleviates the impact of Non-IID data on model training. Furthermore, we compare our proposed method with typical baseline methods on public datasets. Experimental results demonstrate that when the data labels are severely imbalanced, the proposed HFLDD outperforms the baseline methods in terms of both test accuracy and communication cost.
[AI-95] Functional Classification of Spiking Signal Data Using Artificial Intelligence Techniques: A Review
Abstract:Human brain neuron activities are incredibly significant nowadays. Neuronal behavior is assessed by analyzing signal data such as electroencephalography (EEG), which can offer scientists valuable information about diseases and human-computer interaction. One of the difficulties researchers confront while evaluating these signals is the existence of large volumes of spike data. Spikes are some considerable parts of signal data that can happen as a consequence of vital biomarkers or physical issues such as electrode movements. Hence, distinguishing types of spikes is important. From this spot, the spike classification concept commences. Previously, researchers classified spikes manually. The manual classification was not precise enough as it involves extensive analysis. Consequently, Artificial Intelligence (AI) was introduced into neuroscience to assist clinicians in classifying spikes correctly. This review discusses the importance and use of AI in spike classification, focusing on the recognition of neural activity noises. The task is divided into three main components: preprocessing, classification, and evaluation. Existing methods are introduced and their importance is determined. The review also highlights the need for more efficient algorithms. The primary goal is to provide a perspective on spike classification for future research and provide a comprehensive understanding of the methodologies and issues involved. The review organizes materials in the spike classification field for future studies. In this work, numerous studies were extracted from different databases. The PRISMA-related research guidelines were then used to choose papers. Then, research studies based on spike classification using machine learning and deep learning approaches with effective preprocessing were selected.
[AI-96] From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Forecasting with Reflection NEURIPS2024
链接: https://arxiv.org/abs/2409.17515 作者: Xinlei Wang,Maike Feng,Jing Qiu,Jinjin Gu,Junhua Zhao 关键词-EN: Large Language Models, Large Language, time series forecasting, enhance time series, Generative Agents 类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted for NeurIPS 2024
点击查看摘要
Abstract:This paper introduces a novel approach to enhance time series forecasting using Large Language Models (LLMs) and Generative Agents. With language as a medium, our method adaptively integrates various social events into forecasting models, aligning news content with time series fluctuations for enriched insights. Specifically, we utilize LLM-based agents to iteratively filter out irrelevant news and employ human-like reasoning and reflection to evaluate predictions. This enables our model to analyze complex events, such as unexpected incidents and shifts in social behavior, and continuously refine the selection logic of news and the robustness of the agent’s output. By compiling selected news with time series data, we fine-tune the LLaMa2 pre-trained model. The results demonstrate significant improvements in forecasting accuracy and suggest a potential paradigm shift in time series forecasting by effectively harnessing unstructured news data.
[AI-97] Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
链接: https://arxiv.org/abs/2409.17508 作者: Xun Zhu,Ying Hu,Fanbin Mo,Miao Li,Ji Wu 关键词-EN: shown impressive capabilities, Multi-modal large language, large language models, large language, shown impressive 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code, data and model will be soon available at GitHub.
[AI-98] GLinSAT: The General Linear Satisfiability Neural Network Layer By Accelerated Gradient Descent
链接: https://arxiv.org/abs/2409.17500 作者: Hongtai Zeng,Chao Yang,Yanzhen Zhou,Cheng Yang,Qinglai Guo 关键词-EN: applying neural networks, networks satisfy specific, neural networks satisfy, satisfy specific constraints, neural network outputs 类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Ensuring that the outputs of neural networks satisfy specific constraints is crucial for applying neural networks to real-life decision-making problems. In this paper, we consider making a batch of neural network outputs satisfy bounded and general linear constraints. We first reformulate the neural network output projection problem as an entropy-regularized linear programming problem. We show that such a problem can be equivalently transformed into an unconstrained convex optimization problem with Lipschitz continuous gradient according to the duality theorem. Then, based on an accelerated gradient descent algorithm with numerical performance enhancement, we present our architecture, GLinSAT, to solve the problem. To the best of our knowledge, this is the first general linear satisfiability layer in which all the operations are differentiable and matrix-factorization-free. Despite the fact that we can explicitly perform backpropagation based on automatic differentiation mechanism, we also provide an alternative approach in GLinSAT to calculate the derivatives based on implicit differentiation of the optimality condition. Experimental results on constrained traveling salesman problems, partial graph matching with outliers, predictive portfolio allocation and power system unit commitment demonstrate the advantages of GLinSAT over existing satisfiability layers.
[AI-99] Human Mobility Modeling with Limited Information via Large Language Models
链接: https://arxiv.org/abs/2409.17495 作者: Yifan Liu,Xishun Liao,Haoxuan Ma,Brian Yueshuai He,Chris Stanford,Jiaqi Ma 关键词-EN: human mobility modeling, human mobility, Understanding human mobility, complex challenge, challenge in transportation 类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:Understanding human mobility patterns has traditionally been a complex challenge in transportation modeling. Due to the difficulties in obtaining high-quality training datasets across diverse locations, conventional activity-based models and learning-based human mobility modeling algorithms are particularly limited by the availability and quality of datasets. Furthermore, current research mainly focuses on the spatial-temporal travel pattern but lacks an understanding of the semantic information between activities, which is crucial for modeling the interdependence between activities. In this paper, we propose an innovative Large Language Model (LLM) empowered human mobility modeling framework. Our proposed approach significantly reduces the reliance on detailed human mobility statistical data, utilizing basic socio-demographic information of individuals to generate their daily mobility patterns. We have validated our results using the NHTS and SCAG-ABM datasets, demonstrating the effective modeling of mobility patterns and the strong adaptability of our framework across various geographic locations.
[AI-100] Global-Local Medical SAM Adaptor Based on Full Adaption
链接: https://arxiv.org/abs/2409.17486 作者: Meng Wang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yarong Feng(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yongwei Tang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Tian Zhang(Software college Northeastern University Shenyang, Liaoning Province, P. R. China),Yuxin Liang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Chao Lv(Department of General Surgery, Shengjing Hospital China Medical University Shenyang, Liaoning Province, P. R. China) 关键词-EN: Medical SAM adaptor, visual language models, made great breakthroughs, Emerging of visual, SAM adaptor 类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Emerging of visual language models, such as the segment anything model (SAM), have made great breakthroughs in the field of universal semantic segmentation and significantly aid the improvements of medical image segmentation, in particular with the help of Medical SAM adaptor (Med-SA). However, Med-SA still can be improved, as it fine-tunes SAM in a partial adaption manner. To resolve this problem, we present a novel global medical SAM adaptor (GMed-SA) with full adaption, which can adapt SAM globally. We further combine GMed-SA and Med-SA to propose a global-local medical SAM adaptor (GLMed-SA) to adapt SAM both globally and locally. Extensive experiments have been performed on the challenging public 2D melanoma segmentation dataset. The results show that GLMed-SA outperforms several state-of-the-art semantic segmentation methods on various evaluation metrics, demonstrating the superiority of our methods.
[AI-101] MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models NEURIPS2024
链接: https://arxiv.org/abs/2409.17481 作者: Gongfan Fang,Hongxu Yin,Saurav Muralidharan,Greg Heinrich,Jeff Pool,Jan Kautz,Pavlo Molchanov,Xinchao Wang 关键词-EN: Large Language Models, massive parameter counts, Large Language, Language Models, significant redundancy 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Spotlight
点击查看摘要
Abstract:Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M’') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model’s 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM’s learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \urlthis https URL.
[AI-102] What Would Happen Next? Predicting Consequences from An Event Causality Graph
Abstract:Existing script event prediction task forcasts the subsequent event based on an event script chain. However, the evolution of historical events are more complicated in real world scenarios and the limited information provided by the event script chain also make it difficult to accurately predict subsequent events. This paper introduces a Causality Graph Event Prediction(CGEP) task that forecasting consequential event based on an Event Causality Graph (ECG). We propose a Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model for the CGEP task. In SeDGPL, (1) we design a Distance-sensitive Graph Linearization (DsGL) module to reformulate the ECG into a graph prompt template as the input of a PLM; (2) propose an Event-Enriched Causality Encoding (EeCE) module to integrate both event contextual semantic and graph schema information; (3) propose a Semantic Contrast Event Prediction (ScEP) module to enhance the event representation among numerous candidate events and predict consequential event following prompt learning paradigm. %We construct two CGEP datasets based on existing MAVEN-ERE and ESC corpus for experiments. Experiment results validate our argument our proposed SeDGPL model outperforms the advanced competitors for the CGEP task.
[AI-103] Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards EMNLP2024
链接: https://arxiv.org/abs/2409.17472 作者: Heejin Do,Sangwon Ryu,Gary Geunbae Lee 关键词-EN: provide enriched feedback, evaluating multiple traits, Recent advances, automated essay scoring, enriched feedback 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024
点击查看摘要
Abstract:Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an autoregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.
[AI-104] CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches
链接: https://arxiv.org/abs/2409.17457 作者: Sifan Wu,Amir Khasahmadi,Mor Katz,Pradeep Kumar Jayaraman,Yewen Pu,Karl Willis,Bang Liu 关键词-EN: contemporary mechanical design, CAD, mechanical design, central to contemporary, Design 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.
[AI-105] A Time Series is Worth Five Experts: Heterogeneous Mixture of Experts for Traffic Flow Prediction
Abstract:Accurate traffic prediction faces significant challenges, necessitating a deep understanding of both temporal and spatial cues and their complex interactions across multiple variables. Recent advancements in traffic prediction systems are primarily due to the development of complex sequence-centric models. However, existing approaches often embed multiple variables and spatial relationships at each time step, which may hinder effective variable-centric learning, ultimately leading to performance degradation in traditional traffic prediction tasks. To overcome these limitations, we introduce variable-centric and prior knowledge-centric modeling techniques. Specifically, we propose a Heterogeneous Mixture of Experts (TITAN) model for traffic flow prediction. TITAN initially consists of three experts focused on sequence-centric modeling. Then, designed a low-rank adaptive method, TITAN simultaneously enables variable-centric modeling. Furthermore, we supervise the gating process using a prior knowledge-centric modeling strategy to ensure accurate routing. Experiments on two public traffic network datasets, METR-LA and PEMS-BAY, demonstrate that TITAN effectively captures variable-centric dependencies while ensuring accurate routing. Consequently, it achieves improvements in all evaluation metrics, ranging from approximately 4.37% to 11.53%, compared to previous state-of-the-art (SOTA) models. The code is open at \hrefthis https URLthis https URL.
[AI-106] HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows
Abstract:Despite recent advancements in large language models (LLMs), their performance on complex reasoning problems requiring multi-step thinking and combining various skills is still limited. To address this, we propose a novel framework HDFlow for complex reasoning with LLMs that combines fast and slow thinking modes in an adaptive manner. Our approach consists of two key components: 1) a new approach for slow, deliberate reasoning called Dynamic Workflow, which automatically decomposes complex problems into more manageable sub-tasks and dynamically designs a workflow to assemble specialized LLM or symbolic reasoning tools to solve sub-tasks; 2) Hybrid Thinking, a general framework that dynamically combines fast and slow thinking based on problem complexity. Finally, we propose an easy-to-scale method for automatically synthesizing a large-scale dataset of 27K challenging reasoning problems for complex reasoning and a hybrid thinking tuning method that trains smaller LLMs on this dataset to internalize the fast/slow hybrid reasoning strategies. Experiments on four reasoning benchmark datasets demonstrate that our slow thinking with dynamic workflows significantly outperforms Chain-of-Thought, and hybrid thinking achieves the highest accuracy while providing an effective balance between computational efficiency and performance. Fine-tuning using our hybrid thinking approach also significantly boosts the complex reasoning capabilities of open-source language models. The results showcase the promise of slow thinking, dynamic workflows, and hybrid thinking in expanding the frontier of complex problem-solving with LLMs\footnoteCode and data will be released at \urlthis https URL…
[AI-107] Exploring the Use of ChatGPT for a Systematic Literature Review: a Design-Based Research
Abstract:ChatGPT has been used in several educational contexts,including learning, teaching and research. It also has potential to conduct the systematic literature review (SLR). However, there are limited empirical studies on how to use ChatGPT in conducting a SLR. Based on a SLR published,this study used ChatGPT to conduct a SLR of the same 33 papers in a design-based approach, to see what the differences are by comparing the reviews’ results,and to answer: To what extent can ChatGPT conduct SLR? What strategies can human researchers utilize to structure prompts for ChatGPT that enhance the reliability and validity of a SLR? This study found that ChatGPT could conduct a SLR. It needs detailed and accurate prompts to analyze the literature. It also has limitations. Guiding principles are summarized from this study for researchers to follow when they need to conduct SLRs using ChatGPT.
[AI-108] Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
链接: https://arxiv.org/abs/2409.17422 作者: Zhenmei Shi,Yifei Ming,Xuan-Phi Nguyen,Yingyu Liang,Shafiq Joty 关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, increased computational resources 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4 \times speedup and 30% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at \urlthis https URL.
[AI-109] From Deception to Detection: The Dual Roles of Large Language Models in Fake News
链接: https://arxiv.org/abs/2409.17416 作者: Dorsaf Sallami,Yuan-Chen Chang,Esma Aïmeur 关键词-EN: Fake, public trust, poses a significant, significant threat, ecosystems and public 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Fake news poses a significant threat to the integrity of information ecosystems and public trust. The advent of Large Language Models (LLMs) holds considerable promise for transforming the battle against fake news. Generally, LLMs represent a double-edged sword in this struggle. One major concern is that LLMs can be readily used to craft and disseminate misleading information on a large scale. This raises the pressing questions: Can LLMs easily generate biased fake news? Do all LLMs have this capability? Conversely, LLMs offer valuable prospects for countering fake news, thanks to their extensive knowledge of the world and robust reasoning capabilities. This leads to other critical inquiries: Can we use LLMs to detect fake news, and do they outperform typical detection models? In this paper, we aim to address these pivotal questions by exploring the performance of various LLMs. Our objective is to explore the capability of various LLMs in effectively combating fake news, marking this as the first investigation to analyze seven such models. Our results reveal that while some models adhere strictly to safety protocols, refusing to generate biased or misleading content, other models can readily produce fake news across a spectrum of biases. Additionally, our results show that larger models generally exhibit superior detection abilities and that LLM-generated fake news are less likely to be detected than human-written ones. Finally, our findings demonstrate that users can benefit from LLM-generated explanations in identifying fake news.
[AI-110] Exploring Semantic Clustering in Deep Reinforcement Learning for Video Games
Abstract:In this paper, we investigate the semantic clustering properties of deep reinforcement learning (DRL) for video games, enriching our understanding of the internal dynamics of DRL and advancing its interpretability. In this context, semantic clustering refers to the inherent capacity of neural networks to internally group video inputs based on semantic similarity. To achieve this, we propose a novel DRL architecture that integrates a semantic clustering module featuring both feature dimensionality reduction and online clustering. This module seamlessly integrates into the DRL training pipeline, addressing instability issues observed in previous t-SNE-based analysis methods and eliminating the necessity for extensive manual annotation of semantic analysis. Through experiments, we validate the effectiveness of the proposed module and the semantic clustering properties in DRL for video games. Additionally, based on these properties, we introduce new analytical methods to help understand the hierarchical structure of policies and the semantic distribution within the feature space.
[AI-111] Sociotechnical Approach to Enterprise Generative Artificial Intelligence (E-GenAI)
链接: https://arxiv.org/abs/2409.17408 作者: Leoncio Jimenez,Francisco Venegas 关键词-EN: Imperfect Knowledge Management, Inventive Problem Solving, Knowledge Management, proposed to characterize, sociotechnical approach 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:In this theoretical article, a sociotechnical approach is proposed to characterize. First, the business ecosystem, focusing on the relationships among Providers, Enterprise, and Customers through SCM, ERP, and CRM platforms to align: (1) Business Intelligence (BI), Fuzzy Logic (FL), and TRIZ (Theory of Inventive Problem Solving), through the OID model, and (2) Knowledge Management (KM) and Imperfect Knowledge Management (IKM), through the OIDK model. Second, the article explores the E-GenAI business ecosystem, which integrates GenAI-based platforms for SCM, ERP, and CRM with GenAI-based platforms for BI, FL, TRIZ, KM, and IKM, to align Large Language Models (LLMs) through the E-GenAI (OID) model. Finally, to understand the dynamics of LLMs, we utilize finite automata to model the relationships between Followers and Followees. This facilitates the construction of LLMs that can identify specific characteristics of users on a social media platform.
[AI-112] Post-hoc Reward Calibration: A Case Study on Length Bias
链接: https://arxiv.org/abs/2409.17407 作者: Zeyu Huang,Zihan Qiu,Zili Wang,Edoardo M. Ponti,Ivan Titov 关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Human Feedback aligns, translates human feedback 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Preprint
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM–RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at this https URL.
[AI-113] AI Enabled Neutron Flux Measurement and Virtual Calibration in Boiling Water Reactors
Abstract:Accurately capturing the three dimensional power distribution within a reactor core is vital for ensuring the safe and economical operation of the reactor, compliance with Technical Specifications, and fuel cycle planning (safety, control, and performance evaluation). Offline (that is, during cycle planning and core design), a three dimensional neutronics simulator is used to estimate the reactor’s power, moderator, void, and flow distributions, from which margin to thermal limits and fuel exposures can be approximated. Online, this is accomplished with a system of local power range monitors (LPRMs) designed to capture enough neutron flux information to infer the full nodal power distribution. Certain problems with this process, ranging from measurement and calibration to the power adaption process, pose challenges to operators and limit the ability to design reload cores economically (e.g., engineering in insufficient margin or more margin than required). Artificial intelligence (AI) and machine learning (ML) are being used to solve the problems to reduce maintenance costs, improve the accuracy of online local power measurements, and decrease the bias between offline and online power distributions, thereby leading to a greater ability to design safe and economical reload cores. We present ML models trained from two deep neural network (DNN) architectures, SurrogateNet and LPRMNet, that demonstrate a testing error of 1 percent and 3 percent, respectively. Applications of these models can include virtual sensing capability for bypassed or malfunctioning LPRMs, on demand virtual calibration of detectors between successive calibrations, highly accurate nuclear end of life determinations for LPRMs, and reduced bias between measured and predicted power distributions within the core.
[AI-114] ransient Adversarial 3D Projection Attacks on Object Detection in Autonomous Driving
链接: https://arxiv.org/abs/2409.17403 作者: Ce Zhou,Qiben Yan,Sijia Liu 关键词-EN: Object detection, crucial task, targeting object detection, patches or stickers, Object 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 7 figures, SmartSP 2024
点击查看摘要
Abstract:Object detection is a crucial task in autonomous driving. While existing research has proposed various attacks on object detection, such as those using adversarial patches or stickers, the exploration of projection attacks on 3D surfaces remains largely unexplored. Compared to adversarial patches or stickers, which have fixed adversarial patterns, projection attacks allow for transient modifications to these patterns, enabling a more flexible attack. In this paper, we introduce an adversarial 3D projection attack specifically targeting object detection in autonomous driving scenarios. We frame the attack formulation as an optimization problem, utilizing a combination of color mapping and geometric transformation models. Our results demonstrate the effectiveness of the proposed attack in deceiving YOLOv3 and Mask R-CNN in physical settings. Evaluations conducted in an indoor environment show an attack success rate of up to 100% under low ambient light conditions, highlighting the potential damage of our attack in real-world driving scenarios.
[AI-115] Enhancing Recommendation with Denoising Auxiliary Task
Abstract:The historical interaction sequences of users plays a crucial role in training recommender systems that can accurately predict user preferences. However, due to the arbitrariness of user behavior, the presence of noise in these sequences poses a challenge to predicting their next actions in recommender systems. To address this issue, our motivation is based on the observation that training noisy sequences and clean sequences (sequences without noise) with equal weights can impact the performance of the model. We propose a novel self-supervised Auxiliary Task Joint Training (ATJT) method aimed at more accurately reweighting noisy sequences in recommender systems. Specifically, we strategically select subsets from users’ original sequences and perform random replacements to generate artificially replaced noisy sequences. Subsequently, we perform joint training on these artificially replaced noisy sequences and the original sequences. Through effective reweighting, we incorporate the training results of the noise recognition model into the recommender model. We evaluate our method on three datasets using a consistent base model. Experimental results demonstrate the effectiveness of introducing self-supervised auxiliary task to enhance the base model’s performance.
[AI-116] AgRegNet: A Deep Regression Network for Flower and Fruit Density Estimation Localization and Counting in Orchards
链接: https://arxiv.org/abs/2409.17400 作者: Uddhav Bhattarai,Santosh Bhusal,Qin Zhang,Manoj Karkee 关键词-EN: agricultural industry today, manual labor availability, fruit density estimation, major challenges, agricultural industry 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:One of the major challenges for the agricultural industry today is the uncertainty in manual labor availability and the associated cost. Automated flower and fruit density estimation, localization, and counting could help streamline harvesting, yield estimation, and crop-load management strategies such as flower and fruitlet thinning. This article proposes a deep regression-based network, AgRegNet, to estimate density, count, and location of flower and fruit in tree fruit canopies without explicit object detection or polygon annotation. Inspired by popular U-Net architecture, AgRegNet is a U-shaped network with an encoder-to-decoder skip connection and modified ConvNeXt-T as an encoder feature extractor. AgRegNet can be trained based on information from point annotation and leverages segmentation information and attention modules (spatial and channel) to highlight relevant flower and fruit features while suppressing non-relevant background features. Experimental evaluation in apple flower and fruit canopy images under an unstructured orchard environment showed that AgRegNet achieved promising accuracy as measured by Structural Similarity Index (SSIM), percentage Mean Absolute Error (pMAE) and mean Average Precision (mAP) to estimate flower and fruit density, count, and centroid location, respectively. Specifically, the SSIM, pMAE, and mAP values for flower images were 0.938, 13.7%, and 0.81, respectively. For fruit images, the corresponding values were 0.910, 5.6%, and 0.93. Since the proposed approach relies on information from point annotation, it is suitable for sparsely and densely located objects. This simplified technique will be highly applicable for growers to accurately estimate yields and decide on optimal chemical and mechanical flower thinning practices.
链接: https://arxiv.org/abs/2409.17386 作者: Zhixiang Shen,Shuo Wang,Zhao Kang 关键词-EN: Unsupervised Multiplex Graph, learn node representations, Multiplex Graph, manual labeling, Multiplex Graph Learning 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Appear in NeurIPS 2024
点击查看摘要
Abstract:Unsupervised Multiplex Graph Learning (UMGL) aims to learn node representations on various edge types without manual labeling. However, existing research overlooks a key factor: the reliability of the graph structure. Real-world data often exhibit a complex nature and contain abundant task-irrelevant noise, severely compromising UMGL’s performance. Moreover, existing methods primarily rely on contrastive learning to maximize mutual information across different graphs, limiting them to multiplex graph redundant scenarios and failing to capture view-unique task-relevant information. In this paper, we focus on a more realistic and challenging task: to unsupervisedly learn a fused graph from multiple graphs that preserve sufficient task-relevant information while removing task-irrelevant noise. Specifically, our proposed Information-aware Unsupervised Multiplex Graph Fusion framework (InfoMGF) uses graph structure refinement to eliminate irrelevant noise and simultaneously maximizes view-shared and view-unique task-relevant information, thereby tackling the frontier of non-redundant multiplex graph. Theoretical analyses further guarantee the effectiveness of InfoMGF. Comprehensive experiments against various baselines on different downstream tasks demonstrate its superior performance and robustness. Surprisingly, our unsupervised method even beats the sophisticated supervised approaches. The source code and datasets are available at this https URL.
[AI-118] Data-efficient Trajectory Prediction via Coreset Selection
链接: https://arxiv.org/abs/2409.17385 作者: Ruining Yang,Lili Su 关键词-EN: multiple information-collection devices, Modern vehicles, sensors and cameras, continuously generating, equipped with multiple 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Modern vehicles are equipped with multiple information-collection devices such as sensors and cameras, continuously generating a large volume of raw data. Accurately predicting the trajectories of neighboring vehicles is a vital component in understanding the complex driving environment. Yet, training trajectory prediction models is challenging in two ways. Processing the large-scale data is computation-intensive. Moreover, easy-medium driving scenarios often overwhelmingly dominate the dataset, leaving challenging driving scenarios such as dense traffic under-represented. For example, in the Argoverse motion prediction dataset, there are very few instances with \ge 50 agents, while scenarios with 10 \thicksim 20 agents are far more common. In this paper, to mitigate data redundancy in the over-represented driving scenarios and to reduce the bias rooted in the data scarcity of complex ones, we propose a novel data-efficient training method based on coreset selection. This method strategically selects a small but representative subset of data while balancing the proportions of different scenario difficulties. To the best of our knowledge, we are the first to introduce a method capable of effectively condensing large-scale trajectory dataset, while achieving a state-of-the-art compression ratio. Notably, even when using only 50% of the Argoverse dataset, the model can be trained with little to no decline in performance. Moreover, the selected coreset maintains excellent generalization ability.
[AI-119] VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search
Abstract:Traditional retrieval methods have been essential for assessing document similarity but struggle with capturing semantic nuances. Despite advancements in latent semantic analysis (LSA) and deep learning, achieving comprehensive semantic understanding and accurate retrieval remains challenging due to high dimensionality and semantic gaps. The above challenges call for new techniques to effectively reduce the dimensions and close the semantic gaps. To this end, we propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval. By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy. Experiments on real-world datasets show that VectorSearch outperforms baseline metrics, demonstrating its efficacy for large-scale retrieval tasks.
[AI-120] slas Autopilot: Ethics and Tragedy
链接: https://arxiv.org/abs/2409.17380 作者: Aravinda Jatavallabha 关键词-EN: emphasizing Tesla Motors’, involving Tesla Autopilot, Tesla Motors’ moral, Motors’ moral responsibility, Tesla Autopilot 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This case study delves into the ethical ramifications of an incident involving Tesla’s Autopilot, emphasizing Tesla Motors’ moral responsibility. Using a seven-step ethical decision-making process, it examines user behavior, system constraints, and regulatory implications. This incident prompts a broader evaluation of ethical challenges in the automotive industry’s adoption of autonomous technologies, urging a reconsideration of industry norms and legal frameworks. The analysis offers a succinct exploration of ethical considerations in evolving technological landscapes.
[AI-121] Search for Efficient Large Language Models NEURIPS2024
链接: https://arxiv.org/abs/2409.17372 作者: Xuan Shen,Pu Zhao,Yifan Gong,Zhenglun Kong,Zheng Zhan,Yushu Wu,Ming Lin,Chao Wu,Xue Lin,Yanzhi Wang 关键词-EN: Large Language Models, Large Language, artificial intelligence research, long held sway, Language Models 类目: Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures. Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs. In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration. Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data. Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration.
[AI-122] he Overfocusing Bias of Convolutional Neural Networks: A Saliency-Guided Regularization Approach
链接: https://arxiv.org/abs/2409.17370 作者: David Bertoin,Eduardo Hugo Sanchez,Mehdi Zouitine,Emmanuel Rachelson 关键词-EN: computer vision, low-data regimes, transformers being considered, standard in computer, convolutional neural networks 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model’s generalization capabilities, making it disproportionately dependent on certain features that might not represent the broader context of images. While the conditions leading to this phenomenon remain elusive, the primary intent of this article is to shed light on this observed behavior of neural networks. Our research endeavors to prioritize comprehensive insight and to outline an initial response to this phenomenon. In line with this, we introduce Saliency Guided Dropout (SGDrop), a pioneering regularization approach tailored to address this specific issue. SGDrop utilizes attribution methods on the feature map to identify and then reduce the influence of the most salient features during training. This process encourages the network to diversify its attention and not focus solely on specific standout areas. Our experiments across several visual classification benchmarks validate SGDrop’s role in enhancing generalization. Significantly, models incorporating SGDrop display more expansive attributions and neural activity, offering a more comprehensive view of input images in contrast to their traditionally trained counterparts.
[AI-123] Koopman-driven grip force prediction through EMG sensing
链接: https://arxiv.org/abs/2409.17340 作者: Tomislav Bazina,Ervin Kamenar,Maria Fonoberova,Igor Mezić 关键词-EN: impacts daily activities, multiple sclerosis significantly, sclerosis significantly impacts, significantly impacts daily, hand function due 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
*备注: 11 pages, 8 figures, journal
点击查看摘要
Abstract:Loss of hand function due to conditions like stroke or multiple sclerosis significantly impacts daily activities. Robotic rehabilitation provides tools to restore hand function, while novel methods based on surface electromyography (sEMG) enable the adaptation of the device’s force output according to the user’s condition, thereby improving rehabilitation outcomes. This study aims to achieve accurate force estimations during medium wrap grasps using a single sEMG sensor pair, thereby addressing the challenge of escalating sensor requirements for precise predictions. We conducted sEMG measurements on 13 subjects at two forearm positions, validating results with a hand dynamometer. We established flexible signal-processing steps, yielding high peak cross-correlations between the processed sEMG signal (representing meaningful muscle activity) and grip force. Influential parameters were subsequently identified through sensitivity analysis. Leveraging a novel data-driven Koopman operator theory-based approach and problem-specific data lifting techniques, we devised a methodology for the estimation and short-term prediction of grip force from processed sEMG signals. A weighted mean absolute percentage error (wMAPE) of approx. 5.5% was achieved for the estimated grip force, whereas predictions with a 0.5-second prediction horizon resulted in a wMAPE of approx. 17.9%. The methodology proved robust regarding precise electrode positioning, as the effect of sensing position on error metrics was non-significant. The algorithm executes exceptionally fast, processing, estimating, and predicting a 0.5-second sEMG signal batch in just approx. 30 ms, facilitating real-time implementation.
[AI-124] he Technology of Outrage: Bias in Artificial Intelligence
链接: https://arxiv.org/abs/2409.17336 作者: Will Bridewell,Paul F. Bello,Selmer Bringsjord 关键词-EN: offload decision making, learning are increasingly, offload decision, decision making, machine learning 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Distribution Statement A. Approved for public release; distribution is unlimited
点击查看摘要
Abstract:Artificial intelligence and machine learning are increasingly used to offload decision making from people. In the past, one of the rationales for this replacement was that machines, unlike people, can be fair and unbiased. Evidence suggests otherwise. We begin by entertaining the ideas that algorithms can replace people and that algorithms cannot be biased. Taken as axioms, these statements quickly lead to absurdity. Spurred on by this result, we investigate the slogans more closely and identify equivocation surrounding the word ‘bias.’ We diagnose three forms of outrage-intellectual, moral, and political-that are at play when people react emotionally to algorithmic bias. Then we suggest three practical approaches to addressing bias that the AI community could take, which include clarifying the language around bias, developing new auditing methods for intelligent systems, and building certain capabilities into these systems. We conclude by offering a moral regarding the conversations about algorithmic bias that may transfer to other areas of artificial intelligence.
[AI-125] Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting
链接: https://arxiv.org/abs/2409.17332 作者: Jay Zoellin,Colin Merk,Mischa Buob,Amr Saad,Samuel Giesser,Tahm Spitznagel,Ferhat Turgut,Rui Santos,Yukun Zhou,Sigfried Wagner,Pearse A. Keane,Yih Chung Tham,Delia Cabrera DeBuc,Matthias D. Becker,Gabor M. Somfai 关键词-EN: Integrating deep learning, greatly advance diagnostic, Integrating deep, self-supervised learning, DINORET 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this http URL , C. Merk and M. Buob contributed equally as shared-first authors. D. Cabrera DeBuc, M. D. Becker and G. M. Somfai contributed equally as senior authors for this work
点击查看摘要
Abstract:Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.
[AI-126] KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation
链接: https://arxiv.org/abs/2409.17315 作者: Anantaa Kotal,Anupam Joshi 关键词-EN: Generative Deep Learning, Deep Learning models, including differential privacy, differential privacy techniques, provable privacy guarantee 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model’s capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.
[AI-127] Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP2024
链接: https://arxiv.org/abs/2409.17313 作者: Zehao Wang,Minye Wu,Yixin Cao,Yubo Ma,Meiqi Chen,Tinne Tuytelaars 关键词-EN: study presents, instruction categories, evaluation framework, VLN, Vision-Language Navigation 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings; project page: this https URL
点击查看摘要
Abstract:This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.
[AI-128] A Hybrid Quantum-Classical AI-Based Detection Strategy for Generative Adversarial Network-Based Deepfake Attacks on an Autonomous Vehicle Traffic Sign Classification System
Abstract:The perception module in autonomous vehicles (AVs) relies heavily on deep learning-based models to detect and identify various objects in their surrounding environment. An AV traffic sign classification system is integral to this module, which helps AVs recognize roadway traffic signs. However, adversarial attacks, in which an attacker modifies or alters the image captured for traffic sign recognition, could lead an AV to misrecognize the traffic signs and cause hazardous consequences. Deepfake presents itself as a promising technology to be used for such adversarial attacks, in which a deepfake traffic sign would replace a real-world traffic sign image before the image is fed to the AV traffic sign classification system. In this study, the authors present how a generative adversarial network-based deepfake attack can be crafted to fool the AV traffic sign classification systems. The authors developed a deepfake traffic sign image detection strategy leveraging hybrid quantum-classical neural networks (NNs). This hybrid approach utilizes amplitude encoding to represent the features of an input traffic sign image using quantum states, which substantially reduces the memory requirement compared to its classical counterparts. The authors evaluated this hybrid deepfake detection approach along with several baseline classical convolutional NNs on real-world and deepfake traffic sign images. The results indicate that the hybrid quantum-classical NNs for deepfake detection could achieve similar or higher performance than the baseline classical convolutional NNs in most cases while requiring less than one-third of the memory required by the shallowest classical convolutional NN considered in this study.
[AI-129] Neural Network Plasticity and Loss Sharpness
链接: https://arxiv.org/abs/2409.17300 作者: Max Koster,Jude Kukla 关键词-EN: increasingly popular research, popular research field, research field due, evolve over time, gearing towards complex 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In recent years, continual learning, a prediction setting in which the problem environment may evolve over time, has become an increasingly popular research field due to the framework’s gearing towards complex, non-stationary objectives. Learning such objectives requires plasticity, or the ability of a neural network to adapt its predictions to a different task. Recent findings indicate that plasticity loss on new tasks is highly related to loss landscape sharpness in non-stationary RL frameworks. We explore the usage of sharpness regularization techniques, which seek out smooth minima and have been touted for their generalization capabilities in vanilla prediction settings, in efforts to combat plasticity loss. Our findings indicate that such techniques have no significant effect on reducing plasticity loss.
[AI-130] SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
Abstract:This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, existing datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Existing SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. We present SpoofCeleb, which leverages a fully automated pipeline that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. The resulting SpoofCeleb dataset comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We provide baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at this https URL.
[AI-131] Memory Networks: Towards Fully Biologically Plausible Learning
Abstract:The field of artificial intelligence faces significant challenges in achieving both biological plausibility and computational efficiency, particularly in visual learning tasks. Current artificial neural networks, such as convolutional neural networks, rely on techniques like backpropagation and weight sharing, which do not align with the brain’s natural information processing methods. To address these issues, we propose the Memory Network, a model inspired by biological principles that avoids backpropagation and convolutions, and operates in a single pass. This approach enables rapid and efficient learning, mimicking the brain’s ability to adapt quickly with minimal exposure to data. Our experiments demonstrate that the Memory Network achieves efficient and biologically plausible learning, showing strong performance on simpler datasets like MNIST. However, further refinement is needed for the model to handle more complex datasets such as CIFAR10, highlighting the need to develop new algorithms and techniques that closely align with biological processes while maintaining computational efficiency.
[AI-132] On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains
链接: https://arxiv.org/abs/2409.17275 作者: Xun Xian,Ganghua Wang,Xuan Bi,Jayanth Srinivasa,Ashish Kundu,Charles Fleming,Mingyi Hong,Jie Ding 关键词-EN: large language models, Retrieval-Augmented Generation, language models, legal contexts, empirically shown 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs’ generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q\A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q\A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.
[AI-133] Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning
链接: https://arxiv.org/abs/2409.17270 作者: Debargha Ganguly,Srinivasan Iyengar,Vipin Chaudhary,Shivkumar Kalyanaraman 关键词-EN: Large Language Models, natural language processing, revolutionized natural language, complex logical sequences, Large Language 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate Proof of Thought’s effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.
[AI-134] Model aggregation: minimizing empirical variance outperforms minimizing empirical error
链接: https://arxiv.org/abs/2409.17267 作者: Théo Bourdais,Houman Owhadi 关键词-EN: deterministic or stochastic, Minimal Variance Aggregation, designed to approximate, approximate a specific, aggregation 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: The code in this paper is available for download at this https URL
点击查看摘要
Abstract:Whether deterministic or stochastic, models can be viewed as functions designed to approximate a specific quantity of interest. We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output. This aggregation approach exploits each model’s strengths to enhance overall accuracy. It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models, including those from machine learning and numerical solvers. We argue that the aggregation process should be point-wise linear and propose two methods to find an optimal aggregate: Minimal Error Aggregation (MEA), which minimizes the aggregate’s prediction error, and Minimal Variance Aggregation (MVA), which minimizes its variance. While MEA is inherently more accurate when correlations between models and the target quantity are perfectly known, Minimal Empirical Variance Aggregation (MEVA), an empirical version of MVA - consistently outperforms Minimal Empirical Error Aggregation (MEEA), the empirical counterpart of MEA, when these correlations must be estimated from data. The key difference is that MEVA constructs an aggregate by estimating model errors, while MEEA treats the models as features for direct interpolation of the quantity of interest. This makes MEEA more susceptible to overfitting and poor generalization, where the aggregate may underperform individual models during testing. We demonstrate the versatility and effectiveness of our framework in various applications, such as data science and partial differential equations, showing how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.
[AI-135] AAPM: Large Language Model Agent -based Asset Pricing Models
Abstract:In this study, we propose a novel asset pricing approach, LLM Agent-based Asset Pricing Models (AAPM), which fuses qualitative discretionary investment analysis from LLM agents and quantitative manual financial economic factors to predict excess asset returns. The experimental results show that our approach outperforms machine learning-based asset pricing baselines in portfolio optimization and asset pricing errors. Specifically, the Sharpe ratio and average |\alpha| for anomaly portfolios improved significantly by 9.6% and 10.8% respectively. In addition, we conducted extensive ablation studies on our model and analysis of the data to reveal further insights into the proposed method.
[AI-136] Collaborative Comic Generation: Integrating Visual Narrative Theories with AI Models for Enhanced Creativity ECAI
链接: https://arxiv.org/abs/2409.17263 作者: Yi-Chun Chen,Arnav Jhala 关键词-EN: integrates conceptual principles-comic, conceptual principles-comic authoring, principles-comic authoring idioms-with, authoring idioms-with generative, theory-inspired visual narrative 类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted for oral presentation at CREAI2024, ECAI, 2024. However, the author’s attendance is currently uncertain due to visa issues
点击查看摘要
Abstract:This study presents a theory-inspired visual narrative generative system that integrates conceptual principles-comic authoring idioms-with generative and language models to enhance the comic creation process. Our system combines human creativity with AI models to support parts of the generative process, providing a collaborative platform for creating comic content. These comic-authoring idioms, derived from prior human-created image sequences, serve as guidelines for crafting and refining storytelling. The system translates these principles into system layers that facilitate comic creation through sequential decision-making, addressing narrative elements such as panel composition, story tension changes, and panel transitions. Key contributions include integrating machine learning models into the human-AI cooperative comic generation process, deploying abstract narrative theories into AI-driven comic creation, and a customizable tool for narrative-driven image sequences. This approach improves narrative elements in generated image sequences and engages human creativity in an AI-generative process of comics. We open-source the code at this https URL.
[AI-137] Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies
链接: https://arxiv.org/abs/2409.17216 作者: Ritwik Gupta,Leah Walker,Rodolfo Corona,Stephanie Fu,Suzanne Petryk,Janet Napolitano,Trevor Darrell,Andrew W. Reddie 关键词-EN: Current regulations, regulations on powerful, narrowly focused, Current, models 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Current regulations on powerful AI capabilities are narrowly focused on “foundation” or “frontier” models. However, these terms are vague and inconsistently defined, leading to an unstable foundation for governance efforts. Critically, policy debates often fail to consider the data used with these models, despite the clear link between data and model performance. Even (relatively) “small” models that fall outside the typical definitions of foundation and frontier models can achieve equivalent outcomes when exposed to sufficiently specific datasets. In this work, we illustrate the importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future. More broadly, we emphasize the risk posed by over-regulating reactively and provide a path towards careful, quantitative evaluation of capabilities that can lead to a simplified regulatory environment.
[AI-138] Plurals: A System for Guiding LLMs Via Simulated Social Ensembles
链接: https://arxiv.org/abs/2409.17213 作者: Joshua Ashkinaze,Emily Fry,Narendra Edara,Eric Gilbert,Ceren Budak 关键词-EN: Recent debates raised, debates raised concerns, Recent debates, debates raised, raised concerns 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a ‘view from nowhere’ but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by democratic deliberation theory, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at this https URL and will be continually updated.
[AI-139] 2024 BRAVO Challenge Track 1 1st Place Report: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation
链接: https://arxiv.org/abs/2409.17208 作者: Tommie Kerssies,Daan de Geus,Gijs Dubbelman 关键词-EN: BRAVO Challenge, trained on Cityscapes, robustness is evaluated, solution for Track, present our solution 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2409.15107
点击查看摘要
Abstract:In this report, we present our solution for Track 1 of the 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves 1st place in the challenge. Our code is publicly available at this https URL.
[AI-140] Enhancing Guardrails for Safe and Secure Healthcare AI
链接: https://arxiv.org/abs/2409.17190 作者: Ananya Gangavarapu 关键词-EN: holds immense promise, numerous innovative applications, Generative AI holds, addressing global healthcare, global healthcare access 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Generative AI holds immense promise in addressing global healthcare access challenges, with numerous innovative applications now ready for use across various healthcare domains. However, a significant barrier to the widespread adoption of these domain-specific AI solutions is the lack of robust safety mechanisms to effectively manage issues such as hallucination, misinformation, and ensuring truthfulness. Left unchecked, these risks can compromise patient safety and erode trust in healthcare AI systems. While general-purpose frameworks like Llama Guard are useful for filtering toxicity and harmful content, they do not fully address the stringent requirements for truthfulness and safety in healthcare contexts. This paper examines the unique safety and security challenges inherent to healthcare AI, particularly the risk of hallucinations, the spread of misinformation, and the need for factual accuracy in clinical settings. I propose enhancements to existing guardrails frameworks, such as Nvidia NeMo Guardrails, to better suit healthcare-specific needs. By strengthening these safeguards, I aim to ensure the secure, reliable, and accurate use of AI in healthcare, mitigating misinformation risks and improving patient safety.
[AI-141] Fully automatic extraction of morphological traits from the Web: utopia or reality?
链接: https://arxiv.org/abs/2409.17179 作者: Diego Marcos,Robert van de Vlasakker,Ioannis N. Athanasiadis,Pierre Bonnet,Hervé Goeau,Alexis Joly,W. Daniel Kissling,César Leblanc,André S.J. van Proosdij,Konstantinos P. Panousis 关键词-EN: Plant morphological traits, observable characteristics, fundamental to understand, understand the role, role played 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.
[AI-142] CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency
链接: https://arxiv.org/abs/2409.17174 作者: Kangsheng Wang,Xiao Zhang,Zizheng Guo,Tianyu Hu,Huimin Ma 关键词-EN: large language models, causal significance, significance and consistency, reasoning, solving reasoning tasks 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Chain-based reasoning methods like chain of thought (CoT) play a rising role in solving reasoning tasks for large language models (LLMs). However, the causal illusions between \textita step of reasoning and \textitcorresponding state transitions are becoming a significant obstacle to advancing LLMs’ reasoning capabilities, especially in long-range reasoning tasks. This paper proposes a non-chain-based reasoning framework for simultaneous consideration of causal significance and consistency, i.e., the Causal Significance and Consistency Enhancer (CSCE). We customize LLM’s loss function utilizing treatment effect assessments to enhance its reasoning ability from two aspects: causal significance and consistency. This ensures that the model captures essential causal relationships and maintains robust and consistent performance across various scenarios. Additionally, we transform the reasoning process from the cascading multiple one-step reasoning commonly used in Chain-Based methods, like CoT, to a causal-enhanced method that outputs the entire reasoning process in one go, further improving the model’s reasoning efficiency. Extensive experiments show that our method improves both the reasoning success rate and speed. These improvements further demonstrate that non-chain-based methods can also aid LLMs in completing reasoning tasks.
[AI-143] A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models
链接: https://arxiv.org/abs/2409.17173 作者: Satoshi Munakata,Taku Fukui,Takao Mohri 关键词-EN: Large language models, Large language, language models, fabricate a hallucinatory, Large 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages
点击查看摘要
Abstract:Large language models (LLMs) often fabricate a hallucinatory text. Several methods have been developed to detect such text by semantically comparing it with the multiple versions probabilistically regenerated. However, a significant issue is that if the storyline of each regenerated text changes, the generated texts become incomparable, which worsen detection accuracy. In this paper, we propose a hallucination detection method that incorporates a multiple-fill-in-the-blank exam approach to address this storyline-changing issue. First, our method creates a multiple-fill-in-the-blank exam by masking multiple objects from the original text. Second, prompts an LLM to repeatedly answer this exam. This approach ensures that the storylines of the exam answers align with the original ones. Finally, quantifies the degree of hallucination for each original sentence by scoring the exam answers, considering the potential for \emphhallucination snowballing within the original text itself. Experimental results show that our method alone not only outperforms existing methods, but also achieves clearer state-of-the-art performance in the ensembles with existing methods.
[AI-144] What Would You Ask When You First Saw a2b2=c2? Evaluating LLM on Curiosity-Driven Questioning
链接: https://arxiv.org/abs/2409.17172 作者: Shashidhar Reddy Javaji,Zining Zhu 关键词-EN: knowledge remains unknown, remains unknown, store a massive, massive amount, knowledge 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems
[AI-145] Cross-Domain Content Generation with Domain-Specific Small Language Models
链接: https://arxiv.org/abs/2409.17171 作者: Ankit Maloo Abhinav Garg 关键词-EN: small language models, language models poses, small language, minimal overlap, models poses challenges 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages
点击查看摘要
Abstract:Generating domain-specific content using small language models poses challenges, especially when dealing with multiple distinct datasets with minimal overlap. In this study, we explore methods to enable a small language model to produce coherent and relevant outputs for two different domains: stories (Dataset A) and recipes (Dataset B). Our initial experiments show that training individual models on each dataset yields satisfactory results, with each model generating appropriate content within its domain. We find that utilizing custom tokenizers tailored to each dataset significantly enhances generation quality compared to using a generic tokenizer. Attempts to adapt a single model to both domains using Low-Rank Adaptation (LoRA) or standard fine-tuning do not yield substantial results, often failing to produce meaningful outputs. Moreover, full fine-tuning without freezing the model’s existing weights leads to catastrophic forgetting, where the model loses previously learned information and only retains knowledge from the new data. To overcome these challenges, we employ a knowledge expansion strategy: training only with additional parameters. This approach enables the model to generate both stories and recipes upon request, effectively handling multiple domains without suffering from catastrophic forgetting. Our findings demonstrate that knowledge expansion with frozen layers is an effective method for small language models to generate domain-specific content across distinct datasets. This work contributes to the development of efficient multi-domain language models and provides insights into managing catastrophic forgetting in small-scale architectures.
[AI-146] REAL: Response Embedding-based Alignment for LLMs
链接: https://arxiv.org/abs/2409.17169 作者: Honggen Zhang,Igor Molybog,June Zhang,Xufeng Zhao 关键词-EN: Aligning large language, Aligning large, Direct Preference Optimization, Preference Optimization rely, large language models 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization rely on pairs of AI-generated responses ranked according to human feedback. The labeling process is the most labor-intensive and costly part of the alignment pipeline, and improving its efficiency would have a meaningful impact on AI development. We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs for labeling out of a set of AI-generated responses. Experimental results on synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. We also applied our method to the real-world dataset SHP2, selecting optimal pairs from multiple responses. The model aligned on dissimilar response pairs obtained the best win rate on the dialogue task. Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators’ work.
[AI-147] StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?
链接: https://arxiv.org/abs/2409.17167 作者: Guobin Shen,Dongcheng Zhao,Aorigele Bao,Xiang He,Yiting Dong,Yi Zeng 关键词-EN: Large Language Models, Language Models, Large Language, stress, LLMs 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 9 figures
点击查看摘要
Abstract:Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.
[AI-148] ScriptSmith: A Unified LLM Framework for Enhancing IT Operations via Automated Bash Script Generation Assessment and Refinement
链接: https://arxiv.org/abs/2409.17166 作者: Oishik Chatterjee,Pooja Aggarwal,Suranjana Samanta,Ting Dai,Prateeti Mohapatra,Debanjana Kar,Ruchi Mahindru,Steve Barbieri,Eugen Postea,Brad Blancett,Arthur De Magalhaes 关键词-EN: site reliability engineering, rapidly evolving landscape, site reliability, reliability engineering, applications is paramount 类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under Review
点击查看摘要
Abstract:In the rapidly evolving landscape of site reliability engineering (SRE), the demand for efficient and effective solutions to manage and resolve issues in site and cloud applications is paramount. This paper presents an innovative approach to action automation using large language models (LLMs) for script generation, assessment, and refinement. By leveraging the capabilities of LLMs, we aim to significantly reduce the human effort involved in writing and debugging scripts, thereby enhancing the productivity of SRE teams. Our experiments focus on Bash scripts, a commonly used tool in SRE, and involve the CodeSift dataset of 100 tasks and the InterCode dataset of 153 tasks. The results show that LLMs can automatically assess and refine scripts efficiently, reducing the need for script validation in an execution environment. Results demonstrate that the framework shows an overall improvement of 7-10% in script generation.
[AI-149] Cross Dataset Analysis and Network Architecture Repair for Autonomous Car Lane Detection
Abstract:Transfer Learning has become one of the standard methods to solve problems to overcome the isolated learning paradigm by utilizing knowledge acquired for one task to solve another related one. However, research needs to be done, to identify the initial steps before inducing transfer learning to applications for further verification and explainablity. In this research, we have performed cross dataset analysis and network architecture repair for the lane detection application in autonomous vehicles. Lane detection is an important aspect of autonomous vehicles driving assistance system. In most circumstances, modern deep-learning-based lane recognition systems are successful, but they struggle with lanes with complex topologies. The proposed architecture, ERFCondLaneNet is an enhancement to the CondlaneNet used for lane identification framework to solve the difficulty of detecting lane lines with complex topologies like dense, curved and fork lines. The newly proposed technique was tested on two common lane detecting benchmarks, CULane and CurveLanes respectively, and two different backbones, ResNet and ERFNet. The researched technique with ERFCondLaneNet, exhibited similar performance in comparison to ResnetCondLaneNet, while using 33% less features, resulting in a reduction of model size by 46%.
[AI-150] Confident Teacher Confident Student? A Novel User Study Design for Investigating the Didactic Potential of Explanations and their Impact on Uncertainty ECML2024
链接: https://arxiv.org/abs/2409.17157 作者: Teodor Chiaburu,Frank Haußer,Felix Bießmann 关键词-EN: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, potential of XAI, research community 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 15 pages, 5 figures, 1 table, presented at ECML 2024, AIMLAI Workshop, Vilnius
点击查看摘要
Abstract:Evaluating the quality of explanations in Explainable Artificial Intelligence (XAI) is to this day a challenging problem, with ongoing debate in the research community. While some advocate for establishing standardized offline metrics, others emphasize the importance of human-in-the-loop (HIL) evaluation. Here we propose an experimental design to evaluate the potential of XAI in human-AI collaborative settings as well as the potential of XAI for didactics. In a user study with 1200 participants we investigate the impact of explanations on human performance on a challenging visual task - annotation of biological species in complex taxonomies. Our results demonstrate the potential of XAI in complex visual annotation tasks: users become more accurate in their annotations and demonstrate less uncertainty with AI assistance. The increase in accuracy was, however, not significantly different when users were shown the mere prediction of the model compared to when also providing an explanation. We also find negative effects of explanations: users tend to replicate the model’s predictions more often when shown explanations, even when those predictions are wrong. When evaluating the didactic effects of explanations in collaborative human-AI settings, we find that users’ annotations are not significantly better after performing annotation with AI assistance. This suggests that explanations in visual human-AI collaboration do not appear to induce lasting learning effects. All code and experimental data can be found in our GitHub repository: this https URL.
[AI-151] PhantomLiDAR: Cross-modality Signal Injection Attacks against LiDAR
链接: https://arxiv.org/abs/2409.17907 作者: Zizhi Jin,Qinhong Jiang,Xuancun Lu,Chen Yan,Xiaoyu Ji,Wenyuan Xu 关键词-EN: Light Detection, Detection and Ranging, offering precise, spatial information, autonomous driving 类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:LiDAR (Light Detection and Ranging) is a pivotal sensor for autonomous driving, offering precise 3D spatial information. Previous signal attacks against LiDAR systems mainly exploit laser signals. In this paper, we investigate the possibility of cross-modality signal injection attacks, i.e., injecting intentional electromagnetic interference (IEMI) to manipulate LiDAR output. Our insight is that the internal modules of a LiDAR, i.e., the laser receiving circuit, the monitoring sensors, and the beam-steering modules, even with strict electromagnetic compatibility (EMC) testing, can still couple with the IEMI attack signals and result in the malfunction of LiDAR systems. Based on the above attack surfaces, we propose the PhantomLiDAR attack, which manipulates LiDAR output in terms of Points Interference, Points Injection, Points Removal, and even LiDAR Power-Off. We evaluate and demonstrate the effectiveness of PhantomLiDAR with both simulated and real-world experiments on five COTS LiDAR systems. We also conduct feasibility experiments in real-world moving scenarios. We provide potential defense measures that can be implemented at both the sensor level and the vehicle system level to mitigate the risks associated with IEMI attacks. Video demonstrations can be viewed at this https URL.
[AI-152] Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations
链接: https://arxiv.org/abs/2409.17899 作者: Yujia Sun,Zeyu Zhao,Korin Richmond,Yuanchao Li 关键词-EN: music SSL models, SSL models, speech and music, Emotion recognition, Music Emotion Recognition 类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
*备注:
点击查看摘要
Abstract:Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.
[AI-153] Let the Quantum Creep In: Designing Quantum Neural Network Models by Gradually Swapping Out Classical Components
链接: https://arxiv.org/abs/2409.17583 作者: Peiyong Wang,Casey. R. Myers,Lloyd C. L. Hollenberg,Udaya Parampalli 关键词-EN: Artificial Intelligence, quantum neural network, neural network, classical neural network, quantum 类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 50 pages (including Appendix), many figures, accepted as a poster on QTML2024. Code available at this https URL
点击查看摘要
Abstract:Artificial Intelligence (AI), with its multiplier effect and wide applications in multiple areas, could potentially be an important application of quantum computing. Since modern AI systems are often built on neural networks, the design of quantum neural networks becomes a key challenge in integrating quantum computing into AI. To provide a more fine-grained characterisation of the impact of quantum components on the performance of neural networks, we propose a framework where classical neural network layers are gradually replaced by quantum layers that have the same type of input and output while keeping the flow of information between layers unchanged, different from most current research in quantum neural network, which favours an end-to-end quantum model. We start with a simple three-layer classical neural network without any normalisation layers or activation functions, and gradually change the classical layers to the corresponding quantum versions. We conduct numerical experiments on image classification datasets such as the MNIST, FashionMNIST and CIFAR-10 datasets to demonstrate the change of performance brought by the systematic introduction of quantum components. Through this framework, our research sheds new light on the design of future quantum neural network models where it could be more favourable to search for methods and frameworks that harness the advantages from both the classical and quantum worlds.
[AI-154] NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes NEURIPS2024
链接: https://arxiv.org/abs/2409.17510 作者: Ziquan Wei,Tingting Dan,Jiaqi Ding,Paul J Laurienti,Guorong Wu 关键词-EN: modern imaging technologies, fluctuations emerge remarkable, emerge remarkable cognition, brain regions in-vivo, spontaneous functional fluctuations 类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the cliché of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.
[AI-155] Adjusting Regression Models for Conditional Uncertainty Calibration
Abstract:Conformal Prediction methods have finite-sample distribution-free marginal coverage guarantees. However, they generally do not offer conditional coverage guarantees, which can be important for high-stakes decisions. In this paper, we propose a novel algorithm to train a regression function to improve the conditional coverage after applying the split conformal prediction procedure. We establish an upper bound for the miscoverage gap between the conditional coverage and the nominal coverage rate and propose an end-to-end algorithm to control this upper bound. We demonstrate the efficacy of our method empirically on synthetic and real-world datasets.
[AI-156] Solar Active Regions Emergence Prediction Using Long Short-Term Memory Networks
链接: https://arxiv.org/abs/2409.17421 作者: Spiridon Kasapis,Irina N. Kitiashvili,Alexander G. Kosovichev,John T. Stefan 关键词-EN: Long Short-Term Memory, developed Long Short-Term, Short-Term Memory, developed Long, Long Short-Term 类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 5 tables, under review at the AAS Astrophysical Journal
点击查看摘要
Abstract:We developed Long Short-Term Memory (LSTM) models to predict the formation of active regions (ARs) on the solar surface. Using the Doppler shift velocity, the continuum intensity, and the magnetic field observations from the Solar Dynamics Observatory (SDO) Helioseismic and Magnetic Imager (HMI), we have created time-series datasets of acoustic power and magnetic flux, which are used to train LSTM models on predicting continuum intensity, 12 hours in advance. These novel machine learning (ML) models are able to capture variations of the acoustic power density associated with upcoming magnetic flux emergence and continuum intensity decrease. Testing of the models’ performance was done on data for 5 ARs, unseen from the models during training. Model 8, the best performing model trained, was able to make a successful prediction of emergence for all testing active regions in an experimental setting and three of them in an operational. The model predicted the emergence of AR11726, AR13165, and AR13179 respectively 10, 29, and 5 hours in advance, and variations of this model achieved average RMSE values of 0.11 for both active and quiet areas on the solar disc. This work sets the foundations for ML-aided prediction of solar ARs.
[AI-157] Disk2Planet: A Robust and Automated Machine Learning Tool for Parameter Inference in Disk-Planet Systems
Abstract:We introduce Disk2Planet, a machine learning-based tool to infer key parameters in disk-planet systems from observed protoplanetary disk structures. Disk2Planet takes as input the disk structures in the form of two-dimensional density and velocity maps, and outputs disk and planet properties, that is, the Shakura–Sunyaev viscosity, the disk aspect ratio, the planet–star mass ratio, and the planet’s radius and azimuth. We integrate the Covariance Matrix Adaptation Evolution Strategy (CMA–ES), an evolutionary algorithm tailored for complex optimization problems, and the Protoplanetary Disk Operator Network (PPDONet), a neural network designed to predict solutions of disk–planet interactions. Our tool is fully automated and can retrieve parameters in one system in three minutes on an Nvidia A100 graphics processing unit. We empirically demonstrate that our tool achieves percent-level or higher accuracy, and is able to handle missing data and unknown levels of noise.
[AI-158] ransfer learning for financial data predictions: a systematic review
链接: https://arxiv.org/abs/2409.17183 作者: V. Lanzetta 关键词-EN: time series data, financial time series, series data pose, data pose significant, time series 类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 43 pages, 5 tables, 1 figure
点击查看摘要
Abstract:Literature highlighted that financial time series data pose significant challenges for accurate stock price prediction, because these data are characterized by noise and susceptibility to news; traditional statistical methodologies made assumptions, such as linearity and normality, which are not suitable for the non-linear nature of financial time series; on the other hand, machine learning methodologies are able to capture non linear relationship in the data. To date, neural network is considered the main machine learning tool for the financial prices prediction. Transfer Learning, as a method aimed at transferring knowledge from source tasks to target tasks, can represent a very useful methodological tool for getting better financial prediction capability. Current reviews on the above body of knowledge are mainly focused on neural network architectures, for financial prediction, with very little emphasis on the transfer learning methodology; thus, this paper is aimed at going deeper on this topic by developing a systematic review with respect to application of Transfer Learning for financial market predictions and to challenges/potential future directions of the transfer learning methodologies for stock market predictions.
计算机视觉
[CV-0] FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner NEURIPS2024
Abstract:Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor’s outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1% \sim 58.3% on class-conditional generation and 29.8% \sim 38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art. Code is available at this https URL.
[CV-1] EgoLM: Multi-Modal Language Model of Egocentric Motions
链接: https://arxiv.org/abs/2409.18127 作者: Fangzhou Hong,Vladimir Guzov,Hyo Jin Kim,Yuting Ye,Richard Newcombe,Ziwei Liu,Lingni Ma 关键词-EN: wearable devices, prevalence of wearable, essential to develop, develop contextual, egocentric 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL
点击查看摘要
Abstract:As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset validate the effectiveness of EgoLM as a generalist model for universal egocentric learning.
[CV-2] LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
链接: https://arxiv.org/abs/2409.18125 作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Jiangmiao Pang,Xihui Liu 关键词-EN: Large Multimodal Models, Multimodal Models, Large Multimodal, Recent advancements, advancements in Large 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.
[CV-3] Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
链接: https://arxiv.org/abs/2409.18124 作者: Jing He,Haodong Li,Wei Yin,Yixun Liang,Leheng Li,Kaiqiang Zhou,Hongbo Liu,Bingbing Liu,Ying-Cong Chen 关键词-EN: dense prediction tasks, dense prediction, priors of pre-trained, offers a promising, promising solution 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also significantly enhances efficiency, being hundreds of times faster than most existing diffusion-based methods.
[CV-4] Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction
Abstract:Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration’s intended behavior while considering the robot’s own morphological limits, rather than attempting to reproduce the hand’s motion. We evaluate 4D-DPM’s 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD’s physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models – without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: this https URL
[CV-5] EvMAPPER: High Altitude Orthomapping with Event Cameras
链接: https://arxiv.org/abs/2409.18120 作者: Fernando Cladera,Kenneth Chaney,M. Ani Hsieh,Camillo J. Taylor,Vijay Kumar 关键词-EN: unmanned aerial vehicles, unmanned aerial, aerial vehicles, Traditionally, collect images 类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 7 figures
点击查看摘要
Abstract:Traditionally, unmanned aerial vehicles (UAVs) rely on CMOS-based cameras to collect images about the world below. One of the most successful applications of UAVs is to generate orthomosaics or orthomaps, in which a series of images are integrated together to develop a larger map. However, the use of CMOS-based cameras with global or rolling shutters mean that orthomaps are vulnerable to challenging light conditions, motion blur, and high-speed motion of independently moving objects under the camera. Event cameras are less sensitive to these issues, as their pixels are able to trigger asynchronously on brightness changes. This work introduces the first orthomosaic approach using event cameras. In contrast to existing methods relying only on CMOS cameras, our approach enables map generation even in challenging light conditions, including direct sunlight and after sunset.
[CV-6] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography MICCAI2024
链接: https://arxiv.org/abs/2409.18119 作者: Yuexi Du,John Onofrey,Nicha C. Dvornek 关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Language-Image Pre-training, requires substantial data, shows promise 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is also the basis of the overall best solution for the MICCAI 2024 CXR-LT Challenge
点击查看摘要
Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.
[CV-7] EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation
链接: https://arxiv.org/abs/2409.18114 作者: Jiaxiang Tang,Zhaoshuo Li,Zekun Hao,Xian Liu,Gang Zeng,Ming-Yu Liu,Qinsheng Zhang 关键词-EN: Current auto-regressive mesh, Current auto-regressive, insufficient detail, generation methods suffer, methods suffer 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL
点击查看摘要
Abstract:Current auto-regressive mesh generation methods suffer from issues such as incompleteness, insufficient detail, and poor generalization. In this paper, we propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of 512^3 . We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency. Furthermore, our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization. Extensive experiments demonstrate the superior quality, diversity, and generalization capabilities of our model in both point cloud and image-conditioned mesh generation tasks.
[CV-8] E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding NEURIPS2024
链接: https://arxiv.org/abs/2409.18111 作者: Ye Liu,Zongyang Ma,Zhongang Qi,Yang Wu,Ying Shan,Chang Wen Chen 关键词-EN: Large Language Models, Video Large Language, Large Language, Recent advances, Language Models 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024 Datasets and Benchmarks Track
点击查看摘要
Abstract:Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.
[CV-9] Find Rhinos without Finding Rhinos: Active Learning with Multimodal Imagery of South African Rhino Habitats IJCAI2023
链接: https://arxiv.org/abs/2409.18104 作者: Lucia Gordon,Nikhil Behari,Samuel Collier,Elizabeth Bondi-Kelly,Jackson A. Killian,Catherine Ressijac,Peter Boucher,Andrew Davies,Milind Tambe 关键词-EN: Earth charismatic megafauna, crisis in Africa, Earth charismatic, human activities, charismatic megafauna 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, IJCAI 2023 Special Track on AI for Good
点击查看摘要
Abstract:Much of Earth’s charismatic megafauna is endangered by human activities, particularly the rhino, which is at risk of extinction due to the poaching crisis in Africa. Monitoring rhinos’ movement is crucial to their protection but has unfortunately proven difficult because rhinos are elusive. Therefore, instead of tracking rhinos, we propose the novel approach of mapping communal defecation sites, called middens, which give information about rhinos’ spatial behavior valuable to anti-poaching, management, and reintroduction efforts. This paper provides the first-ever mapping of rhino midden locations by building classifiers to detect them using remotely sensed thermal, RGB, and LiDAR imagery in passive and active learning settings. As existing active learning methods perform poorly due to the extreme class imbalance in our dataset, we design MultimodAL, an active learning system employing a ranking technique and multimodality to achieve competitive performance with passive learning models with 94% fewer labels. Our methods could therefore save over 76 hours in labeling time when used on a similarly-sized dataset. Unexpectedly, our midden map reveals that rhino middens are not randomly distributed throughout the landscape; rather, they are clustered. Consequently, rangers should be targeted at areas with high midden densities to strengthen anti-poaching efforts, in line with UN Target 15.7.
[CV-10] MALPOLON: A Framework for Deep Species Distribution Modeling
链接: https://arxiv.org/abs/2409.18102 作者: Theo Larcher,Lukas Picek,Benjamin Deneu,Titouan Lorieul,Maximilien Servajean,Alexis Joly 关键词-EN: paper describes, Python language skills, general Python language, deep species distribution, testing deep learning 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This paper describes a deep-SDM framework, MALPOLON. Written in Python and built upon the PyTorch library, this framework aims to facilitate training and inferences of deep species distribution models (deep-SDM) and sharing for users with only general Python language skills (e.g., modeling ecologists) who are interested in testing deep learning approaches to build new SDMs. More advanced users can also benefit from the framework’s modularity to run more specific experiments by overriding existing classes while taking advantage of press-button examples to train neural networks on multiple classification tasks using custom or provided raw and pre-processed datasets. The framework is open-sourced on GitHub and PyPi along with extensive documentation and examples of use in various scenarios. MALPOLON offers straightforward installation, YAML-based configuration, parallel computing, multi-GPU utilization, baseline and foundational models for benchmarking, and extensive tutorials/documentation, aiming to enhance accessibility and performance scalability for ecologists and researchers.
[CV-11] AI-Powered Augmented Reality for Satellite Assembly Integration and Test
Abstract:The integration of Artificial Intelligence (AI) and Augmented Reality (AR) is set to transform satellite Assembly, Integration, and Testing (AIT) processes by enhancing precision, minimizing human error, and improving operational efficiency in cleanroom environments. This paper presents a technical description of the European Space Agency’s (ESA) project “AI for AR in Satellite AIT,” which combines real-time computer vision and AR systems to assist technicians during satellite assembly. Leveraging Microsoft HoloLens 2 as the AR interface, the system delivers context-aware instructions and real-time feedback, tackling the complexities of object recognition and 6D pose estimation in AIT workflows. All AI models demonstrated over 70% accuracy, with the detection model exceeding 95% accuracy, indicating a high level of performance and reliability. A key contribution of this work lies in the effective use of synthetic data for training AI models in AR applications, addressing the significant challenges of obtaining real-world datasets in highly dynamic satellite environments, as well as the creation of the Segmented Anything Model for Automatic Labelling (SAMAL), which facilitates the automatic annotation of real data, achieving speeds up to 20 times faster than manual human annotation. The findings demonstrate the efficacy of AI-driven AR systems in automating critical satellite assembly tasks, setting a foundation for future innovations in the space industry.
[CV-12] Self-supervised Pretraining for Cardiovascular Magnetic Resonance Cine Segmentation MICCAI2024
链接: https://arxiv.org/abs/2409.18100 作者: Rob A. J. de Mooij,Josien P. W. Pluim,Cian M. Scannell 关键词-EN: cardiovascular magnetic resonance, shown promising results, automated cardiovascular magnetic, CMR cine segmentation, SSP 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to Data Engineering in Medical Imaging (DEMI) Workshop at MICCAI 2024
点击查看摘要
Abstract:Self-supervised pretraining (SSP) has shown promising results in learning from large unlabeled datasets and, thus, could be useful for automated cardiovascular magnetic resonance (CMR) short-axis cine segmentation. However, inconsistent reports of the benefits of SSP for segmentation have made it difficult to apply SSP to CMR. Therefore, this study aimed to evaluate SSP methods for CMR cine segmentation. To this end, short-axis cine stacks of 296 subjects (90618 2D slices) were used for unlabeled pretraining with four SSP methods; SimCLR, positional contrastive learning, DINO, and masked image modeling (MIM). Subsets of varying numbers of subjects were used for supervised fine-tuning of 2D models for each SSP method, as well as to train a 2D baseline model from scratch. The fine-tuned models were compared to the baseline using the 3D Dice similarity coefficient (DSC) in a test dataset of 140 subjects. The SSP methods showed no performance gains with the largest supervised fine-tuning subset compared to the baseline (DSC = 0.89). When only 10 subjects (231 2D slices) are available for supervised training, SSP using MIM (DSC = 0.86) improves over training from scratch (DSC = 0.82). This study found that SSP is valuable for CMR cine segmentation when labeled training data is scarce, but does not aid state-of-the-art deep learning methods when ample labeled data is available. Moreover, the choice of SSP method is important. The code is publicly available at: this https URL Comments: Accepted to Data Engineering in Medical Imaging (DEMI) Workshop at MICCAI 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.18100 [cs.CV] (or arXiv:2409.18100v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.18100 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rob De Mooij [view email] [v1] Thu, 26 Sep 2024 17:44:29 UTC (1,439 KB)
[CV-13] EfficientCrackNet: A Lightweight Model for Crack Segmentation
Abstract:Crack detection, particularly from pavement images, presents a formidable challenge in the domain of computer vision due to several inherent complexities such as intensity inhomogeneity, intricate topologies, low contrast, and noisy backgrounds. Automated crack detection is crucial for maintaining the structural integrity of essential infrastructures, including buildings, pavements, and bridges. Existing lightweight methods often face challenges including computational inefficiency, complex crack patterns, and difficult backgrounds, leading to inaccurate detection and impracticality for real-world applications. To address these limitations, we propose EfficientCrackNet, a lightweight hybrid model combining Convolutional Neural Networks (CNNs) and transformers for precise crack segmentation. EfficientCrackNet integrates depthwise separable convolutions (DSC) layers and MobileViT block to capture both global and local features. The model employs an Edge Extraction Method (EEM) and for efficient crack edge detection without pretraining, and Ultra-Lightweight Subspace Attention Module (ULSAM) to enhance feature extraction. Extensive experiments on three benchmark datasets Crack500, DeepCrack, and GAPs384 demonstrate that EfficientCrackNet achieves superior performance compared to existing lightweight models, while requiring only 0.26M parameters, and 0.483 FLOPs (G). The proposed model offers an optimal balance between accuracy and computational efficiency, outperforming state-of-the-art lightweight models, and providing a robust and adaptable solution for real-world crack segmentation.
Abstract:Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle’s surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets and our approach outperforms the state-of-the-art for SSC.
[CV-15] Stable Video Portraits ECCV2024
链接: https://arxiv.org/abs/2409.18083 作者: Mirela Ostrek,Justus Thies 关键词-EN: Rapid advances, computer-generated imagery today, perceive computer-generated imagery, field of generative, perceive computer-generated 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024, Project: this https URL
点击查看摘要
Abstract:Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.
[CV-16] SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation
链接: https://arxiv.org/abs/2409.18082 作者: Xin Li,Siyuan Huang,Qiaojun Yu,Zhengkai Jiang,Ce Hao,Yimeng Zhu,Hongsheng Li,Peng Gao,Cewu Lu 关键词-EN: Automating garment manipulation, Automating garment, poses a significant, significant challenge, diverse and deformable 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and deformable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics for future.
[CV-17] FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction
Abstract:Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user’s intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: this https URL.
[CV-18] LightAvatar: Efficient Head Avatar as Dynamic Neural Light Field ECCV’24
链接: https://arxiv.org/abs/2409.18057 作者: Huan Wang,Feitong Tan,Ziqian Bai,Yinda Zhang,Shichen Liu,Qiangeng Xu,Menglei Chai,Anish Prabhu,Rohit Pandey,Sean Fanello,Zeng Huang,Yun Fu 关键词-EN: build photorealistic head, Recent works, photorealistic head avatars, monocular video, neural radiance fields 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Appear in ECCV’24 CADL Workshop. Code: this https URL
点击查看摘要
Abstract:Recent works have shown that neural radiance fields (NeRFs) on top of parametric models have reached SOTA quality to build photorealistic head avatars from a monocular video. However, one major limitation of the NeRF-based avatars is the slow rendering speed due to the dense point sampling of NeRF, preventing them from broader utility on resource-constrained devices. We introduce LightAvatar, the first head avatar model based on neural light fields (NeLFs). LightAvatar renders an image from 3DMM parameters and a camera pose via a single network forward pass, without using mesh or volume rendering. The proposed approach, while being conceptually appealing, poses a significant challenge towards real-time efficiency and training stability. To resolve them, we introduce dedicated network designs to obtain proper representations for the NeLF model and maintain a low FLOPs budget. Meanwhile, we tap into a distillation-based training strategy that uses a pretrained avatar model as teacher to synthesize abundant pseudo data for training. A warping field network is introduced to correct the fitting error in the real data so that the model can learn better. Extensive experiments suggest that our method can achieve new SOTA image quality quantitatively or qualitatively, while being significantly faster than the counterparts, reporting 174.1 FPS (512x512 resolution) on a consumer-grade GPU (RTX3090) with no customized optimization.
[CV-19] Visual Data Diagnosis and Debiasing with Concept Graphs
链接: https://arxiv.org/abs/2409.18055 作者: Rwiddhi Chakraborty,Yinong Wang,Jialu Gao,Runkai Zheng,Cheng Zhang,Fernando De la Torre 关键词-EN: deep learning models, learning models today, size and complexity, widespread success, success of deep 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present CONBIAS, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. CONBIAS represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by CONBIAS improves generalization performance across multiple datasets compared to state-of-the-art methods. We will make our code and data publicly available.
[CV-20] Revisit Anything: Visual Place Recognition via Image Segment Retrieval ECCV2024
链接: https://arxiv.org/abs/2409.18049 作者: Kartik Garg,Sai Shubodh Puligilla,Shishir Kolathaya,Madhava Krishna,Sourav Garg 关键词-EN: Accurately recognizing, localize and navigate, crucial for embodied, embodied agents, agents to localize 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Presented at ECCV 2024; Includes supplementary; 29 pages; 8 figures
点击查看摘要
Abstract:Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the “whole” image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: “the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap”. We address this by encoding and searching for “image segments” instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful’ entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything’’ by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: this https URL.
[CV-21] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning EMNLP2024
链接: https://arxiv.org/abs/2409.18046 作者: Soeun Lee,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim 关键词-EN: Recent advancements, paired image-text data, explored text-only training, text-only training, overcome the limitations 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024
点击查看摘要
Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ( \textbfI mage-like Retrieval and \textbfF requency-based Entity Filtering for Zero-shot \textbfCap tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.
[CV-22] EMOVA: Empowering Language Models to See Hear and Speak with Vivid Emotions
链接: https://arxiv.org/abs/2409.18042 作者: Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Lanqing Hong,Lu Hou,Hang Xu 关键词-EN: Large Language Models, enables vocal conversations, Large Language, empowering Large Language, enable Large Language 类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Project Page: this https URL
点击查看摘要
Abstract:GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
[CV-23] ReliOcc: Towards Reliable Semantic Occupancy Prediction via Uncertainty Learning
链接: https://arxiv.org/abs/2409.18026 作者: Song Wang,Zhongdao Wang,Jiawei Yu,Wentong Li,Bailan Feng,Junbo Chen,Jianke Zhu 关键词-EN: Vision-centric semantic occupancy, Vision-centric semantic, autonomous driving, plays a crucial, crucial role 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Technical report. Work in progress
点击查看摘要
Abstract:Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving, which requires accurate and reliable predictions from low-cost sensors. Although having notably narrowed the accuracy gap with LiDAR, there is still few research effort to explore the reliability in predicting semantic occupancy from camera. In this paper, we conduct a comprehensive evaluation of existing semantic occupancy prediction models from a reliability perspective for the first time. Despite the gradual alignment of camera-based models with LiDAR in term of accuracy, a significant reliability gap persists. To addresses this concern, we propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks. ReliOcc provides a plug-and-play scheme for existing models, which integrates hybrid uncertainty from individual voxels with sampling-based noise and relative voxels through mix-up learning. Besides, an uncertainty-aware calibration strategy is devised to further enhance model reliability in offline mode. Extensive experiments under various settings demonstrate that ReliOcc significantly enhances model reliability while maintaining the accuracy of both geometric and semantic predictions. Importantly, our proposed approach exhibits robustness to sensor failures and out of domain noises during inference.
[CV-24] ransferring disentangled representations: bridging the gap between synthetic and real images
Abstract:Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.
[CV-25] InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction
Abstract:We propose a novel unsupervised cross-modal homography estimation framework, based on interleaved modality transfer and self-supervised homography prediction, named InterNet. InterNet integrates modality transfer and self-supervised homography estimation, introducing an innovative interleaved optimization framework to alternately promote both components. The modality transfer gradually narrows the modality gaps, facilitating the self-supervised homography estimation to fully leverage the synthetic intra-modal data. The self-supervised homography estimation progressively achieves reliable predictions, thereby providing robust cross-modal supervision for the modality transfer. To further boost the estimation accuracy, we also formulate a fine-grained homography feature loss to improve the connection between two components. Furthermore, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. Experiments reveal that InterNet achieves the state-of-the-art (SOTA) performance among unsupervised methods, and even outperforms many supervised methods such as MHN and LocalTrans.
[CV-26] Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions ECCV2024
链接: https://arxiv.org/abs/2409.17988 作者: Weng Fei Low,Gim Hee Lee 关键词-EN: high dynamic range, standard cameras underperform, event motion blur, event camera makes, event 类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Accepted to ECCV 2024. Project website is accessible at this https URL . arXiv admin note: text overlap with arXiv:2006.07722 by other authors
点击查看摘要
Abstract:The stark contrast in the design philosophy of an event camera makes it particularly ideal for operating under high-speed, high dynamic range and low-light conditions, where standard cameras underperform. Nonetheless, event cameras still suffer from some amount of motion blur, especially under these challenging conditions, in contrary to what most think. This is attributed to the limited bandwidth of the event sensor pixel, which is mostly proportional to the light intensity. Thus, to ensure that event cameras can truly excel in such conditions where it has an edge over standard cameras, it is crucial to account for event motion blur in downstream applications, especially reconstruction. However, none of the recent works on reconstructing Neural Radiance Fields (NeRFs) from events, nor event simulators, have considered the full effects of event motion blur. To this end, we propose, Deblur e-NeRF, a novel method to directly and effectively reconstruct blur-minimal NeRFs from motion-blurred events generated under high-speed motion or low-light conditions. The core component of this work is a physically-accurate pixel bandwidth model proposed to account for event motion blur under arbitrary speed and lighting conditions. We also introduce a novel threshold-normalized total variation loss to improve the regularization of large textureless patches. Experiments on real and novel realistically simulated sequences verify our effectiveness. Our code, event simulator and synthetic event dataset will be open-sourced.
[CV-27] LLM4Brain: Training a Large Language Model for Brain Video Understanding ECCV2024
链接: https://arxiv.org/abs/2409.17987 作者: Ruizhe Zheng,Lichao Sun 关键词-EN: limited data availability, poses significant challenges, subjects poses significant, functional MRI, Decoding visual-semantic information 类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: ECCV2024 Workshop
点击查看摘要
Abstract:Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.
[CV-28] BlinkTrack: Feature Tracking over 100 FPS via Events and Images
Abstract:Feature tracking is crucial for, structure from motion (SFM), simultaneous localization and mapping (SLAM), object tracking and various computer vision tasks. Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with RGB images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves single-modality tracking, resolves ambiguities, and supports asynchronous data fusion. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing event-based methods, exceeding 100 FPS with preprocessed event data and 80 FPS with multi-modality data.
[CV-29] HydraViT: Stacking Heads for a Scalable ViT
Abstract:The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at this https URL.
[CV-30] Cross-Modality Attack Boosted by Gradient-Evolutionary Multiform Optimization
Abstract:In recent years, despite significant advancements in adversarial attack research, the security challenges in cross-modal scenarios, such as the transferability of adversarial attacks between infrared, thermal, and RGB images, have been overlooked. These heterogeneous image modalities collected by different hardware devices are widely prevalent in practical applications, and the substantial differences between modalities pose significant challenges to attack transferability. In this work, we explore a novel cross-modal adversarial attack strategy, termed multiform attack. We propose a dual-layer optimization framework based on gradient-evolution, facilitating efficient perturbation transfer between modalities. In the first layer of optimization, the framework utilizes image gradients to learn universal perturbations within each modality and employs evolutionary algorithms to search for shared perturbations with transferability across different modalities through secondary optimization. Through extensive testing on multiple heterogeneous datasets, we demonstrate the superiority and robustness of Multiform Attack compared to existing techniques. This work not only enhances the transferability of cross-modal adversarial attacks but also provides a new perspective for understanding security vulnerabilities in cross-modal systems.
[CV-31] CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors
链接: https://arxiv.org/abs/2409.17963 作者: Linye Lyu,Jiawei Zhou,Daojing He,Yu Li 关键词-EN: Prior works, detectors mainly focus, effectiveness and robustness, vehicle detectors, Prior 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Prior works on physical adversarial camouflage against vehicle detectors mainly focus on the effectiveness and robustness of the attack. The current most successful methods optimize 3D vehicle texture at a pixel level. However, this results in conspicuous and attention-grabbing patterns in the generated camouflage, which humans can easily identify. To address this issue, we propose a Customizable and Natural Camouflage Attack (CNCA) method by leveraging an off-the-shelf pre-trained diffusion model. By sampling the optimal texture image from the diffusion model with a user-specific text prompt, our method can generate natural and customizable adversarial camouflage while maintaining high attack performance. With extensive experiments on the digital and physical worlds and user studies, the results demonstrate that our proposed method can generate significantly more natural-looking camouflage than the state-of-the-art baselines while achieving competitive attack performance. Our code is available at \hrefthis https URLthis https URL
[CV-32] he Hard Positive Truth about Vision-Language Compositionality ECCV2024
链接: https://arxiv.org/abs/2409.17958 作者: Amita Kamath,Cheng-Yu Hsieh,Kai-Wei Chang,Ranjay Krishna 关键词-EN: hard, CLIP, hard positives, hard negatives, vision-language models 类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024
点击查看摘要
Abstract:Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated – because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts.
[CV-33] Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition
链接: https://arxiv.org/abs/2409.17951 作者: Xinpeng Yin,Wenming Cao 关键词-EN: skeleton-based action recognition, self-supervised skeleton-based action, mask reconstruction paradigm, enhancing model refinement, action recognition 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages,6 figures,IEEE Trans
点击查看摘要
Abstract:In self-supervised skeleton-based action recognition, the mask reconstruction paradigm is gaining interest in enhancing model refinement and robustness through effective masking. However, previous works primarily relied on a single masking criterion, resulting in the model overfitting specific features and overlooking other effective information. In this paper, we introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives. Specifically, in spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons, employing joint hierarchy as the masking criterion. In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective. Additionally, we incorporate cross-contrast loss based on the cross-masking framework into the loss function to enhance the model’s learning of instance-level features. HA-CM shows efficiency and universality on three public large-scale datasets, NTU-60, NTU-120, and PKU-MMD. The source code of our HA-CM is available at this https URL.
链接: https://arxiv.org/abs/2409.17941 作者: Filippo Bartolucci,Iacopo Masi,Giuseppe Lisanti 关键词-EN: received considerable attention, Generative Models, localization have received, received considerable, considerable attention 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Image manipulation detection and localization have received considerable attention from the research community given the blooming of Generative Models (GMs). Detection methods that follow a passive approach may overfit to specific GMs, limiting their application in real-world scenarios, due to the growing diversity of generative models. Recently, approaches based on a proactive framework have shown the possibility of dealing with this limitation. However, these methods suffer from two main limitations, which raises concerns about potential vulnerabilities: i) the manipulation detector is not robust to noise and hence can be easily fooled; ii) the fact that they rely on fixed perturbations for image protection offers a predictable exploit for malicious attackers, enabling them to reverse-engineer and evade detection. To overcome this issue we propose PADL, a new solution able to generate image-specific perturbations using a symmetric scheme of encoding and decoding based on cross-attention, which drastically reduces the possibility of reverse engineering, even when evaluated with adaptive attack [31]. Additionally, PADL is able to pinpoint manipulated areas, facilitating the identification of specific regions that have undergone alterations, and has more generalization power than prior art on held-out generative models. Indeed, although being trained only on an attribute manipulation GAN model [15], our method generalizes to a range of unseen models with diverse architectural designs, such as StarGANv2, BlendGAN, DiffAE, StableDiffusion and StableDiffusionXL. Additionally, we introduce a novel evaluation protocol, which offers a fair evaluation of localisation performance in function of detection accuracy and better captures real-world scenarios.
[CV-35] Neural Light Spheres for Implicit Image Stitching and View Synthesis
链接: https://arxiv.org/abs/2409.17924 作者: Ilya Chugunov,Amogh Joshi,Kiran Murthy,Francois Bleibel,Felix Heide 关键词-EN: panorama paradoxically remains, mobile camera applications, modern mobile camera, challenging to display, cellphone screen 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project site: this https URL
点击查看摘要
Abstract:Challenging to capture, and challenging to display on a cellphone screen, the panorama paradoxically remains both a staple and underused feature of modern mobile camera applications. In this work we address both of these challenges with a spherical neural light field model for implicit panoramic image stitching and re-rendering; able to accommodate for depth parallax, view-dependent lighting, and local scene motion and color changes during capture. Fit during test-time to an arbitrary path panoramic video capture – vertical, horizontal, random-walk – these neural light spheres jointly estimate the camera path and a high-resolution scene reconstruction to produce novel wide field-of-view projections of the environment. Our single-layer model avoids expensive volumetric sampling, and decomposes the scene into compact view-dependent ray offset and color components, with a total model size of 80 MB per scene, and real-time (50 FPS) rendering at 1080p resolution. We demonstrate improved reconstruction quality over traditional image stitching and radiance field methods, with significantly higher tolerance to scene motion and non-ideal capture settings.
[CV-36] Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation
Abstract:Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code is available at this https URL.
[CV-37] WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians
Abstract:While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Splatting (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover’s Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques. See our project page for additional results and source code: \hrefthis https URLthis https URL .
[CV-38] LKA-ReID:Vehicle Re-Identification with Large Kernel Attention ICASSP2025
链接: https://arxiv.org/abs/2409.17908 作者: Xuezhi Xiang,Zhushan Ma,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen 关键词-EN: smart city infrastructure, intelligent transportation systems, important research field, Vehicle Re-ID technology, Vehicle Re-ID 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
点击查看摘要
Abstract:With the rapid development of intelligent transportation systems and the popularity of smart city infrastructure, Vehicle Re-ID technology has become an important research field. The vehicle Re-ID task faces an important challenge, which is the high similarity between different vehicles. Existing methods use additional detection or segmentation models to extract differentiated local features. However, these methods either rely on additional annotations or greatly increase the computational cost. Using attention mechanism to capture global and local features is crucial to solve the challenge of high similarity between classes in vehicle Re-ID tasks. In this paper, we propose LKA-ReID with large kernel attention. Specifically, the large kernel attention (LKA) utilizes the advantages of self-attention and also benefits from the advantages of convolution, which can extract the global and local features of the vehicle more comprehensively. We also introduce hybrid channel attention (HCA) combines channel attention with spatial information, so that the model can better focus on channels and feature regions, and ignore background and other disturbing information. Experiments on VeRi-776 dataset demonstrated the effectiveness of LKA-ReID, with mAP reaches 86.65% and Rank-1 reaches 98.03%.
[CV-39] Self-supervised Monocular Depth Estimation with Large Kernel Attention ICASSP2025
链接: https://arxiv.org/abs/2409.17895 作者: Xuezhi Xiang,Yao Wang,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen 关键词-EN: labeled training data, Self-supervised monocular depth, training data, promising approach, rely on labeled 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
点击查看摘要
Abstract:Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.
[CV-40] Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection ECCV2024
链接: https://arxiv.org/abs/2409.17886 作者: Andrea Toaiari,Vittorio Murino,Marco Cristani,Cigdem Beyan 关键词-EN: Gaze Target Detection, Gaze Target, external viewpoint, challenging task, Target Detection 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in the T-CAP workshop at ECCV 2024
点击查看摘要
Abstract:Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person’s appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person’s upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person’s face, thus promoting privacy preservation in various application contexts. The code is available at this https URL.
[CV-41] Self-Distilled Depth Refinement with Noisy Poisson Fusion NEURIPS2024
Abstract:Depth refinement aims to infer high-resolution depth with fine-grained edges and details, refining low-resolution results of depth estimation models. The prevailing methods adopt tile-based manners by merging numerous patches, which lacks efficiency and produces inconsistency. Besides, prior arts suffer from fuzzy depth boundaries and limited generalizability. Analyzing the fundamental reasons for these limitations, we model depth refinement as a noisy Poisson fusion problem with local inconsistency and edge deformation noises. We propose the Self-distilled Depth Refinement (SDDR) framework to enforce robustness against the noises, which mainly consists of depth edge representation and edge-based guidance. With noisy depth predictions as input, SDDR generates low-noise depth edge representations as pseudo-labels by coarse-to-fine self-distillation. Edge-based guidance with edge-guided gradient loss and edge-based fusion loss serves as the optimization objective equivalent to Poisson fusion. When depth maps are better refined, the labels also become more noise-free. Our model can acquire strong robustness to the noises, achieving significant improvements in accuracy, edge quality, efficiency, and generalizability on five different benchmarks. Moreover, directly training another model with edge labels produced by SDDR brings improvements, suggesting that our method could help with training robust refinement models in future works.
[CV-42] Visualization of Age Distributions as Elements of Medical Data-Stories
链接: https://arxiv.org/abs/2409.17854 作者: Sophia Dowlatabadi,Bernhard Preim,Monique Meuschke 关键词-EN: including medicine, age distributions, enhance health communication, Abstract, distributions are crucial 类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 11 pages, 7 figures
点击查看摘要
Abstract:In various fields, including medicine, age distributions are crucial. Despite widespread media coverage of health topics, there remains a need to enhance health communication. Narrative medical visualization is promising for improving information comprehension and retention. This study explores the most effective ways to present age distributions of diseases through narrative visualizations. We conducted a thorough analysis of existing visualizations, held workshops with a broad audience, and reviewed relevant literature. From this, we identified design choices focusing on comprehension, aesthetics, engagement, and memorability. We specifically tested three pictogram variants: pictograms as bars, stacked pictograms, and annotations. After evaluating 18 visualizations with 72 participants and three expert reviews, we determined that annotations were most effective for comprehension and aesthetics. However, traditional bar charts were preferred for engagement, and other variants were more memorable. The study provides a set of design recommendations based on these insights.
[CV-43] A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts ECCV2024
链接: https://arxiv.org/abs/2409.17851 作者: Aurel Pjetri(1 and 2),Stefano Caprasecca(1),Leonardo Taccari(1),Matteo Simoncini(1),Henrique Piñeiro Monteagudo(1 and 3),Walter Wallace(1),Douglas Coimbra de Andrade(4),Francesco Sambo(1),Andrew David Bagdanov(1) ((1) Verizon Connect Research, Florence, Italy, (2) Department of Information Engineering, University of Florence, Florence, Italy, (3) University of Bologna, Bologna, Italy, (4) SENAI Institute of Innovation, Rio de Janeiro, Brazil) 关键词-EN: Monocular depth estimation, computer vision applications, depth estimation, Monocular depth, critical task 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 5 figures. Accepted at ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)
点击查看摘要
Abstract:Monocular depth estimation is a critical task for autonomous driving and many other computer vision applications. While significant progress has been made in this field, the effects of viewpoint shifts on depth estimation models remain largely underexplored. This paper introduces a novel dataset and evaluation methodology to quantify the impact of different camera positions and orientations on monocular depth estimation performance. We propose a ground truth strategy based on homography estimation and object detection, eliminating the need for expensive lidar sensors. We collect a diverse dataset of road scenes from multiple viewpoints and use it to assess the robustness of a modern depth estimation model to geometric shifts. After assessing the validity of our strategy on a public dataset, we provide valuable insights into the limitations of current models and highlight the importance of considering viewpoint variations in real-world applications.
[CV-44] Unsupervised Learning Based Multi-Scale Exposure Fusion
链接: https://arxiv.org/abs/2409.17830 作者: Chaobing Zheng,Shiqian Wu,Zhenggguo Li 关键词-EN: low dynamic range, high dynamic range, Unsupervised learning based, higher quality LDR, dynamic range 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages
点击查看摘要
Abstract:Unsupervised learning based multi-scale exposure fusion (ULMEF) is efficient for fusing differently exposed low dynamic range (LDR) images into a higher quality LDR image for a high dynamic range (HDR) scene. Unlike supervised learning, loss functions play a crucial role in the ULMEF. In this paper, novel loss functions are proposed for the ULMEF and they are defined by using all the images to be fused and other differently exposed images from the same HDR scene. The proposed loss functions can guide the proposed ULMEF to learn more reliable information from the HDR scene than existing loss functions which are defined by only using the set of images to be fused. As such, the quality of the fused image is significantly improved. The proposed ULMEF also adopts a multi-scale strategy that includes a multi-scale attention module to effectively preserve the scene depth and local contrast in the fused image. Meanwhile, the proposed ULMEF can be adopted to achieve exposure interpolation and exposure extrapolation. Extensive experiments show that the proposed ULMEF algorithm outperforms state-of-the-art exposure fusion algorithms.
[CV-45] Kendalls tau Coefficient for Logits Distillation
Abstract:Knowledge distillation typically employs the Kullback-Leibler (KL) divergence to constrain the student model’s output to match the soft labels provided by the teacher model exactly. However, sometimes the optimization direction of the KL divergence loss is not always aligned with the task loss, where a smaller KL divergence could lead to erroneous predictions that diverge from the soft labels. This limitation often results in suboptimal optimization for the student. Moreover, even under temperature scaling, the KL divergence loss function tends to overly focus on the larger-valued channels in the logits, disregarding the rich inter-class information provided by the multitude of smaller-valued channels. This hard constraint proves too challenging for lightweight students, hindering further knowledge distillation. To address this issue, we propose a plug-and-play ranking loss based on Kendall’s \tau coefficient, called Rank-Kendall Knowledge Distillation (RKKD). RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits, providing more inter-class relational information. The rank constraint on the top-valued channels helps avoid suboptimal traps during optimization. We also discuss different differentiable forms of Kendall’s \tau coefficient and demonstrate that the proposed ranking loss function shares a consistent optimization objective with the KL divergence. Extensive experiments on the CIFAR-100 and ImageNet datasets show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.
[CV-46] Cascade Prompt Learning for Vision-Language Model Adaptation ECCV2024
Abstract:Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning CasPL framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It’s worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets. Code is publicly available at: this https URL.
[CV-47] Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs
链接: https://arxiv.org/abs/2409.17792 作者: Xinya Shu,Yu Li,Dongwei Ren,Xiaohe Wu,Jin Li,Wangmeng Zuo 关键词-EN: image defocus deblurring, acquiring well-aligned training, defocus deblurring, single image defocus, defocus 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The source code and dataset are available at this https URL
点击查看摘要
Abstract:For single image defocus deblurring, acquiring well-aligned training pairs (or training triplets), i.e., a defocus blurry image, an all-in-focus sharp image (and a defocus blur map), is an intricate task for the development of deblurring models. Existing image defocus deblurring methods typically rely on training data collected by specialized imaging equipment, presupposing that these pairs or triplets are perfectly aligned. However, in practical scenarios involving the collection of real-world data, direct acquisition of training triplets is infeasible, and training pairs inevitably encounter spatial misalignment issues. In this work, we introduce a reblurring-guided learning framework for single image defocus deblurring, enabling the learning of a deblurring network even with misaligned training pairs. Specifically, we first propose a baseline defocus deblurring network that utilizes spatially varying defocus blur map as degradation prior to enhance the deblurring performance. Then, to effectively learn the baseline defocus deblurring network with misaligned training pairs, our reblurring module ensures spatial consistency between the deblurred image, the reblurred image and the input blurry image by reconstructing spatially variant isotropic blur kernels. Moreover, the spatially variant blur derived from the reblurring module can serve as pseudo supervision for defocus blur map during training, interestingly transforming training pairs into training triplets. Additionally, we have collected a new dataset specifically for single image defocus deblurring (SDD) with typical misalignments, which not only substantiates our proposed method but also serves as a benchmark for future research.
[CV-48] CASPFormer: Trajectory Prediction from BEV Images with Deformable Attention ICPR2024
Abstract:Motion prediction is an important aspect for Autonomous Driving (AD) and Advance Driver Assistance Systems (ADAS). Current state-of-the-art motion prediction methods rely on High Definition (HD) maps for capturing the surrounding context of the ego vehicle. Such systems lack scalability in real-world deployment as HD maps are expensive to produce and update in real-time. To overcome this issue, we propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from rasterized Bird-Eye-View (BEV) images. Our system can be integrated with any upstream perception module that is capable of generating BEV images. Moreover, CASPFormer directly decodes vectorized trajectories without any postprocessing. Trajectories are decoded recurrently using deformable attention, as it is computationally efficient and provides the network with the ability to focus its attention on the important spatial locations of the BEV images. In addition, we also address the issue of mode collapse for generating multiple scene-consistent trajectories by incorporating learnable mode queries. We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics
[CV-49] aming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs NEURIPS2024
链接: https://arxiv.org/abs/2409.17778 作者: Qinpeng Cui,Yixuan Liu,Xinyi Zhang,Qiqi Bao,Zhongdao Wang,Qingmin Liao,Li Wang,Tian Lu,Emad Barsoum 关键词-EN: attracted substantial interest, substantial interest due, image restoration capabilities, powerful image restoration, Diffusion-based image super-resolution 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is accepted by NeurIPS 2024
点击查看摘要
Abstract:Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency. Code: this https URL.
[CV-50] Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
链接: https://arxiv.org/abs/2409.17777 作者: Raja Kumar,Raghav Singhal,Pranamya Kulkarni,Deval Mehta,Kshitij Jadhav 关键词-EN: shown remarkable success, Deep multimodal learning, Deep multimodal, leveraging contrastive learning, Mixup-based contrastive loss 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: RK and RS contributed equally to this work, 20 Pages, 8 Figures, 9 Tables
点击查看摘要
Abstract:Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.
[CV-51] UNICORN: A Deep Learning Model for Integrating Multi-Stain Data in Histopathology
链接: https://arxiv.org/abs/2409.17775 作者: Valentin Koch,Sabine Bauer,Valerio Luppberger,Michael Joner,Heribert Schunkert,Julia A. Schnabel,Moritz von Scheidt,Carsten Marr 关键词-EN: deep learning poses, poses a significant, Background, data, digital histopathology 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Background: The integration of multi-stain histopathology images through deep learning poses a significant challenge in digital histopathology. Current multi-modal approaches struggle with data heterogeneity and missing data. This study aims to overcome these limitations by developing a novel transformer model for multi-stain integration that can handle missing data during training as well as inference. Methods: We propose UNICORN (UNiversal modality Integration Network for CORonary classificatioN) a multi-modal transformer capable of processing multi-stain histopathology for atherosclerosis severity class prediction. The architecture comprises a two-stage, end-to-end trainable model with specialized modules utilizing transformer self-attention blocks. The initial stage employs domain-specific expert modules to extract features from each modality. In the subsequent stage, an aggregation expert module integrates these features by learning the interactions between the different data modalities. Results: Evaluation was performed using a multi-class dataset of atherosclerotic lesions from the Munich Cardiovascular Studies Biobank (MISSION), using over 4,000 paired multi-stain whole slide images (WSIs) from 170 deceased individuals on 7 prespecified segments of the coronary tree, each stained according to four histopathological protocols. UNICORN achieved a classification accuracy of 0.67, outperforming other state-of-the-art models. The model effectively identifies relevant tissue phenotypes across stainings and implicitly models disease progression. Conclusion: Our proposed multi-modal transformer model addresses key challenges in medical data analysis, including data heterogeneity and missing modalities. Explainability and the model’s effectiveness in predicting atherosclerosis progression underscores its potential for broader applications in medical research.
[CV-52] Confidence intervals uncovered: Are we ready for real-world medical imaging AI? MICCAI2024
链接: https://arxiv.org/abs/2409.17763 作者: Evangelia Christodoulou,Annika Reinke,Rola Houhou,Piotr Kalinowski,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Nicola Rieke,Veronika Cheplygina,Michela Antonelli,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Paul F. Jäger,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein 关键词-EN: Medical imaging, transformation of healthcare, imaging is spearheading, Performance, Medical 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted at MICCAI 2024 conference
点击查看摘要
Abstract:Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.
[CV-53] xt Image Generation for Low-Resource Languages with Dual Translation Learning
链接: https://arxiv.org/abs/2409.17747 作者: Chihiro Noguchi,Shun Fukuda,Shoichiro Mihara,Masao Yamanaka 关键词-EN: frequently faces challenges, faces challenges due, training datasets derived, languages frequently faces, text images 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 11 figures
点击查看摘要
Abstract:Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: synthetic'' and real.‘’ The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model’s explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.
[CV-54] AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status
Abstract:Diffusion models have made compelling progress on facilitating high-throughput daily production. Nevertheless, the appealing customized requirements are remain suffered from instance-level finetuning for authentic fidelity. Prior zero-shot customization works achieve the semantic consistence through the condensed injection of identity features, while addressing detailed low-level signatures through complex model configurations and subject-specific fabrications, which significantly break the statistical coherence within the overall system and limit the applicability across various scenarios. To facilitate the generic signature concentration with rectified efficiency, we present \textbfAnyLogo, a zero-shot region customizer with remarkable detail consistency, building upon the symbiotic diffusion system with eliminated cumbersome designs. Streamlined as vanilla image generation, we discern that the rigorous signature extraction and creative content generation are promisingly compatible and can be systematically recycled within a single denoising model. In place of the external configurations, the gemini status of the denoising model promote the reinforced subject transmission efficiency and disentangled semantic-signature space with continuous signature decoration. Moreover, the sparse recycling paradigm is adopted to prevent the duplicated risk with compressed transmission quota for diversified signature stimulation. Extensive experiments on constructed logo-level benchmarks demonstrate the effectiveness and practicability of our methods.
[CV-55] Neural Implicit Representation for Highly Dynamic LiDAR Mapping and Odometry
链接: https://arxiv.org/abs/2409.17729 作者: Qi Zhang,He Wang,Ru Li,Wenbin Li 关键词-EN: Simultaneous Localization, Neural Radiance Fields, Recent advancements, advancements in Simultaneous, LiDAR-based techniques 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recent advancements in Simultaneous Localization and Mapping (SLAM) have increasingly highlighted the robustness of LiDAR-based techniques. At the same time, Neural Radiance Fields (NeRF) have introduced new possibilities for 3D scene reconstruction, exemplified by SLAM systems. Among these, NeRF-LOAM has shown notable performance in NeRF-based SLAM applications. However, despite its strengths, these systems often encounter difficulties in dynamic outdoor environments due to their inherent static assumptions. To address these limitations, this paper proposes a novel method designed to improve reconstruction in highly dynamic outdoor scenes. Based on NeRF-LOAM, the proposed approach consists of two primary components. First, we separate the scene into static background and dynamic foreground. By identifying and excluding dynamic elements from the mapping process, this segmentation enables the creation of a dense 3D map that accurately represents the static background only. The second component extends the octree structure to support multi-resolution representation. This extension not only enhances reconstruction quality but also aids in the removal of dynamic objects identified by the first module. Additionally, Fourier feature encoding is applied to the sampled points, capturing high-frequency information and leading to more complete reconstruction results. Evaluations on various datasets demonstrate that our method achieves more competitive results compared to current state-of-the-art approaches.
[CV-56] AlterMOMA: Fusion Redundancy Pruning for Camera-LiDAR Fusion Models with Alternative Modality Masking NEURIPS2024
Abstract:Camera-LiDAR fusion models significantly enhance perception performance in autonomous driving. The fusion mechanism leverages the strengths of each modality while minimizing their weaknesses. Moreover, in practice, camera-LiDAR fusion models utilize pre-trained backbones for efficient training. However, we argue that directly loading single-modal pre-trained camera and LiDAR backbones into camera-LiDAR fusion models introduces similar feature redundancy across modalities due to the nature of the fusion mechanism. Unfortunately, existing pruning methods are developed explicitly for single-modal models, and thus, they struggle to effectively identify these specific redundant parameters in camera-LiDAR fusion models. In this paper, to address the issue above on camera-LiDAR fusion models, we propose a novelty pruning framework Alternative Modality Masking Pruning (AlterMOMA), which employs alternative masking on each modality and identifies the redundant parameters. Specifically, when one modality parameters are masked (deactivated), the absence of features from the masked backbone compels the model to reactivate previous redundant features of the other modality backbone. Therefore, these redundant features and relevant redundant parameters can be identified via the reactivation process. The redundant parameters can be pruned by our proposed importance score evaluation function, Alternative Evaluation (AlterEva), which is based on the observation of the loss changes when certain modality parameters are activated and deactivated. Extensive experiments on the nuScene and KITTI datasets encompassing diverse tasks, baseline models, and pruning algorithms showcase that AlterMOMA outperforms existing pruning methods, attaining state-of-the-art performance.
[CV-57] Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications
链接: https://arxiv.org/abs/2409.17727 作者: Nghia Nguyen,Minh Nhat Vu,Tung D. Ta,Baoru Huang,Thieu Vo,Ngan Le,Anh Nguyen 关键词-EN: extracting meaningful features, played a key, key role, role in extracting, extracting meaningful 类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages
点击查看摘要
Abstract:Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP’s strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.
[CV-58] Scene Understanding in Pick-and-Place Tasks: Analyzing Transformations Between Initial and Final Scenes
链接: https://arxiv.org/abs/2409.17720 作者: Seraj Ghasemi,Hamed Hosseini,MohammadHossein Koosheshi,Mehdi Tale Masouleh,Ahmad Kalhor 关键词-EN: robots increasingly collaborating, robotic systems capable, pick and place, place tasks, robots increasingly 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Conference Paper, ICEE 2024, 7 pages, 5 figures
点击查看摘要
Abstract:With robots increasingly collaborating with humans in everyday tasks, it is important to take steps toward robotic systems capable of understanding the environment. This work focuses on scene understanding to detect pick and place tasks given initial and final images from the scene. To this end, a dataset is collected for object detection and pick and place task detection. A YOLOv5 network is subsequently trained to detect the objects in the initial and final scenes. Given the detected objects and their bounding boxes, two methods are proposed to detect the pick and place tasks which transform the initial scene into the final scene. A geometric method is proposed which tracks objects’ movements in the two scenes and works based on the intersection of the bounding boxes which moved within scenes. Contrarily, the CNN-based method utilizes a Convolutional Neural Network to classify objects with intersected bounding boxes into 5 classes, showing the spatial relationship between the involved objects. The performed pick and place tasks are then derived from analyzing the experiments with both scenes. Results show that the CNN-based method, using a VGG16 backbone, outscores the geometric method by roughly 12 percentage points in certain scenarios, with an overall success rate of 84.3%.
链接: https://arxiv.org/abs/2409.17717 作者: Dimitrios Kollias,Chunchang Shao,Odysseus Kaloidas,Ioannis Patras 关键词-EN: Action Unit Detection, integrating Face Localization, facial behavior analysis, Face Localization, Unit Detection 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In this paper, we introduce Behavior4All, a comprehensive, open-source toolkit for in-the-wild facial behavior analysis, integrating Face Localization, Valence-Arousal Estimation, Basic Expression Recognition and Action Unit Detection, all within a single framework. Available in both CPU-only and GPU-accelerated versions, Behavior4All leverages 12 large-scale, in-the-wild datasets consisting of over 5 million images from diverse demographic groups. It introduces a novel framework that leverages distribution matching and label co-annotation to address tasks with non-overlapping annotations, encoding prior knowledge of their relatedness. In the largest study of its kind, Behavior4All outperforms both state-of-the-art and toolkits in overall performance as well as fairness across all databases and tasks. It also demonstrates superior generalizability on unseen databases and on compound expression recognition. Finally, Behavior4All is way times faster than other toolkits.
[CV-60] MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling NEURIPS2024
Abstract:Motion generation from discrete quantization offers many advantages over continuous regression, but at the cost of inevitable approximation errors. Previous methods usually quantize the entire body pose into one code, which not only faces the difficulty in encoding all joints within one vector but also loses the spatial relationship between different joints. Differently, in this work we quantize each individual joint into one vector, which i) simplifies the quantization process as the complexity associated with a single joint is markedly lower than that of the entire pose; ii) maintains a spatial-temporal structure that preserves both the spatial relationships among joints and the temporal movement patterns; iii) yields a 2D token map, which enables the application of various 2D operations widely used in 2D images. Grounded in the 2D motion quantization, we build a spatial-temporal modeling framework, where 2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D attention are proposed to take advantage of spatial-temporal signals among the 2D tokens. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a 26.6% decrease of FID on HumanML3D and a 29.9% decrease on KIT-ML.
[CV-61] Dark Miner: Defend against unsafe generation for text-to-image diffusion models
链接: https://arxiv.org/abs/2409.17682 作者: Zheling Meng,Bo Peng,Xiaochuan Jin,Yue Jiang,Jing Dong,Wei Wang,Tieniu Tan 关键词-EN: large-scale training data, unfiltered large-scale training, unsafe generation due, shocking images, due to unfiltered 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model’s native generation capability. Our code will be available on GitHub.
[CV-62] Event-based Stereo Depth Estimation: A Survey
链接: https://arxiv.org/abs/2409.17680 作者: Suman Ghosh,Guillermo Gallego 关键词-EN: Stereopsis has widespread, widespread appeal, appeal in robotics, living beings perceive, high temporal 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 28 pages, 20 figures, 7 tables
点击查看摘要
Abstract:Stereopsis has widespread appeal in robotics as it is the predominant way by which living beings perceive depth to navigate our 3D world. Event cameras are novel bio-inspired sensors that detect per-pixel brightness changes asynchronously, with very high temporal resolution and high dynamic range, enabling machine perception in high-speed motion and broad illumination conditions. The high temporal precision also benefits stereo matching, making disparity (depth) estimation a popular research area for event cameras ever since its inception. Over the last 30 years, the field has evolved rapidly, from low-latency, low-power circuit design to current deep learning (DL) approaches driven by the computer vision community. The bibliography is vast and difficult to navigate for non-experts due its highly interdisciplinary nature. Past surveys have addressed distinct aspects of this topic, in the context of applications, or focusing only on a specific class of techniques, but have overlooked stereo datasets. This survey provides a comprehensive overview, covering both instantaneous stereo and long-term methods suitable for simultaneous localization and mapping (SLAM), along with theoretical and empirical comparisons. It is the first to extensively review DL methods as well as stereo datasets, even providing practical suggestions for creating new benchmarks to advance the field. The main advantages and challenges faced by event-based stereo depth estimation are also discussed. Despite significant progress, challenges remain in achieving optimal performance in not only accuracy but also efficiency, a cornerstone of event-based computing. We identify several gaps and propose future research directions. We hope this survey inspires future research in this area, by serving as an accessible entry point for newcomers, as well as a practical guide for seasoned researchers in the community.
[CV-63] EM-Net: Efficient Channel and Frequency Learning with Mamba for 3D Medical Image Segmentation MICCAI2024
链接: https://arxiv.org/abs/2409.17675 作者: Ao Chang,Jiajun Zeng,Ruobing Huang,Dong Ni 关键词-EN: Convolutional neural networks, small receptive fields, Convolutional neural, primarily led, receptive fields 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, accepted by MICCAI 2024
点击查看摘要
Abstract:Convolutional neural networks have primarily led 3D medical image segmentation but may be limited by small receptive fields. Transformer models excel in capturing global relationships through self-attention but are challenged by high computational costs at high resolutions. Recently, Mamba, a state space model, has emerged as an effective approach for sequential modeling. Inspired by its success, we introduce a novel Mamba-based 3D medical image segmentation model called EM-Net. It not only efficiently captures attentive interaction between regions by integrating and selecting channels, but also effectively utilizes frequency domain to harmonize the learning of features across varying scales, while accelerating training speed. Comprehensive experiments on two challenging multi-organ datasets with other state-of-the-art (SOTA) algorithms show that our method exhibits better segmentation accuracy while requiring nearly half the parameter size of SOTA models and 2x faster training speed.
[CV-64] Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation
Abstract:Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.
[CV-65] Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes
链接: https://arxiv.org/abs/2409.17671 作者: Katja Ludwig,Julian Lorenz,Daniel Kienzle,Tuan Bui,Rainer Lienhart 关键词-EN: body shape, HME models, basic body shape, HME, body 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The basic body shape of a person does not change within a single video. However, most SOTA human mesh estimation (HME) models output a slightly different body shape for each video frame, which results in inconsistent body shapes for the same person. In contrast, we leverage anthropometric measurements like tailors are already obtaining from humans for centuries. We create a model called A2B that converts such anthropometric measurements to body shape parameters of human mesh models. Moreover, we find that finetuned SOTA 3D human pose estimation (HPE) models outperform HME models regarding the precision of the estimated keypoints. We show that applying inverse kinematics (IK) to the results of such a 3D HPE model and combining the resulting body pose with the A2B body shape leads to superior and consistent human meshes for challenging datasets like ASPset or fit3D, where we can lower the MPJPE by over 30 mm compared to SOTA HME models. Further, replacing HME models estimates of the body shape parameters with A2B model results not only increases the performance of these HME models, but also leads to consistent body shapes.
Abstract:Recent concept-based interpretable models have succeeded in providing meaningful explanations by pre-defined concept sets. However, the dependency on the pre-defined concepts restricts the application because of the limited number of concepts for explanations. This paper proposes a novel interpretable deep neural network called explanation bottleneck models (XBMs). XBMs generate a text explanation from the input without pre-defined concepts and then predict a final task prediction based on the generated explanation by leveraging pre-trained vision-language encoder-decoder models. To achieve both the target task performance and the explanation quality, we train XBMs through the target task loss with the regularization penalizing the explanation decoder via the distillation from the frozen pre-trained decoder. Our experiments, including a comparison to state-of-the-art concept bottleneck models, confirm that XBMs provide accurate and fluent natural language explanations without pre-defined concept sets. Code will be available at this https URL.
[CV-67] Provable Performance Guarantees of Copy Detection Patterns
链接: https://arxiv.org/abs/2409.17649 作者: Joakim Tutt,Slava Voloshynovskiy 关键词-EN: Copy Detection Patterns, Copy Detection, Detection Patterns, modern security applications, playing a vital 类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Copy Detection Patterns (CDPs) are crucial elements in modern security applications, playing a vital role in safeguarding industries such as food, pharmaceuticals, and cosmetics. Current performance evaluations of CDPs predominantly rely on empirical setups using simplistic metrics like Hamming distances or Pearson correlation. These methods are often inadequate due to their sensitivity to distortions, degradation, and their limitations to stationary statistics of printing and imaging. Additionally, machine learning-based approaches suffer from distribution biases and fail to generalize to unseen counterfeit samples. Given the critical importance of CDPs in preventing counterfeiting, including the counterfeit vaccines issue highlighted during the COVID-19 pandemic, there is an urgent need for provable performance guarantees across various criteria. This paper aims to establish a theoretical framework to derive optimal criteria for the analysis, optimization, and future development of CDP authentication technologies, ensuring their reliability and effectiveness in diverse security scenarios.
[CV-68] MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning NEURIPS2024
链接: https://arxiv.org/abs/2409.17647 作者: Tieyuan Chen,Huabin Liu,Tianyao He,Yihang Chen,Chaofan Gan,Xiao Ma,Cheng Zhong,Yang Zhang,Yingxue Wang,Hui Lin,Weiyao Lin 关键词-EN: causal, achieve a high-level, high-level understanding, causal relationships, MECD 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024 as a spotlight paper
点击查看摘要
Abstract:Video causal reasoning aims to achieve a high-level understanding of video content from a causal perspective. However, current video reasoning tasks are limited in scope, primarily executed in a question-answering paradigm and focusing on short videos containing only a single event and simple causal relationships, lacking comprehensive and structured causality analysis for videos with multiple events. To fill this gap, we introduce a new task and dataset, Multi-Event Causal Discovery (MECD). It aims to uncover the causal relationships between events distributed chronologically across long videos. Given visual segments and textual descriptions of events, MECD requires identifying the causal associations between these events to derive a comprehensive, structured event-level video causal diagram explaining why and how the final result event occurred. To address MECD, we devise a novel framework inspired by the Granger Causality method, using an efficient mask-based event prediction model to perform an Event Granger Test, which estimates causality by comparing the predicted result event when premise events are masked versus unmasked. Furthermore, we integrate causal inference techniques such as front-door adjustment and counterfactual inference to address challenges in MECD like causality confounding and illusory causality. Experiments validate the effectiveness of our framework in providing causal relationships in multi-event videos, outperforming GPT-4o and VideoLLaVA by 5.7% and 4.1%, respectively.
[CV-69] P4Q: Learning to Prompt for Quantization in Visual-language Models
Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization’’ (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 \times while achieving 66.94% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24% with negligible additional parameters on the ImageNet dataset.
[CV-70] Hand-object reconstruction via interaction-aware graph attention mechanism ICIP2024
链接: https://arxiv.org/abs/2409.17629 作者: Taeyun Woo,Tae-Kyun Kim,Jinah Park 关键词-EN: advanced vision computing, Estimating the poses, vision computing, important area, area of research 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, Accepted by ICIP 2024
点击查看摘要
Abstract:Estimating the poses of both a hand and an object has become an important area of research due to the growing need for advanced vision computing. The primary challenge involves understanding and reconstructing how hands and objects interact, such as contact and physical plausibility. Existing approaches often adopt a graph neural network to incorporate spatial information of hand and object meshes. However, these approaches have not fully exploited the potential of graphs without modification of edges within and between hand- and object-graphs. We propose a graph-based refinement method that incorporates an interaction-aware graph-attention mechanism to account for hand-object interactions. Using edges, we establish connections among closely correlated nodes, both within individual graphs and across different graphs. Experiments demonstrate the effectiveness of our proposed method with notable improvements in the realm of physical plausibility.
[CV-71] Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment
链接: https://arxiv.org/abs/2409.17612 作者: Jiawei Du,Xin Zhang,Juncheng Hu,Wenxin Huang,Joey Tianyi Zhou 关键词-EN: sharp increase, increase in data-related, motivated research, research into condensing, datasets 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The sharp increase in data-related expenses has motivated research into condensing datasets while retaining the most informative features. Dataset distillation has thus recently come to the fore. This paradigm generates synthetic dataset that are representative enough to replace the original dataset in training a neural network. To avoid redundancy in these synthetic datasets, it is crucial that each element contains unique features and remains diverse from others during the synthesis stage. In this paper, we provide a thorough theoretical and empirical analysis of diversity within synthesized datasets. We argue that enhancing diversity can improve the parallelizable yet isolated synthesizing approach. Specifically, we introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process, thereby maximizing the representativeness and diversity of each synthetic instance. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, demonstrate the superior performance of our method, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense.
[CV-72] ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue
链接: https://arxiv.org/abs/2409.17610 作者: Zhangpu Li,Changhong Zou,Suxue Ma,Zhicheng Yang,Chen Du,Youbao Tang,Zhenjie Cao,Ning Zhang,Jui-Hsin Lai,Ruei-Sung Lin,Yuan Ni,Xingzhi Sun,Jing Xiao,Kai Zhang,Mei Han 关键词-EN: multimodal medical dialogue, multi-turn multimodal medical, large language models, multimodal medical, medical dialogue 类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients’ mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.
[CV-73] Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection
链接: https://arxiv.org/abs/2409.17608 作者: Jiahao Lyu,Minghua Zhao,Jing Hu,Xuewen Huang,Shuangli Du,Cheng Shi,Zhiyong Lv 关键词-EN: measuring significant deviations, Video anomaly detection, significant deviations, Video anomaly, learns the distribution 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 11 figures
点击查看摘要
Abstract:Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.
[CV-74] Good Data Is All Imitation Learning Needs
链接: https://arxiv.org/abs/2409.17605 作者: Amir Samadi,Konstantinos Koufos,Kurt Debattista,Mehrdad Dianati 关键词-EN: Automated Driving Systems, context of Autonomous, traditional teacher-student models, imitation learning, Automated Driving 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this paper, we address the limitations of traditional teacher-student models, imitation learning, and behaviour cloning in the context of Autonomous/Automated Driving Systems (ADS), where these methods often struggle with incomplete coverage of real-world scenarios. To enhance the robustness of such models, we introduce the use of Counterfactual Explanations (CFEs) as a novel data augmentation technique for end-to-end ADS. CFEs, by generating training samples near decision boundaries through minimal input modifications, lead to a more comprehensive representation of expert driver strategies, particularly in safety-critical scenarios. This approach can therefore help improve the model’s ability to handle rare and challenging driving events, such as anticipating darting out pedestrians, ultimately leading to safer and more trustworthy decision-making for ADS. Our experiments in the CARLA simulator demonstrate that CF-Driver outperforms the current state-of-the-art method, achieving a higher driving score and lower infraction rates. Specifically, CF-Driver attains a driving score of 84.2, surpassing the previous best model by 15.02 percentage points. These results highlight the effectiveness of incorporating CFEs in training end-to-end ADS. To foster further research, the CF-Driver code is made publicly available.
[CV-75] A-Cleaner: A Fine-grained Text Alignment Backdoor Defense Strategy for Multimodal Contrastive Learning
Abstract:Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the supervised learning domain, fine-tuning defense strategies can achieve excellent defense performance. However, in the unsupervised and semi-supervised domain, we find that when CLIP faces some complex attack techniques, the existing fine-tuning defense strategy, CleanCLIP, has some limitations on defense performance. The synonym substitution of its text-augmentation is insufficient to enhance the text feature space. To compensate for this weakness, we improve it by proposing a fine-grained \textbfText \textbfAlignment \textbfCleaner (TA-Cleaner) to cut off feature connections of backdoor triggers. We randomly select a few samples for positive and negative subtext generation at each epoch of CleanCLIP, and align the subtexts to the images to strengthen the text self-supervision. We evaluate the effectiveness of our TA-Cleaner against six attack algorithms and conduct comprehensive zero-shot classification tests on ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques. Even when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02% and 63.88%, respectively.
[CV-76] Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution
Abstract:Window-based transformers have demonstrated outstanding performance in super-resolution tasks due to their adaptive modeling capabilities through local self-attention (SA). However, they exhibit higher computational complexity and inference latency than convolutional neural networks. In this paper, we first identify that the adaptability of the Transformers is derived from their adaptive spatial aggregation and advanced structural design, while their high latency results from the computational costs and memory layout transformations associated with the local SA. To simulate this aggregation approach, we propose an effective convolution-based linear focal separable attention (FSA), allowing for long-range dynamic modeling with linear complexity. Additionally, we introduce an effective dual-branch structure combined with an ultra-lightweight information exchange module (IEM) to enhance the aggregation of information by the Token Mixer. Finally, with respect to the structure, we modify the existing spatial-gate-based feedforward neural networks by incorporating a self-gate mechanism to preserve high-dimensional channel information, enabling the modeling of more complex relationships. With these advancements, we construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet). Extensive experiments demonstrate that LAMNet achieves better performance than existing SA-based Transformer methods while maintaining the computational efficiency of convolutional neural networks, which can achieve a (3\times) speedup of inference time. The code will be publicly available at: this https URL.
[CV-77] Improving Fast Adversarial Training via Self-Knowledge Guidance
Abstract:Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples, which leads to an imbalanced optimization. However, this imbalance remains unexplored in the field of FAT. In this paper, we conduct a comprehensive study of the imbalance issue in FAT and observe an obvious class disparity regarding their performances. This disparity could be embodied from a perspective of alignment between clean and robust accuracy. Based on the analysis, we mainly attribute the observed misalignment and disparity to the imbalanced optimization in FAT, which motivates us to optimize different training data adaptively to enhance robustness. Specifically, we take disparity and misalignment into consideration. First, we introduce self-knowledge guided regularization, which assigns differentiated regularization weights to each class based on its training state, alleviating class disparity. Additionally, we propose self-knowledge guided label relaxation, which adjusts label relaxation according to the training accuracy, alleviating the misalignment and improving robustness. By combining these methods, we formulate the Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge during training to enhance the adversarial robustness without compromising training efficiency. Extensive experiments on four standard datasets demonstrate that the SKG-FAT improves the robustness and preserves competitive clean accuracy, outperforming the state-of-the-art methods.
[CV-78] ID3: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition NEURIPS2024
链接: https://arxiv.org/abs/2409.17576 作者: Shen Li,Jianqing Xu,Jiaying Wu,Miao Xiong,Ailin Deng,Jiazhen Ji,Yuge Huang,Wenjie Feng,Shouhong Ding,Bryan Hooi 关键词-EN: Synthetic face, generate synthetic face, Synthetic face recognition, synthetic face datasets, privacy-preserving manner 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024
点击查看摘要
Abstract:Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed \textID^3 . \textID^3 employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of \textID^3 .
[CV-79] Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule
Abstract:Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images. Despite their effectiveness, these models often require significant computational resources owing to their numerous sequential denoising steps and the significant inference cost of each step. Recently, Neural Architecture Search (NAS) techniques have been employed to automatically search for faster generation processes. However, NAS for diffusion is inherently time-consuming as it requires estimating thousands of diffusion models to search for the optimal one. In this paper, we introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models by concurrently optimizing generation steps and network structures. Specifically, we partition the generation process into isometric step segments, each sequentially composed of a full step, multiple partial steps, and several null steps. The full step computes all network blocks, while the partial step involves part of the blocks, and the null step entails no computation. Flexiffusion autonomously explores flexible step combinations for each segment, substantially reducing search costs and enabling greater acceleration compared to the state-of-the-art (SOTA) method for diffusion models. Our searched models reported speedup factors of 2.6\times and 1.5\times for the original LDM-4-G and the SOTA, respectively. The factors for Stable Diffusion V1.5 and the SOTA are 5.1\times and 2.0\times . We also verified the performance of Flexiffusion on multiple datasets, and positive experiment results indicate that Flexiffusion can effectively reduce redundancy in diffusion models.
[CV-80] Pixel-Space Post-Training of Latent Diffusion Models
链接: https://arxiv.org/abs/2409.17565 作者: Christina Zhang,Simran Motwani,Matthew Yu,Ji Hou,Felix Juefei-Xu,Sam Tsai,Peter Vajda,Zijian He,Jialiang Wang 关键词-EN: made significant advancements, recent years, made significant, significant advancements, generation in recent 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 \times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.
[CV-81] General Compression Framework for Efficient Transformer Object Tracking
Abstract:Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model’s ability to replicate the teacher model’s behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model’s compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieves 2.17x speed up.
[CV-82] Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking
链接: https://arxiv.org/abs/2409.17560 作者: Pengcheng Shao,Tianyang Xu,Xuefeng Zhu,Xiaojun Wu,Josef Kittler 关键词-EN: bionic camera asynchronously, high dynamic range, Event-based bionic camera, RGB under conditions, high temporal resolution 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures, conference
点击查看摘要
Abstract:Event-based bionic camera asynchronously captures dynamic scenes with high temporal resolution and high dynamic range, offering potential for the integration of events and RGB under conditions of illumination degradation and fast motion. Existing RGB-E tracking methods model event characteristics utilising attention mechanism of Transformer before integrating both modalities. Nevertheless, these methods involve aggregating the event stream into a single event frame, lacking the utilisation of the temporal information inherent in the event this http URL, the traditional attention mechanism is well-suited for dense semantic features, while the attention mechanism for sparse event features require revolution. In this paper, we propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues. Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions. The experimental results indicate that our method outperforms existing state-of-the-art methods on the FE240 and COESOT datasets, providing an effective processing manner for the event data.
链接: https://arxiv.org/abs/2409.17555 作者: Kunyu Peng,Di Wen,Kailun Yang,Ao Luo,Yufan Chen,Jia Fu,M. Saquib Sarfraz,Alina Roitberg,Rainer Stiefelhagen 关键词-EN: Open-Set Domain Generalization, Domain Generalization, open-set conditions, domain scheduler, Domain 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024. The source code will be available at this https URL
点击查看摘要
Abstract:In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bi-level manner. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories. The source code will be available at this https URL.
Abstract:Existing 3D mask learning methods encounter performance bottlenecks under limited data, and our objective is to overcome this limitation. In this paper, we introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders to achieve multi-mask learning for 3D point clouds. Specifically, we augment the baselines with two additional mask choices (i.e., medium mask and low mask) as our core insight is that the recovery process of an object can manifest in diverse ways. Previous high-masking schemes focus on capturing the global representation but lack the fine-grained recovery capability, so that the generated pre-trained weights tend to play a limited role in the fine-tuning process. With the support of the proposed TPM, available methods can exhibit more flexible and accurate completion capabilities, enabling the potential autoencoder in the pre-training stage to consider multiple representations of a single 3D object. In addition, an SVM-guided weight selection module is proposed to fill the encoder parameters for downstream networks with the optimal weight during the fine-tuning stage, maximizing linear accuracy and facilitating the acquisition of intricate representations for new objects. Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
[CV-85] CAMOT: Camera Angle-aware Multi-Object Tracking
链接: https://arxiv.org/abs/2409.17533 作者: Felix Limanta,Kuniaki Uto,Koichi Shinoda 关键词-EN: inaccurate distance estimation, paper proposes CAMOT, simple camera angle, tackle two problems, inaccurate distance 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This paper proposes CAMOT, a simple camera angle estimator for multi-object tracking to tackle two problems: 1) occlusion and 2) inaccurate distance estimation in the depth direction. Under the assumption that multiple objects are located on a flat plane in each video frame, CAMOT estimates the camera angle using object detection. In addition, it gives the depth of each object, enabling pseudo-3D MOT. We evaluated its performance by adding it to various 2D MOT methods on the MOT17 and MOT20 datasets and confirmed its effectiveness. Applying CAMOT to ByteTrack, we obtained 63.8% HOTA, 80.6% MOTA, and 78.5% IDF1 in MOT17, which are state-of-the-art results. Its computational cost is significantly lower than the existing deep-learning-based depth estimators for tracking.
[CV-86] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion NEURIPS2024
链接: https://arxiv.org/abs/2409.17531 作者: Ming Dai,Lingfeng Yang,Yihao Xu,Zhenhua Feng,Wankou Yang 关键词-EN: involves grounding descriptive, grounding descriptive sentences, common vision task, common vision, descriptive sentences 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21pages, 11figures, NeurIPS2024
点击查看摘要
Abstract:Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \urlthis https URL.
[CV-87] Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Integrating SGBM and Segmentation Models
链接: https://arxiv.org/abs/2409.17526 作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green 关键词-EN: radiata pine trees, pine trees presents, trees presents significant, safety risks due, Manual pruning 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Manual pruning of radiata pine trees presents significant safety risks due to their substantial height and the challenging terrains in which they thrive. To address these risks, this research proposes the development of a drone-based pruning system equipped with specialized pruning tools and a stereo vision camera, enabling precise detection and trimming of branches. Deep learning algorithms, including YOLO and Mask R-CNN, are employed to ensure accurate branch detection, while the Semi-Global Matching algorithm is integrated to provide reliable distance estimation. The synergy between these techniques facilitates the precise identification of branch locations and enables efficient, targeted pruning. Experimental results demonstrate that the combined implementation of YOLO and SGBM enables the drone to accurately detect branches and measure their distances from the drone. This research not only improves the safety and efficiency of pruning operations but also makes a significant contribution to the advancement of drone technology in the automation of agricultural and forestry practices, laying a foundational framework for further innovations in environmental management.
[CV-88] JoyType: A Robust Design for Multilingual Visual Text Creation AAAI2025
链接: https://arxiv.org/abs/2409.17524 作者: Chao Li,Chen Jiang,Xiaolong Liu,Jun Zhao,Guoxin Wang 关键词-EN: non-Latin languages, poses a significant, accurately represented text, accurately represented, significant challenge 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review at AAAI 2025
点击查看摘要
Abstract:Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model’s ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on this https URL.
链接: https://arxiv.org/abs/2409.17523 作者: Jing Bi,Yunlong Tang,Luchuan Song,Ali Vosoughi,Nguyen Nguyen,Chenliang Xu 关键词-EN: video analysis brings, understanding human activities, first-person perspective, egocentric video analysis, egocentric video 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACMMM 24
点击查看摘要
Abstract:The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textitfirst large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE’s superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.
[CV-90] Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization
链接: https://arxiv.org/abs/2409.17519 作者: Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba 关键词-EN: diverse environments, environmental state recognition, autonomously navigate, navigate and operate, operate in diverse 类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Advanced Robotics, website - this https URL
点击查看摘要
Abstract:In order for robots to autonomously navigate and operate in diverse environments, it is essential for them to recognize the state of their environment. On the other hand, the environmental state recognition has traditionally involved distinct methods tailored to each state to be recognized. In this study, we perform a unified environmental state recognition for robots through the spoken language with pre-trained large-scale vision-language models. We apply Visual Question Answering and Image-to-Text Retrieval, which are tasks of Vision-Language Models. We show that with our method, it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed and whether water is running in a sink, without training neural networks or manual programming. In addition, the recognition accuracy can be improved by selecting appropriate texts from the set of prepared texts based on black-box optimization. For each state recognition, only the text set and its weighting need to be changed, eliminating the need to prepare multiple different models and programs, and facilitating the management of source code and computer resource. We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.
[CV-91] SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning ECCV2024
Abstract:Open-set semi-supervised learning (OSSL) leverages practical open-set unlabeled data, comprising both in-distribution (ID) samples from seen classes and out-of-distribution (OOD) samples from unseen classes, for semi-supervised learning (SSL). Prior OSSL methods initially learned the decision boundary between ID and OOD with labeled ID data, subsequently employing self-training to refine this boundary. These methods, however, suffer from the tendency to overtrust the labeled ID data: the scarcity of labeled data caused the distribution bias between the labeled samples and the entire ID data, which misleads the decision boundary to overfit. The subsequent self-training process, based on the overfitted result, fails to rectify this problem. In this paper, we address the overtrusting issue by treating OOD samples as an additional class, forming a new SSL process. Specifically, we propose SCOMatch, a novel OSSL method that 1) selects reliable OOD samples as new labeled data with an OOD memory queue and a corresponding update strategy and 2) integrates the new SSL process into the original task through our Simultaneous Close-set and Open-set self-training. SCOMatch refines the decision boundary of ID and OOD classes across the entire dataset, thereby leading to improved results. Extensive experimental results show that SCOMatch significantly outperforms the state-of-the-art methods on various benchmarks. The effectiveness is further verified through ablation studies and visualization. Comments: ECCV 2024 accepted Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.17512 [cs.CV] (or arXiv:2409.17512v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.17512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-92] Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
链接: https://arxiv.org/abs/2409.17508 作者: Xun Zhu,Ying Hu,Fanbin Mo,Miao Li,Ji Wu 关键词-EN: shown impressive capabilities, Multi-modal large language, large language models, large language, shown impressive 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code, data and model will be soon available at GitHub.
[CV-93] Learning Quantized Adaptive Conditions for Diffusion Models
Abstract:The curvature of ODE trajectories in diffusion models hinders their ability to generate high-quality images in a few number of function evaluations (NFE). In this paper, we propose a novel and effective approach to reduce trajectory curvature by utilizing adaptive conditions. By employing a extremely light-weight quantized encoder, our method incurs only an additional 1% of training parameters, eliminates the need for extra regularization terms, yet achieves significantly better sample quality. Our approach accelerates ODE sampling while preserving the downstream task image editing capabilities of SDE techniques. Extensive experiments verify that our method can generate high quality results under extremely limited sampling costs. With only 6 NFE, we achieve 5.14 FID on CIFAR-10, 6.91 FID on FFHQ 64x64 and 3.10 FID on AFHQv2.
[CV-94] Global-Local Medical SAM Adaptor Based on Full Adaption
链接: https://arxiv.org/abs/2409.17486 作者: Meng Wang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yarong Feng(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yongwei Tang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Tian Zhang(Software college Northeastern University Shenyang, Liaoning Province, P. R. China),Yuxin Liang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Chao Lv(Department of General Surgery, Shengjing Hospital China Medical University Shenyang, Liaoning Province, P. R. China) 关键词-EN: Medical SAM adaptor, visual language models, made great breakthroughs, Emerging of visual, SAM adaptor 类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Emerging of visual language models, such as the segment anything model (SAM), have made great breakthroughs in the field of universal semantic segmentation and significantly aid the improvements of medical image segmentation, in particular with the help of Medical SAM adaptor (Med-SA). However, Med-SA still can be improved, as it fine-tunes SAM in a partial adaption manner. To resolve this problem, we present a novel global medical SAM adaptor (GMed-SA) with full adaption, which can adapt SAM globally. We further combine GMed-SA and Med-SA to propose a global-local medical SAM adaptor (GLMed-SA) to adapt SAM both globally and locally. Extensive experiments have been performed on the challenging public 2D melanoma segmentation dataset. The results show that GLMed-SA outperforms several state-of-the-art semantic segmentation methods on various evaluation metrics, demonstrating the superiority of our methods.
[CV-95] Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection MICCAI2024
链接: https://arxiv.org/abs/2409.17485 作者: Yi Gu,Yi Lin,Kwang-Ting Cheng,Hao Chen 关键词-EN: identification and localization, Medical anomaly detection, crucial in pathological, pathological identification, anomaly detection 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Early accepted by MICCAI2024
点击查看摘要
Abstract:Medical anomaly detection (AD) is crucial in pathological identification and localization. Current methods typically rely on uncertainty estimation in deep ensembles to detect anomalies, assuming that ensemble learners should agree on normal samples while exhibiting disagreement on unseen anomalies in the output space. However, these methods may suffer from inadequate disagreement on anomalies or diminished agreement on normal samples. To tackle these issues, we propose D2UE, a Diversified Dual-space Uncertainty Estimation framework for medical anomaly detection. To effectively balance agreement and disagreement for anomaly detection, we propose Redundancy-Aware Repulsion (RAR), which uses a similarity kernel that remains invariant to both isotropic scaling and orthogonal transformations, explicitly promoting diversity in learners’ feature space. Moreover, to accentuate anomalous regions, we develop Dual-Space Uncertainty (DSU), which utilizes the ensemble’s uncertainty in input and output spaces. In input space, we first calculate gradients of reconstruction error with respect to input images. The gradients are then integrated with reconstruction outputs to estimate uncertainty for inputs, enabling effective anomaly discrimination even when output space disagreement is minimal. We conduct a comprehensive evaluation of five medical benchmarks with different backbones. Experimental results demonstrate the superiority of our method to state-of-the-art methods and the effectiveness of each component in our framework. Our code is available at this https URL.
[CV-96] FS-NeRF: Template-Free NeRF for Semantic 3D Reconstruction of Dynamic Scene NEURIPS2024
Abstract:Despite advancements in Neural Implicit models for 3D surface reconstruction, handling dynamic environments with arbitrary rigid, non-rigid, or deformable entities remains challenging. Many template-based methods are entity-specific, focusing on humans, while generic reconstruction methods adaptable to such dynamic scenes often require additional inputs like depth or optical flow or rely on pre-trained image features for reasonable outcomes. These methods typically use latent codes to capture frame-by-frame deformations. In contrast, some template-free methods bypass these requirements and adopt traditional LBS (Linear Blend Skinning) weights for a detailed representation of deformable object motions, although they involve complex optimizations leading to lengthy training times. To this end, as a remedy, this paper introduces TFS-NeRF, a template-free 3D semantic NeRF for dynamic scenes captured from sparse or single-view RGB videos, featuring interactions among various entities and more time-efficient than other LBS-based approaches. Our framework uses an Invertible Neural Network (INN) for LBS prediction, simplifying the training process. By disentangling the motions of multiple entities and optimizing per-entity skinning weights, our method efficiently generates accurate, semantically separable geometries. Extensive experiments demonstrate that our approach produces high-quality reconstructions of both deformable and non-deformable objects in complex interactions, with improved training efficiency compared to existing methods.
[CV-97] CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches
链接: https://arxiv.org/abs/2409.17457 作者: Sifan Wu,Amir Khasahmadi,Mor Katz,Pradeep Kumar Jayaraman,Yewen Pu,Karl Willis,Bang Liu 关键词-EN: contemporary mechanical design, CAD, mechanical design, central to contemporary, Design 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.
[CV-98] AgMTR: Agent Mining Transformer for Few-shot Segmentation in Remote Sensing
链接: https://arxiv.org/abs/2409.17453 作者: Hanbo Bi,Yingchao Feng,Yongqiang Mao,Jianning Pei,Wenhui Diao,Hongqi Wang,Xian Sun 关键词-EN: Few-shot Segmentation, aims to segment, labeled samples, segment the interested, interested objects 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to IJCV
点击查看摘要
Abstract:Few-shot Segmentation (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches, resulting in semantic ambiguity between the query foreground (FG) and background (BG) pixels. To tackle this problem, we propose a novel Agent Mining Transformer (AgMTR), which adaptively mines a set of local-aware agents to construct agent-level semantic correlation. Compared with pixel-level semantics, the given agents are equipped with local-contextual information and possess a broader receptive field. At this point, different query pixels can selectively aggregate the fine-grained local semantics of different agents, thereby enhancing the semantic clarity between query FG and BG pixels. Concretely, the Agent Learning Encoder (ALE) is first proposed to erect the optimal transport plan that arranges different agents to aggregate support semantics under different local regions. Then, for further optimizing the agents, the Agent Aggregation Decoder (AAD) and the Semantic Alignment Decoder (SAD) are constructed to break through the limited support set for mining valuable class-specific semantics from unlabeled data sources and the query image itself, respectively. Extensive experiments on the remote sensing benchmark iSAID indicate that the proposed method achieves state-of-the-art performance. Surprisingly, our method remains quite competitive when extended to more common natural scenarios, i.e., PASCAL-5i and COCO-20i.
链接: https://arxiv.org/abs/2409.17439 作者: Chirag Vashist,Shichong Peng,Ke Li 关键词-EN: learn deep generative, deep generative models, limited training data, Maximum Likelihood Estimation, Implicit Maximum Likelihood 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:An emerging area of research aims to learn deep generative models with limited training data. Prior generative models like GANs and diffusion models require a lot of data to perform well, and their performance degrades when they are trained on only a small amount of data. A recent technique called Implicit Maximum Likelihood Estimation (IMLE) has been adapted to the few-shot setting, achieving state-of-the-art performance. However, current IMLE-based approaches encounter challenges due to inadequate correspondence between the latent codes selected for training and those drawn during inference. This results in suboptimal test-time performance. We theoretically show a way to address this issue and propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher quality image generation compared to existing GAN and IMLE-based methods, as validated by comprehensive experiments conducted on nine few-shot image datasets.
[CV-100] HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing
链接: https://arxiv.org/abs/2409.17432 作者: Md Tanvir Islam,Nasir Rahim,Saeed Anwar,Muhammad Saqib,Sambit Bakshi,Khan Muhammad 关键词-EN: computer vision applications, haze type classification, type classification, haze type, haze 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM Multimedia 2024
点击查看摘要
Abstract:Reducing the atmospheric haze and enhancing image clarity is crucial for computer vision applications. The lack of real-life hazy ground truth images necessitates synthetic datasets, which often lack diverse haze types, impeding effective haze type classification and dehazing algorithm selection. This research introduces the HazeSpace2M dataset, a collection of over 2 million images designed to enhance dehazing through haze type classification. HazeSpace2M includes diverse scenes with 10 haze intensity levels, featuring Fog, Cloud, and Environmental Haze (EH). Using the dataset, we introduce a technique of haze type classification followed by specialized dehazers to clear hazy images. Unlike conventional methods, our approach classifies haze types before applying type-specific dehazing, improving clarity in real-life hazy images. Benchmarking with state-of-the-art (SOTA) models, ResNet50 and AlexNet achieve 92.75% and 92.50% accuracy, respectively, against existing synthetic datasets. However, these models achieve only 80% and 70% accuracy, respectively, against our Real Hazy Testset (RHT), highlighting the challenging nature of our HazeSpace2M dataset. Additional experiments show that haze type classification followed by specialized dehazing improves results by 2.41% in PSNR, 17.14% in SSIM, and 10.2% in MSE over general dehazers. Moreover, when testing with SOTA dehazing models, we found that applying our proposed framework significantly improves their performance. These results underscore the significance of HazeSpace2M and our proposed framework in addressing atmospheric haze in multimedia processing. Complete code and dataset is available on \hrefthis https URL \textcolorblue\textbfGitHub.
[CV-101] ransient Adversarial 3D Projection Attacks on Object Detection in Autonomous Driving
链接: https://arxiv.org/abs/2409.17403 作者: Ce Zhou,Qiben Yan,Sijia Liu 关键词-EN: Object detection, crucial task, targeting object detection, patches or stickers, Object 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 7 figures, SmartSP 2024
点击查看摘要
Abstract:Object detection is a crucial task in autonomous driving. While existing research has proposed various attacks on object detection, such as those using adversarial patches or stickers, the exploration of projection attacks on 3D surfaces remains largely unexplored. Compared to adversarial patches or stickers, which have fixed adversarial patterns, projection attacks allow for transient modifications to these patterns, enabling a more flexible attack. In this paper, we introduce an adversarial 3D projection attack specifically targeting object detection in autonomous driving scenarios. We frame the attack formulation as an optimization problem, utilizing a combination of color mapping and geometric transformation models. Our results demonstrate the effectiveness of the proposed attack in deceiving YOLOv3 and Mask R-CNN in physical settings. Evaluations conducted in an indoor environment show an attack success rate of up to 100% under low ambient light conditions, highlighting the potential damage of our attack in real-world driving scenarios.
[CV-102] AgRegNet: A Deep Regression Network for Flower and Fruit Density Estimation Localization and Counting in Orchards
链接: https://arxiv.org/abs/2409.17400 作者: Uddhav Bhattarai,Santosh Bhusal,Qin Zhang,Manoj Karkee 关键词-EN: agricultural industry today, manual labor availability, fruit density estimation, major challenges, agricultural industry 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:One of the major challenges for the agricultural industry today is the uncertainty in manual labor availability and the associated cost. Automated flower and fruit density estimation, localization, and counting could help streamline harvesting, yield estimation, and crop-load management strategies such as flower and fruitlet thinning. This article proposes a deep regression-based network, AgRegNet, to estimate density, count, and location of flower and fruit in tree fruit canopies without explicit object detection or polygon annotation. Inspired by popular U-Net architecture, AgRegNet is a U-shaped network with an encoder-to-decoder skip connection and modified ConvNeXt-T as an encoder feature extractor. AgRegNet can be trained based on information from point annotation and leverages segmentation information and attention modules (spatial and channel) to highlight relevant flower and fruit features while suppressing non-relevant background features. Experimental evaluation in apple flower and fruit canopy images under an unstructured orchard environment showed that AgRegNet achieved promising accuracy as measured by Structural Similarity Index (SSIM), percentage Mean Absolute Error (pMAE) and mean Average Precision (mAP) to estimate flower and fruit density, count, and centroid location, respectively. Specifically, the SSIM, pMAE, and mAP values for flower images were 0.938, 13.7%, and 0.81, respectively. For fruit images, the corresponding values were 0.910, 5.6%, and 0.93. Since the proposed approach relies on information from point annotation, it is suitable for sparsely and densely located objects. This simplified technique will be highly applicable for growers to accurately estimate yields and decide on optimal chemical and mechanical flower thinning practices.
[CV-103] Data-efficient Trajectory Prediction via Coreset Selection
链接: https://arxiv.org/abs/2409.17385 作者: Ruining Yang,Lili Su 关键词-EN: multiple information-collection devices, Modern vehicles, sensors and cameras, continuously generating, equipped with multiple 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Modern vehicles are equipped with multiple information-collection devices such as sensors and cameras, continuously generating a large volume of raw data. Accurately predicting the trajectories of neighboring vehicles is a vital component in understanding the complex driving environment. Yet, training trajectory prediction models is challenging in two ways. Processing the large-scale data is computation-intensive. Moreover, easy-medium driving scenarios often overwhelmingly dominate the dataset, leaving challenging driving scenarios such as dense traffic under-represented. For example, in the Argoverse motion prediction dataset, there are very few instances with \ge 50 agents, while scenarios with 10 \thicksim 20 agents are far more common. In this paper, to mitigate data redundancy in the over-represented driving scenarios and to reduce the bias rooted in the data scarcity of complex ones, we propose a novel data-efficient training method based on coreset selection. This method strategically selects a small but representative subset of data while balancing the proportions of different scenario difficulties. To the best of our knowledge, we are the first to introduce a method capable of effectively condensing large-scale trajectory dataset, while achieving a state-of-the-art compression ratio. Notably, even when using only 50% of the Argoverse dataset, the model can be trained with little to no decline in performance. Moreover, the selected coreset maintains excellent generalization ability.
[CV-104] Optical Lens Attack on Deep Learning Based Monocular Depth Estimation
链接: https://arxiv.org/abs/2409.17376 作者: Ce Zhou(1),Qiben Yan(1),Daniel Kent(1),Guangjing Wang(1),Ziqi Zhang(2),Hayder Radha(1) ((1) Michigan State University, (2) Peking University) 关键词-EN: Monocular Depth Estimation, plays a crucial, Depth Estimation, vision-based Autonomous Driving, crucial role 类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 13 figures, SecureComm 2024
点击查看摘要
Abstract:Monocular Depth Estimation (MDE) plays a crucial role in vision-based Autonomous Driving (AD) systems. It utilizes a single-camera image to determine the depth of objects, facilitating driving decisions such as braking a few meters in front of a detected obstacle or changing lanes to avoid collision. In this paper, we investigate the security risks associated with monocular vision-based depth estimation algorithms utilized by AD systems. By exploiting the vulnerabilities of MDE and the principles of optical lenses, we introduce LensAttack, a physical attack that involves strategically placing optical lenses on the camera of an autonomous vehicle to manipulate the perceived object depths. LensAttack encompasses two attack formats: concave lens attack and convex lens attack, each utilizing different optical lenses to induce false depth perception. We begin by constructing a mathematical model of our attack, incorporating various attack parameters. Subsequently, we simulate the attack and evaluate its real-world performance in driving scenarios to demonstrate its effect on state-of-the-art MDE models. The results highlight the significant impact of LensAttack on the accuracy of depth estimation in AD systems.
[CV-105] he Overfocusing Bias of Convolutional Neural Networks: A Saliency-Guided Regularization Approach
链接: https://arxiv.org/abs/2409.17370 作者: David Bertoin,Eduardo Hugo Sanchez,Mehdi Zouitine,Emmanuel Rachelson 关键词-EN: computer vision, low-data regimes, transformers being considered, standard in computer, convolutional neural networks 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model’s generalization capabilities, making it disproportionately dependent on certain features that might not represent the broader context of images. While the conditions leading to this phenomenon remain elusive, the primary intent of this article is to shed light on this observed behavior of neural networks. Our research endeavors to prioritize comprehensive insight and to outline an initial response to this phenomenon. In line with this, we introduce Saliency Guided Dropout (SGDrop), a pioneering regularization approach tailored to address this specific issue. SGDrop utilizes attribution methods on the feature map to identify and then reduce the influence of the most salient features during training. This process encourages the network to diversify its attention and not focus solely on specific standout areas. Our experiments across several visual classification benchmarks validate SGDrop’s role in enhancing generalization. Significantly, models incorporating SGDrop display more expansive attributions and neural activity, offering a more comprehensive view of input images in contrast to their traditionally trained counterparts.
[CV-106] Implicit Neural Representations for Simultaneous Reduction and Continuous Reconstruction of Multi-Altitude Climate Data
链接: https://arxiv.org/abs/2409.17367 作者: Alif Bin Abdul Qayyum,Xihaier Luo,Nathan M. Urban,Xiaoning Qian,Byung-Jun Yoon 关键词-EN: renewable energy sources, greenhouse gas emissions, reduce greenhouse gas, energy sources, global warming 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2401.16936
点击查看摘要
Abstract:The world is moving towards clean and renewable energy sources, such as wind energy, in an attempt to reduce greenhouse gas emissions that contribute to global warming. To enhance the analysis and storage of wind data, we introduce a deep learning framework designed to simultaneously enable effective dimensionality reduction and continuous representation of multi-altitude wind data from discrete observations. The framework consists of three key components: dimensionality reduction, cross-modal prediction, and super-resolution. We aim to: (1) improve data resolution across diverse climatic conditions to recover high-resolution details; (2) reduce data dimensionality for more efficient storage of large climate datasets; and (3) enable cross-prediction between wind data measured at different heights. Comprehensive testing confirms that our approach surpasses existing methods in both super-resolution quality and compression efficiency.
[CV-107] Improving satellite imagery segmentation using multiple Sentinel-2 revisits
链接: https://arxiv.org/abs/2409.17363 作者: Kartik Jindgar,Grace W. Lindsay 关键词-EN: traditional computer vision, computer vision, recent years, shared models pre-trained, benefited immensely 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation – power substation segmentation – that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.
[CV-108] A vision-based framework for human behavior understanding in industrial assembly lines
链接: https://arxiv.org/abs/2409.17356 作者: Konstantinos Papoutsakis,Nikolaos Bakalos,Konstantinos Fragkoulis,Athena Zacharia,Georgia Kapetadimitri,Maria Pateraki 关键词-EN: understanding human behavior, industrial assembly lines, paper introduces, introduces a vision-based, capturing and understanding 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This paper introduces a vision-based framework for capturing and understanding human behavior in industrial assembly lines, focusing on car door manufacturing. The framework leverages advanced computer vision techniques to estimate workers’ locations and 3D poses and analyze work postures, actions, and task progress. A key contribution is the introduction of the CarDA dataset, which contains domain-relevant assembly actions captured in a realistic setting to support the analysis of the framework for human pose and action analysis. The dataset comprises time-synchronized multi-camera RGB-D videos, motion capture data recorded in a real car manufacturing environment, and annotations for EAWS-based ergonomic risk scores and assembly activities. Experimental results demonstrate the effectiveness of the proposed approach in classifying worker postures and robust performance in monitoring assembly task progress.
[CV-109] SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model
链接: https://arxiv.org/abs/2409.17345 作者: Daniel Yang,John J. Leonard,Yogesh Girdhar 关键词-EN: enable real-time rendering, method to enable, underwater image formation, real-time rendering, radiance fields 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Project page here: this https URL
点击查看摘要
Abstract:We introduce SeaSplat, a method to enable real-time rendering of underwater scenes leveraging recent advances in 3D radiance fields. Underwater scenes are challenging visual environments, as rendering through a medium such as water introduces both range and color dependent effects on image capture. We constrain 3D Gaussian Splatting (3DGS), a recent advance in radiance fields enabling rapid training and real-time rendering of full 3D scenes, with a physically grounded underwater image formation model. Applying SeaSplat to the real-world scenes from SeaThru-NeRF dataset, a scene collected by an underwater vehicle in the US Virgin Islands, and simulation-degraded real-world scenes, not only do we see increased quantitative performance on rendering novel viewpoints from the scene with the medium present, but are also able to recover the underlying true color of the scene and restore renders to be without the presence of the intervening medium. We show that the underwater image formation helps learn scene structure, with better depth maps, as well as show that our improvements maintain the significant computational improvements afforded by leveraging a 3D Gaussian representation.
[CV-110] Energy-Efficient Real-Time Computer Vision with Intelligent Skipping via Reconfigurable CMOS Image Sensors
链接: https://arxiv.org/abs/2409.17341 作者: Md Abdullah-Al Kaiser,Sreetama Sarkar,Peter A. Beerel,Akhilesh R. Jaiswal,Gourav Datta 关键词-EN: Current video-based computer, video-based computer vision, Current video-based, high energy consumption, applications typically suffer 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review
点击查看摘要
Abstract:Current video-based computer vision (CV) applications typically suffer from high energy consumption due to reading and processing all pixels in a frame, regardless of their significance. While previous works have attempted to reduce this energy by skipping input patches or pixels and using feedback from the end task to guide the skipping algorithm, the skipping is not performed during the sensor read phase. As a result, these methods can not optimize the front-end sensor energy. Moreover, they may not be suitable for real-time applications due to the long latency of modern CV networks that are deployed in the back-end. To address this challenge, this paper presents a custom-designed reconfigurable CMOS image sensor (CIS) system that improves energy efficiency by selectively skipping uneventful regions or rows within a frame during the sensor’s readout phase, and the subsequent analog-to-digital conversion (ADC) phase. A novel masking algorithm intelligently directs the skipping process in real-time, optimizing both the front-end sensor and back-end neural networks for applications including autonomous driving and augmented/virtual reality (AR/VR). Our system can also operate in standard mode without skipping, depending on application needs. We evaluate our hardware-algorithm co-design framework on object detection based on BDD100K and ImageNetVID, and gaze estimation based on OpenEDS, achieving up to 53% reduction in front-end sensor energy while maintaining state-of-the-art (SOTA) accuracy.
[CV-111] Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting
链接: https://arxiv.org/abs/2409.17332 作者: Jay Zoellin,Colin Merk,Mischa Buob,Amr Saad,Samuel Giesser,Tahm Spitznagel,Ferhat Turgut,Rui Santos,Yukun Zhou,Sigfried Wagner,Pearse A. Keane,Yih Chung Tham,Delia Cabrera DeBuc,Matthias D. Becker,Gabor M. Somfai 关键词-EN: Integrating deep learning, greatly advance diagnostic, Integrating deep, self-supervised learning, DINORET 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this http URL , C. Merk and M. Buob contributed equally as shared-first authors. D. Cabrera DeBuc, M. D. Becker and G. M. Somfai contributed equally as senior authors for this work
点击查看摘要
Abstract:Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.
[CV-112] ChatCam: Empowering Camera Control through Conversational AI NEURIPS2024
链接: https://arxiv.org/abs/2409.17331 作者: Xinhang Liu,Yu-Wing Tai,Chi-Keung Tang 关键词-EN: crafting compelling visual, compelling visual narratives, Cinematographers adeptly capture, crafting compelling, intricate camera movements 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted to NeurIPS 2024
点击查看摘要
Abstract:Cinematographers adeptly capture the essence of the world, crafting compelling visual narratives through intricate camera movements. Witnessing the strides made by large language models in perceiving and interacting with the 3D world, this study explores their capability to control cameras with human language guidance. We introduce ChatCam, a system that navigates camera movements through conversations with users, mimicking a professional cinematographer’s workflow. To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation. We also develop an Anchor Determinator to ensure precise camera trajectory placement. ChatCam understands user requests and employs our proposed tools to generate trajectories, which can be used to render high-quality video footage on radiance field representations. Our experiments, including comparisons to state-of-the-art approaches and user studies, demonstrate our approach’s ability to interpret and execute complex instructions for camera operation, showing promising applications in real-world production settings.
链接: https://arxiv.org/abs/2409.17330 作者: Liangyu Zhong,Joachim Sicking,Fabian Hüger,Hanno Gottschalk 关键词-EN: achieved significant success, identically distributed data, Semantic segmentation networks, achieved significant, significant success 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 9 figures, to be published in ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)
点击查看摘要
Abstract:Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.
[CV-114] Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement
Abstract:Remote photoplethysmography (rPPG) is gaining prominence for its non-invasive approach to monitoring physiological signals using only cameras. Despite its promise, the adaptability of rPPG models to new, unseen domains is hindered due to the environmental sensitivity of physiological signals. To address this, we pioneer the Test-Time Adaptation (TTA) in rPPG, enabling the adaptation of pre-trained models to the target domain during inference, sidestepping the need for annotations or source data due to privacy considerations. Particularly, utilizing only the user’s face video stream as the accessible target domain data, the rPPG model is adjusted by tuning on each single instance it encounters. However, 1) TTA algorithms are designed predominantly for classification tasks, ill-suited in regression tasks such as rPPG due to inadequate supervision. 2) Tuning pre-trained models in a single-instance manner introduces variability and instability, posing challenges to effectively filtering domain-relevant from domain-irrelevant features while simultaneously preserving the learned information. To overcome these challenges, we present Bi-TTA, a novel expert knowledge-based Bidirectional Test-Time Adapter framework. Specifically, leveraging two expert-knowledge priors for providing self-supervision, our Bi-TTA primarily comprises two modules: a prospective adaptation (PA) module using sharpness-aware minimization to eliminate domain-irrelevant noise, enhancing the stability and efficacy during the adaptation process, and a retrospective stabilization (RS) module to dynamically reinforce crucial learned model parameters, averting performance degradation caused by overfitting or catastrophic forgetting. To this end, we established a large-scale benchmark for rPPG tasks under TTA protocol. The experimental results demonstrate the significant superiority of our approach over the state-of-the-art.
[CV-115] Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP2024
链接: https://arxiv.org/abs/2409.17313 作者: Zehao Wang,Minye Wu,Yixin Cao,Yubo Ma,Meiqi Chen,Tinne Tuytelaars 关键词-EN: study presents, instruction categories, evaluation framework, VLN, Vision-Language Navigation 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings; project page: this https URL
点击查看摘要
Abstract:This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.
[CV-116] Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
链接: https://arxiv.org/abs/2409.17280 作者: Hui En Pang,Shuai Liu,Zhongang Cai,Lei Yang,Tianwei Zhang,Ziwei Liu 关键词-EN: Gaussian Splatting framework, Splatting framework, Gaussian Splatting, textbf, Splatting 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:We present \textbfDisco4D, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. Different from existing methods, Disco4D distinctively disentangles clothings (with Gaussian models) from the human body (with SMPL-X model), significantly enhancing the generation details and flexibility. It has the following technical innovations. \textbf1) Disco4D learns to efficiently fit the clothing Gaussians over the SMPL-X Gaussians. \textbf2) It adopts diffusion models to enhance the 3D generation process, \textite.g., modeling occluded parts not visible in the input image. \textbf3) It learns an identity encoding for each clothing Gaussian to facilitate the separation and extraction of clothing assets. Furthermore, Disco4D naturally supports 4D human animation with vivid dynamics. Extensive experiments demonstrate the superiority of Disco4D on 4D human generation and animation tasks. Our visualizations can be found in \urlthis https URL.
[CV-117] Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs ECCV2024
Abstract:The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.
[CV-118] Neural Network Architecture Search Enabled Wide-Deep Learning (NAS-WD) for Spatially Heterogenous Property Awared Chicken Woody Breast Classification and Hardness Regression
链接: https://arxiv.org/abs/2409.17210 作者: Chaitanya Pallerla,Yihong Feng,Casey M. Owens,Ramesh Bahadur Bist,Siavash Mahmoudi,Pouya Sohrabipour,Amirreza Davar,Dongyi Wang 关键词-EN: intensive genetic selection, rapid growth rates, global poultry industry, high broiler yields, Due to intensive 类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
*备注:
点击查看摘要
Abstract:Due to intensive genetic selection for rapid growth rates and high broiler yields in recent years, the global poultry industry has faced a challenging problem in the form of woody breast (WB) conditions. This condition has caused significant economic losses as high as 200 million annually, and the root cause of WB has yet to be identified. Human palpation is the most common method of distinguishing a WB from others. However, this method is time-consuming and subjective. Hyperspectral imaging (HSI) combined with machine learning algorithms can evaluate the WB conditions of fillets in a non-invasive, objective, and high-throughput manner. In this study, 250 raw chicken breast fillet samples (normal, mild, severe) were taken, and spatially heterogeneous hardness distribution was first considered when designing HSI processing models. The study not only classified the WB levels from HSI but also built a regression model to correlate the spectral information with sample hardness data. To achieve a satisfactory classification and regression model, a neural network architecture search (NAS) enabled a wide-deep neural network model named NAS-WD, which was developed. In NAS-WD, NAS was first used to automatically optimize the network architecture and hyperparameters. The classification results show that NAS-WD can classify the three WB levels with an overall accuracy of 95%, outperforming the traditional machine learning model, and the regression correlation between the spectral data and hardness was 0.75, which performs significantly better than traditional regression models.
[CV-119] 2024 BRAVO Challenge Track 1 1st Place Report: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation
链接: https://arxiv.org/abs/2409.17208 作者: Tommie Kerssies,Daan de Geus,Gijs Dubbelman 关键词-EN: BRAVO Challenge, trained on Cityscapes, robustness is evaluated, solution for Track, present our solution 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2409.15107
点击查看摘要
Abstract:In this report, we present our solution for Track 1 of the 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves 1st place in the challenge. Our code is publicly available at this https URL.
[CV-120] AACLiteNet: A Lightweight Model for Detection of Fine-Grained Abdominal Aortic Calcification
链接: https://arxiv.org/abs/2409.17203 作者: Zaid Ilyas,Afsah Saleem,David Suter,Siobhan Reid,John Schousboe,William Leslie,Joshua Lewis,Syed Zulqarnain Gilani 关键词-EN: Cardiovascular Diseases, million lives annually, Vertebral Fracture Assessment, Abdominal Aortic Calcification, death worldwide 类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages including references
点击查看摘要
Abstract:Cardiovascular Diseases (CVDs) are the leading cause of death worldwide, taking 17.9 million lives annually. Abdominal Aortic Calcification (AAC) is an established marker for CVD, which can be observed in lateral view Vertebral Fracture Assessment (VFA) scans, usually done for vertebral fracture detection. Early detection of AAC may help reduce the risk of developing clinical CVDs by encouraging preventive measures. Manual analysis of VFA scans for AAC measurement is time consuming and requires trained human assessors. Recently, efforts have been made to automate the process, however, the proposed models are either low in accuracy, lack granular level score prediction, or are too heavy in terms of inference time and memory footprint. Considering all these shortcomings of existing algorithms, we propose ‘AACLiteNet’, a lightweight deep learning model that predicts both cumulative and granular level AAC scores with high accuracy, and also has a low memory footprint, and computation cost (Floating Point Operations (FLOPs)). The AACLiteNet achieves a significantly improved one-vs-rest average accuracy of 85.94% as compared to the previous best 81.98%, with 19.88 times less computational cost and 2.26 times less memory footprint, making it implementable on portable computing devices.
[CV-121] Cross Dataset Analysis and Network Architecture Repair for Autonomous Car Lane Detection
Abstract:Transfer Learning has become one of the standard methods to solve problems to overcome the isolated learning paradigm by utilizing knowledge acquired for one task to solve another related one. However, research needs to be done, to identify the initial steps before inducing transfer learning to applications for further verification and explainablity. In this research, we have performed cross dataset analysis and network architecture repair for the lane detection application in autonomous vehicles. Lane detection is an important aspect of autonomous vehicles driving assistance system. In most circumstances, modern deep-learning-based lane recognition systems are successful, but they struggle with lanes with complex topologies. The proposed architecture, ERFCondLaneNet is an enhancement to the CondlaneNet used for lane identification framework to solve the difficulty of detecting lane lines with complex topologies like dense, curved and fork lines. The newly proposed technique was tested on two common lane detecting benchmarks, CULane and CurveLanes respectively, and two different backbones, ResNet and ERFNet. The researched technique with ERFCondLaneNet, exhibited similar performance in comparison to ResnetCondLaneNet, while using 33% less features, resulting in a reduction of model size by 46%.
[CV-122] An Art-centric perspective on AI-based content moderation of nudity ECCV2024
链接: https://arxiv.org/abs/2409.17156 作者: Piera Riccio,Georgina Curto,Thomas Hofmann,Nuria Oliver 关键词-EN: generative Artificial Intelligence, Artificial Intelligence, highly debated topic, artistic nudity online, generative Artificial 类目: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
*备注: To be published at the AI4VA (AI for Visual Arts) Workshop and Challenges at ECCV 2024
点击查看摘要
Abstract:At a time when the influence of generative Artificial Intelligence on visual arts is a highly debated topic, we raise the attention towards a more subtle phenomenon: the algorithmic censorship of artistic nudity online. We analyze the performance of three "Not-Safe-For-Work’’ image classifiers on artistic nudity, and empirically uncover the existence of a gender and a stylistic bias, as well as evident technical limitations, especially when only considering visual information. Hence, we propose a multi-modal zero-shot classification approach that improves artistic nudity classification. From our research, we draw several implications that we hope will inform future research on this topic.
[CV-123] PhoCoLens: Photorealistic and Consistent Reconstruction in Lensless Imaging NEURIPS2024
链接: https://arxiv.org/abs/2409.17996 作者: Xin Cai,Zhiyuan You,Hailong Zhang,Wentao Liu,Jinwei Gu,Tianfan Xue 关键词-EN: offer significant advantages, cameras offer significant, traditional lens-based systems, Lensless cameras offer, advantages in size 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Spotlight
点击查看摘要
Abstract:Lensless cameras offer significant advantages in size, weight, and cost compared to traditional lens-based systems. Without a focusing lens, lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, current algorithms struggle with inaccurate forward imaging models and insufficient priors to reconstruct high-quality images. To overcome these limitations, we introduce a novel two-stage approach for consistent and photorealistic lensless image reconstruction. The first stage of our approach ensures data consistency by focusing on accurately reconstructing the low-frequency content with a spatially varying deconvolution method that adjusts to changes in the Point Spread Function (PSF) across the camera’s field of view. The second stage enhances photorealism by incorporating a generative prior from pre-trained diffusion models. By conditioning on the low-frequency content retrieved in the first stage, the diffusion model effectively reconstructs the high-frequency details that are typically lost in the lensless imaging process, while also maintaining image fidelity. Our method achieves a superior balance between data fidelity and visual quality compared to existing methods, as demonstrated with two popular lensless systems, PhlatCam and DiffuserCam. Project website: this https URL.
[CV-124] LGFN: Lightweight Light Field Image Super-Resolution using Local Convolution Modulation and Global Attention Feature Extraction
链接: https://arxiv.org/abs/2409.17759 作者: Zhongxin Yu,Liang Chen,Zhiyun Zeng,Kunping Yang,Shaofei Luo,Shaorui Chen,Cheng Zhong 关键词-EN: Capturing different intensity, scene Light field, scene cues, post-capture refocusing, depth sensing 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures
点击查看摘要
Abstract:Capturing different intensity and directions of light rays at the same scene Light field (LF) can encode the 3D scene cues into a 4D LF image which has a wide range of applications (i.e. post-capture refocusing and depth sensing). LF image super-resolution (SR) aims to improve the image resolution limited by the performance of LF camera sensor. Although existing methods have achieved promising results the practical application of these models is limited because they are not lightweight enough. In this paper we propose a lightweight model named LGFN which integrates the local and global features of different views and the features of different channels for LF image SR. Specifically owing to neighboring regions of the same pixel position in different sub-aperture images exhibit similar structural relationships we design a lightweight CNN-based feature extraction module (namely DGCE) to extract local features better through feature modulation. Meanwhile as the position beyond the boundaries in the LF image presents a large disparity we propose an efficient spatial attention module (namely ESAM) which uses decomposable large-kernel convolution to obtain an enlarged receptive field and an efficient channel attention module (namely ECAM). Compared with the existing LF image SR models with large parameter our model has a parameter of 0.45M and a FLOPs of 19.33G which has achieved a competitive effect. Extensive experiments with ablation studies demonstrate the effectiveness of our proposed method which ranked the second place in the Track 2 Fidelity Efficiency of NTIRE2024 Light Field Super Resolution Challenge and the seventh place in the Track 1 Fidelity.
[CV-125] Let the Quantum Creep In: Designing Quantum Neural Network Models by Gradually Swapping Out Classical Components
链接: https://arxiv.org/abs/2409.17583 作者: Peiyong Wang,Casey. R. Myers,Lloyd C. L. Hollenberg,Udaya Parampalli 关键词-EN: Artificial Intelligence, quantum neural network, neural network, classical neural network, quantum 类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 50 pages (including Appendix), many figures, accepted as a poster on QTML2024. Code available at this https URL
点击查看摘要
Abstract:Artificial Intelligence (AI), with its multiplier effect and wide applications in multiple areas, could potentially be an important application of quantum computing. Since modern AI systems are often built on neural networks, the design of quantum neural networks becomes a key challenge in integrating quantum computing into AI. To provide a more fine-grained characterisation of the impact of quantum components on the performance of neural networks, we propose a framework where classical neural network layers are gradually replaced by quantum layers that have the same type of input and output while keeping the flow of information between layers unchanged, different from most current research in quantum neural network, which favours an end-to-end quantum model. We start with a simple three-layer classical neural network without any normalisation layers or activation functions, and gradually change the classical layers to the corresponding quantum versions. We conduct numerical experiments on image classification datasets such as the MNIST, FashionMNIST and CIFAR-10 datasets to demonstrate the change of performance brought by the systematic introduction of quantum components. Through this framework, our research sheds new light on the design of future quantum neural network models where it could be more favourable to search for methods and frameworks that harness the advantages from both the classical and quantum worlds.
[CV-126] NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes NEURIPS2024
链接: https://arxiv.org/abs/2409.17510 作者: Ziquan Wei,Tingting Dan,Jiaqi Ding,Paul J Laurienti,Guorong Wu 关键词-EN: modern imaging technologies, fluctuations emerge remarkable, emerge remarkable cognition, brain regions in-vivo, spontaneous functional fluctuations 类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the cliché of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.
[CV-127] Shape-intensity knowledge distillation for robust medical image segmentation
Abstract:Many medical image segmentation methods have achieved impressive results. Yet, most existing methods do not take into account the shape-intensity prior information. This may lead to implausible segmentation results, in particular for images of unseen datasets. In this paper, we propose a novel approach to incorporate joint shape-intensity prior information into the segmentation network. Specifically, we first train a segmentation network (regarded as the teacher network) on class-wise averaged training images to extract valuable shape-intensity information, which is then transferred to a student segmentation network with the same network architecture as the teacher via knowledge distillation. In this way, the student network regarded as the final segmentation model can effectively integrate the shape-intensity prior information, yielding more accurate segmentation results. Despite its simplicity, experiments on five medical image segmentation tasks of different modalities demonstrate that the proposed Shape-Intensity Knowledge Distillation (SIKD) consistently improves several baseline models (including recent MaxStyle and SAMed) under intra-dataset evaluation, and significantly improves the cross-dataset generalization ability. The code is available at this https URL.
[CV-128] Study of Subjective and Objective Quality in Super-Resolution Enhanced Broadcast Images on a Novel SR-IQA Dataset
链接: https://arxiv.org/abs/2409.17451 作者: Yongrok Kim,Junha Shin,Juhyun Lee,Hyunsuk Ko 关键词-EN: key consumer technology, display low-quality broadcast, full-screen format, application of Super-Resolution, consumer technology 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
点击查看摘要
Abstract:To display low-quality broadcast content on high-resolution screens in full-screen format, the application of Super-Resolution (SR), a key consumer technology, is essential. Recently, SR methods have been developed that not only increase resolution while preserving the original image information but also enhance the perceived quality. However, evaluating the quality of SR images generated from low-quality sources, such as SR-enhanced broadcast content, is challenging due to the need to consider both distortions and improvements. Additionally, assessing SR image quality without original high-quality sources presents another significant challenge. Unfortunately, there has been a dearth of research specifically addressing the Image Quality Assessment (IQA) of SR images under these conditions. In this work, we introduce a new IQA dataset for SR broadcast images in both 2K and 4K resolutions. We conducted a subjective quality evaluation to obtain the Mean Opinion Score (MOS) for these SR images and performed a comprehensive human study to identify the key factors influencing the perceived quality. Finally, we evaluated the performance of existing IQA metrics on our dataset. This study reveals the limitations of current metrics, highlighting the need for a more robust IQA metric that better correlates with the perceived quality of SR images.
[CV-129] Multi-scale decomposition of sea surface height snapshots using machine learning
链接: https://arxiv.org/abs/2409.17354 作者: Jingwen Lyu,Yue Wang,Christian Pedersen,Spencer Jones,Dhruv Balwada 关键词-EN: Sea Surface Height, Knowledge of ocean, weather and climate, blue economy, important for understanding 类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Knowledge of ocean circulation is important for understanding and predicting weather and climate, and managing the blue economy. This circulation can be estimated through Sea Surface Height (SSH) observations, but requires decomposing the SSH into contributions from balanced and unbalanced motions (BMs and UBMs). This decomposition is particularly pertinent for the novel SWOT satellite, which measures SSH at an unprecedented spatial resolution. Specifically, the requirement, and the goal of this work, is to decompose instantaneous SSH into BMs and UBMs. While a few studies using deep learning (DL) approaches have shown promise in framing this decomposition as an image-to-image translation task, these models struggle to work well across a wide range of spatial scales and require extensive training data, which is scarce in this domain. These challenges are not unique to our task, and pervade many problems requiring multi-scale fidelity. We show that these challenges can be addressed by using zero-phase component analysis (ZCA) whitening and data augmentation; making this a viable option for SSH decomposition across scales.
[CV-130] An Integrated Deep Learning Framework for Effective Brain Tumor Localization Segmentation and Classification from Magnetic Resonance Images
链接: https://arxiv.org/abs/2409.17273 作者: Pandiyaraju V,Shravan Venkatraman,Abeshek A,Aravintakshan S A,Pavan Kumar S,Madhan S 关键词-EN: abnormal cell growth, abnormal cell, brain cells, brain tissue, cell growth 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 36 pages, 27 figures, 5 tables
点击查看摘要
Abstract:Tumors in the brain result from abnormal cell growth within the brain tissue, arising from various types of brain cells. When left undiagnosed, they lead to severe neurological deficits such as cognitive impairment, motor dysfunction, and sensory loss. As the tumor grows, it causes an increase in intracranial pressure, potentially leading to life-threatening complications such as brain herniation. Therefore, early detection and treatment are necessary to manage the complications caused by such tumors to slow down their growth. Numerous works involving deep learning (DL) and artificial intelligence (AI) are being carried out to assist physicians in early diagnosis by utilizing the scans obtained through Magnetic Resonance Imaging (MRI). Our research proposes DL frameworks for localizing, segmenting, and classifying the grade of these gliomas from MRI images to solve this critical issue. In our localization framework, we enhance the LinkNet framework with a VGG19- inspired encoder architecture for improved multimodal tumor feature extraction, along with spatial and graph attention mechanisms to refine feature focus and inter-feature relationships. Following this, we integrated the SeResNet101 CNN model as the encoder backbone into the LinkNet framework for tumor segmentation, which achieved an IoU Score of 96%. To classify the segmented tumors, we combined the SeResNet152 feature extractor with an Adaptive Boosting classifier, which yielded an accuracy of 98.53%. Our proposed models demonstrated promising results, with the potential to advance medical AI by enabling early diagnosis and providing more accurate treatment options for patients.
[CV-131] AIM 2024 Challenge on Efficient Video Super-Resolution for AV1 Compressed Content ECCV
链接: https://arxiv.org/abs/2409.17256 作者: Marcos V Conde,Zhijun Lei,Wen Li,Christos Bampis,Ioannis Katsavounidis,Radu Timofte 关键词-EN: critical task, task for enhancing, enhancing low-bitrate, low-bitrate and low-resolution, VSR 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注: European Conference on Computer Vision (ECCV) 2024 - Advances in Image Manipulation (AIM)
点击查看摘要
Abstract:Video super-resolution (VSR) is a critical task for enhancing low-bitrate and low-resolution videos, particularly in streaming applications. While numerous solutions have been developed, they often suffer from high computational demands, resulting in low frame rates (FPS) and poor power efficiency, especially on mobile platforms. In this work, we compile different methods to address these challenges, the solutions are end-to-end real-time video super-resolution frameworks optimized for both high performance and low runtime. We also introduce a new test set of high-quality 4K videos to further validate the approaches. The proposed solutions tackle video up-scaling for two applications: 540p to 4K (x4) as a general case, and 360p to 1080p (x3) more tailored towards mobile devices. In both tracks, the solutions have a reduced number of parameters and operations (MACs), allow high FPS, and improve VMAF and PSNR over interpolation baselines. This report gauges some of the most efficient video super-resolution methods to date.
[CV-132] MODELCO: Exoplanet detection in angular differential imaging by learning across multiple observations
链接: https://arxiv.org/abs/2409.17178 作者: Théo Bodrito,Olivier Flasseur,Julien Mairal,Jean Ponce,Maud Langlois,Anne-Marie Lagrange 关键词-EN: small angular separation, angular separations due, Direct imaging, star luminosities, high contrast 类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
点击查看摘要
Abstract:Direct imaging of exoplanets is particularly challenging due to the high contrast between the planet and the star luminosities, and their small angular separation. In addition to tailored instrumental facilities implementing adaptive optics and coronagraphy, post-processing methods combining several images recorded in pupil tracking mode are needed to attenuate the nuisances corrupting the signals of interest. Most of these post-processing methods build a model of the nuisances from the target observations themselves, resulting in strongly limited detection sensitivity at short angular separations due to the lack of angular diversity. To address this issue, we propose to build the nuisance model from an archive of multiple observations by leveraging supervised deep learning techniques. The proposed approach casts the detection problem as a reconstruction task and captures the structure of the nuisance from two complementary representations of the data. Unlike methods inspired by reference differential imaging, the proposed model is highly non-linear and does not resort to explicit image-to-image similarity measurements and subtractions. The proposed approach also encompasses statistical modeling of learnable spatial features. The latter is beneficial to improve both the detection sensitivity and the robustness against heterogeneous data. We apply the proposed algorithm to several datasets from the VLT/SPHERE instrument, and demonstrate a superior precision-recall trade-off compared to the PACO algorithm. Interestingly, the gain is especially important when the diversity induced by ADI is the most limited, thus supporting the ability of the proposed approach to learn information across multiple observations.
机器学习
[LG-0] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography MICCAI2024
链接: https://arxiv.org/abs/2409.18119 作者: Yuexi Du,John Onofrey,Nicha C. Dvornek 关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Language-Image Pre-training, requires substantial data, shows promise 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is also the basis of the overall best solution for the MICCAI 2024 CXR-LT Challenge
点击查看摘要
Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modal