本篇博文主要展示 2024-09-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-09-27)

今日共更新537篇论文,其中:

  • 自然语言处理84篇(Computation and Language (cs.CL))
  • 人工智能159篇(Artificial Intelligence (cs.AI))
  • 计算机视觉133篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习156篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Open-World Evaluation for Retrieving Diverse Perspectives

【速读】: 该论文试图解决在复杂和有争议的问题上检索包含多样化观点的文档集合的问题。解决方案的关键在于构建了一个名为BERDS的多样化检索基准,并开发了一种基于语言模型的自动评估器,用于判断检索到的文档是否包含特定观点。通过这种方式,论文评估了不同类型的语料库(如Wikipedia、网页快照和动态构建的语料库)与检索器的组合性能,并探讨了查询扩展和专注于多样性的重排序方法对检索效果的影响。最终,论文为处理复杂查询的多样化检索研究奠定了基础。

链接: https://arxiv.org/abs/2409.18110
作者: Hung-Ting Chen,Eunsol Choi
关键词-EN: harm than good, contentious question, Subjective questions, question, Retrieval Diversity
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., will ChatGPT do more harm than good?). We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. On this data, retrievers paired with a corpus are evaluated to surface a document set that contains diverse perspectives. Our framing diverges from most retrieval tasks in that document relevancy cannot be decided by simple string matches to references. Instead, we build a language model based automatic evaluator that decides whether each retrieved document contains a perspective. This allows us to evaluate the performance of three different types of corpus (Wikipedia, web snapshot, and corpus constructed on the fly with retrieved pages from the search engine) paired with retrievers. Retrieving diverse documents remains challenging, with the outputs from existing retrievers covering all perspectives on only 33.74% of the examples. We further study the impact of query expansion and diversity-focused reranking approaches and analyze retriever sycophancy. Together, we lay the foundation for future studies in retrieval diversity handling complex queries.
摘要:我们研究如何检索一组涵盖复杂且有争议问题的各种观点的文档(例如,ChatGPT 是否会带来更多危害而非益处?)。我们构建了一个主观问题检索多样性基准 (BERDS),其中每个示例包含一个问题及其相关的多样化观点,这些观点来源于调查问卷和辩论网站。在此数据集上,结合语料库的检索器被评估以呈现包含多样化观点的文档集。我们的框架与大多数检索任务不同,因为文档的相关性不能通过简单的字符串匹配来确定。相反,我们构建了一个基于语言模型的自动评估器,用于判断每个检索到的文档是否包含某种观点。这使我们能够评估三种不同类型的语料库(维基百科、网页快照以及通过搜索引擎检索页面即时构建的语料库)与检索器配对时的性能。检索多样化文档仍然具有挑战性,现有检索器的输出仅在 33.74% 的示例中涵盖了所有观点。我们进一步研究了查询扩展和专注于多样性的重排序方法的影响,并分析了检索器的盲从性。综上所述,我们为未来处理复杂查询的检索多样性研究奠定了基础。

[NLP-1] Infer Humans Intentions Before Following Natural Language Instructions

【速读】: 该论文试图解决AI代理在遵循自然语言指令完成日常协作任务时,由于人类指令固有的模糊性而导致的执行失败问题。解决方案的关键在于提出了一个新的框架——Follow Instructions with Social and Embodied Reasoning (FISER),该框架通过显式推理人类目标和意图作为中间推理步骤,从而更好地处理指令中的模糊性。通过使用基于Transformer的模型,FISER在HandMeThat基准测试中表现优异,显著超越了纯粹的端到端方法和现有的强基线模型,达到了该领域的最新技术水平。

链接: https://arxiv.org/abs/2409.18073
作者: Yanming Wan,Yue Wu,Yiping Wang,Jiayuan Mao,Natasha Jaques
关键词-EN: complete everyday cooperative, everyday cooperative tasks, complete everyday, everyday cooperative, human
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.
摘要:为了使 AI 智能体对人类有所帮助,它们应当能够遵循自然语言指令,在人类环境中完成日常的合作任务。然而,真实的人类指令本身具有模糊性,因为说话者假设听者对其隐藏的目标和意图有足够的先验知识。标准的语言接地和规划方法无法解决这种模糊性,因为它们没有将人类的内在目标建模为环境中的额外部分可观察因素。我们提出了一种新的框架,即遵循指令与社会和具身推理 (Follow Instructions with Social and Embodied Reasoning, FISER),旨在更好地遵循合作具身任务中的自然语言指令。我们的框架明确地将人类目标和意图作为中间推理步骤进行推断。我们实现了一系列基于 Transformer 的模型,并在一个具有挑战性的基准测试 HandMeThat 上进行了评估。实证结果表明,在制定行动计划之前,使用社会推理明确推断人类意图的方法优于纯粹的端到端方法。我们还将其与强大的基线方法进行了比较,包括在最大可用预训练大语言模型上进行的思维链提示 (Chain of Thought prompting),发现 FISER 在所研究的具身社会推理任务中表现更佳,达到了 HandMeThat 上的最新技术水平。

[NLP-2] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning EMNLP2024

【速读】: 该论文试图解决图像描述生成中仅依赖文本训练数据带来的模态差距问题,即训练时使用文本数据而推理时使用图像数据的差异。解决方案的关键在于提出了一种名为Image-like Retrieval的新方法,通过将文本特征与视觉相关特征对齐来缓解模态差距。此外,论文还设计了一个融合模块(Fusion Module)来整合检索到的描述与输入特征,以及一种基于频率的实体过滤技术(Frequency-based Entity Filtering)来提升描述质量。这些方法被整合到一个统一的框架中,称为IFCap,通过实验验证了其在图像和视频描述生成任务中显著优于现有的仅依赖文本训练的零样本描述生成方法。

链接: https://arxiv.org/abs/2409.18046
作者: Soeun Lee,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim
关键词-EN: Recent advancements, paired image-text data, explored text-only training, text-only training, overcome the limitations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ( \textbfI mage-like Retrieval and \textbfF requency-based Entity Filtering for Zero-shot \textbfCap tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.
摘要:近年来,图像描述生成领域的进展探索了仅使用文本数据的训练方法,以克服配对图像-文本数据的局限性。然而,现有的仅使用文本数据的训练方法往往忽略了训练过程中使用文本数据与推理过程中使用图像之间的模态差异。为了解决这一问题,我们提出了一种名为“类图像检索”的新方法,该方法通过将文本特征与视觉相关特征对齐来缓解模态差异。我们的方法通过设计一个融合模块,将检索到的描述与输入特征相结合,进一步提高了生成描述的准确性。此外,我们还引入了一种基于频率的实体过滤技术,显著提升了描述质量。我们将这些方法整合到一个统一的框架中,称之为IFCap(Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning)。通过广泛的实验验证,我们这种简单而强大的方法展示了其有效性,在图像描述生成和视频描述生成方面,相比仅基于文本训练的零样本描述生成方法,显著优于当前最先进的方法。

[NLP-3] Unveiling the Role of Pretraining in Direct Speech Translation EMNLP2024

【速读】: 该论文试图解决直接语音到文本翻译系统在数据稀缺情况下训练效率低下的问题。解决方案的关键在于改进解码器的交叉注意力机制,使其能够在训练早期阶段更好地整合源语音信息。通过这一改进,从零开始训练的模型能够达到与预训练模型相当的性能,同时显著缩短训练时间。

链接: https://arxiv.org/abs/2409.18044
作者: Belen Alastruey,Gerard I. Gállego,Marta R. Costa-jussà
关键词-EN: data scarcity, translation systems encounter, encounter an important, important drawback, drawback in data
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.
摘要:直接语音到文本翻译系统面临数据稀缺的重要缺陷。常见的解决方案是在自动语音识别上预训练编码器,从而在训练过程中损失效率。在本研究中,我们比较了使用预训练编码器的系统、传统方法以及从头开始训练的系统的训练动态。我们观察到,在整个训练过程中,随机初始化的模型在预测时难以整合语音输入的信息。因此,我们假设这一问题源于有效训练直接语音翻译编码器的困难。虽然从头开始训练的模型需要同时学习声学和语义建模,但预训练的模型只需专注于后者。基于这些发现,我们提出了一种微妙的解码器交叉注意力变化,以在训练的早期步骤中整合源信息。我们展示了通过这一变化,从头开始训练的模型可以达到与预训练模型相当的性能,同时减少训练时间。

[NLP-4] EMOVA: Empowering Language Models to See Hear and Speak with Vivid Emotions

【速读】: 该论文试图解决大型语言模型(LLMs)在感知和生成图像、文本及语音时,依赖外部工具进行语音处理,以及语音模型缺乏视觉理解能力的问题。解决方案的关键在于提出了EMOVA(EMotionally Omni-present Voice Assistant),通过一个语义-声学解耦的语音标记器,实现了端到端的语音能力,同时保持了领先的视觉-语言性能。此外,论文还引入了一个轻量级的风格模块,用于灵活控制语音风格(如情感和音调),从而在视觉-语言和语音基准测试中达到了最先进的表现,并支持带有生动情感的全模态对话。

链接: https://arxiv.org/abs/2409.18042
作者: Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Lanqing Hong,Lu Hou,Hang Xu
关键词-EN: Large Language Models, enables vocal conversations, Large Language, empowering Large Language, enable Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
摘要:GPT-4o,一种能够实现带有多种情感和语调的语音对话的全模态模型,标志着全模态基础模型的一个重要里程碑。然而,在开源社区中,使大语言模型能够感知和生成图像、文本和语音的端到端能力,并利用公开数据仍然是一个挑战。现有的视觉-语言模型依赖于外部工具进行语音处理,而语音-语言模型仍然面临视觉理解能力有限甚至缺失的问题。为了填补这一空白,我们提出了EMOVA(EMotionally Omni-present Voice Assistant),以赋予大语言模型端到端的语音能力,同时保持领先的视觉-语言性能。通过一种语义-声学解耦的语音Token化器,我们发现全模态对齐可以进一步增强视觉-语言和语音能力,相比于相应的双模态对齐模型。此外,我们还提出了一种轻量级的风格模块,用于灵活的语音风格控制(例如,情感和音调)。首次,EMOVA在视觉-语言和语音基准测试中均达到了最先进的性能,同时支持带有生动情感的全模态语音对话。

[NLP-5] Automated Detection and Analysis of Power Words in Persuasive Text Using Natural Language Processing

【速读】: 该论文试图解决如何自动化检测和分析说服性文本中的“power words”(具有强烈情感反应的词汇),以评估其对读者情感和参与度的影响。解决方案的关键在于使用自定义词典和Python的TextBlob库,通过识别和统计文本中power words的存在和频率,进而分类和分析其对文本情感和读者行为的影响。该研究通过跨领域的多样化数据集,提供了关于power words效果的深入见解,为内容创作者、广告商和政策制定者提供了实际应用价值。

链接: https://arxiv.org/abs/2409.18033
作者: Sahil Garje
关键词-EN: influence readers’ behavior, evoke strong emotional, strong emotional responses, significantly influence readers’, Power words
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Power words are terms that evoke strong emotional responses and significantly influence readers’ behavior, playing a crucial role in fields like marketing, politics, and motivational writing. This study proposes a methodology for the automated detection and analysis of power words in persuasive text using a custom lexicon and the TextBlob library in Python. By identifying the presence and frequency of power words within a given text, we aim to classify and analyze their impact on sentiment and reader engagement. This research examines diverse datasets across various domains to provide insights into the effectiveness of power words, offering practical applications for content creators, advertisers, and policymakers.
摘要:影响力词汇是指能够引发强烈情感反应并显著影响读者行为的术语,在营销、政治和励志写作等领域中发挥着至关重要的作用。本研究提出了一种利用自定义词典和 Python 中的 TextBlob 库来自动检测和分析说服性文本中影响力词汇的方法。通过识别文本中影响力词汇的存在及其频率,我们旨在对其对情感和读者参与度的影响进行分类和分析。本研究考察了跨多个领域的多样化数据集,以提供关于影响力词汇有效性的见解,为内容创作者、广告商和政策制定者提供实际应用。

[NLP-6] Compositional Hardness of Code in Large Language Models – A Probabilistic Perspective

【速读】: 该论文试图解决大型语言模型(LLM)在处理复杂分析任务(如代码生成)时,由于上下文窗口限制导致的组合难题(in-context hardness of composition)。解决方案的关键在于通过多智能体系统(multi-agent system)将分解后的子任务分布到多个LLM中,从而降低生成复杂度。论文通过理论证明和实证研究,展示了在同一上下文中解决组合问题的生成复杂度与分布式多智能体系统之间的指数级差距。

链接: https://arxiv.org/abs/2409.18028
作者: Yotam Wolf,Binyamin Rothberg,Dorin Shteyman,Amnon Shashua
关键词-EN: large language model, complex analytical tasks, model context window, model context, usage for complex
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model’s context window. Previous works have shown that subtask decomposition within the model’s context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs’ ability to perform several sub-tasks within the same context window - an in-context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi-agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution’s length. We prove our results theoretically and demonstrate them empirically.
摘要:在大语言模型 (LLM) 用于代码生成等复杂分析任务时,常见做法是在模型的上下文窗口内采样整个任务的解决方案。先前的工作表明,在模型的上下文内进行子任务分解(即思维链),对解决此类任务是有益的。在本研究中,我们指出 LLM 在同一上下文窗口内执行多个子任务的能力存在局限性——即上下文内的组合难度,这表明在多智能体系统中分解问题具有优势。组合难度通过生成复杂度指标量化,即采样至少一个正确解决方案所需的大语言模型生成次数。我们发现,相对于在多个智能体之间分配问题,在同一上下文中解决组合问题的生成复杂度存在差距,并且随着解决方案长度的增加,这一差距呈指数级增长。我们通过理论证明和实证验证了这些结果。

[NLP-7] An Adversarial Perspective on Machine Unlearning for AI Safety

【速读】: 该论文试图解决的问题是现有的大语言模型在经过“遗忘”(unlearning)处理后,仍可能被对抗性方法恢复其潜在的危险能力。解决方案的关键在于揭示并挑战传统安全后训练与遗忘方法之间的根本差异,通过开发适应性方法来恢复被遗忘的能力,如在激活空间中移除特定方向或对无关示例进行微调,从而证明当前遗忘方法的鲁棒性不足,并质疑其相对于安全训练的优势。

链接: https://arxiv.org/abs/2409.18025
作者: Jakub Łucki,Boyi Wei,Yangsibo Huang,Peter Henderson,Florian Tramèr,Javier Rando
关键词-EN: Large language models, Large language, finetuned to refuse, Large, hazardous knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
摘要:大语言模型经过微调以拒绝关于危险知识的提问,但这些保护措施往往可以被绕过。遗忘方法旨在完全移除模型的危险能力,使其无法被对手访问。本文从对抗角度探讨了遗忘与传统安全后训练之间的根本差异。我们证明,先前被认为对遗忘无效的现有越狱方法,在谨慎应用时可以成功。此外,我们开发了多种自适应方法,恢复了大部分被认为已遗忘的能力。例如,我们展示了在10个不相关的示例上进行微调或在激活空间中移除特定方向,可以恢复使用RMU(一种最先进的遗忘方法)编辑的模型的大部分危险能力。我们的研究结果挑战了当前遗忘方法的鲁棒性,并质疑其在安全训练上的优势。

[NLP-8] DARE: Diverse Visual Question Answering with Robustness Evaluation

【速读】: 该论文试图解决现有视觉语言模型(VLMs)在多模态推理能力(如计数和空间推理)上的不足,以及在面对指令和评估协议微小变化时的脆弱性问题。解决方案的关键在于引入了一个名为DARE的多样化视觉问答基准,该基准通过五种不同类别的任务和四种基于提示、答案选项子集、输出格式和正确答案数量的鲁棒性评估,全面评估VLMs的性能和鲁棒性。研究发现,即使是当前最先进的VLMs在大多数类别的问题上仍表现不佳,且在不同鲁棒性评估中的表现波动较大,表明这些模型在面对多样化和微小变化时的鲁棒性仍有待提高。

链接: https://arxiv.org/abs/2409.18023
作者: Hannah Sterz,Jonas Pfeiffer,Ivan Vulić
关键词-EN: Vision Language Models, text-only large language, Vision Language, large language models, extend remarkable capabilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.
摘要:视觉语言模型 (Vision Language Models, VLMs) 扩展了仅文本大语言模型和仅视觉模型的显著能力,能够从多模态的视觉文本输入中学习和处理信息。尽管现代 VLMs 在许多标准图像分类和图文匹配任务中表现出色,但它们在计数和空间推理等关键视觉语言 (Vision-Language, VL) 推理能力方面仍显不足。此外,尽管它们对指令和/或评估协议的小变化可能非常脆弱,但现有基准未能评估其鲁棒性(或更确切地说,缺乏鲁棒性)。为了将具有挑战性的 VL 场景与全面的鲁棒性评估相结合,我们引入了 DARE,即多样化的视觉问答与鲁棒性评估 (Diverse Visual Question Answering with Robustness Evaluation),这是一个精心创建和策划的多项选择 VQA 基准。DARE 评估 VLM 在五个不同类别上的表现,并包含四个基于提示、答案选项子集、输出格式和正确答案数量变化的鲁棒性评估。在众多其他发现中,我们报告称,最先进的 VLMs 在大多数类别的问题上仍显吃力,并且在测试的鲁棒性评估中无法持续展现其峰值性能。在选项子集中的最差表现比标准情况下的表现低达 34%。开源 VLMs 如 LLaVA 1.6 和 Idefics2 的鲁棒性无法与 GPT-4 和 Gemini 等闭源模型相媲美,但即使是后者,对不同变化的鲁棒性也非常脆弱。

[NLP-9] Multilingual Evaluation of Long Context Retrieval and Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在多语言环境下处理长文本和多个目标句子的性能问题。解决方案的关键在于全面评估多个长上下文LLMs在五种不同语言(英语、越南语、印尼语、斯瓦希里语和索马里语)中的检索和推理任务,揭示了语言资源水平和目标句子数量对模型性能的显著影响。研究发现,即使在同一脚本(拉丁文)下,不同语言家族和资源水平的语言之间存在显著的性能差距,特别是在处理多个目标句子时,模型的准确率显著下降。

链接: https://arxiv.org/abs/2409.18006
作者: Ameeta Agrawal,Andy Dang,Sina Bagheri Nezhad,Rhitabrat Pokharel,Russell Scheinberg
关键词-EN: demonstrate impressive capabilities, Recent large language, exhibiting near-perfect recall, Recent large, handling long contexts
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We comprehensively evaluate several long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.
摘要:近期的大语言模型 (LLMs) 在处理长上下文方面展示了令人印象深刻的能力,其中一些在合成检索任务上表现出近乎完美的召回率。然而,这些评估主要集中在英文文本上,并且涉及长上下文中的单一目标句子。我们的研究探讨了 LLM 性能在多语言环境中如何泛化,特别是在存在多个隐藏目标句子的情况下。我们全面评估了几种长上下文 LLM 在检索和推理任务上的表现,涵盖了五种语言:英语、越南语、印度尼西亚语、斯瓦希里语和索马里语。这些语言虽然共享拉丁字母,但属于不同的语言家族和资源级别。我们的分析揭示了语言之间的显著性能差距。表现最佳的模型如 Gemini-1.5 和 GPT-4o,在英语中达到约 96% 的准确率,而在索马里语中仅为约 36%,且仅涉及单一目标句子。然而,当处理三个目标句子时,英语中的准确率降至 40%,而在索马里语中降至 0%。我们的研究结果突显了长上下文 LLM 在处理更长上下文、增加目标句子数量或资源较低语言时所面临的挑战。

[NLP-10] Extracting Affect Aggregates from Longitudinal Social Media Data with Temporal Adapters for Large Language Models

【速读】: 该论文试图解决如何利用大语言模型(LLMs)进行社交媒体数据的纵向分析问题。解决方案的关键在于提出了时间对齐的LLMs,并通过微调Llama 3 8B模型中的Temporal Adapters来处理英国Twitter用户的全时间线数据,从而提取情感和态度的纵向聚合。该方法通过与英国代表性调查数据的验证,显示出与传统分类模型相当的稳健性和显著的正相关性,为社交媒体数据的纵向分析提供了新的途径。

链接: https://arxiv.org/abs/2409.17990
作者: Georg Ahnert,Max Pellert,David Garcia,Markus Strohmaier
关键词-EN: aligned Large Language, Large Language Models, temporally aligned Large, Large Language, paper proposes temporally
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Code available at this https URL

点击查看摘要

Abstract:This paper proposes temporally aligned Large Language Models (LLMs) as a tool for longitudinal analysis of social media data. We fine-tune Temporal Adapters for Llama 3 8B on full timelines from a panel of British Twitter users, and extract longitudinal aggregates of emotions and attitudes with established questionnaires. We validate our estimates against representative British survey data and find strong positive, significant correlations for several collective emotions. The obtained estimates are robust across multiple training seeds and prompt formulations, and in line with collective emotions extracted using a traditional classification model trained on labeled data. To the best of our knowledge, this is the first work to extend the analysis of affect in LLMs to a longitudinal setting through Temporal Adapters. Our work enables new approaches towards the longitudinal analysis of social media data.
摘要: 本文提出将时间对齐的大语言模型 (LLMs) 作为社交媒体数据纵向分析的工具。我们对 Llama 3 8B 模型的时间适配器 (Temporal Adapters) 进行了微调,以处理来自英国 Twitter 用户面板的完整时间线数据,并使用既定问卷提取情感和态度的纵向聚合数据。我们通过与代表性英国调查数据进行对比验证了我们的估计,发现多个集体情感指标存在显著的正相关关系。所获得的估计值在多个训练种子和提示语句组合下均表现出稳健性,并与使用传统分类模型(基于标注数据训练)提取的集体情感相一致。据我们所知,这是首次通过时间适配器将大语言模型中的情感分析扩展到纵向分析领域。我们的工作为社交媒体数据的纵向分析开辟了新的途径。

[NLP-11] BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search

【速读】: 该论文试图解决大语言模型(LLMs)在解决数学问题时表现不佳的问题,特别是由于数学问题的严谨性和逻辑性,传统方法如监督微调(SFT)、提示工程和基于搜索的方法虽然有所改进,但仍需大量计算资源且效果有限。论文提出的解决方案之关键是BEATS方法,该方法通过设计新的提示引导模型迭代重写、逐步推进并基于前一步生成答案,同时引入回溯验证技术利用LLMs验证答案的正确性,并采用剪枝树搜索优化搜索时间,从而显著提升数学问题解决能力,在MATH基准测试中将Qwen2-7b-Instruct的得分从36.94提升至61.52,超越了GPT4的42.5。

链接: https://arxiv.org/abs/2409.17972
作者: Linzhuang Sun,Hao Liang,Wentao Zhang
关键词-EN: Large Language Models, Large Language, exhibited exceptional performance, Language Models, tasks and domains
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. However, they still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. Previous studies have employed techniques such as supervised fine-tuning (SFT), prompt engineering, and search-based methods to improve the mathematical problem-solving abilities of LLMs. Despite these efforts, their performance remains suboptimal and demands substantial computational resources. To address this issue, we propose a novel approach, BEATS, to enhance mathematical problem-solving abilities. Our method leverages newly designed prompts that guide the model to iteratively rewrite, advance by one step, and generate answers based on previous steps. Additionally, we introduce a new back-verification technique that uses LLMs to validate the correctness of the generated answers. Furthermore, we employ a pruning tree search to optimize search time while achieving strong performance. Notably, our method improves Qwen2-7b-Instruct’s score from 36.94 to 61.52, outperforming GPT4’s 42.5 on the MATH benchmark.
摘要:大语言模型 (LLMs) 在广泛的任务和领域中展现了卓越的性能。然而,由于数学的严谨性和逻辑性,它们在解决数学问题时仍面临困难。先前的研究采用了监督微调 (SFT)、提示工程和基于搜索的方法来提升大语言模型的数学问题解决能力。尽管如此,这些方法的性能仍不尽如人意,并且需要大量的计算资源。为了解决这一问题,我们提出了一种新颖的方法,BEATS,以增强数学问题解决能力。我们的方法利用了新设计的提示,指导模型通过迭代重写、逐步推进并基于先前步骤生成答案。此外,我们引入了一种新的后验证技术,使用大语言模型来验证生成答案的正确性。同时,我们采用了一种剪枝树搜索来优化搜索时间,同时实现强大的性能。值得注意的是,我们的方法将 Qwen2-7b-Instruct 的分数从 36.94 提升至 61.52,超过了 GPT4 在 MATH 基准测试中的 42.5 分。

[NLP-12] he Hard Positive Truth about Vision-Language Compositionality ECCV2024

【速读】: 该论文试图解决现有视觉-语言模型(如CLIP)在组合性任务中表现不佳的问题,特别是模型在面对复杂干扰项时的鲁棒性不足。解决方案的关键在于引入“硬正例”(hard positives)进行训练和评估,以揭示现有模型在处理相关“正例”概念时的不变性问题。通过构建包含112,382个硬负例和硬正例的评估数据集,论文发现仅使用硬负例进行微调会导致模型性能大幅下降,而人类在这类任务中表现近乎完美。因此,论文提出了一种新的训练方法,即同时使用硬负例和硬正例进行训练,以提升模型在现有基准测试中的表现,并增强其在处理复杂语义关系时的鲁棒性。这一方法不仅改善了模型在组合性任务中的表现,还表明了在训练过程中考虑硬正例的重要性。

链接: https://arxiv.org/abs/2409.17958
作者: Amita Kamath,Cheng-Yu Hsieh,Kai-Wei Chang,Ranjay Krishna
关键词-EN: hard, CLIP, hard positives, hard negatives, vision-language models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024

点击查看摘要

Abstract:Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated – because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts.
摘要:多项基准测试得出结论,我们最先进的视觉语言模型(例如 CLIP)在组合性方面存在不足。给定一张图像,这些基准测试考察模型在一组组合性干扰项中识别其相关描述的能力。为此,近期涌现出一系列通过使用干扰项作为硬负样本对 CLIP 进行微调以提升性能的方案。我们的研究揭示,这些改进实际上被显著夸大了——因为现有基准测试并未探究微调后的视觉语言模型是否对硬正样本保持不变性。通过精心构建包含 112,382 个硬负样本和硬正样本的评估数据集,我们发现引入硬正样本会使 CLIP 的性能下降 12.9%,而人类在此任务上的表现则轻松达到 99%。采用硬负样本微调的 CLIP 性能下降更为严重,高达 38.7%。基于这一发现,我们随后生成了一个包含 1,775,259 个图像-文本对的大型训练集,其中同时涵盖了硬负样本和硬正样本的描述。通过同时训练这两种样本,我们不仅在现有基准测试中看到了性能提升,而且在处理硬正样本时也表现更佳,这表明组合性方面的改进更为稳健。我们的工作表明,未来研究需要严格测试并提升 CLIP 对相关“正”概念间语义关系的理解能力。

[NLP-13] Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation

【速读】: 该论文试图解决在参数高效微调(PEFT)背景下,大语言模型(LLMs)中后门攻击效果不佳的问题。解决方案的关键在于提出了一种基于对比知识蒸馏的从弱到强后门攻击算法(W2SAttack)。具体来说,通过全参数微调毒化小规模语言模型作为教师模型,然后利用对比知识蒸馏将后门秘密传递给大规模学生模型,过程中采用PEFT技术。这种方法在理论上增强了后门攻击的效果,并在实验中展示了接近100%的成功率。

链接: https://arxiv.org/abs/2409.17946
作者: Shuai Zhao,Leilei Gan,Zhongliang Guo,Xiaobao Wu,Luwei Xiao,Xiaoyu Xu,Cong-Duy Nguyen,Luu Anh Tuan
关键词-EN: widely applied due, Large Language Models, Large Language, backdoor attacks, backdoor
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on contrastive knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through contrastive knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
摘要:尽管大语言模型 (Large Language Models, LLMs) 因其卓越的能力而被广泛应用,但已被证明容易受到后门攻击。这些攻击通过毒化训练样本和全参数微调引入目标漏洞。然而,这种后门攻击受限于其需要大量计算资源,尤其是随着 LLMs 规模的增加。此外,参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 提供了一种替代方案,但其受限的参数更新可能阻碍触发器与目标标签的对齐。在本研究中,我们首先验证了使用 PEFT 的后门攻击可能在实现可行性能方面遇到挑战。为了解决这些问题并提高使用 PEFT 的后门攻击的有效性,我们提出了一种基于对比知识蒸馏 (Contrastive Knowledge Distillation) 的由弱到强的新型后门攻击算法 (W2SAttack)。具体而言,我们通过全参数微调毒化小规模语言模型,作为教师模型。然后,教师模型通过对比知识蒸馏,使用 PEFT 将后门秘密转移给大规模学生模型。理论分析表明,W2SAttack 具有增强后门攻击效果的潜力。我们在四个语言模型、四种后门攻击算法和两种不同架构的教师模型上展示了 W2SAttack 在分类任务中的优越性能。实验结果显示,针对 PEFT 的后门攻击成功率接近 100%。

[NLP-14] On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms

【速读】: 该论文试图解决机器翻译系统在处理技术术语缩写(acronyms)时的歧义问题。解决方案的关键在于在源语言到目标语言的翻译流程中引入一个额外的步骤,即首先提供一个新的缩写语料库供公众使用,然后采用基于搜索的阈值算法进行处理,该算法相较于Google Translate和OpusMT,在处理缩写时准确率提升了近10%。

链接: https://arxiv.org/abs/2409.17943
作者: Richard Yue,John E. Ortega,Kenneth Ward Church
关键词-EN: natural language processing, professional translator, models in natural, Google Translate, BLEU and COMET
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users

点击查看摘要

Abstract:The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.
摘要:专业翻译人员将文档从源语言 (Source Language, SL) 翻译为目标语言 (Target Language, TL) 的典型工作流程,并不总是专注于自然语言处理 (Natural Language Processing, NLP) 中的许多语言模型所做的——预测一系列单词中的下一个单词。尽管像英语和法语这样的高资源语言在使用 BLEU 和 COMET 等常见度量标准进行测量时,已报告达到接近人类的水平,但我们发现一个重要的步骤被忽略了:技术术语,特别是缩略词的翻译。一些公开可用的最先进的机器翻译系统,如 Google Translate,在处理缩略词时可能会出现错误,根据我们的研究,错误率高达 50%。本文通过在 SL-TL (FR-EN) 翻译工作流程中增加一个步骤来解决机器翻译 (Machine Translation, MT) 系统的缩略词歧义问题,首先提供一个新的缩略词语料库供公众使用,然后实验一种基于搜索的阈值算法,与 Google Translate 和 OpusMT 相比,该算法实现了近 10% 的提升。

[NLP-15] Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods

【速读】: 该论文试图解决在计算机辅助翻译(CAT)工具中,如何更有效地处理模糊匹配(fuzzy-match)并修复锚定词(anchored words)的问题。解决方案的关键在于利用基于机器学习的模型(如Word2Vec、BERT和GPT-4)来替代传统的神经机器翻译(NMT)方法,特别是在处理遵循连续词袋(CBOW)范式的锚定词时,这些模型能够提供更优或至少相似的翻译效果,尤其是在法语到英语的翻译任务中。

链接: https://arxiv.org/abs/2409.17939
作者: Richard Yue,John E. Ortega
关键词-EN: tools called computer-aided, called computer-aided translation, CAT tool, CAT tools, CAT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users

点击查看摘要

Abstract:Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s’). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s’. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s’. Most FMR techniques use machine translation as a way of “repairing” those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.
摘要:翻译记忆库 (Translation Memories, TMs) 是专业翻译工具——计算机辅助翻译 (Computer-Aided Translation, CAT) 工具的核心。在使用 CAT 工具进行翻译时,译者利用 TM 来收集与目标翻译片段 (s’) 相似的翻译。许多 CAT 工具提供模糊匹配算法,以定位 TM 中与 s’ 距离相近的片段 (s)。在找到两个相似片段后,CAT 工具会展示包含源语言片段及其目标语言翻译的平行片段 (s, t)。此外,CAT 工具还包含模糊匹配修复 (Fuzzy-Match Repair, FMR) 技术,这些技术会自动使用 TM 中的平行片段来创建新的 TM 条目,这些条目包含原始片段的修改版本,旨在作为 s’ 的翻译。大多数 FMR 技术使用机器翻译来“修复”那些需要修改的词汇。在本文中,我们展示了对于那些锚定的词汇,我们可以使用基于机器学习的方法,如 Word2Vec、BERT 和 ChatGPT,来替代大部分词汇的修复工作。具体而言,我们展示了对于遵循连续词袋 (Continuous Bag-of-Words, CBOW) 范式的锚定词汇,Word2Vec、BERT 和 GPT-4 可以用于实现与神经机器翻译相似,甚至在某些情况下更好的结果,以将法语中的锚定词汇翻译为英语。

[NLP-16] he Lou Dataset – Exploring the Impact of Gender-Fair Language in German Text Classification

【速读】: 该论文试图解决性别公平语言对语言模型(LMs)分类任务影响评估的资源缺乏问题。解决方案的关键在于提出了Lou数据集,这是首个包含高质量德语文本分类重构的数据集,涵盖七个任务(如立场检测和毒性分类)。通过评估16种单语和多语言LMs在Lou上的表现,研究发现性别公平语言显著影响预测结果,包括标签翻转、确定性降低和注意力模式改变。尽管如此,现有评估的有效性未受影响,因为原始和重构实例的LM排名差异不大。该研究不仅为德语文本分类提供了初步见解,还暗示其发现可能适用于其他语言。

链接: https://arxiv.org/abs/2409.17929
作者: Andreas Waldis,Joel Birrer,Anne Lauscher,Iryna Gurevych
关键词-EN: evolving German linguistic, German linguistic variation, fosters inclusion, neutral forms, inclusion by addressing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Gender-fair language, an evolving German linguistic variation, fosters inclusion by addressing all genders or using neutral forms. Nevertheless, there is a significant lack of resources to assess the impact of this linguistic shift on classification using language models (LMs), which are probably not trained on such variations. To address this gap, we present Lou, the first dataset featuring high-quality reformulations for German text classification covering seven tasks, like stance detection and toxicity classification. Evaluating 16 mono- and multi-lingual LMs on Lou shows that gender-fair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns. However, existing evaluations remain valid, as LM rankings of original and reformulated instances do not significantly differ. While we offer initial insights on the effect on German text classification, the findings likely apply to other languages, as consistent patterns were observed in multi-lingual and English LMs.
摘要:性别公平语言是一种不断发展的德语语言变体,通过涵盖所有性别或使用中性形式来促进包容性。然而,目前缺乏资源来评估这种语言变化对使用语言模型 (LMs) 进行分类的影响,这些模型可能并未针对此类变体进行训练。为了填补这一空白,我们推出了 Lou,这是首个包含高质量重构文本的德语文本分类数据集,涵盖了立场检测和毒性分类等七项任务。通过对 Lou 上的 16 个单语和多语言 LMs 进行评估,我们发现性别公平语言显著影响了预测结果,包括标签翻转、确定性降低以及注意力模式的变化。然而,现有的评估仍然有效,因为原始实例和重构实例的 LM 排名没有显著差异。尽管我们提供了关于性别公平语言对德语文本分类影响的初步见解,但这些发现很可能适用于其他语言,因为在多语言和英语 LMs 中观察到了一致的模式。

[NLP-17] Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion EMNLP24

【速读】: 该论文试图解决文本到图像(T2I)扩散模型中知识过时的问题,即模型参数中编码的事实知识可能随时间变得陈旧,导致生成的图像与现实世界不符。解决方案的关键在于设计了一个三阶段的T2I知识编辑框架:首先,构建了一个名为CAKE的数据集,包含释义和多对象测试,以实现更细粒度的知识泛化评估;其次,提出了自适应CLIP阈值标准,用于有效过滤当前标准下的假阳性图像,确保编辑评估的可靠性;最后,引入了一种简单但有效的T2I知识编辑方法MPE,通过精确识别和编辑条件文本提示中的过时部分,以适应最新的知识。MPE基于上下文学习,实现了比以往模型编辑器更好的整体性能。

链接: https://arxiv.org/abs/2409.17928
作者: Hengrui Gu,Kaixiong Zhou,Yili Wang,Ruobing Wang,Xin Wang
关键词-EN: diffusion models encode, models encode factual, encode factual knowledge, knowledge, encode factual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP24 Findings

点击查看摘要

Abstract:During pre-training, the Text-to-Image (T2I) diffusion models encode factual knowledge into their parameters. These parameterized facts enable realistic image generation, but they may become obsolete over time, thereby misrepresenting the current state of the world. Knowledge editing techniques aim to update model knowledge in a targeted way. However, facing the dual challenges posed by inadequate editing datasets and unreliable evaluation criterion, the development of T2I knowledge editing encounter difficulties in effectively generalizing injected knowledge. In this work, we design a T2I knowledge editing framework by comprehensively spanning on three phases: First, we curate a dataset \textbfCAKE, comprising paraphrase and multi-object test, to enable more fine-grained assessment on knowledge generalization. Second, we propose a novel criterion, \textbfadaptive CLIP threshold, to effectively filter out false successful images under the current criterion and achieve reliable editing evaluation. Finally, we introduce \textbfMPE, a simple but effective approach for T2I knowledge editing. Instead of tuning parameters, MPE precisely recognizes and edits the outdated part of the conditioning text-prompt to accommodate the up-to-date knowledge. A straightforward implementation of MPE (Based on in-context learning) exhibits better overall performance than previous model editors. We hope these efforts can further promote faithful evaluation of T2I knowledge editing methods.
摘要:在预训练阶段,文本到图像 (Text-to-Image, T2I) 扩散模型将其参数化的事实知识编码到模型参数中。这些参数化的事实使得模型能够生成逼真的图像,但随着时间的推移,这些知识可能会变得过时,从而导致对当前世界状态的错误描述。知识编辑技术旨在以有针对性的方式更新模型知识。然而,面对编辑数据集不足和评估标准不可靠的双重挑战,T2I 知识编辑的发展在有效泛化注入知识方面遇到了困难。在本研究中,我们设计了一个 T2I 知识编辑框架,全面涵盖了三个阶段:首先,我们精心构建了一个数据集 \textbf{CAKE},该数据集包含释义和多对象测试,以实现对知识泛化的更精细评估。其次,我们提出了一种新的标准,即 \textbf{自适应 CLIP 阈值},以有效过滤当前标准下的虚假成功图像,并实现可靠的编辑评估。最后,我们引入了 \textbf{MPE},这是一种简单但有效的 T2I 知识编辑方法。MPE 不是调整参数,而是精确识别并编辑条件文本提示中过时的部分,以适应最新的知识。基于上下文学习的 MPE 直接实现展示了比先前模型编辑器更好的整体性能。我们希望这些努力能够进一步促进 T2I 知识编辑方法的忠实评估。

[NLP-18] Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

【速读】: 该论文试图解决低资源语言变体(如摩洛哥阿拉伯语Darija)在大型语言模型(LLMs)中的应用问题。解决方案的关键在于构建了一个专门针对Darija的指令数据集,通过整合现有资源、手动和合成创建新数据集,并严格控制翻译质量,从而训练出Atlas-Chat-9B和2B模型。这些模型在遵循Darija指令和执行标准NLP任务方面表现优异,超越了现有的最先进和阿拉伯语专用LLMs,如LLaMa、Jais和AceGPT,特别是在新引入的DarijaMMLU评估套件中,实现了显著的性能提升。此外,论文还通过实验分析了不同的微调策略和基础模型选择,以确定最佳配置。

链接: https://arxiv.org/abs/2409.17912
作者: Guokan Shang,Hadi Abdine,Yousef Khoubrane,Amr Mohamed,Yassine Abbahaddou,Sofiane Ennadir,Imane Momayiz,Xuguang Ren,Eric Moulines,Preslav Nakov,Michalis Vazirgiannis,Eric Xing
关键词-EN: models specifically developed, dialectal Arabic, Moroccan Arabic, introduce Atlas-Chat, large language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.
摘要:我们介绍了 Atlas-Chat,这是首个专门为阿拉伯方言开发的大语言模型集合。聚焦于摩洛哥阿拉伯语(也称为 Darija),我们通过整合现有的 Darija 语言资源、手动和合成创建新数据集,以及严格质量控制下的英语指令翻译,构建了我们的指令数据集。经过数据集微调的 Atlas-Chat-9B 和 2B 模型,在遵循 Darija 指令和执行标准自然语言处理任务方面表现出卓越能力。值得注意的是,我们的模型在 DarijaMMLU 上,相较于更大的 13B 模型,实现了 13% 的性能提升,超越了包括 LLaMa、Jais 和 AceGPT 在内的最先进和专门针对阿拉伯语的 LLM。此外,我们对多种微调策略和基础模型选择进行了实验分析,以确定最佳配置。所有资源均为公开可访问,我们相信我们的工作为低资源语言变体的指令微调提供了全面的设计方法,这些变体在当代 LLM 中往往被数据丰富的语言所忽视。

[NLP-19] EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

【速读】: 该论文试图解决低资源语言在大型语言模型中的覆盖不足问题。解决方案的关键在于通过持续预训练(continual pre-training)策略,利用包含546种语言的MaLA语料库对Llama 2 7B模型进行扩展训练,从而生成EMMA-500模型。这一方法显著提升了模型在多语言任务中的表现,特别是在跨语言迁移、任务泛化和语言适应性方面取得了显著进步,有效增强了低资源语言的处理能力。

链接: https://arxiv.org/abs/2409.17892
作者: Shaoxiong Ji,Zihao Li,Indraneil Paul,Jaakko Paavola,Peiqin Lin,Pinzhen Chen,Dayyán O’Brien,Hengyu Luo,Hinrich Schütze,Jörg Tiedemann,Barry Haddow
关键词-EN: improving language coverage, focusing on improving, enhanced multilingual performance, continue-trained on texts, designed for enhanced
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models’ language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.
摘要:在本研究中,我们介绍了 EMMA-500,这是一个大规模的多语言语言模型,针对 546 种语言进行了持续训练,旨在提升多语言性能,特别是改善低资源语言的覆盖率。为了支持持续预训练,我们编纂了 MaLA 语料库,这是一个综合性的多语言数据集,涵盖了多个领域的精选数据集。利用这一语料库,我们对 Llama 2 7B 模型进行了广泛的持续预训练,从而生成了 EMMA-500,该模型在包括多语言任务和本研究开发的开放式生成基准 PolyWrite 在内的一系列基准测试中表现出色。我们的研究结果突显了持续预训练在扩展大语言模型语言能力方面的有效性,特别是在代表性不足的语言方面,显著提升了跨语言迁移、任务泛化和语言适应性。

[NLP-20] Implementing a Nordic-Baltic Federated Health Data Network: a case report

【速读】: 该论文试图解决跨国医疗数据集中收集和处理中的隐私、数据异质性和法律障碍等问题。解决方案的关键在于建立一个跨学科联盟,开发一个联邦健康数据网络,通过六家跨五国的机构合作,促进北欧-波罗的海地区的健康数据二次利用。该网络采用混合方法,结合实验设计和实施科学,评估影响网络实施的因素,实验结果表明网络在技术上运行良好,但需关注不确定的监管环境和显著的运营成本。

链接: https://arxiv.org/abs/2409.17865
作者: Taridzo Chomutare,Aleksandar Babic,Laura-Maria Peltonen,Silja Elunurm,Peter Lundberg,Arne Jönsson,Emma Eneling,Ciprian-Virgil Gerstenberger,Troels Siggaard,Raivo Kolde,Oskar Jerdhaf,Martin Hansson,Alexandra Makhlysheva,Miroslav Muzny,Erik Ylipää,Søren Brunak,Hercules Dalianis
关键词-EN: including privacy concerns, national borders pose, borders pose significant, pose significant challenges, including privacy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages (including appendices), 1 figure

点击查看摘要

Abstract:Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs.
摘要:
背景: 跨国集中收集和处理医疗数据面临重大挑战,包括隐私问题、数据异质性和法律障碍。为了应对其中一些挑战,我们组建了一个跨学科联盟,旨在开发一个联邦健康数据网络,该网络由五个国家的六个机构组成,以促进北欧-波罗的海地区在医疗数据二次使用方面的合作。本报告旨在提供我们在开发这一网络过程中的早期见解。方法: 我们采用了混合方法,结合实验设计和实施科学来评估影响我们网络实施的因素。结果: 从技术角度来看,我们的实验表明,与集中式模拟相比,该网络在功能上没有显著的性能下降。结论: 尽管跨学科方法具有解决建立此类协作网络所面临挑战的潜力,但我们的发现突显了监管环境的不确定性以及显著的运营成本。

[NLP-21] PEDRO: Parameter-Efficient Fine-tuning with Prompt DEpenDent Representation MOdification

【速读】: 该论文试图解决在单骨干多租户框架下部署大型语言模型(LLMs)时,如何在保持高效推理的同时,实现对下游任务的竞争性性能的问题。解决方案的关键在于提出了一种名为Prompt Dependent Representation Modification (PEDRO)的新型参数高效微调(PEFT)方法。PEDRO通过在每个Transformer层中集成一个轻量级的向量生成器,根据输入提示生成向量,并通过点积操作修改LLM的隐藏表示,从而影响模型的语义输出和生成内容。实验结果表明,PEDRO在相似的可调参数数量下优于最近的PEFT基准,并且在单骨干多租户部署模型中表现出比LoRA更高的效率,显示出显著的工业应用潜力。

链接: https://arxiv.org/abs/2409.17834
作者: Tianfang Xie,Tianjing Li,Wei Zhu,Wei Han,Yi Zhao
关键词-EN: large language models, substantial sizes, large language, typically deployed, underline
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2405.18203

点击查看摘要

Abstract:Due to their substantial sizes, large language models (LLMs) are typically deployed within a single-backbone multi-tenant framework. In this setup, a single instance of an LLM backbone must cater to multiple users or tasks through the application of various parameter-efficient fine-tuning (PEFT) models. Despite the availability of numerous effective PEFT techniques such as LoRA, there remains a need for a PEFT approach that achieves both high efficiency during inference and competitive performance on downstream tasks. In this research, we introduce a new and straightforward PEFT methodology named \underlinePrompt D\underlineEpen\underlineDent \underlineRepresentation M\underlineOdification (PEDRO). The proposed method involves integrating a lightweight vector generator into each Transformer layer, which generates vectors contingent upon the input prompts. These vectors then modify the hidden representations created by the LLM through a dot product operation, thereby influencing the semantic output and generated content of the model. Extensive experimentation across a variety of tasks indicates that: (a) PEDRO surpasses recent PEFT benchmarks when using a similar number of tunable parameters. (b) Under the single-backbone multi-tenant deployment model, PEDRO exhibits superior efficiency compared to LoRA, indicating significant industrial potential.
摘要:由于大语言模型 (Large Language Model, LLM) 的规模庞大,它们通常部署在单一主干多租户框架中。在这种架构下,LLM 主干的一个实例必须通过应用各种参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 模型来服务于多个用户或任务。尽管存在许多有效的 PEFT 技术,如 LoRA,但仍然需要一种在推理过程中实现高效性并在下游任务中表现出色的 PEFT 方法。在本研究中,我们引入了一种新的、直接的 PEFT 方法,名为 \underline{Prompt D\underlineEpen\underlineDent \underlineRepresentation M\underlineOdification} (PEDRO)。该方法涉及将一个轻量级向量生成器集成到每个 Transformer 层中,该生成器根据输入提示生成向量。这些向量随后通过点积操作修改由 LLM 生成的隐藏表示,从而影响模型的语义输出和生成内容。在多种任务上的广泛实验表明:(a) 在使用相似数量的可调参数时,PEDRO 超越了最近的 PEFT 基准。(b) 在单一主干多租户部署模型下,PEDRO 相比 LoRA 表现出更高的效率,显示出显著的工业应用潜力。

[NLP-22] BeanCounter: A low-toxicity large-scale and open dataset of business-oriented text

【速读】: 该论文试图解决现有语言模型训练数据集中存在的低质量和较高毒性问题,并提出了一种新的数据源——BeanCounter,这是一个包含超过1590亿个标记的企业披露数据集。解决方案的关键在于利用BeanCounter数据集进行持续预训练,以显著减少模型生成的有毒内容,并提升在金融领域的性能。实验结果表明,使用BeanCounter数据集预训练的模型在减少有毒生成方面有18-33%的改进,同时在金融领域的表现也有所提升,这表明BeanCounter是一个新颖的、低毒性且高质量的领域特定数据源,足以用于训练大规模的参数化语言模型。

链接: https://arxiv.org/abs/2409.17827
作者: Siyan Wang,Bradford Levy
关键词-EN: breakthroughs in language, language modeling, modeling have resulted, resulted from scaling, scaling effectively
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses’ disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data’s provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.
摘要:近年来,语言模型领域的许多突破性进展源于将相同的模型架构有效地扩展到更大的数据集上。在此背景下,最近的研究强调了增加训练数据集规模和质量带来的性能提升,这表明需要寻找新的、大规模的数据集来源。本文中,我们引入了 BeanCounter,这是一个包含超过 1590 亿 Token 的公开数据集,这些 Token 提取自企业的披露信息。我们证明,这些数据确实是新颖的:BeanCounter 中不到 0.1% 的内容出现在基于 Common Crawl 的数据集中,并且其规模比依赖类似来源的数据集大一个数量级。鉴于数据的来源,我们假设 BeanCounter 相对于基于网络的数据集,具有更高的真实性和更低的毒性。通过探索这一假设,我们发现许多人口统计身份在 BeanCounter 中出现的频率与其他数据集相似,但在相对较少的毒性环境中出现。为了展示 BeanCounter 的实用性,我们评估并比较了两个在 BeanCounter 上持续预训练的大语言模型与其基础模型。我们发现,持续预训练的模型在生成毒性内容方面减少了 18-33%,并且在金融领域的表现有所提升。总体而言,我们的研究表明,BeanCounter 是一个新颖的、低毒性且高质量的领域特定数据源,其规模足以训练拥有数十亿参数的大语言模型。

[NLP-23] Inference-Time Language Model Alignment via Integrated Value Guidance EMNLP2024

【速读】: 该论文试图解决大规模语言模型在微调过程中计算复杂且耗时的问题。解决方案的关键在于引入了一种名为**Integrated Value Guidance (IVG)**的方法,该方法通过隐式和显式的价值函数分别在token和chunk级别上指导语言模型的解码过程,从而在推理阶段高效地对齐大规模语言模型,避免了直接微调的复杂性,并在多个任务中显著提升了模型的对齐效果。

链接: https://arxiv.org/abs/2409.17819
作者: Zhixuan Liu,Zhanhui Zhou,Yuanfu Wang,Chao Yang,Yu Qiao
关键词-EN: Large language models, human preferences, intensive and complex, tuning large models, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models are typically fine-tuned to align with human preferences, but tuning large models is computationally intensive and complex. In this work, we introduce \textitIntegrated Value Guidance (IVG), a method that uses implicit and explicit value functions to guide language model decoding at token and chunk-level respectively, efficiently aligning large language models purely at inference time. This approach circumvents the complexities of direct fine-tuning and outperforms traditional methods. Empirically, we demonstrate the versatility of IVG across various tasks. In controlled sentiment generation and summarization tasks, our method significantly improves the alignment of large models using inference-time guidance from \textttgpt2 -based value functions. Moreover, in a more challenging instruction-following benchmark AlpacaEval 2.0, we show that both specifically tuned and off-the-shelf value functions greatly improve the length-controlled win rates of large models against \textttgpt-4-turbo (e.g., 19.51% \rightarrow 26.51% for \textttMistral-7B-Instruct-v0.2 and 25.58% \rightarrow 33.75% for \textttMixtral-8x7B-Instruct-v0.1 with Tulu guidance).
摘要:大语言模型通常经过微调以符合人类偏好,但微调大型模型在计算上既耗费资源又复杂。在本研究中,我们引入了 综合价值引导 (Integrated Value Guidance, IVG),这是一种利用隐式和显式价值函数分别在 Token 和块级别引导语言模型解码的方法,从而在推理时高效地对齐大语言模型。这种方法避免了直接微调的复杂性,并优于传统方法。通过实验,我们展示了 IVG 在各种任务中的广泛适用性。在受控情感生成和摘要任务中,我们的方法显著提升了大模型在推理时通过基于 gpt2 的价值函数引导下的对齐效果。此外,在更具挑战性的指令遵循基准测试 AlpacaEval 2.0 中,我们展示了专门调优和现成的价值函数都能大幅提高大模型在长度控制下的胜率,例如,Mistral-7B-Instruct-v0.2 的胜率从 19.51% 提升至 26.51%,而 Mixtral-8x7B-Instruct-v0.1 的胜率从 25.58% 提升至 33.75%(使用 Tulu 引导)。

[NLP-24] Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness EMNLP2024

【速读】: 该论文试图解决在基于人类反馈的强化学习(RLHF)中,现有方法如直接偏好优化(DPO)及其变体在处理不同响应间的偏好程度差异时存在的不足。现有方法通常使用二元交叉熵机制处理成对样本,忽略了响应间偏好程度的细微差别,这阻碍了大型语言模型(LLMs)充分理解人类偏好。论文提出的解决方案是引入一种新的自监督偏好优化(SPO)框架,该框架通过构建自监督的偏好程度损失与对齐损失相结合,帮助LLMs更好地理解偏好程度,从而显著提升现有偏好优化方法的性能,达到最先进的水平。

链接: https://arxiv.org/abs/2409.17791
作者: Jian Li,Haojing Huang,Yujia Zhang,Pengfei Xu,Xi Chen,Rui Song,Lida Shi,Jingwen Wang,Hao Xu
关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Direct Preference Optimization, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at this https URL.
摘要:最近,对于大语言模型 (LLM),如直接偏好优化 (DPO) 及其变体,在强化学习与人类反馈 (RLHF) 方法中替换奖励模型的兴趣显著增加。这些方法通常使用成对样本上的二元交叉熵机制,即分别基于偏好或非偏好的响应来最小化和最大化损失。然而,尽管这种训练策略省略了奖励模型,但它也忽略了不同响应中偏好的不同程度。我们假设这是阻碍 LLM 充分理解人类偏好的关键因素。为了解决这个问题,我们提出了一种新的自监督偏好优化 (SPO) 框架,该框架构建了一个自监督的偏好程度损失,结合对齐损失,从而帮助 LLM 提高理解偏好程度的能力。我们在两个广泛使用的不同任务的数据集上进行了大量实验。结果表明,SPO 可以无缝集成到现有的偏好优化方法中,并显著提升其性能,达到最先进的水平。我们还进行了详细的分析,以提供对 SPO 的全面见解,验证了其有效性。代码可在以下链接获取:https URL。

[NLP-25] Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations EMNLP2024

【速读】: 该论文试图解决现有可解释AI(Explainable AI)在自然语言处理(NLP)中 faithfulness 评估方法存在的偏差和不一致性问题。解决方案的关键在于引入了一种名为“Adversarial Sensitivity”的新评估方法,该方法通过分析解释器在模型遭受对抗攻击时的响应来评估其 faithfulness,从而捕捉到解释器对对抗输入变化的敏感性,弥补了现有评估技术的不足。

链接: https://arxiv.org/abs/2409.17774
作者: Supriya Manna,Niladri Sett
关键词-EN: critical metric, metric to assess, assess the reliability, reliability of explainable, Faithfulness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a Full Paper at EMNLP 2024 Workshop BlackBoxNLP

点击查看摘要

Abstract:Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer’s response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
摘要:忠实度无疑是评估可解释 AI 可靠性的最关键指标。在自然语言处理 (NLP) 领域,当前的忠实度评估方法存在诸多差异和偏见,往往无法捕捉模型的真实推理过程。我们提出了一种新颖的忠实度评估方法——对抗敏感性 (Adversarial Sensitivity),重点关注解释器在模型遭受对抗攻击时的响应。我们的方法通过捕捉对抗输入变化的敏感性来评估解释器的忠实度。这项工作解决了现有评估技术的重大局限性,并进一步从一种关键但未充分探索的范式中量化了忠实度。

[NLP-26] Integrating Hierarchical Semantic into Iterative Generation Model for Entailment Tree Explanation

【速读】: 该论文旨在解决现有解释性问答(QA)方法在处理蕴含树结构时,未能充分考虑句子间及层级间语义关联的问题。解决方案的关键在于提出了一个名为HiSCG(Hierarchical Semantics Controller-Generator)的架构,该架构通过设计假设与事实之间的层级映射,区分参与树构建的事实,并优化单步蕴含关系,从而在同一层级和相邻层级之间引入句子的层级语义,显著提升了蕴含树构建的准确性。

链接: https://arxiv.org/abs/2409.17757
作者: Qin Wang,Jianzhou Feng,Yiming Xu
关键词-EN: explainable question answering, Manifestly and logically, question answering, logically displaying, reasoning from evidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manifestly and logically displaying the line of reasoning from evidence to answer is significant to explainable question answering (QA). The entailment tree exhibits the lines structurally, which is different from the self-explanation principle in large-scale language models. Existing methods rarely consider the semantic association of sentences between and within hierarchies within the tree structure, which is prone to apparent mistakes in combinations. In this work, we propose an architecture of integrating the Hierarchical Semantics of sentences under the framework of Controller-Generator (HiSCG) to explain answers. The HiSCG designs a hierarchical mapping between hypotheses and facts, discriminates the facts involved in tree constructions, and optimizes single-step entailments. To the best of our knowledge, We are the first to notice hierarchical semantics of sentences between the same layer and adjacent layers to yield improvements. The proposed method achieves comparable performance on all three settings of the EntailmentBank dataset. The generalization results on two out-of-domain datasets also demonstrate the effectiveness of our method.
摘要:在可解释问答 (QA) 中,从证据到答案的推理路径的显式和逻辑展示至关重要。蕴涵树以结构化的方式展示这些路径,这与大规模语言模型中的自我解释原则不同。现有方法很少考虑树结构中层次之间和层次内部的句子语义关联,这容易导致组合中的明显错误。在本研究中,我们提出了一种在控制器-生成器 (Controller-Generator) 框架下整合句子层次语义 (Hierarchical Semantics of sentences) 的架构 (HiSCG) 来解释答案。HiSCG 设计了假设与事实之间的层次映射,区分了参与树构建的事实,并优化了单步蕴涵。据我们所知,我们是第一个注意到同一层和相邻层之间句子层次语义以实现改进的。所提出的方法在 EntailmentBank 数据集的所有三种设置中均取得了可比的表现。在两个域外数据集上的泛化结果也证明了我们方法的有效性。

[NLP-27] SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning

【速读】: 该论文试图解决机器人对关键概念无意识情况下的交互任务学习问题,即在机器人不了解解决任务所需的关键概念时,如何通过交互学习来完成任务。解决方案的关键是提出了SECURE框架,该框架通过具身对话来修正缺陷的领域模型,使机器人能够通过对话发现并利用未预见的可能性。SECURE不仅使机器人能够从用户的纠正反馈中学习,还能策略性地进行对话以揭示解决任务所需的新概念,从而实现对后续任务的泛化学习。

链接: https://arxiv.org/abs/2409.17755
作者: Rimvydas Rubavicius,Peter David Fagan,Alex Lascarides,Subramanian Ramamoorthy
关键词-EN: interactive task learning, challenging interactive task, task learning scenario, paper addresses, addresses a challenging
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages,4 figures, 2 tables

点击查看摘要

Abstract:This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the robot is unaware of a concept that’s key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation. Through dialogue, the robot discovers and then learns to exploit unforeseen possibilities. Using SECURE, the robot not only learns from the user’s corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the robot to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that a robot that is semantics-aware – that is, it exploits the logical consequences of both sentence and discourse semantics in the learning and inference process – learns to solve rearrangement under unawareness more effectively than a robot that lacks such capabilities.
摘要:本文探讨了一种具有挑战性的交互式任务学习场景,我们称之为“无意识重排”:在机器人对解决指令任务的关键概念一无所知的情况下,操控刚体环境。我们提出了 SECURE,这是一个交互式任务学习框架,旨在通过实体对话修复缺陷领域模型来解决此类问题。通过对话,机器人能够发现并学习利用未预见的可能性。使用 SECURE,机器人不仅在出错时从用户的纠正反馈中学习,还能学会做出战略性对话决策,以揭示解决指令任务的新概念的有用证据。这些能力共同使机器人能够利用新获得的知识推广到后续任务中。我们证明,一个具有语义意识的机器人——即在学习与推理过程中利用句子和话语语义的逻辑结果——比缺乏此类能力的机器人更有效地解决无意识重排问题。

[NLP-28] Few-shot Pairwise Rank Prompting: An Effective Non-Parametric Retrieval Model EMNLP2024

【速读】: 该论文试图解决零样本推理在排序模型中的性能问题,特别是如何在不依赖复杂训练流程的情况下提升排序模型的表现。解决方案的关键在于提出了一种成对少样本排序器(pairwise few-shot ranker),通过利用训练集中相似查询的偏好示例来增强对查询和文档对的偏好预测任务。这种方法在不增加复杂训练过程的前提下,显著提升了零样本基线在域内(TREC DL)和域外(BEIR subset)检索基准上的表现,接近监督模型的性能。

链接: https://arxiv.org/abs/2409.17745
作者: Nilanjan Sinhababu,Andrew Parry,Debasis Ganguly,Debasis Samanta,Pabitra Mitra
关键词-EN: typically multiple stages, involves complex processing, typically multiple, pre-training and fine-tuning, multiple stages
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:A supervised ranking model, despite its advantage of being effective, usually involves complex processing - typically multiple stages of task-specific pre-training and fine-tuning. This has motivated researchers to explore simpler pipelines leveraging large language models (LLMs) that are capable of working in a zero-shot manner. However, since zero-shot inference does not make use of a training set of pairs of queries and their relevant documents, its performance is mostly worse than that of supervised models, which are trained on such example pairs. Motivated by the existing findings that training examples generally improve zero-shot performance, in our work, we explore if this also applies to ranking models. More specifically, given a query and a pair of documents, the preference prediction task is improved by augmenting examples of preferences for similar queries from a training set. Our proposed pairwise few-shot ranker demonstrates consistent improvements over the zero-shot baseline on both in-domain (TREC DL) and out-domain (BEIR subset) retrieval benchmarks. Our method also achieves a close performance to that of a supervised model without requiring any complex training pipeline.
摘要:尽管监督排序模型具有有效性的优势,但其通常涉及复杂的处理流程——通常包括多个阶段的任务特定预训练和微调。这促使研究人员探索利用大语言模型 (LLM) 的更简单流程,这些模型能够在零样本 (zero-shot) 模式下工作。然而,由于零样本推理不使用查询及其相关文档对的训练集,其性能通常不如在类似示例对上训练的监督模型。受现有研究结果的启发,即训练示例通常能提升零样本性能,我们在工作中探讨了这一现象是否也适用于排序模型。更具体地说,给定一个查询和一对文档,通过增加训练集中相似查询的偏好示例,偏好预测任务得到了改进。我们提出的成对少样本 (few-shot) 排序器在域内 (TREC DL) 和域外 (BEIR 子集) 检索基准测试中均显示出对零样本基线的持续改进。我们的方法还实现了与监督模型相近的性能,而无需任何复杂的训练流程。

[NLP-29] MIO: A Foundation Model on Multimodal Tokens

【速读】: 该论文试图解决现有大型语言模型(LLMs)和多模态大型语言模型(MM-LLMs)在理解和生成多模态数据(如语音、文本、图像和视频)时缺乏真正的任意到任意(any-to-any)理解和生成能力的问题。解决方案的关键在于提出了MIO模型,该模型通过多模态令牌(multimodal tokens)进行端到端的自回归生成,并经过四个阶段的训练过程:对齐预训练、交错预训练、语音增强预训练和综合监督微调。MIO模型不仅在多模态任务上表现优异,还展示了如交错视频-文本生成、视觉思维链推理、视觉指南生成和指导性图像编辑等先进的多模态生成能力。

链接: https://arxiv.org/abs/2409.17692
作者: Zekun Wang,King Zhu,Chunpu Xu,Wangchunshu Zhou,Jiaheng Liu,Yibo Zhang,Jiashuo Wang,Ning Shi,Siyu Li,Yizhi Li,Haoran Que,Zhaoxiang Zhang,Yuanxing Zhang,Ge Zhang,Ke Xu,Jie Fu,Wenhao Huang
关键词-EN: foundation model built, large language models, autoregressive manner, understanding and generating, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report. Codes and models will be available soon

点击查看摘要

Abstract:In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
摘要:本文介绍了一种名为 MIO 的新型基础模型,该模型基于多模态 Token,能够以端到端、自回归的方式理解和生成语音、文本、图像和视频。尽管大语言模型 (LLM) 和多模态大语言模型 (MM-LLM) 通过其多功能性推动了通用人工智能 (AGI) 的发展,但它们仍然缺乏真正的任意到任意理解和生成能力。最近,GPT-4o 的发布展示了任意到任意大语言模型在复杂现实任务中的显著潜力,实现了图像、语音和文本之间的全方位输入和输出。然而,它是闭源的,并且不支持多模态交错序列的生成。为了填补这一空白,我们提出了 MIO,该模型通过因果多模态建模在四种模态的混合离散 Token 上进行训练。MIO 经历了四个阶段的训练过程:(1) 对齐预训练,(2) 交错预训练,(3) 语音增强预训练,以及 (4) 在多样化的文本、视觉和语音任务上的综合监督微调。我们的实验结果表明,MIO 在性能上与之前的双模态基线、任意到任意模型基线以及特定模态基线相比具有竞争力,甚至在某些情况下表现更优。此外,MIO 展示了其任意到任意特性所固有的高级能力,例如交错视频-文本生成、视觉思维链推理、视觉指南生成、指导性图像编辑等。

[NLP-30] Zero- and Few-shot Named Entity Recognition and Text Expansion in Medication Prescriptions using ChatGPT

【速读】: 该论文试图解决医疗处方中自由文本的结构化和扩展问题,这些文本通常包含混合语言、本地品牌名称以及各种特异格式和缩写。解决方案的关键在于利用ChatGPT 3.5进行命名实体识别(NER)和文本扩展(EX),通过零样本和少样本学习设置下的不同提示策略,自动结构化和扩展出院总结中的药物声明,从而提高其可解释性。研究结果显示,NER任务的最佳提示策略在测试集上达到了0.94的F1分数,而EX任务的少样本提示策略表现更优,平均F1分数为0.87,表明ChatGPT在处理安全相关的药物数据时能够有效避免生成错误信息。

链接: https://arxiv.org/abs/2409.17683
作者: Natthanaphop Isaradech,Andrea Riedel,Wachiranun Sirikul,Markus Kreuzthaler,Stefan Schulz
关键词-EN: local brand, formats and abbreviations, include a mix, wide range, range of idiosyncratic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Introduction: Medication prescriptions are often in free text and include a mix of two languages, local brand names, and a wide range of idiosyncratic formats and abbreviations. Large language models (LLMs) have shown promising ability to generate text in response to input prompts. We use ChatGPT 3.5 to automatically structure and expand medication statements in discharge summaries and thus make them easier to interpret for people and machines. Methods: Named-entity Recognition (NER) and Text Expansion (EX) are used in a zero- and few-shot setting with different prompt strategies. 100 medication statements were manually annotated and curated. NER performance was measured by using strict and partial matching. For the task EX, two experts interpreted the results by assessing semantic equivalence between original and expanded statements. The model performance was measured by precision, recall, and F1 score. Results: For NER, the best-performing prompt reached an average F1 score of 0.94 in the test set. For EX, the few-shot prompt showed superior performance among other prompts, with an average F1 score of 0.87. Conclusion: Our study demonstrates good performance for NER and EX tasks in free-text medication statements using ChatGPT. Compared to a zero-shot baseline, a few-shot approach prevented the system from hallucinating, which would be unacceptable when processing safety-relevant medication data.

1
2
3
4
5
6
7
8
9
**摘要:**

**引言:** 药物处方通常以自由文本形式存在,并包含两种语言的混合、本地品牌名称以及各种独特的格式和缩写。大语言模型 (Large Language Models, LLMs) 在根据输入提示生成文本方面展示了令人鼓舞的能力。我们使用 ChatGPT 3.5 来自动结构化和扩展出院总结中的药物声明,从而使其更易于人类和机器解读。

**方法:** 在零样本 (Zero-shot) 和少样本 (Few-shot) 设置下,使用不同的提示策略进行命名实体识别 (Named-entity Recognition, NER) 和文本扩展 (Text Expansion, EX)。100 条药物声明被手动标注和整理。NER 性能通过严格匹配和部分匹配进行测量。对于 EX 任务,两位专家通过评估原始声明与扩展声明之间的语义等价性来解释结果。模型性能通过精确度 (Precision)、召回率 (Recall) 和 F1 分数 (F1 Score) 进行测量。

**结果:** 对于 NER,表现最佳的提示在测试集上达到了平均 F1 分数 0.94。对于 EX,少样本提示在其他提示中表现最佳,平均 F1 分数为 0.87。

**结论:** 我们的研究表明,使用 ChatGPT 在自由文本药物声明中进行 NER 和 EX 任务表现良好。与零样本基线相比,少样本方法防止了系统产生幻觉,这在处理与安全相关的药物数据时是不可接受的。

[NLP-31] Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization

【速读】: 该论文试图解决神经机器翻译(NMT)中存在的任务与数据不匹配问题,解决方案的关键在于引入了一种名为Direct Quality Optimization(DQO)的任务对齐算法。DQO是Direct Preference Optimization(DPO)的变体,通过利用预训练的翻译质量评估模型作为人类偏好的代理,来优化翻译任务。实验结果表明,即使仅对多语言模型中的部分语言应用DQO,也能显著提升所有语言的翻译质量,并通过自动评估和人工评估验证了其有效性。

链接: https://arxiv.org/abs/2409.17673
作者: Kaden Uhlig,Joern Wuebker,Raphael Reinauer,John DeNero
关键词-EN: Reinforcement Learning, Direct Preference Optimization, Direct Quality Optimization, Human Feedback, repurpose general
类目: Computation and Language (cs.CL)
备注: 17 pages, 1 figure

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task–data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and human evaluation.
摘要:基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 及其衍生技术,如直接偏好优化 (Direct Preference Optimization, DPO),是用于将通用基础模型重新定位到特定任务的任务对齐算法。我们展示了将任务对齐应用于神经机器翻译 (Neural Machine Translation, NMT) 可以解决 NMT 中现有的任务与数据不匹配问题,从而在多语言模型的所有语言中实现改进,即使任务对齐仅应用于这些语言的一个子集。我们通过引入直接质量优化 (Direct Quality Optimization, DQO) 来实现这一点,DQO 是 DPO 的一个变体,利用预训练的翻译质量评估模型作为人类偏好的代理,并通过自动指标和人工评估验证了这些改进。

[NLP-32] Digital Twin Ecosystem for Oncology Clinical Operations

【速读】: 该论文试图解决在肿瘤学临床操作中如何利用人工智能和数字孪生技术提高工作效率和个性化护理的问题。解决方案的关键在于引入一个创新的数字孪生框架,整合多个专业化的数字孪生模型(如医疗必要性孪生、护理导航孪生和临床历史孪生),并通过综合多源数据并结合国家综合癌症网络(NCCN)指南,创建一个动态的癌症护理路径。这一路径作为持续进化的知识库,使数字孪生模型能够提供精确、个性化的临床建议,从而优化临床操作。

链接: https://arxiv.org/abs/2409.17650
作者: Himanshu Pandey,Akhil Amod,Shivang,Kshitij Jaggi,Ruchi Garg,Abheet Jain,Vinayak Tantia
关键词-EN: Large Language Models, Artificial Intelligence, Large Language, hold significant promise, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Pre Print

点击查看摘要

Abstract:Artificial Intelligence (AI) and Large Language Models (LLMs) hold significant promise in revolutionizing healthcare, especially in clinical applications. Simultaneously, Digital Twin technology, which models and simulates complex systems, has gained traction in enhancing patient care. However, despite the advances in experimental clinical settings, the potential of AI and digital twins to streamline clinical operations remains largely untapped. This paper introduces a novel digital twin framework specifically designed to enhance oncology clinical operations. We propose the integration of multiple specialized digital twins, such as the Medical Necessity Twin, Care Navigator Twin, and Clinical History Twin, to enhance workflow efficiency and personalize care for each patient based on their unique data. Furthermore, by synthesizing multiple data sources and aligning them with the National Comprehensive Cancer Network (NCCN) guidelines, we create a dynamic Cancer Care Path, a continuously evolving knowledge base that enables these digital twins to provide precise, tailored clinical recommendations.
摘要:人工智能 (AI) 和大语言模型 (LLM) 在革新医疗领域,特别是在临床应用方面,具有巨大的潜力。同时,数字孪生技术,通过建模和模拟复杂系统,在提升患者护理方面也逐渐受到重视。然而,尽管在实验临床环境中取得了进展,AI 和数字孪生在优化临床操作方面的潜力仍未得到充分开发。本文介绍了一种专为提升肿瘤临床操作而设计的新型数字孪生框架。我们提出整合多种专业数字孪生,如医疗必需性孪生、护理导航孪生和临床历史孪生,以提高工作流程效率并根据每位患者的独特数据个性化护理。此外,通过综合多个数据源并与国家综合癌症网络 (NCCN) 指南对齐,我们创建了一个动态的癌症护理路径,这是一个持续演进的知识库,使这些数字孪生能够提供精确、定制化的临床建议。

[NLP-33] Efficient In-Domain Question Answering for Resource-Constrained Environments

【速读】: 该论文试图解决在实际问答应用中,使用检索增强生成(RAG)方法时面临的提示工程复杂性和资源效率低下的问题。解决方案的关键在于结合检索增强微调(RAFT)和参数高效微调(PEFT)技术,特别是低秩适应(LoRA),以减少微调和存储需求,并提高推理速度,同时保持与传统RAG相当的性能。这种结合形成的计算效率更高的RAFT(CRAFT)特别适用于资源受限环境中知识密集型问答任务。

链接: https://arxiv.org/abs/2409.17648
作者: Isaac Chung,Phat Vo,Arman Kizilkale,Aaron Reite
关键词-EN: pretrained Large Language, Retrieval Augmented Generation, Large Language Models, Large Language, integrating external knowledge
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 tables

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a common method for integrating external knowledge into pretrained Large Language Models (LLMs) to enhance accuracy and relevancy in question answering (QA) tasks. However, prompt engineering and resource efficiency remain significant bottlenecks in developing optimal and robust RAG solutions for real-world QA applications. Recent studies have shown success in using fine tuning to address these problems; in particular, Retrieval Augmented Fine Tuning (RAFT) applied to smaller 7B models has demonstrated superior performance compared to RAG setups with much larger models such as GPT-3.5. The combination of RAFT with parameter-efficient fine tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), promises an even more efficient solution, yet remains an unexplored area. In this work, we combine RAFT with LoRA to reduce fine tuning and storage requirements and gain faster inference times while maintaining comparable RAG performance. This results in a more compute-efficient RAFT, or CRAFT, which is particularly useful for knowledge-intensive QA tasks in resource-constrained environments where internet access may be restricted and hardware resources limited.
摘要:检索增强生成 (Retrieval Augmented Generation, RAG) 是一种常见的方法,用于将外部知识整合到预训练的大语言模型 (Large Language Models, LLMs) 中,以提高问答 (Question Answering, QA) 任务中的准确性和相关性。然而,提示工程和资源效率仍然是开发适用于实际 QA 应用的最佳且稳健的 RAG 解决方案的主要瓶颈。最近的研究表明,通过微调可以有效解决这些问题;特别是,应用于较小 7B 模型的检索增强微调 (Retrieval Augmented Fine Tuning, RAFT) 在性能上优于使用更大模型(如 GPT-3.5)的 RAG 设置。将 RAFT 与参数高效微调 (Parameter-Efficient Fine Tuning, PEFT) 技术(如低秩适应 (Low-Rank Adaptation, LoRA))相结合,有望提供更高效的解决方案,但这一领域仍未被充分探索。在本研究中,我们将 RAFT 与 LoRA 结合,以减少微调和存储需求,并实现更快的推理时间,同时保持与 RAG 相当的性能。这产生了一种更高效的 RAFT,即计算高效的 RAFT (Compute-Efficient RAFT, CRAFT),这对于资源受限环境中知识密集型 QA 任务特别有用,在这些环境中,互联网访问可能受限,硬件资源有限。

[NLP-34] 3: A Novel Zero-shot Transfer Learning Framework Iteratively Training on an Assistant Task for a Target Task

【速读】: 该论文试图解决长文本摘要任务中,由于开源训练数据集不足和上下文细节处理要求高,导致大型语言模型(如GPT和LLaMA系列)性能受限的问题。解决方案的关键在于设计了一种名为T3的新型零样本迁移学习框架,通过迭代训练基线LLM在辅助任务上,以实现目标任务的优化。具体来说,T3利用问答任务作为辅助任务来处理长文本摘要任务,并在多个数据集上验证了其有效性,相较于三个基线LLM,在ROUGE、BLEU和Factscore指标上分别提升了近14%、35%和16%,展示了其在更多辅助-目标任务组合中的潜力。

链接: https://arxiv.org/abs/2409.17640
作者: Xindi Tong,Yujin Zhu,Shijian Fan,Liang Xu
关键词-EN: Large Language Models, processing large volumes, efficiently processing large, contextual details dealing, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long text summarization, gradually being essential for efficiently processing large volumes of information, stays challenging for Large Language Models (LLMs) such as GPT and LLaMA families because of the insufficient open-sourced training datasets and the high requirement of contextual details dealing. To address the issue, we design a novel zero-shot transfer learning framework, abbreviated as T3, to iteratively training a baseline LLM on an assistant task for the target task, where the former should own richer data resources and share structural or semantic similarity with the latter. In practice, T3 is approached to deal with the long text summarization task by utilizing question answering as the assistant task, and further validated its effectiveness on the BBC summary, NarraSum, FairytaleQA, and NLQuAD datasets, with up to nearly 14% improvement in ROUGE, 35% improvement in BLEU, and 16% improvement in Factscore compared to three baseline LLMs, demonstrating its potential for more assistant-target task combinations.
摘要:长文本摘要,作为高效处理大量信息的关键手段,对于 GPT 和 LLaMA 系列等大语言模型 (LLM) 来说仍然具有挑战性,这主要是因为开源训练数据集的不足以及对上下文细节处理的高要求。为解决这一问题,我们设计了一种新颖的零样本迁移学习框架,简称 T3,通过在辅助任务上迭代训练基线 LLM 以实现目标任务,其中辅助任务应拥有更丰富的数据资源,并与目标任务在结构或语义上具有相似性。在实际应用中,T3 通过利用问答作为辅助任务来处理长文本摘要任务,并在 BBC 摘要、NarraSum、FairytaleQA 和 NLQuAD 数据集上进一步验证了其有效性,相较于三个基线 LLM,ROUGE 提升了近 14%,BLEU 提升了 35%,Factscore 提升了 16%,展示了其在更多辅助-目标任务组合中的潜力。

[NLP-35] ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

【速读】: 该论文试图解决在多轮多模态医疗对话中,由于患者使用手机拍摄的图像质量较差(如背景杂乱、病变区域偏离中心),导致视觉-语言模型训练时的对齐效果不佳的问题。解决方案的关键在于提出了一种零样本策略ZALM3,通过利用大语言模型(LLM)从先前的文本对话中提取关键词,并结合视觉定位模型提取图像中的感兴趣区域(RoIs),从而更新图像以消除不必要的背景噪声,提升视觉-语言对齐效果。

链接: https://arxiv.org/abs/2409.17610
作者: Zhangpu Li,Changhong Zou,Suxue Ma,Zhicheng Yang,Chen Du,Youbao Tang,Zhenjie Cao,Ning Zhang,Jui-Hsin Lai,Ruei-Sung Lin,Yuan Ni,Xingzhi Sun,Jing Xiao,Kai Zhang,Mei Han
关键词-EN: multimodal medical dialogue, multi-turn multimodal medical, large language models, multimodal medical, medical dialogue
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients’ mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.
摘要:近年来,大语言模型 (LLM) 的迅猛发展推动了视觉-语言模型 (VLM) 在医疗领域的普及。在我们的在线医疗咨询场景中,医生通过多轮对话回应患者提供的文本和图像,以诊断其健康状况,形成了一种多轮多模态的医疗对话格式。与传统医疗视觉问答 (Med-VQA) 中由专业设备拍摄的高质量图像不同,我们场景中的图像由患者使用手机拍摄。这些图像质量控制较差,存在背景元素过多、病变区域严重偏离中心等问题,导致模型训练阶段的视觉-语言对齐效果下降。本文提出了 ZALM3,一种零样本策略,用于改进多轮多模态医疗对话中的视觉-语言对齐。鉴于我们观察到图像之前的文本对话可以推断出图像中的感兴趣区域 (RoI),ZALM3 采用大语言模型从上下文中总结关键词,并使用视觉定位模型提取 RoI。更新后的图像消除了不必要的背景噪声,提供了更有效的视觉-语言对齐。为了更好地评估我们提出的方法,我们设计了一种新的主观评估指标,用于多轮单模态/多模态医疗对话,以提供细粒度的性能比较。我们在三个不同临床部门的实验显著证明了 ZALM3 的有效性,并具有统计学意义。

[NLP-36] Deep CLAS: Deep Contextual Listen Attend and Spell

【速读】: 该论文试图解决现有Contextual-LAS模型在处理罕见词时对上下文信息利用不足的问题。解决方案的关键在于引入深度Contextual-LAS(deep CLAS),通过引入偏置损失(bias loss)强制模型关注上下文信息,丰富偏置注意力的查询以提高其评分准确性,并采用字符级编码和Conformer编码器来获取更细粒度的上下文信息。此外,直接使用偏置注意力评分来修正模型的输出概率分布,从而显著提升在命名实体识别场景中的召回率和F1分数。

链接: https://arxiv.org/abs/2409.17603
作者: Shifu Xiong,Mengzhi Wang,Genshun Wan,Hang Chen,Jianqing Gao,Lirong Dai
关键词-EN: improving Automatic Speech, Automatic Speech Recognition, Automatic Speech, improving Automatic, Speech Recognition
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by NCMMSC 2022

点击查看摘要

Abstract:Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.
摘要:上下文感知语言模型 (Contextual-LAS, CLAS) 已被证明在提高罕见词的自动语音识别 (ASR) 方面有效。它依赖于短语级别的上下文建模和基于注意力的相关性评分,而没有明确的上下文约束,这导致上下文信息的利用不足。在这项工作中,我们提出了深度 CLAS 以更好地利用上下文信息。我们引入了偏置损失,迫使模型关注上下文信息。偏置注意力的查询也被丰富,以提高偏置注意力评分的准确性。为了获取细粒度的上下文信息,我们将短语级别的编码替换为字符级别的编码,并使用 Conformer 而不是 LSTM 来编码上下文信息。此外,我们直接使用偏置注意力评分来修正模型的输出概率分布。使用公开的 AISHELL-1 和 AISHELL-NER 进行实验。在 AISHELL-1 上,与 CLAS 基线相比,深度 CLAS 在命名实体识别场景中获得了 65.78% 的相对召回率和 53.49% 的相对 F1 分数提升。

[NLP-37] DualCoTs: Dual Chain-of-Thoughts Prompting for Sentiment Lexicon Expansion of Idioms

【速读】: 该论文试图解决现有习语情感分析语料库的局限性问题,提出了一种利用大型语言模型通过Chain-of-Thought提示自动扩展习语情感词典的创新方法。解决方案的关键在于设计了Dual Chain-of-Thoughts(DualCoTs)方法,该方法结合了语言学和心理语言学的见解,有效利用大型模型自动扩展中英文习语的情感词典,并通过实验验证了其有效性。

链接: https://arxiv.org/abs/2409.17588
作者: Fuqiang Niu,Minghuan Tan,Bowen Zhang,Min Yang,Ruifeng Xu
关键词-EN: idiom sentiment crucial, text sentiment analysis, idiom sentiment analysis, everyday discourse, rendering the nuanced
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Idioms represent a ubiquitous vehicle for conveying sentiments in the realm of everyday discourse, rendering the nuanced analysis of idiom sentiment crucial for a comprehensive understanding of emotional expression within real-world texts. Nevertheless, the existing corpora dedicated to idiom sentiment analysis considerably limit research in text sentiment analysis. In this paper, we propose an innovative approach to automatically expand the sentiment lexicon for idioms, leveraging the capabilities of large language models through the application of Chain-of-Thought prompting. To demonstrate the effectiveness of this approach, we integrate multiple existing resources and construct an emotional idiom lexicon expansion dataset (called EmoIdiomE), which encompasses a comprehensive repository of Chinese and English idioms. Then we designed the Dual Chain-of-Thoughts (DualCoTs) method, which combines insights from linguistics and psycholinguistics, to demonstrate the effectiveness of using large models to automatically expand the sentiment lexicon for idioms. Experiments show that DualCoTs is effective in idioms sentiment lexicon expansion in both Chinese and English. For reproducibility, we will release the data and code upon acceptance.
摘要:成语在日常对话领域中是传达情感的普遍媒介,因此对成语情感的细致分析对于全面理解现实文本中的情感表达至关重要。然而,现有专门用于成语情感分析的语料库在很大程度上限制了文本情感分析的研究。本文提出了一种创新方法,通过应用思维链提示 (Chain-of-Thought prompting) 利用大语言模型的能力,自动扩展成语的情感词典。为展示该方法的有效性,我们整合了多种现有资源,构建了一个情感成语词典扩展数据集 (称为 EmoIdiomE),该数据集包含全面的中英文成语库。随后,我们设计了双思维链 (Dual Chain-of-Thoughts, DualCoTs) 方法,结合了语言学和心理语言学的见解,以展示使用大模型自动扩展成语情感词典的有效性。实验表明,DualCoTs 在扩展中英文成语情感词典方面是有效的。为确保可重复性,我们将在接受后发布数据和代码。

[NLP-38] Leveraging Annotator Disagreement for Text Classification

【速读】: 该论文试图解决在文本分类中仅使用多数标注标签进行模型训练时,可能忽略标注者之间分歧所蕴含的细微差别和多样视角的问题。解决方案的关键在于提出了三种利用标注分歧的策略:基于概率的多标签方法、集成系统和指令调优。这些方法在仇恨言论和辱骂性对话检测任务中进行了评估,结果显示在仇恨言论检测中,多标签方法表现最佳,而在辱骂性对话检测中,指令调优效果最好。此外,通过在线调查比较了多标签模型与单一标签基准模型的性能,结果表明多标签模型的输出更被认为是文本的更好代表。

链接: https://arxiv.org/abs/2409.17577
作者: Jin Xu,Mariët Theune,Daniel Braun
关键词-EN: common practice, annotated by multiple, text classification, abusive conversation detection, multiple annotators
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is common practice in text classification to only use one majority label for model training even if a dataset has been annotated by multiple annotators. Doing so can remove valuable nuances and diverse perspectives inherent in the annotators’ assessments. This paper proposes and compares three different strategies to leverage annotator disagreement for text classification: a probability-based multi-label method, an ensemble system, and instruction tuning. All three approaches are evaluated on the tasks of hate speech and abusive conversation detection, which inherently entail a high degree of subjectivity. Moreover, to evaluate the effectiveness of embracing annotation disagreements for model training, we conduct an online survey that compares the performance of the multi-label model against a baseline model, which is trained with the majority label. The results show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance. The results of the survey also show that the outputs from the multi-label models are considered a better representation of the texts than the single-label model. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.17577 [cs.CL] (or arXiv:2409.17577v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.17577 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在文本分类中,即使数据集由多个标注者进行标注,通常也只使用多数标签进行模型训练。这样做可能会忽略标注者评估中固有的细微差别和多样视角。本文提出了并比较了三种利用标注者分歧进行文本分类的不同策略:基于概率的多标签方法、集成系统和指令调优。所有三种方法都在仇恨言论和辱骂对话检测任务上进行了评估,这些任务本质上具有高度的主观性。此外,为了评估在模型训练中接受标注分歧的有效性,我们进行了一项在线调查,比较了多标签模型与使用多数标签训练的基线模型的性能。结果显示,在仇恨言论检测中,多标签方法优于其他两种方法,而在辱骂对话检测中,指令调优表现最佳。调查结果还表明,多标签模型的输出被认为比单标签模型更好地代表了文本。

主题:计算与语言 (cs.CL)
引用方式:arXiv:2409.17577 [cs.CL]
(或 arXiv:2409.17577v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.17577
通过 DataCite 发布的 arXiv DOI(待注册)

[NLP-39] Modulated Intervention Preference Optimization (MIPO): Keey the Easy Refine the Difficult AAAI2025

【速读】: 该论文试图解决在偏好优化过程中,当参考模型与给定数据对齐不佳时,正则化项可能阻碍模型对齐的问题。解决方案的关键在于提出了调制干预偏好优化(MIPO)方法,该方法根据给定数据与参考模型的对齐程度动态调整干预强度。具体来说,当数据与参考模型对齐良好时,增加干预以防止策略模型偏离参考模型;当对齐不佳时,减少干预以促进更广泛的训练。实验结果表明,MIPO在各种评估场景中均优于传统的DPO方法。

链接: https://arxiv.org/abs/2409.17545
作者: Cheolhun Jang
关键词-EN: well-trained SFT model, Preference optimization methods, optimization methods typically, methods typically begin, reference model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8pages, submitted to AAAI 2025

点击查看摘要

Abstract:Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model’s distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbfModulated Intervention Preference Optimization (MIPO) to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. If the data is well-aligned, the intervention is increased to prevent the policy model from diverging significantly from reference model. Conversely, if the alignment is poor, the interference is reduced to facilitate more extensive training. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental results demonstrate that MIPO consistently outperforms DPO across various evaluation scenarios.
摘要:偏好优化方法通常以一个经过良好训练的监督微调 (SFT) 模型作为参考模型开始训练。在强化学习人类反馈 (RLHF) 和直接偏好优化 (DPO) 中,偏好优化过程中使用了一个正则化项,以防止策略模型偏离参考模型的分布过远,从而避免生成异常响应。当参考模型已经与给定数据良好对齐或仅需要轻微调整时,这种方法可以产生一个良好对齐的模型。然而,如果参考模型与给定数据不对齐且需要显著偏离其当前状态,正则化项实际上可能阻碍模型的对齐。在本研究中,我们提出了调制干预偏好优化 (MIPO) 来解决这一问题。MIPO 根据给定数据与参考模型的对齐程度来调制参考模型的干预程度。如果数据与参考模型对齐良好,则增加干预以防止策略模型显著偏离参考模型;相反,如果对齐较差,则减少干预以促进更广泛的训练。我们使用 Mistral-7B 和 Llama3-8B 在 Alpaca Eval 2.0 和 MT-Bench 上比较了 MIPO 和 DPO 的性能。实验结果表明,MIPO 在各种评估场景中始终优于 DPO。

[NLP-40] Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在复杂逻辑推理任务中表现不佳的问题,特别是现有提示方法(如Chain-of-Thought)在推理过程中可能出现的推理链与结论不一致的问题。解决方案的关键在于提出了一种名为Logic-of-Thought(LoT)的提示方法,该方法通过使用命题逻辑从输入上下文中生成扩展的逻辑信息,并将这些逻辑信息作为额外的增强内容添加到输入提示中,从而提升模型的逻辑推理能力。LoT方法与现有的提示方法正交,可以无缝集成,实验结果表明,LoT显著提升了多种提示方法在五个逻辑推理任务中的表现。

链接: https://arxiv.org/abs/2409.17539
作者: Tongxuan Liu,Wenjiang Xu,Weizhe Huang,Xingyu Wang,Jiaxing Wang,Hailong Yang,Jing Li
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, tasks remains unsatisfactory
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks but their performance in complex logical reasoning tasks remains unsatisfactory. Although some prompting methods, such as Chain-of-Thought, can improve the reasoning ability of LLMs to some extent, they suffer from an unfaithful issue where derived conclusions may not align with the generated reasoning chain. To address this issue, some studies employ the approach of propositional logic to further enhance logical reasoning abilities of LLMs. However, the potential omissions in the extraction of logical expressions in these methods can cause information loss in the logical reasoning process, thereby generating incorrect results. To this end, we propose Logic-of-Thought (LoT) prompting which employs propositional logic to generate expanded logical information from input context, and utilizes the generated logical information as an additional augmentation to the input prompts, thereby enhancing the capability of logical reasoning. The LoT is orthogonal to existing prompting methods and can be seamlessly integrated with them. Extensive experiments demonstrate that LoT boosts the performance of various prompting methods with a striking margin across five logical reasoning tasks. In particular, the LoT enhances Chain-of-Thought’s performance on the ReClor dataset by +4.35%; moreover, it improves Chain-of-Thought with Self-Consistency’s performance on LogiQA by +5%; additionally, it boosts performance of Tree-of-Thoughts on ProofWriter dataset by +8%.
摘要:大语言模型 (LLMs) 在各种任务中展示了显著的能力,但在复杂的逻辑推理任务中的表现仍不尽如人意。尽管一些提示方法,如思维链 (Chain-of-Thought),可以在一定程度上提高 LLMs 的推理能力,但它们存在一个不忠实的问题,即推导出的结论可能与生成的推理链不一致。为了解决这一问题,一些研究采用了命题逻辑的方法来进一步增强 LLMs 的逻辑推理能力。然而,这些方法在提取逻辑表达式时可能存在的遗漏会导致逻辑推理过程中的信息丢失,从而产生错误的结果。为此,我们提出了思维逻辑 (Logic-of-Thought, LoT) 提示方法,该方法利用命题逻辑从输入上下文中生成扩展的逻辑信息,并将生成的逻辑信息作为输入提示的额外增强,从而增强逻辑推理能力。LoT 与现有的提示方法是正交的,可以无缝地与它们集成。广泛的实验表明,LoT 在五个逻辑推理任务中显著提升了各种提示方法的性能。特别是,LoT 将思维链在 ReClor 数据集上的性能提升了 +4.35%;此外,它将思维链与自一致性在 LogiQA 上的性能提升了 +5%;另外,它将思维树在 ProofWriter 数据集上的性能提升了 +8%。

[NLP-41] On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy

【速读】: 该论文试图解决在大规模预训练语言模型中,全参数微调带来的计算和存储成本过高的问题,并探讨低秩适应方法在数据隐私方面的潜在优势。解决方案的关键在于通过低秩分解矩阵(如LoRA和FLoRA中的adapter)替代全参数微调,从而显著减少可训练参数的数量,并理论证明这种低秩适应方法在梯度更新中引入了随机噪声,近似于差分隐私保护下的全参数微调,从而在降低计算成本的同时,隐式地提供了对微调数据的隐私保护。

链接: https://arxiv.org/abs/2409.17538
作者: Saber Malekmohammadi,Golnoosh Farnadi
关键词-EN: processing involves large-scale, involves large-scale pre-training, natural language processing, language processing involves, general domain data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to specific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.
摘要:自然语言处理中的一个重要方法是在通用领域数据上进行大规模预训练,然后针对特定任务或领域进行适应。随着模型规模的扩大,对所有参数进行全面微调变得越来越不切实际。为了解决这个问题,一些针对语言模型的低秩任务适应方法被提出,例如 LoRA 和 FLoRA。这些方法保持预训练模型权重不变,并在 Transformer 架构的某些层中引入可训练的低秩分解矩阵,称为适配器。与对所有参数进行全面微调相比,这种方法显著减少了下游任务所需的可训练参数数量。在这项工作中,我们从数据隐私的角度审视低秩适应。我们理论上证明了 LoRA 和 FLoRA 中使用的低秩适应等同于向来自其全面微调的适配器参数的批次梯度中注入一些随机噪声,并量化了注入噪声的方差。通过在噪声分布与具有相同方差的高斯分布之间建立 Berry-Esseen 型总变差距离界限,我们表明 LoRA 和 FLoRA 的动力学非常接近于对适配器进行差分隐私全面微调,这表明低秩适应隐含地提供了关于微调数据的隐私保护。最后,利用 Johnson-Lindenstrauss 引理,我们表明,当与梯度裁剪结合时,低秩适应几乎等同于具有固定噪声尺度的差分隐私全面微调适配器。

[NLP-42] MUSE: Integrating Multi-Knowledge for Knowledge Graph Completion

【速读】: 该论文试图解决知识图谱补全(KGC)中缺失关系预测的问题,现有方法未能充分利用知识图谱特征和外部语义知识的指导。解决方案的关键在于提出了一种知识感知的推理模型(MUSE),通过多知识表示学习机制来增强缺失关系预测。该模型通过三个并行组件实现:1) 先验知识学习,利用BERT微调增强三元组的语义表示;2) 上下文消息传递,增强知识图谱的上下文信息;3) 关系路径聚合,增强从头部实体到尾部实体的路径表示。实验结果表明,MUSE在多个公开数据集上显著优于其他基线方法。

链接: https://arxiv.org/abs/2409.17536
作者: Pengjie Liu
关键词-EN: Knowledge Graph Completion, Graph Completion, Knowledge Graph, aims to predict, existing KGC methods
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2408.05283

点击查看摘要

Abstract:Knowledge Graph Completion (KGC) aims to predict the missing [relation] part of (head entity)–[relation]-(tail entity) triplet. Most existing KGC methods focus on single features (e.g., relation types) or sub-graph aggregation. However, they do not fully explore the Knowledge Graph (KG) features and neglect the guidance of external semantic knowledge. To address these shortcomings, we propose a knowledge-aware reasoning model (MUSE), which designs a novel multi-knowledge representation learning mechanism for missing relation prediction. Our model develops a tailored embedding space through three parallel components: 1) Prior Knowledge Learning for enhancing the triplets’ semantic representation by fine-tuning BERT; 2) Context Message Passing for enhancing the context messages of KG; 3) Relational Path Aggregation for enhancing the path representation from the head entity to the tail entity. The experimental results show that MUSE significantly outperforms other baselines on four public datasets, achieving over 5.50% H@1 improvement and 4.20% MRR improvement on the NELL995 dataset. The code and datasets will be released via this https URL.
摘要:知识图谱补全 (Knowledge Graph Completion, KGC) 旨在预测 (头实体)–[关系]-(尾实体) 三元组中缺失的 [关系] 部分。大多数现有的 KGC 方法侧重于单一特征(例如,关系类型)或子图聚合。然而,这些方法并未充分挖掘知识图谱 (Knowledge Graph, KG) 的特征,并且忽略了外部语义知识的指导。为了解决这些不足,我们提出了一种知识感知的推理模型 (MUSE),该模型设计了一种新颖的多知识表示学习机制,用于缺失关系的预测。我们的模型通过三个并行组件开发了一个定制的嵌入空间:1) 先验知识学习,通过微调 BERT 来增强三元组的语义表示;2) 上下文消息传递,用于增强 KG 的上下文消息;3) 关系路径聚合,用于增强从头实体到尾实体的路径表示。实验结果表明,MUSE 在四个公共数据集上显著优于其他基线方法,在 NELL995 数据集上实现了超过 5.50% 的 H@1 提升和 4.20% 的 MRR 提升。代码和数据集将通过此 https URL 发布。

[NLP-43] Data Proportion Detection for Optimized Data Management for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在预训练数据中不同领域数据比例不透明的问题。解决方案的关键在于提出了一种新的研究方向——数据比例检测(data proportion detection),通过分析LLMs生成的输出来自动估计预训练数据中各领域的比例。论文提供了严格的理论证明、实用的算法以及初步的实验结果,为有效进行数据比例检测和数据管理提供了有价值的见解。

链接: https://arxiv.org/abs/2409.17527
作者: Hao Liang,Keshi Zhao,Yajie Yang,Bin Cui,Guosheng Dong,Zenan Zhou,Wentao Zhang
关键词-EN: Large language models, Large language, demonstrated exceptional performance, data preparation playing, data proportion detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textitdata proportion detection, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.
摘要:大语言模型 (LLMs) 在众多任务和领域中展现了卓越的性能,其中数据准备在实现这些成果中起到了关键作用。预训练数据通常结合了来自多个领域的信息。为了在整合来自不同领域的数据时最大化性能,确定最佳的数据比例至关重要。然而,最先进的 (SOTA) LLMs 很少披露其预训练数据的详细信息,这使得研究人员难以确定理想的数据比例。在本文中,我们引入了一个新课题,即数据比例检测 (data proportion detection),通过分析 LLMs 生成的输出来实现预训练数据比例的自动估算。我们提供了严格的理论证明、实用的算法以及初步的实验结果,以支持数据比例检测。基于这些发现,我们为有效数据比例检测和数据管理面临的挑战及未来方向提供了宝贵的见解。

[NLP-44] Comparing Unidirectional Bidirectional and Word2vec Models for Discovering Vulnerabilities in Compiled Lifted Code

【速读】: 该论文试图解决在编译代码中检测缓冲区溢出等软件漏洞的问题,解决方案的关键在于应用基于单向Transformer的嵌入技术,特别是GPT-2模型。通过训练GPT-2模型生成嵌入向量,并利用这些嵌入向量构建LSTM神经网络来区分易受攻击和不易受攻击的代码,研究结果表明,GPT-2嵌入在准确性和F1分数上显著优于BERT和RoBERTa等双向模型,且在优化器选择上,SGD表现优于Adam。这一方法为提升网络安全防御提供了重要的新思路。

链接: https://arxiv.org/abs/2409.17513
作者: Gary A. McCully,John D. Hastings,Shengjie Xu,Adam Fortier
关键词-EN: forms of malware, malware cause significant, significant financial, financial and operational, operational damage
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Ransomware and other forms of malware cause significant financial and operational damage to organizations by exploiting long-standing and often difficult-to-detect software vulnerabilities. To detect vulnerabilities such as buffer overflows in compiled code, this research investigates the application of unidirectional transformer-based embeddings, specifically GPT-2. Using a dataset of LLVM functions, we trained a GPT-2 model to generate embeddings, which were subsequently used to build LSTM neural networks to differentiate between vulnerable and non-vulnerable code. Our study reveals that embeddings from the GPT-2 model significantly outperform those from bidirectional models of BERT and RoBERTa, achieving an accuracy of 92.5% and an F1-score of 89.7%. LSTM neural networks were developed with both frozen and unfrozen embedding model layers. The model with the highest performance was achieved when the embedding layers were unfrozen. Further, the research finds that, in exploring the impact of different optimizers within this domain, the SGD optimizer demonstrates superior performance over Adam. Overall, these findings reveal important insights into the potential of unidirectional transformer-based approaches in enhancing cybersecurity defenses.
摘要:勒索软件和其他形式的恶意软件通过利用长期存在且难以检测的软件漏洞,对组织造成重大的财务和运营损害。为了检测编译代码中的漏洞,如缓冲区溢出,本研究探讨了单向 Transformer 嵌入的应用,特别是 GPT-2。使用 LLVM 函数的数据集,我们训练了一个 GPT-2 模型来生成嵌入,这些嵌入随后被用于构建 LSTM 神经网络,以区分易受攻击和不易受攻击的代码。我们的研究表明,GPT-2 模型生成的嵌入显著优于 BERT 和 RoBERTa 等双向模型的嵌入,达到了 92.5% 的准确率和 89.7% 的 F1 分数。LSTM 神经网络在冻结和未冻结嵌入模型层的情况下进行了开发。当嵌入层未冻结时,模型达到了最高性能。此外,研究发现,在探索该领域内不同优化器的影响时,SGD 优化器的表现优于 Adam。总体而言,这些发现揭示了单向 Transformer 方法在增强网络安全防御方面的潜力。

[NLP-45] HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection NEURIPS2024

【速读】: 该论文试图解决大语言模型(LLMs)生成内容中存在的误导性或虚假信息(即幻觉)的检测问题。解决方案的关键在于提出了HaloScope框架,该框架利用未标注的LLM生成数据进行幻觉检测。通过自动化的成员估计分数,HaloScope能够区分未标注混合数据中的真实与虚假信息,从而训练出一个二分类的真实性分类器。该方法无需额外数据收集和人工标注,具有高度的灵活性和实用性,实验结果表明其幻觉检测性能显著优于现有方法。

链接: https://arxiv.org/abs/2409.17504
作者: Xuefeng Du,Chaowei Xiao,Yixuan Li
关键词-EN: large language models, language models, prompted concerns, misleading or fabricated, large language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at this https URL.
摘要:大语言模型 (LLM) 应用的激增引发了对其生成误导性或虚假信息(即幻觉)的担忧。因此,检测幻觉对于维护 LLM 生成内容的可信度至关重要。学习一个真实性分类器的主要挑战在于缺乏大量标记的真实和幻觉数据。为了应对这一挑战,我们引入了 HaloScope,这是一种新颖的学习框架,利用未标记的 LLM 生成数据进行幻觉检测。这种未标记数据在 LLM 部署于开放世界时自由产生,包含真实和幻觉信息。为了利用这些未标记数据,我们提出了一种自动成员资格估计评分,用于区分未标记混合数据中的真实和虚假生成,从而能够在此基础上训练一个二元真实性分类器。重要的是,我们的框架不需要额外的数据收集和人工标注,为实际应用提供了强大的灵活性和实用性。广泛的实验表明,HaloScope 能够实现卓越的幻觉检测性能,显著优于竞争对手。代码可在以下链接获取:https URL。

[NLP-46] MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models NEURIPS2024

【速读】: 该论文试图解决大型语言模型(LLMs)在推理过程中计算开销过大的问题,提出了一种名为MaskLLM的可学习剪枝方法,通过在LLMs中建立半结构化(N:M)稀疏性来减少计算负担。解决方案的关键在于利用Gumbel Softmax采样显式地将N:M模式建模为可学习的分布,从而实现端到端的训练,并具备高质量掩码生成和跨领域/任务的稀疏性迁移能力。实验结果表明,MaskLLM在多个LLMs上的表现显著优于现有最先进的方法,特别是在保持模型性能的同时大幅降低了困惑度(PPL)。

链接: https://arxiv.org/abs/2409.17481
作者: Gongfan Fang,Hongxu Yin,Saurav Muralidharan,Greg Heinrich,Jeff Pool,Jan Kautz,Pavlo Molchanov,Xinchao Wang
关键词-EN: Large Language Models, massive parameter counts, Large Language, Language Models, significant redundancy
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M’') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model’s 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM’s learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \urlthis https URL.
摘要:大语言模型 (LLMs) 以其庞大的参数数量著称,这些参数通常导致显著的冗余。本文介绍了一种名为 MaskLLM 的可学习剪枝方法,该方法在大语言模型中建立了半结构化 (或称为“N:M”) 稀疏性,旨在减少推理过程中的计算开销。与开发新的重要性标准不同,MaskLLM 通过 Gumbel Softmax 采样显式地将 N:M 模式建模为可学习的分布。这种方法便于在大规模数据集上进行端到端训练,并具有两个显著优势:1) 高质量的掩码 - 我们的方法能够有效扩展到大型数据集并学习准确的掩码;2) 可迁移性 - 掩码分布的概率建模使得稀疏性可以在不同领域或任务之间进行迁移学习。我们使用 2:4 稀疏性对多种大语言模型进行了评估,包括 LLaMA-2、Nemotron-4 和 GPT-3,参数规模从 843M 到 15B 不等,实验结果显示我们的方法在性能上显著优于最先进的方法。例如,领先的方法在 Wikitext 数据集上达到的困惑度 (PPL) 为 10 或更高,而密集模型的 PPL 为 5.12,但 MaskLLM 仅通过学习掩码和冻结权重就达到了显著更低的 6.72 PPL。此外,MaskLLM 的可学习特性允许为下游任务或领域定制无损应用的 2:4 稀疏性掩码。代码可在 \urlthis https URL 获取。

[NLP-47] Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification

【速读】: 该论文试图解决数据增强技术在文本分类任务中引入的噪声问题,即增强数据的质量参差不齐,影响模型性能。解决方案的关键在于提出了一种结合元学习和对比学习的新框架,通过重新加权增强样本和优化其特征表示,以提高增强数据的质量。具体实现包括创新的权重依赖的入队和出队算法,有效利用增强样本的权重信息。实验结果表明,该框架在与现有深度学习模型(如RoBERTa-base和Text-CNN)和增强技术(如Wordnet和Easydata)结合时,显著提升了模型在GLUE基准数据集上的性能,平均提升1.6%至4.3%。

链接: https://arxiv.org/abs/2409.17474
作者: Guanyi Mou,Yichuan Li,Kyumin Lee
关键词-EN: model generalization ability, improving model generalization, generalization ability, shown its effectiveness, effectiveness in resolving
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: IEEE BigData 2021

点击查看摘要

Abstract:Data augmentation has shown its effectiveness in resolving the data-hungry problem and improving model’s generalization ability. However, the quality of augmented data can be varied, especially compared with the raw/original data. To boost deep learning models’ performance given augmented data/samples in text classification tasks, we propose a novel framework, which leverages both meta learning and contrastive learning techniques as parts of our design for reweighting the augmented samples and refining their feature representations based on their quality. As part of the framework, we propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples’ weight/quality information effectively. Through experiments, we show that our framework can reasonably cooperate with existing deep learning models (e.g., RoBERTa-base and Text-CNN) and augmentation techniques (e.g., Wordnet and Easydata) for specific supervised learning tasks. Experiment results show that our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders on seven GLUE benchmark datasets compared with the best baseline. We present an indepth analysis of our framework design, revealing the non-trivial contributions of our network components. Our code is publicly available for better reproducibility.
摘要:数据增强在解决数据饥渴问题和提升模型泛化能力方面展现了其有效性。然而,增强数据的质量可能参差不齐,尤其是与原始数据相比。为了在文本分类任务中利用增强数据/样本提升深度学习模型的性能,我们提出了一种新颖的框架,该框架结合了元学习 (meta learning) 和对比学习 (contrastive learning) 技术,用于根据增强样本的质量重新加权并优化其特征表示。作为框架的一部分,我们提出了新的权重依赖的入队和出队算法,以有效利用增强样本的权重/质量信息。通过实验,我们展示了该框架能够合理地与现有的深度学习模型(如 RoBERTa-base 和 Text-CNN)以及增强技术(如 Wordnet 和 Easydata)协同工作,用于特定的监督学习任务。实验结果表明,与最佳基线相比,我们的框架在七个 GLUE 基准数据集上,对 Text-CNN 编码器实现了平均 1.6%、最高 4.3% 的绝对提升,对 RoBERTa-base 编码器实现了平均 1.4%、最高 4.4% 的绝对提升。我们深入分析了框架设计,揭示了网络组件的非平凡贡献。我们的代码已公开,以促进更好的可复现性。

[NLP-48] Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards EMNLP2024

【速读】: 该论文试图解决自动作文评分(AES)系统中多特质评分模型训练中的不可微性问题,特别是如何将非可微的二次加权kappa(QWK)指标融入到神经网络训练过程中。解决方案的关键在于提出了Scoring-aware Multi-reward Reinforcement Learning(SaMRL)方法,通过设计基于QWK的奖励机制并结合均方误差惩罚,将实际评分方案整合到训练过程中。此外,论文采用自回归评分生成框架,利用token生成概率来增强多特质评分的鲁棒性,从而显著提升模型对先前评分较差提示的评分能力。

链接: https://arxiv.org/abs/2409.17472
作者: Heejin Do,Sangwon Ryu,Gary Geunbae Lee
关键词-EN: provide enriched feedback, evaluating multiple traits, Recent advances, automated essay scoring, enriched feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an autoregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.
摘要:近年来,自动作文评分 (Automated Essay Scoring, AES) 的发展趋势转向评估多种特征以提供更丰富的反馈。与典型的 AES 系统类似,多特征 AES 使用二次加权 Kappa (Quadratic Weighted Kappa, QWK) 来衡量与人工评分者的一致性,这与评分方案紧密契合;然而,其不可微分的特性使其无法直接用于神经网络训练。本文提出了一种评分感知的多奖励强化学习 (Scoring-aware Multi-reward Reinforcement Learning, SaMRL),通过设计基于 QWK 的奖励和均方误差惩罚,将实际评估方案整合到多特征 AES 的训练过程中。现有的强化学习 (Reinforcement Learning, RL) 在 AES 中的应用局限于分类模型,尽管存在性能下降的问题,因为 RL 需要概率分布;相反,我们采用了一种自回归评分生成框架,利用 Token 生成概率来实现稳健的多特征评分预测。实证分析表明,SaMRL 有助于模型训练,显著提升了以往评分较低的提示的评分效果。

[NLP-49] What is the social benefit of hate speech detection research? A Systematic Review

【速读】: 该论文试图解决的问题是自然语言处理(NLP)领域中仇恨言论检测研究与政策制定者及非营利组织之间的脱节现象。论文指出,缺乏伦理框架是导致这一脱节的主要原因。解决方案的关键在于采用适当的伦理框架,以促进仇恨言论检测研究的社会影响力,从而更好地与政策制定者和非营利组织合作,实现最佳实践。

链接: https://arxiv.org/abs/2409.17467
作者: Sidney Gig-Jan Wong
关键词-EN: non-profit organisations, grown exponentially, minimal uptake, uptake or engagement, engagement from policy
类目: Computation and Language (cs.CL)
备注: Accepted to the 3rd Workshop on NLP for Positive Impact

点击查看摘要

Abstract:While NLP research into hate speech detection has grown exponentially in the last three decades, there has been minimal uptake or engagement from policy makers and non-profit organisations. We argue the absence of ethical frameworks have contributed to this rift between current practice and best practice. By adopting appropriate ethical frameworks, NLP researchers may enable the social impact potential of hate speech research. This position paper is informed by reviewing forty-eight hate speech detection systems associated with thirty-seven publications from different venues.
摘要:尽管过去三十年中自然语言处理 (NLP) 领域的仇恨言论检测研究呈指数级增长,但政策制定者和非营利组织对此的采纳和参与却极为有限。我们认为,缺乏伦理框架是导致当前实践与最佳实践之间差距的主要原因。通过采用适当的伦理框架,NLP 研究人员可以释放仇恨言论研究的社会影响潜力。本文通过对来自不同渠道的三十七篇相关出版物中的四十八个仇恨言论检测系统进行回顾,为这一立场提供了依据。

[NLP-50] RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

【速读】: 该论文试图解决大型语言模型(LLMs)在多轮对话中容易被恶意用户通过隐蔽手段进行“越狱”攻击的问题。解决方案的关键在于提出了RED QUEEN ATTACK方法,通过构建多轮对话场景,将恶意意图隐藏在看似无害的对话中,从而揭示LLMs在多轮交互中的脆弱性。实验结果表明,所有测试的LLMs均对RED QUEEN ATTACK表现出高度的脆弱性,尤其是更大规模的模型。为应对这一问题,论文还提出了RED QUEEN GUARD策略,通过调整模型参数,有效降低了攻击成功率至1%以下,同时保持了模型在标准基准测试中的性能。

链接: https://arxiv.org/abs/2409.17458
作者: Yifan Jiang,Kriti Aggarwal,Tanmay Laud,Kashif Munir,Jay Pujara,Subhabrata Mukherjee
关键词-EN: Large Language Models, RED QUEEN ATTACK, presents challenges related, RED QUEEN, Large Language
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model’s performance across standard benchmarks. Full implementation and dataset are publicly accessible at this https URL.
摘要:大语言模型 (LLM) 的快速发展为各个领域和应用带来了新的机遇;然而,它也带来了潜在滥用的挑战。为了减轻这些风险,红队测试 (red teaming) 已被用作一种主动的安全措施,通过越狱攻击 (jailbreak attacks) 来探测语言模型的有害输出。然而,当前的越狱攻击方法主要是单轮的,且恶意查询明确,未能完全捕捉现实世界交互的复杂性。实际上,用户可以与基于 LLM 的聊天助手进行多轮交互,从而以更隐蔽的方式隐藏其真实意图。为了填补这一空白,我们首先提出了一种新的越狱方法,即 RED QUEEN ATTACK。该方法构建了一个多轮场景,将恶意意图隐藏在防止伤害的伪装之下。我们设计了 40 个不同轮次的场景,并选择了 14 个有害类别,生成了 56k 个多轮攻击数据点。我们在四个不同大小的代表性 LLM 家族上进行了全面的 RED QUEEN ATTACK 实验。我们的实验结果表明,所有 LLM 都对 RED QUEEN ATTACK 存在漏洞,GPT-4o 的攻击成功率达到 87.62%,Llama3-70B 的攻击成功率达到 75.4%。进一步分析表明,较大的模型对 RED QUEEN ATTACK 更为敏感,多轮结构和隐藏策略是其成功的关键。为了优先考虑安全性,我们引入了一种简单的缓解策略,称为 RED QUEEN GUARD,该策略使 LLM 能够有效抵御对抗性攻击。这种方法将攻击成功率降低到 1% 以下,同时保持了模型在标准基准测试中的性能。完整的实现和数据集可在以下链接公开获取:https URL。

[NLP-51] Navigating the Shortcut Maze: A Comprehensive Analysis of Shortcut Learning in Text Classification by Language Models

【速读】: 该论文试图解决语言模型(LMs)在依赖虚假关联时,其准确性和泛化能力受到损害的问题。研究的关键在于识别并分类那些更为微妙和复杂的“捷径”(shortcuts),这些捷径超越了简单的模式,对模型的可靠性构成威胁。为此,论文引入了一个综合基准,将这些捷径分为发生型、风格型和概念型,并通过广泛的实验,系统地研究了传统LMs、大型语言模型以及最先进的鲁棒模型对这些复杂捷径的抵抗能力和易感性。

链接: https://arxiv.org/abs/2409.17455
作者: Yuqing Zhou,Ruixiang Tang,Ziyu Yao,Ziwei Zhu
关键词-EN: spurious correlations, undermining their accuracy, accuracy and generalizability, depend on spurious, Language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models (LMs), despite their advances, often depend on spurious correlations, undermining their accuracy and generalizability. This study addresses the overlooked impact of subtler, more complex shortcuts that compromise model reliability beyond oversimplified shortcuts. We introduce a comprehensive benchmark that categorizes shortcuts into occurrence, style, and concept, aiming to explore the nuanced ways in which these shortcuts influence the performance of LMs. Through extensive experiments across traditional LMs, large language models, and state-of-the-art robust models, our research systematically investigates models’ resilience and susceptibilities to sophisticated shortcuts. Our benchmark and code can be found at: this https URL.
摘要:语言模型 (LMs) 尽管取得了进展,但往往依赖于虚假的相关性,从而削弱了其准确性和可推广性。本研究针对那些被忽视的、更为微妙且复杂的捷径对模型可靠性的影响,这些捷径超越了过于简化的捷径。我们引入了一个综合基准,将捷径分为发生、风格和概念三类,旨在探索这些捷径以微妙的方式影响 LMs 性能的途径。通过在传统 LMs、大语言模型以及最先进的鲁棒模型上进行广泛的实验,我们的研究系统地调查了模型对复杂捷径的韧性和脆弱性。我们的基准和代码可以在以下链接找到:this https URL。

[NLP-52] Enhancing Financial Sentiment Analysis with Expert-Designed Hint

【速读】: 该论文试图解决在金融社交媒体帖子中进行情感分析时,如何通过专家设计的提示(hint)来提升大型语言模型(LLMs)的性能。解决方案的关键在于专家设计的提示,特别是强调数字的重要性,这显著提高了LLMs在需要视角转换技能的情境中的表现。研究还发现,这种提示在处理包含货币相关数字的推文时,对情感分析性能的提升尤为显著。

链接: https://arxiv.org/abs/2409.17448
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词-EN: social media posts, financial social media, enhancing sentiment analysis, media posts, paper investigates
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the role of expert-designed hint in enhancing sentiment analysis on financial social media posts. We explore the capability of large language models (LLMs) to empathize with writer perspectives and analyze sentiments. Our findings reveal that expert-designed hint, i.e., pointing out the importance of numbers, significantly improve performances across various LLMs, particularly in cases requiring perspective-taking skills. Further analysis on tweets containing different types of numerical data demonstrates that the inclusion of expert-designed hint leads to notable improvements in sentiment analysis performance, especially for tweets with monetary-related numbers. Our findings contribute to the ongoing discussion on the applicability of Theory of Mind in NLP and open new avenues for improving sentiment analysis in financial domains through the strategic use of expert knowledge.
摘要: 本文探讨了专家设计的提示在增强金融社交媒体帖子情感分析中的作用。我们研究了大语言模型 (LLMs) 在共情作者视角和分析情感方面的能力。研究结果表明,专家设计的提示,即指出数字的重要性,显著提升了各种 LLMs 的性能,特别是在需要视角转换技能的情况下。进一步对包含不同类型数值数据的推文进行分析,结果显示,引入专家设计的提示显著提升了情感分析的性能,尤其是在涉及货币相关数字的推文中。我们的研究为自然语言处理 (NLP) 中关于心智理论适用性的持续讨论做出了贡献,并为通过战略性运用专家知识来改进金融领域情感分析开辟了新的途径。

[NLP-53] HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows

【速读】: 该论文试图解决大型语言模型(LLMs)在处理需要多步骤思考和结合多种技能的复杂推理问题时表现不足的问题。解决方案的关键在于提出了一种名为HDFlow的新框架,该框架通过结合快速和慢速思考模式来适应性地处理复杂推理任务。具体来说,HDFlow包括两个核心组件:一是动态工作流(Dynamic Workflow),它能够自动将复杂问题分解为更易管理的子任务,并动态设计工作流程以组合专门的LLM或符号推理工具来解决这些子任务;二是混合思考(Hybrid Thinking),这是一个通用框架,根据问题的复杂性动态结合快速和慢速思考。此外,论文还提出了一种易于扩展的方法,用于自动合成大规模的复杂推理问题数据集,并采用混合思考调优方法训练较小的LLMs,以内化快速/慢速混合推理策略。实验结果表明,这种慢速思考与动态工作流的结合显著优于传统的思维链方法,而混合思考则在保持计算效率的同时实现了最高的准确性。

链接: https://arxiv.org/abs/2409.17433
作者: Wenlin Yao,Haitao Mi,Dong Yu
关键词-EN: requiring multi-step thinking, problems requiring multi-step, Hybrid Thinking, thinking, slow thinking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures

点击查看摘要

Abstract:Despite recent advancements in large language models (LLMs), their performance on complex reasoning problems requiring multi-step thinking and combining various skills is still limited. To address this, we propose a novel framework HDFlow for complex reasoning with LLMs that combines fast and slow thinking modes in an adaptive manner. Our approach consists of two key components: 1) a new approach for slow, deliberate reasoning called Dynamic Workflow, which automatically decomposes complex problems into more manageable sub-tasks and dynamically designs a workflow to assemble specialized LLM or symbolic reasoning tools to solve sub-tasks; 2) Hybrid Thinking, a general framework that dynamically combines fast and slow thinking based on problem complexity. Finally, we propose an easy-to-scale method for automatically synthesizing a large-scale dataset of 27K challenging reasoning problems for complex reasoning and a hybrid thinking tuning method that trains smaller LLMs on this dataset to internalize the fast/slow hybrid reasoning strategies. Experiments on four reasoning benchmark datasets demonstrate that our slow thinking with dynamic workflows significantly outperforms Chain-of-Thought, and hybrid thinking achieves the highest accuracy while providing an effective balance between computational efficiency and performance. Fine-tuning using our hybrid thinking approach also significantly boosts the complex reasoning capabilities of open-source language models. The results showcase the promise of slow thinking, dynamic workflows, and hybrid thinking in expanding the frontier of complex problem-solving with LLMs\footnoteCode and data will be released at \urlthis https URL…
摘要:尽管大语言模型 (LLM) 在近年来取得了显著进展,但在处理需要多步骤思考和结合多种技能的复杂推理问题时,其表现仍然有限。为解决这一问题,我们提出了一种名为 HDFlow 的新框架,该框架通过自适应方式结合快速和慢速思考模式来进行复杂推理。我们的方法包括两个关键组件:1) 一种名为动态工作流 (Dynamic Workflow) 的新方法,用于缓慢、深思熟虑的推理,该方法能够自动将复杂问题分解为更易管理的子任务,并动态设计工作流程以组合专门的 LLM 或符号推理工具来解决这些子任务;2) 混合思考 (Hybrid Thinking),这是一个通用的框架,能够根据问题复杂性动态结合快速和慢速思考。最后,我们提出了一种易于扩展的方法,用于自动合成包含 27,000 个挑战性推理问题的复杂推理大规模数据集,并提出了一种混合思考调优方法,该方法训练较小的 LLM 以内化快速/慢速混合推理策略。在四个推理基准数据集上的实验表明,我们的动态工作流慢速思考方法显著优于思维链 (Chain-of-Thought),而混合思考在提供计算效率与性能之间有效平衡的同时,达到了最高的准确率。使用我们的混合思考方法进行微调,也显著提升了开源语言模型的复杂推理能力。这些结果展示了慢速思考、动态工作流和混合思考在扩展 LLM 解决复杂问题前沿的潜力 [20]。

代码和数据将在 此链接 发布。

[NLP-54] On Extending Direct Preference Optimization to Accommodate Ties

【速读】: 该论文试图解决在成对比较中如何处理平局(tie)的问题。解决方案的关键在于提出了两种DPO(Direct Preference Optimization)变体,这些变体明确地建模了在成对比较中宣布平局的可能性。具体来说,论文通过引入Rao-Kupper和Davidson的模型扩展,替代了传统的Bradley-Terry模型,这些扩展模型为平局分配了概率,从而避免了在原始DPO中简单丢弃平局数据时观察到的任务性能下降。实验结果表明,明确标记的平局可以加入到这些DPO变体的数据集中,而不会导致性能下降,并且这种做法还能增强相对于参考策略的正则化效果,如KL散度所测量的那样。

链接: https://arxiv.org/abs/2409.17431
作者: Jinghong Chen,Guangyu Yang,Weizhe Lin,Jingbiao Mei,Bill Byrne
关键词-EN: pair-wise comparisons, derive and investigate, possibility of declaring, DPO variants, DPO
类目: Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.
摘要:我们推导并研究了两种 DPO 变体,这些变体明确地建模了在成对比较中宣布平局的可能性。我们用 Rao 和 Kupper 以及 Davidson 提出的两种著名建模扩展替换了 DPO 中的 Bradley-Terry 模型,这些扩展为平局分配了概率,作为明确偏好的替代方案。我们在神经机器翻译和摘要生成任务中的实验表明,可以向这些 DPO 变体的数据集中添加明确标记的平局,而不会出现将相同平局对呈现给 DPO 时观察到的任务性能下降。我们通过经验发现,包含平局会导致相对于参考策略的更强正则化,如 KL 散度所衡量,即使在 DPO 的原始形式中也能看到这一点。这些发现促使并实现了在偏好优化中包含平局对,而不是简单地丢弃它们。

[NLP-55] Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

【速读】: 该论文旨在解决大型语言模型(LLMs)在处理长上下文输入时面临的计算资源和延迟增加的问题。解决方案的关键在于提出了一种名为GemFilter的新算法,该算法利用LLM早期层的注意力机制来筛选和压缩输入令牌,从而显著减少后续处理的上下文长度。这种方法不仅提高了推理速度(2.4倍加速)和GPU内存效率(减少30%的内存使用),而且在Needle in a Haystack任务中显著优于标准注意力机制和SnapKV/H2O,同时在LongBench挑战中表现相当。GemFilter的简单性、无需训练以及广泛适用性使其成为优化LLM设计和推理的重要工具。

链接: https://arxiv.org/abs/2409.17422
作者: Zhenmei Shi,Yifei Ming,Xuan-Phi Nguyen,Yingyu Liang,Shafiq Joty
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, increased computational resources
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4 \times speedup and 30% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at \urlthis https URL.
摘要:大语言模型 (LLMs) 在处理长上下文输入方面展示了显著的能力,但这是以增加计算资源和延迟为代价的。我们的研究引入了一种新颖的方法来解决长上下文瓶颈问题,以加速 LLM 推理并减少 GPU 内存消耗。我们的研究表明,LLMs 在生成查询答案之前,可以在早期层中识别相关 Token。基于这一洞察,我们提出了一种算法,该算法利用 LLM 的早期层作为过滤器,选择和压缩输入 Token,从而显著减少后续处理的上下文长度。我们的方法,GemFilter,与现有技术(如标准注意力机制和 SnapKV/H2O)相比,在速度和内存效率方面展示了显著的改进。值得注意的是,与最先进的方法相比,它实现了 2.4 倍的加速和 30% 的 GPU 内存使用量减少。在“Needle in a Haystack”任务上的评估显示,GemFilter 显著优于标准注意力机制和 SnapKV,并在 LongBench 挑战中展示了可比拟的性能。GemFilter 简单、无需训练,并且广泛适用于不同的大语言模型。关键的是,它通过允许人类检查选定的输入序列,提供了可解释性。这些发现不仅为 LLM 部署提供了实际效益,而且增强了我们对于 LLM 内部机制的理解,为 LLM 设计和推理的进一步优化铺平了道路。我们的代码可在 \urlthis https URL 获取。

[NLP-56] Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

【速读】: 该论文试图解决新闻事件对股票市场影响的持续时间问题,这一问题在当前研究中被忽视。解决方案的关键在于引入了一个名为Impact Duration Estimation Dataset (IDED)的新数据集,用于基于投资者意见估计影响持续时间。通过在IDED上对语言模型进行预训练,可以显著提升基于文本的股票走势预测性能。此外,论文还通过对比情感分析预训练任务,进一步确认了学习影响持续时间的重要性,为金融预测开辟了新的研究方向。

链接: https://arxiv.org/abs/2409.17419
作者: Chr-Jr Chiu,Chung-Chi Chen,Hen-Hsen Huang,Hsin-Hsi Chen
关键词-EN: Duration Estimation Dataset, effective time-series forecasting, Impact Duration Estimation, Impact Duration, market is crucial
类目: Computation and Language (cs.CL)
备注: NTCIR-18 FinArg-2 Dataset

点击查看摘要

Abstract:Understanding the duration of news events’ impact on the stock market is crucial for effective time-series forecasting, yet this facet is largely overlooked in current research. This paper addresses this research gap by introducing a novel dataset, the Impact Duration Estimation Dataset (IDED), specifically designed to estimate impact duration based on investor opinions. Our research establishes that pre-finetuning language models with IDED can enhance performance in text-based stock movement predictions. In addition, we juxtapose our proposed pre-finetuning task with sentiment analysis pre-finetuning, further affirming the significance of learning impact duration. Our findings highlight the promise of this novel research direction in stock movement prediction, offering a new avenue for financial forecasting. We also provide the IDED and pre-finetuned language models under the CC BY-NC-SA 4.0 license for academic use, fostering further exploration in this field.
摘要:理解新闻事件对股票市场影响的持续时间对于有效的时间序列预测至关重要,然而这一方面在当前研究中大多被忽视。本文通过引入一个新颖的数据集——影响持续时间估计数据集 (Impact Duration Estimation Dataset, IDED),旨在基于投资者意见估计影响持续时间,填补了这一研究空白。我们的研究表明,使用 IDED 对语言模型进行预微调可以提升基于文本的股票走势预测性能。此外,我们将提出的预微调任务与情感分析预微调进行对比,进一步确认了学习影响持续时间的重要性。研究结果突显了这一新颖研究方向在股票走势预测中的潜力,为金融预测开辟了新的途径。我们还根据 CC BY-NC-SA 4.0 许可证提供了 IDED 和预微调语言模型,以促进该领域的进一步探索。

[NLP-57] Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

【速读】: 该论文试图解决在互联网和社交媒体平台快速发展背景下,海量在线观点难以全面分析的问题。解决方案的关键在于引入了一种双管齐下的论点挖掘技术,以提高推荐系统的有效性。具体策略包括:1) 利用目标价格与收盘价格之间的差异作为观点指标;2) 应用论点挖掘原理对投资者观点进行评分,并根据评分进行排序。实验结果表明,该方法能够有效识别具有更高盈利潜力的观点,并进一步扩展到风险分析,探讨推荐观点与投资者行为之间的关系,从而提供全面的潜在结果评估。

链接: https://arxiv.org/abs/2409.17417
作者: Chung-Chi Chen,Hen-Hsen Huang,Hsin-Hsi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词-EN: media platform development, individuals readily share, social media platform, rapid Internet, Internet and social
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of rapid Internet and social media platform development, individuals readily share their viewpoints online. The overwhelming quantity of these posts renders comprehensive analysis impractical. This necessitates an efficient recommendation system to filter and present significant, relevant opinions. Our research introduces a dual-pronged argument mining technique to improve recommendation system effectiveness, considering both professional and amateur investor perspectives. Our first strategy involves using the discrepancy between target and closing prices as an opinion indicator. The second strategy applies argument mining principles to score investors’ opinions, subsequently ranking them by these scores. Experimental results confirm the effectiveness of our approach, demonstrating its ability to identify opinions with higher profit potential. Beyond profitability, our research extends to risk analysis, examining the relationship between recommended opinions and investor behaviors. This offers a holistic view of potential outcomes following the adoption of these recommended opinions.
摘要:在互联网和社交媒体平台快速发展的时代,个人可以轻松地在线分享他们的观点。然而,海量的帖子使得全面分析变得不切实际。这要求我们开发一个高效的推荐系统,以筛选和展示重要且相关的意见。我们的研究提出了一种双管齐下的论点挖掘技术,以提高推荐系统的有效性,同时考虑专业投资者和业余投资者的视角。我们的第一种策略是利用目标价格与收盘价格之间的差异作为观点的指标。第二种策略则应用论点挖掘原则对投资者的观点进行评分,并根据这些评分对观点进行排序。实验结果证实了我们的方法的有效性,展示了其识别具有更高盈利潜力观点的能力。除了盈利性,我们的研究还扩展到风险分析,探讨了推荐观点与投资者行为之间的关系。这为我们提供了一个全面的视角,以评估在采纳这些推荐观点后可能产生的结果。

[NLP-58] From Deception to Detection: The Dual Roles of Large Language Models in Fake News

【速读】: 该论文试图解决关于大型语言模型(LLMs)在对抗假新闻中的双重作用问题,即LLMs是否容易生成偏见性假新闻,以及它们在检测假新闻方面的表现是否优于传统模型。解决方案的关键在于通过对比分析七种不同的LLMs,评估它们在生成和检测假新闻方面的能力。研究发现,尽管某些模型严格遵循安全协议,拒绝生成误导性内容,但其他模型能够轻易生成各种偏见的假新闻。此外,较大的模型通常在检测能力上表现更优,且LLM生成的假新闻比人类编写的更难被检测到。最终,研究结果表明,用户可以通过LLM生成的解释来更好地识别假新闻。

链接: https://arxiv.org/abs/2409.17416
作者: Dorsaf Sallami,Yuan-Chen Chang,Esma Aïmeur
关键词-EN: Fake, public trust, poses a significant, significant threat, ecosystems and public
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fake news poses a significant threat to the integrity of information ecosystems and public trust. The advent of Large Language Models (LLMs) holds considerable promise for transforming the battle against fake news. Generally, LLMs represent a double-edged sword in this struggle. One major concern is that LLMs can be readily used to craft and disseminate misleading information on a large scale. This raises the pressing questions: Can LLMs easily generate biased fake news? Do all LLMs have this capability? Conversely, LLMs offer valuable prospects for countering fake news, thanks to their extensive knowledge of the world and robust reasoning capabilities. This leads to other critical inquiries: Can we use LLMs to detect fake news, and do they outperform typical detection models? In this paper, we aim to address these pivotal questions by exploring the performance of various LLMs. Our objective is to explore the capability of various LLMs in effectively combating fake news, marking this as the first investigation to analyze seven such models. Our results reveal that while some models adhere strictly to safety protocols, refusing to generate biased or misleading content, other models can readily produce fake news across a spectrum of biases. Additionally, our results show that larger models generally exhibit superior detection abilities and that LLM-generated fake news are less likely to be detected than human-written ones. Finally, our findings demonstrate that users can benefit from LLM-generated explanations in identifying fake news.
摘要:虚假新闻对信息生态系统的完整性和公众信任构成了重大威胁。大语言模型 (LLM) 的出现为对抗虚假新闻带来了巨大的希望。然而,LLM 在这场斗争中是一把双刃剑。一个主要担忧是,LLM 可以被轻易用于大规模制造和传播误导性信息。这引发了一个紧迫的问题:LLM 是否容易生成带有偏见的虚假新闻?所有 LLM 都具备这种能力吗?相反,LLM 由于其广泛的世界知识和强大的推理能力,为对抗虚假新闻提供了宝贵的可能性。这引出了其他关键问题:我们能否利用 LLM 来检测虚假新闻,并且它们是否优于典型的检测模型?本文旨在通过探索各种 LLM 的性能来回答这些关键问题。我们的目标是探索不同 LLM 在有效对抗虚假新闻方面的能力,这标志着首次对七种此类模型进行分析的研究。我们的结果表明,尽管某些模型严格遵守安全协议,拒绝生成带有偏见或误导性的内容,但其他模型可以轻易地在各种偏见范围内生成虚假新闻。此外,我们的结果显示,较大的模型通常表现出更强的检测能力,并且由 LLM 生成的虚假新闻比人类编写的更难被检测到。最后,我们的研究结果表明,用户可以从 LLM 生成的解释中受益,以识别虚假新闻。

[NLP-59] Post-hoc Reward Calibration: A Case Study on Length Bias

【速读】: 该论文旨在解决强化学习从人类反馈中训练的奖励模型(RM)可能存在的偏差问题,特别是这些模型可能依赖于训练数据中的虚假相关性(如输出长度或风格)而非真正的质量来评估输出。解决方案的关键在于提出了一种后验奖励校准(Post-hoc Reward Calibration)方法,通过估计并消除偏差项来近似真实的奖励值。具体实现包括一种直观的偏差估计方法和扩展的局部加权回归方法,这些方法在多个实验设置中展示了显著的改进效果,且具有计算效率高和可扩展性强的特点。

链接: https://arxiv.org/abs/2409.17407
作者: Zeyu Huang,Zihan Qiu,Zili Wang,Edoardo M. Ponti,Ivan Titov
关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Human Feedback aligns, translates human feedback
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM–RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at this https URL.
摘要:基于人类反馈的强化学习使大语言模型的输出与人类价值观和偏好相一致。这一过程的核心是奖励模型 (RM),它将人类反馈转化为优化大语言模型行为的训练信号。然而,RM 可能会通过利用训练数据中的虚假相关性产生偏差,例如偏好基于长度或风格的输出而非真正的质量。这些偏差可能导致输出排序错误、模型评估次优,以及在大语言模型对齐过程中放大不良行为。本文针对在不增加数据和训练的情况下纠正此类偏差的问题,提出了事后奖励校准的概念。我们首先提出了一种直观的方法来估计偏差项,从而将其移除以近似真实的奖励。然后,我们通过局部加权回归将该方法扩展为一种更通用和稳健的形式。重点针对普遍存在的长度偏差,我们在三种实验设置中验证了我们提出的方法,展示了持续的改进:(1) 在 RewardBench 数据集上,33 个奖励模型的平均性能提升了 3.11;(2) 增强了 RM 排序与 GPT-4 评估和基于 AlpacaEval 基准的人类偏好的一致性;(3) 在多个大语言模型与 RM 组合中,RLHF 过程的长度控制胜率有所提高。我们的方法在计算上高效且可推广到其他类型的偏差和 RM,为缓解大语言模型对齐中的偏差提供了一种可扩展且稳健的解决方案。我们的代码和结果可在以下链接获取:https URL。

[NLP-60] Severity Prediction in Mental Health: LLM-based Creation Analysis Evaluation of a Novel Multilingual Dataset

【速读】: 该论文试图解决大语言模型(LLMs)在非英语心理健康支持应用中的有效性问题。解决方案的关键在于提出了一个多语言适应的广泛使用的心理健康数据集,该数据集从英语翻译成六种语言(希腊语、土耳其语、法语、葡萄牙语、德语和芬兰语),从而能够全面评估LLMs在多语言环境下检测心理健康状况和评估其严重程度的表现。通过实验发现,尽管使用相同的翻译数据集,LLMs在不同语言中的表现存在显著差异,这突显了多语言心理健康支持的复杂性,其中语言特定的细微差别和心理健康数据的覆盖范围会影响模型的准确性。此外,该研究强调了在医疗环境中仅依赖LLMs的风险,并提出了显著的成本节约优势,为多语言任务的广泛实施提供了重要支持。

链接: https://arxiv.org/abs/2409.17397
作者: Konstantinos Skianis,John Pavlopoulos,A. Seza Doğruöz
关键词-EN: mental health support, mental health, health support systems, including mental health, health support
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on large language models (LLMs) in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.
摘要:大语言模型 (LLMs) 正越来越多地被整合到包括心理健康支持系统在内的各个医疗领域中。然而,关于 LLMs 在非英语心理健康支持应用中的有效性研究存在空白。为解决这一问题,我们提出了一种新颖的多语言适应方法,将广泛使用的心理健康数据集从英语翻译成六种语言 (希腊语、土耳其语、法语、葡萄牙语、德语和芬兰语)。该数据集能够全面评估 LLM 在检测心理健康状况及其严重程度方面的性能,涵盖多种语言。通过在 GPT 和 Llama 上进行实验,我们观察到尽管在相同的翻译数据集上进行评估,模型在不同语言中的表现存在显著差异。这种不一致性突显了多语言心理健康支持中固有的复杂性,其中语言特定的细微差别和心理健康数据的覆盖范围可能影响模型的准确性。通过全面的错误分析,我们强调了在医疗环境中完全依赖大语言模型 (LLMs) 的风险 (例如,它们可能导致误诊的潜在风险)。此外,我们提出的方法在多语言任务中提供了显著的成本节省,为大规模实施提供了主要优势。

[NLP-61] Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在执行数值运算(如加法和乘法)时准确性不足的问题。解决方案的关键在于研究不同数值系统(如十进制和千进制)在基于Transformer的大型语言模型中的扩展行为。论文通过实证表明,十进制系统在从头训练设置下,无论是在训练数据规模还是模型大小方面,都比百进制或千进制系统更具数据效率,这归因于十进制系统中更高的符号频率。此外,论文揭示了模型在加法和乘法运算中的外推行为模式,并指出百进制和千进制系统在符号级别辨别和操作上的困难。

链接: https://arxiv.org/abs/2409.17391
作者: Zhejian Zhou,Jiayu Wang,Dahua Lin,Kai Chen
关键词-EN: shown remarkable abilities, numeric operations accurately, Large Language Models, performing numeric operations, mathematics reasoning
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into 1 -digit, and 2) Tokenize into 1\sim 3 digit. The difference is roughly equivalent to using different numeral systems (namely base 10 or base 10^3 ). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base 10 system is consistently more data-efficient than a base 10^2 or 10^3 system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base 10 system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base 100 and base 1000 systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.
摘要:尽管大语言模型 (LLM) 在数学推理方面展现了卓越的能力,但在执行加法和乘法等数值运算时仍面临挑战。不同的大语言模型可以通过多种方式将数字 Token 化,从而影响数值运算的性能。目前,主要有两种代表性方法:1) 将数字 Token 化为 1 位数字,2) 将数字 Token 化为 1 到 3 位数字。这两种方法的差异大致相当于使用不同的数制(即十进制或千进制)。基于此,我们研究了在基于 Transformer 的大语言模型背景下,不同数制的缩放行为。我们通过实证表明,在从头开始训练的设置下,十进制系统在训练数据规模和模型大小方面始终比百进制或千进制系统更具数据效率,而不同的数制在微调性能上非常相似。我们将此归因于十进制系统更高的 Token 频率。此外,我们揭示了加法和乘法上的外推行为模式。我们发现,百进制和千进制系统在 Token 级别的辨别和 Token 级别的运算上存在困难。我们还阐明了模型所学机制的原理。

[NLP-62] data2lang2vec: Data Driven Typological Features Completion

【速读】: 该论文试图解决语言类型学数据库在多语言自然语言处理(NLP)中的覆盖率不足问题,特别是lang2vec工具包仅覆盖28.9%的语言。解决方案的关键在于利用文本数据进行更精准的特征预测,通过引入多语言词性标注器(POS tagger),实现了在1,749种语言上超过70%的准确率,并结合外部统计特征和多种机器学习算法,显著提升了特征预测的准确性和覆盖率。

链接: https://arxiv.org/abs/2409.17373
作者: Hamidreza Amirzadeh,Sadegh Jafari,Anika Harju,Rob van der Goot
关键词-EN: Natural Language Processing, diverse linguistic structures, enhance multi-lingual Natural, improving model adaptability, multi-lingual Natural Language
类目: Computation and Language (cs.CL)
备注: 9 pages, 11 figures

点击查看摘要

Abstract:Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
摘要:语言类型学数据库通过提高模型对多样语言结构的适应性,增强了多语言自然语言处理 (NLP) 的能力。广泛使用的 lang2vec 工具包整合了多个此类数据库,但其覆盖率仍限于 28.9%。先前的工作主要通过基于其他语言特征的预测来自动增加覆盖率,或专注于单一特征,我们提出利用文本数据进行更明智的特征预测。为此,我们引入了一个多语言词性标注器 (POS tagger),在 1,749 种语言中实现了超过 70% 的准确率,并实验了外部统计特征和多种机器学习算法。我们还引入了一种更现实的评估设置,重点关注可能缺失的类型学特征,并展示了我们的方法在两种设置下均优于先前的工作。

[NLP-63] Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

【速读】: 该论文试图解决当前基于语音的大型语言模型(LLMs)在直接语音对话处理中存在的延迟和音频特征丢失问题。解决方案的关键在于将自动语音识别(ASR)的链式思维隐式内化到语音LLM中,从而增强模型对语音的直接理解能力,减少延迟并提高实时音频交互的自然性和效率。

链接: https://arxiv.org/abs/2409.17353
作者: Robin Shing-Hei Yuen,Timothy Tin-Long Tse,Jian Zhu
关键词-EN: Current speech-based LLMs, Current speech-based, excelling in tasks, predominantly trained, trained on extensive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model’s native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.
摘要:当前基于语音的大语言模型 (LLM) 主要通过大量的自动语音识别 (ASR) 和文本转语音 (TTS) 数据集进行训练,在这些领域相关的任务中表现出色。然而,它们在处理直接的语音到语音对话方面的能力仍然显著受限。这些模型通常依赖于一个 ASR 到 TTS 的链式思维流程,先将语音转换为文本进行处理,然后再生成音频响应,这引入了延迟并丢失了音频特征。我们提出了一种方法,将 ASR 链式思维隐式内化到语音大语言模型中,增强其对语音的固有理解能力。我们的方法减少了延迟,并提高了模型对语音的固有理解能力,为更高效和自然的实时音频交互铺平了道路。我们还发布了一个大规模的合成对话数据集,以促进进一步的研究。

[NLP-64] How Transliterations Improve Crosslingual Alignment

【速读】: 该论文试图解决的问题是如何通过仅使用音译数据而不依赖平行数据来提升多语言预训练语言模型(mPLMs)的跨语言对齐效果,并探讨其背后的机制。解决方案的关键在于通过在原始数据和音译数据上应用对齐目标,特别是对比学习目标,来增强模型对匹配句子和随机句子对的区分能力,从而改善跨语言对齐。实验结果表明,音译数据的加入显著提升了句子表示的相似性,但同时也指出,更好的对齐并不总是带来更好的下游任务性能,这表明需要进一步研究以明确对齐与性能之间的关系。

链接: https://arxiv.org/abs/2409.17326
作者: Yihong Liu,Mingyang Wang,Amir Hossein Kargaran,Ayyoob Imani,Orgest Xhelili,Haotian Ye,Chunlan Ma,François Yvon,Hinrich Schütze
关键词-EN: post-aligning multilingual pretrained, Recent studies, multilingual pretrained language, studies have shown, shown that post-aligning
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.
摘要:近期研究表明,通过在原始数据和音译数据上使用对齐目标对多语言预训练语言模型 (mPLMs) 进行后对齐,可以提升跨语言对齐效果。这种提升进一步带来了更好的跨语言迁移性能。然而,目前尚不清楚这种更好的跨语言对齐是如何以及为何实现的,因为该技术仅涉及音译,并未使用任何平行数据。本文尝试明确评估跨语言对齐,并识别基于音译方法中对性能提升起关键作用的因素。为此,我们在两种相关语言对(波兰语与乌克兰语,以及印地语与乌尔都语)上,采用不同设置训练了多个模型。为了评估对齐效果,我们定义了基于句子表示的四种相似性类型。实验结果显示,仅添加音译数据就能提升整体相似性,即使是随机句子对。借助辅助对齐目标,特别是对比目标,模型能够区分匹配对与随机对,从而实现更好的对齐。然而,我们也发现更好的对齐并不总能带来更好的下游性能,这表明需要进一步研究以阐明对齐与性能之间的关系。

[NLP-65] Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP2024

【速读】: 该论文试图解决Vision-Language Navigation (VLN)任务中模型在不同指令类别上的性能诊断问题,提出了一种基于上下文无关文法(CFG)的评估框架。解决方案的关键在于利用大型语言模型(LLMs)辅助构建CFG,并基于此设计了五个主要的指令类别(方向变化、地标识别、区域识别、垂直移动和数值理解),通过生成和分析跨类别的数据,揭示了不同模型在各指令类别上的性能差异和常见问题,为未来语言引导导航系统的发展提供了重要见解。

链接: https://arxiv.org/abs/2409.17313
作者: Zehao Wang,Minye Wu,Yixin Cao,Yubo Ma,Meiqi Chen,Tinne Tuytelaars
关键词-EN: study presents, instruction categories, evaluation framework, VLN, Vision-Language Navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 Findings; project page: this https URL

点击查看摘要

Abstract:This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.
摘要:本研究提出了一种新颖的视觉-语言导航 (Vision-Language Navigation, VLN) 任务评估框架。其目标是在更细粒度的层面上诊断当前模型在各种指令类别中的表现。该框架围绕任务的上下文无关文法 (Context-Free Grammar, CFG) 构建。CFG 作为问题分解的基础和指令类别设计的核心前提。我们提出了一种半自动的 CFG 构建方法,借助大语言模型 (Large-Language Models, LLMs) 的帮助。随后,我们归纳并生成了涵盖五个主要指令类别(即方向变化、地标识别、区域识别、垂直移动和数值理解)的数据。我们对不同模型的分析揭示了显著的性能差异和反复出现的问题。数值理解的停滞、对方向概念的严重选择性偏差以及其他有趣的发现,为未来语言引导导航系统的发展提供了贡献。

[NLP-66] BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data CONLL2024

【速读】: 该论文试图解决在数据有限的情况下,如何通过模型蒸馏技术提升语言模型的性能问题。解决方案的关键在于利用两个教师模型对一个345百万参数的BabyLlama-2模型进行蒸馏预训练,并在10百万词的语料库上进行训练。通过广泛的参数调整实验,证明了蒸馏技术在数据有限环境下的优势,并强调了进一步研究蒸馏技术的重要性。

链接: https://arxiv.org/abs/2409.17312
作者: Jean-Loup Tastet,Inar Timiryasov
关键词-EN: million word corpus, million parameter model, parameter model distillation-pretrained, million word, million word datasets
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 5 tables, submitted to the BabyLM Challenge (CoNLL 2024 Shared Task)

点击查看摘要

Abstract:We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.
摘要:我们提出了 BabyLlama-2,这是一个 3.45 亿参数的模型,通过从两个教师模型在 1000 万词的语料库上进行蒸馏预训练,用于 BabyLM 竞赛。在 BLiMP 和 SuperGLUE 基准测试中,BabyLlama-2 的表现优于在相同数据混合的 1000 万和 1 亿词数据集上训练的基线模型,以及其教师模型。通过广泛的参数扫描,我们证明了蒸馏的优势不能归因于教师模型次优的超参数选择。我们的研究结果强调了进一步研究蒸馏技术,特别是在数据受限环境中的必要性。

[NLP-67] On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

【速读】: 该论文试图解决检索增强生成(RAG)系统在知识密集型领域(如医疗、金融和法律)中的对抗性鲁棒性问题,特别是针对检索系统的通用中毒攻击。解决方案的关键在于发现并利用中毒文档与查询之间的嵌入偏差模式,开发了一种基于检测的防御机制,通过识别和过滤潜在的中毒文档,确保RAG系统的安全使用。实验结果表明,该方法在各种问答领域中均能实现高检测率。

链接: https://arxiv.org/abs/2409.17275
作者: Xun Xian,Ganghua Wang,Xuan Bi,Jayanth Srinivasa,Ashish Kundu,Charles Fleming,Mingyi Hong,Jie Ding
关键词-EN: large language models, Retrieval-Augmented Generation, language models, legal contexts, empirically shown
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs’ generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q\A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q\A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 已被实证证明能够提升大语言模型 (Large Language Models, LLMs) 在医疗、金融和法律等知识密集型领域的表现。给定一个查询,RAG 从语料库中检索相关文档,并将其整合到 LLMs 的生成过程中。在本研究中,我们探讨了 RAG 的对抗鲁棒性,特别关注检索系统的安全性。首先,在 225 种不同的语料库、检索器、查询和目标信息组合中,我们发现检索系统在医疗问答中容易受到普遍的投毒攻击。在这种攻击中,攻击者生成包含广泛目标信息的投毒文档,如个人身份信息。当这些投毒文档被插入到语料库中时,只要使用攻击者指定的查询,它们就能被准确检索到。为了理解这一漏洞,我们发现查询嵌入与投毒文档嵌入之间的偏差往往遵循一种模式,即投毒文档与查询之间的高相似性得以保留,从而实现精确检索。基于这些发现,我们开发了一种基于检测的新防御措施,以确保 RAG 的安全使用。通过在多个问答领域的广泛实验,我们观察到所提出的方法在几乎所有情况下都能持续实现优异的检测率。

[NLP-68] Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning

【速读】: 该论文试图解决大语言模型(LLMs)在处理新领域和复杂逻辑序列时推理不一致的问题。解决方案的关键在于引入“Proof of Thought”框架,通过将LLM生成的想法与形式逻辑验证相结合,使用自定义解释器将LLM输出转换为一阶逻辑结构,以便进行定理证明器的审查。核心方法包括一个基于JSON的领域特定语言,该语言在精确逻辑结构和直观人类概念之间取得平衡,从而实现严格的验证和易于理解的人类可解释性。此外,该方法还包括一个强大的类型系统,用于增强逻辑完整性,明确区分事实知识和推断知识,并提供灵活的架构以适应各种领域特定应用的扩展。

链接: https://arxiv.org/abs/2409.17270
作者: Debargha Ganguly,Srinivasan Iyengar,Vipin Chaudhary,Shivkumar Kalyanaraman
关键词-EN: Large Language Models, natural language processing, revolutionized natural language, complex logical sequences, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate Proof of Thought’s effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.
摘要:大语言模型 (LLMs) 已经彻底改变了自然语言处理领域,但它们在处理新领域和复杂逻辑序列时,推理过程往往不一致。本研究引入了“思维证明” (Proof of Thought) 框架,旨在提升 LLM 输出的可靠性和透明度。我们的方法通过自定义解释器,将 LLM 生成的想法与形式逻辑验证相结合,将 LLM 输出转换为用于定理证明器审查的一阶逻辑结构。该方法的核心是一个基于 JSON 的领域特定语言 (Domain-Specific Language),它在设计上平衡了精确的逻辑结构与直观的人类概念。这种混合表示方式既实现了严格的验证,又便于人类理解 LLM 的推理过程。主要贡献包括一个具有排序管理功能的强大类型系统,以增强逻辑完整性;明确表示规则,以清晰区分事实性知识和推理性知识;以及一个灵活的架构,便于轻松扩展到各种领域特定应用。我们通过在 StrategyQA 和一项新的多模态推理任务上的基准测试,展示了“思维证明”框架的有效性,表明在开放式场景中性能有所提升。通过提供可验证和可解释的结果,我们的技术满足了 AI 系统责任性的关键需求,并为高风险领域中的人机协同监督奠定了基础。

[NLP-69] Plurals: A System for Guiding LLMs Via Simulated Social Ensembles

【速读】: 该论文试图解决语言模型可能偏向特定观点的问题,提出了一种名为Plurals的系统及Python库,通过模拟多元视角的审议过程来实现更公正的决策。解决方案的关键在于利用不同的视角(Agents)在可定制的结构(Structures)中进行审议,并由主持人(Moderators)监督,从而生成具有代表性的社会群体模拟结果。Plurals不仅整合了政府数据以创建具有国家代表性的角色,还借鉴了民主审议理论的模板,允许用户自定义信息共享结构和审议行为,从而实现多元视角的有效整合。

链接: https://arxiv.org/abs/2409.17213
作者: Joshua Ashkinaze,Emily Fry,Narendra Edara,Eric Gilbert,Ceren Budak
关键词-EN: Recent debates raised, debates raised concerns, Recent debates, debates raised, raised concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a ‘view from nowhere’ but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by democratic deliberation theory, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at this https URL and will be continually updated.
摘要:近期关于语言模型的讨论引发了对其可能偏袒某些观点的担忧。但如果解决方案不是追求“无立场”,而是利用不同的观点呢?我们引入了 Plurals,这是一个用于多元 AI 审议的系统和 Python 库。Plurals 由 AI 智能体(大语言模型,可选地带有角色)组成,这些智能体在可定制的结构中进行审议,并由主持人监督审议过程。Plurals 是一个模拟社会集合的生成器。Plurals 整合了政府数据集以创建具有国家代表性的角色,包含了受民主审议理论启发的审议模板,并允许用户自定义信息共享结构和结构内的审议行为。六个案例研究展示了其对理论构件的忠实性和有效性。三个随机实验表明,模拟焦点小组产生的输出与相关在线受众样本产生了共鸣(在 75% 的试验中选择了少样本生成而非零样本生成)。Plurals 既是一种范式,也是一个具体的多元 AI 系统。Plurals 库可通过此 https URL 获取,并将持续更新。

[NLP-70] An Effective Robust and Fairness-aware Hate Speech Detection Framework

【速读】: 该论文旨在解决在线社交网络中仇恨言论检测的准确性、鲁棒性和公平性问题。解决方案的关键在于设计了一个数据增强、公平性处理和不确定性估计的新框架,其中引入了双向四元数-准LSTM层以平衡效果和效率,并通过整合来自三个平台的五个数据集来构建一个泛化能力强的模型。实验结果表明,该模型在无攻击和各种攻击场景下均优于现有的八种最先进方法,显示出其有效性和鲁棒性。

链接: https://arxiv.org/abs/2409.17191
作者: Guanyi Mou,Kyumin Lee
关键词-EN: online social networks, widespread online social, speeches are spreading, spreading faster, faster and causing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: IEEE BigData 2021

点击查看摘要

Abstract:With the widespread online social networks, hate speeches are spreading faster and causing more damage than ever before. Existing hate speech detection methods have limitations in several aspects, such as handling data insufficiency, estimating model uncertainty, improving robustness against malicious attacks, and handling unintended bias (i.e., fairness). There is an urgent need for accurate, robust, and fair hate speech classification in online social networks. To bridge the gap, we design a data-augmented, fairness addressed, and uncertainty estimated novel framework. As parts of the framework, we propose Bidirectional Quaternion-Quasi-LSTM layers to balance effectiveness and efficiency. To build a generalized model, we combine five datasets collected from three platforms. Experiment results show that our model outperforms eight state-of-the-art methods under both no attack scenario and various attack scenarios, indicating the effectiveness and robustness of our model. We share our code along with combined dataset for better future research
摘要:随着在线社交网络的广泛普及,仇恨言论的传播速度比以往任何时候都快,造成的损害也更为严重。现有的仇恨言论检测方法在多个方面存在局限性,如处理数据不足、估计模型不确定性、提高对恶意攻击的鲁棒性以及处理意外偏差(即公平性)。在线社交网络中迫切需要准确、鲁棒且公平的仇恨言论分类。为了填补这一空白,我们设计了一个数据增强、公平性处理和不确定性估计的新颖框架。作为该框架的一部分,我们提出了双向四元数准LSTM层,以平衡有效性和效率。为了构建一个泛化模型,我们结合了从三个平台收集的五个数据集。实验结果表明,在无攻击场景和各种攻击场景下,我们的模型均优于八种最先进的方法,显示出我们模型的有效性和鲁棒性。我们共享了代码和合并的数据集,以促进未来的研究。

[NLP-71] Fully automatic extraction of morphological traits from the Web: utopia or reality?

【速读】: 该论文试图解决植物形态特征信息的大规模结构化问题,即如何从非结构化的在线文本中自动提取并构建植物物种与特征的矩阵。解决方案的关键在于利用大型语言模型(LLMs)的信息提取能力,通过自动化机制处理和解析非结构化的植物特征描述文本,从而实现无需人工干预的大规模特征数据库创建。研究结果表明,该方法能够有效提取超过一半的物种-特征对,F1-score达到75%以上,显示出LLMs在处理此类任务中的潜力。

链接: https://arxiv.org/abs/2409.17179
作者: Diego Marcos,Robert van de Vlasakker,Ioannis N. Athanasiadis,Pierre Bonnet,Hervé Goeau,Alexis Joly,W. Daniel Kissling,César Leblanc,André S.J. van Proosdij,Konstantinos P. Panousis
关键词-EN: Plant morphological traits, observable characteristics, fundamental to understand, understand the role, role played
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.
摘要:植物形态特征,即其可观察到的特性,是理解每种物种在其生态系统中所扮演角色的基础。然而,即使是对中等数量的物种进行特征信息编纂,也是一项耗时耗力的任务,可能需要专家花费数年时间才能完成。与此同时,尽管网络上存在大量关于物种描述的文本信息,但由于缺乏结构化,使得这些数据难以大规模利用。为了克服这一问题,我们提出利用大语言模型 (LLMs) 的最新进展,设计一种机制,用于收集和处理以非结构化文本形式描述的植物特征信息,而无需人工编排。我们通过自动复制三个手动创建的物种-特征矩阵来评估我们的方法。我们的方法成功为超过一半的物种-特征对找到了数值,F1-score 超过 75%。这些结果表明,得益于 LLMs 的信息提取能力,目前从非结构化在线文本中大规模创建结构化特征数据库是可行的,其限制主要在于涵盖所有感兴趣特征的文本描述的可用性。

[NLP-72] CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency

【速读】: 该论文试图解决大型语言模型(LLMs)在长程推理任务中因因果幻觉导致的推理能力受限问题。解决方案的关键在于提出了一种非链式推理框架——因果显著性和一致性增强器(CSCE),通过定制LLM的损失函数,利用处理效应评估来增强模型的因果显著性和一致性,确保模型能够捕捉关键的因果关系并保持跨场景的稳健和一致性能。此外,该方法将推理过程从传统的链式多步推理转变为因果增强的单步输出,从而提高了推理效率。

链接: https://arxiv.org/abs/2409.17174
作者: Kangsheng Wang,Xiao Zhang,Zizheng Guo,Tianyu Hu,Huimin Ma
关键词-EN: large language models, causal significance, significance and consistency, reasoning, solving reasoning tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-based reasoning methods like chain of thought (CoT) play a rising role in solving reasoning tasks for large language models (LLMs). However, the causal illusions between \textita step of reasoning and \textitcorresponding state transitions are becoming a significant obstacle to advancing LLMs’ reasoning capabilities, especially in long-range reasoning tasks. This paper proposes a non-chain-based reasoning framework for simultaneous consideration of causal significance and consistency, i.e., the Causal Significance and Consistency Enhancer (CSCE). We customize LLM’s loss function utilizing treatment effect assessments to enhance its reasoning ability from two aspects: causal significance and consistency. This ensures that the model captures essential causal relationships and maintains robust and consistent performance across various scenarios. Additionally, we transform the reasoning process from the cascading multiple one-step reasoning commonly used in Chain-Based methods, like CoT, to a causal-enhanced method that outputs the entire reasoning process in one go, further improving the model’s reasoning efficiency. Extensive experiments show that our method improves both the reasoning success rate and speed. These improvements further demonstrate that non-chain-based methods can also aid LLMs in completing reasoning tasks.
摘要:基于链的推理方法,如思维链 (Chain of Thought, CoT),在大语言模型 (Large Language Models, LLMs) 解决推理任务中扮演着日益重要的角色。然而,推理步骤与相应状态转换之间的因果错觉正成为提升 LLMs 推理能力,尤其是在长距离推理任务中的一个重大障碍。本文提出了一种非链式推理框架,即因果显著性与一致性增强器 (Causal Significance and Consistency Enhancer, CSCE),用于同时考虑因果显著性和一致性。我们通过利用处理效应评估来定制 LLM 的损失函数,从因果显著性和一致性两个方面增强其推理能力。这确保了模型能够捕捉关键的因果关系,并在各种场景中保持稳健且一致的性能。此外,我们将推理过程从基于链的方法(如 CoT)中常见的级联多步推理转变为因果增强的方法,该方法一次性输出整个推理过程,从而进一步提高了模型的推理效率。大量实验表明,我们的方法在推理成功率和速度上都有所提升。这些改进进一步证明了非链式方法也能帮助 LLMs 完成推理任务。

[NLP-73] A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)生成幻觉文本的问题,特别是在多次生成文本时,由于故事情节的变化导致难以进行语义比较,从而影响检测准确性的问题。解决方案的关键在于提出了一种基于多填空测试的方法,通过在原始文本中遮蔽多个对象,并让LLM重复回答这些填空问题,确保生成的答案与原始故事情节一致。这种方法通过量化每个原始句子的幻觉程度,考虑了幻觉在文本中的累积效应,从而提高了检测的准确性和效果。

链接: https://arxiv.org/abs/2409.17173
作者: Satoshi Munakata,Taku Fukui,Takao Mohri
关键词-EN: Large language models, Large language, language models, fabricate a hallucinatory, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Large language models (LLMs) often fabricate a hallucinatory text. Several methods have been developed to detect such text by semantically comparing it with the multiple versions probabilistically regenerated. However, a significant issue is that if the storyline of each regenerated text changes, the generated texts become incomparable, which worsen detection accuracy. In this paper, we propose a hallucination detection method that incorporates a multiple-fill-in-the-blank exam approach to address this storyline-changing issue. First, our method creates a multiple-fill-in-the-blank exam by masking multiple objects from the original text. Second, prompts an LLM to repeatedly answer this exam. This approach ensures that the storylines of the exam answers align with the original ones. Finally, quantifies the degree of hallucination for each original sentence by scoring the exam answers, considering the potential for \emphhallucination snowballing within the original text itself. Experimental results show that our method alone not only outperforms existing methods, but also achieves clearer state-of-the-art performance in the ensembles with existing methods.
摘要:大语言模型 (LLMs) 常常生成幻觉文本。已有多种方法通过语义比较这些文本与概率性重新生成的多个版本来进行检测。然而,一个显著的问题是,如果每个重新生成的文本的故事线发生变化,生成的文本将变得不可比较,从而降低检测准确性。本文提出了一种幻觉检测方法,该方法结合了多填空题考试方法来解决故事线变化的问题。首先,我们的方法通过从原始文本中屏蔽多个对象来创建多填空题考试。其次,提示 LLM 反复回答此考试。这种方法确保了考试答案的故事线与原始故事线一致。最后,通过评分考试答案来量化每个原始句子的幻觉程度,考虑到原始文本内部可能存在的幻觉滚雪球效应。实验结果表明,我们的方法不仅单独优于现有方法,而且在与现有方法的集成中实现了更清晰的最新性能。

[NLP-74] What Would You Ask When You First Saw a2b2=c2? Evaluating LLM on Curiosity-Driven Questioning

【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)获取新知识的能力。解决方案的关键在于提出了一种新颖的评估框架,该框架通过提示LLMs生成关于科学知识陈述的问题,模拟初次接触该陈述时的自然好奇心,并根据生成问题的质量评分来评估模型的知识获取潜力。通过控制性消融研究和人工评估验证了评分过程的有效性,并发现模型大小并非决定知识获取潜力的唯一因素。

链接: https://arxiv.org/abs/2409.17172
作者: Shashidhar Reddy Javaji,Zining Zhu
关键词-EN: knowledge remains unknown, remains unknown, store a massive, massive amount, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems
摘要:大语言模型 (LLMs) 能够存储大量知识,但其获取新知识的能力仍未明确。我们提出了一种新颖的评估框架,用于评估这一能力。该框架引导 LLMs 针对引入科学知识的陈述生成问题,模拟初次接触该陈述的好奇者。我们通过评分生成问题的质量,从而评估 LLM 的知识获取潜力。我们采用控制性消融研究来验证评分程序。此外,我们创建了一个合成数据集,包含 1101 条物理、化学和数学领域的陈述,难度各异,300 条一般知识陈述,以及 567 条错误陈述。通过人类评估验证了我们的模型评估,在考虑的三项指标上达到了约 0.7 的加权 Cohen’s kappa 值。我们发现,尽管像 GPT-4 和 Mistral 8x7b 这样的大型模型擅长生成连贯且相关的问题,但较小的 Phi-2 模型同样或更为有效。这表明,模型大小并非决定知识获取潜力的唯一因素。所提出的框架量化了一种常被忽视的关键模型能力,并为开发更具知识性的 AI 系统开辟了研究机会。

[NLP-75] Cross-Domain Content Generation with Domain-Specific Small Language Models

【速读】: 该论文试图解决小规模语言模型在处理多个不重叠领域数据时生成内容的相关性和一致性问题。解决方案的关键在于采用知识扩展策略,即在冻结原有模型层的基础上,仅训练额外的参数,从而使模型能够在不遗忘先前学习内容的情况下,生成适用于多个领域(如故事和食谱)的内容。这种方法有效避免了全模型微调导致的灾难性遗忘问题,并提升了模型在多领域数据上的表现。

链接: https://arxiv.org/abs/2409.17171
作者: Ankit Maloo Abhinav Garg
关键词-EN: small language models, language models poses, small language, minimal overlap, models poses challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Generating domain-specific content using small language models poses challenges, especially when dealing with multiple distinct datasets with minimal overlap. In this study, we explore methods to enable a small language model to produce coherent and relevant outputs for two different domains: stories (Dataset A) and recipes (Dataset B). Our initial experiments show that training individual models on each dataset yields satisfactory results, with each model generating appropriate content within its domain. We find that utilizing custom tokenizers tailored to each dataset significantly enhances generation quality compared to using a generic tokenizer. Attempts to adapt a single model to both domains using Low-Rank Adaptation (LoRA) or standard fine-tuning do not yield substantial results, often failing to produce meaningful outputs. Moreover, full fine-tuning without freezing the model’s existing weights leads to catastrophic forgetting, where the model loses previously learned information and only retains knowledge from the new data. To overcome these challenges, we employ a knowledge expansion strategy: training only with additional parameters. This approach enables the model to generate both stories and recipes upon request, effectively handling multiple domains without suffering from catastrophic forgetting. Our findings demonstrate that knowledge expansion with frozen layers is an effective method for small language models to generate domain-specific content across distinct datasets. This work contributes to the development of efficient multi-domain language models and provides insights into managing catastrophic forgetting in small-scale architectures.
摘要:利用小型语言模型生成特定领域的内容面临挑战,尤其是在处理多个几乎没有重叠的不同数据集时。在本研究中,我们探讨了使小型语言模型能够为两个不同领域(故事(数据集 A)和食谱(数据集 B))生成连贯且相关输出的方法。我们的初步实验表明,针对每个数据集训练单独的模型可以获得满意的结果,每个模型都能在其领域内生成合适的内容。我们发现,使用针对每个数据集定制的 Tokenizer 显著提高了生成质量,相比于使用通用 Tokenizer。尝试使用低秩适应(Low-Rank Adaptation, LoRA)或标准微调来使单个模型适应两个领域并未取得显著成果,通常无法生成有意义的输出。此外,在不冻结模型现有权重的情况下进行全面微调会导致灾难性遗忘,模型会丢失之前学习的信息,仅保留新数据的知识。为了克服这些挑战,我们采用了知识扩展策略:仅通过增加参数进行训练。这种方法使模型能够根据请求生成故事和食谱,有效处理多个领域而不会遭受灾难性遗忘。我们的研究结果表明,冻结层的知识扩展是小型语言模型在不同数据集上生成特定领域内容的一种有效方法。这项工作有助于开发高效的多领域语言模型,并为管理小规模架构中的灾难性遗忘提供了见解。

[NLP-76] REAL: Response Embedding-based Alignment for LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)在人类偏好对齐过程中,标注数据集构建的效率问题。解决方案的关键在于提出一种策略,通过选择信息量更大、差异性更高的AI生成响应对进行标注,从而提高训练数据集的质量。实验结果表明,选择差异性较大的响应对不仅增强了LLMs的直接对齐效果,还减少了标注错误,最终在对话任务中取得了最佳表现,并节省了高达65%的标注工作量。

链接: https://arxiv.org/abs/2409.17169
作者: Honggen Zhang,Igor Molybog,June Zhang,Xufeng Zhao
关键词-EN: Aligning large language, Aligning large, Direct Preference Optimization, Preference Optimization rely, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization rely on pairs of AI-generated responses ranked according to human feedback. The labeling process is the most labor-intensive and costly part of the alignment pipeline, and improving its efficiency would have a meaningful impact on AI development. We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs for labeling out of a set of AI-generated responses. Experimental results on synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. We also applied our method to the real-world dataset SHP2, selecting optimal pairs from multiple responses. The model aligned on dissimilar response pairs obtained the best win rate on the dialogue task. Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators’ work.
摘要:将大语言模型 (LLMs) 对齐到人类偏好是构建有用且安全的 AI 工具的关键步骤,这通常涉及在监督数据集上进行训练。流行的算法如直接偏好优化 (Direct Preference Optimization) 依赖于根据人类反馈排序的 AI 生成响应对。标注过程是对齐流程中最耗费人力和成本的部分,提高其效率将对 AI 开发产生重大影响。我们提出了一种策略,用于从一组 AI 生成的响应中采样高质量的训练数据集,重点是获取最具信息量的响应对进行标注。在合成 HH-RLHF 基准上的实验结果表明,选择不相似的响应对可以增强 LLMs 的直接对齐,同时减少继承的标注错误。我们还应用了我们的方法到实际数据集 SHP2,从多个响应中选择最佳对。在对齐不相似响应对的模型在对话任务中获得了最高的胜率。我们的研究结果表明,专注于不太相似的对可以提高 LLM 对齐的效率,节省高达 65% 的标注者工作量。

[NLP-77] StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

【速读】: 该论文试图解决的问题是探究大型语言模型(LLMs)是否像人类一样在压力下表现出性能波动,并评估不同压力诱导提示对其性能的影响。解决方案的关键在于开发了一套名为StressPrompt的新型提示集,这些提示基于心理学框架设计,并通过人类参与者进行校准,以模拟不同程度的压力。通过将这些提示应用于多个LLMs,研究评估了它们在指令跟随、复杂推理和情感智能等任务中的表现。研究发现,LLMs在适度压力下表现最佳,与Yerkes-Dodson定律一致,而在低和高压力条件下性能下降。此外,这些StressPrompt显著改变了LLMs的内部状态,导致其神经表示发生变化,类似于人类对压力的反应。这一研究为设计在压力环境下仍能保持高性能的AI系统提供了重要见解。

链接: https://arxiv.org/abs/2409.17167
作者: Guobin Shen,Dongcheng Zhao,Aorigele Bao,Xiang He,Yiting Dong,Yi Zeng
关键词-EN: Large Language Models, Language Models, Large Language, stress, LLMs
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.
摘要:人类经常经历压力,这会显著影响他们的表现。本研究探讨了大语言模型 (LLMs) 是否表现出类似于人类的压力反应,以及它们在不同压力诱导提示下的表现是否波动。为了研究这一点,我们开发了一套新颖的提示集,称为 StressPrompt,旨在诱导不同程度的压力。这些提示源自已建立的心理学框架,并根据人类参与者的评分进行了仔细校准。然后,我们将这些提示应用于多个 LLMs,以评估它们在指令跟随、复杂推理和情感智能等任务中的响应。研究结果表明,LLMs 与人类一样,在中等压力下表现最佳,这与 Yerkes-Dodson 定律一致。值得注意的是,它们在低压力和高压力条件下的表现均有所下降。我们的进一步分析揭示,这些 StressPrompts 显著改变了 LLMs 的内部状态,导致其神经表示发生变化,这些变化与人类对压力的反应相似。这项研究为 LLMs 的操作稳健性和灵活性提供了关键见解,展示了设计能够在压力普遍存在的现实世界场景中保持高性能的 AI 系统的重要性,例如在客户服务、医疗保健和应急响应环境中。此外,本研究通过提供关于 LLMs 如何处理不同场景及其与人类认知相似性的新视角,为更广泛的 AI 研究社区做出了贡献。

[NLP-78] BERTScoreVisualizer: A Web Tool for Understanding Simplified Text Evaluation with BERTScore

【速读】: 该论文旨在解决BERTScore评估自动文本简化系统时缺乏对具体词匹配信息的可视化问题。解决方案的关键是引入了BERTScoreVisualizer,这是一个网络应用程序,不仅报告精度、召回率和F1分数,还提供了词匹配的可视化,从而帮助分析简化文本与参考文本之间的偏差,提升文本简化系统的分析质量。

链接: https://arxiv.org/abs/2409.17160
作者: Sebastian Jaskowski,Sahasra Chava,Agam Shah
关键词-EN: evaluate automatic text, automatic text simplification, evaluate automatic, text simplification systems, BERTScore metric
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The BERTScore metric is commonly used to evaluate automatic text simplification systems. However, current implementations of the metric fail to provide complete visibility into all information the metric can produce. Notably, the specific token matchings can be incredibly useful in generating clause-level insight into the quality of simplified text. We address this by introducing BERTScoreVisualizer, a web application that goes beyond reporting precision, recall, and F1 score and provides a visualization of the matching between tokens. We believe that our software can help improve the analysis of text simplification systems by specifically showing where generated, simplified text deviates from reference text. We host our code and demo on GitHub.
摘要:BERTScore 指标常用于评估自动文本简化系统。然而,当前的实现未能完全展示该指标所能提供的所有信息。特别是,具体的 Token 匹配信息对于生成关于简化文本质量的从句级洞察极为有用。我们通过引入 BERTScoreVisualizer,一个网页应用程序,来解决这一问题。该应用不仅报告精度、召回率和 F1 分数,还提供了 Token 匹配的可视化。我们相信,我们的软件可以通过具体展示生成的简化文本与参考文本的偏差,来帮助改进文本简化系统的分析。我们将代码和演示托管在 GitHub 上。

[NLP-79] Unveiling the Potential of Graph Neural Networks in SME Credit Risk Assessment

【速读】: 该论文试图解决企业信用风险评估问题,其解决方案的关键在于利用图神经网络技术,通过构建企业财务指标间的图结构映射,深入分析指标间的关系,并利用图神经网络模型进行特征嵌入和分类预测。具体步骤包括:选择29个财务指标并构建相似矩阵,使用最大生成树算法实现图结构映射;在图表示学习阶段,通过GraphSAGE操作和池化操作获取图的嵌入表示;最后,利用两层全连接网络构建分类器完成预测任务。实验结果表明,该模型能有效完成企业多层次信用等级评估,且具有显著的分类效果和良好的鲁棒性。

链接: https://arxiv.org/abs/2409.17909
作者: Bingyao Liu,Iris Li,Jianhua Yao,Yuan Chen,Guanming Huang,Jiajing Wang
关键词-EN: graph neural network, enterprise financial indicators, credit risk assessment, enterprise credit risk, neural network model
类目: Risk Management (q-fin.RM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper takes the graph neural network as the technical framework, integrates the intrinsic connections between enterprise financial indicators, and proposes a model for enterprise credit risk assessment. The main research work includes: Firstly, based on the experience of predecessors, we selected 29 enterprise financial data indicators, abstracted each indicator as a vertex, deeply analyzed the relationships between the indicators, constructed a similarity matrix of indicators, and used the maximum spanning tree algorithm to achieve the graph structure mapping of enterprises; secondly, in the representation learning phase of the mapped graph, a graph neural network model was built to obtain its embedded representation. The feature vector of each node was expanded to 32 dimensions, and three GraphSAGE operations were performed on the graph, with the results pooled using the Pool operation, and the final output of three feature vectors was averaged to obtain the graph’s embedded representation; finally, a classifier was constructed using a two-layer fully connected network to complete the prediction task. Experimental results on real enterprise data show that the model proposed in this paper can well complete the multi-level credit level estimation of enterprises. Furthermore, the tree-structured graph mapping deeply portrays the intrinsic connections of various indicator data of the company, and according to the ROC and other evaluation criteria, the model’s classification effect is significant and has good “robustness”.
摘要:本文以图神经网络为技术框架,整合企业财务指标间的内在联系,提出了一种企业信用风险评估模型。主要研究工作包括:首先,基于前人的经验,选取了29个企业财务数据指标,将每个指标抽象为一个顶点,深入分析指标间的关系,构建了指标的相似矩阵,并利用最大生成树算法实现了企业图结构的映射;其次,在映射图的表示学习阶段,构建了图神经网络模型以获取其嵌入表示。每个节点的特征向量被扩展到32维,并在图上进行了三次GraphSAGE操作,结果通过Pool操作进行池化,最终将三个特征向量的输出取平均,得到图的嵌入表示;最后,构建了一个两层的全连接网络分类器,完成了预测任务。在真实企业数据上的实验结果表明,本文提出的模型能够很好地完成企业多层次信用等级估计。此外,树结构的图映射深度描绘了公司各项指标数据的内在联系,根据ROC等评价标准,模型的分类效果显著,并具有良好的“鲁棒性”。

[NLP-80] Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

【速读】: 该论文试图解决情感识别领域中语音情感识别(SER)和音乐情感识别(MER)之间的跨域知识迁移问题。解决方案的关键在于分析自监督学习(SSL)模型在SER和MER任务中的层级行为,并通过两阶段微调过程进行跨域适应,探索如何有效利用音乐信息提升SER性能,以及利用语音信息提升MER性能。研究还通过Frechet音频距离分析了情感语音和音乐之间的声学相似性,揭示了情感偏差问题,并提出参数高效的微调方法以增强跨域性能。最终,该研究强调了跨域泛化在提升SER和MER系统中的潜力。

链接: https://arxiv.org/abs/2409.17899
作者: Yujia Sun,Zeyu Zhao,Korin Richmond,Yuanchao Li
关键词-EN: music SSL models, SSL models, speech and music, Emotion recognition, Music Emotion Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.
摘要:语音和音乐的情感识别由于其声学重叠而具有相似性,这引起了跨领域知识转移的兴趣。然而,语音和音乐之间的共享声学线索,特别是由自监督学习 (Self-Supervised Learning, SSL) 模型编码的线索,在很大程度上仍未被探索,因为针对语音和音乐的 SSL 模型很少应用于跨领域研究。在这项工作中,我们重新审视了情感语音和音乐之间的声学相似性,首先分析了用于语音情感识别 (Speech Emotion Recognition, SER) 和音乐情感识别 (Music Emotion Recognition, MER) 的 SSL 模型的逐层行为。此外,我们通过在两阶段微调过程中比较几种方法,进行跨领域适应,探讨了利用音乐进行 SER 和利用语音进行 MER 的有效方式。最后,我们使用 Frechet 音频距离探索了情感语音和音乐之间的声学相似性,揭示了语音和音乐 SSL 模型中情感偏差的问题。我们的研究结果表明,尽管语音和音乐 SSL 模型确实捕捉到了共享的声学特征,但由于其训练策略和领域特异性,它们的行为会因不同情感而异。此外,参数高效的微调可以通过相互利用知识来提升 SER 和 MER 的性能。本研究为情感语音和音乐之间的声学相似性提供了新的见解,并强调了跨领域泛化以改进 SER 和 MER 系统的潜力。

[NLP-81] Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study

【速读】: 该论文试图解决将预训练语言模型(PLMs)中的transformer结构重新用于自动语音识别(ASR)中的编码器时,其有效性和性能提升的问题。解决方案的关键在于利用transformer在文本数据上预训练时所具备的强大特征提取能力,并将其转移到语音数据上,从而增强ASR的声学建模能力。研究通过实验验证了在不同ASR任务中,使用预训练LM中的transformer作为ASR编码器的初始化点,能够显著降低字符错误率(CER)和词错误率(WER),特别是在需要深刻语义理解的情况下,这种集成方法能够大幅提升ASR系统的性能。

链接: https://arxiv.org/abs/2409.17750
作者: Keyu An,Shiliang Zhang,Zhijie Yan
关键词-EN: Automatic Speech Recognition, Speech Recognition, pre-trained language models, Automatic Speech, Character Error Rate
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 8pages

点击查看摘要

Abstract:In this study, we delve into the efficacy of transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR). Our underlying hypothesis posits that, despite being initially trained on text-based corpora, these transformers possess a remarkable capacity to extract effective features from the input sequence. This inherent capability, we argue, is transferrable to speech data, thereby augmenting the acoustic modeling ability of ASR. Through rigorous empirical analysis, our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated. Particularly, they serve as an advantageous starting point for initializing ASR encoders. Furthermore, we uncover that these transformers, when integrated into a well-established ASR encoder, can significantly boost performance, especially in scenarios where profound semantic comprehension is pivotal. This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems’ capabilities.
摘要:在本研究中,我们深入探讨了在预训练语言模型 (PLMs) 中使用的 Transformer 作为自动语音识别 (ASR) 编码器的有效性。我们的基本假设认为,尽管这些 Transformer 最初是在基于文本的语料库上进行训练的,但它们具有从输入序列中提取有效特征的显著能力。我们认为,这种固有能力可以转移到语音数据上,从而增强 ASR 的声学建模能力。通过严格的实证分析,我们的研究结果显示,当预训练语言模型中的 Transformer 被引入时,跨不同 ASR 任务的字符错误率 (CER) 和词错误率 (WER) 显著改善。特别是,它们为初始化 ASR 编码器提供了一个有利的起点。此外,我们发现,当这些 Transformer 被整合到一个成熟的 ASR 编码器中时,可以显著提升性能,尤其是在需要深刻语义理解的情况下。这突显了利用预训练 Transformer 中嵌入的语义优势来提升 ASR 系统能力的潜力。

[NLP-82] When A Man Says He Is Pregnant: ERP Evidence for A Rational Account of Speaker-contextualized Language Comprehension

【速读】: 该论文试图解决关于说话者身份与言语内容不匹配时神经生理反应的矛盾结果问题。解决方案的关键在于区分两种不同的认知过程:一种是基于社会刻板印象的整合过程,表现为N400效应;另一种是基于生物学知识的错误修正过程,表现为P600效应。通过实验验证,论文揭示了这两种效应分别对应于不同的语言理解策略,从而调和了先前研究中的不一致性,并为说话者情境化的语言理解提供了合理的解释。

链接: https://arxiv.org/abs/2409.17525
作者: Hanlin Wu,Zhenguang G. Cai
关键词-EN: includes the identities, Spoken, effect, Spoken language, ERP effects reflect
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spoken language is often, if not always, understood in a context that includes the identities of speakers. For instance, we can easily make sense of an utterance such as “I’m going to have a manicure this weekend” or “The first time I got pregnant I had a hard time” when the utterance is spoken by a woman, but it would be harder to understand when it is spoken by a man. Previous event-related potential (ERP) studies have shown mixed results regarding the neurophysiological responses to such speaker-mismatched utterances, with some reporting an N400 effect and others a P600 effect. In an experiment involving 64 participants, we showed that these different ERP effects reflect distinct cognitive processes employed to resolve the speaker-message mismatch. When possible, the message is integrated with the speaker context to arrive at an interpretation, as in the case of violations of social stereotypes (e.g., men getting a manicure), resulting in an N400 effect. However, when such integration is impossible due to violations of biological knowledge (e.g., men getting pregnant), listeners engage in an error correction process to revise either the perceived utterance or the speaker context, resulting in a P600 effect. Additionally, we found that the social N400 effect decreased as a function of the listener’s personality trait of openness, while the biological P600 effect remained robust. Our findings help to reconcile the empirical inconsistencies in the literature and provide a rational account of speaker-contextualized language comprehension.
摘要:口语交流通常(即使不是总是)在包含说话者身份的背景下被理解。例如,当一位女性说出“我这周末要去修指甲”或“我第一次怀孕时很困难”时,我们很容易理解这些话语,但如果这些话是由男性说出的,理解起来就会更加困难。先前的与事件相关电位(ERP)研究对这种说话者与信息不匹配的话语的神经生理反应结果不一,有些报告了N400效应,而另一些则报告了P600效应。在一个涉及64名参与者的实验中,我们发现这些不同的ERP效应反映了用于解决说话者与信息不匹配的不同认知过程。在可能的情况下,信息会与说话者背景整合以达成解释,例如在违反社会刻板印象(如男性修指甲)的情况下,这导致了N400效应。然而,当这种整合由于违反生物学知识(如男性怀孕)而变得不可能时,听者会进行错误修正过程,以修正感知到的话语或说话者背景,从而导致P600效应。此外,我们发现社会N400效应随着听者开放性人格特质的增加而减少,而生物P600效应则保持稳定。我们的研究有助于调和文献中的实证不一致性,并为说话者背景化的语言理解提供了合理的解释。

[NLP-83] Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control ICASSP2025

【速读】: 该论文试图解决跨语言控制文本到语音合成(TTS)中的语音特征问题,特别是在目标语言缺乏音频-描述配对数据的情况下。解决方案的关键在于结合目标语言的TTS模型与另一种语言的描述控制模型,通过共享基于自监督学习(SSL)的解耦音色和风格表示,实现跨语言的语音特征控制。这种方法不仅允许在保留原始音色的同时控制说话风格,还利用了SSL的跨语言通用性,使得在不同语言间共享嵌入空间成为可能,从而在没有目标语言音频-描述配对数据的情况下,仍能实现高质量的自然语音合成和控制。

链接: https://arxiv.org/abs/2409.17452
作者: Ryuichi Yamamoto,Yuma Shirahata,Masaya Kawamura,Kentaro Tachibana
关键词-EN: cross-lingual control capability, TTS model trained, description-based controllable, TTS model, TTS
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages, even though no Japanese audio-description pairs are used.
摘要:我们提出了一种基于描述的可控跨语言文本到语音 (Text-to-Speech, TTS) 方法。为了解决目标语言中缺乏音频-描述配对数据的问题,我们将目标语言训练的 TTS 模型与另一种语言训练的描述控制模型相结合,该模型将输入文本描述映射到 TTS 模型的条件特征。这两个模型基于自监督学习 (Self-Supervised Learning, SSL) 共享解耦的音色和风格表示,从而实现解耦的语音控制,例如在保留原始音色的同时控制说话风格。此外,由于基于 SSL 的音色和风格表示是语言无关的,因此通过共享相同的嵌入空间将 TTS 和描述控制模型结合,可以有效地实现语音特征的跨语言控制。在英语和日语 TTS 上的实验表明,尽管没有使用日语音频-描述配对数据,我们的方法在两种语言上都实现了高自然度和可控性。

人工智能

[AI-0] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography MICCAI2024

链接: https://arxiv.org/abs/2409.18119
作者: Yuexi Du,John Onofrey,Nicha C. Dvornek
关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Language-Image Pre-training, requires substantial data, shows promise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is also the basis of the overall best solution for the MICCAI 2024 CXR-LT Challenge

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.

[AI-1] Find Rhinos without Finding Rhinos: Active Learning with Multimodal Imagery of South African Rhino Habitats IJCAI2023

链接: https://arxiv.org/abs/2409.18104
作者: Lucia Gordon,Nikhil Behari,Samuel Collier,Elizabeth Bondi-Kelly,Jackson A. Killian,Catherine Ressijac,Peter Boucher,Andrew Davies,Milind Tambe
关键词-EN: Earth charismatic megafauna, crisis in Africa, Earth charismatic, human activities, charismatic megafauna
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, IJCAI 2023 Special Track on AI for Good

点击查看摘要

Abstract:Much of Earth’s charismatic megafauna is endangered by human activities, particularly the rhino, which is at risk of extinction due to the poaching crisis in Africa. Monitoring rhinos’ movement is crucial to their protection but has unfortunately proven difficult because rhinos are elusive. Therefore, instead of tracking rhinos, we propose the novel approach of mapping communal defecation sites, called middens, which give information about rhinos’ spatial behavior valuable to anti-poaching, management, and reintroduction efforts. This paper provides the first-ever mapping of rhino midden locations by building classifiers to detect them using remotely sensed thermal, RGB, and LiDAR imagery in passive and active learning settings. As existing active learning methods perform poorly due to the extreme class imbalance in our dataset, we design MultimodAL, an active learning system employing a ranking technique and multimodality to achieve competitive performance with passive learning models with 94% fewer labels. Our methods could therefore save over 76 hours in labeling time when used on a similarly-sized dataset. Unexpectedly, our midden map reveals that rhino middens are not randomly distributed throughout the landscape; rather, they are clustered. Consequently, rangers should be targeted at areas with high midden densities to strengthen anti-poaching efforts, in line with UN Target 15.7.

[AI-2] AI-Powered Augmented Reality for Satellite Assembly Integration and Test

链接: https://arxiv.org/abs/2409.18101
作者: Alvaro Patricio,Joao Valente,Atabak Dehban,Ines Cadilha,Daniel Reis,Rodrigo Ventura
关键词-EN: improving operational efficiency, Artificial Intelligence, Augmented Reality, transform satellite Assembly, minimizing human error
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) and Augmented Reality (AR) is set to transform satellite Assembly, Integration, and Testing (AIT) processes by enhancing precision, minimizing human error, and improving operational efficiency in cleanroom environments. This paper presents a technical description of the European Space Agency’s (ESA) project “AI for AR in Satellite AIT,” which combines real-time computer vision and AR systems to assist technicians during satellite assembly. Leveraging Microsoft HoloLens 2 as the AR interface, the system delivers context-aware instructions and real-time feedback, tackling the complexities of object recognition and 6D pose estimation in AIT workflows. All AI models demonstrated over 70% accuracy, with the detection model exceeding 95% accuracy, indicating a high level of performance and reliability. A key contribution of this work lies in the effective use of synthetic data for training AI models in AR applications, addressing the significant challenges of obtaining real-world datasets in highly dynamic satellite environments, as well as the creation of the Segmented Anything Model for Automatic Labelling (SAMAL), which facilitates the automatic annotation of real data, achieving speeds up to 20 times faster than manual human annotation. The findings demonstrate the efficacy of AI-driven AR systems in automating critical satellite assembly tasks, setting a foundation for future innovations in the space industry.

[AI-3] EfficientCrackNet: A Lightweight Model for Crack Segmentation

链接: https://arxiv.org/abs/2409.18099
作者: Abid Hasan Zim,Aquib Iqbal,Zaid Al-Huda,Asad Malik,Minoru Kuribayash
关键词-EN: computer vision due, intricate topologies, low contrast, presents a formidable, intensity inhomogeneity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crack detection, particularly from pavement images, presents a formidable challenge in the domain of computer vision due to several inherent complexities such as intensity inhomogeneity, intricate topologies, low contrast, and noisy backgrounds. Automated crack detection is crucial for maintaining the structural integrity of essential infrastructures, including buildings, pavements, and bridges. Existing lightweight methods often face challenges including computational inefficiency, complex crack patterns, and difficult backgrounds, leading to inaccurate detection and impracticality for real-world applications. To address these limitations, we propose EfficientCrackNet, a lightweight hybrid model combining Convolutional Neural Networks (CNNs) and transformers for precise crack segmentation. EfficientCrackNet integrates depthwise separable convolutions (DSC) layers and MobileViT block to capture both global and local features. The model employs an Edge Extraction Method (EEM) and for efficient crack edge detection without pretraining, and Ultra-Lightweight Subspace Attention Module (ULSAM) to enhance feature extraction. Extensive experiments on three benchmark datasets Crack500, DeepCrack, and GAPs384 demonstrate that EfficientCrackNet achieves superior performance compared to existing lightweight models, while requiring only 0.26M parameters, and 0.483 FLOPs (G). The proposed model offers an optimal balance between accuracy and computational efficiency, outperforming state-of-the-art lightweight models, and providing a robust and adaptable solution for real-world crack segmentation.

[AI-4] DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2409.18092
作者: Helin Cao,Sven Behnke
关键词-EN: Perception systems play, computer vision algorithms, incorporating multiple sensors, Perception systems, incorporating multiple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review

点击查看摘要

Abstract:Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle’s surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets and our approach outperforms the state-of-the-art for SSC.

[AI-5] GSON: A Group-based Social Navigation Framework with Large Multimodal Model

链接: https://arxiv.org/abs/2409.18084
作者: Shangyi Luo,Ji Zhu,Peng Sun,Yuhong Deng,Cunjun Yu,Anxing Xiao,Xueqian Wang
关键词-EN: human-centered environments grows, Large Multimodal Model, environments grows, number of service, autonomous vehicles
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the number of service robots and autonomous vehicles in human-centered environments grows, their requirements go beyond simply navigating to a destination. They must also take into account dynamic social contexts and ensure respect and comfort for others in shared spaces, which poses significant challenges for perception and planning. In this paper, we present a group-based social navigation framework GSON to enable mobile robots to perceive and exploit the social group of their surroundings by leveling the visual reasoning capability of the Large Multimodal Model (LMM). For perception, we apply visual prompting techniques to zero-shot extract the social relationship among pedestrians and combine the result with a robust pedestrian detection and tracking pipeline to alleviate the problem of low inference speed of the LMM. Given the perception result, the planning system is designed to avoid disrupting the current social structure. We adopt a social structure-based mid-level planner as a bridge between global path planning and local motion planning to preserve the global context and reactive response. The proposed method is validated on real-world mobile robot navigation tasks involving complex social structure understanding and reasoning. Experimental results demonstrate the effectiveness of the system in these scenarios compared with several baselines.

[AI-6] SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

链接: https://arxiv.org/abs/2409.18082
作者: Xin Li,Siyuan Huang,Qiaojun Yu,Zhengkai Jiang,Ce Hao,Yimeng Zhu,Hongsheng Li,Peng Gao,Cewu Lu
关键词-EN: Automating garment manipulation, Automating garment, poses a significant, significant challenge, diverse and deformable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and deformable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics for future.

[AI-7] Infer Humans Intentions Before Following Natural Language Instructions

链接: https://arxiv.org/abs/2409.18073
作者: Yanming Wan,Yue Wu,Yiping Wang,Jiayuan Mao,Natasha Jaques
关键词-EN: complete everyday cooperative, everyday cooperative tasks, complete everyday, everyday cooperative, human
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.

[AI-8] FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

链接: https://arxiv.org/abs/2409.18071
作者: Runze He,Kai Ma,Linjiang Huang,Shaofei Huang,Jialin Gao,Xiaoming Wei,Jiao Dai,Jizhong Han,Si Liu
关键词-EN: Introducing user-specified visual, Introducing user-specified, user-specified visual concepts, image editing, editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 14 figures, project website: this https URL

点击查看摘要

Abstract:Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user’s intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: this https URL.

[AI-9] Visual Data Diagnosis and Debiasing with Concept Graphs

链接: https://arxiv.org/abs/2409.18055
作者: Rwiddhi Chakraborty,Yinong Wang,Jialu Gao,Runkai Zheng,Cheng Zhang,Fernando De la Torre
关键词-EN: deep learning models, learning models today, size and complexity, widespread success, success of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present CONBIAS, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. CONBIAS represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by CONBIAS improves generalization performance across multiple datasets compared to state-of-the-art methods. We will make our code and data publicly available.

[AI-10] DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving

链接: https://arxiv.org/abs/2409.18053
作者: Dingrui Wang,Marc Kaufeld,Johannes Betz
关键词-EN: designed to imitate, imitate human reasoning, autonomous driving framework, driving, autonomous driving
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Autonomous Driving, Large Language Models (LLMs), Human Reasoning, Critical Scenario

点击查看摘要

Abstract:We present a novel autonomous driving framework, DualAD, designed to imitate human reasoning during driving. DualAD comprises two layers: a rule-based motion planner at the bottom layer that handles routine driving tasks requiring minimal reasoning, and an upper layer featuring a rule-based text encoder that converts driving scenarios from absolute states into text description. This text is then processed by a large language model (LLM) to make driving decisions. The upper layer intervenes in the bottom layer’s decisions when potential danger is detected, mimicking human reasoning in critical situations. Closed-loop experiments demonstrate that DualAD, using a zero-shot pre-trained model, significantly outperforms rule-based motion planners that lack reasoning abilities. Our experiments also highlight the effectiveness of the text encoder, which considerably enhances the model’s scenario understanding. Additionally, the integrated DualAD model improves with stronger LLMs, indicating the framework’s potential for further enhancement. We make code and benchmarks publicly available.

[AI-11] Explaining Explaining

链接: https://arxiv.org/abs/2409.18052
作者: Sergei Nirenburg,Marjorie McShane,Kenneth W. Goodman,Sanjay Oruganti
关键词-EN: confidence in high-stakes, Abstract, machine learning, key to people, people having confidence
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Explanation is key to people having confidence in high-stakes AI systems. However, machine-learning-based systems - which account for almost all current AI - can’t explain because they are usually black boxes. The explainable AI (XAI) movement hedges this problem by redefining “explanation”. The human-centered explainable AI (HCXAI) movement identifies the explanation-oriented needs of users but can’t fulfill them because of its commitment to machine learning. In order to achieve the kinds of explanations needed by real people operating in critical domains, we must rethink how to approach AI. We describe a hybrid approach to developing cognitive agents that uses a knowledge-based infrastructure supplemented by data obtained through machine learning when applicable. These agents will serve as assistants to humans who will bear ultimate responsibility for the decisions and actions of the human-robot team. We illustrate the explanatory potential of such agents using the under-the-hood panels of a demonstration system in which a team of simulated robots collaborates on a search task assigned by a human.

[AI-12] Revisit Anything: Visual Place Recognition via Image Segment Retrieval ECCV2024

链接: https://arxiv.org/abs/2409.18049
作者: Kartik Garg,Sai Shubodh Puligilla,Shishir Kolathaya,Madhava Krishna,Sourav Garg
关键词-EN: Accurately recognizing, localize and navigate, crucial for embodied, embodied agents, agents to localize
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Presented at ECCV 2024; Includes supplementary; 29 pages; 8 figures

点击查看摘要

Abstract:Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the “whole” image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: “the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap”. We address this by encoding and searching for “image segments” instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful’ entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything’’ by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: this https URL.

[AI-13] HARMONIC: Cognitive and Control Collaboration in Human-Robotic Teams ICRA2025

链接: https://arxiv.org/abs/2409.18047
作者: Sanjay Oruganti,Sergei Nirenburg,Marjorie McShane,Jesse English,Michael K. Roberts,Christian Arndt
关键词-EN: planning and collaboration, paper presents, multi-robot planning, natural language communication, natural human-robot communication
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Submitted to ICRA 2025 Conference, Atlanta, GA, USA

点击查看摘要

Abstract:This paper presents a novel approach to multi-robot planning and collaboration. We demonstrate a cognitive strategy for robots in human-robot teams that incorporates metacognition, natural language communication, and explainability. The system is embodied using the HARMONIC architecture that flexibly integrates cognitive and control capabilities across the team. We evaluate our approach through simulation experiments involving a joint search task by a team of heterogeneous robots (a UGV and a drone) and a human. We detail the system’s handling of complex, real-world scenarios, effective action coordination between robots with different capabilities, and natural human-robot communication. This work demonstrates that the robots’ ability to reason about plans, goals, and attitudes, and to provide explanations for actions and decisions are essential prerequisites for realistic human-robot teaming.

[AI-14] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning EMNLP2024

链接: https://arxiv.org/abs/2409.18046
作者: Soeun Lee,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim
关键词-EN: Recent advancements, paired image-text data, explored text-only training, text-only training, overcome the limitations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ( \textbfI mage-like Retrieval and \textbfF requency-based Entity Filtering for Zero-shot \textbfCap tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.

[AI-15] HARMONIC: A Framework for Explanatory Cognitive Robots ICRA

链接: https://arxiv.org/abs/2409.18037
作者: Sanjay Oruganti,Sergei Nirenburg,Marjorie McShane,Jesse English,Michael K. Roberts,Christian Arndt
关键词-EN: trusted teammates capable, transforms general-purpose robots, implementing cognitive robots, natural communication, human-level explanation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注: Accepted for presentation at ICRA@40. 23-26 September 2024, Rotterdam, Netherlands

点击查看摘要

Abstract:We present HARMONIC, a framework for implementing cognitive robots that transforms general-purpose robots into trusted teammates capable of complex decision-making, natural communication and human-level explanation. The framework supports interoperability between a strategic (cognitive) layer for high-level decision-making and a tactical (robot) layer for low-level control and execution. We describe the core features of the framework and our initial implementation, in which HARMONIC was deployed on a simulated UGV and drone involved in a multi-robot search and retrieval task.

[AI-16] Compositional Hardness of Code in Large Language Models – A Probabilistic Perspective

链接: https://arxiv.org/abs/2409.18028
作者: Yotam Wolf,Binyamin Rothberg,Dorin Shteyman,Amnon Shashua
关键词-EN: large language model, complex analytical tasks, model context window, model context, usage for complex
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model’s context window. Previous works have shown that subtask decomposition within the model’s context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs’ ability to perform several sub-tasks within the same context window - an in-context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi-agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution’s length. We prove our results theoretically and demonstrate them empirically.

[AI-17] An Adversarial Perspective on Machine Unlearning for AI Safety

链接: https://arxiv.org/abs/2409.18025
作者: Jakub Łucki,Boyi Wei,Yangsibo Huang,Peter Henderson,Florian Tramèr,Javier Rando
关键词-EN: Large language models, Large language, finetuned to refuse, Large, hazardous knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

[AI-18] ransferring disentangled representations: bridging the gap between synthetic and real images

链接: https://arxiv.org/abs/2409.18017
作者: Jacopo Dapueto,Nicoletta Noceti,Francesca Odone
关键词-EN: Developing meaningful, data generation mechanism, Disentangled Representation Learning, representation learning, meaningful and efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.

[AI-19] Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles

链接: https://arxiv.org/abs/2409.18014
作者: Lewei He,Tianyu Shi,Pengran Huang,Bingzhi Chen,Qianglong Chen,Jiahui Pan
关键词-EN: Online Long-context Processing, Large language models, long-context processing, named Online Long-context, language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) with long-context processing are still challenging because of their implementation complexity, training efficiency and data sparsity. To address this issue, a new paradigm named Online Long-context Processing (OLP) is proposed when we process a document of unlimited length, which typically occurs in the information reception and organization of diverse streaming media such as automated news reporting, live e-commerce, and viral short videos. Moreover, a dilemma was often encountered when we tried to select the most suitable LLM from a large number of LLMs amidst explosive growth aiming for outstanding performance, affordable prices, and short response delays. In view of this, we also develop Role Reinforcement Learning (Role-RL) to automatically deploy different LLMs in their respective roles within the OLP pipeline according to their actual performance. Extensive experiments are conducted on our OLP-MINI dataset and it is found that OLP with Role-RL framework achieves OLP benchmark with an average recall rate of 93.2% and the LLM cost saved by 79.4%. The code and dataset are publicly available at: this https URL.

[AI-20] Control Industrial Automation System with Large Language Models

链接: https://arxiv.org/abs/2409.18009
作者: Yuchen Xia,Nasser Jazdi,Jize Zhang,Chaitanya Shah,Michael Weyrich
关键词-EN: require specialized expertise, Traditional industrial automation, systems require specialized, Traditional industrial, require specialized
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Traditional industrial automation systems require specialized expertise to operate and complex reprogramming to adapt to new processes. Large language models offer the intelligence to make them more flexible and easier to use. However, LLMs’ application in industrial settings is underexplored. This paper introduces a framework for integrating LLMs to achieve end-to-end control of industrial automation systems. At the core of the framework are an agent system designed for industrial tasks, a structured prompting method, and an event-driven information modeling mechanism that provides real-time data for LLM inference. The framework supplies LLMs with real-time events on different context semantic levels, allowing them to interpret the information, generate production plans, and control operations on the automation system. It also supports structured dataset creation for fine-tuning on this downstream application of LLMs. Our contribution includes a formal system design, proof-of-concept implementation, and a method for generating task-specific datasets for LLM fine-tuning and testing. This approach enables a more adaptive automation system that can respond to spontaneous events, while allowing easier operation and configuration through natural language for more intuitive human-machine interaction. We provide demo videos and detailed data on GitHub: this https URL

[AI-21] Joint Localization and Planning using Diffusion ICRA2025

链接: https://arxiv.org/abs/2409.17995
作者: L. Lao Beyer,S. Karaman
关键词-EN: vehicle path planning, successfully applied, applied to robotics, manipulation and vehicle, path planning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 9 figures. Submitted to ICRA 2025, under review

点击查看摘要

Abstract:Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation – including both perception and planning – by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in SE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model’s ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model’s use as a real-time, end-to-end localization and planning stack.

[AI-22] CRoP: Context-wise Robust Static Human-Sensing Personalization

链接: https://arxiv.org/abs/2409.17994
作者: Sawinder Kaur,Avery Gump,Jingyu Xin,Yi Xiao,Harshit Sharma,Nina R Benway,Jonathan L Preston,Asif Salekin
关键词-EN: diverse human sensing, advancement in deep, deep learning, led to diverse, human sensing applications
类目: Artificial Intelligence (cs.AI)
*备注: 31 pages, 10 figues and 13 tables

点击查看摘要

Abstract:The advancement in deep learning and internet-of-things have led to diverse human sensing applications. However, distinct patterns in human sensing, influenced by various factors or contexts, challenge generic neural network model’s performance due to natural distribution shifts. To address this, personalization tailors models to individual users. Yet most personalization studies overlook intra-user heterogeneity across contexts in sensory data, limiting intra-user generalizability. This limitation is especially critical in clinical applications, where limited data availability hampers both generalizability and personalization. Notably, intra-user sensing attributes are expected to change due to external factors such as treatment progression, further complicating the this http URL work introduces CRoP, a novel static personalization approach using an off-the-shelf pre-trained model and pruning to optimize personalization and generalization. CRoP shows superior personalization effectiveness and intra-user robustness across four human-sensing datasets, including two from real-world health domains, highlighting its practical and social impact. Additionally, to support CRoP’s generalization ability and design choices, we provide empirical justification through gradient inner product analysis, ablation studies, and comparisons against state-of-the-art baselines.

[AI-23] HydraViT: Stacking Heads for a Scalable ViT

链接: https://arxiv.org/abs/2409.17978
作者: Janek Haberer,Ali Hojjat,Olaf Landsiedel
关键词-EN: Vision Transformers, architecture of Vision, imposes substantial hardware, substantial hardware demands, Multi-head Attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at this https URL.

[AI-24] Enhancing elusive clues in knowledge learning by contrasting attention of language models

链接: https://arxiv.org/abs/2409.17954
作者: Jian Gao,Xiao Zhang,Ji Wu,Miao Li
关键词-EN: Causal language models, acquire vast amount, models acquire vast, general text corpus, Causal language
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages and 17 figures

点击查看摘要

Abstract:Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models’ performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be ``amplified" for a straight-forward improvement in knowledge learning efficiency.

[AI-25] Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation

链接: https://arxiv.org/abs/2409.17946
作者: Shuai Zhao,Leilei Gan,Zhongliang Guo,Xiaobao Wu,Luwei Xiao,Xiaoyu Xu,Cong-Duy Nguyen,Luu Anh Tuan
关键词-EN: widely applied due, Large Language Models, Large Language, backdoor attacks, backdoor
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on contrastive knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through contrastive knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

[AI-26] On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms

链接: https://arxiv.org/abs/2409.17943
作者: Richard Yue,John E. Ortega,Kenneth Ward Church
关键词-EN: natural language processing, professional translator, models in natural, Google Translate, BLEU and COMET
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users

点击查看摘要

Abstract:The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.

[AI-27] Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods

链接: https://arxiv.org/abs/2409.17939
作者: Richard Yue,John E. Ortega
关键词-EN: tools called computer-aided, called computer-aided translation, CAT tool, CAT tools, CAT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AMTA 2024 - The Association for Machine Translation in the Americas organizes biennial conferences devoted to researchers, commercial users, governmental and NGO users

点击查看摘要

Abstract:Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s’). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s’. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s’. Most FMR techniques use machine translation as a way of “repairing” those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.

[AI-28] Intelligent Energy Management: Remaining Useful Life Prediction and Charging Automation System Comprised of Deep Learning and the Internet of Things

链接: https://arxiv.org/abs/2409.17931
作者: Biplov Paneru,Bishwash Paneru,DP Sharma Mainali
关键词-EN: battery remaining life, remaining life, battery RUL dataset, battery remaining, Remaining Useful Life
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Remaining Useful Life (RUL) of battery is an important parameter to know the battery’s remaining life and need for recharge. The goal of this research project is to develop machine learning-based models for the battery RUL dataset. Different ML models are developed to classify the RUL of the vehicle, and the IoT (Internet of Things) concept is simulated for automating the charging system and managing any faults aligning. The graphs plotted depict the relationship between various vehicle parameters using the Blynk IoT platform. Results show that the catboost, Multi-Layer Perceptron (MLP), Gated Recurrent Unit (GRU), and hybrid model developed could classify RUL into three classes with 99% more accuracy. The data is fed using the tkinter GUI for simulating artificial intelligence (AI)-based charging, and with a pyserial backend, data can be entered into the Esp-32 microcontroller for making charge discharge possible with the model’s predictions. Also, with an IoT system, the charging can be disconnected, monitored, and analyzed for automation. The results show that an accuracy of 99% can be obtained on models MLP, catboost model and similar accuracy on GRU model can be obtained, and finally relay-based triggering can be made by prediction through the model used for automating the charging and energy-saving mechanism. By showcasing an exemplary Blynk platform-based monitoring and automation phenomenon, we further present innovative ways of monitoring parameters and automating the system.

[AI-29] Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion EMNLP24

链接: https://arxiv.org/abs/2409.17928
作者: Hengrui Gu,Kaixiong Zhou,Yili Wang,Ruobing Wang,Xin Wang
关键词-EN: diffusion models encode, models encode factual, encode factual knowledge, knowledge, encode factual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP24 Findings

点击查看摘要

Abstract:During pre-training, the Text-to-Image (T2I) diffusion models encode factual knowledge into their parameters. These parameterized facts enable realistic image generation, but they may become obsolete over time, thereby misrepresenting the current state of the world. Knowledge editing techniques aim to update model knowledge in a targeted way. However, facing the dual challenges posed by inadequate editing datasets and unreliable evaluation criterion, the development of T2I knowledge editing encounter difficulties in effectively generalizing injected knowledge. In this work, we design a T2I knowledge editing framework by comprehensively spanning on three phases: First, we curate a dataset \textbfCAKE, comprising paraphrase and multi-object test, to enable more fine-grained assessment on knowledge generalization. Second, we propose a novel criterion, \textbfadaptive CLIP threshold, to effectively filter out false successful images under the current criterion and achieve reliable editing evaluation. Finally, we introduce \textbfMPE, a simple but effective approach for T2I knowledge editing. Instead of tuning parameters, MPE precisely recognizes and edits the outdated part of the conditioning text-prompt to accommodate the up-to-date knowledge. A straightforward implementation of MPE (Based on in-context learning) exhibits better overall performance than previous model editors. We hope these efforts can further promote faithful evaluation of T2I knowledge editing methods.

[AI-30] Navigation in a simplified Urban Flow through Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.17922
作者: Federica Tonti,Jean Rabault,Ricardo Vinuesa
关键词-EN: unmanned aerial vehicles, urban environments requires, environmental impact, increasing number, number of unmanned
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing number of unmanned aerial vehicles (UAVs) in urban environments requires a strategy to minimize their environmental impact, both in terms of energy efficiency and noise reduction. In order to reduce these concerns, novel strategies for developing prediction models and optimization of flight planning, for instance through deep reinforcement learning (DRL), are needed. Our goal is to develop DRL algorithms capable of enabling the autonomous navigation of UAVs in urban environments, taking into account the presence of buildings and other UAVs, optimizing the trajectories in order to reduce both energetic consumption and noise. This is achieved using fluid-flow simulations which represent the environment in which UAVs navigate and training the UAV as an agent interacting with an urban environment. In this work, we consider a domain domain represented by a two-dimensional flow field with obstacles, ideally representing buildings, extracted from a three-dimensional high-fidelity numerical simulation. The presented methodology, using PPO+LSTM cells, was validated by reproducing a simple but fundamental problem in navigation, namely the Zermelo’s problem, which deals with a vessel navigating in a turbulent flow, travelling from a starting point to a target location, optimizing the trajectory. The current method shows a significant improvement with respect to both a simple PPO and a TD3 algorithm, with a success rate (SR) of the PPO+LSTM trained policy of 98.7%, and a crash rate (CR) of 0.1%, outperforming both PPO (SR = 75.6%, CR=18.6%) and TD3 (SR=77.4% and CR=14.5%). This is the first step towards DRL strategies which will guide UAVs in a three-dimensional flow field using real-time signals, making the navigation efficient in terms of flight time and avoiding damages to the vehicle.

[AI-31] Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

链接: https://arxiv.org/abs/2409.17904
作者: Owen Henkel,Hannah Horne-Robinson,Maria Dyshel,Nabil Ch,Baptiste Moreau-Pernet,Ralph Abood
关键词-EN: paper introduces AMMORE, pairs from Rori, African countries, AMMORE dataset enables, large language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment 1 we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach – chain-of-thought prompting – accurately scored 92% of these edge cases, effectively boosting the overall accuracy of the grading from 98.7% to 99.9%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy, by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that relatively modest improvements in model accuracy at the individual question level can lead to significant changes in the estimation of student mastery. Where the rules-based classifier currently used to grade student, answers misclassified the mastery status of 6.9% of students across their completed lessons, using the LLM chain-of-thought approach this misclassification rate was reduced to 2.6% of students. Taken together, these findings suggest that LLMs could be a valuable tool for grading open-response questions in K-12 mathematics education, potentially enabling encouraging wider adoption of open-ended questions in formative assessment.

[AI-32] Why Companies “Democratise” Artificial Intelligence: The Case of Open Source Software Donations

链接: https://arxiv.org/abs/2409.17876
作者: Cailean Osborne
关键词-EN: open source software, artificial intelligence, source software, non-profit foundations, OSS
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 30 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Companies claim to “democratise” artificial intelligence (AI) when they donate AI open source software (OSS) to non-profit foundations or release AI models, among others, but what does this term mean and why do they do it? As the impact of AI on society and the economy grows, understanding the commercial incentives behind AI democratisation efforts is crucial for ensuring these efforts serve broader interests beyond commercial agendas. Towards this end, this study employs a mixed-methods approach to investigate commercial incentives for 43 AI OSS donations to the Linux Foundation. It makes contributions to both research and practice. It contributes a taxonomy of both individual and organisational social, economic, and technological incentives for AI democratisation. In particular, it highlights the role of democratising the governance and control rights of an OSS project (i.e., from one company to open governance) as a structural enabler for downstream goals, such as attracting external contributors, reducing development costs, and influencing industry standards, among others. Furthermore, OSS donations are often championed by individual developers within companies, highlighting the importance of the bottom-up incentives for AI democratisation. The taxonomy provides a framework and toolkit for discerning incentives for other AI democratisation efforts, such as the release of AI models. The paper concludes with a discussion of future research directions.

[AI-33] DarkSAM: Fooling Segment Anything Model to Segment Nothing NEURIPS’24

链接: https://arxiv.org/abs/2409.17874
作者: Ziqi Zhou,Yufei Song,Minghui Li,Shengshan Hu,Xianlong Wang,Leo Yu Zhang,Dezhong Yao,Hai Jin
关键词-EN: SAM, data and tasks, recently gained, gained much attention, outstanding generalization
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by the 38th Annual Conference on Neural Information Processing Systems (NeurIPS’24)

点击查看摘要

Abstract:Segment Anything Model (SAM) has recently gained much attention for its outstanding generalization to unseen data and tasks. Despite its promising prospect, the vulnerabilities of SAM, especially to universal adversarial perturbation (UAP) have not been thoroughly investigated yet. In this paper, we propose DarkSAM, the first prompt-free universal attack framework against SAM, including a semantic decoupling-based spatial attack and a texture distortion-based frequency attack. We first divide the output of SAM into foreground and background. Then, we design a shadow target strategy to obtain the semantic blueprint of the image as the attack target. DarkSAM is dedicated to fooling SAM by extracting and destroying crucial object features from images in both spatial and frequency domains. In the spatial domain, we disrupt the semantics of both the foreground and background in the image to confuse SAM. In the frequency domain, we further enhance the attack effectiveness by distorting the high-frequency components (i.e., texture information) of the image. Consequently, with a single UAP, DarkSAM renders SAM incapable of segmenting objects across diverse images with varying prompts. Experimental results on four datasets for SAM and its two variant models demonstrate the powerful attack capability and transferability of DarkSAM.

[AI-34] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

链接: https://arxiv.org/abs/2409.17870
作者: Shaobo Ma,Chao Fang,Haikuo Shao,Zhongfeng Wang
关键词-EN: Large language models, Large language, GPU Tensor Core, GPU Tensor, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach’s effectiveness, with up to 13\times speedup in matrix multiplication compared to NVIDIA’s CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.

[AI-35] Implementing a Nordic-Baltic Federated Health Data Network: a case report

链接: https://arxiv.org/abs/2409.17865
作者: Taridzo Chomutare,Aleksandar Babic,Laura-Maria Peltonen,Silja Elunurm,Peter Lundberg,Arne Jönsson,Emma Eneling,Ciprian-Virgil Gerstenberger,Troels Siggaard,Raivo Kolde,Oskar Jerdhaf,Martin Hansson,Alexandra Makhlysheva,Miroslav Muzny,Erik Ylipää,Søren Brunak,Hercules Dalianis
关键词-EN: including privacy concerns, national borders pose, borders pose significant, pose significant challenges, including privacy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 pages (including appendices), 1 figure

点击查看摘要

Abstract:Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs.

[AI-36] A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios RECSYS’24

链接: https://arxiv.org/abs/2409.17864
作者: Christian Ganhör,Marta Moscati,Anna Hausberger,Shah Nawaz,Markus Schedl
关键词-EN: recommender systems adopt, adopt collaborative filtering, systems adopt collaborative, past collective interactions, provide recommendations based
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted at 18th ACM Conference on Recommender Systems (RecSys '24)

点击查看摘要

Abstract:Most recommender systems adopt collaborative filtering (CF) and provide recommendations based on past collective interactions. Therefore, the performance of CF algorithms degrades when few or no interactions are available, a scenario referred to as cold-start. To address this issue, previous work relies on models leveraging both collaborative data and side information on the users or items. Similar to multimodal learning, these models aim at combining collaborative and content representations in a shared embedding space. In this work we propose a novel technique for multimodal recommendation, relying on a multimodal Single-Branch embedding network for Recommendation (SiBraR). Leveraging weight-sharing, SiBraR encodes interaction data as well as multimodal side information using the same single-branch embedding network on different modalities. This makes SiBraR effective in scenarios of missing modality, including cold start. Our extensive experiments on large-scale recommendation datasets from three different recommendation domains (music, movie, and e-commerce) and providing multimodal content information (audio, text, image, labels, and interactions) show that SiBraR significantly outperforms CF as well as state-of-the-art content-based RSs in cold-start scenarios, and is competitive in warm scenarios. We show that SiBraR’s recommendations are accurate in missing modality scenarios, and that the model is able to map different modalities to the same region of the shared embedding space, hence reducing the modality gap.

[AI-37] Machine Learning-based vs Deep Learning-based Anomaly Detection in Multivariate Time Series for Spacecraft Attitude Sensors

链接: https://arxiv.org/abs/2409.17841
作者: R. Gallon,F. Schiemenz,A. Krstova,A. Menicucci,E. Gill
关键词-EN: traditional threshold checking, limitations commonly imposed, Isolation and Recovery, framework of Failure, Failure Detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for the ESA SPAICE Conference 2024

点击查看摘要

Abstract:In the framework of Failure Detection, Isolation and Recovery (FDIR) on spacecraft, new AI-based approaches are emerging in the state of the art to overcome the limitations commonly imposed by traditional threshold checking. The present research aims at characterizing two different approaches to the problem of stuck values detection in multivariate time series coming from spacecraft attitude sensors. The analysis reveals the performance differences in the two approaches, while commenting on their interpretability and generalization to different scenarios. Comments: Accepted for the ESA SPAICE Conference 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.17841 [cs.LG] (or arXiv:2409.17841v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.17841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Detecting and Measuring Confounding Using Causal Mechanism Shifts

链接: https://arxiv.org/abs/2409.17840
作者: Abbavaram Gowtham Reddy,Vineeth N Balasubramanian
关键词-EN: confounding, key challenge, causal sufficiency, Detecting and measuring, causal
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting and measuring confounding effects from data is a key challenge in causal inference. Existing methods frequently assume causal sufficiency, disregarding the presence of unobserved confounding variables. Causal sufficiency is both unrealistic and empirically untestable. Additionally, existing methods make strong parametric assumptions about the underlying causal generative process to guarantee the identifiability of confounding variables. Relaxing the causal sufficiency and parametric assumptions and leveraging recent advancements in causal discovery and confounding analysis with non-i.i.d. data, we propose a comprehensive approach for detecting and measuring confounding. We consider various definitions of confounding and introduce tailored methodologies to achieve three objectives: (i) detecting and measuring confounding among a set of variables, (ii) separating observed and unobserved confounding effects, and (iii) understanding the relative strengths of confounding bias between different sets of variables. We present useful properties of a confounding measure and present measures that satisfy those properties. Empirical results support the theoretical analysis.

[AI-39] Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models NEURIPS2024

链接: https://arxiv.org/abs/2409.17836
作者: Hui-Po Wang,Mario Fritz
关键词-EN: neural network gradients, long been overlooked, neural network, statistical prior models, statistical prior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear in NeurIPS 2024

点击查看摘要

Abstract:Despite the widespread use of statistical prior models in various fields, such models for neural network gradients have long been overlooked. The inherent challenge stems from their high-dimensional structures and complex interdependencies, which complicate effective modeling. In this work, we demonstrate the potential of large language models (LLMs) to act as gradient priors in a zero-shot setting. We examine the property by considering lossless gradient compression – a critical application in distributed learning – that depends heavily on precise probability modeling. To achieve this, we introduce LM-GC, a novel method that integrates LLMs with arithmetic coding. Our technique converts plain gradients into text-like formats, enhancing token efficiency by up to 38 times compared to their plain representations. We ensure that this data conversion maintains a close alignment with the structure of plain gradients and the symbols commonly recognized by LLMs. Our experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods, improving compression rates by 10% up to 17.2% across various datasets and architectures. Additionally, our approach shows promising compatibility with lossy compression techniques such as quantization and sparsification. These findings highlight the significant potential of LLMs as a model for effectively handling gradients. We will release the source code upon publication.

[AI-40] Inference-Time Language Model Alignment via Integrated Value Guidance EMNLP2024

链接: https://arxiv.org/abs/2409.17819
作者: Zhixuan Liu,Zhanhui Zhou,Yuanfu Wang,Chao Yang,Yu Qiao
关键词-EN: Large language models, human preferences, intensive and complex, tuning large models, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models are typically fine-tuned to align with human preferences, but tuning large models is computationally intensive and complex. In this work, we introduce \textitIntegrated Value Guidance (IVG), a method that uses implicit and explicit value functions to guide language model decoding at token and chunk-level respectively, efficiently aligning large language models purely at inference time. This approach circumvents the complexities of direct fine-tuning and outperforms traditional methods. Empirically, we demonstrate the versatility of IVG across various tasks. In controlled sentiment generation and summarization tasks, our method significantly improves the alignment of large models using inference-time guidance from \textttgpt2 -based value functions. Moreover, in a more challenging instruction-following benchmark AlpacaEval 2.0, we show that both specifically tuned and off-the-shelf value functions greatly improve the length-controlled win rates of large models against \textttgpt-4-turbo (e.g., 19.51% \rightarrow 26.51% for \textttMistral-7B-Instruct-v0.2 and 25.58% \rightarrow 33.75% for \textttMixtral-8x7B-Instruct-v0.1 with Tulu guidance).

[AI-41] DREAMS: A python framework to train deep learning models with model card reporting for medical and health applications

链接: https://arxiv.org/abs/2409.17815
作者: Rabindra Khadka,Pedro G Lind,Anis Yazidi,Asma Belhadi
关键词-EN: EEG data analysis, EEG data, observe brain activity, EEG, EEG data processing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) data provides a non-invasive method for researchers and clinicians to observe brain activity in real time. The integration of deep learning techniques with EEG data has significantly improved the ability to identify meaningful patterns, leading to valuable insights for both clinical and research purposes. However, most of the frameworks so far, designed for EEG data analysis, are either too focused on pre-processing or in deep learning methods per, making their use for both clinician and developer communities problematic. Moreover, critical issues such as ethical considerations, biases, uncertainties, and the limitations inherent in AI models for EEG data analysis are frequently overlooked, posing challenges to the responsible implementation of these technologies. In this paper, we introduce a comprehensive deep learning framework tailored for EEG data processing, model training and report generation. While constructed in way to be adapted and developed further by AI developers, it enables to report, through model cards, the outcome and specific information of use for both developers and clinicians. In this way, we discuss how this framework can, in the future, provide clinical researchers and developers with the tools needed to create transparent and accountable AI models for EEG data analysis and diagnosis.

[AI-42] Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness EMNLP2024

链接: https://arxiv.org/abs/2409.17791
作者: Jian Li,Haojing Huang,Yujia Zhang,Pengfei Xu,Xi Chen,Rui Song,Lida Shi,Jingwen Wang,Hao Xu
关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Direct Preference Optimization, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at this https URL.

[AI-43] Ophthalmic Biomarker Detection with Parallel Prediction of Transformer and Convolutional Architecture

链接: https://arxiv.org/abs/2409.17788
作者: Md. Touhidul Islam,Md. Abtahi Majeed Chowdhury,Mahmudul Hasan,Asif Quadir,Lutfa Aktar
关键词-EN: global health issue, Optical Coherence Tomography, Ophthalmic diseases represent, precise diagnostic tools, health issue
类目: Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

Abstract:Ophthalmic diseases represent a significant global health issue, necessitating the use of advanced precise diagnostic tools. Optical Coherence Tomography (OCT) imagery which offers high-resolution cross-sectional images of the retina has become a pivotal imaging modality in ophthalmology. Traditionally physicians have manually detected various diseases and biomarkers from such diagnostic imagery. In recent times, deep learning techniques have been extensively used for medical diagnostic tasks enabling fast and precise diagnosis. This paper presents a novel approach for ophthalmic biomarker detection using an ensemble of Convolutional Neural Network (CNN) and Vision Transformer. While CNNs are good for feature extraction within the local context of the image, transformers are known for their ability to extract features from the global context of the image. Using an ensemble of both techniques allows us to harness the best of both worlds. Our method has been implemented on the OLIVES dataset to detect 6 major biomarkers from the OCT images and shows significant improvement of the macro averaged F1 score on the dataset.

[AI-44] Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

链接: https://arxiv.org/abs/2409.17777
作者: Raja Kumar,Raghav Singhal,Pranamya Kulkarni,Deval Mehta,Kshitij Jadhav
关键词-EN: shown remarkable success, Deep multimodal learning, Deep multimodal, leveraging contrastive learning, Mixup-based contrastive loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: RK and RS contributed equally to this work, 20 Pages, 8 Figures, 9 Tables

点击查看摘要

Abstract:Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

[AI-45] Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations EMNLP2024

链接: https://arxiv.org/abs/2409.17774
作者: Supriya Manna,Niladri Sett
关键词-EN: critical metric, metric to assess, assess the reliability, reliability of explainable, Faithfulness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted as a Full Paper at EMNLP 2024 Workshop BlackBoxNLP

点击查看摘要

Abstract:Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer’s response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.

[AI-46] Federated Learning under Attack: Improving Gradient Inversion for Batch of Images

链接: https://arxiv.org/abs/2409.17767
作者: Luiz Leite,Yuri Santo,Bruno L. Dalmazo,André Riker
关键词-EN: Federated Learning, machine learning, machine learning models, machine learning approach, train machine learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 5 pages, 7 figures

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a machine learning approach able to preserve the privacy of user’s data. Applying FL, clients train machine learning models on a local dataset and a central server aggregates the learned parameters coming from the clients, training a global machine learning model without sharing user’s data. However, the state-of-the-art shows several approaches to promote attacks on FL systems. For instance, inverting or leaking gradient attacks can find, with high precision, the local dataset used during the training phase of the FL. This paper presents an approach, called Deep Leakage from Gradients with Feedback Blending (DLG-FB), which is able to improve the inverting gradient attack, considering the spatial correlation that typically exists in batches of images. The performed evaluation shows an improvement of 19.18% and 48,82% in terms of attack success rate and the number of iterations per attacked image, respectively.

[AI-47] Confidence intervals uncovered: Are we ready for real-world medical imaging AI? MICCAI2024

链接: https://arxiv.org/abs/2409.17763
作者: Evangelia Christodoulou,Annika Reinke,Rola Houhou,Piotr Kalinowski,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Nicola Rieke,Veronika Cheplygina,Michela Antonelli,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Paul F. Jäger,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein
关键词-EN: Medical imaging, transformation of healthcare, imaging is spearheading, Performance, Medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted at MICCAI 2024 conference

点击查看摘要

Abstract:Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

[AI-48] Integrating Hierarchical Semantic into Iterative Generation Model for Entailment Tree Explanation

链接: https://arxiv.org/abs/2409.17757
作者: Qin Wang,Jianzhou Feng,Yiming Xu
关键词-EN: explainable question answering, Manifestly and logically, question answering, logically displaying, reasoning from evidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Manifestly and logically displaying the line of reasoning from evidence to answer is significant to explainable question answering (QA). The entailment tree exhibits the lines structurally, which is different from the self-explanation principle in large-scale language models. Existing methods rarely consider the semantic association of sentences between and within hierarchies within the tree structure, which is prone to apparent mistakes in combinations. In this work, we propose an architecture of integrating the Hierarchical Semantics of sentences under the framework of Controller-Generator (HiSCG) to explain answers. The HiSCG designs a hierarchical mapping between hypotheses and facts, discriminates the facts involved in tree constructions, and optimizes single-step entailments. To the best of our knowledge, We are the first to notice hierarchical semantics of sentences between the same layer and adjacent layers to yield improvements. The proposed method achieves comparable performance on all three settings of the EntailmentBank dataset. The generalization results on two out-of-domain datasets also demonstrate the effectiveness of our method.

[AI-49] SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning

链接: https://arxiv.org/abs/2409.17755
作者: Rimvydas Rubavicius,Peter David Fagan,Alex Lascarides,Subramanian Ramamoorthy
关键词-EN: interactive task learning, challenging interactive task, task learning scenario, paper addresses, addresses a challenging
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages,4 figures, 2 tables

点击查看摘要

Abstract:This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the robot is unaware of a concept that’s key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation. Through dialogue, the robot discovers and then learns to exploit unforeseen possibilities. Using SECURE, the robot not only learns from the user’s corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the robot to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that a robot that is semantics-aware – that is, it exploits the logical consequences of both sentence and discourse semantics in the learning and inference process – learns to solve rearrangement under unawareness more effectively than a robot that lacks such capabilities.

[AI-50] Byzantine-Robust Aggregation for Securing Decentralized Federated Learning

链接: https://arxiv.org/abs/2409.17754
作者: Diego Cajaraville-Aboy,Ana Fernández-Vilas,Rebeca P. Díaz-Redondo,Manuel Fernández-Veiga
关键词-EN: Decentralized Federated Learning, distributed machine learning, machine learning approach, addresses privacy concerns, Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 7 figures, 1 table

点击查看摘要

Abstract:Federated Learning (FL) emerges as a distributed machine learning approach that addresses privacy concerns by training AI models locally on devices. Decentralized Federated Learning (DFL) extends the FL paradigm by eliminating the central server, thereby enhancing scalability and robustness through the avoidance of a single point of failure. However, DFL faces significant challenges in optimizing security, as most Byzantine-robust algorithms proposed in the literature are designed for centralized scenarios. In this paper, we present a novel Byzantine-robust aggregation algorithm to enhance the security of Decentralized Federated Learning environments, coined WFAgg. This proposal handles the adverse conditions and strength robustness of dynamic decentralized topologies at the same time by employing multiple filters to identify and mitigate Byzantine attacks. Experimental results demonstrate the effectiveness of the proposed algorithm in maintaining model accuracy and convergence in the presence of various Byzantine attack scenarios, outperforming state-of-the-art centralized Byzantine-robust aggregation schemes (such as Multi-Krum or Clustering). These algorithms are evaluated on an IID image classification problem in both centralized and decentralized scenarios.

[AI-51] AlterMOMA: Fusion Redundancy Pruning for Camera-LiDAR Fusion Models with Alternative Modality Masking NEURIPS2024

链接: https://arxiv.org/abs/2409.17728
作者: Shiqi Sun,Yantao Lu,Ning Liu,Bo Jiang,JinChao Chen,Ying Zhang
关键词-EN: Camera-LiDAR fusion models, significantly enhance perception, Camera-LiDAR fusion, fusion models, models significantly enhance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 3 figures, Accepted by NeurIPS 2024

点击查看摘要

Abstract:Camera-LiDAR fusion models significantly enhance perception performance in autonomous driving. The fusion mechanism leverages the strengths of each modality while minimizing their weaknesses. Moreover, in practice, camera-LiDAR fusion models utilize pre-trained backbones for efficient training. However, we argue that directly loading single-modal pre-trained camera and LiDAR backbones into camera-LiDAR fusion models introduces similar feature redundancy across modalities due to the nature of the fusion mechanism. Unfortunately, existing pruning methods are developed explicitly for single-modal models, and thus, they struggle to effectively identify these specific redundant parameters in camera-LiDAR fusion models. In this paper, to address the issue above on camera-LiDAR fusion models, we propose a novelty pruning framework Alternative Modality Masking Pruning (AlterMOMA), which employs alternative masking on each modality and identifies the redundant parameters. Specifically, when one modality parameters are masked (deactivated), the absence of features from the masked backbone compels the model to reactivate previous redundant features of the other modality backbone. Therefore, these redundant features and relevant redundant parameters can be identified via the reactivation process. The redundant parameters can be pruned by our proposed importance score evaluation function, Alternative Evaluation (AlterEva), which is based on the observation of the loss changes when certain modality parameters are activated and deactivated. Extensive experiments on the nuScene and KITTI datasets encompassing diverse tasks, baseline models, and pruning algorithms showcase that AlterMOMA outperforms existing pruning methods, attaining state-of-the-art performance.

[AI-52] Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience

链接: https://arxiv.org/abs/2409.17702
作者: Leonard Bärmann,Chad DeChant,Joana Plewnia,Fabian Peller-Konrad,Daniel Bauer,Tamim Asfour,Alex Waibel
关键词-EN: improving human-robot interaction, human-robot interaction, question answering, crucial ability, ability for improving
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Code, data and demo videos at this https URL

点击查看摘要

Abstract:Verbalization of robot experience, i.e., summarization of and question answering about a robot’s past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user’s query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.

[AI-53] MoJE: Mixture of Jailbreak Experts Naive Tabular Classifiers as Guard for Prompt Attacks

链接: https://arxiv.org/abs/2409.17699
作者: Giandomenico Cornacchia,Giulio Zizzo,Kieran Fraser,Muhammad Zaid Hamed,Ambrish Rawat,Mark Purcell
关键词-EN: Large Language Models, Large Language, diverse applications underscores, proliferation of Large, thwart potential jailbreak
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) in diverse applications underscores the pressing need for robust security measures to thwart potential jailbreak attacks. These attacks exploit vulnerabilities within LLMs, endanger data integrity and user privacy. Guardrails serve as crucial protective mechanisms against such threats, but existing models often fall short in terms of both detection accuracy, and computational efficiency. This paper advocates for the significance of jailbreak attack prevention on LLMs, and emphasises the role of input guardrails in safeguarding these models. We introduce MoJE (Mixture of Jailbreak Expert), a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. By employing simple linguistic statistical techniques, MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference. Through rigorous experimentation, MoJE demonstrates superior performance capable of detecting 90% of the attacks without compromising benign prompts, enhancing LLMs security against jailbreak attacks.

[AI-54] he application of GPT-4 in grading design university students assignment and providing feedback: An exploratory study

链接: https://arxiv.org/abs/2409.17698
作者: Qian Huang,Thijs Willems,King Wang Poon
关键词-EN: Custom GPT, GPT, Custom, design, design university students
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:This study aims to investigate whether GPT-4 can effectively grade assignments for design university students and provide useful feedback. In design education, assignments do not have a single correct answer and often involve solving an open-ended design problem. This subjective nature of design projects often leads to grading problems,as grades can vary between different raters,for instance instructor from engineering background or architecture background. This study employs an iterative research approach in developing a Custom GPT with the aim of achieving more reliable results and testing whether it can provide design students with constructive feedback. The findings include: First,through several rounds of iterations the inter-reliability between GPT and human raters reached a level that is generally accepted by educators. This indicates that by providing accurate prompts to GPT,and continuously iterating to build a Custom GPT, it can be used to effectively grade students’ design assignments, serving as a reliable complement to human raters. Second, the intra-reliability of GPT’s scoring at different times is between 0.65 and 0.78. This indicates that, with adequate instructions, a Custom GPT gives consistent results which is a precondition for grading students. As consistency and comparability are the two main rules to ensure the reliability of educational assessment, this study has looked at whether a Custom GPT can be developed that adheres to these two rules. We finish the paper by testing whether Custom GPT can provide students with useful feedback and reflecting on how educators can develop and iterate a Custom GPT to serve as a complementary rater.

[AI-55] MIO: A Foundation Model on Multimodal Tokens

链接: https://arxiv.org/abs/2409.17692
作者: Zekun Wang,King Zhu,Chunpu Xu,Wangchunshu Zhou,Jiaheng Liu,Yibo Zhang,Jiashuo Wang,Ning Shi,Siyu Li,Yizhi Li,Haoran Que,Zhaoxiang Zhang,Yuanxing Zhang,Ge Zhang,Ke Xu,Jie Fu,Wenhao Huang
关键词-EN: foundation model built, large language models, autoregressive manner, understanding and generating, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Technical Report. Codes and models will be available soon

点击查看摘要

Abstract:In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

[AI-56] Efficient Bias Mitigation Without Privileged Information ECCV2024

链接: https://arxiv.org/abs/2409.17691
作者: Mateo Espinosa Zarlenga,Swami Sankaranarayanan,Jerone T. A. Andrews,Zohreh Shams,Mateja Jamnik,Alice Xiang
关键词-EN: Deep neural networks, empirical risk minimisation, Deep neural, grassy background, neural networks trained
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the 18th European Conference on Computer Vision (ECCV 2024) as an Oral presentation

点击查看摘要

Abstract:Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., “grassy background” and “cows”). Existing bias mitigation methods that aim to address this issue often either rely on group labels for training or validation, or require an extensive hyperparameter search. Such data and computational requirements hinder the practical deployment of these methods, especially when datasets are too large to be group-annotated, computational resources are limited, and models are trained through already complex pipelines. In this paper, we propose Targeted Augmentations for Bias Mitigation (TAB), a simple hyperparameter-free framework that leverages the entire training history of a helper model to identify spurious samples, and generate a group-balanced training set from which a robust model can be trained. We show that TAB improves worst-group performance without any group information or model selection, outperforming existing methods while maintaining overall accuracy.

[AI-57] Graph Edit Distance with General Costs Using Neural Set Divergence NEURIPS2024

链接: https://arxiv.org/abs/2409.17687
作者: Eeshaan Jain,Indradyumna Roy,Saswat Meher,Soumen Chakrabarti,Abir De
关键词-EN: Graph Edit Distance, minimum-cost edit sequence, Edit Distance, GED, sequence that transforms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at NeurIPS 2024

点击查看摘要

Abstract:Graph Edit Distance (GED) measures the (dis-)similarity between two given graphs, in terms of the minimum-cost edit sequence that transforms one graph to the other. However, the exact computation of GED is NP-Hard, which has recently motivated the design of neural methods for GED estimation. However, they do not explicitly account for edit operations with different costs. In response, we propose GRAPHEDX, a neural GED estimator that can work with general costs specified for the four edit operations, viz., edge deletion, edge addition, node deletion and node addition. We first present GED as a quadratic assignment problem (QAP) that incorporates these four costs. Then, we represent each graph as a set of node and edge embeddings and use them to design a family of neural set divergence surrogates. We replace the QAP terms corresponding to each operation with their surrogates. Computing such neural set divergence require aligning nodes and edges of the two graphs. We learn these alignments using a Gumbel-Sinkhorn permutation generator, additionally ensuring that the node and edge alignments are consistent with each other. Moreover, these alignments are cognizant of both the presence and absence of edges between node-pairs. Experiments on several datasets, under a variety of edit cost settings, show that GRAPHEDX consistently outperforms state-of-the-art methods and heuristics in terms of prediction error.

[AI-58] Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets

链接: https://arxiv.org/abs/2409.17685
作者: Yasaman Haghbin,Hadi Moradi,Reshad Hosseini
关键词-EN: data generation techniques, Clustered Latent Space, Artificial Data Point, machine learning models, growing trends
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson’s disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.

[AI-59] Preserving logical and functional dependencies in synthetic tabular data

链接: https://arxiv.org/abs/2409.17684
作者: Chaithra Umesh,Kristian Schultz,Manjunath Mahendra,Saparshi Bej,Olaf Wolkenhauer
关键词-EN: data generation, tabular data generation, tabular data, data, data generation algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Pattern Recognition Journal

点击查看摘要

Abstract:Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.

[AI-60] Zero- and Few-shot Named Entity Recognition and Text Expansion in Medication Prescriptions using ChatGPT

链接: https://arxiv.org/abs/2409.17683
作者: Natthanaphop Isaradech,Andrea Riedel,Wachiranun Sirikul,Markus Kreuzthaler,Stefan Schulz
关键词-EN: local brand, formats and abbreviations, include a mix, wide range, range of idiosyncratic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Introduction: Medication prescriptions are often in free text and include a mix of two languages, local brand names, and a wide range of idiosyncratic formats and abbreviations. Large language models (LLMs) have shown promising ability to generate text in response to input prompts. We use ChatGPT 3.5 to automatically structure and expand medication statements in discharge summaries and thus make them easier to interpret for people and machines. Methods: Named-entity Recognition (NER) and Text Expansion (EX) are used in a zero- and few-shot setting with different prompt strategies. 100 medication statements were manually annotated and curated. NER performance was measured by using strict and partial matching. For the task EX, two experts interpreted the results by assessing semantic equivalence between original and expanded statements. The model performance was measured by precision, recall, and F1 score. Results: For NER, the best-performing prompt reached an average F1 score of 0.94 in the test set. For EX, the few-shot prompt showed superior performance among other prompts, with an average F1 score of 0.87. Conclusion: Our study demonstrates good performance for NER and EX tasks in free-text medication statements using ChatGPT. Compared to a zero-shot baseline, a few-shot approach prevented the system from hallucinating, which would be unacceptable when processing safety-relevant medication data.

[AI-61] Explanation Bottleneck Models

链接: https://arxiv.org/abs/2409.17663
作者: Shin’ya Yamaguchi,Kosuke Nishida
关键词-EN: Recent concept-based interpretable, Recent concept-based, providing meaningful explanations, pre-defined concept sets, concept-based interpretable models
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recent concept-based interpretable models have succeeded in providing meaningful explanations by pre-defined concept sets. However, the dependency on the pre-defined concepts restricts the application because of the limited number of concepts for explanations. This paper proposes a novel interpretable deep neural network called explanation bottleneck models (XBMs). XBMs generate a text explanation from the input without pre-defined concepts and then predict a final task prediction based on the generated explanation by leveraging pre-trained vision-language encoder-decoder models. To achieve both the target task performance and the explanation quality, we train XBMs through the target task loss with the regularization penalizing the explanation decoder via the distillation from the frozen pre-trained decoder. Our experiments, including a comparison to state-of-the-art concept bottleneck models, confirm that XBMs provide accurate and fluent natural language explanations without pre-defined concept sets. Code will be available at this https URL.

[AI-62] A Fuzzy-based Approach to Predict Human Interaction by Functional Near-Infrared Spectroscopy

链接: https://arxiv.org/abs/2409.17661
作者: Xiaowei Jiang,Liang Ou,Yanan Chen,Na Ao,Yu-Cheng Chang,Thomas Do,Chin-Teng Lin
关键词-EN: Fuzzy Attention Layer, Fuzzy Attention, Attention Layer, Attention Layer mechanism, Fuzzy-based Attention
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The paper introduces a Fuzzy-based Attention (Fuzzy Attention Layer) mechanism, a novel computational approach to enhance the interpretability and efficacy of neural models in psychological research. The proposed Fuzzy Attention Layer mechanism is integrated as a neural network layer within the Transformer Encoder model to facilitate the analysis of complex psychological phenomena through neural signals, such as those captured by functional Near-Infrared Spectroscopy (fNIRS). By leveraging fuzzy logic, the Fuzzy Attention Layer is capable of learning and identifying interpretable patterns of neural activity. This capability addresses a significant challenge when using Transformer: the lack of transparency in determining which specific brain activities most contribute to particular predictions. Our experimental results demonstrated on fNIRS data from subjects engaged in social interactions involving handholding reveal that the Fuzzy Attention Layer not only learns interpretable patterns of neural activity but also enhances model performance. Additionally, the learned patterns provide deeper insights into the neural correlates of interpersonal touch and emotional exchange. The application of our model shows promising potential in deciphering the subtle complexities of human social behaviors, thereby contributing significantly to the fields of social neuroscience and psychological AI.

[AI-63] Hierarchical End-to-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.17659
作者: Siyi Lu,Lei He,Shengbo Eben Li,Yugong Luo,Jianqiang Wang,Keqiang Li
关键词-EN: traditional modular pipeline, Deep Reinforcement Learning, modular pipeline, offers a streamlined, streamlined alternative
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird’s-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%.

[AI-64] Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection ICASSP2025

链接: https://arxiv.org/abs/2409.17656
作者: Pengfei Cai,Yan Song,Nan Jiang,Qing Gu,Ian McLoughlin
关键词-EN: sound event detection, labeled data due, high annotation costs, Masked Audio Model, labeled data
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP2025; The code for this paper will be available at this https URL after the paper is accepted

点击查看摘要

Abstract:A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.

[AI-65] AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment

链接: https://arxiv.org/abs/2409.17655
作者: Nan Sun,Bo Mao,Yongchang Li,Lumeng Ma,Di Guo,Huaping Liu
关键词-EN: motivated significant research, autonomous robotic systems, Large Language Models, increasing demand, demand for intelligent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 6 pages, 8 figures, 4 tables

点击查看摘要

Abstract:The increasing demand for intelligent assistants in human-populated environments has motivated significant research in autonomous robotic systems. Traditional service robots and virtual assistants, however, struggle with real-world task execution due to their limited capacity for dynamic reasoning and interaction, particularly when human collaboration is required. Recent developments in Large Language Models have opened new avenues for improving these systems, enabling more sophisticated reasoning and natural interaction capabilities. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed to operate autonomously in a physical office environment. Unlike conventional service robots, AssistantX leverages a novel multi-agent architecture, PPDR4X, which provides advanced inference capabilities and comprehensive collaboration awareness. By effectively bridging the gap between virtual operations and physical interactions, AssistantX demonstrates robust performance in managing complex real-world scenarios. Our evaluation highlights the architecture’s effectiveness, showing that AssistantX can respond to clear instructions, actively retrieve supplementary information from memory, and proactively seek collaboration from team members to ensure successful task completion. More details and videos can be found at this https URL.

[AI-66] FactorSim: Generative Simulation via Factorized Representation NEURIPS2024

链接: https://arxiv.org/abs/2409.17652
作者: Fan-Yun Sun,S. I. Harini,Angela Yi,Yihan Zhou,Alex Zook,Jonathan Tremblay,Logan Cross,Jiajun Wu,Nick Haber
关键词-EN: remains an open-ended, natural language input, train intelligent agents, open-ended challenge, task documentation
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: neurips 2024, project website: this https URL

点击查看摘要

Abstract:Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code’s accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (e.g., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks.

[AI-67] Digital Twin Ecosystem for Oncology Clinical Operations

链接: https://arxiv.org/abs/2409.17650
作者: Himanshu Pandey,Akhil Amod,Shivang,Kshitij Jaggi,Ruchi Garg,Abheet Jain,Vinayak Tantia
关键词-EN: Large Language Models, Artificial Intelligence, Large Language, hold significant promise, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Pre Print

点击查看摘要

Abstract:Artificial Intelligence (AI) and Large Language Models (LLMs) hold significant promise in revolutionizing healthcare, especially in clinical applications. Simultaneously, Digital Twin technology, which models and simulates complex systems, has gained traction in enhancing patient care. However, despite the advances in experimental clinical settings, the potential of AI and digital twins to streamline clinical operations remains largely untapped. This paper introduces a novel digital twin framework specifically designed to enhance oncology clinical operations. We propose the integration of multiple specialized digital twins, such as the Medical Necessity Twin, Care Navigator Twin, and Clinical History Twin, to enhance workflow efficiency and personalize care for each patient based on their unique data. Furthermore, by synthesizing multiple data sources and aligning them with the National Comprehensive Cancer Network (NCCN) guidelines, we create a dynamic Cancer Care Path, a continuously evolving knowledge base that enables these digital twins to provide precise, tailored clinical recommendations.

[AI-68] AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure

链接: https://arxiv.org/abs/2409.17642
作者: Xi Chen,Zhiyang Zhang,Fangkai Yang,Xiaoting Qin,Chao Du,Xi Cheng,Hangxin Liu,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
关键词-EN: Large language model, Large language, language model, conversational interfaces, increasingly utilized
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Large language model (LLM)-based AI delegates are increasingly utilized to act on behalf of users, assisting them with a wide range of tasks through conversational interfaces. Despite their advantages, concerns arise regarding the potential risk of privacy leaks, particularly in scenarios involving social interactions. While existing research has focused on protecting privacy by limiting the access of AI delegates to sensitive user information, many social scenarios require disclosing private details to achieve desired outcomes, necessitating a balance between privacy protection and disclosure. To address this challenge, we conduct a pilot study to investigate user preferences for AI delegates across various social relations and task scenarios, and then propose a novel AI delegate system that enables privacy-conscious self-disclosure. Our user study demonstrates that the proposed AI delegate strategically protects privacy, pioneering its use in diverse and dynamic social interactions.

[AI-69] 3: A Novel Zero-shot Transfer Learning Framework Iteratively Training on an Assistant Task for a Target Task

链接: https://arxiv.org/abs/2409.17640
作者: Xindi Tong,Yujin Zhu,Shijian Fan,Liang Xu
关键词-EN: Large Language Models, processing large volumes, efficiently processing large, contextual details dealing, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Long text summarization, gradually being essential for efficiently processing large volumes of information, stays challenging for Large Language Models (LLMs) such as GPT and LLaMA families because of the insufficient open-sourced training datasets and the high requirement of contextual details dealing. To address the issue, we design a novel zero-shot transfer learning framework, abbreviated as T3, to iteratively training a baseline LLM on an assistant task for the target task, where the former should own richer data resources and share structural or semantic similarity with the latter. In practice, T3 is approached to deal with the long text summarization task by utilizing question answering as the assistant task, and further validated its effectiveness on the BBC summary, NarraSum, FairytaleQA, and NLQuAD datasets, with up to nearly 14% improvement in ROUGE, 35% improvement in BLEU, and 16% improvement in Factscore compared to three baseline LLMs, demonstrating its potential for more assistant-target task combinations.

[AI-70] P4Q: Learning to Prompt for Quantization in Visual-language Models

链接: https://arxiv.org/abs/2409.17634
作者: Huixin Sun,Runqi Wang,Yanjing Li,Xianbin Cao,Xiaolong Jiang,Yao Hu,Baochang Zhang
关键词-EN: downstream application platforms, application platforms remains, platforms remains challenging, remains challenging due, Large-scale pre-trained Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization’’ (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 \times while achieving 66.94% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24% with negligible additional parameters on the ImageNet dataset.

[AI-71] Hand-object reconstruction via interaction-aware graph attention mechanism ICIP2024

链接: https://arxiv.org/abs/2409.17629
作者: Taeyun Woo,Tae-Kyun Kim,Jinah Park
关键词-EN: advanced vision computing, Estimating the poses, vision computing, important area, area of research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, Accepted by ICIP 2024

点击查看摘要

Abstract:Estimating the poses of both a hand and an object has become an important area of research due to the growing need for advanced vision computing. The primary challenge involves understanding and reconstructing how hands and objects interact, such as contact and physical plausibility. Existing approaches often adopt a graph neural network to incorporate spatial information of hand and object meshes. However, these approaches have not fully exploited the potential of graphs without modification of edges within and between hand- and object-graphs. We propose a graph-based refinement method that incorporates an interaction-aware graph-attention mechanism to account for hand-object interactions. Using edges, we establish connections among closely correlated nodes, both within individual graphs and across different graphs. Experiments demonstrate the effectiveness of our proposed method with notable improvements in the realm of physical plausibility.

[AI-72] Neural P3M: A Long-Range Interaction Modeling Enhancer for Geometric GNNs NEURIPS2024

链接: https://arxiv.org/abs/2409.17622
作者: Yusong Wang,Chaoran Cheng,Shaoning Li,Yuxuan Ren,Bin Shao,Ge Liu,Pheng-Ann Heng,Nanning Zheng
关键词-EN: modeling molecular geometry, graph neural networks, Geometric graph neural, emerged as powerful, powerful tools
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at NeurIPS 2024

点击查看摘要

Abstract:Geometric graph neural networks (GNNs) have emerged as powerful tools for modeling molecular geometry. However, they encounter limitations in effectively capturing long-range interactions in large molecular systems. To address this challenge, we introduce Neural P ^3 M, a versatile enhancer of geometric GNNs to expand the scope of their capabilities by incorporating mesh points alongside atoms and reimaging traditional mathematical operations in a trainable manner. Neural P ^3 M exhibits flexibility across a wide range of molecular systems and demonstrates remarkable accuracy in predicting energies and forces, outperforming on benchmarks such as the MD22 dataset. It also achieves an average improvement of 22% on the OE62 dataset while integrating with various architectures.

[AI-73] Dirichlet-Based Coarse-to-Fine Example Selection For Open-Set Annotation

链接: https://arxiv.org/abs/2409.17607
作者: Ye-Wen Wang,Chen-Chen Zong,Ming-Kun Xie,Sheng-Jun Huang
关键词-EN: achieved great success, Active learning, achieved great, great success, Active
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Active learning (AL) has achieved great success by selecting the most valuable examples from unlabeled data. However, they usually deteriorate in real scenarios where open-set noise gets involved, which is studied as open-set annotation (OSA). In this paper, we owe the deterioration to the unreliable predictions arising from softmax-based translation invariance and propose a Dirichlet-based Coarse-to-Fine Example Selection (DCFS) strategy accordingly. Our method introduces simplex-based evidential deep learning (EDL) to break translation invariance and distinguish known and unknown classes by considering evidence-based data and distribution uncertainty simultaneously. Furthermore, hard known-class examples are identified by model discrepancy generated from two classifier heads, where we amplify and alleviate the model discrepancy respectively for unknown and known classes. Finally, we combine the discrepancy with uncertainties to form a two-stage strategy, selecting the most informative examples from known classes. Extensive experiments on various openness ratio datasets demonstrate that DCFS achieves state-of-art performance.

[AI-74] Open Digital Rights Enforcement Framework (ODRE): from descriptive to enforceable policies

链接: https://arxiv.org/abs/2409.17602
作者: Andrea Cimmino,Juan Cano-Benito,Raúl García-Castro
关键词-EN: Data Spaces, Open Digital, ODRL, data usage policies, decentralised ecosystems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 20 pages, 3 Figures, Submitted to Computers Security journal

点击查看摘要

Abstract:From centralised platforms to decentralised ecosystems, like Data Spaces, sharing data has become a paramount challenge. For this reason, the definition of data usage policies has become crucial in these domains, highlighting the necessity of effective policy enforcement mechanisms. The Open Digital Rights Language (ODRL) is a W3C standard ontology designed to describe data usage policies, however, it lacks built-in enforcement capabilities, limiting its practical application. This paper introduces the Open Digital Rights Enforcement (ODRE) framework, whose goal is to provide ODRL with enforcement capabilities. The ODRE framework proposes a novel approach to express ODRL policies that integrates the descriptive ontology terms of ODRL with other languages that allow behaviour specification, such as dynamic data handling or function evaluation. The framework includes an enforcement algorithm for ODRL policies and two open-source implementations in Python and Java. The ODRE framework is also designed to support future extensions of ODRL to specific domain scenarios. In addition, current limitations of ODRE, ODRL, and current challenges are reported. Finally, to demonstrate the enforcement capabilities of the implementations, their performance, and their extensibility features, several experiments have been carried out with positive results.

[AI-75] A-Cleaner: A Fine-grained Text Alignment Backdoor Defense Strategy for Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2409.17601
作者: Yuan Xun,Siyuan Liang,Xiaojun Jia,Xinwei Liu,Xiaochun Cao
关键词-EN: Pre-trained large models, multimodal contrastive learning, Pre-trained large, multimodal contrastive, widely recognized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the supervised learning domain, fine-tuning defense strategies can achieve excellent defense performance. However, in the unsupervised and semi-supervised domain, we find that when CLIP faces some complex attack techniques, the existing fine-tuning defense strategy, CleanCLIP, has some limitations on defense performance. The synonym substitution of its text-augmentation is insufficient to enhance the text feature space. To compensate for this weakness, we improve it by proposing a fine-grained \textbfText \textbfAlignment \textbfCleaner (TA-Cleaner) to cut off feature connections of backdoor triggers. We randomly select a few samples for positive and negative subtext generation at each epoch of CleanCLIP, and align the subtexts to the images to strengthen the text self-supervision. We evaluate the effectiveness of our TA-Cleaner against six attack algorithms and conduct comprehensive zero-shot classification tests on ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques. Even when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02% and 63.88%, respectively.

[AI-76] Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming

链接: https://arxiv.org/abs/2409.17596
作者: Zehao Zhu,Wei Sun,Jun Jia,Wei Wu,Sibin Deng,Kai Li,Ying Chen,Xiongkuo Min,Jia Wang,Guangtao Zhai
关键词-EN: live video streaming, gained widespread popularity, social media platforms, QoE, live video
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:In recent years, live video streaming has gained widespread popularity across various social media platforms. Quality of experience (QoE), which reflects end-users’ satisfaction and overall experience, plays a critical role for media service providers to optimize large-scale live compression and transmission strategies to achieve perceptually optimal rate-distortion trade-off. Although many QoE metrics for video-on-demand (VoD) have been proposed, there remain significant challenges in developing QoE metrics for live video streaming. To bridge this gap, we conduct a comprehensive study of subjective and objective QoE evaluations for live video streaming. For the subjective QoE study, we introduce the first live video streaming QoE dataset, TaoLive QoE, which consists of 42 source videos collected from real live broadcasts and 1,155 corresponding distorted ones degraded due to a variety of streaming distortions, including conventional streaming distortions such as compression, stalling, as well as live streaming-specific distortions like frame skipping, variable frame rate, etc. Subsequently, a human study was conducted to derive subjective QoE scores of videos in the TaoLive QoE dataset. For the objective QoE study, we benchmark existing QoE models on the TaoLive QoE dataset as well as publicly available QoE datasets for VoD scenarios, highlighting that current models struggle to accurately assess video QoE, particularly for live content. Hence, we propose an end-to-end QoE evaluation model, Tao-QoE, which integrates multi-scale semantic features and optical flow-based motion features to predicting a retrospective QoE score, eliminating reliance on statistical quality of service (QoS) features.

[AI-77] Deep Manifold Part 1: Anatomy of Neural Network Manifold

链接: https://arxiv.org/abs/2409.17592
作者: Max Y. Ma,Gen-Hua Shi
关键词-EN: self-progressing boundary conditions, numerical computation combining, manifold method principle, computation combining forward, numerical manifold method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Based on the numerical manifold method principle, we developed a mathematical framework of a neural network manifold: Deep Manifold and discovered that neural networks: 1) is numerical computation combining forward and inverse; 2) have near infinite degrees of freedom; 3) exponential learning capacity with depth; 4) have self-progressing boundary conditions; 5) has training hidden bottleneck. We also define two concepts: neural network learning space and deep manifold space and introduce two concepts: neural network intrinsic pathway and fixed point. We raise three fundamental questions: 1). What is the training completion definition; 2). where is the deep learning convergence point (neural network fixed point); 3). How important is token timestamp in training data given negative time is critical in inverse problem.

[AI-78] Improving Fast Adversarial Training via Self-Knowledge Guidance

链接: https://arxiv.org/abs/2409.17589
作者: Chengze Jiang,Junkai Wang,Minjing Dong,Jie Gui,Xinli Shi,Yuan Cao,Yuan Yan Tang,James Tin-Yau Kwok
关键词-EN: achieved remarkable advancements, FAT, achieved remarkable, remarkable advancements, advancements in defending
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples, which leads to an imbalanced optimization. However, this imbalance remains unexplored in the field of FAT. In this paper, we conduct a comprehensive study of the imbalance issue in FAT and observe an obvious class disparity regarding their performances. This disparity could be embodied from a perspective of alignment between clean and robust accuracy. Based on the analysis, we mainly attribute the observed misalignment and disparity to the imbalanced optimization in FAT, which motivates us to optimize different training data adaptively to enhance robustness. Specifically, we take disparity and misalignment into consideration. First, we introduce self-knowledge guided regularization, which assigns differentiated regularization weights to each class based on its training state, alleviating class disparity. Additionally, we propose self-knowledge guided label relaxation, which adjusts label relaxation according to the training accuracy, alleviating the misalignment and improving robustness. By combining these methods, we formulate the Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge during training to enhance the adversarial robustness without compromising training efficiency. Extensive experiments on four standard datasets demonstrate that the SKG-FAT improves the robustness and preserves competitive clean accuracy, outperforming the state-of-the-art methods.

[AI-79] Multimodal Banking Dataset: Understanding Client Needs through Event Sequences

链接: https://arxiv.org/abs/2409.17587
作者: Mollaev Dzhambulat,Alexander Kostin,Postnova Maria,Ivan Karpukhin,Ivan A Kireev,Gleb Gusev,Andrey Savchenko
关键词-EN: Financial organizations collect, Financial organizations, organizations collect, collect a huge, huge amount
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Financial organizations collect a huge amount of data about clients that typically has a temporal (sequential) structure and is collected from various sources (modalities). Due to privacy issues, there are no large-scale open-source multimodal datasets of event sequences, which significantly limits the research in this area. In this paper, we present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients with several modalities: 950M bank transactions, 1B geo position events, 5M embeddings of dialogues with technical support and monthly aggregated purchases of four bank’s products. All entries are properly anonymized from real proprietary bank data. Using this dataset, we introduce a novel benchmark with two business tasks: campaigning (purchase prediction in the next month) and matching of clients. We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task. As a result, the proposed dataset can open new perspectives and facilitate the future development of practically important large-scale multimodal algorithms for event sequences. HuggingFace Link: this https URL Github Link: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.17587 [cs.LG] (or arXiv:2409.17587v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.17587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-80] A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K Filings Using Large Language Models

链接: https://arxiv.org/abs/2409.17581
作者: Syed Affan Daimi,Asma Iqbal
关键词-EN: number of companies, growing exponentially, market analysts, significant challenge, challenge for market
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:The number of companies listed on the NYSE has been growing exponentially, creating a significant challenge for market analysts, traders, and stockholders who must monitor and assess the performance and strategic shifts of a large number of companies regularly. There is an increasing need for a fast, cost-effective, and comprehensive method to evaluate the performance and detect and compare many companies’ strategy changes efficiently. We propose a novel data-driven approach that leverages large language models (LLMs) to systematically analyze and rate the performance of companies based on their SEC 10-K filings. These filings, which provide detailed annual reports on a company’s financial performance and strategic direction, serve as a rich source of data for evaluating various aspects of corporate health, including confidence, environmental sustainability, innovation, and workforce management. We also introduce an automated system for extracting and preprocessing 10-K filings. This system accurately identifies and segments the required sections as outlined by the SEC, while also isolating key textual content that contains critical information about the company. This curated data is then fed into Cohere’s Command-R+ LLM to generate quantitative ratings across various performance metrics. These ratings are subsequently processed and visualized to provide actionable insights. The proposed scheme is then implemented on an interactive GUI as a no-code solution for running the data pipeline and creating the visualizations. The application showcases the rating results and provides year-on-year comparisons of company performance.

[AI-81] Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study

链接: https://arxiv.org/abs/2409.17580
作者: Zahra Sepasdar,Sushant Gautam,Cise Midoglu,Michael A. Riegler,Pål Halvorsen
关键词-EN: Extracting meaningful insights, poses significant challenges, Extracting meaningful, datasets poses significant, significant challenges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Extracting meaningful insights from large and complex datasets poses significant challenges, particularly in ensuring the accuracy and relevance of retrieved information. Traditional data retrieval methods such as sequential search and index-based retrieval often fail when handling intricate and interconnected data structures, resulting in incomplete or misleading outputs. To overcome these limitations, we introduce Structured-GraphRAG, a versatile framework designed to enhance information retrieval across structured datasets in natural language queries. Structured-GraphRAG utilizes multiple knowledge graphs, which represent data in a structured format and capture complex relationships between entities, enabling a more nuanced and comprehensive retrieval of information. This graph-based approach reduces the risk of errors in language model outputs by grounding responses in a structured format, thereby enhancing the reliability of results. We demonstrate the effectiveness of Structured-GraphRAG by comparing its performance with that of a recently published method using traditional retrieval-augmented generation. Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times. While our case study focuses on soccer data, the framework’s design is broadly applicable, offering a powerful tool for data analysis and enhancing language model applications across various structured domains.

[AI-82] Dr. GPT in Campus Counseling: Understanding Higher Education Students Opinions on LLM-assisted Mental Health Services

链接: https://arxiv.org/abs/2409.17572
作者: Owen Xingjian Zhang,Shuyao Zhou,Jiayi Geng,Yuhan Liu,Sunny Xun Liu
关键词-EN: Large Language Models, Language Models, health challenges faced, Large Language, General Information Inquiry
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

Abstract:In response to the increasing mental health challenges faced by college students, we sought to understand their perspectives on how AI applications, particularly Large Language Models (LLMs), can be leveraged to enhance their mental well-being. Through pilot interviews with ten diverse students, we explored their opinions on the use of LLMs across five fictional scenarios: General Information Inquiry, Initial Screening, Reshaping Patient-Expert Dynamics, Long-term Care, and Follow-up Care. Our findings revealed that students’ acceptance of LLMs varied by scenario, with participants highlighting both potential benefits, such as proactive engagement and personalized follow-up care, and concerns, including limitations in training data and emotional support. These insights inform how AI technology should be designed and implemented to effectively support and enhance students’ mental well-being, particularly in scenarios where LLMs can complement traditional methods, while maintaining empathy and respecting individual preferences.

[AI-83] Showing Many Labels in Multi-label Classification Models: An Empirical Study of Adversarial Examples

链接: https://arxiv.org/abs/2409.17568
作者: Yujiang Liu,Wenjian Luo,Zhijian Chen,Muhammad Luqman Naseem
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, development of Deep, numerous fields
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:With the rapid development of Deep Neural Networks (DNNs), they have been applied in numerous fields. However, research indicates that DNNs are susceptible to adversarial examples, and this is equally true in the multi-label domain. To further investigate multi-label adversarial examples, we introduce a novel type of attacks, termed “Showing Many Labels”. The objective of this attack is to maximize the number of labels included in the classifier’s prediction results. In our experiments, we select nine attack algorithms and evaluate their performance under “Showing Many Labels”. Eight of the attack algorithms were adapted from the multi-class environment to the multi-label environment, while the remaining one was specifically designed for the multi-label environment. We choose ML-LIW and ML-GCN as target models and train them on four popular multi-label datasets: VOC2007, VOC2012, NUS-WIDE, and COCO. We record the success rate of each algorithm when it shows the expected number of labels in eight different scenarios. Experimental results indicate that under the “Showing Many Labels”, iterative attacks perform significantly better than one-step attacks. Moreover, it is possible to show all labels in the dataset.

[AI-84] Pixel-Space Post-Training of Latent Diffusion Models

链接: https://arxiv.org/abs/2409.17565
作者: Christina Zhang,Simran Motwani,Matthew Yu,Ji Hou,Felix Juefei-Xu,Sam Tsai,Peter Vajda,Zijian He,Jialiang Wang
关键词-EN: made significant advancements, recent years, made significant, significant advancements, generation in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 \times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

[AI-85] riple Point Masking

链接: https://arxiv.org/abs/2409.17547
作者: Jiaming Liu,Linghe Kong,Yue Wu,Maoguo Gong,Hao Li,Qiguang Miao,Wenping Ma,Can Qin
关键词-EN: encounter performance bottlenecks, learning methods encounter, overcome this limitation, mask learning methods, methods encounter performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing 3D mask learning methods encounter performance bottlenecks under limited data, and our objective is to overcome this limitation. In this paper, we introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders to achieve multi-mask learning for 3D point clouds. Specifically, we augment the baselines with two additional mask choices (i.e., medium mask and low mask) as our core insight is that the recovery process of an object can manifest in diverse ways. Previous high-masking schemes focus on capturing the global representation but lack the fine-grained recovery capability, so that the generated pre-trained weights tend to play a limited role in the fine-tuning process. With the support of the proposed TPM, available methods can exhibit more flexible and accurate completion capabilities, enabling the potential autoencoder in the pre-training stage to consider multiple representations of a single 3D object. In addition, an SVM-guided weight selection module is proposed to fill the encoder parameters for downstream networks with the optimal weight during the fine-tuning stage, maximizing linear accuracy and facilitating the acquisition of intricate representations for new objects. Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.

[AI-86] Modulated Intervention Preference Optimization (MIPO): Keey the Easy Refine the Difficult AAAI2025

链接: https://arxiv.org/abs/2409.17545
作者: Cheolhun Jang
关键词-EN: well-trained SFT model, Preference optimization methods, optimization methods typically, methods typically begin, reference model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8pages, submitted to AAAI 2025

点击查看摘要

Abstract:Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model’s distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbfModulated Intervention Preference Optimization (MIPO) to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. If the data is well-aligned, the intervention is increased to prevent the policy model from diverging significantly from reference model. Conversely, if the alignment is poor, the interference is reduced to facilitate more extensive training. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental results demonstrate that MIPO consistently outperforms DPO across various evaluation scenarios.

[AI-87] On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy

链接: https://arxiv.org/abs/2409.17538
作者: Saber Malekmohammadi,Golnoosh Farnadi
关键词-EN: processing involves large-scale, involves large-scale pre-training, natural language processing, language processing involves, general domain data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to specific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.

[AI-88] Just say what you want: only-prompting self-rewarding online preference optimization

链接: https://arxiv.org/abs/2409.17534
作者: Ruijie Xu,Zhihan Liu,Yongfei Liu,Shipeng Yan,Zhaoran Wang,Zhi Zhang,Xuming He
关键词-EN: online Reinforcement Learning, Reinforcement Learning, self-rewarding alignment methods, online Reinforcement, alignment methods
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator’s judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we conduct extensive experiments on two base models, Mistral-7B and Mistral-Instruct-7B, which significantly bootstrap the performance of the reference model, achieving 34.5% in the Length-controlled Win Rates of AlpacaEval 2.0.

[AI-89] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion NEURIPS2024

链接: https://arxiv.org/abs/2409.17531
作者: Ming Dai,Lingfeng Yang,Yihao Xu,Zhenhua Feng,Wankou Yang
关键词-EN: involves grounding descriptive, grounding descriptive sentences, common vision task, common vision, descriptive sentences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21pages, 11figures, NeurIPS2024

点击查看摘要

Abstract:Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \urlthis https URL.

[AI-90] Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Integrating SGBM and Segmentation Models

链接: https://arxiv.org/abs/2409.17526
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
关键词-EN: radiata pine trees, pine trees presents, trees presents significant, safety risks due, Manual pruning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Manual pruning of radiata pine trees presents significant safety risks due to their substantial height and the challenging terrains in which they thrive. To address these risks, this research proposes the development of a drone-based pruning system equipped with specialized pruning tools and a stereo vision camera, enabling precise detection and trimming of branches. Deep learning algorithms, including YOLO and Mask R-CNN, are employed to ensure accurate branch detection, while the Semi-Global Matching algorithm is integrated to provide reliable distance estimation. The synergy between these techniques facilitates the precise identification of branch locations and enables efficient, targeted pruning. Experimental results demonstrate that the combined implementation of YOLO and SGBM enables the drone to accurately detect branches and measure their distances from the drone. This research not only improves the safety and efficiency of pruning operations but also makes a significant contribution to the advancement of drone technology in the automation of agricultural and forestry practices, laying a foundational framework for further innovations in environmental management.

[AI-91] EAGLE: Egocentric AGgregated Language-video Engine

链接: https://arxiv.org/abs/2409.17523
作者: Jing Bi,Yunlong Tang,Luchuan Song,Ali Vosoughi,Nguyen Nguyen,Chenliang Xu
关键词-EN: video analysis brings, understanding human activities, first-person perspective, egocentric video analysis, egocentric video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACMMM 24

点击查看摘要

Abstract:The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textitfirst large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE’s superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.

[AI-92] Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization

链接: https://arxiv.org/abs/2409.17519
作者: Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba
关键词-EN: diverse environments, environmental state recognition, autonomously navigate, navigate and operate, operate in diverse
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Advanced Robotics, website - this https URL

点击查看摘要

Abstract:In order for robots to autonomously navigate and operate in diverse environments, it is essential for them to recognize the state of their environment. On the other hand, the environmental state recognition has traditionally involved distinct methods tailored to each state to be recognized. In this study, we perform a unified environmental state recognition for robots through the spoken language with pre-trained large-scale vision-language models. We apply Visual Question Answering and Image-to-Text Retrieval, which are tasks of Vision-Language Models. We show that with our method, it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed and whether water is running in a sink, without training neural networks or manual programming. In addition, the recognition accuracy can be improved by selecting appropriate texts from the set of prepared texts based on black-box optimization. For each state recognition, only the text set and its weighting need to be changed, eliminating the need to prepare multiple different models and programs, and facilitating the management of source code and computer resource. We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.

[AI-93] Multi-Designated Detector Watermarking for Language Models

链接: https://arxiv.org/abs/2409.17518
作者: Zhengan Huang,Gongxian Zeng,Xin Mu,Yu Wang,Yue Yu
关键词-EN: large language models, multi-designated detector watermarking, initiate the study, large language, MDDW
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we initiate the study of \emphmulti-designated detector watermarking (MDDW) for large language models (LLMs). This technique allows model providers to generate watermarked outputs from LLMs with two key properties: (i) only specific, possibly multiple, designated detectors can identify the watermarks, and (ii) there is no perceptible degradation in the output quality for ordinary users. We formalize the security definitions for MDDW and present a framework for constructing MDDW for any LLM using multi-designated verifier signatures (MDVS). Recognizing the significant economic value of LLM outputs, we introduce claimability as an optional security feature for MDDW, enabling model providers to assert ownership of LLM outputs within designated-detector settings. To support claimable MDDW, we propose a generic transformation converting any MDVS to a claimable MDVS. Our implementation of the MDDW scheme highlights its advanced functionalities and flexibility over existing methods, with satisfactory performance metrics.

[AI-94] Dataset Distillation-based Hybrid Federated Learning on Non-IID Data

链接: https://arxiv.org/abs/2409.17517
作者: Xiufang Shi,Wei Zhang,Mincheng Wu,Guangyi Liu,Zhenyu Wen,Shibo He,Tejal Shah,Rajiv Ranjan
关键词-EN: data, model training, data labels, federated learning, training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In federated learning, the heterogeneity of client data has a great impact on the performance of model training. Many heterogeneity issues in this process are raised by non-independently and identically distributed (Non-IID) data. This study focuses on the issue of label distribution skew. To address it, we propose a hybrid federated learning framework called HFLDD, which integrates dataset distillation to generate approximately independent and equally distributed (IID) data, thereby improving the performance of model training. Particularly, we partition the clients into heterogeneous clusters, where the data labels among different clients within a cluster are unbalanced while the data labels among different clusters are balanced. The cluster headers collect distilled data from the corresponding cluster members, and conduct model training in collaboration with the server. This training process is like traditional federated learning on IID data, and hence effectively alleviates the impact of Non-IID data on model training. Furthermore, we compare our proposed method with typical baseline methods on public datasets. Experimental results demonstrate that when the data labels are severely imbalanced, the proposed HFLDD outperforms the baseline methods in terms of both test accuracy and communication cost.

[AI-95] Functional Classification of Spiking Signal Data Using Artificial Intelligence Techniques: A Review

链接: https://arxiv.org/abs/2409.17516
作者: Danial Sharifrazi,Nouman Javed,Javad Hassannataj Joloudari,Roohallah Alizadehsani,Prasad N. Paradkar,Ru-San Tan,U. Rajendra Acharya,Asim Bhatti
关键词-EN: Human brain neuron, incredibly significant nowadays, brain neuron activities, Human brain, significant nowadays
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 8 figures, 32 pages

点击查看摘要

Abstract:Human brain neuron activities are incredibly significant nowadays. Neuronal behavior is assessed by analyzing signal data such as electroencephalography (EEG), which can offer scientists valuable information about diseases and human-computer interaction. One of the difficulties researchers confront while evaluating these signals is the existence of large volumes of spike data. Spikes are some considerable parts of signal data that can happen as a consequence of vital biomarkers or physical issues such as electrode movements. Hence, distinguishing types of spikes is important. From this spot, the spike classification concept commences. Previously, researchers classified spikes manually. The manual classification was not precise enough as it involves extensive analysis. Consequently, Artificial Intelligence (AI) was introduced into neuroscience to assist clinicians in classifying spikes correctly. This review discusses the importance and use of AI in spike classification, focusing on the recognition of neural activity noises. The task is divided into three main components: preprocessing, classification, and evaluation. Existing methods are introduced and their importance is determined. The review also highlights the need for more efficient algorithms. The primary goal is to provide a perspective on spike classification for future research and provide a comprehensive understanding of the methodologies and issues involved. The review organizes materials in the spike classification field for future studies. In this work, numerous studies were extracted from different databases. The PRISMA-related research guidelines were then used to choose papers. Then, research studies based on spike classification using machine learning and deep learning approaches with effective preprocessing were selected.

[AI-96] From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Forecasting with Reflection NEURIPS2024

链接: https://arxiv.org/abs/2409.17515
作者: Xinlei Wang,Maike Feng,Jing Qiu,Jinjin Gu,Junhua Zhao
关键词-EN: Large Language Models, Large Language, time series forecasting, enhance time series, Generative Agents
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted for NeurIPS 2024

点击查看摘要

Abstract:This paper introduces a novel approach to enhance time series forecasting using Large Language Models (LLMs) and Generative Agents. With language as a medium, our method adaptively integrates various social events into forecasting models, aligning news content with time series fluctuations for enriched insights. Specifically, we utilize LLM-based agents to iteratively filter out irrelevant news and employ human-like reasoning and reflection to evaluate predictions. This enables our model to analyze complex events, such as unexpected incidents and shifts in social behavior, and continuously refine the selection logic of news and the robustness of the agent’s output. By compiling selected news with time series data, we fine-tune the LLaMa2 pre-trained model. The results demonstrate significant improvements in forecasting accuracy and suggest a potential paradigm shift in time series forecasting by effectively harnessing unstructured news data.

[AI-97] Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

链接: https://arxiv.org/abs/2409.17508
作者: Xun Zhu,Ying Hu,Fanbin Mo,Miao Li,Ji Wu
关键词-EN: shown impressive capabilities, Multi-modal large language, large language models, large language, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code, data and model will be soon available at GitHub.

[AI-98] GLinSAT: The General Linear Satisfiability Neural Network Layer By Accelerated Gradient Descent

链接: https://arxiv.org/abs/2409.17500
作者: Hongtai Zeng,Chao Yang,Yanzhen Zhou,Cheng Yang,Qinglai Guo
关键词-EN: applying neural networks, networks satisfy specific, neural networks satisfy, satisfy specific constraints, neural network outputs
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Ensuring that the outputs of neural networks satisfy specific constraints is crucial for applying neural networks to real-life decision-making problems. In this paper, we consider making a batch of neural network outputs satisfy bounded and general linear constraints. We first reformulate the neural network output projection problem as an entropy-regularized linear programming problem. We show that such a problem can be equivalently transformed into an unconstrained convex optimization problem with Lipschitz continuous gradient according to the duality theorem. Then, based on an accelerated gradient descent algorithm with numerical performance enhancement, we present our architecture, GLinSAT, to solve the problem. To the best of our knowledge, this is the first general linear satisfiability layer in which all the operations are differentiable and matrix-factorization-free. Despite the fact that we can explicitly perform backpropagation based on automatic differentiation mechanism, we also provide an alternative approach in GLinSAT to calculate the derivatives based on implicit differentiation of the optimality condition. Experimental results on constrained traveling salesman problems, partial graph matching with outliers, predictive portfolio allocation and power system unit commitment demonstrate the advantages of GLinSAT over existing satisfiability layers.

[AI-99] Human Mobility Modeling with Limited Information via Large Language Models

链接: https://arxiv.org/abs/2409.17495
作者: Yifan Liu,Xishun Liao,Haoxuan Ma,Brian Yueshuai He,Chris Stanford,Jiaqi Ma
关键词-EN: human mobility modeling, human mobility, Understanding human mobility, complex challenge, challenge in transportation
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Understanding human mobility patterns has traditionally been a complex challenge in transportation modeling. Due to the difficulties in obtaining high-quality training datasets across diverse locations, conventional activity-based models and learning-based human mobility modeling algorithms are particularly limited by the availability and quality of datasets. Furthermore, current research mainly focuses on the spatial-temporal travel pattern but lacks an understanding of the semantic information between activities, which is crucial for modeling the interdependence between activities. In this paper, we propose an innovative Large Language Model (LLM) empowered human mobility modeling framework. Our proposed approach significantly reduces the reliance on detailed human mobility statistical data, utilizing basic socio-demographic information of individuals to generate their daily mobility patterns. We have validated our results using the NHTS and SCAG-ABM datasets, demonstrating the effective modeling of mobility patterns and the strong adaptability of our framework across various geographic locations.

[AI-100] Global-Local Medical SAM Adaptor Based on Full Adaption

链接: https://arxiv.org/abs/2409.17486
作者: Meng Wang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yarong Feng(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yongwei Tang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Tian Zhang(Software college Northeastern University Shenyang, Liaoning Province, P. R. China),Yuxin Liang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Chao Lv(Department of General Surgery, Shengjing Hospital China Medical University Shenyang, Liaoning Province, P. R. China)
关键词-EN: Medical SAM adaptor, visual language models, made great breakthroughs, Emerging of visual, SAM adaptor
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Emerging of visual language models, such as the segment anything model (SAM), have made great breakthroughs in the field of universal semantic segmentation and significantly aid the improvements of medical image segmentation, in particular with the help of Medical SAM adaptor (Med-SA). However, Med-SA still can be improved, as it fine-tunes SAM in a partial adaption manner. To resolve this problem, we present a novel global medical SAM adaptor (GMed-SA) with full adaption, which can adapt SAM globally. We further combine GMed-SA and Med-SA to propose a global-local medical SAM adaptor (GLMed-SA) to adapt SAM both globally and locally. Extensive experiments have been performed on the challenging public 2D melanoma segmentation dataset. The results show that GLMed-SA outperforms several state-of-the-art semantic segmentation methods on various evaluation metrics, demonstrating the superiority of our methods.

[AI-101] MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2409.17481
作者: Gongfan Fang,Hongxu Yin,Saurav Muralidharan,Greg Heinrich,Jeff Pool,Jan Kautz,Pavlo Molchanov,Xinchao Wang
关键词-EN: Large Language Models, massive parameter counts, Large Language, Language Models, significant redundancy
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M’') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model’s 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM’s learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \urlthis https URL.

[AI-102] What Would Happen Next? Predicting Consequences from An Event Causality Graph

链接: https://arxiv.org/abs/2409.17480
作者: Chuanhong Zhan,Wei Xiang,Chao Liang,Bang Wang
关键词-EN: event script chain, script event prediction, Graph Event Prediction, Causality Graph Event, Event Causality Graph
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing script event prediction task forcasts the subsequent event based on an event script chain. However, the evolution of historical events are more complicated in real world scenarios and the limited information provided by the event script chain also make it difficult to accurately predict subsequent events. This paper introduces a Causality Graph Event Prediction(CGEP) task that forecasting consequential event based on an Event Causality Graph (ECG). We propose a Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model for the CGEP task. In SeDGPL, (1) we design a Distance-sensitive Graph Linearization (DsGL) module to reformulate the ECG into a graph prompt template as the input of a PLM; (2) propose an Event-Enriched Causality Encoding (EeCE) module to integrate both event contextual semantic and graph schema information; (3) propose a Semantic Contrast Event Prediction (ScEP) module to enhance the event representation among numerous candidate events and predict consequential event following prompt learning paradigm. %We construct two CGEP datasets based on existing MAVEN-ERE and ESC corpus for experiments. Experiment results validate our argument our proposed SeDGPL model outperforms the advanced competitors for the CGEP task.

[AI-103] Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards EMNLP2024

链接: https://arxiv.org/abs/2409.17472
作者: Heejin Do,Sangwon Ryu,Gary Geunbae Lee
关键词-EN: provide enriched feedback, evaluating multiple traits, Recent advances, automated essay scoring, enriched feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an autoregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.

[AI-104] CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

链接: https://arxiv.org/abs/2409.17457
作者: Sifan Wu,Amir Khasahmadi,Mor Katz,Pradeep Kumar Jayaraman,Yewen Pu,Karl Willis,Bang Liu
关键词-EN: contemporary mechanical design, CAD, mechanical design, central to contemporary, Design
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

[AI-105] A Time Series is Worth Five Experts: Heterogeneous Mixture of Experts for Traffic Flow Prediction

链接: https://arxiv.org/abs/2409.17440
作者: Guangyu Wang,Yujie Chen,Ming Gao,Zhiqiao Wu,Jiafu Tang,Jiabi Zhao
关键词-EN: faces significant challenges, prediction faces significant, significant challenges, necessitating a deep, faces significant
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Accurate traffic prediction faces significant challenges, necessitating a deep understanding of both temporal and spatial cues and their complex interactions across multiple variables. Recent advancements in traffic prediction systems are primarily due to the development of complex sequence-centric models. However, existing approaches often embed multiple variables and spatial relationships at each time step, which may hinder effective variable-centric learning, ultimately leading to performance degradation in traditional traffic prediction tasks. To overcome these limitations, we introduce variable-centric and prior knowledge-centric modeling techniques. Specifically, we propose a Heterogeneous Mixture of Experts (TITAN) model for traffic flow prediction. TITAN initially consists of three experts focused on sequence-centric modeling. Then, designed a low-rank adaptive method, TITAN simultaneously enables variable-centric modeling. Furthermore, we supervise the gating process using a prior knowledge-centric modeling strategy to ensure accurate routing. Experiments on two public traffic network datasets, METR-LA and PEMS-BAY, demonstrate that TITAN effectively captures variable-centric dependencies while ensuring accurate routing. Consequently, it achieves improvements in all evaluation metrics, ranging from approximately 4.37% to 11.53%, compared to previous state-of-the-art (SOTA) models. The code is open at \hrefthis https URLthis https URL.

[AI-106] HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows

链接: https://arxiv.org/abs/2409.17433
作者: Wenlin Yao,Haitao Mi,Dong Yu
关键词-EN: requiring multi-step thinking, problems requiring multi-step, Hybrid Thinking, thinking, slow thinking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:Despite recent advancements in large language models (LLMs), their performance on complex reasoning problems requiring multi-step thinking and combining various skills is still limited. To address this, we propose a novel framework HDFlow for complex reasoning with LLMs that combines fast and slow thinking modes in an adaptive manner. Our approach consists of two key components: 1) a new approach for slow, deliberate reasoning called Dynamic Workflow, which automatically decomposes complex problems into more manageable sub-tasks and dynamically designs a workflow to assemble specialized LLM or symbolic reasoning tools to solve sub-tasks; 2) Hybrid Thinking, a general framework that dynamically combines fast and slow thinking based on problem complexity. Finally, we propose an easy-to-scale method for automatically synthesizing a large-scale dataset of 27K challenging reasoning problems for complex reasoning and a hybrid thinking tuning method that trains smaller LLMs on this dataset to internalize the fast/slow hybrid reasoning strategies. Experiments on four reasoning benchmark datasets demonstrate that our slow thinking with dynamic workflows significantly outperforms Chain-of-Thought, and hybrid thinking achieves the highest accuracy while providing an effective balance between computational efficiency and performance. Fine-tuning using our hybrid thinking approach also significantly boosts the complex reasoning capabilities of open-source language models. The results showcase the promise of slow thinking, dynamic workflows, and hybrid thinking in expanding the frontier of complex problem-solving with LLMs\footnoteCode and data will be released at \urlthis https URL…

[AI-107] Exploring the Use of ChatGPT for a Systematic Literature Review: a Design-Based Research

链接: https://arxiv.org/abs/2409.17426
作者: Qian Huang,Qiyun Wang
关键词-EN: educational contexts,including learning, SLR, teaching and research, contexts,including learning, educational contexts,including
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages, 13 figures, 2 tables

点击查看摘要

Abstract:ChatGPT has been used in several educational contexts,including learning, teaching and research. It also has potential to conduct the systematic literature review (SLR). However, there are limited empirical studies on how to use ChatGPT in conducting a SLR. Based on a SLR published,this study used ChatGPT to conduct a SLR of the same 33 papers in a design-based approach, to see what the differences are by comparing the reviews’ results,and to answer: To what extent can ChatGPT conduct SLR? What strategies can human researchers utilize to structure prompts for ChatGPT that enhance the reliability and validity of a SLR? This study found that ChatGPT could conduct a SLR. It needs detailed and accurate prompts to analyze the literature. It also has limitations. Guiding principles are summarized from this study for researchers to follow when they need to conduct SLRs using ChatGPT.

[AI-108] Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

链接: https://arxiv.org/abs/2409.17422
作者: Zhenmei Shi,Yifei Ming,Xuan-Phi Nguyen,Yingyu Liang,Shafiq Joty
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, increased computational resources
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4 \times speedup and 30% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at \urlthis https URL.

[AI-109] From Deception to Detection: The Dual Roles of Large Language Models in Fake News

链接: https://arxiv.org/abs/2409.17416
作者: Dorsaf Sallami,Yuan-Chen Chang,Esma Aïmeur
关键词-EN: Fake, public trust, poses a significant, significant threat, ecosystems and public
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fake news poses a significant threat to the integrity of information ecosystems and public trust. The advent of Large Language Models (LLMs) holds considerable promise for transforming the battle against fake news. Generally, LLMs represent a double-edged sword in this struggle. One major concern is that LLMs can be readily used to craft and disseminate misleading information on a large scale. This raises the pressing questions: Can LLMs easily generate biased fake news? Do all LLMs have this capability? Conversely, LLMs offer valuable prospects for countering fake news, thanks to their extensive knowledge of the world and robust reasoning capabilities. This leads to other critical inquiries: Can we use LLMs to detect fake news, and do they outperform typical detection models? In this paper, we aim to address these pivotal questions by exploring the performance of various LLMs. Our objective is to explore the capability of various LLMs in effectively combating fake news, marking this as the first investigation to analyze seven such models. Our results reveal that while some models adhere strictly to safety protocols, refusing to generate biased or misleading content, other models can readily produce fake news across a spectrum of biases. Additionally, our results show that larger models generally exhibit superior detection abilities and that LLM-generated fake news are less likely to be detected than human-written ones. Finally, our findings demonstrate that users can benefit from LLM-generated explanations in identifying fake news.

[AI-110] Exploring Semantic Clustering in Deep Reinforcement Learning for Video Games

链接: https://arxiv.org/abs/2409.17411
作者: Liang Zhang,Adarsh Pyarelal,Justin Lieffers
关键词-EN: deep reinforcement learning, semantic clustering, semantic clustering properties, reinforcement learning, enriching our understanding
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the semantic clustering properties of deep reinforcement learning (DRL) for video games, enriching our understanding of the internal dynamics of DRL and advancing its interpretability. In this context, semantic clustering refers to the inherent capacity of neural networks to internally group video inputs based on semantic similarity. To achieve this, we propose a novel DRL architecture that integrates a semantic clustering module featuring both feature dimensionality reduction and online clustering. This module seamlessly integrates into the DRL training pipeline, addressing instability issues observed in previous t-SNE-based analysis methods and eliminating the necessity for extensive manual annotation of semantic analysis. Through experiments, we validate the effectiveness of the proposed module and the semantic clustering properties in DRL for video games. Additionally, based on these properties, we introduce new analytical methods to help understand the hierarchical structure of policies and the semantic distribution within the feature space.

[AI-111] Sociotechnical Approach to Enterprise Generative Artificial Intelligence (E-GenAI)

链接: https://arxiv.org/abs/2409.17408
作者: Leoncio Jimenez,Francisco Venegas
关键词-EN: Imperfect Knowledge Management, Inventive Problem Solving, Knowledge Management, proposed to characterize, sociotechnical approach
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:In this theoretical article, a sociotechnical approach is proposed to characterize. First, the business ecosystem, focusing on the relationships among Providers, Enterprise, and Customers through SCM, ERP, and CRM platforms to align: (1) Business Intelligence (BI), Fuzzy Logic (FL), and TRIZ (Theory of Inventive Problem Solving), through the OID model, and (2) Knowledge Management (KM) and Imperfect Knowledge Management (IKM), through the OIDK model. Second, the article explores the E-GenAI business ecosystem, which integrates GenAI-based platforms for SCM, ERP, and CRM with GenAI-based platforms for BI, FL, TRIZ, KM, and IKM, to align Large Language Models (LLMs) through the E-GenAI (OID) model. Finally, to understand the dynamics of LLMs, we utilize finite automata to model the relationships between Followers and Followees. This facilitates the construction of LLMs that can identify specific characteristics of users on a social media platform.

[AI-112] Post-hoc Reward Calibration: A Case Study on Length Bias

链接: https://arxiv.org/abs/2409.17407
作者: Zeyu Huang,Zihan Qiu,Zili Wang,Edoardo M. Ponti,Ivan Titov
关键词-EN: Large Language Models, Reinforcement Learning, Large Language, Human Feedback aligns, translates human feedback
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM–RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at this https URL.

[AI-113] AI Enabled Neutron Flux Measurement and Virtual Calibration in Boiling Water Reactors

链接: https://arxiv.org/abs/2409.17405
作者: Anirudh Tunga,Jordan Heim,Michael Mueterthies,Thomas Gruenwald,Jonathan Nistor
关键词-EN: Technical Specifications, compliance with Technical, Accurately capturing, fuel cycle planning, dimensional power distribution
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately capturing the three dimensional power distribution within a reactor core is vital for ensuring the safe and economical operation of the reactor, compliance with Technical Specifications, and fuel cycle planning (safety, control, and performance evaluation). Offline (that is, during cycle planning and core design), a three dimensional neutronics simulator is used to estimate the reactor’s power, moderator, void, and flow distributions, from which margin to thermal limits and fuel exposures can be approximated. Online, this is accomplished with a system of local power range monitors (LPRMs) designed to capture enough neutron flux information to infer the full nodal power distribution. Certain problems with this process, ranging from measurement and calibration to the power adaption process, pose challenges to operators and limit the ability to design reload cores economically (e.g., engineering in insufficient margin or more margin than required). Artificial intelligence (AI) and machine learning (ML) are being used to solve the problems to reduce maintenance costs, improve the accuracy of online local power measurements, and decrease the bias between offline and online power distributions, thereby leading to a greater ability to design safe and economical reload cores. We present ML models trained from two deep neural network (DNN) architectures, SurrogateNet and LPRMNet, that demonstrate a testing error of 1 percent and 3 percent, respectively. Applications of these models can include virtual sensing capability for bypassed or malfunctioning LPRMs, on demand virtual calibration of detectors between successive calibrations, highly accurate nuclear end of life determinations for LPRMs, and reduced bias between measured and predicted power distributions within the core.

[AI-114] ransient Adversarial 3D Projection Attacks on Object Detection in Autonomous Driving

链接: https://arxiv.org/abs/2409.17403
作者: Ce Zhou,Qiben Yan,Sijia Liu
关键词-EN: Object detection, crucial task, targeting object detection, patches or stickers, Object
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 7 figures, SmartSP 2024

点击查看摘要

Abstract:Object detection is a crucial task in autonomous driving. While existing research has proposed various attacks on object detection, such as those using adversarial patches or stickers, the exploration of projection attacks on 3D surfaces remains largely unexplored. Compared to adversarial patches or stickers, which have fixed adversarial patterns, projection attacks allow for transient modifications to these patterns, enabling a more flexible attack. In this paper, we introduce an adversarial 3D projection attack specifically targeting object detection in autonomous driving scenarios. We frame the attack formulation as an optimization problem, utilizing a combination of color mapping and geometric transformation models. Our results demonstrate the effectiveness of the proposed attack in deceiving YOLOv3 and Mask R-CNN in physical settings. Evaluations conducted in an indoor environment show an attack success rate of up to 100% under low ambient light conditions, highlighting the potential damage of our attack in real-world driving scenarios.

[AI-115] Enhancing Recommendation with Denoising Auxiliary Task

链接: https://arxiv.org/abs/2409.17402
作者: Pengsheng Liu,Linan Zheng,Jiale Chen,Guangfa Zhang,Yang Xu,Jinyun Fang
关键词-EN: predict user preferences, historical interaction sequences, accurately predict user, sequences, recommender systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The historical interaction sequences of users plays a crucial role in training recommender systems that can accurately predict user preferences. However, due to the arbitrariness of user behavior, the presence of noise in these sequences poses a challenge to predicting their next actions in recommender systems. To address this issue, our motivation is based on the observation that training noisy sequences and clean sequences (sequences without noise) with equal weights can impact the performance of the model. We propose a novel self-supervised Auxiliary Task Joint Training (ATJT) method aimed at more accurately reweighting noisy sequences in recommender systems. Specifically, we strategically select subsets from users’ original sequences and perform random replacements to generate artificially replaced noisy sequences. Subsequently, we perform joint training on these artificially replaced noisy sequences and the original sequences. Through effective reweighting, we incorporate the training results of the noise recognition model into the recommender model. We evaluate our method on three datasets using a consistent base model. Experimental results demonstrate the effectiveness of introducing self-supervised auxiliary task to enhance the base model’s performance.

[AI-116] AgRegNet: A Deep Regression Network for Flower and Fruit Density Estimation Localization and Counting in Orchards

链接: https://arxiv.org/abs/2409.17400
作者: Uddhav Bhattarai,Santosh Bhusal,Qin Zhang,Manoj Karkee
关键词-EN: agricultural industry today, manual labor availability, fruit density estimation, major challenges, agricultural industry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the major challenges for the agricultural industry today is the uncertainty in manual labor availability and the associated cost. Automated flower and fruit density estimation, localization, and counting could help streamline harvesting, yield estimation, and crop-load management strategies such as flower and fruitlet thinning. This article proposes a deep regression-based network, AgRegNet, to estimate density, count, and location of flower and fruit in tree fruit canopies without explicit object detection or polygon annotation. Inspired by popular U-Net architecture, AgRegNet is a U-shaped network with an encoder-to-decoder skip connection and modified ConvNeXt-T as an encoder feature extractor. AgRegNet can be trained based on information from point annotation and leverages segmentation information and attention modules (spatial and channel) to highlight relevant flower and fruit features while suppressing non-relevant background features. Experimental evaluation in apple flower and fruit canopy images under an unstructured orchard environment showed that AgRegNet achieved promising accuracy as measured by Structural Similarity Index (SSIM), percentage Mean Absolute Error (pMAE) and mean Average Precision (mAP) to estimate flower and fruit density, count, and centroid location, respectively. Specifically, the SSIM, pMAE, and mAP values for flower images were 0.938, 13.7%, and 0.81, respectively. For fruit images, the corresponding values were 0.910, 5.6%, and 0.93. Since the proposed approach relies on information from point annotation, it is suitable for sparsely and densely located objects. This simplified technique will be highly applicable for growers to accurately estimate yields and decide on optimal chemical and mechanical flower thinning practices.

[AI-117] Beyond Redundancy: Information-aware Unsupervised Multiplex Graph Structure Learning NEURIPS2024

链接: https://arxiv.org/abs/2409.17386
作者: Zhixiang Shen,Shuo Wang,Zhao Kang
关键词-EN: Unsupervised Multiplex Graph, learn node representations, Multiplex Graph, manual labeling, Multiplex Graph Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Appear in NeurIPS 2024

点击查看摘要

Abstract:Unsupervised Multiplex Graph Learning (UMGL) aims to learn node representations on various edge types without manual labeling. However, existing research overlooks a key factor: the reliability of the graph structure. Real-world data often exhibit a complex nature and contain abundant task-irrelevant noise, severely compromising UMGL’s performance. Moreover, existing methods primarily rely on contrastive learning to maximize mutual information across different graphs, limiting them to multiplex graph redundant scenarios and failing to capture view-unique task-relevant information. In this paper, we focus on a more realistic and challenging task: to unsupervisedly learn a fused graph from multiple graphs that preserve sufficient task-relevant information while removing task-irrelevant noise. Specifically, our proposed Information-aware Unsupervised Multiplex Graph Fusion framework (InfoMGF) uses graph structure refinement to eliminate irrelevant noise and simultaneously maximizes view-shared and view-unique task-relevant information, thereby tackling the frontier of non-redundant multiplex graph. Theoretical analyses further guarantee the effectiveness of InfoMGF. Comprehensive experiments against various baselines on different downstream tasks demonstrate its superior performance and robustness. Surprisingly, our unsupervised method even beats the sophisticated supervised approaches. The source code and datasets are available at this https URL.

[AI-118] Data-efficient Trajectory Prediction via Coreset Selection

链接: https://arxiv.org/abs/2409.17385
作者: Ruining Yang,Lili Su
关键词-EN: multiple information-collection devices, Modern vehicles, sensors and cameras, continuously generating, equipped with multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Modern vehicles are equipped with multiple information-collection devices such as sensors and cameras, continuously generating a large volume of raw data. Accurately predicting the trajectories of neighboring vehicles is a vital component in understanding the complex driving environment. Yet, training trajectory prediction models is challenging in two ways. Processing the large-scale data is computation-intensive. Moreover, easy-medium driving scenarios often overwhelmingly dominate the dataset, leaving challenging driving scenarios such as dense traffic under-represented. For example, in the Argoverse motion prediction dataset, there are very few instances with \ge 50 agents, while scenarios with 10 \thicksim 20 agents are far more common. In this paper, to mitigate data redundancy in the over-represented driving scenarios and to reduce the bias rooted in the data scarcity of complex ones, we propose a novel data-efficient training method based on coreset selection. This method strategically selects a small but representative subset of data while balancing the proportions of different scenario difficulties. To the best of our knowledge, we are the first to introduce a method capable of effectively condensing large-scale trajectory dataset, while achieving a state-of-the-art compression ratio. Notably, even when using only 50% of the Argoverse dataset, the model can be trained with little to no decline in performance. Moreover, the selected coreset maintains excellent generalization ability.

[AI-119] VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search

链接: https://arxiv.org/abs/2409.17383
作者: Solmaz Seyed Monir,Irene Lau,Shubing Yang,Dongfang Zhao
关键词-EN: Traditional retrieval methods, assessing document similarity, capturing semantic nuances, Traditional retrieval, essential for assessing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 10 pages, 14 figures

点击查看摘要

Abstract:Traditional retrieval methods have been essential for assessing document similarity but struggle with capturing semantic nuances. Despite advancements in latent semantic analysis (LSA) and deep learning, achieving comprehensive semantic understanding and accurate retrieval remains challenging due to high dimensionality and semantic gaps. The above challenges call for new techniques to effectively reduce the dimensions and close the semantic gaps. To this end, we propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval. By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy. Experiments on real-world datasets show that VectorSearch outperforms baseline metrics, demonstrating its efficacy for large-scale retrieval tasks.

[AI-120] slas Autopilot: Ethics and Tragedy

链接: https://arxiv.org/abs/2409.17380
作者: Aravinda Jatavallabha
关键词-EN: emphasizing Tesla Motors’, involving Tesla Autopilot, Tesla Motors’ moral, Motors’ moral responsibility, Tesla Autopilot
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This case study delves into the ethical ramifications of an incident involving Tesla’s Autopilot, emphasizing Tesla Motors’ moral responsibility. Using a seven-step ethical decision-making process, it examines user behavior, system constraints, and regulatory implications. This incident prompts a broader evaluation of ethical challenges in the automotive industry’s adoption of autonomous technologies, urging a reconsideration of industry norms and legal frameworks. The analysis offers a succinct exploration of ethical considerations in evolving technological landscapes.

[AI-121] Search for Efficient Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2409.17372
作者: Xuan Shen,Pu Zhao,Yifan Gong,Zhenglun Kong,Zheng Zhan,Yushu Wu,Ming Lin,Chao Wu,Xue Lin,Yanzhi Wang
关键词-EN: Large Language Models, Large Language, artificial intelligence research, long held sway, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures. Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs. In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration. Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data. Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration.

[AI-122] he Overfocusing Bias of Convolutional Neural Networks: A Saliency-Guided Regularization Approach

链接: https://arxiv.org/abs/2409.17370
作者: David Bertoin,Eduardo Hugo Sanchez,Mehdi Zouitine,Emmanuel Rachelson
关键词-EN: computer vision, low-data regimes, transformers being considered, standard in computer, convolutional neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model’s generalization capabilities, making it disproportionately dependent on certain features that might not represent the broader context of images. While the conditions leading to this phenomenon remain elusive, the primary intent of this article is to shed light on this observed behavior of neural networks. Our research endeavors to prioritize comprehensive insight and to outline an initial response to this phenomenon. In line with this, we introduce Saliency Guided Dropout (SGDrop), a pioneering regularization approach tailored to address this specific issue. SGDrop utilizes attribution methods on the feature map to identify and then reduce the influence of the most salient features during training. This process encourages the network to diversify its attention and not focus solely on specific standout areas. Our experiments across several visual classification benchmarks validate SGDrop’s role in enhancing generalization. Significantly, models incorporating SGDrop display more expansive attributions and neural activity, offering a more comprehensive view of input images in contrast to their traditionally trained counterparts.

[AI-123] Koopman-driven grip force prediction through EMG sensing

链接: https://arxiv.org/abs/2409.17340
作者: Tomislav Bazina,Ervin Kamenar,Maria Fonoberova,Igor Mezić
关键词-EN: impacts daily activities, multiple sclerosis significantly, sclerosis significantly impacts, significantly impacts daily, hand function due
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
*备注: 11 pages, 8 figures, journal

点击查看摘要

Abstract:Loss of hand function due to conditions like stroke or multiple sclerosis significantly impacts daily activities. Robotic rehabilitation provides tools to restore hand function, while novel methods based on surface electromyography (sEMG) enable the adaptation of the device’s force output according to the user’s condition, thereby improving rehabilitation outcomes. This study aims to achieve accurate force estimations during medium wrap grasps using a single sEMG sensor pair, thereby addressing the challenge of escalating sensor requirements for precise predictions. We conducted sEMG measurements on 13 subjects at two forearm positions, validating results with a hand dynamometer. We established flexible signal-processing steps, yielding high peak cross-correlations between the processed sEMG signal (representing meaningful muscle activity) and grip force. Influential parameters were subsequently identified through sensitivity analysis. Leveraging a novel data-driven Koopman operator theory-based approach and problem-specific data lifting techniques, we devised a methodology for the estimation and short-term prediction of grip force from processed sEMG signals. A weighted mean absolute percentage error (wMAPE) of approx. 5.5% was achieved for the estimated grip force, whereas predictions with a 0.5-second prediction horizon resulted in a wMAPE of approx. 17.9%. The methodology proved robust regarding precise electrode positioning, as the effect of sensing position on error metrics was non-significant. The algorithm executes exceptionally fast, processing, estimating, and predicting a 0.5-second sEMG signal batch in just approx. 30 ms, facilitating real-time implementation.

[AI-124] he Technology of Outrage: Bias in Artificial Intelligence

链接: https://arxiv.org/abs/2409.17336
作者: Will Bridewell,Paul F. Bello,Selmer Bringsjord
关键词-EN: offload decision making, learning are increasingly, offload decision, decision making, machine learning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Distribution Statement A. Approved for public release; distribution is unlimited

点击查看摘要

Abstract:Artificial intelligence and machine learning are increasingly used to offload decision making from people. In the past, one of the rationales for this replacement was that machines, unlike people, can be fair and unbiased. Evidence suggests otherwise. We begin by entertaining the ideas that algorithms can replace people and that algorithms cannot be biased. Taken as axioms, these statements quickly lead to absurdity. Spurred on by this result, we investigate the slogans more closely and identify equivocation surrounding the word ‘bias.’ We diagnose three forms of outrage-intellectual, moral, and political-that are at play when people react emotionally to algorithmic bias. Then we suggest three practical approaches to addressing bias that the AI community could take, which include clarifying the language around bias, developing new auditing methods for intelligent systems, and building certain capabilities into these systems. We conclude by offering a moral regarding the conversations about algorithmic bias that may transfer to other areas of artificial intelligence.

[AI-125] Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting

链接: https://arxiv.org/abs/2409.17332
作者: Jay Zoellin,Colin Merk,Mischa Buob,Amr Saad,Samuel Giesser,Tahm Spitznagel,Ferhat Turgut,Rui Santos,Yukun Zhou,Sigfried Wagner,Pearse A. Keane,Yih Chung Tham,Delia Cabrera DeBuc,Matthias D. Becker,Gabor M. Somfai
关键词-EN: Integrating deep learning, greatly advance diagnostic, Integrating deep, self-supervised learning, DINORET
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this http URL , C. Merk and M. Buob contributed equally as shared-first authors. D. Cabrera DeBuc, M. D. Becker and G. M. Somfai contributed equally as senior authors for this work

点击查看摘要

Abstract:Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.

[AI-126] KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation

链接: https://arxiv.org/abs/2409.17315
作者: Anantaa Kotal,Anupam Joshi
关键词-EN: Generative Deep Learning, Deep Learning models, including differential privacy, differential privacy techniques, provable privacy guarantee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model’s capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.

[AI-127] Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP2024

链接: https://arxiv.org/abs/2409.17313
作者: Zehao Wang,Minye Wu,Yixin Cao,Yubo Ma,Meiqi Chen,Tinne Tuytelaars
关键词-EN: study presents, instruction categories, evaluation framework, VLN, Vision-Language Navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings; project page: this https URL

点击查看摘要

Abstract:This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.

[AI-128] A Hybrid Quantum-Classical AI-Based Detection Strategy for Generative Adversarial Network-Based Deepfake Attacks on an Autonomous Vehicle Traffic Sign Classification System

链接: https://arxiv.org/abs/2409.17311
作者: M Sabbir Salek,Shaozhi Li,Mashrur Chowdhury
关键词-EN: traffic sign, traffic sign classification, traffic sign image, deepfake traffic sign, deep learning-based models
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The perception module in autonomous vehicles (AVs) relies heavily on deep learning-based models to detect and identify various objects in their surrounding environment. An AV traffic sign classification system is integral to this module, which helps AVs recognize roadway traffic signs. However, adversarial attacks, in which an attacker modifies or alters the image captured for traffic sign recognition, could lead an AV to misrecognize the traffic signs and cause hazardous consequences. Deepfake presents itself as a promising technology to be used for such adversarial attacks, in which a deepfake traffic sign would replace a real-world traffic sign image before the image is fed to the AV traffic sign classification system. In this study, the authors present how a generative adversarial network-based deepfake attack can be crafted to fool the AV traffic sign classification systems. The authors developed a deepfake traffic sign image detection strategy leveraging hybrid quantum-classical neural networks (NNs). This hybrid approach utilizes amplitude encoding to represent the features of an input traffic sign image using quantum states, which substantially reduces the memory requirement compared to its classical counterparts. The authors evaluated this hybrid deepfake detection approach along with several baseline classical convolutional NNs on real-world and deepfake traffic sign images. The results indicate that the hybrid quantum-classical NNs for deepfake detection could achieve similar or higher performance than the baseline classical convolutional NNs in most cases while requiring less than one-third of the memory required by the shallowest classical convolutional NN considered in this study.

[AI-129] Neural Network Plasticity and Loss Sharpness

链接: https://arxiv.org/abs/2409.17300
作者: Max Koster,Jude Kukla
关键词-EN: increasingly popular research, popular research field, research field due, evolve over time, gearing towards complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, continual learning, a prediction setting in which the problem environment may evolve over time, has become an increasingly popular research field due to the framework’s gearing towards complex, non-stationary objectives. Learning such objectives requires plasticity, or the ability of a neural network to adapt its predictions to a different task. Recent findings indicate that plasticity loss on new tasks is highly related to loss landscape sharpness in non-stationary RL frameworks. We explore the usage of sharpness regularization techniques, which seek out smooth minima and have been touted for their generalization capabilities in vanilla prediction settings, in efforts to combat plasticity loss. Our findings indicate that such techniques have no significant effect on reducing plasticity loss.

[AI-130] SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

链接: https://arxiv.org/abs/2409.17285
作者: Jee-weon Jung,Yihan Wu,Xin Wang,Ji-Hoon Kim,Soumi Maiti,Yuta Matsunaga,Hye-jin Shim,Jinchuan Tian,Nicholas Evans,Joon Son Chung,Wangyou Zhang,Seyun Um,Shinnosuke Takamichi,Shinji Watanabe
关键词-EN: Speech Deepfake Detection, Automatic Speaker Verification, Spoofing-robust Automatic Speaker, Deepfake Detection, Spoofing-robust Automatic
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 2 figures, 8 tables

点击查看摘要

Abstract:This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, existing datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Existing SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. We present SpoofCeleb, which leverages a fully automated pipeline that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. The resulting SpoofCeleb dataset comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We provide baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at this https URL.

[AI-131] Memory Networks: Towards Fully Biologically Plausible Learning

链接: https://arxiv.org/abs/2409.17282
作者: Jacobo Ruiz,Manas Gupta
关键词-EN: intelligence faces significant, faces significant challenges, artificial intelligence faces, visual learning tasks, intelligence faces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 2024

点击查看摘要

Abstract:The field of artificial intelligence faces significant challenges in achieving both biological plausibility and computational efficiency, particularly in visual learning tasks. Current artificial neural networks, such as convolutional neural networks, rely on techniques like backpropagation and weight sharing, which do not align with the brain’s natural information processing methods. To address these issues, we propose the Memory Network, a model inspired by biological principles that avoids backpropagation and convolutions, and operates in a single pass. This approach enables rapid and efficient learning, mimicking the brain’s ability to adapt quickly with minimal exposure to data. Our experiments demonstrate that the Memory Network achieves efficient and biologically plausible learning, showing strong performance on simpler datasets like MNIST. However, further refinement is needed for the model to handle more complex datasets such as CIFAR10, highlighting the need to develop new algorithms and techniques that closely align with biological processes while maintaining computational efficiency.

[AI-132] On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

链接: https://arxiv.org/abs/2409.17275
作者: Xun Xian,Ganghua Wang,Xuan Bi,Jayanth Srinivasa,Ashish Kundu,Charles Fleming,Mingyi Hong,Jie Ding
关键词-EN: large language models, Retrieval-Augmented Generation, language models, legal contexts, empirically shown
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs’ generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q\A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q\A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.

[AI-133] Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning

链接: https://arxiv.org/abs/2409.17270
作者: Debargha Ganguly,Srinivasan Iyengar,Vipin Chaudhary,Shivkumar Kalyanaraman
关键词-EN: Large Language Models, natural language processing, revolutionized natural language, complex logical sequences, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate Proof of Thought’s effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.

[AI-134] Model aggregation: minimizing empirical variance outperforms minimizing empirical error

链接: https://arxiv.org/abs/2409.17267
作者: Théo Bourdais,Houman Owhadi
关键词-EN: deterministic or stochastic, Minimal Variance Aggregation, designed to approximate, approximate a specific, aggregation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: The code in this paper is available for download at this https URL

点击查看摘要

Abstract:Whether deterministic or stochastic, models can be viewed as functions designed to approximate a specific quantity of interest. We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output. This aggregation approach exploits each model’s strengths to enhance overall accuracy. It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models, including those from machine learning and numerical solvers. We argue that the aggregation process should be point-wise linear and propose two methods to find an optimal aggregate: Minimal Error Aggregation (MEA), which minimizes the aggregate’s prediction error, and Minimal Variance Aggregation (MVA), which minimizes its variance. While MEA is inherently more accurate when correlations between models and the target quantity are perfectly known, Minimal Empirical Variance Aggregation (MEVA), an empirical version of MVA - consistently outperforms Minimal Empirical Error Aggregation (MEEA), the empirical counterpart of MEA, when these correlations must be estimated from data. The key difference is that MEVA constructs an aggregate by estimating model errors, while MEEA treats the models as features for direct interpolation of the quantity of interest. This makes MEEA more susceptible to overfitting and poor generalization, where the aggregate may underperform individual models during testing. We demonstrate the versatility and effectiveness of our framework in various applications, such as data science and partial differential equations, showing how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.

[AI-135] AAPM: Large Language Model Agent -based Asset Pricing Models

链接: https://arxiv.org/abs/2409.17266
作者: Junyan Cheng,Peter Chin
关键词-EN: LLM Agent-based Asset, Agent-based Asset Pricing, LLM Agent-based, excess asset returns, fuses qualitative discretionary
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:In this study, we propose a novel asset pricing approach, LLM Agent-based Asset Pricing Models (AAPM), which fuses qualitative discretionary investment analysis from LLM agents and quantitative manual financial economic factors to predict excess asset returns. The experimental results show that our approach outperforms machine learning-based asset pricing baselines in portfolio optimization and asset pricing errors. Specifically, the Sharpe ratio and average |\alpha| for anomaly portfolios improved significantly by 9.6% and 10.8% respectively. In addition, we conducted extensive ablation studies on our model and analysis of the data to reveal further insights into the proposed method.

[AI-136] Collaborative Comic Generation: Integrating Visual Narrative Theories with AI Models for Enhanced Creativity ECAI

链接: https://arxiv.org/abs/2409.17263
作者: Yi-Chun Chen,Arnav Jhala
关键词-EN: integrates conceptual principles-comic, conceptual principles-comic authoring, principles-comic authoring idioms-with, authoring idioms-with generative, theory-inspired visual narrative
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted for oral presentation at CREAI2024, ECAI, 2024. However, the author’s attendance is currently uncertain due to visa issues

点击查看摘要

Abstract:This study presents a theory-inspired visual narrative generative system that integrates conceptual principles-comic authoring idioms-with generative and language models to enhance the comic creation process. Our system combines human creativity with AI models to support parts of the generative process, providing a collaborative platform for creating comic content. These comic-authoring idioms, derived from prior human-created image sequences, serve as guidelines for crafting and refining storytelling. The system translates these principles into system layers that facilitate comic creation through sequential decision-making, addressing narrative elements such as panel composition, story tension changes, and panel transitions. Key contributions include integrating machine learning models into the human-AI cooperative comic generation process, deploying abstract narrative theories into AI-driven comic creation, and a customizable tool for narrative-driven image sequences. This approach improves narrative elements in generated image sequences and engages human creativity in an AI-generative process of comics. We open-source the code at this https URL.

[AI-137] Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

链接: https://arxiv.org/abs/2409.17216
作者: Ritwik Gupta,Leah Walker,Rodolfo Corona,Stephanie Fu,Suzanne Petryk,Janet Napolitano,Trevor Darrell,Andrew W. Reddie
关键词-EN: Current regulations, regulations on powerful, narrowly focused, Current, models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current regulations on powerful AI capabilities are narrowly focused on “foundation” or “frontier” models. However, these terms are vague and inconsistently defined, leading to an unstable foundation for governance efforts. Critically, policy debates often fail to consider the data used with these models, despite the clear link between data and model performance. Even (relatively) “small” models that fall outside the typical definitions of foundation and frontier models can achieve equivalent outcomes when exposed to sufficiently specific datasets. In this work, we illustrate the importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future. More broadly, we emphasize the risk posed by over-regulating reactively and provide a path towards careful, quantitative evaluation of capabilities that can lead to a simplified regulatory environment.

[AI-138] Plurals: A System for Guiding LLMs Via Simulated Social Ensembles

链接: https://arxiv.org/abs/2409.17213
作者: Joshua Ashkinaze,Emily Fry,Narendra Edara,Eric Gilbert,Ceren Budak
关键词-EN: Recent debates raised, debates raised concerns, Recent debates, debates raised, raised concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a ‘view from nowhere’ but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by democratic deliberation theory, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at this https URL and will be continually updated.

[AI-139] 2024 BRAVO Challenge Track 1 1st Place Report: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation

链接: https://arxiv.org/abs/2409.17208
作者: Tommie Kerssies,Daan de Geus,Gijs Dubbelman
关键词-EN: BRAVO Challenge, trained on Cityscapes, robustness is evaluated, solution for Track, present our solution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2409.15107

点击查看摘要

Abstract:In this report, we present our solution for Track 1 of the 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves 1st place in the challenge. Our code is publicly available at this https URL.

[AI-140] Enhancing Guardrails for Safe and Secure Healthcare AI

链接: https://arxiv.org/abs/2409.17190
作者: Ananya Gangavarapu
关键词-EN: holds immense promise, numerous innovative applications, Generative AI holds, addressing global healthcare, global healthcare access
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI holds immense promise in addressing global healthcare access challenges, with numerous innovative applications now ready for use across various healthcare domains. However, a significant barrier to the widespread adoption of these domain-specific AI solutions is the lack of robust safety mechanisms to effectively manage issues such as hallucination, misinformation, and ensuring truthfulness. Left unchecked, these risks can compromise patient safety and erode trust in healthcare AI systems. While general-purpose frameworks like Llama Guard are useful for filtering toxicity and harmful content, they do not fully address the stringent requirements for truthfulness and safety in healthcare contexts. This paper examines the unique safety and security challenges inherent to healthcare AI, particularly the risk of hallucinations, the spread of misinformation, and the need for factual accuracy in clinical settings. I propose enhancements to existing guardrails frameworks, such as Nvidia NeMo Guardrails, to better suit healthcare-specific needs. By strengthening these safeguards, I aim to ensure the secure, reliable, and accurate use of AI in healthcare, mitigating misinformation risks and improving patient safety.

[AI-141] Fully automatic extraction of morphological traits from the Web: utopia or reality?

链接: https://arxiv.org/abs/2409.17179
作者: Diego Marcos,Robert van de Vlasakker,Ioannis N. Athanasiadis,Pierre Bonnet,Hervé Goeau,Alexis Joly,W. Daniel Kissling,César Leblanc,André S.J. van Proosdij,Konstantinos P. Panousis
关键词-EN: Plant morphological traits, observable characteristics, fundamental to understand, understand the role, role played
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.

[AI-142] CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency

链接: https://arxiv.org/abs/2409.17174
作者: Kangsheng Wang,Xiao Zhang,Zizheng Guo,Tianyu Hu,Huimin Ma
关键词-EN: large language models, causal significance, significance and consistency, reasoning, solving reasoning tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Chain-based reasoning methods like chain of thought (CoT) play a rising role in solving reasoning tasks for large language models (LLMs). However, the causal illusions between \textita step of reasoning and \textitcorresponding state transitions are becoming a significant obstacle to advancing LLMs’ reasoning capabilities, especially in long-range reasoning tasks. This paper proposes a non-chain-based reasoning framework for simultaneous consideration of causal significance and consistency, i.e., the Causal Significance and Consistency Enhancer (CSCE). We customize LLM’s loss function utilizing treatment effect assessments to enhance its reasoning ability from two aspects: causal significance and consistency. This ensures that the model captures essential causal relationships and maintains robust and consistent performance across various scenarios. Additionally, we transform the reasoning process from the cascading multiple one-step reasoning commonly used in Chain-Based methods, like CoT, to a causal-enhanced method that outputs the entire reasoning process in one go, further improving the model’s reasoning efficiency. Extensive experiments show that our method improves both the reasoning success rate and speed. These improvements further demonstrate that non-chain-based methods can also aid LLMs in completing reasoning tasks.

[AI-143] A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models

链接: https://arxiv.org/abs/2409.17173
作者: Satoshi Munakata,Taku Fukui,Takao Mohri
关键词-EN: Large language models, Large language, language models, fabricate a hallucinatory, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Large language models (LLMs) often fabricate a hallucinatory text. Several methods have been developed to detect such text by semantically comparing it with the multiple versions probabilistically regenerated. However, a significant issue is that if the storyline of each regenerated text changes, the generated texts become incomparable, which worsen detection accuracy. In this paper, we propose a hallucination detection method that incorporates a multiple-fill-in-the-blank exam approach to address this storyline-changing issue. First, our method creates a multiple-fill-in-the-blank exam by masking multiple objects from the original text. Second, prompts an LLM to repeatedly answer this exam. This approach ensures that the storylines of the exam answers align with the original ones. Finally, quantifies the degree of hallucination for each original sentence by scoring the exam answers, considering the potential for \emphhallucination snowballing within the original text itself. Experimental results show that our method alone not only outperforms existing methods, but also achieves clearer state-of-the-art performance in the ensembles with existing methods.

[AI-144] What Would You Ask When You First Saw a2b2=c2? Evaluating LLM on Curiosity-Driven Questioning

链接: https://arxiv.org/abs/2409.17172
作者: Shashidhar Reddy Javaji,Zining Zhu
关键词-EN: knowledge remains unknown, remains unknown, store a massive, massive amount, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems

[AI-145] Cross-Domain Content Generation with Domain-Specific Small Language Models

链接: https://arxiv.org/abs/2409.17171
作者: Ankit Maloo Abhinav Garg
关键词-EN: small language models, language models poses, small language, minimal overlap, models poses challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:Generating domain-specific content using small language models poses challenges, especially when dealing with multiple distinct datasets with minimal overlap. In this study, we explore methods to enable a small language model to produce coherent and relevant outputs for two different domains: stories (Dataset A) and recipes (Dataset B). Our initial experiments show that training individual models on each dataset yields satisfactory results, with each model generating appropriate content within its domain. We find that utilizing custom tokenizers tailored to each dataset significantly enhances generation quality compared to using a generic tokenizer. Attempts to adapt a single model to both domains using Low-Rank Adaptation (LoRA) or standard fine-tuning do not yield substantial results, often failing to produce meaningful outputs. Moreover, full fine-tuning without freezing the model’s existing weights leads to catastrophic forgetting, where the model loses previously learned information and only retains knowledge from the new data. To overcome these challenges, we employ a knowledge expansion strategy: training only with additional parameters. This approach enables the model to generate both stories and recipes upon request, effectively handling multiple domains without suffering from catastrophic forgetting. Our findings demonstrate that knowledge expansion with frozen layers is an effective method for small language models to generate domain-specific content across distinct datasets. This work contributes to the development of efficient multi-domain language models and provides insights into managing catastrophic forgetting in small-scale architectures.

[AI-146] REAL: Response Embedding-based Alignment for LLMs

链接: https://arxiv.org/abs/2409.17169
作者: Honggen Zhang,Igor Molybog,June Zhang,Xufeng Zhao
关键词-EN: Aligning large language, Aligning large, Direct Preference Optimization, Preference Optimization rely, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization rely on pairs of AI-generated responses ranked according to human feedback. The labeling process is the most labor-intensive and costly part of the alignment pipeline, and improving its efficiency would have a meaningful impact on AI development. We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs for labeling out of a set of AI-generated responses. Experimental results on synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. We also applied our method to the real-world dataset SHP2, selecting optimal pairs from multiple responses. The model aligned on dissimilar response pairs obtained the best win rate on the dialogue task. Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators’ work.

[AI-147] StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

链接: https://arxiv.org/abs/2409.17167
作者: Guobin Shen,Dongcheng Zhao,Aorigele Bao,Xiang He,Yiting Dong,Yi Zeng
关键词-EN: Large Language Models, Language Models, Large Language, stress, LLMs
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.

[AI-148] ScriptSmith: A Unified LLM Framework for Enhancing IT Operations via Automated Bash Script Generation Assessment and Refinement

链接: https://arxiv.org/abs/2409.17166
作者: Oishik Chatterjee,Pooja Aggarwal,Suranjana Samanta,Ting Dai,Prateeti Mohapatra,Debanjana Kar,Ruchi Mahindru,Steve Barbieri,Eugen Postea,Brad Blancett,Arthur De Magalhaes
关键词-EN: site reliability engineering, rapidly evolving landscape, site reliability, reliability engineering, applications is paramount
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:In the rapidly evolving landscape of site reliability engineering (SRE), the demand for efficient and effective solutions to manage and resolve issues in site and cloud applications is paramount. This paper presents an innovative approach to action automation using large language models (LLMs) for script generation, assessment, and refinement. By leveraging the capabilities of LLMs, we aim to significantly reduce the human effort involved in writing and debugging scripts, thereby enhancing the productivity of SRE teams. Our experiments focus on Bash scripts, a commonly used tool in SRE, and involve the CodeSift dataset of 100 tasks and the InterCode dataset of 153 tasks. The results show that LLMs can automatically assess and refine scripts efficiently, reducing the need for script validation in an execution environment. Results demonstrate that the framework shows an overall improvement of 7-10% in script generation.

[AI-149] Cross Dataset Analysis and Network Architecture Repair for Autonomous Car Lane Detection

链接: https://arxiv.org/abs/2409.17158
作者: Parth Ganeriwala,Siddhartha Bhattacharyya,Raja Muthalagu
关键词-EN: isolated learning paradigm, utilizing knowledge acquired, Transfer Learning, inducing transfer learning, isolated learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer Learning has become one of the standard methods to solve problems to overcome the isolated learning paradigm by utilizing knowledge acquired for one task to solve another related one. However, research needs to be done, to identify the initial steps before inducing transfer learning to applications for further verification and explainablity. In this research, we have performed cross dataset analysis and network architecture repair for the lane detection application in autonomous vehicles. Lane detection is an important aspect of autonomous vehicles driving assistance system. In most circumstances, modern deep-learning-based lane recognition systems are successful, but they struggle with lanes with complex topologies. The proposed architecture, ERFCondLaneNet is an enhancement to the CondlaneNet used for lane identification framework to solve the difficulty of detecting lane lines with complex topologies like dense, curved and fork lines. The newly proposed technique was tested on two common lane detecting benchmarks, CULane and CurveLanes respectively, and two different backbones, ResNet and ERFNet. The researched technique with ERFCondLaneNet, exhibited similar performance in comparison to ResnetCondLaneNet, while using 33% less features, resulting in a reduction of model size by 46%.

[AI-150] Confident Teacher Confident Student? A Novel User Study Design for Investigating the Didactic Potential of Explanations and their Impact on Uncertainty ECML2024

链接: https://arxiv.org/abs/2409.17157
作者: Teodor Chiaburu,Frank Haußer,Felix Bießmann
关键词-EN: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, potential of XAI, research community
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 15 pages, 5 figures, 1 table, presented at ECML 2024, AIMLAI Workshop, Vilnius

点击查看摘要

Abstract:Evaluating the quality of explanations in Explainable Artificial Intelligence (XAI) is to this day a challenging problem, with ongoing debate in the research community. While some advocate for establishing standardized offline metrics, others emphasize the importance of human-in-the-loop (HIL) evaluation. Here we propose an experimental design to evaluate the potential of XAI in human-AI collaborative settings as well as the potential of XAI for didactics. In a user study with 1200 participants we investigate the impact of explanations on human performance on a challenging visual task - annotation of biological species in complex taxonomies. Our results demonstrate the potential of XAI in complex visual annotation tasks: users become more accurate in their annotations and demonstrate less uncertainty with AI assistance. The increase in accuracy was, however, not significantly different when users were shown the mere prediction of the model compared to when also providing an explanation. We also find negative effects of explanations: users tend to replicate the model’s predictions more often when shown explanations, even when those predictions are wrong. When evaluating the didactic effects of explanations in collaborative human-AI settings, we find that users’ annotations are not significantly better after performing annotation with AI assistance. This suggests that explanations in visual human-AI collaboration do not appear to induce lasting learning effects. All code and experimental data can be found in our GitHub repository: this https URL.

[AI-151] PhantomLiDAR: Cross-modality Signal Injection Attacks against LiDAR

链接: https://arxiv.org/abs/2409.17907
作者: Zizhi Jin,Qinhong Jiang,Xuancun Lu,Chen Yan,Xiaoyu Ji,Wenyuan Xu
关键词-EN: Light Detection, Detection and Ranging, offering precise, spatial information, autonomous driving
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:LiDAR (Light Detection and Ranging) is a pivotal sensor for autonomous driving, offering precise 3D spatial information. Previous signal attacks against LiDAR systems mainly exploit laser signals. In this paper, we investigate the possibility of cross-modality signal injection attacks, i.e., injecting intentional electromagnetic interference (IEMI) to manipulate LiDAR output. Our insight is that the internal modules of a LiDAR, i.e., the laser receiving circuit, the monitoring sensors, and the beam-steering modules, even with strict electromagnetic compatibility (EMC) testing, can still couple with the IEMI attack signals and result in the malfunction of LiDAR systems. Based on the above attack surfaces, we propose the PhantomLiDAR attack, which manipulates LiDAR output in terms of Points Interference, Points Injection, Points Removal, and even LiDAR Power-Off. We evaluate and demonstrate the effectiveness of PhantomLiDAR with both simulated and real-world experiments on five COTS LiDAR systems. We also conduct feasibility experiments in real-world moving scenarios. We provide potential defense measures that can be implemented at both the sensor level and the vehicle system level to mitigate the risks associated with IEMI attacks. Video demonstrations can be viewed at this https URL.

[AI-152] Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

链接: https://arxiv.org/abs/2409.17899
作者: Yujia Sun,Zeyu Zhao,Korin Richmond,Yuanchao Li
关键词-EN: music SSL models, SSL models, speech and music, Emotion recognition, Music Emotion Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.

[AI-153] Let the Quantum Creep In: Designing Quantum Neural Network Models by Gradually Swapping Out Classical Components

链接: https://arxiv.org/abs/2409.17583
作者: Peiyong Wang,Casey. R. Myers,Lloyd C. L. Hollenberg,Udaya Parampalli
关键词-EN: Artificial Intelligence, quantum neural network, neural network, classical neural network, quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 50 pages (including Appendix), many figures, accepted as a poster on QTML2024. Code available at this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI), with its multiplier effect and wide applications in multiple areas, could potentially be an important application of quantum computing. Since modern AI systems are often built on neural networks, the design of quantum neural networks becomes a key challenge in integrating quantum computing into AI. To provide a more fine-grained characterisation of the impact of quantum components on the performance of neural networks, we propose a framework where classical neural network layers are gradually replaced by quantum layers that have the same type of input and output while keeping the flow of information between layers unchanged, different from most current research in quantum neural network, which favours an end-to-end quantum model. We start with a simple three-layer classical neural network without any normalisation layers or activation functions, and gradually change the classical layers to the corresponding quantum versions. We conduct numerical experiments on image classification datasets such as the MNIST, FashionMNIST and CIFAR-10 datasets to demonstrate the change of performance brought by the systematic introduction of quantum components. Through this framework, our research sheds new light on the design of future quantum neural network models where it could be more favourable to search for methods and frameworks that harness the advantages from both the classical and quantum worlds.

[AI-154] NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes NEURIPS2024

链接: https://arxiv.org/abs/2409.17510
作者: Ziquan Wei,Tingting Dan,Jiaqi Ding,Paul J Laurienti,Guorong Wu
关键词-EN: modern imaging technologies, fluctuations emerge remarkable, emerge remarkable cognition, brain regions in-vivo, spontaneous functional fluctuations
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the cliché of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.

[AI-155] Adjusting Regression Models for Conditional Uncertainty Calibration

链接: https://arxiv.org/abs/2409.17466
作者: Ruijiang Gao,Mingzhang Yin,James McInerney,Nathan Kallus
关键词-EN: finite-sample distribution-free marginal, marginal coverage guarantees, distribution-free marginal coverage, conditional coverage guarantees, Conformal Prediction
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Machine Learning Special Issue on Uncertainty Quantification

点击查看摘要

Abstract:Conformal Prediction methods have finite-sample distribution-free marginal coverage guarantees. However, they generally do not offer conditional coverage guarantees, which can be important for high-stakes decisions. In this paper, we propose a novel algorithm to train a regression function to improve the conditional coverage after applying the split conformal prediction procedure. We establish an upper bound for the miscoverage gap between the conditional coverage and the nominal coverage rate and propose an end-to-end algorithm to control this upper bound. We demonstrate the efficacy of our method empirically on synthetic and real-world datasets.

[AI-156] Solar Active Regions Emergence Prediction Using Long Short-Term Memory Networks

链接: https://arxiv.org/abs/2409.17421
作者: Spiridon Kasapis,Irina N. Kitiashvili,Alexander G. Kosovichev,John T. Stefan
关键词-EN: Long Short-Term Memory, developed Long Short-Term, Short-Term Memory, developed Long, Long Short-Term
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 5 tables, under review at the AAS Astrophysical Journal

点击查看摘要

Abstract:We developed Long Short-Term Memory (LSTM) models to predict the formation of active regions (ARs) on the solar surface. Using the Doppler shift velocity, the continuum intensity, and the magnetic field observations from the Solar Dynamics Observatory (SDO) Helioseismic and Magnetic Imager (HMI), we have created time-series datasets of acoustic power and magnetic flux, which are used to train LSTM models on predicting continuum intensity, 12 hours in advance. These novel machine learning (ML) models are able to capture variations of the acoustic power density associated with upcoming magnetic flux emergence and continuum intensity decrease. Testing of the models’ performance was done on data for 5 ARs, unseen from the models during training. Model 8, the best performing model trained, was able to make a successful prediction of emergence for all testing active regions in an experimental setting and three of them in an operational. The model predicted the emergence of AR11726, AR13165, and AR13179 respectively 10, 29, and 5 hours in advance, and variations of this model achieved average RMSE values of 0.11 for both active and quiet areas on the solar disc. This work sets the foundations for ML-aided prediction of solar ARs.

[AI-157] Disk2Planet: A Robust and Automated Machine Learning Tool for Parameter Inference in Disk-Planet Systems

链接: https://arxiv.org/abs/2409.17228
作者: Shunyuan Mao,Ruobing Dong,Kwang Moo Yi,Lu Lu,Sifan Wang,Paris Perdikaris
关键词-EN: observed protoplanetary disk, infer key parameters, protoplanetary disk structures, machine learning-based tool, Protoplanetary Disk Operator
类目: Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ApJ

点击查看摘要

Abstract:We introduce Disk2Planet, a machine learning-based tool to infer key parameters in disk-planet systems from observed protoplanetary disk structures. Disk2Planet takes as input the disk structures in the form of two-dimensional density and velocity maps, and outputs disk and planet properties, that is, the Shakura–Sunyaev viscosity, the disk aspect ratio, the planet–star mass ratio, and the planet’s radius and azimuth. We integrate the Covariance Matrix Adaptation Evolution Strategy (CMA–ES), an evolutionary algorithm tailored for complex optimization problems, and the Protoplanetary Disk Operator Network (PPDONet), a neural network designed to predict solutions of disk–planet interactions. Our tool is fully automated and can retrieve parameters in one system in three minutes on an Nvidia A100 graphics processing unit. We empirically demonstrate that our tool achieves percent-level or higher accuracy, and is able to handle missing data and unknown levels of noise.

[AI-158] ransfer learning for financial data predictions: a systematic review

链接: https://arxiv.org/abs/2409.17183
作者: V. Lanzetta
关键词-EN: time series data, financial time series, series data pose, data pose significant, time series
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 43 pages, 5 tables, 1 figure

点击查看摘要

Abstract:Literature highlighted that financial time series data pose significant challenges for accurate stock price prediction, because these data are characterized by noise and susceptibility to news; traditional statistical methodologies made assumptions, such as linearity and normality, which are not suitable for the non-linear nature of financial time series; on the other hand, machine learning methodologies are able to capture non linear relationship in the data. To date, neural network is considered the main machine learning tool for the financial prices prediction. Transfer Learning, as a method aimed at transferring knowledge from source tasks to target tasks, can represent a very useful methodological tool for getting better financial prediction capability. Current reviews on the above body of knowledge are mainly focused on neural network architectures, for financial prediction, with very little emphasis on the transfer learning methodology; thus, this paper is aimed at going deeper on this topic by developing a systematic review with respect to application of Transfer Learning for financial market predictions and to challenges/potential future directions of the transfer learning methodologies for stock market predictions.

计算机视觉

[CV-0] FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner NEURIPS2024

链接: https://arxiv.org/abs/2409.18128
作者: Wenliang Zhao,Minglei Shi,Xumin Yu,Jie Zhou,Jiwen Lu
关键词-EN: flow-based models, flow-based models reemerge, models, prominent family, achieved competitive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor’s outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1% \sim 58.3% on class-conditional generation and 29.8% \sim 38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art. Code is available at this https URL.

[CV-1] EgoLM: Multi-Modal Language Model of Egocentric Motions

链接: https://arxiv.org/abs/2409.18127
作者: Fangzhou Hong,Vladimir Guzov,Hyo Jin Kim,Yuting Ye,Richard Newcombe,Ziwei Liu,Lingni Ma
关键词-EN: wearable devices, prevalence of wearable, essential to develop, develop contextual, egocentric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset validate the effectiveness of EgoLM as a generalist model for universal egocentric learning.

[CV-2] LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

链接: https://arxiv.org/abs/2409.18125
作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Jiangmiao Pang,Xihui Liu
关键词-EN: Large Multimodal Models, Multimodal Models, Large Multimodal, Recent advancements, advancements in Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

[CV-3] Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

链接: https://arxiv.org/abs/2409.18124
作者: Jing He,Haodong Li,Wei Yin,Yixun Liang,Leheng Li,Kaiqiang Zhou,Hongbo Liu,Bingbing Liu,Ying-Cong Chen
关键词-EN: dense prediction tasks, dense prediction, priors of pre-trained, offers a promising, promising solution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also significantly enhances efficiency, being hundreds of times faster than most existing diffusion-based methods.

[CV-4] Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

链接: https://arxiv.org/abs/2409.18121
作者: Justin Kerr,Chung Min Kim,Mingxuan Wu,Brent Yi,Qianqian Wang,Ken Goldberg,Angjoo Kanazawa
关键词-EN: RGB human demonstration, monocular RGB human, RGB human, simply watching, natural interface
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: CoRL 2024, Project page: this https URL

点击查看摘要

Abstract:Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration’s intended behavior while considering the robot’s own morphological limits, rather than attempting to reproduce the hand’s motion. We evaluate 4D-DPM’s 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD’s physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models – without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: this https URL

[CV-5] EvMAPPER: High Altitude Orthomapping with Event Cameras

链接: https://arxiv.org/abs/2409.18120
作者: Fernando Cladera,Kenneth Chaney,M. Ani Hsieh,Camillo J. Taylor,Vijay Kumar
关键词-EN: unmanned aerial vehicles, unmanned aerial, aerial vehicles, Traditionally, collect images
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Traditionally, unmanned aerial vehicles (UAVs) rely on CMOS-based cameras to collect images about the world below. One of the most successful applications of UAVs is to generate orthomosaics or orthomaps, in which a series of images are integrated together to develop a larger map. However, the use of CMOS-based cameras with global or rolling shutters mean that orthomaps are vulnerable to challenging light conditions, motion blur, and high-speed motion of independently moving objects under the camera. Event cameras are less sensitive to these issues, as their pixels are able to trigger asynchronously on brightness changes. This work introduces the first orthomosaic approach using event cameras. In contrast to existing methods relying only on CMOS cameras, our approach enables map generation even in challenging light conditions, including direct sunlight and after sunset.

[CV-6] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography MICCAI2024

链接: https://arxiv.org/abs/2409.18119
作者: Yuexi Du,John Onofrey,Nicha C. Dvornek
关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Language-Image Pre-training, requires substantial data, shows promise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is also the basis of the overall best solution for the MICCAI 2024 CXR-LT Challenge

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.

[CV-7] EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

链接: https://arxiv.org/abs/2409.18114
作者: Jiaxiang Tang,Zhaoshuo Li,Zekun Hao,Xian Liu,Gang Zeng,Ming-Yu Liu,Qinsheng Zhang
关键词-EN: Current auto-regressive mesh, Current auto-regressive, insufficient detail, generation methods suffer, methods suffer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Current auto-regressive mesh generation methods suffer from issues such as incompleteness, insufficient detail, and poor generalization. In this paper, we propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of 512^3 . We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency. Furthermore, our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization. Extensive experiments demonstrate the superior quality, diversity, and generalization capabilities of our model in both point cloud and image-conditioned mesh generation tasks.

[CV-8] E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding NEURIPS2024

链接: https://arxiv.org/abs/2409.18111
作者: Ye Liu,Zongyang Ma,Zhongang Qi,Yang Wu,Ying Shan,Chang Wen Chen
关键词-EN: Large Language Models, Video Large Language, Large Language, Recent advances, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

[CV-9] Find Rhinos without Finding Rhinos: Active Learning with Multimodal Imagery of South African Rhino Habitats IJCAI2023

链接: https://arxiv.org/abs/2409.18104
作者: Lucia Gordon,Nikhil Behari,Samuel Collier,Elizabeth Bondi-Kelly,Jackson A. Killian,Catherine Ressijac,Peter Boucher,Andrew Davies,Milind Tambe
关键词-EN: Earth charismatic megafauna, crisis in Africa, Earth charismatic, human activities, charismatic megafauna
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, IJCAI 2023 Special Track on AI for Good

点击查看摘要

Abstract:Much of Earth’s charismatic megafauna is endangered by human activities, particularly the rhino, which is at risk of extinction due to the poaching crisis in Africa. Monitoring rhinos’ movement is crucial to their protection but has unfortunately proven difficult because rhinos are elusive. Therefore, instead of tracking rhinos, we propose the novel approach of mapping communal defecation sites, called middens, which give information about rhinos’ spatial behavior valuable to anti-poaching, management, and reintroduction efforts. This paper provides the first-ever mapping of rhino midden locations by building classifiers to detect them using remotely sensed thermal, RGB, and LiDAR imagery in passive and active learning settings. As existing active learning methods perform poorly due to the extreme class imbalance in our dataset, we design MultimodAL, an active learning system employing a ranking technique and multimodality to achieve competitive performance with passive learning models with 94% fewer labels. Our methods could therefore save over 76 hours in labeling time when used on a similarly-sized dataset. Unexpectedly, our midden map reveals that rhino middens are not randomly distributed throughout the landscape; rather, they are clustered. Consequently, rangers should be targeted at areas with high midden densities to strengthen anti-poaching efforts, in line with UN Target 15.7.

[CV-10] MALPOLON: A Framework for Deep Species Distribution Modeling

链接: https://arxiv.org/abs/2409.18102
作者: Theo Larcher,Lukas Picek,Benjamin Deneu,Titouan Lorieul,Maximilien Servajean,Alexis Joly
关键词-EN: paper describes, Python language skills, general Python language, deep species distribution, testing deep learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper describes a deep-SDM framework, MALPOLON. Written in Python and built upon the PyTorch library, this framework aims to facilitate training and inferences of deep species distribution models (deep-SDM) and sharing for users with only general Python language skills (e.g., modeling ecologists) who are interested in testing deep learning approaches to build new SDMs. More advanced users can also benefit from the framework’s modularity to run more specific experiments by overriding existing classes while taking advantage of press-button examples to train neural networks on multiple classification tasks using custom or provided raw and pre-processed datasets. The framework is open-sourced on GitHub and PyPi along with extensive documentation and examples of use in various scenarios. MALPOLON offers straightforward installation, YAML-based configuration, parallel computing, multi-GPU utilization, baseline and foundational models for benchmarking, and extensive tutorials/documentation, aiming to enhance accessibility and performance scalability for ecologists and researchers.

[CV-11] AI-Powered Augmented Reality for Satellite Assembly Integration and Test

链接: https://arxiv.org/abs/2409.18101
作者: Alvaro Patricio,Joao Valente,Atabak Dehban,Ines Cadilha,Daniel Reis,Rodrigo Ventura
关键词-EN: improving operational efficiency, Artificial Intelligence, Augmented Reality, transform satellite Assembly, minimizing human error
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) and Augmented Reality (AR) is set to transform satellite Assembly, Integration, and Testing (AIT) processes by enhancing precision, minimizing human error, and improving operational efficiency in cleanroom environments. This paper presents a technical description of the European Space Agency’s (ESA) project “AI for AR in Satellite AIT,” which combines real-time computer vision and AR systems to assist technicians during satellite assembly. Leveraging Microsoft HoloLens 2 as the AR interface, the system delivers context-aware instructions and real-time feedback, tackling the complexities of object recognition and 6D pose estimation in AIT workflows. All AI models demonstrated over 70% accuracy, with the detection model exceeding 95% accuracy, indicating a high level of performance and reliability. A key contribution of this work lies in the effective use of synthetic data for training AI models in AR applications, addressing the significant challenges of obtaining real-world datasets in highly dynamic satellite environments, as well as the creation of the Segmented Anything Model for Automatic Labelling (SAMAL), which facilitates the automatic annotation of real data, achieving speeds up to 20 times faster than manual human annotation. The findings demonstrate the efficacy of AI-driven AR systems in automating critical satellite assembly tasks, setting a foundation for future innovations in the space industry.

[CV-12] Self-supervised Pretraining for Cardiovascular Magnetic Resonance Cine Segmentation MICCAI2024

链接: https://arxiv.org/abs/2409.18100
作者: Rob A. J. de Mooij,Josien P. W. Pluim,Cian M. Scannell
关键词-EN: cardiovascular magnetic resonance, shown promising results, automated cardiovascular magnetic, CMR cine segmentation, SSP
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to Data Engineering in Medical Imaging (DEMI) Workshop at MICCAI 2024

点击查看摘要

Abstract:Self-supervised pretraining (SSP) has shown promising results in learning from large unlabeled datasets and, thus, could be useful for automated cardiovascular magnetic resonance (CMR) short-axis cine segmentation. However, inconsistent reports of the benefits of SSP for segmentation have made it difficult to apply SSP to CMR. Therefore, this study aimed to evaluate SSP methods for CMR cine segmentation. To this end, short-axis cine stacks of 296 subjects (90618 2D slices) were used for unlabeled pretraining with four SSP methods; SimCLR, positional contrastive learning, DINO, and masked image modeling (MIM). Subsets of varying numbers of subjects were used for supervised fine-tuning of 2D models for each SSP method, as well as to train a 2D baseline model from scratch. The fine-tuned models were compared to the baseline using the 3D Dice similarity coefficient (DSC) in a test dataset of 140 subjects. The SSP methods showed no performance gains with the largest supervised fine-tuning subset compared to the baseline (DSC = 0.89). When only 10 subjects (231 2D slices) are available for supervised training, SSP using MIM (DSC = 0.86) improves over training from scratch (DSC = 0.82). This study found that SSP is valuable for CMR cine segmentation when labeled training data is scarce, but does not aid state-of-the-art deep learning methods when ample labeled data is available. Moreover, the choice of SSP method is important. The code is publicly available at: this https URL Comments: Accepted to Data Engineering in Medical Imaging (DEMI) Workshop at MICCAI 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.18100 [cs.CV] (or arXiv:2409.18100v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.18100 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rob De Mooij [view email] [v1] Thu, 26 Sep 2024 17:44:29 UTC (1,439 KB)

[CV-13] EfficientCrackNet: A Lightweight Model for Crack Segmentation

链接: https://arxiv.org/abs/2409.18099
作者: Abid Hasan Zim,Aquib Iqbal,Zaid Al-Huda,Asad Malik,Minoru Kuribayash
关键词-EN: computer vision due, intricate topologies, low contrast, presents a formidable, intensity inhomogeneity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crack detection, particularly from pavement images, presents a formidable challenge in the domain of computer vision due to several inherent complexities such as intensity inhomogeneity, intricate topologies, low contrast, and noisy backgrounds. Automated crack detection is crucial for maintaining the structural integrity of essential infrastructures, including buildings, pavements, and bridges. Existing lightweight methods often face challenges including computational inefficiency, complex crack patterns, and difficult backgrounds, leading to inaccurate detection and impracticality for real-world applications. To address these limitations, we propose EfficientCrackNet, a lightweight hybrid model combining Convolutional Neural Networks (CNNs) and transformers for precise crack segmentation. EfficientCrackNet integrates depthwise separable convolutions (DSC) layers and MobileViT block to capture both global and local features. The model employs an Edge Extraction Method (EEM) and for efficient crack edge detection without pretraining, and Ultra-Lightweight Subspace Attention Module (ULSAM) to enhance feature extraction. Extensive experiments on three benchmark datasets Crack500, DeepCrack, and GAPs384 demonstrate that EfficientCrackNet achieves superior performance compared to existing lightweight models, while requiring only 0.26M parameters, and 0.483 FLOPs (G). The proposed model offers an optimal balance between accuracy and computational efficiency, outperforming state-of-the-art lightweight models, and providing a robust and adaptable solution for real-world crack segmentation.

[CV-14] DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2409.18092
作者: Helin Cao,Sven Behnke
关键词-EN: Perception systems play, computer vision algorithms, incorporating multiple sensors, Perception systems, incorporating multiple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review

点击查看摘要

Abstract:Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle’s surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets and our approach outperforms the state-of-the-art for SSC.

[CV-15] Stable Video Portraits ECCV2024

链接: https://arxiv.org/abs/2409.18083
作者: Mirela Ostrek,Justus Thies
关键词-EN: Rapid advances, computer-generated imagery today, perceive computer-generated imagery, field of generative, perceive computer-generated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024, Project: this https URL

点击查看摘要

Abstract:Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

[CV-16] SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

链接: https://arxiv.org/abs/2409.18082
作者: Xin Li,Siyuan Huang,Qiaojun Yu,Zhengkai Jiang,Ce Hao,Yimeng Zhu,Hongsheng Li,Peng Gao,Cewu Lu
关键词-EN: Automating garment manipulation, Automating garment, poses a significant, significant challenge, diverse and deformable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and deformable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics for future.

[CV-17] FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

链接: https://arxiv.org/abs/2409.18071
作者: Runze He,Kai Ma,Linjiang Huang,Shaofei Huang,Jialin Gao,Xiaoming Wei,Jiao Dai,Jizhong Han,Si Liu
关键词-EN: Introducing user-specified visual, Introducing user-specified, user-specified visual concepts, image editing, editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 14 figures, project website: this https URL

点击查看摘要

Abstract:Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user’s intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: this https URL.

[CV-18] LightAvatar: Efficient Head Avatar as Dynamic Neural Light Field ECCV’24

链接: https://arxiv.org/abs/2409.18057
作者: Huan Wang,Feitong Tan,Ziqian Bai,Yinda Zhang,Shichen Liu,Qiangeng Xu,Menglei Chai,Anish Prabhu,Rohit Pandey,Sean Fanello,Zeng Huang,Yun Fu
关键词-EN: build photorealistic head, Recent works, photorealistic head avatars, monocular video, neural radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Appear in ECCV’24 CADL Workshop. Code: this https URL

点击查看摘要

Abstract:Recent works have shown that neural radiance fields (NeRFs) on top of parametric models have reached SOTA quality to build photorealistic head avatars from a monocular video. However, one major limitation of the NeRF-based avatars is the slow rendering speed due to the dense point sampling of NeRF, preventing them from broader utility on resource-constrained devices. We introduce LightAvatar, the first head avatar model based on neural light fields (NeLFs). LightAvatar renders an image from 3DMM parameters and a camera pose via a single network forward pass, without using mesh or volume rendering. The proposed approach, while being conceptually appealing, poses a significant challenge towards real-time efficiency and training stability. To resolve them, we introduce dedicated network designs to obtain proper representations for the NeLF model and maintain a low FLOPs budget. Meanwhile, we tap into a distillation-based training strategy that uses a pretrained avatar model as teacher to synthesize abundant pseudo data for training. A warping field network is introduced to correct the fitting error in the real data so that the model can learn better. Extensive experiments suggest that our method can achieve new SOTA image quality quantitatively or qualitatively, while being significantly faster than the counterparts, reporting 174.1 FPS (512x512 resolution) on a consumer-grade GPU (RTX3090) with no customized optimization.

[CV-19] Visual Data Diagnosis and Debiasing with Concept Graphs

链接: https://arxiv.org/abs/2409.18055
作者: Rwiddhi Chakraborty,Yinong Wang,Jialu Gao,Runkai Zheng,Cheng Zhang,Fernando De la Torre
关键词-EN: deep learning models, learning models today, size and complexity, widespread success, success of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present CONBIAS, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. CONBIAS represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by CONBIAS improves generalization performance across multiple datasets compared to state-of-the-art methods. We will make our code and data publicly available.

[CV-20] Revisit Anything: Visual Place Recognition via Image Segment Retrieval ECCV2024

链接: https://arxiv.org/abs/2409.18049
作者: Kartik Garg,Sai Shubodh Puligilla,Shishir Kolathaya,Madhava Krishna,Sourav Garg
关键词-EN: Accurately recognizing, localize and navigate, crucial for embodied, embodied agents, agents to localize
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Presented at ECCV 2024; Includes supplementary; 29 pages; 8 figures

点击查看摘要

Abstract:Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the “whole” image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: “the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap”. We address this by encoding and searching for “image segments” instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful’ entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything’’ by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: this https URL.

[CV-21] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning EMNLP2024

链接: https://arxiv.org/abs/2409.18046
作者: Soeun Lee,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim
关键词-EN: Recent advancements, paired image-text data, explored text-only training, text-only training, overcome the limitations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ( \textbfI mage-like Retrieval and \textbfF requency-based Entity Filtering for Zero-shot \textbfCap tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.

[CV-22] EMOVA: Empowering Language Models to See Hear and Speak with Vivid Emotions

链接: https://arxiv.org/abs/2409.18042
作者: Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Lanqing Hong,Lu Hou,Hang Xu
关键词-EN: Large Language Models, enables vocal conversations, Large Language, empowering Large Language, enable Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

Abstract:GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

[CV-23] ReliOcc: Towards Reliable Semantic Occupancy Prediction via Uncertainty Learning

链接: https://arxiv.org/abs/2409.18026
作者: Song Wang,Zhongdao Wang,Jiawei Yu,Wentong Li,Bailan Feng,Junbo Chen,Jianke Zhu
关键词-EN: Vision-centric semantic occupancy, Vision-centric semantic, autonomous driving, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Technical report. Work in progress

点击查看摘要

Abstract:Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving, which requires accurate and reliable predictions from low-cost sensors. Although having notably narrowed the accuracy gap with LiDAR, there is still few research effort to explore the reliability in predicting semantic occupancy from camera. In this paper, we conduct a comprehensive evaluation of existing semantic occupancy prediction models from a reliability perspective for the first time. Despite the gradual alignment of camera-based models with LiDAR in term of accuracy, a significant reliability gap persists. To addresses this concern, we propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks. ReliOcc provides a plug-and-play scheme for existing models, which integrates hybrid uncertainty from individual voxels with sampling-based noise and relative voxels through mix-up learning. Besides, an uncertainty-aware calibration strategy is devised to further enhance model reliability in offline mode. Extensive experiments under various settings demonstrate that ReliOcc significantly enhances model reliability while maintaining the accuracy of both geometric and semantic predictions. Importantly, our proposed approach exhibits robustness to sensor failures and out of domain noises during inference.

[CV-24] ransferring disentangled representations: bridging the gap between synthetic and real images

链接: https://arxiv.org/abs/2409.18017
作者: Jacopo Dapueto,Nicoletta Noceti,Francesca Odone
关键词-EN: Developing meaningful, data generation mechanism, Disentangled Representation Learning, representation learning, meaningful and efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.

[CV-25] InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction

链接: https://arxiv.org/abs/2409.17993
作者: Junchen Yu,Si-Yuan Cao,Runmin Zhang,Chenghao Zhang,Jianxin Hu,Zhu Yu,Hui-liang Shen
关键词-EN: self-supervised homography estimation, self-supervised homography, modality transfer, interleaved modality transfer, homography estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel unsupervised cross-modal homography estimation framework, based on interleaved modality transfer and self-supervised homography prediction, named InterNet. InterNet integrates modality transfer and self-supervised homography estimation, introducing an innovative interleaved optimization framework to alternately promote both components. The modality transfer gradually narrows the modality gaps, facilitating the self-supervised homography estimation to fully leverage the synthetic intra-modal data. The self-supervised homography estimation progressively achieves reliable predictions, thereby providing robust cross-modal supervision for the modality transfer. To further boost the estimation accuracy, we also formulate a fine-grained homography feature loss to improve the connection between two components. Furthermore, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. Experiments reveal that InterNet achieves the state-of-the-art (SOTA) performance among unsupervised methods, and even outperforms many supervised methods such as MHN and LocalTrans.

[CV-26] Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions ECCV2024

链接: https://arxiv.org/abs/2409.17988
作者: Weng Fei Low,Gim Hee Lee
关键词-EN: high dynamic range, standard cameras underperform, event motion blur, event camera makes, event
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Accepted to ECCV 2024. Project website is accessible at this https URL . arXiv admin note: text overlap with arXiv:2006.07722 by other authors

点击查看摘要

Abstract:The stark contrast in the design philosophy of an event camera makes it particularly ideal for operating under high-speed, high dynamic range and low-light conditions, where standard cameras underperform. Nonetheless, event cameras still suffer from some amount of motion blur, especially under these challenging conditions, in contrary to what most think. This is attributed to the limited bandwidth of the event sensor pixel, which is mostly proportional to the light intensity. Thus, to ensure that event cameras can truly excel in such conditions where it has an edge over standard cameras, it is crucial to account for event motion blur in downstream applications, especially reconstruction. However, none of the recent works on reconstructing Neural Radiance Fields (NeRFs) from events, nor event simulators, have considered the full effects of event motion blur. To this end, we propose, Deblur e-NeRF, a novel method to directly and effectively reconstruct blur-minimal NeRFs from motion-blurred events generated under high-speed motion or low-light conditions. The core component of this work is a physically-accurate pixel bandwidth model proposed to account for event motion blur under arbitrary speed and lighting conditions. We also introduce a novel threshold-normalized total variation loss to improve the regularization of large textureless patches. Experiments on real and novel realistically simulated sequences verify our effectiveness. Our code, event simulator and synthetic event dataset will be open-sourced.

[CV-27] LLM4Brain: Training a Large Language Model for Brain Video Understanding ECCV2024

链接: https://arxiv.org/abs/2409.17987
作者: Ruizhe Zheng,Lichao Sun
关键词-EN: limited data availability, poses significant challenges, subjects poses significant, functional MRI, Decoding visual-semantic information
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: ECCV2024 Workshop

点击查看摘要

Abstract:Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.

[CV-28] BlinkTrack: Feature Tracking over 100 FPS via Events and Images

链接: https://arxiv.org/abs/2409.17981
作者: Yichen Shen,Yijin Li,Shuo Chen,Guanglin Li,Zhaoyang Huang,Hujun Bao,Zhaopeng Cui,Guofeng Zhang
关键词-EN: computer vision tasks, structure from motion, simultaneous localization, localization and mapping, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Feature tracking is crucial for, structure from motion (SFM), simultaneous localization and mapping (SLAM), object tracking and various computer vision tasks. Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with RGB images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves single-modality tracking, resolves ambiguities, and supports asynchronous data fusion. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing event-based methods, exceeding 100 FPS with preprocessed event data and 80 FPS with multi-modality data.

[CV-29] HydraViT: Stacking Heads for a Scalable ViT

链接: https://arxiv.org/abs/2409.17978
作者: Janek Haberer,Ali Hojjat,Olaf Landsiedel
关键词-EN: Vision Transformers, architecture of Vision, imposes substantial hardware, substantial hardware demands, Multi-head Attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at this https URL.

[CV-30] Cross-Modality Attack Boosted by Gradient-Evolutionary Multiform Optimization

链接: https://arxiv.org/abs/2409.17977
作者: Yunpeng Gong,Qingyuan Zeng,Dejun Xu,Zhenzhong Wang,Min Jiang
关键词-EN: RGB images, adversarial attack research, recent years, attack research, RGB
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, despite significant advancements in adversarial attack research, the security challenges in cross-modal scenarios, such as the transferability of adversarial attacks between infrared, thermal, and RGB images, have been overlooked. These heterogeneous image modalities collected by different hardware devices are widely prevalent in practical applications, and the substantial differences between modalities pose significant challenges to attack transferability. In this work, we explore a novel cross-modal adversarial attack strategy, termed multiform attack. We propose a dual-layer optimization framework based on gradient-evolution, facilitating efficient perturbation transfer between modalities. In the first layer of optimization, the framework utilizes image gradients to learn universal perturbations within each modality and employs evolutionary algorithms to search for shared perturbations with transferability across different modalities through secondary optimization. Through extensive testing on multiple heterogeneous datasets, we demonstrate the superiority and robustness of Multiform Attack compared to existing techniques. This work not only enhances the transferability of cross-modal adversarial attacks but also provides a new perspective for understanding security vulnerabilities in cross-modal systems.

[CV-31] CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors

链接: https://arxiv.org/abs/2409.17963
作者: Linye Lyu,Jiawei Zhou,Daojing He,Yu Li
关键词-EN: Prior works, detectors mainly focus, effectiveness and robustness, vehicle detectors, Prior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prior works on physical adversarial camouflage against vehicle detectors mainly focus on the effectiveness and robustness of the attack. The current most successful methods optimize 3D vehicle texture at a pixel level. However, this results in conspicuous and attention-grabbing patterns in the generated camouflage, which humans can easily identify. To address this issue, we propose a Customizable and Natural Camouflage Attack (CNCA) method by leveraging an off-the-shelf pre-trained diffusion model. By sampling the optimal texture image from the diffusion model with a user-specific text prompt, our method can generate natural and customizable adversarial camouflage while maintaining high attack performance. With extensive experiments on the digital and physical worlds and user studies, the results demonstrate that our proposed method can generate significantly more natural-looking camouflage than the state-of-the-art baselines while achieving competitive attack performance. Our code is available at \hrefthis https URLthis https URL

[CV-32] he Hard Positive Truth about Vision-Language Compositionality ECCV2024

链接: https://arxiv.org/abs/2409.17958
作者: Amita Kamath,Cheng-Yu Hsieh,Kai-Wei Chang,Ranjay Krishna
关键词-EN: hard, CLIP, hard positives, hard negatives, vision-language models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated – because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts.

[CV-33] Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

链接: https://arxiv.org/abs/2409.17951
作者: Xinpeng Yin,Wenming Cao
关键词-EN: skeleton-based action recognition, self-supervised skeleton-based action, mask reconstruction paradigm, enhancing model refinement, action recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages,6 figures,IEEE Trans

点击查看摘要

Abstract:In self-supervised skeleton-based action recognition, the mask reconstruction paradigm is gaining interest in enhancing model refinement and robustness through effective masking. However, previous works primarily relied on a single masking criterion, resulting in the model overfitting specific features and overlooking other effective information. In this paper, we introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives. Specifically, in spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons, employing joint hierarchy as the masking criterion. In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective. Additionally, we incorporate cross-contrast loss based on the cross-masking framework into the loss function to enhance the model’s learning of instance-level features. HA-CM shows efficiency and universality on three public large-scale datasets, NTU-60, NTU-120, and PKU-MMD. The source code of our HA-CM is available at this https URL.

[CV-34] Perturb Attend Detect and Localize (PADL): Robust Proactive Image Defense

链接: https://arxiv.org/abs/2409.17941
作者: Filippo Bartolucci,Iacopo Masi,Giuseppe Lisanti
关键词-EN: received considerable attention, Generative Models, localization have received, received considerable, considerable attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image manipulation detection and localization have received considerable attention from the research community given the blooming of Generative Models (GMs). Detection methods that follow a passive approach may overfit to specific GMs, limiting their application in real-world scenarios, due to the growing diversity of generative models. Recently, approaches based on a proactive framework have shown the possibility of dealing with this limitation. However, these methods suffer from two main limitations, which raises concerns about potential vulnerabilities: i) the manipulation detector is not robust to noise and hence can be easily fooled; ii) the fact that they rely on fixed perturbations for image protection offers a predictable exploit for malicious attackers, enabling them to reverse-engineer and evade detection. To overcome this issue we propose PADL, a new solution able to generate image-specific perturbations using a symmetric scheme of encoding and decoding based on cross-attention, which drastically reduces the possibility of reverse engineering, even when evaluated with adaptive attack [31]. Additionally, PADL is able to pinpoint manipulated areas, facilitating the identification of specific regions that have undergone alterations, and has more generalization power than prior art on held-out generative models. Indeed, although being trained only on an attribute manipulation GAN model [15], our method generalizes to a range of unseen models with diverse architectural designs, such as StarGANv2, BlendGAN, DiffAE, StableDiffusion and StableDiffusionXL. Additionally, we introduce a novel evaluation protocol, which offers a fair evaluation of localisation performance in function of detection accuracy and better captures real-world scenarios.

[CV-35] Neural Light Spheres for Implicit Image Stitching and View Synthesis

链接: https://arxiv.org/abs/2409.17924
作者: Ilya Chugunov,Amogh Joshi,Kiran Murthy,Francois Bleibel,Felix Heide
关键词-EN: panorama paradoxically remains, mobile camera applications, modern mobile camera, challenging to display, cellphone screen
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project site: this https URL

点击查看摘要

Abstract:Challenging to capture, and challenging to display on a cellphone screen, the panorama paradoxically remains both a staple and underused feature of modern mobile camera applications. In this work we address both of these challenges with a spherical neural light field model for implicit panoramic image stitching and re-rendering; able to accommodate for depth parallax, view-dependent lighting, and local scene motion and color changes during capture. Fit during test-time to an arbitrary path panoramic video capture – vertical, horizontal, random-walk – these neural light spheres jointly estimate the camera path and a high-resolution scene reconstruction to produce novel wide field-of-view projections of the environment. Our single-layer model avoids expensive volumetric sampling, and decomposes the scene into compact view-dependent ray offset and color components, with a total model size of 80 MB per scene, and real-time (50 FPS) rendering at 1080p resolution. We demonstrate improved reconstruction quality over traditional image stitching and radiance field methods, with significantly higher tolerance to scene motion and non-ideal capture settings.

[CV-36] Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

链接: https://arxiv.org/abs/2409.17920
作者: Qihan Huang,Siming Fu,Jinlong Liu,Hao Jiang,Yipeng Yu,Jie Song
关键词-EN: wide research interest, garnered wide research, customized images based, personalized image generation, generate personalized images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code is available at this https URL.

[CV-37] WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

链接: https://arxiv.org/abs/2409.17917
作者: Dmytro Kotovenko,Olga Grebenkova,Nikolaos Sarafianos,Avinash Paliwal,Pingchuan Ma,Omid Poursaeed,Sreyas Mohan,Yuchen Fan,Yilei Li,Rakesh Ranjan,Björn Ommer
关键词-EN: Earth Mover Distance, remains relatively unexplored, explicit Gaussian Splatting, style transfer techniques, scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Splatting (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover’s Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques. See our project page for additional results and source code: \hrefthis https URLthis https URL .

[CV-38] LKA-ReID:Vehicle Re-Identification with Large Kernel Attention ICASSP2025

链接: https://arxiv.org/abs/2409.17908
作者: Xuezhi Xiang,Zhushan Ma,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen
关键词-EN: smart city infrastructure, intelligent transportation systems, important research field, Vehicle Re-ID technology, Vehicle Re-ID
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:With the rapid development of intelligent transportation systems and the popularity of smart city infrastructure, Vehicle Re-ID technology has become an important research field. The vehicle Re-ID task faces an important challenge, which is the high similarity between different vehicles. Existing methods use additional detection or segmentation models to extract differentiated local features. However, these methods either rely on additional annotations or greatly increase the computational cost. Using attention mechanism to capture global and local features is crucial to solve the challenge of high similarity between classes in vehicle Re-ID tasks. In this paper, we propose LKA-ReID with large kernel attention. Specifically, the large kernel attention (LKA) utilizes the advantages of self-attention and also benefits from the advantages of convolution, which can extract the global and local features of the vehicle more comprehensively. We also introduce hybrid channel attention (HCA) combines channel attention with spatial information, so that the model can better focus on channels and feature regions, and ignore background and other disturbing information. Experiments on VeRi-776 dataset demonstrated the effectiveness of LKA-ReID, with mAP reaches 86.65% and Rank-1 reaches 98.03%.

[CV-39] Self-supervised Monocular Depth Estimation with Large Kernel Attention ICASSP2025

链接: https://arxiv.org/abs/2409.17895
作者: Xuezhi Xiang,Yao Wang,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen
关键词-EN: labeled training data, Self-supervised monocular depth, training data, promising approach, rely on labeled
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.

[CV-40] Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection ECCV2024

链接: https://arxiv.org/abs/2409.17886
作者: Andrea Toaiari,Vittorio Murino,Marco Cristani,Cigdem Beyan
关键词-EN: Gaze Target Detection, Gaze Target, external viewpoint, challenging task, Target Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in the T-CAP workshop at ECCV 2024

点击查看摘要

Abstract:Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person’s appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person’s upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person’s face, thus promoting privacy preservation in various application contexts. The code is available at this https URL.

[CV-41] Self-Distilled Depth Refinement with Noisy Poisson Fusion NEURIPS2024

链接: https://arxiv.org/abs/2409.17880
作者: Jiaqi Li,Yiran Wang,Jinghong Zheng,Zihao Huang,Ke Xian,Zhiguo Cao,Jianming Zhang
关键词-EN: refining low-resolution results, infer high-resolution depth, Depth, refining low-resolution, Depth refinement aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Depth refinement aims to infer high-resolution depth with fine-grained edges and details, refining low-resolution results of depth estimation models. The prevailing methods adopt tile-based manners by merging numerous patches, which lacks efficiency and produces inconsistency. Besides, prior arts suffer from fuzzy depth boundaries and limited generalizability. Analyzing the fundamental reasons for these limitations, we model depth refinement as a noisy Poisson fusion problem with local inconsistency and edge deformation noises. We propose the Self-distilled Depth Refinement (SDDR) framework to enforce robustness against the noises, which mainly consists of depth edge representation and edge-based guidance. With noisy depth predictions as input, SDDR generates low-noise depth edge representations as pseudo-labels by coarse-to-fine self-distillation. Edge-based guidance with edge-guided gradient loss and edge-based fusion loss serves as the optimization objective equivalent to Poisson fusion. When depth maps are better refined, the labels also become more noise-free. Our model can acquire strong robustness to the noises, achieving significant improvements in accuracy, edge quality, efficiency, and generalizability on five different benchmarks. Moreover, directly training another model with edge labels produced by SDDR brings improvements, suggesting that our method could help with training robust refinement models in future works.

[CV-42] Visualization of Age Distributions as Elements of Medical Data-Stories

链接: https://arxiv.org/abs/2409.17854
作者: Sophia Dowlatabadi,Bernhard Preim,Monique Meuschke
关键词-EN: including medicine, age distributions, enhance health communication, Abstract, distributions are crucial
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:In various fields, including medicine, age distributions are crucial. Despite widespread media coverage of health topics, there remains a need to enhance health communication. Narrative medical visualization is promising for improving information comprehension and retention. This study explores the most effective ways to present age distributions of diseases through narrative visualizations. We conducted a thorough analysis of existing visualizations, held workshops with a broad audience, and reviewed relevant literature. From this, we identified design choices focusing on comprehension, aesthetics, engagement, and memorability. We specifically tested three pictogram variants: pictograms as bars, stacked pictograms, and annotations. After evaluating 18 visualizations with 72 participants and three expert reviews, we determined that annotations were most effective for comprehension and aesthetics. However, traditional bar charts were preferred for engagement, and other variants were more memorable. The study provides a set of design recommendations based on these insights.

[CV-43] A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts ECCV2024

链接: https://arxiv.org/abs/2409.17851
作者: Aurel Pjetri(1 and 2),Stefano Caprasecca(1),Leonardo Taccari(1),Matteo Simoncini(1),Henrique Piñeiro Monteagudo(1 and 3),Walter Wallace(1),Douglas Coimbra de Andrade(4),Francesco Sambo(1),Andrew David Bagdanov(1) ((1) Verizon Connect Research, Florence, Italy, (2) Department of Information Engineering, University of Florence, Florence, Italy, (3) University of Bologna, Bologna, Italy, (4) SENAI Institute of Innovation, Rio de Janeiro, Brazil)
关键词-EN: Monocular depth estimation, computer vision applications, depth estimation, Monocular depth, critical task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 5 figures. Accepted at ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

点击查看摘要

Abstract:Monocular depth estimation is a critical task for autonomous driving and many other computer vision applications. While significant progress has been made in this field, the effects of viewpoint shifts on depth estimation models remain largely underexplored. This paper introduces a novel dataset and evaluation methodology to quantify the impact of different camera positions and orientations on monocular depth estimation performance. We propose a ground truth strategy based on homography estimation and object detection, eliminating the need for expensive lidar sensors. We collect a diverse dataset of road scenes from multiple viewpoints and use it to assess the robustness of a modern depth estimation model to geometric shifts. After assessing the validity of our strategy on a public dataset, we provide valuable insights into the limitations of current models and highlight the importance of considering viewpoint variations in real-world applications.

[CV-44] Unsupervised Learning Based Multi-Scale Exposure Fusion

链接: https://arxiv.org/abs/2409.17830
作者: Chaobing Zheng,Shiqian Wu,Zhenggguo Li
关键词-EN: low dynamic range, high dynamic range, Unsupervised learning based, higher quality LDR, dynamic range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:Unsupervised learning based multi-scale exposure fusion (ULMEF) is efficient for fusing differently exposed low dynamic range (LDR) images into a higher quality LDR image for a high dynamic range (HDR) scene. Unlike supervised learning, loss functions play a crucial role in the ULMEF. In this paper, novel loss functions are proposed for the ULMEF and they are defined by using all the images to be fused and other differently exposed images from the same HDR scene. The proposed loss functions can guide the proposed ULMEF to learn more reliable information from the HDR scene than existing loss functions which are defined by only using the set of images to be fused. As such, the quality of the fused image is significantly improved. The proposed ULMEF also adopts a multi-scale strategy that includes a multi-scale attention module to effectively preserve the scene depth and local contrast in the fused image. Meanwhile, the proposed ULMEF can be adopted to achieve exposure interpolation and exposure extrapolation. Extensive experiments show that the proposed ULMEF algorithm outperforms state-of-the-art exposure fusion algorithms.

[CV-45] Kendalls tau Coefficient for Logits Distillation

链接: https://arxiv.org/abs/2409.17823
作者: Yuchen Guan,Runxi Cheng,Kang Liu,Chun Yuan
关键词-EN: student model output, distillation typically employs, Knowledge distillation typically, soft labels provided, Knowledge distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge distillation typically employs the Kullback-Leibler (KL) divergence to constrain the student model’s output to match the soft labels provided by the teacher model exactly. However, sometimes the optimization direction of the KL divergence loss is not always aligned with the task loss, where a smaller KL divergence could lead to erroneous predictions that diverge from the soft labels. This limitation often results in suboptimal optimization for the student. Moreover, even under temperature scaling, the KL divergence loss function tends to overly focus on the larger-valued channels in the logits, disregarding the rich inter-class information provided by the multitude of smaller-valued channels. This hard constraint proves too challenging for lightweight students, hindering further knowledge distillation. To address this issue, we propose a plug-and-play ranking loss based on Kendall’s \tau coefficient, called Rank-Kendall Knowledge Distillation (RKKD). RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits, providing more inter-class relational information. The rank constraint on the top-valued channels helps avoid suboptimal traps during optimization. We also discuss different differentiable forms of Kendall’s \tau coefficient and demonstrate that the proposed ranking loss function shares a consistent optimization objective with the KL divergence. Extensive experiments on the CIFAR-100 and ImageNet datasets show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.

[CV-46] Cascade Prompt Learning for Vision-Language Model Adaptation ECCV2024

链接: https://arxiv.org/abs/2409.17805
作者: Ge Wu,Xin Zhang,Zheng Li,Zhaowei Chen,Jiajun Liang,Jian Yang,Xiang Li
关键词-EN: Prompt learning, Prompt, Cascade Prompt Learning, adapting prompt, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024

点击查看摘要

Abstract:Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning CasPL framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It’s worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets. Code is publicly available at: this https URL.

[CV-47] Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs

链接: https://arxiv.org/abs/2409.17792
作者: Xinya Shu,Yu Li,Dongwei Ren,Xiaohe Wu,Jin Li,Wangmeng Zuo
关键词-EN: image defocus deblurring, acquiring well-aligned training, defocus deblurring, single image defocus, defocus
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The source code and dataset are available at this https URL

点击查看摘要

Abstract:For single image defocus deblurring, acquiring well-aligned training pairs (or training triplets), i.e., a defocus blurry image, an all-in-focus sharp image (and a defocus blur map), is an intricate task for the development of deblurring models. Existing image defocus deblurring methods typically rely on training data collected by specialized imaging equipment, presupposing that these pairs or triplets are perfectly aligned. However, in practical scenarios involving the collection of real-world data, direct acquisition of training triplets is infeasible, and training pairs inevitably encounter spatial misalignment issues. In this work, we introduce a reblurring-guided learning framework for single image defocus deblurring, enabling the learning of a deblurring network even with misaligned training pairs. Specifically, we first propose a baseline defocus deblurring network that utilizes spatially varying defocus blur map as degradation prior to enhance the deblurring performance. Then, to effectively learn the baseline defocus deblurring network with misaligned training pairs, our reblurring module ensures spatial consistency between the deblurred image, the reblurred image and the input blurry image by reconstructing spatially variant isotropic blur kernels. Moreover, the spatially variant blur derived from the reblurring module can serve as pseudo supervision for defocus blur map during training, interestingly transforming training pairs into training triplets. Additionally, we have collected a new dataset specifically for single image defocus deblurring (SDD) with typical misalignments, which not only substantiates our proposed method but also serves as a benchmark for future research.

[CV-48] CASPFormer: Trajectory Prediction from BEV Images with Deformable Attention ICPR2024

链接: https://arxiv.org/abs/2409.17790
作者: Harsh Yadav,Maximilian Schaefer,Kun Zhao,Tobias Meisen
关键词-EN: Advance Driver Assistance, Driver Assistance Systems, Autonomous Driving, Advance Driver, Driver Assistance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review at ICPR 2024, Kolkata

点击查看摘要

Abstract:Motion prediction is an important aspect for Autonomous Driving (AD) and Advance Driver Assistance Systems (ADAS). Current state-of-the-art motion prediction methods rely on High Definition (HD) maps for capturing the surrounding context of the ego vehicle. Such systems lack scalability in real-world deployment as HD maps are expensive to produce and update in real-time. To overcome this issue, we propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from rasterized Bird-Eye-View (BEV) images. Our system can be integrated with any upstream perception module that is capable of generating BEV images. Moreover, CASPFormer directly decodes vectorized trajectories without any postprocessing. Trajectories are decoded recurrently using deformable attention, as it is computationally efficient and provides the network with the ability to focus its attention on the important spatial locations of the BEV images. In addition, we also address the issue of mode collapse for generating multiple scene-consistent trajectories by incorporating learnable mode queries. We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics

[CV-49] aming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs NEURIPS2024

链接: https://arxiv.org/abs/2409.17778
作者: Qinpeng Cui,Yixuan Liu,Xinyi Zhang,Qiqi Bao,Zhongdao Wang,Qingmin Liao,Li Wang,Tian Lu,Emad Barsoum
关键词-EN: attracted substantial interest, substantial interest due, image restoration capabilities, powerful image restoration, Diffusion-based image super-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is accepted by NeurIPS 2024

点击查看摘要

Abstract:Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency. Code: this https URL.

[CV-50] Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

链接: https://arxiv.org/abs/2409.17777
作者: Raja Kumar,Raghav Singhal,Pranamya Kulkarni,Deval Mehta,Kshitij Jadhav
关键词-EN: shown remarkable success, Deep multimodal learning, Deep multimodal, leveraging contrastive learning, Mixup-based contrastive loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: RK and RS contributed equally to this work, 20 Pages, 8 Figures, 9 Tables

点击查看摘要

Abstract:Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

[CV-51] UNICORN: A Deep Learning Model for Integrating Multi-Stain Data in Histopathology

链接: https://arxiv.org/abs/2409.17775
作者: Valentin Koch,Sabine Bauer,Valerio Luppberger,Michael Joner,Heribert Schunkert,Julia A. Schnabel,Moritz von Scheidt,Carsten Marr
关键词-EN: deep learning poses, poses a significant, Background, data, digital histopathology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Background: The integration of multi-stain histopathology images through deep learning poses a significant challenge in digital histopathology. Current multi-modal approaches struggle with data heterogeneity and missing data. This study aims to overcome these limitations by developing a novel transformer model for multi-stain integration that can handle missing data during training as well as inference. Methods: We propose UNICORN (UNiversal modality Integration Network for CORonary classificatioN) a multi-modal transformer capable of processing multi-stain histopathology for atherosclerosis severity class prediction. The architecture comprises a two-stage, end-to-end trainable model with specialized modules utilizing transformer self-attention blocks. The initial stage employs domain-specific expert modules to extract features from each modality. In the subsequent stage, an aggregation expert module integrates these features by learning the interactions between the different data modalities. Results: Evaluation was performed using a multi-class dataset of atherosclerotic lesions from the Munich Cardiovascular Studies Biobank (MISSION), using over 4,000 paired multi-stain whole slide images (WSIs) from 170 deceased individuals on 7 prespecified segments of the coronary tree, each stained according to four histopathological protocols. UNICORN achieved a classification accuracy of 0.67, outperforming other state-of-the-art models. The model effectively identifies relevant tissue phenotypes across stainings and implicitly models disease progression. Conclusion: Our proposed multi-modal transformer model addresses key challenges in medical data analysis, including data heterogeneity and missing modalities. Explainability and the model’s effectiveness in predicting atherosclerosis progression underscores its potential for broader applications in medical research.

[CV-52] Confidence intervals uncovered: Are we ready for real-world medical imaging AI? MICCAI2024

链接: https://arxiv.org/abs/2409.17763
作者: Evangelia Christodoulou,Annika Reinke,Rola Houhou,Piotr Kalinowski,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Nicola Rieke,Veronika Cheplygina,Michela Antonelli,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Paul F. Jäger,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein
关键词-EN: Medical imaging, transformation of healthcare, imaging is spearheading, Performance, Medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted at MICCAI 2024 conference

点击查看摘要

Abstract:Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

[CV-53] xt Image Generation for Low-Resource Languages with Dual Translation Learning

链接: https://arxiv.org/abs/2409.17747
作者: Chihiro Noguchi,Shun Fukuda,Shoichiro Mihara,Masao Yamanaka
关键词-EN: frequently faces challenges, faces challenges due, training datasets derived, languages frequently faces, text images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 11 figures

点击查看摘要

Abstract:Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: synthetic'' and real.‘’ The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model’s explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.

[CV-54] AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status

链接: https://arxiv.org/abs/2409.17740
作者: Jinghao Zhang,Wen Qian,Hao Luo,Fan Wang,Feng Zhao
关键词-EN: high-throughput daily production, made compelling progress, facilitating high-throughput daily, daily production, made compelling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Diffusion models have made compelling progress on facilitating high-throughput daily production. Nevertheless, the appealing customized requirements are remain suffered from instance-level finetuning for authentic fidelity. Prior zero-shot customization works achieve the semantic consistence through the condensed injection of identity features, while addressing detailed low-level signatures through complex model configurations and subject-specific fabrications, which significantly break the statistical coherence within the overall system and limit the applicability across various scenarios. To facilitate the generic signature concentration with rectified efficiency, we present \textbfAnyLogo, a zero-shot region customizer with remarkable detail consistency, building upon the symbiotic diffusion system with eliminated cumbersome designs. Streamlined as vanilla image generation, we discern that the rigorous signature extraction and creative content generation are promisingly compatible and can be systematically recycled within a single denoising model. In place of the external configurations, the gemini status of the denoising model promote the reinforced subject transmission efficiency and disentangled semantic-signature space with continuous signature decoration. Moreover, the sparse recycling paradigm is adopted to prevent the duplicated risk with compressed transmission quota for diversified signature stimulation. Extensive experiments on constructed logo-level benchmarks demonstrate the effectiveness and practicability of our methods.

[CV-55] Neural Implicit Representation for Highly Dynamic LiDAR Mapping and Odometry

链接: https://arxiv.org/abs/2409.17729
作者: Qi Zhang,He Wang,Ru Li,Wenbin Li
关键词-EN: Simultaneous Localization, Neural Radiance Fields, Recent advancements, advancements in Simultaneous, LiDAR-based techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in Simultaneous Localization and Mapping (SLAM) have increasingly highlighted the robustness of LiDAR-based techniques. At the same time, Neural Radiance Fields (NeRF) have introduced new possibilities for 3D scene reconstruction, exemplified by SLAM systems. Among these, NeRF-LOAM has shown notable performance in NeRF-based SLAM applications. However, despite its strengths, these systems often encounter difficulties in dynamic outdoor environments due to their inherent static assumptions. To address these limitations, this paper proposes a novel method designed to improve reconstruction in highly dynamic outdoor scenes. Based on NeRF-LOAM, the proposed approach consists of two primary components. First, we separate the scene into static background and dynamic foreground. By identifying and excluding dynamic elements from the mapping process, this segmentation enables the creation of a dense 3D map that accurately represents the static background only. The second component extends the octree structure to support multi-resolution representation. This extension not only enhances reconstruction quality but also aids in the removal of dynamic objects identified by the first module. Additionally, Fourier feature encoding is applied to the sampled points, capturing high-frequency information and leading to more complete reconstruction results. Evaluations on various datasets demonstrate that our method achieves more competitive results compared to current state-of-the-art approaches.

[CV-56] AlterMOMA: Fusion Redundancy Pruning for Camera-LiDAR Fusion Models with Alternative Modality Masking NEURIPS2024

链接: https://arxiv.org/abs/2409.17728
作者: Shiqi Sun,Yantao Lu,Ning Liu,Bo Jiang,JinChao Chen,Ying Zhang
关键词-EN: Camera-LiDAR fusion models, significantly enhance perception, Camera-LiDAR fusion, fusion models, models significantly enhance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 3 figures, Accepted by NeurIPS 2024

点击查看摘要

Abstract:Camera-LiDAR fusion models significantly enhance perception performance in autonomous driving. The fusion mechanism leverages the strengths of each modality while minimizing their weaknesses. Moreover, in practice, camera-LiDAR fusion models utilize pre-trained backbones for efficient training. However, we argue that directly loading single-modal pre-trained camera and LiDAR backbones into camera-LiDAR fusion models introduces similar feature redundancy across modalities due to the nature of the fusion mechanism. Unfortunately, existing pruning methods are developed explicitly for single-modal models, and thus, they struggle to effectively identify these specific redundant parameters in camera-LiDAR fusion models. In this paper, to address the issue above on camera-LiDAR fusion models, we propose a novelty pruning framework Alternative Modality Masking Pruning (AlterMOMA), which employs alternative masking on each modality and identifies the redundant parameters. Specifically, when one modality parameters are masked (deactivated), the absence of features from the masked backbone compels the model to reactivate previous redundant features of the other modality backbone. Therefore, these redundant features and relevant redundant parameters can be identified via the reactivation process. The redundant parameters can be pruned by our proposed importance score evaluation function, Alternative Evaluation (AlterEva), which is based on the observation of the loss changes when certain modality parameters are activated and deactivated. Extensive experiments on the nuScene and KITTI datasets encompassing diverse tasks, baseline models, and pruning algorithms showcase that AlterMOMA outperforms existing pruning methods, attaining state-of-the-art performance.

[CV-57] Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

链接: https://arxiv.org/abs/2409.17727
作者: Nghia Nguyen,Minh Nhat Vu,Tung D. Ta,Baoru Huang,Thieu Vo,Ngan Le,Anh Nguyen
关键词-EN: extracting meaningful features, played a key, key role, role in extracting, extracting meaningful
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP’s strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.

[CV-58] Scene Understanding in Pick-and-Place Tasks: Analyzing Transformations Between Initial and Final Scenes

链接: https://arxiv.org/abs/2409.17720
作者: Seraj Ghasemi,Hamed Hosseini,MohammadHossein Koosheshi,Mehdi Tale Masouleh,Ahmad Kalhor
关键词-EN: robots increasingly collaborating, robotic systems capable, pick and place, place tasks, robots increasingly
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Conference Paper, ICEE 2024, 7 pages, 5 figures

点击查看摘要

Abstract:With robots increasingly collaborating with humans in everyday tasks, it is important to take steps toward robotic systems capable of understanding the environment. This work focuses on scene understanding to detect pick and place tasks given initial and final images from the scene. To this end, a dataset is collected for object detection and pick and place task detection. A YOLOv5 network is subsequently trained to detect the objects in the initial and final scenes. Given the detected objects and their bounding boxes, two methods are proposed to detect the pick and place tasks which transform the initial scene into the final scene. A geometric method is proposed which tracks objects’ movements in the two scenes and works based on the intersection of the bounding boxes which moved within scenes. Contrarily, the CNN-based method utilizes a Convolutional Neural Network to classify objects with intersected bounding boxes into 5 classes, showing the spatial relationship between the involved objects. The performed pick and place tasks are then derived from analyzing the experiments with both scenes. Results show that the CNN-based method, using a VGG16 backbone, outscores the geometric method by roughly 12 percentage points in certain scenarios, with an overall success rate of 84.3%.

[CV-59] Behaviour4All: in-the-wild Facial Behaviour Analysis Toolkit

链接: https://arxiv.org/abs/2409.17717
作者: Dimitrios Kollias,Chunchang Shao,Odysseus Kaloidas,Ioannis Patras
关键词-EN: Action Unit Detection, integrating Face Localization, facial behavior analysis, Face Localization, Unit Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Behavior4All, a comprehensive, open-source toolkit for in-the-wild facial behavior analysis, integrating Face Localization, Valence-Arousal Estimation, Basic Expression Recognition and Action Unit Detection, all within a single framework. Available in both CPU-only and GPU-accelerated versions, Behavior4All leverages 12 large-scale, in-the-wild datasets consisting of over 5 million images from diverse demographic groups. It introduces a novel framework that leverages distribution matching and label co-annotation to address tasks with non-overlapping annotations, encoding prior knowledge of their relatedness. In the largest study of its kind, Behavior4All outperforms both state-of-the-art and toolkits in overall performance as well as fairness across all databases and tasks. It also demonstrates superior generalizability on unseen databases and on compound expression recognition. Finally, Behavior4All is way times faster than other toolkits.

[CV-60] MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling NEURIPS2024

链接: https://arxiv.org/abs/2409.17686
作者: Weihao Yuan,Weichao Shen,Yisheng He,Yuan Dong,Xiaodong Gu,Zilong Dong,Liefeng Bo,Qixing Huang
关键词-EN: inevitable approximation errors, discrete quantization offers, continuous regression, approximation errors, generation from discrete
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Motion generation from discrete quantization offers many advantages over continuous regression, but at the cost of inevitable approximation errors. Previous methods usually quantize the entire body pose into one code, which not only faces the difficulty in encoding all joints within one vector but also loses the spatial relationship between different joints. Differently, in this work we quantize each individual joint into one vector, which i) simplifies the quantization process as the complexity associated with a single joint is markedly lower than that of the entire pose; ii) maintains a spatial-temporal structure that preserves both the spatial relationships among joints and the temporal movement patterns; iii) yields a 2D token map, which enables the application of various 2D operations widely used in 2D images. Grounded in the 2D motion quantization, we build a spatial-temporal modeling framework, where 2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D attention are proposed to take advantage of spatial-temporal signals among the 2D tokens. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a 26.6% decrease of FID on HumanML3D and a 29.9% decrease on KIT-ML.

[CV-61] Dark Miner: Defend against unsafe generation for text-to-image diffusion models

链接: https://arxiv.org/abs/2409.17682
作者: Zheling Meng,Bo Peng,Xiaochuan Jin,Yue Jiang,Jing Dong,Wei Wang,Tieniu Tan
关键词-EN: large-scale training data, unfiltered large-scale training, unsafe generation due, shocking images, due to unfiltered
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model’s native generation capability. Our code will be available on GitHub.

[CV-62] Event-based Stereo Depth Estimation: A Survey

链接: https://arxiv.org/abs/2409.17680
作者: Suman Ghosh,Guillermo Gallego
关键词-EN: Stereopsis has widespread, widespread appeal, appeal in robotics, living beings perceive, high temporal
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 28 pages, 20 figures, 7 tables

点击查看摘要

Abstract:Stereopsis has widespread appeal in robotics as it is the predominant way by which living beings perceive depth to navigate our 3D world. Event cameras are novel bio-inspired sensors that detect per-pixel brightness changes asynchronously, with very high temporal resolution and high dynamic range, enabling machine perception in high-speed motion and broad illumination conditions. The high temporal precision also benefits stereo matching, making disparity (depth) estimation a popular research area for event cameras ever since its inception. Over the last 30 years, the field has evolved rapidly, from low-latency, low-power circuit design to current deep learning (DL) approaches driven by the computer vision community. The bibliography is vast and difficult to navigate for non-experts due its highly interdisciplinary nature. Past surveys have addressed distinct aspects of this topic, in the context of applications, or focusing only on a specific class of techniques, but have overlooked stereo datasets. This survey provides a comprehensive overview, covering both instantaneous stereo and long-term methods suitable for simultaneous localization and mapping (SLAM), along with theoretical and empirical comparisons. It is the first to extensively review DL methods as well as stereo datasets, even providing practical suggestions for creating new benchmarks to advance the field. The main advantages and challenges faced by event-based stereo depth estimation are also discussed. Despite significant progress, challenges remain in achieving optimal performance in not only accuracy but also efficiency, a cornerstone of event-based computing. We identify several gaps and propose future research directions. We hope this survey inspires future research in this area, by serving as an accessible entry point for newcomers, as well as a practical guide for seasoned researchers in the community.

[CV-63] EM-Net: Efficient Channel and Frequency Learning with Mamba for 3D Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2409.17675
作者: Ao Chang,Jiajun Zeng,Ruobing Huang,Dong Ni
关键词-EN: Convolutional neural networks, small receptive fields, Convolutional neural, primarily led, receptive fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, accepted by MICCAI 2024

点击查看摘要

Abstract:Convolutional neural networks have primarily led 3D medical image segmentation but may be limited by small receptive fields. Transformer models excel in capturing global relationships through self-attention but are challenged by high computational costs at high resolutions. Recently, Mamba, a state space model, has emerged as an effective approach for sequential modeling. Inspired by its success, we introduce a novel Mamba-based 3D medical image segmentation model called EM-Net. It not only efficiently captures attentive interaction between regions by integrating and selecting channels, but also effectively utilizes frequency domain to harmonize the learning of features across varying scales, while accelerating training speed. Comprehensive experiments on two challenging multi-organ datasets with other state-of-the-art (SOTA) algorithms show that our method exhibits better segmentation accuracy while requiring nearly half the parameter size of SOTA models and 2x faster training speed.

[CV-64] Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

链接: https://arxiv.org/abs/2409.17674
作者: Huan Yang,Jiahui Chen,Chaofan Ding,Runhua Shi,Siyu Xiong,Qingqi Hong,Xiaoqi Mo,Xinhan Di
关键词-EN: enhancing co-speech communication, pivotal in enhancing, co-speech communication, enhancing co-speech, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures, conference

点击查看摘要

Abstract:Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.

[CV-65] Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes

链接: https://arxiv.org/abs/2409.17671
作者: Katja Ludwig,Julian Lorenz,Daniel Kienzle,Tuan Bui,Rainer Lienhart
关键词-EN: body shape, HME models, basic body shape, HME, body
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The basic body shape of a person does not change within a single video. However, most SOTA human mesh estimation (HME) models output a slightly different body shape for each video frame, which results in inconsistent body shapes for the same person. In contrast, we leverage anthropometric measurements like tailors are already obtaining from humans for centuries. We create a model called A2B that converts such anthropometric measurements to body shape parameters of human mesh models. Moreover, we find that finetuned SOTA 3D human pose estimation (HPE) models outperform HME models regarding the precision of the estimated keypoints. We show that applying inverse kinematics (IK) to the results of such a 3D HPE model and combining the resulting body pose with the A2B body shape leads to superior and consistent human meshes for challenging datasets like ASPset or fit3D, where we can lower the MPJPE by over 30 mm compared to SOTA HME models. Further, replacing HME models estimates of the body shape parameters with A2B model results not only increases the performance of these HME models, but also leads to consistent body shapes.

[CV-66] Explanation Bottleneck Models

链接: https://arxiv.org/abs/2409.17663
作者: Shin’ya Yamaguchi,Kosuke Nishida
关键词-EN: Recent concept-based interpretable, Recent concept-based, providing meaningful explanations, pre-defined concept sets, concept-based interpretable models
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recent concept-based interpretable models have succeeded in providing meaningful explanations by pre-defined concept sets. However, the dependency on the pre-defined concepts restricts the application because of the limited number of concepts for explanations. This paper proposes a novel interpretable deep neural network called explanation bottleneck models (XBMs). XBMs generate a text explanation from the input without pre-defined concepts and then predict a final task prediction based on the generated explanation by leveraging pre-trained vision-language encoder-decoder models. To achieve both the target task performance and the explanation quality, we train XBMs through the target task loss with the regularization penalizing the explanation decoder via the distillation from the frozen pre-trained decoder. Our experiments, including a comparison to state-of-the-art concept bottleneck models, confirm that XBMs provide accurate and fluent natural language explanations without pre-defined concept sets. Code will be available at this https URL.

[CV-67] Provable Performance Guarantees of Copy Detection Patterns

链接: https://arxiv.org/abs/2409.17649
作者: Joakim Tutt,Slava Voloshynovskiy
关键词-EN: Copy Detection Patterns, Copy Detection, Detection Patterns, modern security applications, playing a vital
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Copy Detection Patterns (CDPs) are crucial elements in modern security applications, playing a vital role in safeguarding industries such as food, pharmaceuticals, and cosmetics. Current performance evaluations of CDPs predominantly rely on empirical setups using simplistic metrics like Hamming distances or Pearson correlation. These methods are often inadequate due to their sensitivity to distortions, degradation, and their limitations to stationary statistics of printing and imaging. Additionally, machine learning-based approaches suffer from distribution biases and fail to generalize to unseen counterfeit samples. Given the critical importance of CDPs in preventing counterfeiting, including the counterfeit vaccines issue highlighted during the COVID-19 pandemic, there is an urgent need for provable performance guarantees across various criteria. This paper aims to establish a theoretical framework to derive optimal criteria for the analysis, optimization, and future development of CDP authentication technologies, ensuring their reliability and effectiveness in diverse security scenarios.

[CV-68] MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning NEURIPS2024

链接: https://arxiv.org/abs/2409.17647
作者: Tieyuan Chen,Huabin Liu,Tianyao He,Yihang Chen,Chaofan Gan,Xiao Ma,Cheng Zhong,Yang Zhang,Yingxue Wang,Hui Lin,Weiyao Lin
关键词-EN: causal, achieve a high-level, high-level understanding, causal relationships, MECD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024 as a spotlight paper

点击查看摘要

Abstract:Video causal reasoning aims to achieve a high-level understanding of video content from a causal perspective. However, current video reasoning tasks are limited in scope, primarily executed in a question-answering paradigm and focusing on short videos containing only a single event and simple causal relationships, lacking comprehensive and structured causality analysis for videos with multiple events. To fill this gap, we introduce a new task and dataset, Multi-Event Causal Discovery (MECD). It aims to uncover the causal relationships between events distributed chronologically across long videos. Given visual segments and textual descriptions of events, MECD requires identifying the causal associations between these events to derive a comprehensive, structured event-level video causal diagram explaining why and how the final result event occurred. To address MECD, we devise a novel framework inspired by the Granger Causality method, using an efficient mask-based event prediction model to perform an Event Granger Test, which estimates causality by comparing the predicted result event when premise events are masked versus unmasked. Furthermore, we integrate causal inference techniques such as front-door adjustment and counterfactual inference to address challenges in MECD like causality confounding and illusory causality. Experiments validate the effectiveness of our framework in providing causal relationships in multi-event videos, outperforming GPT-4o and VideoLLaVA by 5.7% and 4.1%, respectively.

[CV-69] P4Q: Learning to Prompt for Quantization in Visual-language Models

链接: https://arxiv.org/abs/2409.17634
作者: Huixin Sun,Runqi Wang,Yanjing Li,Xianbin Cao,Xiaolong Jiang,Yao Hu,Baochang Zhang
关键词-EN: downstream application platforms, application platforms remains, platforms remains challenging, remains challenging due, Large-scale pre-trained Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization’’ (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 \times while achieving 66.94% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24% with negligible additional parameters on the ImageNet dataset.

[CV-70] Hand-object reconstruction via interaction-aware graph attention mechanism ICIP2024

链接: https://arxiv.org/abs/2409.17629
作者: Taeyun Woo,Tae-Kyun Kim,Jinah Park
关键词-EN: advanced vision computing, Estimating the poses, vision computing, important area, area of research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, Accepted by ICIP 2024

点击查看摘要

Abstract:Estimating the poses of both a hand and an object has become an important area of research due to the growing need for advanced vision computing. The primary challenge involves understanding and reconstructing how hands and objects interact, such as contact and physical plausibility. Existing approaches often adopt a graph neural network to incorporate spatial information of hand and object meshes. However, these approaches have not fully exploited the potential of graphs without modification of edges within and between hand- and object-graphs. We propose a graph-based refinement method that incorporates an interaction-aware graph-attention mechanism to account for hand-object interactions. Using edges, we establish connections among closely correlated nodes, both within individual graphs and across different graphs. Experiments demonstrate the effectiveness of our proposed method with notable improvements in the realm of physical plausibility.

[CV-71] Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment

链接: https://arxiv.org/abs/2409.17612
作者: Jiawei Du,Xin Zhang,Juncheng Hu,Wenxin Huang,Joey Tianyi Zhou
关键词-EN: sharp increase, increase in data-related, motivated research, research into condensing, datasets
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The sharp increase in data-related expenses has motivated research into condensing datasets while retaining the most informative features. Dataset distillation has thus recently come to the fore. This paradigm generates synthetic dataset that are representative enough to replace the original dataset in training a neural network. To avoid redundancy in these synthetic datasets, it is crucial that each element contains unique features and remains diverse from others during the synthesis stage. In this paper, we provide a thorough theoretical and empirical analysis of diversity within synthesized datasets. We argue that enhancing diversity can improve the parallelizable yet isolated synthesizing approach. Specifically, we introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process, thereby maximizing the representativeness and diversity of each synthetic instance. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, demonstrate the superior performance of our method, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense.

[CV-72] ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

链接: https://arxiv.org/abs/2409.17610
作者: Zhangpu Li,Changhong Zou,Suxue Ma,Zhicheng Yang,Chen Du,Youbao Tang,Zhenjie Cao,Ning Zhang,Jui-Hsin Lai,Ruei-Sung Lin,Yuan Ni,Xingzhi Sun,Jing Xiao,Kai Zhang,Mei Han
关键词-EN: multimodal medical dialogue, multi-turn multimodal medical, large language models, multimodal medical, medical dialogue
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients’ mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.

[CV-73] Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

链接: https://arxiv.org/abs/2409.17608
作者: Jiahao Lyu,Minghua Zhao,Jing Hu,Xuewen Huang,Shuangli Du,Cheng Shi,Zhiyong Lv
关键词-EN: measuring significant deviations, Video anomaly detection, significant deviations, Video anomaly, learns the distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.

[CV-74] Good Data Is All Imitation Learning Needs

链接: https://arxiv.org/abs/2409.17605
作者: Amir Samadi,Konstantinos Koufos,Kurt Debattista,Mehrdad Dianati
关键词-EN: Automated Driving Systems, context of Autonomous, traditional teacher-student models, imitation learning, Automated Driving
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we address the limitations of traditional teacher-student models, imitation learning, and behaviour cloning in the context of Autonomous/Automated Driving Systems (ADS), where these methods often struggle with incomplete coverage of real-world scenarios. To enhance the robustness of such models, we introduce the use of Counterfactual Explanations (CFEs) as a novel data augmentation technique for end-to-end ADS. CFEs, by generating training samples near decision boundaries through minimal input modifications, lead to a more comprehensive representation of expert driver strategies, particularly in safety-critical scenarios. This approach can therefore help improve the model’s ability to handle rare and challenging driving events, such as anticipating darting out pedestrians, ultimately leading to safer and more trustworthy decision-making for ADS. Our experiments in the CARLA simulator demonstrate that CF-Driver outperforms the current state-of-the-art method, achieving a higher driving score and lower infraction rates. Specifically, CF-Driver attains a driving score of 84.2, surpassing the previous best model by 15.02 percentage points. These results highlight the effectiveness of incorporating CFEs in training end-to-end ADS. To foster further research, the CF-Driver code is made publicly available.

[CV-75] A-Cleaner: A Fine-grained Text Alignment Backdoor Defense Strategy for Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2409.17601
作者: Yuan Xun,Siyuan Liang,Xiaojun Jia,Xinwei Liu,Xiaochun Cao
关键词-EN: Pre-trained large models, multimodal contrastive learning, Pre-trained large, multimodal contrastive, widely recognized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the supervised learning domain, fine-tuning defense strategies can achieve excellent defense performance. However, in the unsupervised and semi-supervised domain, we find that when CLIP faces some complex attack techniques, the existing fine-tuning defense strategy, CleanCLIP, has some limitations on defense performance. The synonym substitution of its text-augmentation is insufficient to enhance the text feature space. To compensate for this weakness, we improve it by proposing a fine-grained \textbfText \textbfAlignment \textbfCleaner (TA-Cleaner) to cut off feature connections of backdoor triggers. We randomly select a few samples for positive and negative subtext generation at each epoch of CleanCLIP, and align the subtexts to the images to strengthen the text self-supervision. We evaluate the effectiveness of our TA-Cleaner against six attack algorithms and conduct comprehensive zero-shot classification tests on ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques. Even when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02% and 63.88%, respectively.

[CV-76] Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution

链接: https://arxiv.org/abs/2409.17597
作者: Zhenyu Hu,Wanjie Sun
关键词-EN: super-resolution tasks due, demonstrated outstanding performance, Window-based transformers, demonstrated outstanding, super-resolution tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Window-based transformers have demonstrated outstanding performance in super-resolution tasks due to their adaptive modeling capabilities through local self-attention (SA). However, they exhibit higher computational complexity and inference latency than convolutional neural networks. In this paper, we first identify that the adaptability of the Transformers is derived from their adaptive spatial aggregation and advanced structural design, while their high latency results from the computational costs and memory layout transformations associated with the local SA. To simulate this aggregation approach, we propose an effective convolution-based linear focal separable attention (FSA), allowing for long-range dynamic modeling with linear complexity. Additionally, we introduce an effective dual-branch structure combined with an ultra-lightweight information exchange module (IEM) to enhance the aggregation of information by the Token Mixer. Finally, with respect to the structure, we modify the existing spatial-gate-based feedforward neural networks by incorporating a self-gate mechanism to preserve high-dimensional channel information, enabling the modeling of more complex relationships. With these advancements, we construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet). Extensive experiments demonstrate that LAMNet achieves better performance than existing SA-based Transformer methods while maintaining the computational efficiency of convolutional neural networks, which can achieve a (3\times) speedup of inference time. The code will be publicly available at: this https URL.

[CV-77] Improving Fast Adversarial Training via Self-Knowledge Guidance

链接: https://arxiv.org/abs/2409.17589
作者: Chengze Jiang,Junkai Wang,Minjing Dong,Jie Gui,Xinli Shi,Yuan Cao,Yuan Yan Tang,James Tin-Yau Kwok
关键词-EN: achieved remarkable advancements, FAT, achieved remarkable, remarkable advancements, advancements in defending
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples, which leads to an imbalanced optimization. However, this imbalance remains unexplored in the field of FAT. In this paper, we conduct a comprehensive study of the imbalance issue in FAT and observe an obvious class disparity regarding their performances. This disparity could be embodied from a perspective of alignment between clean and robust accuracy. Based on the analysis, we mainly attribute the observed misalignment and disparity to the imbalanced optimization in FAT, which motivates us to optimize different training data adaptively to enhance robustness. Specifically, we take disparity and misalignment into consideration. First, we introduce self-knowledge guided regularization, which assigns differentiated regularization weights to each class based on its training state, alleviating class disparity. Additionally, we propose self-knowledge guided label relaxation, which adjusts label relaxation according to the training accuracy, alleviating the misalignment and improving robustness. By combining these methods, we formulate the Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge during training to enhance the adversarial robustness without compromising training efficiency. Extensive experiments on four standard datasets demonstrate that the SKG-FAT improves the robustness and preserves competitive clean accuracy, outperforming the state-of-the-art methods.

[CV-78] ID3: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition NEURIPS2024

链接: https://arxiv.org/abs/2409.17576
作者: Shen Li,Jianqing Xu,Jiaying Wu,Miao Xiong,Ailin Deng,Jiazhen Ji,Yuge Huang,Wenjie Feng,Shouhong Ding,Bryan Hooi
关键词-EN: Synthetic face, generate synthetic face, Synthetic face recognition, synthetic face datasets, privacy-preserving manner
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed \textID^3 . \textID^3 employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of \textID^3 .

[CV-79] Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

链接: https://arxiv.org/abs/2409.17566
作者: Hongtao Huang,Xiaojun Chang,Lina Yao
关键词-EN: cutting-edge generative models, generative models adept, Diffusion models, Neural Architecture Search, high-quality images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images. Despite their effectiveness, these models often require significant computational resources owing to their numerous sequential denoising steps and the significant inference cost of each step. Recently, Neural Architecture Search (NAS) techniques have been employed to automatically search for faster generation processes. However, NAS for diffusion is inherently time-consuming as it requires estimating thousands of diffusion models to search for the optimal one. In this paper, we introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models by concurrently optimizing generation steps and network structures. Specifically, we partition the generation process into isometric step segments, each sequentially composed of a full step, multiple partial steps, and several null steps. The full step computes all network blocks, while the partial step involves part of the blocks, and the null step entails no computation. Flexiffusion autonomously explores flexible step combinations for each segment, substantially reducing search costs and enabling greater acceleration compared to the state-of-the-art (SOTA) method for diffusion models. Our searched models reported speedup factors of 2.6\times and 1.5\times for the original LDM-4-G and the SOTA, respectively. The factors for Stable Diffusion V1.5 and the SOTA are 5.1\times and 2.0\times . We also verified the performance of Flexiffusion on multiple datasets, and positive experiment results indicate that Flexiffusion can effectively reduce redundancy in diffusion models.

[CV-80] Pixel-Space Post-Training of Latent Diffusion Models

链接: https://arxiv.org/abs/2409.17565
作者: Christina Zhang,Simran Motwani,Matthew Yu,Ji Hou,Felix Juefei-Xu,Sam Tsai,Peter Vajda,Zijian He,Jialiang Wang
关键词-EN: made significant advancements, recent years, made significant, significant advancements, generation in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 \times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

[CV-81] General Compression Framework for Efficient Transformer Object Tracking

链接: https://arxiv.org/abs/2409.17564
作者: Lingyi Hong,Jinglun Li,Xinyu Zhou,Shilin Yan,Pinxue Guo,Kaixun Jiang,Zhaoyu Chen,Shuyong Gao,Wei Zhang,Hong Lu,Wenqiang Zhang
关键词-EN: Transformer-based trackers, model, teacher model, student model, teacher
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model’s ability to replicate the teacher model’s behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model’s compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieves 2.17x speed up.

[CV-82] Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking

链接: https://arxiv.org/abs/2409.17560
作者: Pengcheng Shao,Tianyang Xu,Xuefeng Zhu,Xiaojun Wu,Josef Kittler
关键词-EN: bionic camera asynchronously, high dynamic range, Event-based bionic camera, RGB under conditions, high temporal resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures, conference

点击查看摘要

Abstract:Event-based bionic camera asynchronously captures dynamic scenes with high temporal resolution and high dynamic range, offering potential for the integration of events and RGB under conditions of illumination degradation and fast motion. Existing RGB-E tracking methods model event characteristics utilising attention mechanism of Transformer before integrating both modalities. Nevertheless, these methods involve aggregating the event stream into a single event frame, lacking the utilisation of the temporal information inherent in the event this http URL, the traditional attention mechanism is well-suited for dense semantic features, while the attention mechanism for sparse event features require revolution. In this paper, we propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues. Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions. The experimental results indicate that our method outperforms existing state-of-the-art methods on the FE240 and COESOT datasets, providing an effective processing manner for the event data.

[CV-83] Advancing Open-Set Domain Generalization Using Evidential Bi-Level Hardest Domain Scheduler NEURIPS2024

链接: https://arxiv.org/abs/2409.17555
作者: Kunyu Peng,Di Wen,Kailun Yang,Ao Luo,Yufan Chen,Jia Fu,M. Saquib Sarfraz,Alina Roitberg,Rainer Stiefelhagen
关键词-EN: Open-Set Domain Generalization, Domain Generalization, open-set conditions, domain scheduler, Domain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024. The source code will be available at this https URL

点击查看摘要

Abstract:In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bi-level manner. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories. The source code will be available at this https URL.

[CV-84] riple Point Masking

链接: https://arxiv.org/abs/2409.17547
作者: Jiaming Liu,Linghe Kong,Yue Wu,Maoguo Gong,Hao Li,Qiguang Miao,Wenping Ma,Can Qin
关键词-EN: encounter performance bottlenecks, learning methods encounter, overcome this limitation, mask learning methods, methods encounter performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing 3D mask learning methods encounter performance bottlenecks under limited data, and our objective is to overcome this limitation. In this paper, we introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders to achieve multi-mask learning for 3D point clouds. Specifically, we augment the baselines with two additional mask choices (i.e., medium mask and low mask) as our core insight is that the recovery process of an object can manifest in diverse ways. Previous high-masking schemes focus on capturing the global representation but lack the fine-grained recovery capability, so that the generated pre-trained weights tend to play a limited role in the fine-tuning process. With the support of the proposed TPM, available methods can exhibit more flexible and accurate completion capabilities, enabling the potential autoencoder in the pre-training stage to consider multiple representations of a single 3D object. In addition, an SVM-guided weight selection module is proposed to fill the encoder parameters for downstream networks with the optimal weight during the fine-tuning stage, maximizing linear accuracy and facilitating the acquisition of intricate representations for new objects. Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.

[CV-85] CAMOT: Camera Angle-aware Multi-Object Tracking

链接: https://arxiv.org/abs/2409.17533
作者: Felix Limanta,Kuniaki Uto,Koichi Shinoda
关键词-EN: inaccurate distance estimation, paper proposes CAMOT, simple camera angle, tackle two problems, inaccurate distance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes CAMOT, a simple camera angle estimator for multi-object tracking to tackle two problems: 1) occlusion and 2) inaccurate distance estimation in the depth direction. Under the assumption that multiple objects are located on a flat plane in each video frame, CAMOT estimates the camera angle using object detection. In addition, it gives the depth of each object, enabling pseudo-3D MOT. We evaluated its performance by adding it to various 2D MOT methods on the MOT17 and MOT20 datasets and confirmed its effectiveness. Applying CAMOT to ByteTrack, we obtained 63.8% HOTA, 80.6% MOTA, and 78.5% IDF1 in MOT17, which are state-of-the-art results. Its computational cost is significantly lower than the existing deep-learning-based depth estimators for tracking.

[CV-86] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion NEURIPS2024

链接: https://arxiv.org/abs/2409.17531
作者: Ming Dai,Lingfeng Yang,Yihao Xu,Zhenhua Feng,Wankou Yang
关键词-EN: involves grounding descriptive, grounding descriptive sentences, common vision task, common vision, descriptive sentences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21pages, 11figures, NeurIPS2024

点击查看摘要

Abstract:Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \urlthis https URL.

[CV-87] Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Integrating SGBM and Segmentation Models

链接: https://arxiv.org/abs/2409.17526
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
关键词-EN: radiata pine trees, pine trees presents, trees presents significant, safety risks due, Manual pruning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Manual pruning of radiata pine trees presents significant safety risks due to their substantial height and the challenging terrains in which they thrive. To address these risks, this research proposes the development of a drone-based pruning system equipped with specialized pruning tools and a stereo vision camera, enabling precise detection and trimming of branches. Deep learning algorithms, including YOLO and Mask R-CNN, are employed to ensure accurate branch detection, while the Semi-Global Matching algorithm is integrated to provide reliable distance estimation. The synergy between these techniques facilitates the precise identification of branch locations and enables efficient, targeted pruning. Experimental results demonstrate that the combined implementation of YOLO and SGBM enables the drone to accurately detect branches and measure their distances from the drone. This research not only improves the safety and efficiency of pruning operations but also makes a significant contribution to the advancement of drone technology in the automation of agricultural and forestry practices, laying a foundational framework for further innovations in environmental management.

[CV-88] JoyType: A Robust Design for Multilingual Visual Text Creation AAAI2025

链接: https://arxiv.org/abs/2409.17524
作者: Chao Li,Chen Jiang,Xiaolong Liu,Jun Zhao,Guoxin Wang
关键词-EN: non-Latin languages, poses a significant, accurately represented text, accurately represented, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review at AAAI 2025

点击查看摘要

Abstract:Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model’s ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on this https URL.

[CV-89] EAGLE: Egocentric AGgregated Language-video Engine

链接: https://arxiv.org/abs/2409.17523
作者: Jing Bi,Yunlong Tang,Luchuan Song,Ali Vosoughi,Nguyen Nguyen,Chenliang Xu
关键词-EN: video analysis brings, understanding human activities, first-person perspective, egocentric video analysis, egocentric video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACMMM 24

点击查看摘要

Abstract:The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textitfirst large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE’s superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.

[CV-90] Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization

链接: https://arxiv.org/abs/2409.17519
作者: Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba
关键词-EN: diverse environments, environmental state recognition, autonomously navigate, navigate and operate, operate in diverse
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Advanced Robotics, website - this https URL

点击查看摘要

Abstract:In order for robots to autonomously navigate and operate in diverse environments, it is essential for them to recognize the state of their environment. On the other hand, the environmental state recognition has traditionally involved distinct methods tailored to each state to be recognized. In this study, we perform a unified environmental state recognition for robots through the spoken language with pre-trained large-scale vision-language models. We apply Visual Question Answering and Image-to-Text Retrieval, which are tasks of Vision-Language Models. We show that with our method, it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed and whether water is running in a sink, without training neural networks or manual programming. In addition, the recognition accuracy can be improved by selecting appropriate texts from the set of prepared texts based on black-box optimization. For each state recognition, only the text set and its weighting need to be changed, eliminating the need to prepare multiple different models and programs, and facilitating the management of source code and computer resource. We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.

[CV-91] SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning ECCV2024

链接: https://arxiv.org/abs/2409.17512
作者: Zerun Wang,Liuyu Xiang,Lang Huang,Jiafeng Mao,Ling Xiao,Toshihiko Yamasaki
关键词-EN: Open-set semi-supervised learning, semi-supervised learning, leverages practical open-set, practical open-set unlabeled, open-set unlabeled data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 accepted

点击查看摘要

Abstract:Open-set semi-supervised learning (OSSL) leverages practical open-set unlabeled data, comprising both in-distribution (ID) samples from seen classes and out-of-distribution (OOD) samples from unseen classes, for semi-supervised learning (SSL). Prior OSSL methods initially learned the decision boundary between ID and OOD with labeled ID data, subsequently employing self-training to refine this boundary. These methods, however, suffer from the tendency to overtrust the labeled ID data: the scarcity of labeled data caused the distribution bias between the labeled samples and the entire ID data, which misleads the decision boundary to overfit. The subsequent self-training process, based on the overfitted result, fails to rectify this problem. In this paper, we address the overtrusting issue by treating OOD samples as an additional class, forming a new SSL process. Specifically, we propose SCOMatch, a novel OSSL method that 1) selects reliable OOD samples as new labeled data with an OOD memory queue and a corresponding update strategy and 2) integrates the new SSL process into the original task through our Simultaneous Close-set and Open-set self-training. SCOMatch refines the decision boundary of ID and OOD classes across the entire dataset, thereby leading to improved results. Extensive experimental results show that SCOMatch significantly outperforms the state-of-the-art methods on various benchmarks. The effectiveness is further verified through ablation studies and visualization. Comments: ECCV 2024 accepted Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.17512 [cs.CV] (or arXiv:2409.17512v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.17512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-92] Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

链接: https://arxiv.org/abs/2409.17508
作者: Xun Zhu,Ying Hu,Fanbin Mo,Miao Li,Ji Wu
关键词-EN: shown impressive capabilities, Multi-modal large language, large language models, large language, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code, data and model will be soon available at GitHub.

[CV-93] Learning Quantized Adaptive Conditions for Diffusion Models

链接: https://arxiv.org/abs/2409.17487
作者: Yuchen Liang,Yuchuan Tian,Lei Yu,Huao Tang,Jie Hu,Xiangzhong Fang,Hanting Chen
关键词-EN: diffusion models hinders, function evaluations, trajectories in diffusion, diffusion models, models hinders
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The curvature of ODE trajectories in diffusion models hinders their ability to generate high-quality images in a few number of function evaluations (NFE). In this paper, we propose a novel and effective approach to reduce trajectory curvature by utilizing adaptive conditions. By employing a extremely light-weight quantized encoder, our method incurs only an additional 1% of training parameters, eliminates the need for extra regularization terms, yet achieves significantly better sample quality. Our approach accelerates ODE sampling while preserving the downstream task image editing capabilities of SDE techniques. Extensive experiments verify that our method can generate high quality results under extremely limited sampling costs. With only 6 NFE, we achieve 5.14 FID on CIFAR-10, 6.91 FID on FFHQ 64x64 and 3.10 FID on AFHQv2.

[CV-94] Global-Local Medical SAM Adaptor Based on Full Adaption

链接: https://arxiv.org/abs/2409.17486
作者: Meng Wang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yarong Feng(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Yongwei Tang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Tian Zhang(Software college Northeastern University Shenyang, Liaoning Province, P. R. China),Yuxin Liang(School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China),Chao Lv(Department of General Surgery, Shengjing Hospital China Medical University Shenyang, Liaoning Province, P. R. China)
关键词-EN: Medical SAM adaptor, visual language models, made great breakthroughs, Emerging of visual, SAM adaptor
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Emerging of visual language models, such as the segment anything model (SAM), have made great breakthroughs in the field of universal semantic segmentation and significantly aid the improvements of medical image segmentation, in particular with the help of Medical SAM adaptor (Med-SA). However, Med-SA still can be improved, as it fine-tunes SAM in a partial adaption manner. To resolve this problem, we present a novel global medical SAM adaptor (GMed-SA) with full adaption, which can adapt SAM globally. We further combine GMed-SA and Med-SA to propose a global-local medical SAM adaptor (GLMed-SA) to adapt SAM both globally and locally. Extensive experiments have been performed on the challenging public 2D melanoma segmentation dataset. The results show that GLMed-SA outperforms several state-of-the-art semantic segmentation methods on various evaluation metrics, demonstrating the superiority of our methods.

[CV-95] Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection MICCAI2024

链接: https://arxiv.org/abs/2409.17485
作者: Yi Gu,Yi Lin,Kwang-Ting Cheng,Hao Chen
关键词-EN: identification and localization, Medical anomaly detection, crucial in pathological, pathological identification, anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Early accepted by MICCAI2024

点击查看摘要

Abstract:Medical anomaly detection (AD) is crucial in pathological identification and localization. Current methods typically rely on uncertainty estimation in deep ensembles to detect anomalies, assuming that ensemble learners should agree on normal samples while exhibiting disagreement on unseen anomalies in the output space. However, these methods may suffer from inadequate disagreement on anomalies or diminished agreement on normal samples. To tackle these issues, we propose D2UE, a Diversified Dual-space Uncertainty Estimation framework for medical anomaly detection. To effectively balance agreement and disagreement for anomaly detection, we propose Redundancy-Aware Repulsion (RAR), which uses a similarity kernel that remains invariant to both isotropic scaling and orthogonal transformations, explicitly promoting diversity in learners’ feature space. Moreover, to accentuate anomalous regions, we develop Dual-Space Uncertainty (DSU), which utilizes the ensemble’s uncertainty in input and output spaces. In input space, we first calculate gradients of reconstruction error with respect to input images. The gradients are then integrated with reconstruction outputs to estimate uncertainty for inputs, enabling effective anomaly discrimination even when output space disagreement is minimal. We conduct a comprehensive evaluation of five medical benchmarks with different backbones. Experimental results demonstrate the superiority of our method to state-of-the-art methods and the effectiveness of each component in our framework. Our code is available at this https URL.

[CV-96] FS-NeRF: Template-Free NeRF for Semantic 3D Reconstruction of Dynamic Scene NEURIPS2024

链接: https://arxiv.org/abs/2409.17459
作者: Sandika Biswas,Qianyi Wu,Biplab Banerjee,Hamid Rezatofighi
关键词-EN: Neural Implicit models, handling dynamic environments, Implicit models, entities remains challenging, Neural Implicit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in NeuRIPS 2024

点击查看摘要

Abstract:Despite advancements in Neural Implicit models for 3D surface reconstruction, handling dynamic environments with arbitrary rigid, non-rigid, or deformable entities remains challenging. Many template-based methods are entity-specific, focusing on humans, while generic reconstruction methods adaptable to such dynamic scenes often require additional inputs like depth or optical flow or rely on pre-trained image features for reasonable outcomes. These methods typically use latent codes to capture frame-by-frame deformations. In contrast, some template-free methods bypass these requirements and adopt traditional LBS (Linear Blend Skinning) weights for a detailed representation of deformable object motions, although they involve complex optimizations leading to lengthy training times. To this end, as a remedy, this paper introduces TFS-NeRF, a template-free 3D semantic NeRF for dynamic scenes captured from sparse or single-view RGB videos, featuring interactions among various entities and more time-efficient than other LBS-based approaches. Our framework uses an Invertible Neural Network (INN) for LBS prediction, simplifying the training process. By disentangling the motions of multiple entities and optimizing per-entity skinning weights, our method efficiently generates accurate, semantically separable geometries. Extensive experiments demonstrate that our approach produces high-quality reconstructions of both deformable and non-deformable objects in complex interactions, with improved training efficiency compared to existing methods.

[CV-97] CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

链接: https://arxiv.org/abs/2409.17457
作者: Sifan Wu,Amir Khasahmadi,Mor Katz,Pradeep Kumar Jayaraman,Yewen Pu,Karl Willis,Bang Liu
关键词-EN: contemporary mechanical design, CAD, mechanical design, central to contemporary, Design
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

[CV-98] AgMTR: Agent Mining Transformer for Few-shot Segmentation in Remote Sensing

链接: https://arxiv.org/abs/2409.17453
作者: Hanbo Bi,Yingchao Feng,Yongqiang Mao,Jianning Pei,Wenhui Diao,Hongqi Wang,Xian Sun
关键词-EN: Few-shot Segmentation, aims to segment, labeled samples, segment the interested, interested objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to IJCV

点击查看摘要

Abstract:Few-shot Segmentation (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches, resulting in semantic ambiguity between the query foreground (FG) and background (BG) pixels. To tackle this problem, we propose a novel Agent Mining Transformer (AgMTR), which adaptively mines a set of local-aware agents to construct agent-level semantic correlation. Compared with pixel-level semantics, the given agents are equipped with local-contextual information and possess a broader receptive field. At this point, different query pixels can selectively aggregate the fine-grained local semantics of different agents, thereby enhancing the semantic clarity between query FG and BG pixels. Concretely, the Agent Learning Encoder (ALE) is first proposed to erect the optimal transport plan that arranges different agents to aggregate support semantics under different local regions. Then, for further optimizing the agents, the Agent Aggregation Decoder (AAD) and the Semantic Alignment Decoder (SAD) are constructed to break through the limited support set for mining valuable class-specific semantics from unlabeled data sources and the query image itself, respectively. Extensive experiments on the remote sensing benchmark iSAID indicate that the proposed method achieves state-of-the-art performance. Surprisingly, our method remains quite competitive when extended to more common natural scenarios, i.e., PASCAL-5i and COCO-20i.

[CV-99] Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

链接: https://arxiv.org/abs/2409.17439
作者: Chirag Vashist,Shichong Peng,Ke Li
关键词-EN: learn deep generative, deep generative models, limited training data, Maximum Likelihood Estimation, Implicit Maximum Likelihood
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An emerging area of research aims to learn deep generative models with limited training data. Prior generative models like GANs and diffusion models require a lot of data to perform well, and their performance degrades when they are trained on only a small amount of data. A recent technique called Implicit Maximum Likelihood Estimation (IMLE) has been adapted to the few-shot setting, achieving state-of-the-art performance. However, current IMLE-based approaches encounter challenges due to inadequate correspondence between the latent codes selected for training and those drawn during inference. This results in suboptimal test-time performance. We theoretically show a way to address this issue and propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher quality image generation compared to existing GAN and IMLE-based methods, as validated by comprehensive experiments conducted on nine few-shot image datasets.

[CV-100] HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing

链接: https://arxiv.org/abs/2409.17432
作者: Md Tanvir Islam,Nasir Rahim,Saeed Anwar,Muhammad Saqib,Sambit Bakshi,Khan Muhammad
关键词-EN: computer vision applications, haze type classification, type classification, haze type, haze
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Reducing the atmospheric haze and enhancing image clarity is crucial for computer vision applications. The lack of real-life hazy ground truth images necessitates synthetic datasets, which often lack diverse haze types, impeding effective haze type classification and dehazing algorithm selection. This research introduces the HazeSpace2M dataset, a collection of over 2 million images designed to enhance dehazing through haze type classification. HazeSpace2M includes diverse scenes with 10 haze intensity levels, featuring Fog, Cloud, and Environmental Haze (EH). Using the dataset, we introduce a technique of haze type classification followed by specialized dehazers to clear hazy images. Unlike conventional methods, our approach classifies haze types before applying type-specific dehazing, improving clarity in real-life hazy images. Benchmarking with state-of-the-art (SOTA) models, ResNet50 and AlexNet achieve 92.75% and 92.50% accuracy, respectively, against existing synthetic datasets. However, these models achieve only 80% and 70% accuracy, respectively, against our Real Hazy Testset (RHT), highlighting the challenging nature of our HazeSpace2M dataset. Additional experiments show that haze type classification followed by specialized dehazing improves results by 2.41% in PSNR, 17.14% in SSIM, and 10.2% in MSE over general dehazers. Moreover, when testing with SOTA dehazing models, we found that applying our proposed framework significantly improves their performance. These results underscore the significance of HazeSpace2M and our proposed framework in addressing atmospheric haze in multimedia processing. Complete code and dataset is available on \hrefthis https URL \textcolorblue\textbfGitHub.

[CV-101] ransient Adversarial 3D Projection Attacks on Object Detection in Autonomous Driving

链接: https://arxiv.org/abs/2409.17403
作者: Ce Zhou,Qiben Yan,Sijia Liu
关键词-EN: Object detection, crucial task, targeting object detection, patches or stickers, Object
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 7 figures, SmartSP 2024

点击查看摘要

Abstract:Object detection is a crucial task in autonomous driving. While existing research has proposed various attacks on object detection, such as those using adversarial patches or stickers, the exploration of projection attacks on 3D surfaces remains largely unexplored. Compared to adversarial patches or stickers, which have fixed adversarial patterns, projection attacks allow for transient modifications to these patterns, enabling a more flexible attack. In this paper, we introduce an adversarial 3D projection attack specifically targeting object detection in autonomous driving scenarios. We frame the attack formulation as an optimization problem, utilizing a combination of color mapping and geometric transformation models. Our results demonstrate the effectiveness of the proposed attack in deceiving YOLOv3 and Mask R-CNN in physical settings. Evaluations conducted in an indoor environment show an attack success rate of up to 100% under low ambient light conditions, highlighting the potential damage of our attack in real-world driving scenarios.

[CV-102] AgRegNet: A Deep Regression Network for Flower and Fruit Density Estimation Localization and Counting in Orchards

链接: https://arxiv.org/abs/2409.17400
作者: Uddhav Bhattarai,Santosh Bhusal,Qin Zhang,Manoj Karkee
关键词-EN: agricultural industry today, manual labor availability, fruit density estimation, major challenges, agricultural industry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the major challenges for the agricultural industry today is the uncertainty in manual labor availability and the associated cost. Automated flower and fruit density estimation, localization, and counting could help streamline harvesting, yield estimation, and crop-load management strategies such as flower and fruitlet thinning. This article proposes a deep regression-based network, AgRegNet, to estimate density, count, and location of flower and fruit in tree fruit canopies without explicit object detection or polygon annotation. Inspired by popular U-Net architecture, AgRegNet is a U-shaped network with an encoder-to-decoder skip connection and modified ConvNeXt-T as an encoder feature extractor. AgRegNet can be trained based on information from point annotation and leverages segmentation information and attention modules (spatial and channel) to highlight relevant flower and fruit features while suppressing non-relevant background features. Experimental evaluation in apple flower and fruit canopy images under an unstructured orchard environment showed that AgRegNet achieved promising accuracy as measured by Structural Similarity Index (SSIM), percentage Mean Absolute Error (pMAE) and mean Average Precision (mAP) to estimate flower and fruit density, count, and centroid location, respectively. Specifically, the SSIM, pMAE, and mAP values for flower images were 0.938, 13.7%, and 0.81, respectively. For fruit images, the corresponding values were 0.910, 5.6%, and 0.93. Since the proposed approach relies on information from point annotation, it is suitable for sparsely and densely located objects. This simplified technique will be highly applicable for growers to accurately estimate yields and decide on optimal chemical and mechanical flower thinning practices.

[CV-103] Data-efficient Trajectory Prediction via Coreset Selection

链接: https://arxiv.org/abs/2409.17385
作者: Ruining Yang,Lili Su
关键词-EN: multiple information-collection devices, Modern vehicles, sensors and cameras, continuously generating, equipped with multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Modern vehicles are equipped with multiple information-collection devices such as sensors and cameras, continuously generating a large volume of raw data. Accurately predicting the trajectories of neighboring vehicles is a vital component in understanding the complex driving environment. Yet, training trajectory prediction models is challenging in two ways. Processing the large-scale data is computation-intensive. Moreover, easy-medium driving scenarios often overwhelmingly dominate the dataset, leaving challenging driving scenarios such as dense traffic under-represented. For example, in the Argoverse motion prediction dataset, there are very few instances with \ge 50 agents, while scenarios with 10 \thicksim 20 agents are far more common. In this paper, to mitigate data redundancy in the over-represented driving scenarios and to reduce the bias rooted in the data scarcity of complex ones, we propose a novel data-efficient training method based on coreset selection. This method strategically selects a small but representative subset of data while balancing the proportions of different scenario difficulties. To the best of our knowledge, we are the first to introduce a method capable of effectively condensing large-scale trajectory dataset, while achieving a state-of-the-art compression ratio. Notably, even when using only 50% of the Argoverse dataset, the model can be trained with little to no decline in performance. Moreover, the selected coreset maintains excellent generalization ability.

[CV-104] Optical Lens Attack on Deep Learning Based Monocular Depth Estimation

链接: https://arxiv.org/abs/2409.17376
作者: Ce Zhou(1),Qiben Yan(1),Daniel Kent(1),Guangjing Wang(1),Ziqi Zhang(2),Hayder Radha(1) ((1) Michigan State University, (2) Peking University)
关键词-EN: Monocular Depth Estimation, plays a crucial, Depth Estimation, vision-based Autonomous Driving, crucial role
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 13 figures, SecureComm 2024

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) plays a crucial role in vision-based Autonomous Driving (AD) systems. It utilizes a single-camera image to determine the depth of objects, facilitating driving decisions such as braking a few meters in front of a detected obstacle or changing lanes to avoid collision. In this paper, we investigate the security risks associated with monocular vision-based depth estimation algorithms utilized by AD systems. By exploiting the vulnerabilities of MDE and the principles of optical lenses, we introduce LensAttack, a physical attack that involves strategically placing optical lenses on the camera of an autonomous vehicle to manipulate the perceived object depths. LensAttack encompasses two attack formats: concave lens attack and convex lens attack, each utilizing different optical lenses to induce false depth perception. We begin by constructing a mathematical model of our attack, incorporating various attack parameters. Subsequently, we simulate the attack and evaluate its real-world performance in driving scenarios to demonstrate its effect on state-of-the-art MDE models. The results highlight the significant impact of LensAttack on the accuracy of depth estimation in AD systems.

[CV-105] he Overfocusing Bias of Convolutional Neural Networks: A Saliency-Guided Regularization Approach

链接: https://arxiv.org/abs/2409.17370
作者: David Bertoin,Eduardo Hugo Sanchez,Mehdi Zouitine,Emmanuel Rachelson
关键词-EN: computer vision, low-data regimes, transformers being considered, standard in computer, convolutional neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model’s generalization capabilities, making it disproportionately dependent on certain features that might not represent the broader context of images. While the conditions leading to this phenomenon remain elusive, the primary intent of this article is to shed light on this observed behavior of neural networks. Our research endeavors to prioritize comprehensive insight and to outline an initial response to this phenomenon. In line with this, we introduce Saliency Guided Dropout (SGDrop), a pioneering regularization approach tailored to address this specific issue. SGDrop utilizes attribution methods on the feature map to identify and then reduce the influence of the most salient features during training. This process encourages the network to diversify its attention and not focus solely on specific standout areas. Our experiments across several visual classification benchmarks validate SGDrop’s role in enhancing generalization. Significantly, models incorporating SGDrop display more expansive attributions and neural activity, offering a more comprehensive view of input images in contrast to their traditionally trained counterparts.

[CV-106] Implicit Neural Representations for Simultaneous Reduction and Continuous Reconstruction of Multi-Altitude Climate Data

链接: https://arxiv.org/abs/2409.17367
作者: Alif Bin Abdul Qayyum,Xihaier Luo,Nathan M. Urban,Xiaoning Qian,Byung-Jun Yoon
关键词-EN: renewable energy sources, greenhouse gas emissions, reduce greenhouse gas, energy sources, global warming
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2401.16936

点击查看摘要

Abstract:The world is moving towards clean and renewable energy sources, such as wind energy, in an attempt to reduce greenhouse gas emissions that contribute to global warming. To enhance the analysis and storage of wind data, we introduce a deep learning framework designed to simultaneously enable effective dimensionality reduction and continuous representation of multi-altitude wind data from discrete observations. The framework consists of three key components: dimensionality reduction, cross-modal prediction, and super-resolution. We aim to: (1) improve data resolution across diverse climatic conditions to recover high-resolution details; (2) reduce data dimensionality for more efficient storage of large climate datasets; and (3) enable cross-prediction between wind data measured at different heights. Comprehensive testing confirms that our approach surpasses existing methods in both super-resolution quality and compression efficiency.

[CV-107] Improving satellite imagery segmentation using multiple Sentinel-2 revisits

链接: https://arxiv.org/abs/2409.17363
作者: Kartik Jindgar,Grace W. Lindsay
关键词-EN: traditional computer vision, computer vision, recent years, shared models pre-trained, benefited immensely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation – power substation segmentation – that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.

[CV-108] A vision-based framework for human behavior understanding in industrial assembly lines

链接: https://arxiv.org/abs/2409.17356
作者: Konstantinos Papoutsakis,Nikolaos Bakalos,Konstantinos Fragkoulis,Athena Zacharia,Georgia Kapetadimitri,Maria Pateraki
关键词-EN: understanding human behavior, industrial assembly lines, paper introduces, introduces a vision-based, capturing and understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a vision-based framework for capturing and understanding human behavior in industrial assembly lines, focusing on car door manufacturing. The framework leverages advanced computer vision techniques to estimate workers’ locations and 3D poses and analyze work postures, actions, and task progress. A key contribution is the introduction of the CarDA dataset, which contains domain-relevant assembly actions captured in a realistic setting to support the analysis of the framework for human pose and action analysis. The dataset comprises time-synchronized multi-camera RGB-D videos, motion capture data recorded in a real car manufacturing environment, and annotations for EAWS-based ergonomic risk scores and assembly activities. Experimental results demonstrate the effectiveness of the proposed approach in classifying worker postures and robust performance in monitoring assembly task progress.

[CV-109] SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model

链接: https://arxiv.org/abs/2409.17345
作者: Daniel Yang,John J. Leonard,Yogesh Girdhar
关键词-EN: enable real-time rendering, method to enable, underwater image formation, real-time rendering, radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Project page here: this https URL

点击查看摘要

Abstract:We introduce SeaSplat, a method to enable real-time rendering of underwater scenes leveraging recent advances in 3D radiance fields. Underwater scenes are challenging visual environments, as rendering through a medium such as water introduces both range and color dependent effects on image capture. We constrain 3D Gaussian Splatting (3DGS), a recent advance in radiance fields enabling rapid training and real-time rendering of full 3D scenes, with a physically grounded underwater image formation model. Applying SeaSplat to the real-world scenes from SeaThru-NeRF dataset, a scene collected by an underwater vehicle in the US Virgin Islands, and simulation-degraded real-world scenes, not only do we see increased quantitative performance on rendering novel viewpoints from the scene with the medium present, but are also able to recover the underlying true color of the scene and restore renders to be without the presence of the intervening medium. We show that the underwater image formation helps learn scene structure, with better depth maps, as well as show that our improvements maintain the significant computational improvements afforded by leveraging a 3D Gaussian representation.

[CV-110] Energy-Efficient Real-Time Computer Vision with Intelligent Skipping via Reconfigurable CMOS Image Sensors

链接: https://arxiv.org/abs/2409.17341
作者: Md Abdullah-Al Kaiser,Sreetama Sarkar,Peter A. Beerel,Akhilesh R. Jaiswal,Gourav Datta
关键词-EN: Current video-based computer, video-based computer vision, Current video-based, high energy consumption, applications typically suffer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Current video-based computer vision (CV) applications typically suffer from high energy consumption due to reading and processing all pixels in a frame, regardless of their significance. While previous works have attempted to reduce this energy by skipping input patches or pixels and using feedback from the end task to guide the skipping algorithm, the skipping is not performed during the sensor read phase. As a result, these methods can not optimize the front-end sensor energy. Moreover, they may not be suitable for real-time applications due to the long latency of modern CV networks that are deployed in the back-end. To address this challenge, this paper presents a custom-designed reconfigurable CMOS image sensor (CIS) system that improves energy efficiency by selectively skipping uneventful regions or rows within a frame during the sensor’s readout phase, and the subsequent analog-to-digital conversion (ADC) phase. A novel masking algorithm intelligently directs the skipping process in real-time, optimizing both the front-end sensor and back-end neural networks for applications including autonomous driving and augmented/virtual reality (AR/VR). Our system can also operate in standard mode without skipping, depending on application needs. We evaluate our hardware-algorithm co-design framework on object detection based on BDD100K and ImageNetVID, and gaze estimation based on OpenEDS, achieving up to 53% reduction in front-end sensor energy while maintaining state-of-the-art (SOTA) accuracy.

[CV-111] Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting

链接: https://arxiv.org/abs/2409.17332
作者: Jay Zoellin,Colin Merk,Mischa Buob,Amr Saad,Samuel Giesser,Tahm Spitznagel,Ferhat Turgut,Rui Santos,Yukun Zhou,Sigfried Wagner,Pearse A. Keane,Yih Chung Tham,Delia Cabrera DeBuc,Matthias D. Becker,Gabor M. Somfai
关键词-EN: Integrating deep learning, greatly advance diagnostic, Integrating deep, self-supervised learning, DINORET
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this http URL , C. Merk and M. Buob contributed equally as shared-first authors. D. Cabrera DeBuc, M. D. Becker and G. M. Somfai contributed equally as senior authors for this work

点击查看摘要

Abstract:Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.

[CV-112] ChatCam: Empowering Camera Control through Conversational AI NEURIPS2024

链接: https://arxiv.org/abs/2409.17331
作者: Xinhang Liu,Yu-Wing Tai,Chi-Keung Tang
关键词-EN: crafting compelling visual, compelling visual narratives, Cinematographers adeptly capture, crafting compelling, intricate camera movements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted to NeurIPS 2024

点击查看摘要

Abstract:Cinematographers adeptly capture the essence of the world, crafting compelling visual narratives through intricate camera movements. Witnessing the strides made by large language models in perceiving and interacting with the 3D world, this study explores their capability to control cameras with human language guidance. We introduce ChatCam, a system that navigates camera movements through conversations with users, mimicking a professional cinematographer’s workflow. To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation. We also develop an Anchor Determinator to ensure precise camera trajectory placement. ChatCam understands user requests and employs our proposed tools to generate trajectories, which can be used to render high-quality video footage on radiance field representations. Our experiments, including comparisons to state-of-the-art approaches and user studies, demonstrate our approach’s ability to interpret and execute complex instructions for camera operation, showing promising applications in real-world production settings.

[CV-113] VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection ECCV2024

链接: https://arxiv.org/abs/2409.17330
作者: Liangyu Zhong,Joachim Sicking,Fabian Hüger,Hanno Gottschalk
关键词-EN: achieved significant success, identically distributed data, Semantic segmentation networks, achieved significant, significant success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 9 figures, to be published in ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

点击查看摘要

Abstract:Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.

[CV-114] Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement

链接: https://arxiv.org/abs/2409.17316
作者: Haodong Li,Hao Lu,Ying-Cong Chen
关键词-EN: monitoring physiological signals, Remote photoplethysmography, physiological signals, gaining prominence, monitoring physiological
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) is gaining prominence for its non-invasive approach to monitoring physiological signals using only cameras. Despite its promise, the adaptability of rPPG models to new, unseen domains is hindered due to the environmental sensitivity of physiological signals. To address this, we pioneer the Test-Time Adaptation (TTA) in rPPG, enabling the adaptation of pre-trained models to the target domain during inference, sidestepping the need for annotations or source data due to privacy considerations. Particularly, utilizing only the user’s face video stream as the accessible target domain data, the rPPG model is adjusted by tuning on each single instance it encounters. However, 1) TTA algorithms are designed predominantly for classification tasks, ill-suited in regression tasks such as rPPG due to inadequate supervision. 2) Tuning pre-trained models in a single-instance manner introduces variability and instability, posing challenges to effectively filtering domain-relevant from domain-irrelevant features while simultaneously preserving the learned information. To overcome these challenges, we present Bi-TTA, a novel expert knowledge-based Bidirectional Test-Time Adapter framework. Specifically, leveraging two expert-knowledge priors for providing self-supervision, our Bi-TTA primarily comprises two modules: a prospective adaptation (PA) module using sharpness-aware minimization to eliminate domain-irrelevant noise, enhancing the stability and efficacy during the adaptation process, and a retrospective stabilization (RS) module to dynamically reinforce crucial learned model parameters, averting performance degradation caused by overfitting or catastrophic forgetting. To this end, we established a large-scale benchmark for rPPG tasks under TTA protocol. The experimental results demonstrate the significant superiority of our approach over the state-of-the-art.

[CV-115] Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP2024

链接: https://arxiv.org/abs/2409.17313
作者: Zehao Wang,Minye Wu,Yixin Cao,Yubo Ma,Meiqi Chen,Tinne Tuytelaars
关键词-EN: study presents, instruction categories, evaluation framework, VLN, Vision-Language Navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings; project page: this https URL

点击查看摘要

Abstract:This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.

[CV-116] Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

链接: https://arxiv.org/abs/2409.17280
作者: Hui En Pang,Shuai Liu,Zhongang Cai,Lei Yang,Tianwei Zhang,Ziwei Liu
关键词-EN: Gaussian Splatting framework, Splatting framework, Gaussian Splatting, textbf, Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present \textbfDisco4D, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. Different from existing methods, Disco4D distinctively disentangles clothings (with Gaussian models) from the human body (with SMPL-X model), significantly enhancing the generation details and flexibility. It has the following technical innovations. \textbf1) Disco4D learns to efficiently fit the clothing Gaussians over the SMPL-X Gaussians. \textbf2) It adopts diffusion models to enhance the 3D generation process, \textite.g., modeling occluded parts not visible in the input image. \textbf3) It learns an identity encoding for each clothing Gaussian to facilitate the separation and extraction of clothing assets. Furthermore, Disco4D naturally supports 4D human animation with vivid dynamics. Extensive experiments demonstrate the superiority of Disco4D on 4D human generation and animation tasks. Our visualizations can be found in \urlthis https URL.

[CV-117] Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs ECCV2024

链接: https://arxiv.org/abs/2409.17221
作者: Mattia Segu,Luigi Piccinelli,Siyuan Li,Luc Van Gool,Fisher Yu,Bernt Schiele
关键词-EN: methods requires enormous, provide bounding boxes, requires enormous annotation, enormous annotation efforts, multiple object tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.

[CV-118] Neural Network Architecture Search Enabled Wide-Deep Learning (NAS-WD) for Spatially Heterogenous Property Awared Chicken Woody Breast Classification and Hardness Regression

链接: https://arxiv.org/abs/2409.17210
作者: Chaitanya Pallerla,Yihong Feng,Casey M. Owens,Ramesh Bahadur Bist,Siavash Mahmoudi,Pouya Sohrabipour,Amirreza Davar,Dongyi Wang
关键词-EN: intensive genetic selection, rapid growth rates, global poultry industry, high broiler yields, Due to intensive
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Due to intensive genetic selection for rapid growth rates and high broiler yields in recent years, the global poultry industry has faced a challenging problem in the form of woody breast (WB) conditions. This condition has caused significant economic losses as high as 200 million annually, and the root cause of WB has yet to be identified. Human palpation is the most common method of distinguishing a WB from others. However, this method is time-consuming and subjective. Hyperspectral imaging (HSI) combined with machine learning algorithms can evaluate the WB conditions of fillets in a non-invasive, objective, and high-throughput manner. In this study, 250 raw chicken breast fillet samples (normal, mild, severe) were taken, and spatially heterogeneous hardness distribution was first considered when designing HSI processing models. The study not only classified the WB levels from HSI but also built a regression model to correlate the spectral information with sample hardness data. To achieve a satisfactory classification and regression model, a neural network architecture search (NAS) enabled a wide-deep neural network model named NAS-WD, which was developed. In NAS-WD, NAS was first used to automatically optimize the network architecture and hyperparameters. The classification results show that NAS-WD can classify the three WB levels with an overall accuracy of 95%, outperforming the traditional machine learning model, and the regression correlation between the spectral data and hardness was 0.75, which performs significantly better than traditional regression models.

[CV-119] 2024 BRAVO Challenge Track 1 1st Place Report: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation

链接: https://arxiv.org/abs/2409.17208
作者: Tommie Kerssies,Daan de Geus,Gijs Dubbelman
关键词-EN: BRAVO Challenge, trained on Cityscapes, robustness is evaluated, solution for Track, present our solution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2409.15107

点击查看摘要

Abstract:In this report, we present our solution for Track 1 of the 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves 1st place in the challenge. Our code is publicly available at this https URL.

[CV-120] AACLiteNet: A Lightweight Model for Detection of Fine-Grained Abdominal Aortic Calcification

链接: https://arxiv.org/abs/2409.17203
作者: Zaid Ilyas,Afsah Saleem,David Suter,Siobhan Reid,John Schousboe,William Leslie,Joshua Lewis,Syed Zulqarnain Gilani
关键词-EN: Cardiovascular Diseases, million lives annually, Vertebral Fracture Assessment, Abdominal Aortic Calcification, death worldwide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages including references

点击查看摘要

Abstract:Cardiovascular Diseases (CVDs) are the leading cause of death worldwide, taking 17.9 million lives annually. Abdominal Aortic Calcification (AAC) is an established marker for CVD, which can be observed in lateral view Vertebral Fracture Assessment (VFA) scans, usually done for vertebral fracture detection. Early detection of AAC may help reduce the risk of developing clinical CVDs by encouraging preventive measures. Manual analysis of VFA scans for AAC measurement is time consuming and requires trained human assessors. Recently, efforts have been made to automate the process, however, the proposed models are either low in accuracy, lack granular level score prediction, or are too heavy in terms of inference time and memory footprint. Considering all these shortcomings of existing algorithms, we propose ‘AACLiteNet’, a lightweight deep learning model that predicts both cumulative and granular level AAC scores with high accuracy, and also has a low memory footprint, and computation cost (Floating Point Operations (FLOPs)). The AACLiteNet achieves a significantly improved one-vs-rest average accuracy of 85.94% as compared to the previous best 81.98%, with 19.88 times less computational cost and 2.26 times less memory footprint, making it implementable on portable computing devices.

[CV-121] Cross Dataset Analysis and Network Architecture Repair for Autonomous Car Lane Detection

链接: https://arxiv.org/abs/2409.17158
作者: Parth Ganeriwala,Siddhartha Bhattacharyya,Raja Muthalagu
关键词-EN: isolated learning paradigm, utilizing knowledge acquired, Transfer Learning, inducing transfer learning, isolated learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer Learning has become one of the standard methods to solve problems to overcome the isolated learning paradigm by utilizing knowledge acquired for one task to solve another related one. However, research needs to be done, to identify the initial steps before inducing transfer learning to applications for further verification and explainablity. In this research, we have performed cross dataset analysis and network architecture repair for the lane detection application in autonomous vehicles. Lane detection is an important aspect of autonomous vehicles driving assistance system. In most circumstances, modern deep-learning-based lane recognition systems are successful, but they struggle with lanes with complex topologies. The proposed architecture, ERFCondLaneNet is an enhancement to the CondlaneNet used for lane identification framework to solve the difficulty of detecting lane lines with complex topologies like dense, curved and fork lines. The newly proposed technique was tested on two common lane detecting benchmarks, CULane and CurveLanes respectively, and two different backbones, ResNet and ERFNet. The researched technique with ERFCondLaneNet, exhibited similar performance in comparison to ResnetCondLaneNet, while using 33% less features, resulting in a reduction of model size by 46%.

[CV-122] An Art-centric perspective on AI-based content moderation of nudity ECCV2024

链接: https://arxiv.org/abs/2409.17156
作者: Piera Riccio,Georgina Curto,Thomas Hofmann,Nuria Oliver
关键词-EN: generative Artificial Intelligence, Artificial Intelligence, highly debated topic, artistic nudity online, generative Artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
*备注: To be published at the AI4VA (AI for Visual Arts) Workshop and Challenges at ECCV 2024

点击查看摘要

Abstract:At a time when the influence of generative Artificial Intelligence on visual arts is a highly debated topic, we raise the attention towards a more subtle phenomenon: the algorithmic censorship of artistic nudity online. We analyze the performance of three "Not-Safe-For-Work’’ image classifiers on artistic nudity, and empirically uncover the existence of a gender and a stylistic bias, as well as evident technical limitations, especially when only considering visual information. Hence, we propose a multi-modal zero-shot classification approach that improves artistic nudity classification. From our research, we draw several implications that we hope will inform future research on this topic.

[CV-123] PhoCoLens: Photorealistic and Consistent Reconstruction in Lensless Imaging NEURIPS2024

链接: https://arxiv.org/abs/2409.17996
作者: Xin Cai,Zhiyuan You,Hailong Zhang,Wentao Liu,Jinwei Gu,Tianfan Xue
关键词-EN: offer significant advantages, cameras offer significant, traditional lens-based systems, Lensless cameras offer, advantages in size
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Lensless cameras offer significant advantages in size, weight, and cost compared to traditional lens-based systems. Without a focusing lens, lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, current algorithms struggle with inaccurate forward imaging models and insufficient priors to reconstruct high-quality images. To overcome these limitations, we introduce a novel two-stage approach for consistent and photorealistic lensless image reconstruction. The first stage of our approach ensures data consistency by focusing on accurately reconstructing the low-frequency content with a spatially varying deconvolution method that adjusts to changes in the Point Spread Function (PSF) across the camera’s field of view. The second stage enhances photorealism by incorporating a generative prior from pre-trained diffusion models. By conditioning on the low-frequency content retrieved in the first stage, the diffusion model effectively reconstructs the high-frequency details that are typically lost in the lensless imaging process, while also maintaining image fidelity. Our method achieves a superior balance between data fidelity and visual quality compared to existing methods, as demonstrated with two popular lensless systems, PhlatCam and DiffuserCam. Project website: this https URL.

[CV-124] LGFN: Lightweight Light Field Image Super-Resolution using Local Convolution Modulation and Global Attention Feature Extraction

链接: https://arxiv.org/abs/2409.17759
作者: Zhongxin Yu,Liang Chen,Zhiyun Zeng,Kunping Yang,Shaofei Luo,Shaorui Chen,Cheng Zhong
关键词-EN: Capturing different intensity, scene Light field, scene cues, post-capture refocusing, depth sensing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Capturing different intensity and directions of light rays at the same scene Light field (LF) can encode the 3D scene cues into a 4D LF image which has a wide range of applications (i.e. post-capture refocusing and depth sensing). LF image super-resolution (SR) aims to improve the image resolution limited by the performance of LF camera sensor. Although existing methods have achieved promising results the practical application of these models is limited because they are not lightweight enough. In this paper we propose a lightweight model named LGFN which integrates the local and global features of different views and the features of different channels for LF image SR. Specifically owing to neighboring regions of the same pixel position in different sub-aperture images exhibit similar structural relationships we design a lightweight CNN-based feature extraction module (namely DGCE) to extract local features better through feature modulation. Meanwhile as the position beyond the boundaries in the LF image presents a large disparity we propose an efficient spatial attention module (namely ESAM) which uses decomposable large-kernel convolution to obtain an enlarged receptive field and an efficient channel attention module (namely ECAM). Compared with the existing LF image SR models with large parameter our model has a parameter of 0.45M and a FLOPs of 19.33G which has achieved a competitive effect. Extensive experiments with ablation studies demonstrate the effectiveness of our proposed method which ranked the second place in the Track 2 Fidelity Efficiency of NTIRE2024 Light Field Super Resolution Challenge and the seventh place in the Track 1 Fidelity.

[CV-125] Let the Quantum Creep In: Designing Quantum Neural Network Models by Gradually Swapping Out Classical Components

链接: https://arxiv.org/abs/2409.17583
作者: Peiyong Wang,Casey. R. Myers,Lloyd C. L. Hollenberg,Udaya Parampalli
关键词-EN: Artificial Intelligence, quantum neural network, neural network, classical neural network, quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 50 pages (including Appendix), many figures, accepted as a poster on QTML2024. Code available at this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI), with its multiplier effect and wide applications in multiple areas, could potentially be an important application of quantum computing. Since modern AI systems are often built on neural networks, the design of quantum neural networks becomes a key challenge in integrating quantum computing into AI. To provide a more fine-grained characterisation of the impact of quantum components on the performance of neural networks, we propose a framework where classical neural network layers are gradually replaced by quantum layers that have the same type of input and output while keeping the flow of information between layers unchanged, different from most current research in quantum neural network, which favours an end-to-end quantum model. We start with a simple three-layer classical neural network without any normalisation layers or activation functions, and gradually change the classical layers to the corresponding quantum versions. We conduct numerical experiments on image classification datasets such as the MNIST, FashionMNIST and CIFAR-10 datasets to demonstrate the change of performance brought by the systematic introduction of quantum components. Through this framework, our research sheds new light on the design of future quantum neural network models where it could be more favourable to search for methods and frameworks that harness the advantages from both the classical and quantum worlds.

[CV-126] NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes NEURIPS2024

链接: https://arxiv.org/abs/2409.17510
作者: Ziquan Wei,Tingting Dan,Jiaqi Ding,Paul J Laurienti,Guorong Wu
关键词-EN: modern imaging technologies, fluctuations emerge remarkable, emerge remarkable cognition, brain regions in-vivo, spontaneous functional fluctuations
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the cliché of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.

[CV-127] Shape-intensity knowledge distillation for robust medical image segmentation

链接: https://arxiv.org/abs/2409.17503
作者: Wenhui Dong,Bo Du,Yongchao Xu
关键词-EN: achieved impressive results, shape-intensity prior information, achieved impressive, segmentation, shape-intensity prior
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many medical image segmentation methods have achieved impressive results. Yet, most existing methods do not take into account the shape-intensity prior information. This may lead to implausible segmentation results, in particular for images of unseen datasets. In this paper, we propose a novel approach to incorporate joint shape-intensity prior information into the segmentation network. Specifically, we first train a segmentation network (regarded as the teacher network) on class-wise averaged training images to extract valuable shape-intensity information, which is then transferred to a student segmentation network with the same network architecture as the teacher via knowledge distillation. In this way, the student network regarded as the final segmentation model can effectively integrate the shape-intensity prior information, yielding more accurate segmentation results. Despite its simplicity, experiments on five medical image segmentation tasks of different modalities demonstrate that the proposed Shape-Intensity Knowledge Distillation (SIKD) consistently improves several baseline models (including recent MaxStyle and SAMed) under intra-dataset evaluation, and significantly improves the cross-dataset generalization ability. The code is available at this https URL.

[CV-128] Study of Subjective and Objective Quality in Super-Resolution Enhanced Broadcast Images on a Novel SR-IQA Dataset

链接: https://arxiv.org/abs/2409.17451
作者: Yongrok Kim,Junha Shin,Juhyun Lee,Hyunsuk Ko
关键词-EN: key consumer technology, display low-quality broadcast, full-screen format, application of Super-Resolution, consumer technology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:To display low-quality broadcast content on high-resolution screens in full-screen format, the application of Super-Resolution (SR), a key consumer technology, is essential. Recently, SR methods have been developed that not only increase resolution while preserving the original image information but also enhance the perceived quality. However, evaluating the quality of SR images generated from low-quality sources, such as SR-enhanced broadcast content, is challenging due to the need to consider both distortions and improvements. Additionally, assessing SR image quality without original high-quality sources presents another significant challenge. Unfortunately, there has been a dearth of research specifically addressing the Image Quality Assessment (IQA) of SR images under these conditions. In this work, we introduce a new IQA dataset for SR broadcast images in both 2K and 4K resolutions. We conducted a subjective quality evaluation to obtain the Mean Opinion Score (MOS) for these SR images and performed a comprehensive human study to identify the key factors influencing the perceived quality. Finally, we evaluated the performance of existing IQA metrics on our dataset. This study reveals the limitations of current metrics, highlighting the need for a more robust IQA metric that better correlates with the perceived quality of SR images.

[CV-129] Multi-scale decomposition of sea surface height snapshots using machine learning

链接: https://arxiv.org/abs/2409.17354
作者: Jingwen Lyu,Yue Wang,Christian Pedersen,Spencer Jones,Dhruv Balwada
关键词-EN: Sea Surface Height, Knowledge of ocean, weather and climate, blue economy, important for understanding
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge of ocean circulation is important for understanding and predicting weather and climate, and managing the blue economy. This circulation can be estimated through Sea Surface Height (SSH) observations, but requires decomposing the SSH into contributions from balanced and unbalanced motions (BMs and UBMs). This decomposition is particularly pertinent for the novel SWOT satellite, which measures SSH at an unprecedented spatial resolution. Specifically, the requirement, and the goal of this work, is to decompose instantaneous SSH into BMs and UBMs. While a few studies using deep learning (DL) approaches have shown promise in framing this decomposition as an image-to-image translation task, these models struggle to work well across a wide range of spatial scales and require extensive training data, which is scarce in this domain. These challenges are not unique to our task, and pervade many problems requiring multi-scale fidelity. We show that these challenges can be addressed by using zero-phase component analysis (ZCA) whitening and data augmentation; making this a viable option for SSH decomposition across scales.

[CV-130] An Integrated Deep Learning Framework for Effective Brain Tumor Localization Segmentation and Classification from Magnetic Resonance Images

链接: https://arxiv.org/abs/2409.17273
作者: Pandiyaraju V,Shravan Venkatraman,Abeshek A,Aravintakshan S A,Pavan Kumar S,Madhan S
关键词-EN: abnormal cell growth, abnormal cell, brain cells, brain tissue, cell growth
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 36 pages, 27 figures, 5 tables

点击查看摘要

Abstract:Tumors in the brain result from abnormal cell growth within the brain tissue, arising from various types of brain cells. When left undiagnosed, they lead to severe neurological deficits such as cognitive impairment, motor dysfunction, and sensory loss. As the tumor grows, it causes an increase in intracranial pressure, potentially leading to life-threatening complications such as brain herniation. Therefore, early detection and treatment are necessary to manage the complications caused by such tumors to slow down their growth. Numerous works involving deep learning (DL) and artificial intelligence (AI) are being carried out to assist physicians in early diagnosis by utilizing the scans obtained through Magnetic Resonance Imaging (MRI). Our research proposes DL frameworks for localizing, segmenting, and classifying the grade of these gliomas from MRI images to solve this critical issue. In our localization framework, we enhance the LinkNet framework with a VGG19- inspired encoder architecture for improved multimodal tumor feature extraction, along with spatial and graph attention mechanisms to refine feature focus and inter-feature relationships. Following this, we integrated the SeResNet101 CNN model as the encoder backbone into the LinkNet framework for tumor segmentation, which achieved an IoU Score of 96%. To classify the segmented tumors, we combined the SeResNet152 feature extractor with an Adaptive Boosting classifier, which yielded an accuracy of 98.53%. Our proposed models demonstrated promising results, with the potential to advance medical AI by enabling early diagnosis and providing more accurate treatment options for patients.

[CV-131] AIM 2024 Challenge on Efficient Video Super-Resolution for AV1 Compressed Content ECCV

链接: https://arxiv.org/abs/2409.17256
作者: Marcos V Conde,Zhijun Lei,Wen Li,Christos Bampis,Ioannis Katsavounidis,Radu Timofte
关键词-EN: critical task, task for enhancing, enhancing low-bitrate, low-bitrate and low-resolution, VSR
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注: European Conference on Computer Vision (ECCV) 2024 - Advances in Image Manipulation (AIM)

点击查看摘要

Abstract:Video super-resolution (VSR) is a critical task for enhancing low-bitrate and low-resolution videos, particularly in streaming applications. While numerous solutions have been developed, they often suffer from high computational demands, resulting in low frame rates (FPS) and poor power efficiency, especially on mobile platforms. In this work, we compile different methods to address these challenges, the solutions are end-to-end real-time video super-resolution frameworks optimized for both high performance and low runtime. We also introduce a new test set of high-quality 4K videos to further validate the approaches. The proposed solutions tackle video up-scaling for two applications: 540p to 4K (x4) as a general case, and 360p to 1080p (x3) more tailored towards mobile devices. In both tracks, the solutions have a reduced number of parameters and operations (MACs), allow high FPS, and improve VMAF and PSNR over interpolation baselines. This report gauges some of the most efficient video super-resolution methods to date.

[CV-132] MODELCO: Exoplanet detection in angular differential imaging by learning across multiple observations

链接: https://arxiv.org/abs/2409.17178
作者: Théo Bodrito,Olivier Flasseur,Julien Mairal,Jean Ponce,Maud Langlois,Anne-Marie Lagrange
关键词-EN: small angular separation, angular separations due, Direct imaging, star luminosities, high contrast
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Direct imaging of exoplanets is particularly challenging due to the high contrast between the planet and the star luminosities, and their small angular separation. In addition to tailored instrumental facilities implementing adaptive optics and coronagraphy, post-processing methods combining several images recorded in pupil tracking mode are needed to attenuate the nuisances corrupting the signals of interest. Most of these post-processing methods build a model of the nuisances from the target observations themselves, resulting in strongly limited detection sensitivity at short angular separations due to the lack of angular diversity. To address this issue, we propose to build the nuisance model from an archive of multiple observations by leveraging supervised deep learning techniques. The proposed approach casts the detection problem as a reconstruction task and captures the structure of the nuisance from two complementary representations of the data. Unlike methods inspired by reference differential imaging, the proposed model is highly non-linear and does not resort to explicit image-to-image similarity measurements and subtractions. The proposed approach also encompasses statistical modeling of learnable spatial features. The latter is beneficial to improve both the detection sensitivity and the robustness against heterogeneous data. We apply the proposed algorithm to several datasets from the VLT/SPHERE instrument, and demonstrate a superior precision-recall trade-off compared to the PACO algorithm. Interestingly, the gain is especially important when the diversity induced by ADI is the most limited, thus supporting the ability of the proposed approach to learn information across multiple observations.

机器学习

[LG-0] Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography MICCAI2024

链接: https://arxiv.org/abs/2409.18119
作者: Yuexi Du,John Onofrey,Nicha C. Dvornek
关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Language-Image Pre-training, requires substantial data, shows promise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is also the basis of the overall best solution for the MICCAI 2024 CXR-LT Challenge

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modal